BAN668 Cheat Sheet: Data Manipulation & Analysis

BAN668 Cheat Sheet

Notes for Every Dataset

  • Always look at the data info and the starting descriptive statistics.
  • After every altering command, view the changes.

Loading and Describing Data

df = pd.read_csv('bank2.csv', sep='\t') Be able to read different files and know the separator.

Add headers by adding names to the read function.

df.sort_values(), df.describe(stats), df.groupby().value_counts() (qualitative data)

Visualizing Data

import seaborn as sns

# Convert day variable

df['day'] = pd.to_datetime(df['day'])

sns.graph(data, 'X', 'Y')

Handling Missing Values

df['x'].isna().sum() = Count missing values

df = df.dropna() = Delete rows with missing values

df = df.drop(columns=['x']) = Delete variables with too many missing values

df['x'] = df['saleLocation'].fillna(df['x'].mode()[0]) = Impute missing values with mode

Concatenating Data

Note: Rename columns in two datasets so the names are identical (df.rename(columns={'x'}))

df_concat = pd.concat([df1, df2])

Merging Data

df_merged = pd.merge(df1, df2, how='inner') = Keep rows in common (inner join)

Change code to “left” to only keep rows from one dataset or “outer” to keep all rows.

Sorting and Slicing Data

df = df.sort_values('x') = To sort by a specific column (choose ascending type to sort the data)

pd.to_datetime = Changing variable to date and time

For slicing, create a subset based on the desired size. Use head for the top rows and tail for the last rows.
Use sample for randomizing and for specific columns use:

df_subset = df[['x', 'y']]

Filtering Data

df[df['column_name'] == value] = Filter by column value

df[(df['column1'] == value1) & (df['column2'] < value2)] = Filter by multiple conditions

df[df['column_name'].between(lower_bound, upper_bound)] = Filtering between a range

Aggregating Data

For basic stats, use df['column_name'].function(). For instance:

df['column_name'].sum() = Aggregate by sum

df.groupby('column_name').agg({'column1': 'sum', 'column2': 'mean'}) = Applying multiple functions

Cleaning Data

# Change columns to lower case: df['column'] = df['column'].str.lower()

For upper case: .str.upper() and .str.capitalize() for sentence case.

df['column_name'] = df['column_name'].str.replace(r'[^A-Za-z ]', '', regex=True) = Replace anything that’s not a letter with nothing.

df['column_name'] = df['column_name'].str.replace(r'[^0-9]', '', regex=True) = Removing non-numeric characters.

df['column'] = pd.to_numeric(df['column']) = Convert to a numeric variable.

df[df.duplicated()] = Shows duplicated rows

df = df.copy().drop_duplicates() = Removes duplicated rows.

Mutating Data

import numpy as np = Data import.

df['new_column'] = df['column1'] + df['column2'] = Adding a new column based on existing functions (it can be subtraction, multiplication, or division).

df = pd.DataFrame({'column_name': [/* your data */]})

df['new_column'] = np.select([df['column_name'] > /* condition */, df['column_name'] <= /* condition */], ['Value1', 'Value2']) = Adding new columns with new conditions.

Handling Extreme Outliers

Make use of NumPy (np) and Seaborn (sns).

Q1 = df['column_name'].quantile(0.25)

Q3 = df['column_name'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)] = Identifying the interquartile range

df = df[(df['column_name'] >= lower_bound) & (df['column_name'] <= upper_bound)] = Show outliers

mean_value = df['column_name'].mean()

df['column_name'] = np.where((df['column_name'] < lower_bound) | (df['column_name'] > upper_bound), mean_value, df['column_name'])
= Removing outliers and replacing with the mean; can utilize mode too, depending on the task.

Supervised Learning: Regression & Classification

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import ConfusionMatrixDisplay, f1_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

df = pd.get_dummies(df, columns=['column'], drop_first=True) = Dummy-encode (nominal variable)

Ordinal-Encode (Ordinal Variable)

df['column'] = df['column'].replace('', 1, regex=False) Make as many as the variables available.

y = df['column'] = Setting target variable and drop from other variables (X)

X = StandardScaler().fit_transform(X) = Standardize the data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) = Setting test data

Linear Regression

model = LinearRegression().fit(X_train, y_train)

y_predictions = model.predict(X_test)

mean_absolute_error(y_test, y_predictions)

df['columns'] = model.predict(X)

Logistic Regression

ConfusionMatrixDisplay.from_predictions(y_test, y_predictions) = Confusion matrix calculation

f1_score(y_test, y_predictions) = F1 score