Pandas Data Manipulation: A Practical Guide with Examples
Importing Libraries and Data
Importing Libraries
import pandas as pd
import numpy as np
Reading CSV Files
coffee = pd.read_csv('./warmup-data/coffee.csv')
bios = pd.read_csv("./data/bios.csv")
nocs = pd.read_csv('./data/noc_regions.csv')
results = pd.read_csv('./data/results.csv')
Creating DataFrames
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'], index=["x", "y", "z"])
Viewing and Inspecting Data
df.head()
df.info()
df.describe()
Selecting Data
Selecting Columns and Rows
coffee['Units Sold'].max()
coffee.loc[coffee['Units Sold'] > 0, 'Units Sold']
coffee.loc[0, 'Units Sold']
coffee.loc[0]
coffee.loc[0:6]
units = coffee.loc[0:5, ['Units Sold', 'Day']]
days = coffee.loc[10:, 'Day']
coffee.iloc[0:, [0, 2]]
Selecting with Conditionals
bios.loc[bios["height_cm"] > 220]
bios.loc[bios["height_cm"] > 215, ["name", "NOC", "height_cm"]]
bios[bios["height_cm"] > 215][["name","height_cm"]].sort_values("height_cm", ascending=False)
bios.loc[bios["height_cm"] > 215, ["name", "height_cm"]].sort_values("height_cm")
bios[bios["name"].str.contains("keith | patrick", case=False)][["name", "born_country"]]
bios[bios["name"].str.contains(r'son$|sen$', case=False)]["name"]
bios[bios["born_date"].str.contains(r'^19', na=False)][["name", "born_date"]]
bios[bios["born_country"].isin(["GER"]) & (bios["name"].str.startswith("Steffi"))].sort_values("born_date")
Modifying Data
Assigning New Values
coffee.loc[1:6, "Units Sold"] = 90
Creating a New Column
coffee["price"] = 4.99
coffee.pop("price")
coffee["price"] = 5
coffee.drop(columns=["price"], inplace=True)
coffee["new_price"] = np.where(coffee["Coffee Type"] == "Espresso", 3.99 , 5.99)
coffee["revenue"] = coffee["Units Sold"] * coffee["new_price"]
coffee.rename(columns={"new_price":"Item Price"}, inplace=True)
Handling Missing Data
coffee.loc[1:6, 'Units Sold'] = np.nan
coffee.loc[1, 'Units Sold'] = 35
coffee.fillna(coffee[['Units Sold']].mean())
coffee['Units Sold'].interpolate()
coffee['Units Sold'] = coffee['Units Sold'].interpolate()
coffee = coffee.dropna(subset=['Units Sold'])
Grouping and Sorting
units_sold = coffee[["Day", "Units Sold"]]
units_sold.sort_values(["Units Sold", "Day"], ascending=False)
bios[bios['born_country'] == 'USA']['born_region'].value_counts()
Iterating through Data
for index, row in coffee.iterrows():
print(index)
print(row[["Day", "Units Sold"]])
print("\n")
Applying Functions
Using Apply
bios['height_category'] = bios["height_cm"].apply(lambda x: 'Short' if x < 165 else ('Average' if x <185 else 'Tall'))
Using a Custom Function
def categorize_athlete(row):
if row['height_cm'] < 175 and row['weight_kg'] < 70:
return 'Lightweight'
elif row['height_cm'] < 185 and row['weight_kg'] <= 85:
return 'Middleweight'
else: return 'Heavyweight'
bios['Category'] = bios.apply(categorize_athlete, axis=1)
Working with Dates
bios["born_date"] = pd.to_datetime(bios["born_date"])
bios["year"] = bios["born_date"].dt.year
bios = bios.dropna(subset=["year"])
bios.pop("year")
Merging DataFrames
bios_merge = pd.merge(bios, nocs, left_on='born_country', right_on='NOC', how='left')
bios_merge.rename(columns={'region': 'born_country_full'}, inplace=True)
combined = pd.merge(results, bios, on='athlete_id', how='left' )
Concatenating DataFrames
usa = bios[bios['born_country'] == 'USA']
gbr = bios[bios['born_country'] == 'GBR']
usa_gbr = pd.concat([usa, gbr])
Saving to CSV
bios_new.to_csv("./data/bios_new.csv")
Handling Null Values
Creating Null Values
coffee.loc[1:6, 'Units Sold'] = np.nan
Filling Null Values
coffee.loc[1, 'Units Sold'] = 35
coffee.fillna(coffee[['Units Sold']].mean())
coffee['Units Sold'].interpolate()
Dropping Null Values
coffee = coffee.dropna(subset=['Units Sold'])
Selecting Null Values
coffee[coffee['Units Sold'].isna()]
Aggregating Data
Count
bios['born_city'].value_counts()
bios[bios['born_country'] == 'USA']['born_region'].value_counts()
GroupBy
bios.groupby(bios['born_date'].dt.year).count()
bios.groupby([bios['born_date'].dt.year])['name'].count().reset_index().sort_values('name', ascending=False)
bios.groupby(['born_city'])[['height_cm', 'weight_kg']].mean().sort_values('height_cm', ascending=False)
Multiple Aggregations
coffee.groupby(['Coffee Type', 'Day']).agg({'Units Sold': 'sum', 'price': 'mean'})
bios.groupby(bios['born_date'].dt.year[bios['born_date'].dt.year > 1900])['height_cm'].mean()
bios.groupby([bios.born_date.dt.year[bios.born_date.dt.year > 1900]])['height_cm'].mean()
Pivot Tables
pivot = coffee.pivot(columns='Coffee Type', index='Day', values='revenue')
Advanced Techniques
Shift
coffee['yesterday_revenue'] = coffee['revenue'].shift(2)
coffee['pct_increase'] = coffee['revenue'] / coffee['yesterday_revenue'] * 100
Rank
bios['height_rank'] = bios['height_cm'].rank(ascending=False)
bios.sort_values(['height_rank'], inplace=True)
Cumulative Sum
coffee['cumulative_sum'] = coffee['revenue'].cumsum()
Rolling Calculations
latte = coffee[coffee['Coffee Type'] == 'Latte'].copy()
latte['Sold_Units_Last_3Days'] = latte['Units Sold'].rolling(3).sum()