Pandas Data Manipulation: A Practical Guide with Examples

Posted on Jul 12, 2024 in Language

Importing Libraries and Data

Importing Libraries

import pandas as pd import numpy as np

Reading CSV Files

coffee = pd.read_csv('./warmup-data/coffee.csv') bios = pd.read_csv("./data/bios.csv") nocs = pd.read_csv('./data/noc_regions.csv') results = pd.read_csv('./data/results.csv')

Creating DataFrames

df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'], index=["x", "y", "z"])

Viewing and Inspecting Data

df.head() df.info() df.describe()

Selecting Data

Selecting Columns and Rows

coffee['Units Sold'].max() coffee.loc[coffee['Units Sold'] > 0, 'Units Sold'] coffee.loc[0, 'Units Sold'] coffee.loc[0] coffee.loc[0:6] units = coffee.loc[0:5, ['Units Sold', 'Day']] days = coffee.loc[10:, 'Day'] coffee.iloc[0:, [0, 2]]

Selecting with Conditionals

bios.loc[bios["height_cm"] > 220] bios.loc[bios["height_cm"] > 215, ["name", "NOC", "height_cm"]] bios[bios["height_cm"] > 215][["name","height_cm"]].sort_values("height_cm", ascending=False) bios.loc[bios["height_cm"] > 215, ["name", "height_cm"]].sort_values("height_cm") bios[bios["name"].str.contains("keith | patrick", case=False)][["name", "born_country"]] bios[bios["name"].str.contains(r'son$|sen$', case=False)]["name"] bios[bios["born_date"].str.contains(r'^19', na=False)][["name", "born_date"]] bios[bios["born_country"].isin(["GER"]) & (bios["name"].str.startswith("Steffi"))].sort_values("born_date")

Modifying Data

Assigning New Values

coffee.loc[1:6, "Units Sold"] = 90

Creating a New Column

coffee["price"] = 4.99 coffee.pop("price") coffee["price"] = 5 coffee.drop(columns=["price"], inplace=True) coffee["new_price"] = np.where(coffee["Coffee Type"] == "Espresso", 3.99 , 5.99) coffee["revenue"] = coffee["Units Sold"] * coffee["new_price"] coffee.rename(columns={"new_price":"Item Price"}, inplace=True)

Handling Missing Data

coffee.loc[1:6, 'Units Sold'] = np.nan coffee.loc[1, 'Units Sold'] = 35 coffee.fillna(coffee[['Units Sold']].mean()) coffee['Units Sold'].interpolate() coffee['Units Sold'] = coffee['Units Sold'].interpolate() coffee = coffee.dropna(subset=['Units Sold'])

Grouping and Sorting

units_sold = coffee[["Day", "Units Sold"]] units_sold.sort_values(["Units Sold", "Day"], ascending=False) bios[bios['born_country'] == 'USA']['born_region'].value_counts()

Iterating through Data

for index, row in coffee.iterrows(): print(index) print(row[["Day", "Units Sold"]]) print("\n")

Applying Functions

Using Apply

bios['height_category'] = bios["height_cm"].apply(lambda x: 'Short' if x < 165 else ('Average' if x <185 else 'Tall'))

Using a Custom Function

def categorize_athlete(row): if row['height_cm'] < 175 and row['weight_kg'] < 70: return 'Lightweight' elif row['height_cm'] < 185 and row['weight_kg'] <= 85: return 'Middleweight' else: return 'Heavyweight' bios['Category'] = bios.apply(categorize_athlete, axis=1)

Working with Dates

bios["born_date"] = pd.to_datetime(bios["born_date"]) bios["year"] = bios["born_date"].dt.year bios = bios.dropna(subset=["year"]) bios.pop("year")

Merging DataFrames

bios_merge = pd.merge(bios, nocs, left_on='born_country', right_on='NOC', how='left') bios_merge.rename(columns={'region': 'born_country_full'}, inplace=True) combined = pd.merge(results, bios, on='athlete_id', how='left' )

Concatenating DataFrames

usa = bios[bios['born_country'] == 'USA'] gbr = bios[bios['born_country'] == 'GBR'] usa_gbr = pd.concat([usa, gbr])

Saving to CSV

bios_new.to_csv("./data/bios_new.csv")

Handling Null Values

Creating Null Values

coffee.loc[1:6, 'Units Sold'] = np.nan

Filling Null Values

coffee.loc[1, 'Units Sold'] = 35 coffee.fillna(coffee[['Units Sold']].mean()) coffee['Units Sold'].interpolate()

Dropping Null Values

coffee = coffee.dropna(subset=['Units Sold'])

Selecting Null Values

coffee[coffee['Units Sold'].isna()]

Aggregating Data

Count

bios['born_city'].value_counts() bios[bios['born_country'] == 'USA']['born_region'].value_counts()

GroupBy

bios.groupby(bios['born_date'].dt.year).count() bios.groupby([bios['born_date'].dt.year])['name'].count().reset_index().sort_values('name', ascending=False) bios.groupby(['born_city'])[['height_cm', 'weight_kg']].mean().sort_values('height_cm', ascending=False)

Multiple Aggregations

coffee.groupby(['Coffee Type', 'Day']).agg({'Units Sold': 'sum', 'price': 'mean'}) bios.groupby(bios['born_date'].dt.year[bios['born_date'].dt.year > 1900])['height_cm'].mean() bios.groupby([bios.born_date.dt.year[bios.born_date.dt.year > 1900]])['height_cm'].mean()

Pivot Tables

pivot = coffee.pivot(columns='Coffee Type', index='Day', values='revenue')

Advanced Techniques

Shift

coffee['yesterday_revenue'] = coffee['revenue'].shift(2) coffee['pct_increase'] = coffee['revenue'] / coffee['yesterday_revenue'] * 100

Rank

bios['height_rank'] = bios['height_cm'].rank(ascending=False) bios.sort_values(['height_rank'], inplace=True)

Cumulative Sum

coffee['cumulative_sum'] = coffee['revenue'].cumsum()

Rolling Calculations

latte = coffee[coffee['Coffee Type'] == 'Latte'].copy() latte['Sold_Units_Last_3Days'] = latte['Units Sold'].rolling(3).sum()