Pandas in Python: Essential Techniques
Posted on Feb 17, 2025 in Information and Documentation
Pandas – Part 2
Built-in Functions
- Average of column:
df_grades.Column.mean()- Average of multiple columns:
df_grades[["column1", "column2"]].mean()
- Max/min:
df_grades["column name"].max() or .min() - Statistical summary:
df_grades.describe()- Can get all key stats for numeric columns at once with the
describe() method- Count, mean, std, min, 25%, 50%, 75%, max
- Pick out chunks of the summary by storing the summary in a variable and then use
variable[["column name", "column name"]]
- Number of each unique value:
counts = df_grades["column name"].value_counts()- value_count() = gives a count of number of times each unique value appears in the column. Returns a series where indices are unique column values
- Number of a specific unique value:
counts["value"]
- What are the unique values:
unique = df_grades["column name"].unique() returns a string- Unique tells you what values are within the column specific (A, A-, B)
unique[index number] to pick a specific unique valuelen(unique) to find the number of unique values (3 if there’s A, A-, B)
Attributes vs. Methods
- Creating a new column:
df_data["new_column"] = xyz- You can pick anything to be in the new column including a new function itself using
df_data["column"] - e.g)
df_grades["new_column"]=1 // db_grades.head() will create a new column populated (filled) with 1s throughout - e.g)
df_grades["new_column"]=["Final"]/36 // db_grades.head()
Deleting Columns
- Deleting columns:
del df_grades["column name"] - Deleting multiple columns:
df_grades.drop(["column name", "column name"], axis = 1, inplace= True)- Axis = 1 for deleting specific column, 0 for rows
- inplace = whether to make change permanent (True) or not
- False will return dataframe with specified columns deleted, not changing df_grades
Sorting DataFrames
df_data.sort_values (by = "column name", inplace = True/False, ascending = True/False)- Will sort the column specified in ascending/descending order, making permanent or not
- Sorting data frames with tie breaker:
df_data.sort_values (by = ["column name", "tie breaker column"], inplace = True/False, ascending = [True, True])- So if values in the “column name” are tied, then the “tie breaker column” will determine which to show first based on which is bigger/smaller
- Setting index when reading in:
pd.read_csv("file path", index_col = column index number)- Will be the column to use as the row labels for DataFrame
Date and Time
- Converting to datetime: Two methods
- Adjust when reading in:
pd.read_csv("file path", index_col = X, parse_dates = ["column"])- Parse_dates will store the column type as datetime
- Adjust after reading:
df_data["column"] = pd.to_datetime(df_data["column"])- Wrap column in
to_datetime() built in function to store as datetime
- Time stamp attributes
- create variable locating the date:
datetime = df_data.loc[X, "Y"]datetime.hour ; datetime.dayofweek ; datetime.year ; datetime.is_leap_year (returns boolean)
- Creating new columns from datetime column:
df_data["DOW"] = df_data["datetime column"].dt.weekday_name or .dt.month- Need .dt if we are applying datetime attributes to a series