Pandas in Python: Essential Techniques
Posted on Feb 17, 2025 in Information and Documentation
Pandas – Part 2
Built-in Functions
- Average of column:
df_grades.Column.mean()
- Average of multiple columns:
df_grades[["column1", "column2"]].mean()
- Max/min:
df_grades["column name"].max()
or .min()
- Statistical summary:
df_grades.describe()
- Can get all key stats for numeric columns at once with the
describe()
method- Count, mean, std, min, 25%, 50%, 75%, max
- Pick out chunks of the summary by storing the summary in a variable and then use
variable[["column name", "column name"]]
- Number of each unique value:
counts = df_grades["column name"].value_counts()
- value_count() = gives a count of number of times each unique value appears in the column. Returns a series where indices are unique column values
- Number of a specific unique value:
counts["value"]
- What are the unique values:
unique = df_grades["column name"].unique()
returns a string- Unique tells you what values are within the column specific (A, A-, B)
unique[index number]
to pick a specific unique valuelen(unique)
to find the number of unique values (3 if there’s A, A-, B)
Attributes vs. Methods
- Creating a new column:
df_data["new_column"] = xyz
- You can pick anything to be in the new column including a new function itself using
df_data["column"]
- e.g)
df_grades["new_column"]=1 // db_grades.head()
will create a new column populated (filled) with 1s throughout - e.g)
df_grades["new_column"]=["Final"]/36 // db_grades.head()
Deleting Columns
- Deleting columns:
del df_grades["column name"]
- Deleting multiple columns:
df_grades.drop(["column name", "column name"], axis = 1, inplace= True)
- Axis = 1 for deleting specific column, 0 for rows
- inplace = whether to make change permanent (True) or not
- False will return dataframe with specified columns deleted, not changing df_grades
Sorting DataFrames
df_data.sort_values (by = "column name", inplace = True/False, ascending = True/False)
- Will sort the column specified in ascending/descending order, making permanent or not
- Sorting data frames with tie breaker:
df_data.sort_values (by = ["column name", "tie breaker column"], inplace = True/False, ascending = [True, True])
- So if values in the “column name” are tied, then the “tie breaker column” will determine which to show first based on which is bigger/smaller
- Setting index when reading in:
pd.read_csv("file path", index_col = column index number)
- Will be the column to use as the row labels for DataFrame
Date and Time
- Converting to datetime: Two methods
- Adjust when reading in:
pd.read_csv("file path", index_col = X, parse_dates = ["column"])
- Parse_dates will store the column type as datetime
- Adjust after reading:
df_data["column"] = pd.to_datetime(df_data["column"])
- Wrap column in
to_datetime()
built in function to store as datetime
- Time stamp attributes
- create variable locating the date:
datetime = df_data.loc[X, "Y"]
datetime.hour
; datetime.dayofweek
; datetime.year
; datetime.is_leap_year
(returns boolean)
- Creating new columns from datetime column:
df_data["DOW"] = df_data["datetime column"].dt.weekday_name
or .dt.month
- Need .dt if we are applying datetime attributes to a series