Pandas in Python: Essential Techniques

Pandas – Part 2

Average of column: df_grades.Column.mean()
- Average of multiple columns: df_grades[["column1", "column2"]].mean()
Max/min: df_grades["column name"].max() or .min()
Statistical summary: df_grades.describe()
- Can get all key stats for numeric columns at once with the describe() method
  - Count, mean, std, min, 25%, 50%, 75%, max
- Pick out chunks of the summary by storing the summary in a variable and then use variable[["column name", "column name"]]
Number of each unique value: counts = df_grades["column name"].value_counts()
- value_count() = gives a count of number of times each unique value appears in the column. Returns a series where indices are unique column values
- Number of a specific unique value: counts["value"]
What are the unique values: unique = df_grades["column name"].unique() returns a string
- Unique tells you what values are within the column specific (A, A-, B)
- unique[index number] to pick a specific unique value
- len(unique) to find the number of unique values (3 if there’s A, A-, B)

Creating a new column: df_data["new_column"] = xyz
- You can pick anything to be in the new column including a new function itself using df_data["column"]
- e.g) df_grades["new_column"]=1 // db_grades.head() will create a new column populated (filled) with 1s throughout
- e.g) df_grades["new_column"]=["Final"]/36 // db_grades.head()

Deleting columns: del df_grades["column name"]
Deleting multiple columns: df_grades.drop(["column name", "column name"], axis = 1, inplace= True)
- Axis = 1 for deleting specific column, 0 for rows
- inplace = whether to make change permanent (True) or not
  - False will return dataframe with specified columns deleted, not changing df_grades

df_data.sort_values (by = "column name", inplace = True/False, ascending = True/False)
- Will sort the column specified in ascending/descending order, making permanent or not
Sorting data frames with tie breaker: df_data.sort_values (by = ["column name", "tie breaker column"], inplace = True/False, ascending = [True, True])
- So if values in the “column name” are tied, then the “tie breaker column” will determine which to show first based on which is bigger/smaller
Setting index when reading in: pd.read_csv("file path", index_col = column index number)
- Will be the column to use as the row labels for DataFrame

Converting to datetime: Two methods
- Adjust when reading in: pd.read_csv("file path", index_col = X, parse_dates = ["column"])
  - Parse_dates will store the column type as datetime
- Adjust after reading: df_data["column"] = pd.to_datetime(df_data["column"])
  - Wrap column in to_datetime() built in function to store as datetime
Time stamp attributes
- create variable locating the date: datetime = df_data.loc[X, "Y"]
  - datetime.hour ; datetime.dayofweek ; datetime.year ; datetime.is_leap_year (returns boolean)
Creating new columns from datetime column:
- df_data["DOW"] = df_data["datetime column"].dt.weekday_name or .dt.month
- Need .dt if we are applying datetime attributes to a series