Pandas in Python: Essential Techniques

Pandas – Part 2

Built-in Functions

  • Average of column: df_grades.Column.mean()
    • Average of multiple columns: df_grades[["column1", "column2"]].mean()
  • Max/min: df_grades["column name"].max() or .min()
  • Statistical summary: df_grades.describe()
    • Can get all key stats for numeric columns at once with the describe() method
      • Count, mean, std, min, 25%, 50%, 75%, max
    • Pick out chunks of the summary by storing the summary in a variable and then use variable[["column name", "column name"]]
  • Number of each unique value: counts = df_grades["column name"].value_counts()
    • value_count() = gives a count of number of times each unique value appears in the column. Returns a series where indices are unique column values
    • Number of a specific unique value: counts["value"]
  • What are the unique values: unique = df_grades["column name"].unique() returns a string
    • Unique tells you what values are within the column specific (A, A-, B)
    • unique[index number] to pick a specific unique value
    • len(unique) to find the number of unique values (3 if there’s A, A-, B)

Attributes vs. Methods

  • Creating a new column: df_data["new_column"] = xyz
    • You can pick anything to be in the new column including a new function itself using df_data["column"]
    • e.g) df_grades["new_column"]=1 // db_grades.head() will create a new column populated (filled) with 1s throughout
    • e.g) df_grades["new_column"]=["Final"]/36 // db_grades.head()

Deleting Columns

  • Deleting columns: del df_grades["column name"]
  • Deleting multiple columns: df_grades.drop(["column name", "column name"], axis = 1, inplace= True)
    • Axis = 1 for deleting specific column, 0 for rows
    • inplace = whether to make change permanent (True) or not
      • False will return dataframe with specified columns deleted, not changing df_grades

Sorting DataFrames

  • df_data.sort_values (by = "column name", inplace = True/False, ascending = True/False)
    • Will sort the column specified in ascending/descending order, making permanent or not
  • Sorting data frames with tie breaker: df_data.sort_values (by = ["column name", "tie breaker column"], inplace = True/False, ascending = [True, True])
    • So if values in the “column name” are tied, then the “tie breaker column” will determine which to show first based on which is bigger/smaller
  • Setting index when reading in: pd.read_csv("file path", index_col = column index number)
    • Will be the column to use as the row labels for DataFrame

Date and Time

  • Converting to datetime: Two methods
    • Adjust when reading in: pd.read_csv("file path", index_col = X, parse_dates = ["column"])
      • Parse_dates will store the column type as datetime
    • Adjust after reading: df_data["column"] = pd.to_datetime(df_data["column"])
      • Wrap column in to_datetime() built in function to store as datetime
  • Time stamp attributes
    • create variable locating the date: datetime = df_data.loc[X, "Y"]
      • datetime.hour ; datetime.dayofweek ; datetime.year ; datetime.is_leap_year (returns boolean)
  • Creating new columns from datetime column:
    • df_data["DOW"] = df_data["datetime column"].dt.weekday_name or .dt.month
    • Need .dt if we are applying datetime attributes to a series