Mastering Data Manipulation with dplyr and reshape2 in R

This document demonstrates data manipulation techniques using the dplyr and reshape2 packages in R.

Introduction to dplyr

dplyr is much faster and easier to read than base R. Here are some of its key functions:

  • select(): Selects column variables based on their names.
  • filter(): Filters rows based on their values.
  • arrange(): Changes the ordering of the rows.
  • summarize(): Reduces multiple values down to a single summary.
  • mutate(): Creates columns that are functions of existing variables.

Let’s explore these functions with examples using the nycflights13 dataset.

  
    install.packages("dplyr")
    library(dplyr)
    install.packages("nycflights13")
    library(nycflights13)
  

Using select()

Select multiple columns:

  
    head(flights)
    select(flights, year, month, carrier)
  

Select columns by position:

  
    head(select(flights, 2))
    head(select(flights, 2:5))
  

Select columns using helper functions:

  
    head(select(flights, contains("rr")))
    head(select(flights, starts_with("d")))
    head(select(flights, ends_with("y")))
    head(select(flights, everything()))
  

Using filter()

Filter rows based on conditions (similar to the WHERE clause in SQL):

  
    head(filter(flights, month == 07))
    filter(flights, month == 09, day == 03)
    filter(flights, month == 09, day == 03, origin == "LGA")
  

Using mutate() and transmute()

Add new columns with mutate():

  
    over_delay <- mutate(flights, overall_delay = arr_delay - dep_delay)
  

Show only new columns with transmute():

  
    over_delay <- transmute(flights, overall_delay = arr_delay - dep_delay)
  

Using summarize() and group_by()

Calculate descriptive statistics with summarize():

  
    summarise(flights, avg_air_time = mean(air_time, na.rm = TRUE))
    summarise(flights, total_air_time = sum(air_time, na.rm = TRUE))
  

Group data and then summarize with group_by():

  
    by_carrier <- group_by(flights, carrier)
    summarise(by_carrier)
    by_air_time <- group_by(flights, air_time)
    summarise(by_air_time, avg_air_time = mean(air_time, na.rm = TRUE))
  

Using the Pipe Operator (%>%)

The pipe operator chains operations together. For example, let’s count young boys named “Taylor” using the babynames dataset:

  
    install.packages("babynames")
    library(babynames)
    babynames %>% filter(sex=="M", name == "Taylor") %>% select(n) %>% sum()
  

Using arrange()

Sort data with arrange():

  
    head(arrange(flights, year, dep_time))
    head(arrange(flights, desc(dep_time)))
  

Introduction to reshape2

The reshape2 package facilitates data reshaping between wide and long formats.

  • melt(): Transforms data from wide to long format.
  • dcast(): Transforms data from long to wide format.

Using melt()

  
    library(reshape2)
    melt(mtcars)
    melt(mtcars, id = c("gear", "carb"))
    melt(mtcars, variable.name = "foo", value.name = "bar") %>% head()
  

Using dcast()

  
    df <- data.frame(idvar = rep(1:20, times = 3),
                     variable = rep(c("height", "weight", "girth"), each = 20),
                     value = c(rnorm(20, mean = 5.5, sd=1),
                               rnorm(20, mean = 180, sd =20),
                               rnorm(20, mean = 30, sd =4)))
    dcast(df, idvar ~ variable)
    df2 <- df %>% mutate(idvar2 = idvar %% 2 , idvar = idvar %% 10)
    dcast(df2, idvar + idvar2 ~ variable, fun.aggregate = mean)
  

The fun.aggregate parameter in dcast() specifies the function to use for aggregation (e.g., mean, median, sum). By default, it uses length to count the number of rows.