Mastering Data Manipulation with dplyr and reshape2 in R

Posted on Dec 24, 2024 in Technology

This document demonstrates data manipulation techniques using the dplyr and reshape2 packages in R.

Introduction to dplyr

dplyr is much faster and easier to read than base R. Here are some of its key functions:

select(): Selects column variables based on their names.
filter(): Filters rows based on their values.
arrange(): Changes the ordering of the rows.
summarize(): Reduces multiple values down to a single summary.
mutate(): Creates columns that are functions of existing variables.

Let’s explore these functions with examples using the nycflights13 dataset.

  
    install.packages("dplyr")
    library(dplyr)
    install.packages("nycflights13")
    library(nycflights13)

Using `select()`

Select multiple columns:

  
    head(flights)
    select(flights, year, month, carrier)

Select columns by position:

  
    head(select(flights, 2))
    head(select(flights, 2:5))

Select columns using helper functions:

  
    head(select(flights, contains("rr")))
    head(select(flights, starts_with("d")))
    head(select(flights, ends_with("y")))
    head(select(flights, everything()))

Using `filter()`

Filter rows based on conditions (similar to the WHERE clause in SQL):

  
    head(filter(flights, month == 07))
    filter(flights, month == 09, day == 03)
    filter(flights, month == 09, day == 03, origin == "LGA")

Using `mutate()` and `transmute()`

Add new columns with mutate():

  
    over_delay <- mutate(flights, overall_delay = arr_delay - dep_delay)

Show only new columns with transmute():

  
    over_delay <- transmute(flights, overall_delay = arr_delay - dep_delay)

Using `summarize()` and `group_by()`

Calculate descriptive statistics with summarize():

  
    summarise(flights, avg_air_time = mean(air_time, na.rm = TRUE))
    summarise(flights, total_air_time = sum(air_time, na.rm = TRUE))

Group data and then summarize with group_by():

  
    by_carrier <- group_by(flights, carrier)
    summarise(by_carrier)
    by_air_time <- group_by(flights, air_time)
    summarise(by_air_time, avg_air_time = mean(air_time, na.rm = TRUE))

Using the Pipe Operator (`%>%`)

The pipe operator chains operations together. For example, let’s count young boys named “Taylor” using the babynames dataset:

  
    install.packages("babynames")
    library(babynames)
    babynames %>% filter(sex=="M", name == "Taylor") %>% select(n) %>% sum()

Using `arrange()`

Sort data with arrange():

  
    head(arrange(flights, year, dep_time))
    head(arrange(flights, desc(dep_time)))

Introduction to reshape2

The reshape2 package facilitates data reshaping between wide and long formats.

melt(): Transforms data from wide to long format.
dcast(): Transforms data from long to wide format.

Using `melt()`

  
    library(reshape2)
    melt(mtcars)
    melt(mtcars, id = c("gear", "carb"))
    melt(mtcars, variable.name = "foo", value.name = "bar") %>% head()

Using `dcast()`

  
    df <- data.frame(idvar = rep(1:20, times = 3),
                     variable = rep(c("height", "weight", "girth"), each = 20),
                     value = c(rnorm(20, mean = 5.5, sd=1),
                               rnorm(20, mean = 180, sd =20),
                               rnorm(20, mean = 30, sd =4)))
    dcast(df, idvar ~ variable)
    df2 <- df %>% mutate(idvar2 = idvar %% 2 , idvar = idvar %% 10)
    dcast(df2, idvar + idvar2 ~ variable, fun.aggregate = mean)

The fun.aggregate parameter in dcast() specifies the function to use for aggregation (e.g., mean, median, sum). By default, it uses length to count the number of rows.

Mastering Data Manipulation with dplyr and reshape2 in R

Introduction to dplyr

Using `select()`

Using `filter()`

Using `mutate()` and `transmute()`

Using `summarize()` and `group_by()`

Using the Pipe Operator (`%>%`)

Using `arrange()`

Introduction to reshape2

Using `melt()`

Using `dcast()`

Recent Notes

Subjects

Publicidad

Mastering Data Manipulation with dplyr and reshape2 in R

Introduction to dplyr

Using select()

Using filter()

Using mutate() and transmute()

Using summarize() and group_by()

Using the Pipe Operator (%>%)

Using arrange()

Introduction to reshape2

Using melt()

Using dcast()

Recent Notes

Subjects

Publicidad

Using `select()`

Using `filter()`

Using `mutate()` and `transmute()`

Using `summarize()` and `group_by()`

Using the Pipe Operator (`%>%`)

Using `arrange()`

Using `melt()`

Using `dcast()`