Mastering Data Manipulation with dplyr and reshape2 in R
This document demonstrates data manipulation techniques using the dplyr
and reshape2
packages in R.
Introduction to dplyr
dplyr
is much faster and easier to read than base R. Here are some of its key functions:
select()
: Selects column variables based on their names.filter()
: Filters rows based on their values.arrange()
: Changes the ordering of the rows.summarize()
: Reduces multiple values down to a single summary.mutate()
: Creates columns that are functions of existing variables.
Let’s explore these functions with examples using the nycflights13
dataset.
install.packages("dplyr")
library(dplyr)
install.packages("nycflights13")
library(nycflights13)
Using select()
Select multiple columns:
head(flights)
select(flights, year, month, carrier)
Select columns by position:
head(select(flights, 2))
head(select(flights, 2:5))
Select columns using helper functions:
head(select(flights, contains("rr")))
head(select(flights, starts_with("d")))
head(select(flights, ends_with("y")))
head(select(flights, everything()))
Using filter()
Filter rows based on conditions (similar to the WHERE
clause in SQL):
head(filter(flights, month == 07))
filter(flights, month == 09, day == 03)
filter(flights, month == 09, day == 03, origin == "LGA")
Using mutate()
and transmute()
Add new columns with mutate()
:
over_delay <- mutate(flights, overall_delay = arr_delay - dep_delay)
Show only new columns with transmute()
:
over_delay <- transmute(flights, overall_delay = arr_delay - dep_delay)
Using summarize()
and group_by()
Calculate descriptive statistics with summarize()
:
summarise(flights, avg_air_time = mean(air_time, na.rm = TRUE))
summarise(flights, total_air_time = sum(air_time, na.rm = TRUE))
Group data and then summarize with group_by()
:
by_carrier <- group_by(flights, carrier)
summarise(by_carrier)
by_air_time <- group_by(flights, air_time)
summarise(by_air_time, avg_air_time = mean(air_time, na.rm = TRUE))
Using the Pipe Operator (%>%
)
The pipe operator chains operations together. For example, let’s count young boys named “Taylor” using the babynames
dataset:
install.packages("babynames")
library(babynames)
babynames %>% filter(sex=="M", name == "Taylor") %>% select(n) %>% sum()
Using arrange()
Sort data with arrange()
:
head(arrange(flights, year, dep_time))
head(arrange(flights, desc(dep_time)))
Introduction to reshape2
The reshape2
package facilitates data reshaping between wide and long formats.
melt()
: Transforms data from wide to long format.dcast()
: Transforms data from long to wide format.
Using melt()
library(reshape2)
melt(mtcars)
melt(mtcars, id = c("gear", "carb"))
melt(mtcars, variable.name = "foo", value.name = "bar") %>% head()
Using dcast()
df <- data.frame(idvar = rep(1:20, times = 3),
variable = rep(c("height", "weight", "girth"), each = 20),
value = c(rnorm(20, mean = 5.5, sd=1),
rnorm(20, mean = 180, sd =20),
rnorm(20, mean = 30, sd =4)))
dcast(df, idvar ~ variable)
df2 <- df %>% mutate(idvar2 = idvar %% 2 , idvar = idvar %% 10)
dcast(df2, idvar + idvar2 ~ variable, fun.aggregate = mean)
The fun.aggregate
parameter in dcast()
specifies the function to use for aggregation (e.g., mean
, median
, sum
). By default, it uses length
to count the number of rows.