DPLYR

DPLYR Cheat Sheet

One of the core packages of the tidyverse in the R programming language, dplyr is primarily a set of functions designed to enable dataframe manipulation in an intuitive, user-friendly way.

Syntax Cheat Sheet

Converts data to tbl class. tbl’s are easier to examine than
data frames. R displays only the data that fits onscreen:

dplyr::tbl_df(iris)

Information-dense summary of tbl data.

dplyr::glimpse(iris)

View data set in spreadsheet-like display (note capital V).

utils::View(iris)

Passes object on lef hand side as first argument (or .
argument) of function on righthand side.

dplyr::%>%

Reshaping Data Cheat Sheet

Gather columns into rows.

tidyr::gather(cases, "year", "n", 2:4)

Spread rows into columns.

tidyr::spread(pollution, size, amount)

Combine vectors into data frame (optimized).

dplyr::data_frame(a = 1:3, b = 4:6)

Order rows by values of a column (low to high).

dplyr::arrange(mtcars, mpg)

Order rows by values of a column (high to low).

dplyr::arrange(mtcars, desc(mpg)) 

Rename the columns of a data frame.

dplyr::rename(tb, y = year)

Separate one column into several.

tidyr::separate(storms, date, c("y", "m", "d"))

Unite several columns into one.

tidyr::unite(data, col, ..., sep)

Subset Observations Cheat Sheet

Extract rows that meet logical criteria.

dplyr::filter(iris, Sepal.Length > 7)

Remove duplicate rows.

dplyr::distinct(iris)

Randomly select a fraction of rows.

dplyr::sample_frac(iris, 0.5, replace = TRUE)

Randomly select n rows.

dplyr::sample_n(iris, 10, replace = TRUE)

Select rows by position.

dplyr::slice(iris, 10:15)

Select and order top n entries (by the group if grouped data).

dplyr::top_n(storms, 2, date)

Subset Variables Cheat Sheet

Select columns by name or helper function.

dplyr::select(iris, Sepal.Width, Petal.Length, Species)

Select columns whose name contains a character string.

select(iris, contains("."))

Select columns whose name ends with a character string.

select(iris, ends_with("Length"))

Select every column.

select(iris, everything())

Select columns whose name matches a regular expression.

select(iris, matches(".t."))

Select columns named x1, x2, x3, x4, x5.

select(iris, num_range("x", 1:5))

Select columns whose names are in a group of names.

select(iris, one_of(c("Species", "Genus")))

Select columns whose name starts with a character string.

select(iris, starts_with("Sepal"))

Select all columns between Sepal. Length and Petal.Width (inclusive).

select(iris, Sepal.Length:Petal.Width)

Select all columns except Species.

select(iris, -Species)

Summarise Data Cheat Sheet

Summarise data into a single row of values.

dplyr::summarise(iris, avg = mean(Sepal.Length))

Apply the summary function to each column.

dplyr::summarise_each(iris, funs(mean))

Count the number of rows with each unique value of a variable (with or without weights).

dplyr::count(iris, Species, wt = Sepal.Length)

First value of a vector.

dplyr::first

Last value of a vector.

dplyr::last

Nth value of a vector.

dplyr::nth

of values in a vector.

dplyr::n

of distinct values in a vector.

dplyr::n_distinct

Minimum value in a vector.

min

Maximum value in a vector.

max

Mean value of a vector.

mean

Median value of a vector.

median

The variance of a vector.

var

The standard deviation of a vector.

sd

Group Data Cheat Sheet

Group data into rows with the same value of Species.

dplyr::group_by(iris, Species)

Remove grouping information from a data frame.

dplyr::ungroup(iris)

Compute separate summary rows for each group.

iris %>% group_by(Species) %>% summarise(…)

Compute new variables by group

iris %>% group_by(Species) %>% mutate(…)

Combine Data Sets

Join matching rows from b to a.

dplyr::lef_join(a, b, by = "x1")

Join matching rows from a to b.

dplyr::right_join(a, b, by = "x1")

Join data. Retain only rows in both sets.

dplyr::inner_join(a, b, by = "x1")

Join data. Retain all values, all rows.

dplyr::full_join(a, b, by = "x1")

All rows in a that have a match in b.

dplyr::semi_join(a, b, by = "x1")

All rows in a that do not have a match in b.

dplyr::anti_join(a, b, by = "x1")

Rows that appear in both y and z.

dplyr::intersect(y, z)

Rows that appear in either or both y and z.

dplyr::union(y, z)

Rows that appear in y but not z.

dplyr::setdiff(y, z)

Append z to y as new rows.

dplyr::bind_rows(y, z)

Append z to y as new columns. Caution: matches rows by position.

dplyr::bind_cols(y, z)

Make New Variables

Compute and append one or more new columns.

dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width)

Apply the window function to each column.

dplyr::mutate_each(iris, funs(min_rank))

Compute one or more new columns. Drop original columns.

dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)

Copy with values shifted by 1.

dplyr::lead

Copy with values lagged by 1.

dplyr::lag

Ranks with no gaps.

dplyr::dense_rank

Ranks. Ties get min rank.

dplyr::min_rank

Ranks rescaled to [0, 1].

dplyr::percent_rank

Ranks. Ties got to first value.

dplyr::row_number

Bin vector into n buckets.

dplyr::ntile

Are values between a and b?

dplyr::between

Cumulative distribution.

dplyr::cume_dist

Cumulative all

dplyr::cumall

Cumulative any

dplyr::cumany

Cumulative mean

dplyr::cummean

Cumulative sum

cumsum

Cumulative max

cummax

Cumulative min

cummin

Cumulative prod

cumprod

Element-wise max

pmax

Element-wise min

pmin