Code
# install from CRAN
install.packages("nplyr")
# install from github
::install_github("markjrieke/nplyr") devtools
A grammar of (nested) data manipulation
Mark Rieke
July 24, 2022
Data manipulation and transformation is a fundamental part of any analysis. There are excellent tools in the R ecosystem for manipulating data frames (dplyr, data.table, and arrow, to name a few). Sometimes, however, it is desirable to work with nested data frames, for which few tools are readily available.
This is where nplyr comes into play! nplyr is a grammar of nested data manipulation that allows users to perform dplyr-like manipulations on data frames nested within a list-col of another data frame. Most dplyr verbs have nested equivalents in nplyr. For example:
nest_mutate()
is the nested equivalent of mutate()
nest_select()
is the nested equivalent of select()
nest_filter()
is the nested equivalent of filter()
nest_summarise()
is the nested equivalent of summarise()
nest_group_by()
is the nested equivalent of group_by()
nplyr 0.1.0 is available on CRAN. Alternatively, you can install the development version from github with the devtools or remotes package:
To get started, we’ll create a nested column for the country data within each continent from the gapminder dataset.
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [578 × 5]>
#> 2 Europe <tibble [1,302 × 5]>
#> 3 Africa <tibble [637 × 5]>
#> 4 Americas <tibble [470 × 5]>
#> 5 FSU <tibble [139 × 5]>
#> 6 Oceania <tibble [187 × 5]>
dplyr can perform operations on the top-level data frame, but with nplyr, we can perform operations on the nested data frames:
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [43 × 6]>
#> 2 Europe <tibble [34 × 6]>
#> 3 Africa <tibble [53 × 6]>
#> 4 Americas <tibble [33 × 6]>
#> 5 FSU <tibble [9 × 6]>
#> 6 Oceania <tibble [11 × 6]>
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
nplyr also supports grouped operations with nest_group_by()
:
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [58 × 5]>
#> 2 Europe <tibble [58 × 5]>
#> 3 Africa <tibble [13 × 5]>
#> 4 Americas <tibble [57 × 5]>
#> 5 FSU <tibble [44 × 5]>
#> 6 Oceania <tibble [56 × 5]>
#> # A tibble: 58 × 6
#> continent year n lifeExp pop gdpPercap
#> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Europe 1950 22 65.8 7408264 6343.
#> 2 Europe 1951 18 65.7 7165515 6509.
#> 3 Europe 1952 31 65.9 7124673 5210.
#> 4 Europe 1953 17 67.3 7346100 6774.
#> 5 Europe 1954 17 68.0 7423300 7046.
#> 6 Europe 1955 17 68.5 7499400 7817.
#> 7 Europe 1956 17 68.5 7575800 8224.
#> 8 Europe 1957 31 67.5 7363802 6093.
#> 9 Europe 1958 18 69.6 8308052. 8833.
#> 10 Europe 1959 18 69.6 8379664. 9088.
#> # … with 48 more rows
In the previous set of examples, the output from nplyr’s nested operations could be obtained by unnesting and performing grouped dplyr operations.
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
Why, then, might we need to use nplyr? Well, in other scenarios, it may be far more convenient to work with nested data frames or it may not even be possible to unnest!
Consider a set of surveys that an organization might use to gather market data. It is common for organization to have separate surveys for separate purposes but to gather the same baseline set of data across all surveys (for example , a respondent’s age and gender may be recorded across all surveys, but each survey will have a different set of questions). Let’s use two fake surveys with the below questions for this example:
In this scenario, both surveys are collecting demographic information — age, location, and industry — but differ in the remaining questions. A convenient way to get the response files into the environment would be to use purrr::map()
to read each file to a nested data frame.
#> # A tibble: 2 × 2
#> survey_file survey_data
#> <chr> <list>
#> 1 job_survey <spc_tbl_ [500 × 6]>
#> 2 personal_survey <spc_tbl_ [750 × 6]>
tidyr::unnest()
can usually handle idiosyncracies in layout when unnesting, but in this case unnesting throws an error!
#> Error in `list_unchop()`:
#> ! Can't combine `x[[1]]$Q5` <double> and `x[[2]]$Q5` <character>.
This is because the surveys share column names but not necessarily column types! In this case, both data frames contain a column named Q5
, but in job_survey
it’s a double and in personal_survey
it’s a character.
#> Rows: 500
#> Columns: 7
#> $ survey_file <chr> "job_survey", "job_survey", "job_survey", "job_survey", "j…
#> $ survey_name <chr> "job", "job", "job", "job", "job", "job", "job", "job", "j…
#> $ Q1 <dbl> 100, 81, 51, 81, 80, 32, 65, 57, 43, 94, 25, 83, 61, 66, 8…
#> $ Q2 <chr> "Austin", "San Antonio", "Austin", "Austin", "Dallas", "Fo…
#> $ Q3 <chr> "Consulting", "Consulting", "Consulting", "Technology", "C…
#> $ Q4 <chr> "Somewhat dissatisfied", "Neither satisfied nor dissatisfi…
#> $ Q5 <dbl> 163, 48, 190, 25, 143, 233, 43, 243, 158, 235, 245, 195, 2…
#> Rows: 750
#> Columns: 7
#> $ survey_file <chr> "personal_survey", "personal_survey", "personal_survey", "…
#> $ survey_name <chr> "personal", "personal", "personal", "personal", "personal"…
#> $ Q1 <dbl> 91, 32, 40, 23, 88, 69, 96, 40, 57, 40, 39, 70, 29, 38, 57…
#> $ Q2 <chr> "Austin", "San Antonio", "San Antonio", "Austin", "Dallas"…
#> $ Q3 <chr> "Energy", "Healthcare", "Consulting", "Consulting", "Techn…
#> $ Q4 <chr> "Neither satisfied nor dissatisfied", "Extremely satisfied…
#> $ Q5 <chr> "Blandit eros! A, ligula facilisis imperdiet! Interdum pla…
We could potentially get around this issue with unnesting by reading in all columns as characters via readr::read_csv(x, col_types = cols(.default = "c"))
, but this presents its own challenges. Q5
would still be better represented as a double in job_survey
and, from the survey question text, Q4
has similar, but distinctly different, meanings across the survey files.
This is where nplyr can assist! Rather than malign the data types or create separate objects for each survey file, we can use nplyr to perform operations directly on the nested data frames.
surveys <-
surveys %>%
nest_mutate(survey_data,
age_group = if_else(Q1 < 65, "Adult", "Retirement Age")) %>%
nest_group_by(survey_data, Q3) %>%
nest_add_count(survey_data,
name = "n_respondents_in_industry") %>%
nest_mutate(survey_data,
median_industry_age = median(Q1)) %>%
nest_ungroup(survey_data)
surveys %>%
slice(1) %>%
tidyr::unnest(survey_data)
#> # A tibble: 500 × 10
#> survey_file survey_name Q1 Q2 Q3 Q4 Q5 age_g…¹ n_res…² media…³
#> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <int> <dbl>
#> 1 job_survey job 100 Aust… Cons… Some… 163 Retire… 107 57
#> 2 job_survey job 81 San … Cons… Neit… 48 Retire… 107 57
#> 3 job_survey job 51 Aust… Cons… Extr… 190 Adult 107 57
#> 4 job_survey job 81 Aust… Tech… Extr… 25 Retire… 108 61.5
#> 5 job_survey job 80 Dall… Cons… Extr… 143 Retire… 107 57
#> 6 job_survey job 32 Fort… Ener… Some… 233 Adult 99 59
#> 7 job_survey job 65 Dall… Cons… Some… 43 Retire… 107 57
#> 8 job_survey job 57 Hous… Heal… Some… 243 Adult 75 65
#> 9 job_survey job 43 Dall… Gove… Neit… 158 Adult 111 57
#> 10 job_survey job 94 Fort… Heal… Extr… 235 Retire… 75 65
#> # … with 490 more rows, and abbreviated variable names ¹age_group,
#> # ²n_respondents_in_industry, ³median_industry_age
#> # A tibble: 750 × 10
#> survey_file surve…¹ Q1 Q2 Q3 Q4 Q5 age_g…² n_res…³ media…⁴
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 personal_survey person… 91 Aust… Ener… Neit… Blan… Retire… 145 61
#> 2 personal_survey person… 32 San … Heal… Extr… Elem… Adult 131 62
#> 3 personal_survey person… 40 San … Cons… Some… Eget… Adult 149 61
#> 4 personal_survey person… 23 Aust… Cons… Extr… Scel… Adult 149 61
#> 5 personal_survey person… 88 Dall… Tech… Neit… Aene… Retire… 150 61
#> 6 personal_survey person… 69 Fort… Tech… Neit… Inte… Retire… 150 61
#> 7 personal_survey person… 96 Hous… Heal… Extr… Blan… Retire… 131 62
#> 8 personal_survey person… 40 Hous… Cons… Extr… Scel… Adult 149 61
#> 9 personal_survey person… 57 Fort… Ener… Extr… Pede… Adult 145 61
#> 10 personal_survey person… 40 Fort… Heal… Extr… Phar… Adult 131 62
#> # … with 740 more rows, and abbreviated variable names ¹survey_name,
#> # ²age_group, ³n_respondents_in_industry, ⁴median_industry_age
@online{rieke2022,
author = {Rieke, Mark},
title = {Introducing \{Nplyr\}},
date = {2022-07-24},
url = {https://www.thedatadiary.net/posts/2022-07-24-introducing-nplyr/},
langid = {en}
}