Data Rectangling with jq

“Data rectangling”: the process of turning highly nested data structures (e.g. JSON, XML) into a tabular format.

Data rectangling is a brilliant turn of phrase coined by Jenny Bryan (UBC, RStudio) and leader in the #rstats community. Recording or slides of Jenny’s talk on the subject give a much better intro to the idea and working with this in R, particularly through the purrr package.

As nice as purrr is for the task, I’ve recently found that the jqr package from Scott Chamberlain and co can be a much easier way to go about rectangling your JSON. Here’s a quick comparison based on an example from the lesson Hadley Jenny have on Data Rectangling.

#devtools::install_github("jennybc/repurrrsive")
library(jsonlite)
library(tidyverse)
library(repurrrsive)
library(jqr)

Using purrr

gh_flat <- gh_repos %>% flatten()  # abandon nested structure and hope we didn't need it

gh_tibble <- tibble(
  name =     gh_flat %>% map_chr("name"),
  issues =   gh_flat %>% map_int("open_issues_count"),
  wiki =     gh_flat %>% map_lgl("has_wiki"),
  homepage = gh_flat %>% map_chr("homepage", .default = ""),
  owner =    gh_flat %>% map_chr(c("owner", "login"))
)

gh_tibble

A tibble: 176 x 5

name issues wiki homepage owner

1 after 0 TRUE "" gaborcsardi 2 argufy 6 TRUE "" gaborcsardi 3 ask 4 TRUE "" gaborcsardi 4 baseimports 0 TRUE "" gaborcsardi 5 citest 0 TRUE "" gaborcsardi 6 clisymbols 0 TRUE "" gaborcsardi 7 cmaker 0 TRUE "" gaborcsardi 8 cmark 0 TRUE "" gaborcsardi 9 conditions 0 TRUE "" gaborcsardi 10 crayon 7 TRUE "" gaborcsardi # … with 166 more rows

Note we need to be explicit about missing value defaults and types.

Using jqr

Note that we can simply exploit the object typing already encoded in the data (int, lgl,chr)

f <- system.file("extdata/gh_repos.json", package="repurrrsive")

read_file(f) %>% 
 jq('.[][] | { 
    name: .name, 
    issues: .open_issues_count,
    wiki: .has_wiki,
    homepage: .homepage,
    owner: .owner.login
    } ') %>% 
  jqr::combine() %>% # single json file
  fromJSON() 
                                name issues  wiki

1 after 0 TRUE 2 argufy 6 TRUE 3 ask 4 TRUE 4 baseimports 0 TRUE 5 citest 0 TRUE 6 clisymbols 0 TRUE 7 cmaker 0 TRUE 8 cmark 0 TRUE 9 conditions 0 TRUE 10 crayon 7 TRUE 11 debugme 4 TRUE 12 diffobj 0 TRUE 13 disposables 2 TRUE 14 dotenv 1 TRUE 15 elasticsearch-jetty 0 TRUE 16 falsy 0 TRUE 17 fswatch 0 TRUE 18 gh 8 TRUE 19 gitty 0 TRUE 20 ISA 2 TRUE 21 keypress 1 TRUE 22 lintr 0 TRUE 23 macBriain 0 TRUE 24 maxygen 2 TRUE 25 MISO 6 TRUE 26 parr 14 FALSE 27 parsedate 3 TRUE 28 pingr 2 TRUE 29 pkgconfig 1 TRUE 30 playground 1 TRUE 31 2013-11_sfu 1 TRUE 32 2014-01-27-miami 4 TRUE 33 2014-05-12-ubc 1 TRUE 34 2015-02-23_bryan-fields-talk 0 TRUE 35 2015-06-28_r-summit-talk 0 TRUE 36 2015-08_bryan-jsm-stat-data-sci-talk 0 TRUE 37 2015_Coartic 0 FALSE 38 2016-06_spreadsheets 0 TRUE 39 2016-07_data-carpentry-uzh 0 TRUE 40 545A_hw06 0 TRUE 41 access-r-source 0 TRUE 42 adv-r 0 TRUE 43 analyze-github-stuff-with-r 0 TRUE 44 arms-length-render 0 TRUE 45 assertr 0 TRUE 46 bellybutton 0 TRUE 47 bingo 3 TRUE 48 bioinformatics.ca-swc-r 0 TRUE 49 BIRS_13w5083 0 TRUE 50 blarg 0 TRUE 51 bookdown 0 FALSE 52 boot-camps 0 TRUE 53 candy 2 TRUE 54 CoffeeCoop 0 TRUE 55 datacarpentry 0 TRUE 56 ddpcr 0 FALSE 57 devtools 0 TRUE 58 diffr 0 TRUE 59 dplyr 0 TRUE 60 eigencoder 0 TRUE 61 advdatasci 0 TRUE 62 advdatasci-swirl 1 TRUE 63 advdatasci16 0 TRUE 64 advdatasci_swirl 1 TRUE 65 ballgown 0 TRUE 66 capitalIn21stCenturyinR 0 TRUE 67 careerplanning 0 TRUE 68 dataanalysis 5 TRUE 69 datascientist 0 TRUE 70 datasharing 399 TRUE 71 datawomenontwitter 1 TRUE 72 derfinder 0 TRUE 73 derfinder-1 0 TRUE 74 DSM 0 TRUE 75 EDA-Project 0 TRUE 76 firstpaper 0 TRUE 77 futureofstats 1 TRUE 78 genomicspapers 1 TRUE 79 genstats 3 TRUE 80 genstats_site 0 TRUE 81 googleCite 0 TRUE 82 graduate 0 TRUE 83 healthvis 0 TRUE 84 hyde 0 FALSE 85 inclassfeb62014 0 TRUE 86 jhsph753 0 TRUE 87 jhsph753and4 0 TRUE 88 jhudash 0 TRUE 89 jhudash-refugee 0 TRUE 90 jtleek.github.io 0 TRUE 91 2016-14 0 TRUE 92 choroplethrCaCensusTract 0 TRUE 93 choroplethrUTCensusTract 0 TRUE 94 CountyHealthApp 0 TRUE 95 data-police-shootings 0 TRUE 96 ExData_Plotting1 0 TRUE 97 fall2016competition 0 TRUE 98 ggthemes 0 TRUE 99 human_activity_smartphones 0 TRUE 100 janeaustenr 0 TRUE 101 juliasilge.github.io 0 TRUE 102 leaflet 0 TRUE 103 learning-python 0 TRUE 104 learning-sql 0 TRUE 105 minimal-mistakes 0 TRUE 106 nasanotebooks 0 TRUE 107 neissapp 0 TRUE 108 populationapp 0 TRUE 109 PredictNamesApp 0 TRUE 110 ProgrammingAssignment2 0 TRUE 111 r-travis 0 TRUE 112 RepData_PeerAssessment1 0 TRUE 113 SLCWaterMapping 0 TRUE 114 tidytext 5 TRUE 115 unconf16 0 TRUE 116 WeightLiftingMachineLearning 0 TRUE 117 ampolcourse 0 TRUE 118 apsa-leeper.bst 0 TRUE 119 arco 0 TRUE 120 astrojs 0 TRUE 121 batman 0 TRUE 122 choco-r-devel 0 TRUE 123 choco-rtools 0 TRUE 124 ciplotm 1 TRUE 125 colourlovers 1 FALSE 126 conflictcourse 0 TRUE 127 congressional-district-boundaries 0 TRUE 128 cowsay 0 TRUE 129 crandatapkgs 12 TRUE 130 csvy 2 TRUE 131 data-versioning 0 TRUE 132 dataverse-1 0 FALSE 133 designcourse 1 TRUE 134 devtools 0 TRUE 135 dkstat 0 TRUE 136 docthis 0 TRUE 137 drat 0 TRUE 138 dvn 0 TRUE 139 effect-heterogeneity 0 TRUE 140 expcourse 0 TRUE 141 exppolcourse 0 TRUE 142 expResults 1 TRUE 143 GK2011 0 TRUE 144 GREA 0 TRUE 145 hints 1 FALSE 146 Impressive 0 TRUE 147 aqi_pdf 2 TRUE 148 catan_card_game 0 TRUE 149 colourlovers_patterns 1 TRUE 150 convertagd 1 TRUE 151 cpcb 5 TRUE 152 domar_datos 0 TRUE 153 duststorm 1 TRUE 154 EML 0 TRUE 155 fietsen 2 TRUE 156 first_7_jobs 0 TRUE 157 geoparsing_tweets 0 TRUE 158 ggExtra 0 TRUE 159 india_trains 0 TRUE 160 janeausten 0 TRUE 161 jss_genderizer 0 TRUE 162 kervillebourg 0 TRUE 163 laads 4 TRUE 164 masalmon.github.io 0 TRUE 165 onboarding 0 TRUE 166 openaq_figures 2 TRUE 167 r-appveyor 0 TRUE 168 railways 0 TRUE 169 RealTimeVsHistoric 0 TRUE 170 rtimicropem 5 TRUE 171 songlyrics 0 TRUE 172 usaqmindia 0 TRUE 173 watchme 1 TRUE 174 who_aq_db 0 TRUE 175 worldbank_data 0 TRUE 176 youtubedata 0 TRUE homepage owner 1 gaborcsardi 2 gaborcsardi 3 gaborcsardi 4 gaborcsardi 5 gaborcsardi 6 gaborcsardi 7 gaborcsardi 8 gaborcsardi 9 gaborcsardi 10 gaborcsardi 11 gaborcsardi 12 gaborcsardi 13 gaborcsardi 14 gaborcsardi 15 gaborcsardi 16 gaborcsardi 17 gaborcsardi 18 gaborcsardi 19 gaborcsardi 20 gaborcsardi 21 gaborcsardi 22 gaborcsardi 23 gaborcsardi 24 gaborcsardi 25 http://genes.mit.edu/burgelab/miso/index.html gaborcsardi 26 gaborcsardi 27 gaborcsardi 28 gaborcsardi 29 gaborcsardi 30 gaborcsardi 31 jennybc 32 jennybc 33 jennybc 34 jennybc 35 jennybc 36 jennybc 37 jennybc 38 https://speakerdeck.com/jennybc/spreadsheets jennybc 39 https://markrobinsonuzh.github.io/2016-07-18-zurich/ jennybc 40 jennybc 41 jennybc 42 jennybc 43 jennybc 44 jennybc 45 jennybc 46 jennybc 47 http://daattali.com/shiny/bingo/ jennybc 48 jennybc 49 jennybc 50 jennybc 51 https://bookdown.org jennybc 52 jennybc 53 jennybc 54 jennybc 55 jennybc 56 http://daattali.com/shiny/ddpcr/ jennybc 57 jennybc 58 jennybc 59 jennybc 60 http://trestletech.com/2016/03/09/eigencoder/ jennybc 61 jtleek 62 jtleek 63 jtleek 64 jtleek 65 jtleek 66 jtleek 67 jtleek 68 jtleek 69 jtleek 70 jtleek 71 jtleek 72 jtleek 73 jtleek 74 jtleek 75 jtleek 76 jtleek 77 jtleek 78 jtleek 79 jtleek 80 jtleek 81 jtleek 82 jtleek 83 jtleek 84 http://hyde.getpoole.com jtleek 85 jtleek 86 jtleek 87 jtleek 88 jtleek 89 jtleek 90 jtleek 91 juliasilge 92 juliasilge 93 juliasilge 94 juliasilge 95 juliasilge 96 juliasilge 97 juliasilge 98 juliasilge 99 juliasilge 100 juliasilge 101 http://juliasilge.com/ juliasilge 102 http://rstudio.github.io/leaflet juliasilge 103 juliasilge 104 juliasilge 105 juliasilge 106 juliasilge 107 juliasilge 108 juliasilge 109 juliasilge 110 juliasilge 111 juliasilge 112 juliasilge 113 juliasilge 114 juliasilge 115 http://unconf16.ropensci.org juliasilge 116 juliasilge 117 http://www.thomasleeper.com/ampolcourse leeper 118 leeper 119 leeper 120 leeper 121 leeper 122 leeper 123 leeper 124 leeper 125 http://cloud.r-project.org/package=colourlovers leeper 126 leeper 127 cdmaps.polisci.ucla.edu leeper 128 leeper 129 leeper 130 leeper 131 leeper 132 http://dataverse.org leeper 133 leeper 134 leeper 135 leeper 136 leeper 137 leeper 138 http://cran.r-project.org/web/packages/dvn/index.html leeper 139 leeper 140 http://www.thomasleeper.com/expcourse leeper 141 leeper 142 leeper 143 leeper 144 leeper 145 leeper 146 leeper 147 masalmon 148 masalmon 149 masalmon 150 masalmon 151 masalmon 152 masalmon 153 masalmon 154 masalmon 155 masalmon 156 masalmon 157 masalmon 158 http://daattali.com/shiny/ggExtra-ggMarginal-demo/ masalmon 159 masalmon 160 masalmon 161 masalmon 162 masalmon 163 masalmon 164 http://masalmon.github.io/ masalmon 165 masalmon 166 masalmon 167 masalmon 168 masalmon 169 masalmon 170 masalmon 171 masalmon 172 masalmon 173 masalmon 174 masalmon 175 masalmon 176 masalmon

This example only touches the surface of the jq syntax. The jq manual provides a nice overview of this intuitive syntax. jq can also perform a wide range of data processing on the elements: including conditionals, comparisons, regular expressions, math, and so forth. While these are great, most R users will want to learn just enough jq syntax to get back a nice data rectangle, and then dplyr can take over.