Data Import Cheatsheet

Read functions Parsing data types
Data
Tidy Import
Data
with readr, tibble, and tidyr
Read tabular data to tibbles readr functions guess the types of each column
and convert types when appropriate (but will
with tidyr Cheat Sheet These functions share the common arguments: NOT convert strings to factors automatically).
Cheat Sheet read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"),
A message shows the type of each column in
quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max =
the result.
min(1000, n_max), progress = interactive())
## Parsed with column specification:
A B C read_csv() ## cols(
a,b,c age is an
Rs tidyverse is built around tidy data stored in 1 2 3 Reads comma delimited files. ## age = col_integer(),
tibbles, an enhanced version of a data frame. 1,2,3 4 5 NA read_csv("file.csv") ## sex = col_character(), integer
4,5,NA ## earn = col_double()
The front side of this sheet shows how ## ) sex is a
earn is a double (numeric) character
to read text files into R with readr. A B C read_csv2()
a;b;c 1. Use problems() to diagnose problems
The reverse side shows how to create
1 2 3 Reads Semi-colon delimited files.
1;2;3 x <- read_csv("file.csv"); problems(x)
tibbles with tibble and to layout tidy
4 5 NA read_csv2("file2.csv")
4;5;NA
data with tidyr. 2. Use a col_ function to guide parsing
A B C read_delim(delim, quote = "\"", escape_backslash = FALSE, col_guess() - the default
Other types of data a|b|c escape_double = TRUE) Reads files with any delimiter.
1 2 3 col_character()
Try one of the following packages to import 1|2|3 4 5 NA read_delim("file.txt", delim = "|") col_double()
other types of files 4|5|NA col_euro_double()
haven - SPSS, Stata, and SAS files col_datetime(format = "") Also
readxl - excel files (.xls and .xlsx) A B C read_fwf(col_positions)
abc col_date(format = "") and col_time(format = "")
DBI - databases
1 2 3 Reads fixed width files.
123 col_factor(levels, ordered = FALSE)
jsonlite - json
4 5 NA read_fwf("file.fwf", col_positions = c(1, 3, 5))
4 5 NA col_integer()
xml2 - XML read_tsv() col_logical()
httr - Web APIs Reads tab delimited files. Also read_table(). col_number()
rvest - HTML (Web Scraping) read_tsv("file.tsv") col_numeric()
col_skip()
x <- read_csv("file.csv", col_types = cols(
Write functions Useful arguments
A = col_double(),
a,b,c B = col_logical(),
Save x, an R object, to path, a file path, with: Example file 1 2 3 Skip lines C = col_factor()
1,2,3 write_csv (path = "file.csv", read_csv("file.csv",
write_csv(x, path, na = "NA", append = FALSE, 4 5 NA ))
4,5,NA x = read_csv("a,b,c\n1,2,3\n4,5,NA")) skip = 1)
col_names = !append)
3. Else, read in as character vectors then parse
Tibble/df to comma delimited file. with a parse_ function.
A B C No header A B C
Read in a subset
write_delim(x, path, delim = " ", na = "NA", 1 2 3 1 2 3 parse_guess(x, na = c("", "NA"), locale =
append = FALSE, col_names = !append) read_csv("file.csv", read_csv("file.csv",
4 5 NA default_locale())
col_names = FALSE) n_max = 1)
Tibble/df to file with any delimiter. parse_character(x, na = c("", "NA"), locale =
A B C
write_excel_csv(x, path, na = "NA", append = x y z default_locale())
Provide header 1 2 3
FALSE, col_names = !append) A B C Missing Values parse_datetime(x, format = "", na = c("", "NA"),
read_csv("file.csv", NA NA NA
locale = default_locale()) Also parse_date()
Tibble/df to a CSV for excel 1 2 3
col_names = c("x", "y", "z")) read_csv("file.csv",
and parse_time()
write_file(x, path, append = FALSE) 4 5 NA na = c("4", "5", "."))
parse_double(x, na = c("", "NA"), locale =
String to file. default_locale())
Read non-tabular data
write_lines(x, path, na = "NA", append = parse_factor(x, levels, ordered = FALSE, na =
FALSE) read_file(file, locale = default_locale())
read_lines_raw(file, skip = 0, n_max = -1L, c("", "NA"), locale = default_locale())
String vector to file, one element per line. Read a file into a single string. progress = interactive()) parse_integer(x, na = c("", "NA"), locale =
write_rds(x, path, compress = c("none", "gz", read_file_raw(file) Read each line into a raw vector. default_locale())
"bz2", "xz"), ...) Read a file into a raw vector. parse_logical(x, na = c("", "NA"), locale =
read_log(file, col_names = FALSE, col_types =
Object to RDS file. read_lines(file, skip = 0, n_max = -1L, locale = NULL, skip = 0, n_max = -1, progress = default_locale())
write_tsv(x, path, na = "NA", append = FALSE, default_locale(), na = character(), progress = interactive()) parse_number(x, na = c("", "NA"), locale =
col_names = !append) interactive()) Apache style log files. default_locale())
Tibble/df to tab delimited files. Read each line into its own string. x$A <- parse_number(x$A)
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) readr 1.1.0 tibble 1.2.12 tidyr 0.6.0 Updated: 2017-01
Tibbles - an enhanced data frame Tidy Data with tidyr
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new S3 class for
A table is tidy if: Tidy data: Split and Combine Cells
storing tabular data, the tibble. Tibbles inherit the A * B -> C
data frame class, but improve two behaviors: A B C A B C A B C A * B Use these functions to split or combine cells into
C
individual, isolated values.
Display - When you print a tibble, R provides a
concise view of the data that fits on one screen. & separate(data, col, into, sep = "[^[:alnum:]]+",
Subsetting - [ always returns a new tibble, remove = TRUE, convert = FALSE,
Each variable is in Each observation, or Makes variables easy Preserves cases during
[[ and $ always return a vector. extra = "warn", fill = "warn", ...)
its own column case, is in its own row to access as vectors vectorized operations
No partial matching - You must use full Separate each cell in a column to make several
column names when subsetting Reshape Data - change the layout of values in a table columns.
table3
Use gather() and spread() to reorganize the values of a table into a new layout. Each uses the idea of a
# A tibble: 234 6
manufacturer model displ
<chr> <chr> <dbl> country year rate country year cases pop
1
2
audi
audi
a4
a4
1.8
1.8
key column: value column pair. A 1999 0.7K/19M A 1999 0.7K 19M
3 audi a4 2.0
gather(data, key, value, ..., na.rm = FALSE, spread(data, key, value, fill = NA, convert = FALSE,
4 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
5 audi a4 2.8
6 audi a4 2.8 B 1999 37K/172M B 1999 37K 172
7 audi a4 3.1
convert = FALSE, factor_key = FALSE) drop = TRUE, sep = NULL) B 2000 80K/174M B 2000 80K 174
w
w
8 audi a4 quattro 1.8
9 audi a4 quattro 1.8 C 1999 212K/1T C 1999 212K 1T
10 audi a4 quattro 2.0
# ... with 224 more rows, and 3
# more variables: year <int>,
Gather moves column names into a key Spread moves the unique values of a key column C 2000 213K/1T C 2000 213K 1T
# cyl <int>, trans <chr> column, gathering the column values into a into the column names, spreading the values of a
tibble display single value column. value column across the new columns that result. separate_rows(table3, rate,
156 1999 6 auto(l4) table4a table2
into = c("cases", "pop"))
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4)
country 1999 2000 country year cases country year type count country year cases pop
160 1999
161 1999
4 manual(m5)
4 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M separate_rows(data, ..., sep = "[^[:alnum:].]+",
162 2008 4 manual(m5) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
163 2008
164 2008
4 manual(m5)
4 auto(l4) C 212K 213K C 1999 212K A 2000 cases 2K B 1999 37K 172M convert = FALSE)
165 2008 4 auto(l4)
166 1999 4 auto(l4) A 2000 2K A 2000 pop 20M B 2000 80K 174M
A large table [ reached
-- omitted
getOption("max.print")
68 rows ] B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make several
to display data frame display C 2000 213K B 1999 pop 172M C 2000 213K 1T rows. Also separate_rows_().
key value B 2000 cases 80K
Control the default appearance with options: table3
B 2000 pop 174M
country year rate country year rate
options(tibble.print_max = n, C 1999 cases 212K
A 1999 0.7K/19M A 1999 0.7K
tibble.print_min = m, tibble.width = Inf) C 1999 pop 1T
A 2000 2K/20M A 1999 19M
C 2000 cases 213K
B 1999 37K/172M A 2000 2K
View entire data set with View(x, title) or C 2000 pop 1T
B 2000 80K/174M A 2000 20M
gather(table4a, `1999`, `2000`, key value
glimpse(x, width = NULL, ) C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
Revert to data frame with as.data.frame() B 2000 80K
(required for some older packages) B 2000 174M
Handle Missing Values C 1999 212K
Construct a tibble in two ways C 1999 1T
drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
tibble() replace = list(), ...) C 2000 1T
Drop rows containing Fill in NAs in columns with most
Construct by columns. Both make
NAs in columns. recent non-NA values. Replace NAs by column. separate_rows(table3, rate)
tibble(x = 1:3, this tibble x x x
y = c("a", "b", "c")) x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
A 1 A 1 A 1 A 1 A 1 A 1 unite(data, col, ..., sep = "_", remove = TRUE)
tribble() A tibble: 3 2 B NA D 3 B NA B 1 B NA B 2
Construct by rows. x y C
D
NA
3
C
D
NA
3
C
D
1
3
C
D
NA
3
C
D
2
3 Collapse cells across several columns to
tribble( <int> <dbl> E NA E NA E 3 E NA E 2 make a single column.
1 1 a
~x, ~y, 2 2 b table5
1, "a", 3 3 c drop_na(x, x2) fill(x, x2) replace_na(x,list(x2 = 2), x2)
country century year country year
2, "b", Afghan 19 99 Afghan 1999
3, "c") Expand Tables - quickly create tables with combinations of values Afghan 20 0 Afghan 2000
as_tibble(x, ) Convert data frame to tibble. Brazil 19 99 Brazil 1999
complete(data, ..., fill = list()) expand(data, ...) Brazil 20 0 Brazil 2000

enframe(x, name = "name", value = "value") China 19 99 China 1999
Converts named vector to a tibble with a Adds to the data missing combinations of the Create new tibble with all possible combinations China 20 0 China 2000
names column and a values column. values of the variables listed in of the values of the variables listed in unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb)
is_tibble(x) Test whether x is a tibble. col = "year", sep = "")
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more at browseVignettes(package = c("readr", "tibble", "tidyr")) readr 1.1.0 tibble 1.2.12 tidyr 0.6.0 Updated: 2017-01

Data Import Cheatsheet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Import Cheatsheet

Uploaded by

Copyright:

Available Formats

Read functions Parsing data types

complete(data, ..., fill = list()) expand(data, ...) Brazil 20 0 Brazil 2000

You might also like