Currently, readr automatically recognises the following types of columns:
col_logical()
[l], containing only T
, F
, TRUE
or FALSE
.
col_integer()
[i], integers.
col_double()
[d], doubles.
col_character()
[c], everything else.
col_date(format = "")
[D]: with the locale’s date_format
.
col_time(format = "")
[t]: with the locale’s time_format
.
col_datetime(format = "")
[T]: ISO8601 date times
col_number()
[n], numbers containing the grouping_mark
To recognise these columns, readr inspects the first 1000 rows of your dataset. This is not guaranteed to be perfect, but it’s fast and a reasonable heuristic. If you get a lot of parsing failures, you’ll need to re-read the file, either increasing guess_max
to overriding the default choices as described below.
You can also manually specify other column types:
col_skip()
[_, -], don’t import this column.
col_date(format)
, dates with given format.
col_datetime(format, tz)
, date times with given format. If the timezone is UTC (the default), this is >20x faster than loading then parsing with strptime()
.
col_time(format)
, times. Returned as number of seconds past midnight.
col_factor(levels, ordered)
, parse a fixed set of known values into a factor
Use the col_types
argument to override the default choices. There are two ways to use it:
With a string: "dc__d"
: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with types that take additional parameters.)
With a (named) list of col objects:
read_csv("iris.csv", col_types = cols(
Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
Or, with their abbreviations:
read_csv("iris.csv", col_types = cols(
Sepal.Length = "d",
Sepal.Width = "d",
Petal.Length = "d",
Petal.Width = "d",
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
Any omitted columns will be parsed automatically, so the previous call will lead to the same result as:
read_csv("iris.csv", col_types = cols(
Species = col_factor(c("setosa", "versicolor", "virginica")))
)
You can also set a default type that will be used instead of relying on the automatic detection for columns you don’t specify:
read_csv("iris.csv", col_types = cols(
Species = col_factor(c("setosa", "versicolor", "virginica")),
.default = col_double())
)
If you only want to read specified columns, use cols_only()
:
read_csv("iris.csv", col_types = cols_only(
Species = col_factor(c("setosa", "versicolor", "virginica")))
)
When reading files interactively the first 20 rows of the col_spec()
used are printed. option(readr.num_columns)
can be used to change the number of columns to be printed, setting the value to 0 disables printing.
readr attaches the spec used for the file to the output. It can be retrieved by calling spec()
on the object.
data <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer()
#> )
data
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> # ... with 22 more rows
# Every table returned has a spec attribute
s <- spec(data)
s
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer()
#> )
# Alternatively you can use a spec function instead, which will only read the
# first 1000 rows (user configurable with guess_max)
s <- spec_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer()
#> )
s
#> cols(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer()
#> )
# Automatically set the default to the most common type
cols_condense(s)
#> cols(
#> .default = col_integer(),
#> mpg = col_double(),
#> disp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double()
#> )
# If the spec has a default of skip then uses cols_only
s$default <- col_skip()
s
#> cols_only(
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer()
#> )
# Otherwise set the default to the proper type
s$default <- col_character()
s
#> cols(
#> .default = col_character(),
#> mpg = col_double(),
#> cyl = col_integer(),
#> disp = col_double(),
#> hp = col_integer(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_integer(),
#> am = col_integer(),
#> gear = col_integer(),
#> carb = col_integer()
#> )
# The print method takes a n parameter to return only that number of columns
print(s, n = 5)
#> cols(
#> .default = col_integer(),
#> mpg = col_double(),
#> disp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double()
#> )
# When reading this is set to 20 by default, set
# options("readr.num_columns" = x) to change
options("readr.num_columns" = 5)
data <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#> .default = col_integer(),
#> mpg = col_double(),
#> disp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double()
#> )
#> See spec(...) for full column specifications.
# Setting it to 0 disables printing
options("readr.num_columns" = 0)
data <- read_csv(readr_example("mtcars.csv"))
As well as specifying how to parse a column from a file on disk, each of the col_xyz()
functions has an equivalent parse_xyz()
that parsers a character vector. These are useful for testing and examples, and for rapidly experimenting to figure out how to parse a vector given a few examples.
parse_logical()
, parse_integer()
, parse_double()
, and parse_character()
are straightforward parsers that produce the corresponding atomic vector.
Make sure to read vignette("locales")
to learn how to deal with doubles.
parse_integer()
and parse_double()
are strict: the input string must be a single number with no leading or trailing characters. parse_number()
is more flexible: it ignores non-numeric prefixes and suffixes, and knows how to deal with grouping marks. This makes it suitable for reading currencies and percentages:
parse_number(c("0%", "10%", "150%"))
#> [1] 0 10 150
parse_number(c("$1,234.5", "$12.45"))
#> [1] 1234.50 12.45
Note that guess_parser()
will only guess that a string is a number if it has no leading or trailing characters (after trimming whitespace), otherwise it’s too prone to false positives. That means you’ll typically needed to explicitly supply the column type for number columns.
guess_parser("$1,234")
#> [1] "character"
guess_parser("1,234")
#> [1] "number"
readr supports three types of date/time data:
readr will guess date time fields if they’re in ISO8601 format:
parse_datetime("2010-10-01 21:45")
#> [1] "2010-10-01 21:45:00 UTC"
parse_date("2010-10-01")
#> [1] "2010-10-01"
Otherwise, you’ll need to specify the format yourself:
parse_datetime("1 January, 2010", "%d %B, %Y")
#> [1] "2010-01-01 UTC"
parse_datetime("02/02/15", "%m/%d/%y")
#> [1] "2015-02-02 UTC"
When reading a column that has a known set of values, you can read directly into a factor.
parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))
#> [1] a b a
#> Levels: a b c
readr will never turn a character vector into a factor unless you explicitly ask for it.