%in%
(#43446)str_sub
binding to properly handle negative end
values (@coussens, #44141)time_hours <- function(mins) mins / 60
worked,
but time_hours_rounded <- function(mins) round(mins / 60)
did not;
now both work. These are automatic translations rather than true user-defined
functions (UDFs); for UDFs, see register_scalar_function()
. (#41223)mutate()
expressions can now include aggregations, such as x - mean(x)
. (#41350)summarize()
supports more complex expressions, and correctly handles cases
where column names are reused in expressions. (#41223)na_matches
argument to the dplyr::*_join()
functions is now supported.
This argument controls whether NA
values are considered equal when joining. (#41358)pull
on grouped datasets, it now
returns the expected column. (#43172)base::prod
have been added so you can now use it in your dplyr
pipelines (i.e., tbl |> summarize(prod(col))
) without having to pull the
data into R (@m-muecke, #38601).dimnames
or colnames
on Dataset
objects now returns a useful
result rather than just NULL
(#38377).code()
method on Schema objects now takes an optional namespace
argument which, when TRUE
, prefixes names with arrow::
which makes
the output more portable (@orgadish, #38144).SystemRequirements
(#39602).sub
, gsub
, stringr::str_replace
, stringr::str_replace_all
are
passed a length > 1 vector of values in pattern
(@abfleishman, #39219).?open_dataset
documenting how to use the
ND-JSON support added in arrow 13.0.0 (@Divyansh200102, #38258).s3_bucket
, S3FileSystem
), the debug log level for S3 can be set
with the AWS_S3_LOG_LEVEL
environment variable.
See ?S3FileSystem
for more information. (#38267)to_duckdb()
) no longer results in warnings
when quitting your R session. (#38495)LIBARROW_BINARY=true
for old behavior (#39861).ARROW_R_ALLOW_CPP_VERSION_MISMATCH=true
) and requires atleast Arrow C++ 13.0.0 (#39739).open_dataset()
, the partition variables are now included in the resulting
dataset (#37658).write_csv_dataset()
now wraps write_dataset()
and mirrors
the syntax of write_csv_arrow()
(@dgreiss, #36436).open_delim_dataset()
now accepts quoted_na
argument to empty strings
to be parsed as NA values (#37828).schema()
can now be called on data.frame
objects to retrieve their
inferred Arrow schema (#37843).read_csv2_arrow()
(#38002).CsvParseOptions
object creation now contains more
information about default values (@angela-li, #37909).fixed()
, regex()
etc.) now allow
variables to be reliably used in their arguments (#36784).ParquetReaderProperties
, allowing users to work with Parquet files
with unusually large metadata (#36992).add_filename()
are improved
(@amoeba, #37372).create_package_with_all_dependencies()
now properly escapes paths on
Windows (#37226).data.frame
and no other classes now have the class
attribute dropped, resulting in now always returning tibbles from file reading functions and arrow_table()
, which results in consistency in the type of returned objects. Calling as.data.frame()
on Arrow Tabular objects now always returns a data.frame
object (#34775)open_dataset()
now works with ND-JSON files (#35055)schema()
on multiple Arrow objects now returns the object's schema (#35543).by
/by
argument now supported in arrow implementation of dplyr verbs (@eitsupi, #35667)dplyr::case_when()
now accepts .default
parameter to match the update in dplyr 1.1.0 (#35502)arrow_array()
can be used to create Arrow Arrays (#36381)scalar()
can be used to create Arrow Scalars (#36265)RecordBatchReader::ReadNext()
from DuckDB from the main R thread (#36307)set_io_thread_count()
with num_threads
< 2 (#36304)strptime()
in arrow will return a timezone-aware timestamp if %z
is part of the format string (#35671)group_by()
and across()
now matches dplyr (@eitsupi, #35473)read_parquet()
and read_feather()
functions can now accept URL
arguments (#33287, #34708).json_credentials
argument in GcsFileSystem$create()
now accepts
a file path containing the appropriate authentication token (@amoeba,
#34421, #34524).$options
member of GcsFileSystem
objects can now be inspected
(@amoeba, #34422, #34477).read_csv_arrow()
and read_json_arrow()
functions now accept literal text input wrapped in
I()
to improve compatability with readr::read_csv()
(@eitsupi, #18487,
#33968).$
and [[
in dplyr expressions
(#18818, #19706).FetchNode
and OrderByNode
to improve performance
and simplify building query plans from dplyr expressions (#34437, #34685).arrow_table()
(#35038, #35039).data.frame
with NULL
column names to a Table
(#15247, #34798).open_csv_dataset()
family of functions (#33998, #34710).dplyr::n()
function is now mapped to the count_all
kernel to improve
performance and simplify the R implementation (#33892, #33917).s3_bucket()
filesystem helper
with endpoint_override
and fixed surprising behaviour that occurred
when passing some combinations of arguments (@cboettig, #33904, #34009).schema
is supplied and col_names = TRUE
in
open_csv_dataset()
(#34217, #34092).open_csv_dataset()
allows a schema to be specified. (#34217)dplyr:::check_names()
(#34369)map_batches()
is lazy by default; it now returns a RecordBatchReader
instead of a list of RecordBatch
objects unless lazy = FALSE
.
(#14521)open_csv_dataset()
, open_tsv_dataset()
, and
open_delim_dataset()
all wrap open_dataset()
- they don't provide new
functionality, but allow for readr-style options to be supplied, making it
simpler to switch between individual file-reading and dataset
functionality. (#33614)col_names
parameter allows specification of column names when
opening a CSV dataset. (@wjones127,
#14705)parse_options
, read_options
, and convert_options
parameters for
reading individual files (read_*_arrow()
functions) and datasets
(open_dataset()
and the new open_*_dataset()
functions) can be passed
in as lists. (#15270)read_csv_arrow()
.
(#14930)join_by()
has been implemented for dplyr joins
on Arrow objects (equality conditions only).
(#33664)dplyr::group_by()
/dplyr::summarise()
calls are used. (#14905)dplyr::summarize()
works with division when divisor is a variable.
(#14933)dplyr::right_join()
correctly coalesces keys.
(#15077)POSIXlt
objects.
(#15277)Array$create()
can create Decimal arrays.
(#15211)StructArray$create()
can be used to create StructArray objects.
(#14922)lubridate::as_datetime()
on Arrow objects can handle time in
sub-seconds. (@eitsupi,
#13890)head()
can be called after as_record_batch_reader()
.
(#14518)as.Date()
can go from timestamp[us]
to timestamp[s]
.
(#14935)check_dots_empty()
. (@daattali,
#14744)Minor improvements and fixes:
.data
pronoun in dplyr::group_by()
(#14484)Several new functions can be used in queries:
dplyr::across()
can be used to apply the same computation across multiple
columns, and the where()
selection helper is supported in across()
;add_filename()
can be used to get the filename a row came from (only
available when querying ?Dataset
);slice_*
family: dplyr::slice_min()
,
dplyr::slice_max()
, dplyr::slice_head()
, dplyr::slice_tail()
, and
dplyr::slice_sample()
.The package now has documentation that lists all dplyr
methods and R function
mappings that are supported on Arrow data, along with notes about any
differences in functionality between queries evaluated in R versus in Acero, the
Arrow query engine. See ?acero
.
A few new features and bugfixes were implemented for joins:
keep
argument is now supported, allowing separate columns for the left
and right hand side join keys in join output. Full joins now coalesce the
join keys (when keep = FALSE
), avoiding the issue where the join keys would
be all NA
for rows in the right hand side without any matches on the left.Some changes to improve the consistency of the API:
dplyr::pull()
will return a ?ChunkedArray
instead of an R vector by default. The current default behavior is deprecated.
To update to the new behavior now, specify pull(as_vector = FALSE)
or set
options(arrow.pull_as_vector = FALSE)
globally.dplyr::compute()
on a query that is grouped returns a ?Table
instead of a query object.Finally, long-running queries can now be cancelled and will abort their computation immediately.
as_arrow_array()
can now take blob::blob
and ?vctrs::list_of
, which
convert to binary and list arrays, respectively. Also fixed an issue where
as_arrow_array()
ignored type argument when passed a StructArray
.
The unique()
function works on ?Table
, ?RecordBatch
, ?Dataset
, and
?RecordBatchReader
.
write_feather()
can take compression = FALSE
to choose writing uncompressed files.
Also, a breaking change for IPC files in write_dataset()
: passing
"ipc"
or "feather"
to format
will now write files with .arrow
extension instead of .ipc
or .feather
.
As of version 10.0.0, arrow
requires C++17 to build. This means that:
R >= 4.0
. Version 9.0.0 was the last version to support
R 3.6.arrow
,
but you first need to install a newer compiler than the default system compiler,
gcc 4.8. See vignette("install", package = "arrow")
for guidance.
Note that you only need the newer compiler to build arrow
:
installing a binary package, as from RStudio Package Manager,
or loading a package you've already installed works fine with the system defaults.dplyr::union
and dplyr::union_all
(#13090)dplyr::glimpse
(#13563)show_exec_plan()
can be added to the end of a dplyr pipeline to show the underlying plan, similar to dplyr::show_query()
. dplyr::show_query()
and dplyr::explain()
also work and show the same output, but may change in the future. (#13541)register_scalar_function()
to create them. (#13397)map_batches()
returns a RecordBatchReader
and requires that the function it maps returns something coercible to a RecordBatch
through the as_record_batch()
S3 function. It can also run in streaming fashion if passed .lazy = TRUE
. (#13170, #13650)stringr::
, lubridate::
) within queries. For example, stringr::str_length
will now dispatch to the same kernel as str_length
. (#13160)lubridate::parse_date_time()
datetime parser: (#12589, #13196, #13506)
orders
with year, month, day, hours, minutes, and seconds components are supported.orders
argument in the Arrow binding works as follows: orders
are transformed into formats
which subsequently get applied in turn. There is no select_formats
parameter and no inference takes place (like is the case in lubridate::parse_date_time()
).lubridate
date and datetime parsers such as lubridate::ymd()
, lubridate::yq()
, and lubridate::ymd_hms()
(#13118, #13163, #13627)lubridate::fast_strptime()
(#13174)lubridate::floor_date()
, lubridate::ceiling_date()
, and lubridate::round_date()
(#12154)strptime()
supports the tz
argument to pass timezones. (#13190)lubridate::qday()
(day of quarter)exp()
and sqrt()
. (#13517)read_ipc_file()
and write_ipc_file()
are added.
These functions are almost the same as read_feather()
and write_feather()
,
but differ in that they only target IPC files (Feather V2 files), not Feather V1 files.read_arrow()
and write_arrow()
, deprecated since 1.0.0 (July 2020), have been removed.
Instead of these, use the read_ipc_file()
and write_ipc_file()
for IPC files, or,
read_ipc_stream()
and write_ipc_stream()
for IPC streams. (#13550)write_parquet()
now defaults to writing Parquet format version 2.4 (was 1.0). Previously deprecated arguments properties
and arrow_properties
have been removed; if you need to deal with these lower-level properties objects directly, use ParquetFileWriter
, which write_parquet()
wraps. (#13555)write_dataset()
preserves all schema metadata again. In 8.0.0, it would drop most metadata, breaking packages such as sfarrow. (#13105)write_csv_arrow()
) will automatically (de-)compress data if the file path contains a compression extension (e.g. "data.csv.gz"
). This works locally as well as on remote filesystems like S3 and GCS. (#13183)FileSystemFactoryOptions
can be provided to open_dataset()
, allowing you to pass options such as which file prefixes to ignore. (#13171)S3FileSystem
will not create or delete buckets. To enable that, pass the configuration option allow_bucket_creation
or allow_bucket_deletion
. (#13206)GcsFileSystem
and gs_bucket()
allow connecting to Google Cloud Storage. (#10999, #13601)$num_rows()
method returns a double (previously integer), avoiding integer overflow on larger tables. (#13482, #13514)arrow.dev_repo
for nightly builds of the R package and prebuilt
libarrow binaries is now https://nightlies.apache.org/arrow/r/.open_dataset()
:
skip
argument for skipping header rows in CSV datasets.UnionDataset
.{dplyr}
queries:
RecordBatchReader
. This allows, for example, results from DuckDB
to be streamed back into Arrow rather than materialized before continuing the pipeline.dplyr::rename_with()
.dplyr::count()
returns an ungrouped dataframe.write_dataset()
has more options for controlling row group and file sizes when
writing partitioned datasets, such as max_open_files
, max_rows_per_file
,
min_rows_per_group
, and max_rows_per_group
.write_csv_arrow()
accepts a Dataset
or an Arrow dplyr query.option(use_threads = FALSE)
no longer
crashes R. That option is set by default on Windows.dplyr
joins support the suffix
argument to handle overlap in column names.is.na()
no longer misses any rows.map_batches()
correctly accepts Dataset
objects.read_csv_arrow()
's readr-style type T
is mapped to timestamp(unit = "ns")
instead of timestamp(unit = "s")
.{lubridate}
features and fixes:
lubridate::tz()
(timezone),lubridate::semester()
,lubridate::dst()
(daylight savings time boolean),lubridate::date()
,lubridate::epiyear()
(year according to epidemiological week calendar),lubridate::month()
works with integer inputs.lubridate::make_date()
& lubridate::make_datetime()
+
base::ISOdatetime()
& base::ISOdate()
to
create date-times from numeric representations.lubridate::decimal_date()
and lubridate::date_decimal()
lubridate::make_difftime()
(duration constructor)?lubridate::duration
helper functions,
such as lubridate::dyears()
, lubridate::dhours()
, lubridate::dseconds()
.lubridate::leap_year()
lubridate::as_date()
and lubridate::as_datetime()
base::difftime
and base::as.difftime()
base::as.Date()
to convert to datebase::format()
strptime()
returns NA
instead of erroring in case of format mismatch,
just like base::strptime()
.as_arrow_array()
and as_arrow_table()
for main Arrow objects. This includes, Arrow tables,
record batches, arrays, chunked arrays, record batch readers, schemas, and
data types. This allows other packages to define custom conversions from their
types to Arrow objects, including extension arrays.?new_extension_type
.vctrs::vec_is()
returns TRUE (i.e., any object that can be used as a column in a
tibble::tibble()
), provided that the underlying vctrs::vec_data()
can be converted
to an Arrow Array.Arrow arrays and tables can be easily concatenated:
concat_arrays()
or, if zero-copy is desired
and chunking is acceptable, using ChunkedArray$create()
.c()
.cbind()
.rbind()
. concat_tables()
is also provided to
concatenate tables while unifying schemas.sqrt()
, log()
, and exp()
with Arrow arrays and scalars.read_*
and write_*
functions support R Connection objects for reading
and writing files.median()
and quantile()
will warn only once about approximate calculations regardless of interactivity.Array$cast()
can cast StructArrays into another struct type with the same field names
and structure (or a subset of fields) but different field types.set_io_thread_count()
would set the CPU count instead of
the IO thread count.RandomAccessFile
has a $ReadMetadata()
method that provides useful
metadata provided by the filesystem.grepl
binding returns FALSE
for NA
inputs (previously it returned NA
),
to match the behavior of base::grepl()
.create_package_with_all_dependencies()
works on Windows and Mac OS, instead
of only Linux.{lubridate}
features: week()
, more of the is.*()
functions, and the label argument to month()
have been implemented.summarize()
, such as ifelse(n() > 1, mean(y), mean(z))
, are supported.tibble
and data.frame
to create columns of tibbles or data.frames respectively (e.g. ... %>% mutate(df_col = tibble(a, b)) %>% ...
).factor
type) are supported inside of coalesce()
.open_dataset()
accepts the partitioning
argument when reading Hive-style partitioned files, even though it is not required.map_batches()
function for custom operations on dataset has been restored.encoding
argument when reading).open_dataset()
correctly ignores byte-order marks (BOM
s) in CSVs, as already was true for reading single fileshead()
no longer hangs on large CSV datasets.write_csv_arrow()
now follows the signature of readr::write_csv()
.$code()
method on a schema
or type
. This allows you to easily get the code needed to create a schema from an object that already has one.Duration
type has been mapped to R's difftime
class.decimal256()
type is supported. The decimal()
function has been revised to call either decimal256()
or decimal128()
based on the value of the precision
argument.write_parquet()
uses a reasonable guess at chunk_size
instead of always writing a single chunk. This improves the speed of reading and writing large Parquet files.write_parquet()
no longer drops attributes for grouped data.frames.proxy_options
.pkg-config
to search for system dependencies (such as libz
) and link to them if present. This new default will make building Arrow from source quicker on systems that have these dependencies installed already. To retain the previous behavior of downloading and building all dependencies, set ARROW_DEPENDENCY_SOURCE=BUNDLED
.glue
, which arrow
depends on transitively, has dropped support for it.str_count()
in dplyr queriesThere are now two ways to query Arrow data:
dplyr::summarize()
, both grouped and ungrouped, is now implemented for Arrow Datasets, Tables, and RecordBatches. Because data is scanned in chunks, you can aggregate over larger-than-memory datasets backed by many files. Supported aggregation functions include n()
, n_distinct()
, min(),
max()
, sum()
, mean()
, var()
, sd()
, any()
, and all()
. median()
and quantile()
with one probability are also supported and currently return approximate results using the t-digest algorithm.
Along with summarize()
, you can also call count()
, tally()
, and distinct()
, which effectively wrap summarize()
.
This enhancement does change the behavior of summarize()
and collect()
in some cases: see "Breaking changes" below for details.
In addition to summarize()
, mutating and filtering equality joins (inner_join()
, left_join()
, right_join()
, full_join()
, semi_join()
, and anti_join()
) with are also supported natively in Arrow.
Grouped aggregation and (especially) joins should be considered somewhat experimental in this release. We expect them to work, but they may not be well optimized for all workloads. To help us focus our efforts on improving them in the next release, please let us know if you encounter unexpected behavior or poor performance.
New non-aggregating compute functions include string functions like str_to_title()
and strftime()
as well as compute functions for extracting date parts (e.g. year()
, month()
) from dates. This is not a complete list of additional compute functions; for an exhaustive list of available compute functions see list_compute_functions()
.
We've also worked to fill in support for all data types, such as Decimal
, for functions added in previous releases. All type limitations mentioned in previous release notes should be no longer valid, and if you find a function that is not implemented for a certain data type, please report an issue.
If you have the duckdb package installed, you can hand off an Arrow Dataset or query object to DuckDB for further querying using the to_duckdb()
function. This allows you to use duckdb's dbplyr
methods, as well as its SQL interface, to aggregate data. Filtering and column projection done before to_duckdb()
is evaluated in Arrow, and duckdb can push down some predicates to Arrow as well. This handoff does not copy the data, instead it uses Arrow's C-interface (just like passing arrow data between R and Python). This means there is no serialization or data copying costs are incurred.
You can also take a duckdb tbl
and call to_arrow()
to stream data to Arrow's query engine. This means that in a single dplyr pipeline, you could start with an Arrow Dataset, evaluate some steps in DuckDB, then evaluate the rest in Arrow.
arrange()
the query result. For calls to summarize()
, you can set options(arrow.summarise.sort = TRUE)
to match the current dplyr
behavior of sorting on the grouping columns.dplyr::summarize()
on an in-memory Arrow Table or RecordBatch no longer eagerly evaluates. Call compute()
or collect()
to evaluate the query.head()
and tail()
also no longer eagerly evaluate, both for in-memory data and for Datasets. Also, because row order is no longer deterministic, they will effectively give you a random slice of data from somewhere in the dataset unless you arrange()
to specify sorting.sf::st_as_binary(col)
) or using the sfarrow package which handles some of the intricacies of this conversion process. We have plans to improve this and re-enable custom metadata like this in the future when we can implement the saving in a safe and efficient way. If you need to preserve the pre-6.0.0 behavior of saving this metadata, you can set options(arrow.preserve_row_level_metadata = TRUE)
. We will be removing this option in a coming release. We strongly recommend avoiding using this workaround if possible since the results will not be supported in the future and can lead to surprising and inaccurate results. If you run into a custom class besides sf columns that are impacted by this please report an issue.LIBARROW_MINIMAL=true
. This will have the core Arrow/Feather components but excludes Parquet, Datasets, compression libraries, and other optional features.create_package_with_all_dependencies()
function (also available on GitHub without installing the arrow package) will download all third-party C++ dependencies and bundle them inside the R source package. Run this function on a system connected to the network to produce the "fat" source package, then copy that .tar.gz package to your offline machine and install. Special thanks to @karldw for the huge amount of work on this.libz
) by setting ARROW_DEPENDENCY_SOURCE=AUTO
. This is not the default in this release (BUNDLED
, i.e. download and build all dependencies) but may become the default in the future.read_json_arrow()
) are now optional and still on by default; set ARROW_JSON=OFF
before building to disable them.options(arrow.use_altrep = FALSE)
Field
objects can now be created as non-nullable, and schema()
now optionally accepts a list of Field
swrite_parquet()
no longer errors when used with a grouped data.framecase_when()
now errors cleanly if an expression is not supported in Arrowopen_dataset()
now works on CSVs without header rowsT
and t
were reversed in read_csv_arrow()
log(..., base = b)
where b is something other than 2, e, or 10Table$create()
now has alias arrow_table()
This patch version contains fixes for some sanitizer and compiler warnings.
There are now more than 250 compute functions available for use in dplyr::filter()
, mutate()
, etc. Additions in this release include:
strsplit()
and str_split()
; strptime()
; paste()
, paste0()
, and str_c()
; substr()
and str_sub()
; str_like()
; str_pad()
; stri_reverse()
lubridate
methods such as year()
, month()
, wday()
, and so onlog()
et al.); trigonometry (sin()
, cos()
, et al.); abs()
; sign()
; pmin()
and pmax()
; ceiling()
, floor()
, and trunc()
ifelse()
and if_else()
for all but Decimal
types; case_when()
for logical, numeric, and temporal types only; coalesce()
for all but lists/structs. Note also that in this release, factors/dictionaries are converted to strings in these functions.is.*
functions are supported and can be used inside relocate()
The print method for arrow_dplyr_query
now includes the expression and the resulting type of columns derived by mutate()
.
transmute()
now errors if passed arguments .keep
, .before
, or .after
, for consistency with the behavior of dplyr
on data.frame
s.
write_csv_arrow()
to use Arrow to write a data.frame to a single CSV filewrite_dataset(format = "csv", ...)
to write a Dataset to CSVs, including with partitioningreticulate::py_to_r()
and r_to_py()
methods. Along with the addition of the Scanner$ToRecordBatchReader()
method, you can now build up a Dataset query in R and pass the resulting stream of batches to another tool in process.Array$export_to_c()
, RecordBatch$import_from_c()
), similar to how they are in pyarrow
. This facilitates their use in other packages. See the py_to_r()
and r_to_py()
methods for usage examples.data.frame
to an Arrow Table
uses multithreading across columnsoptions(arrow.use_altrep = FALSE)
is.na()
now evaluates to TRUE
on NaN
values in floating point number fields, for consistency with base R.is.nan()
now evaluates to FALSE
on NA
values in floating point number fields and FALSE
on all values in non-floating point fields, for consistency with base R.Array
, ChunkedArray
, RecordBatch
, and Table
: na.omit()
and friends, any()
/all()
RecordBatch$create()
and Table$create()
are recycledarrow_info()
includes details on the C++ build, such as compiler versionmatch_arrow()
now converts x
into an Array
if it is not a Scalar
, Array
or ChunkedArray
and no longer dispatches base::match()
.LIBARROW_MINIMAL=false
) includes both jemalloc and mimalloc, and it has still has jemalloc as default, though this is configurable at runtime with the ARROW_DEFAULT_MEMORY_POOL
environment variable.LIBARROW_MINIMAL
, LIBARROW_DOWNLOAD
, and NOT_CRAN
are now case-insensitive in the Linux build script.Many more dplyr
verbs are supported on Arrow objects:
dplyr::mutate()
is now supported in Arrow for many applications. For queries on Table
and RecordBatch
that are not yet supported in Arrow, the implementation falls back to pulling data into an in-memory R data.frame
first, as in the previous release. For queries on Dataset
(which can be larger than memory), it raises an error if the function is not implemented. The main mutate()
features that cannot yet be called on Arrow objects are (1) mutate()
after group_by()
(which is typically used in combination with aggregation) and (2) queries that use dplyr::across()
.dplyr::transmute()
(which calls mutate()
)dplyr::group_by()
now preserves the .drop
argument and supports on-the-fly definition of columnsdplyr::relocate()
to reorder columnsdplyr::arrange()
to sort rowsdplyr::compute()
to evaluate the lazy expressions and return an Arrow Table. This is equivalent to dplyr::collect(as_data_frame = FALSE)
, which was added in 2.0.0.Over 100 functions can now be called on Arrow objects inside a dplyr
verb:
nchar()
, tolower()
, and toupper()
, along with their stringr
spellings str_length()
, str_to_lower()
, and str_to_upper()
, are supported in Arrow dplyr
calls. str_trim()
is also supported.sub()
, gsub()
, and grepl()
, along with str_replace()
, str_replace_all()
, and str_detect()
, are supported.cast(x, type)
and dictionary_encode()
allow changing the type of columns in Arrow objects; as.numeric()
, as.character()
, etc. are exposed as similar type-altering conveniencesdplyr::between()
; the Arrow version also allows the left
and right
arguments to be columns in the data and not just scalarsdplyr
verb. This enables you to access Arrow functions that don't have a direct R mapping. See list_compute_functions()
for all available functions, which are available in dplyr
prefixed by arrow_
.dplyr::filter(arrow_dataset, string_column == 3)
will error with a message about the type mismatch between the numeric 3
and the string type of string_column
.open_dataset()
now accepts a vector of file paths (or even a single file path). Among other things, this enables you to open a single very large file and use write_dataset()
to partition it without having to read the whole file into memory.write_dataset()
now defaults to format = "parquet"
and better validates the format
argumentschema
in open_dataset()
is now correctly handledScanner$Scan()
method has been removed; use Scanner$ScanBatches()
value_counts()
to tabulate values in an Array
or ChunkedArray
, similar to base::table()
.StructArray
objects gain data.frame-like methods, including names()
, $
, [[
, and dim()
.<-
) with either $
or [[
Schema
can now be edited by assigning in new types. This enables using the CSV reader to detect the schema of a file, modify the Schema
object for any columns that you want to read in as a different type, and then use that Schema
to read the data.Table
with a schema, with columns of different lengths, and with scalar value recycling\0
) characters, the error message now informs you that you can set options(arrow.skip_nul = TRUE)
to strip them out. It is not recommended to set this option by default since this code path is significantly slower, and most string data does not contain nuls.read_json_arrow()
now accepts a schema: read_json_arrow("file.json", schema = schema(col_a = float64(), col_b = string()))
vignette("install", package = "arrow")
for details. This allows a faster, smaller package build in cases where that is useful, and it enables a minimal, functioning R package build on Solaris.FORCE_BUNDLED_BUILD=true
.arrow
now uses the mimalloc
memory allocator by default on macOS, if available (as it is in CRAN binaries), instead of jemalloc
. There are configuration issues with jemalloc
on macOS, and benchmark analysis shows that this has negative effects on performance, especially on memory-intensive workflows. jemalloc
remains the default on Linux; mimalloc
is default on Windows.ARROW_DEFAULT_MEMORY_POOL
environment variable to switch memory allocators now works correctly when the Arrow C++ library has been statically linked (as is usually the case when installing from CRAN).arrow_info()
function now reports on the additional optional features, as well as the detected SIMD level. If key features or compression libraries are not enabled in the build, arrow_info()
will refer to the installation vignette for guidance on how to install a more complete build, if desired.vignette("developing", package = "arrow")
.ARROW_HOME
to point to a specific directory where the Arrow libraries are. This is similar to passing INCLUDE_DIR
and LIB_DIR
.flight_get()
and flight_put()
(renamed from push_data()
in this release) can handle both Tables and RecordBatchesflight_put()
gains an overwrite
argument to optionally check for the existence of a resource with the same namelist_flights()
and flight_path_exists()
enable you to see available resources on a Flight serverSchema
objects now have r_to_py
and py_to_r
methods+
, *
, etc.) are supported on Arrays and ChunkedArrays and can be used in filter expressions in Arrow dplyr
pipelines<-
) with either $
or [[
names()
rlang
pronouns .data
and .env
are now fully supported in Arrow dplyr
pipelines.arrow.skip_nul
(default FALSE
, as in base::scan()
) allows conversion of Arrow string (utf8()
) type data containing embedded nul \0
characters to R. If set to TRUE
, nuls will be stripped and a warning is emitted if any are found.arrow_info()
for an overview of various run-time and build-time Arrow configurations, useful for debuggingARROW_DEFAULT_MEMORY_POOL
before loading the Arrow package to change memory allocators. Windows packages are built with mimalloc
; most others are built with both jemalloc
(used by default) and mimalloc
. These alternative memory allocators are generally much faster than the system memory allocator, so they are used by default when available, but sometimes it is useful to turn them off for debugging purposes. To disable them, set ARROW_DEFAULT_MEMORY_POOL=system
.sf
tibbles to faithfully preserved and roundtripped (#8549).schema()
for more details.write_parquet()
can now write RecordBatchesreadr
's problems
attribute is removed when converting to Arrow RecordBatch and table to prevent large amounts of metadata from accumulating inadvertently (#9092)SubTreeFileSystem
gains a useful print method and no longer errors when printingr-arrow
package are available with conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
cmake
versionsvignette("install", package = "arrow")
, especially for known CentOS issuesdistro
package. If your OS isn't correctly identified, please report an issue there.write_dataset()
to Feather or Parquet files with partitioning. See the end of vignette("dataset", package = "arrow")
for discussion and examples.head()
, tail()
, and take ([
) methods. head()
is optimized but the others may not be performant.collect()
gains an as_data_frame
argument, default TRUE
but when FALSE
allows you to evaluate the accumulated select
and filter
query but keep the result in Arrow, not an R data.frame
read_csv_arrow()
supports specifying column types, both with a Schema
and with the compact string representation for types used in the readr
package. It also has gained a timestamp_parsers
argument that lets you express a set of strptime
parse strings that will be tried to convert columns designated as Timestamp
type.libcurl
and openssl
, as well as a sufficiently modern compiler. See vignette("install", package = "arrow")
for details.read_parquet()
, write_feather()
, et al.), as well as open_dataset()
and write_dataset()
, allow you to access resources on S3 (or on file systems that emulate S3) either by providing an s3://
URI or by providing a FileSystem$path()
. See vignette("fs", package = "arrow")
for examples.copy_files()
allows you to recursively copy directories of files from one file system to another, such as from S3 to your local machine.Flight
is a general-purpose client-server framework for high performance
transport of large datasets over network interfaces.
The arrow
R package now provides methods for connecting to Flight RPC servers
to send and receive data. See vignette("flight", package = "arrow")
for an overview.
==
, >
, etc.) and boolean (&
, |
, !
) operations, along with is.na
, %in%
and match
(called match_arrow()
), on Arrow Arrays and ChunkedArrays are now implemented in the C++ library.min()
, max()
, and unique()
are implemented for Arrays and ChunkedArrays.dplyr
filter expressions on Arrow Tables and RecordBatches are now evaluated in the C++ library, rather than by pulling data into R and evaluating. This yields significant performance improvements.dim()
(nrow
) for dplyr queries on Table/RecordBatch is now supportedarrow
now depends on cpp11
, which brings more robust UTF-8 handling and faster compilationInt64
type when all values fit with an R 32-bit integer now correctly inspects all chunks in a ChunkedArray, and this conversion can be disabled (so that Int64
always yields a bit64::integer64
vector) by setting options(arrow.int64_downcast = FALSE)
.ParquetFileReader
has additional methods for accessing individual columns or row groups from the fileParquetFileWriter
; invalid ArrowObject
pointer from a saved R object; converting deeply nested structs from Arrow to Rproperties
and arrow_properties
arguments to write_parquet()
are deprecated%in%
expression now faithfully returns all relevant rows.
or _
; files and subdirectories starting with those prefixes are still ignoredopen_dataset("~/path")
now correctly expands the pathversion
option to write_parquet()
is now correctly implementedparquet-cpp
library has been fixedcmake
is more robust, and you can now specify a /path/to/cmake
by setting the CMAKE
environment variablevignette("arrow", package = "arrow")
includes tables that explain how R types are converted to Arrow types and vice versa.uint64
, binary
, fixed_size_binary
, large_binary
, large_utf8
, large_list
, list
of structs
.character
vectors that exceed 2GB are converted to Arrow large_utf8
typePOSIXlt
objects can now be converted to Arrow (struct
)attributes()
are preserved in Arrow metadata when converting to Arrow RecordBatch and table and are restored when converting from Arrow. This means that custom subclasses, such as haven::labelled
, are preserved in round trip through Arrow.batch$metadata$new_key <- "new value"
int64
, uint32
, and uint64
now are converted to R integer
if all values fit in boundsdate32
is now converted to R Date
with double
underlying storage. Even though the data values themselves are integers, this provides more strict round-trip fidelityfactor
, dictionary
ChunkedArrays that do not have identical dictionaries are properly unifiedRecordBatch{File,Stream}Writer
will write V5, but you can specify an alternate metadata_version
. For convenience, if you know the consumer you're writing to cannot read V5, you can set the environment variable ARROW_PRE_1_0_METADATA_VERSION=1
to write V4 without changing any other code.ds <- open_dataset("s3://...")
. Note that this currently requires a special C++ library build with additional dependencies--this is not yet available in CRAN releases or in nightly packages.sum()
and mean()
are implemented for Array
and ChunkedArray
dimnames()
and as.list()
reticulate
coerce_timestamps
option to write_parquet()
is now correctly implemented.type
definition if provided by the userread_arrow
and write_arrow
are now deprecated; use the read/write_feather()
and read/write_ipc_stream()
functions depending on whether you're working with the Arrow IPC file or stream format, respectively.FileStats
, read_record_batch
, and read_table
have been removed.jemalloc
included, and Windows packages use mimalloc
CC
and CXX
values that R usesdplyr
1.0reticulate::r_to_py()
conversion now correctly works automatically, without having to call the method yourselfThis release includes support for version 2 of the Feather file format.
Feather v2 features full support for all Arrow data types,
fixes the 2GB per-column limitation for large amounts of string data,
and it allows files to be compressed using either lz4
or zstd
.
write_feather()
can write either version 2 or
version 1 Feather files, and read_feather()
automatically detects which file version it is reading.
Related to this change, several functions around reading and writing data
have been reworked. read_ipc_stream()
and write_ipc_stream()
have been
added to facilitate writing data to the Arrow IPC stream format, which is
slightly different from the IPC file format (Feather v2 is the IPC file format).
Behavior has been standardized: all read_<format>()
return an R data.frame
(default) or a Table
if the argument as_data_frame = FALSE
;
all write_<format>()
functions return the data object, invisibly.
To facilitate some workflows, a special write_to_raw()
function is added
to wrap write_ipc_stream()
and return the raw
vector containing the buffer
that was written.
To achieve this standardization, read_table()
, read_record_batch()
,
read_arrow()
, and write_arrow()
have been deprecated.
The 0.17 Apache Arrow release includes a C data interface that allows
exchanging Arrow data in-process at the C level without copying
and without libraries having a build or runtime dependency on each other. This enables
us to use reticulate
to share data between R and Python (pyarrow
) efficiently.
See vignette("python", package = "arrow")
for details.
dim()
method, which sums rows across all files (#6635, @boshek)UnionDataset
with the c()
methodNA
as FALSE
, consistent with dplyr::filter()
vignette("dataset", package = "arrow")
now has correct, executable codeNOT_CRAN=true
. See vignette("install", package = "arrow")
for details and more options.unify_schemas()
to create a Schema
containing the union of fields in multiple schemasread_feather()
and other reader functions close any file connections they openR.oo
package is also loadedFileStats
is renamed to FileInfo
, and the original spelling has been deprecatedinstall_arrow()
now installs the latest release of arrow
, including Linux dependencies, either for CRAN releases or for development builds (if nightly = TRUE
)LIBARROW_DOWNLOAD
or NOT_CRAN
environment variable is setwrite_feather()
, write_arrow()
and write_parquet()
now return their input,
similar to the write_*
functions in the readr
package (#6387, @boshek)list
and create a ListArray when all list elements are the same type (#6275, @michaelchirico)This release includes a dplyr
interface to Arrow Datasets,
which let you work efficiently with large, multi-file datasets as a single entity.
Explore a directory of data files with open_dataset()
and then use dplyr
methods to select()
, filter()
, etc. Work will be done where possible in Arrow memory. When necessary, data is pulled into R for further computation. dplyr
methods are conditionally loaded if you have dplyr
available; it is not a hard dependency.
See vignette("dataset", package = "arrow")
for details.
A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.
See vignette("install", package = "arrow")
for details.
Table
s and RecordBatch
es also have dplyr
methods.dplyr
, [
methods for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural row extraction operations. These use the C++ Filter
, Slice
, and Take
methods for efficient access, depending on the type of selection vector.array_expression
class has also been added, enabling among other things the ability to filter a Table with some function of Arrays, such as arrow_table[arrow_table$var1 > 5, ]
without having to pull everything into R first.write_parquet()
now supports compressioncodec_is_available()
returns TRUE
or FALSE
whether the Arrow C++ library was built with support for a given compression library (e.g. gzip, lz4, snappy)character
(as R factor
levels are required to be) instead of raising an errorClass$create()
methods. Notably, arrow::array()
and arrow::table()
have been removed in favor of Array$create()
and Table$create()
, eliminating the package startup message about masking base
functions. For more information, see the new vignette("arrow")
.ARROW_PRE_0_15_IPC_FORMAT=1
.as_tibble
argument in the read_*()
functions has been renamed to as_data_frame
(#5399, @jameslamb)arrow::Column
class has been removed, as it was removed from the C++ libraryTable
and RecordBatch
objects have S3 methods that enable you to work with them more like data.frame
s. Extract columns, subset, and so on. See ?Table
and ?RecordBatch
for examples.read_csv_arrow()
supports more parsing options, including col_names
, na
, quoted_na
, and skip
read_parquet()
and read_feather()
can ingest data from a raw
vector (#5141)~/file.parquet
(#5169)double()
), and time types can be created with human-friendly resolution strings ("ms", "s", etc.). (#5198, #5201)Initial CRAN release of the arrow
package. Key features include: