5 Benchmarking

Below is the code used to create the fake data needed for the data processing in this document. Click to expand it! 👇

Code

library(polars)
library(arrow)
library(dplyr)
library(data.table)
library(DBI)
library(duckdb)
library(tictoc)
library(microbenchmark)
library(readr)
library(fs)
library(ggplot2)
library(pryr)
library(dbplyr)
library(forcats)
library(collapse)

# Creation the "Datasets" folder
dir.create(normalizePath("Datasets"))

set.seed(123)

# Creation of large example R data.frame
DataMultiTypes <- data.frame(
  colDate1 = as.POSIXct(sample(as.POSIXct("2023-01-01"):as.POSIXct("2023-06-30"),500000),origin="1970-01-01"),
  colDate2 = as.POSIXct(sample(as.POSIXct("2023-07-01"):as.POSIXct("2023-12-31"),500000),origin="1970-01-01"),
  colInt = sample(1:10000, 500000, replace = TRUE),
  colNum = runif(500000),
  colString = sample(c("A", "B", "C"), 500000, replace = TRUE),
  colFactor = factor(sample(c("Low", "Medium", "High"), 500000, replace = TRUE))
)

DataMultiTypes_pl <- as_polars_df(DataMultiTypes)

# Creation of large csv file
write.csv(x = DataMultiTypes,
          file = "Datasets/DataMultiTypes.csv",
          row.names=FALSE)

# Creation of unique parquet file
arrow::write_parquet(x = DataMultiTypes,
                     sink = "Datasets/DataMultiTypes.parquet")

# Creation of 6 parquet file
arrow::write_dataset(dataset = DataMultiTypes,
                     path = "Datasets/DataMultiTypes",
                     format = c("parquet"),
                     partitioning = c("colFactor"))

# Creation of duckdb file
con <- dbConnect(duckdb::duckdb(),
                 "Datasets/DataMultiTypes.duckdb")
duckdb::dbWriteTable(con,
                     "DataMultiTypes",
                     DataMultiTypes,
                     overwrite = TRUE)
dbDisconnect(con, shutdown=TRUE)

In this part of the book we will compare the performance of polars by comparing with other syntaxes, in particular R base, dplyr, dbplyr, SQL and data.table.

This section is structured according to the type of file format used for the comparison.

Note

The data processing that is performed makes very little statistical sense, but it does strive to perform some of the operations most frequently used by data scientists.

Data processing steps:

Conversion of two columns to Date format;
Filter from a integer column which must be within a range of values.
Grouping by a string column
Aggregation with calculation of the mean, minimum and maximum over two columns.

5.1 From an R object

This section analyses the different methods for making a query from an R object already loaded in memory.

Let’s start by comparing polars with R base, dplyr and data.table. We’ll alsso add collapse, a recent package that is very fast for data manipulation.

robject_polars <- function() {

  DataMultiTypes_pl$
    # Filter rows
    filter(
      pl$col("colInt")>2000 & pl$col("colInt")<8000
    )$
    # Grouping and aggregation
    group_by(
      "colString")$
    agg(
      pl$col("colInt")$min()$alias("min_colInt"),
      pl$col("colInt")$mean()$alias("mean_colInt"),
      pl$col("colInt")$max()$alias("max_colInt"),
      pl$col("colNum")$min()$alias("min_colNum"),
      pl$col("colNum")$mean()$alias("mean_colNum"),
      pl$col("colNum")$max()$alias("max_colNum")
    )
}

robject_rbase <- function() {

  # Grouping and aggregation from data filtered
  aggregate(cbind(colInt, colNum) ~ colString,
            data = DataMultiTypes[DataMultiTypes$colInt>2000 & DataMultiTypes$colInt<8000,],
            FUN = function(x) c(mean = mean(x),
                                min = min(x),
                                max = max(x)))

}

robject_dplyr <- function() {

  DataMultiTypes |>

   # Filter rows
    filter(
      colInt>2000 & colInt<8000
      ) |>

    # Grouping and aggregation
    group_by(colString) |>

    summarise(
      min_colInt = min(colInt),
      mean_colInt = mean(colInt),
      mas_colInt = max(colInt),
      min_colNum = min(colNum),
      mean_colNum = mean(colNum),
      max_colNum = max(colNum)
  )

}

robject_collapse <- function() {

  DataMultiTypes |>

   # Filter rows
    fsubset(
      colInt>2000 & colInt<8000
      ) |>

    # Grouping and aggregation
    fgroup_by(colString) |>

    fsummarise(
      min_colInt = fmin(colInt),
      mean_colInt = fmean(colInt),
      mas_colInt = fmax(colInt),
      min_colNum = fmin(colNum),
      mean_colNum = fmean(colNum),
      max_colNum = fmax(colNum)
  )

}

robject_dt <- function() {

  as.data.table(DataMultiTypes)[

    colInt > 2000 & colInt < 8000

  ][, .(min_colInt = min(colInt),
        mean_colInt = mean(colInt),
        mas_colInt = max(colInt),
        min_colNum = min(colNum),
        mean_colNum = mean(colNum),
        max_colNum = max(colNum)),

    by = colString
  ]
}

Now let’s look at how to use the DuckDb engine on R objects. There are 3 main possibilities:

To use the DuckDB engine to query a R object with dplyr, you can use the duckdb::duckdb_register() method and then the dplyr::tbl() method to pass your dplyr instructions (dplyr/DuckDB).
To use the DuckDB engine to query a R object with the standard DBI methods, you can use the duckdb::duckdb_register() method and then the DBI::dbGetQuery() method to pass your SQL query (SQL/DuckDB).
To use the DuckDB engine to query a R object in combination with {arrow} package, you can use the arrow::to_duckdb() and then pass your dplyr instructions (dplyr/arrow/DuckDB).

robject_duckdb_dplyr <- function(variables) {

  con <- DBI::dbConnect(duckdb::duckdb())

  duckdb::duckdb_register(con, "DataMultiTypes", DataMultiTypes)

  tbl(con, "DataMultiTypes") |>

    # Filter rows
    filter(
      colInt>2000 & colInt<8000
    ) |>
    # Grouping and aggregation
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt, na.rm = TRUE),
      mean_colInt = mean(colInt, na.rm = TRUE),
      mas_colInt = max(colInt, na.rm = TRUE),
      min_colNum = min(colNum, na.rm = TRUE),
      mean_colNum = mean(colNum, na.rm = TRUE),
      max_colNum = max(colNum, na.rm = TRUE)
    ) |>
    collect()

  DBI::dbDisconnect(con, shutdown=TRUE)

}

robject_duckdb_sql <- function(variables) {

  con <- DBI::dbConnect(duckdb::duckdb())

  duckdb::duckdb_register(con, "DataMultiTypes", DataMultiTypes)

  DBI::dbGetQuery(
    con,
    "SELECT colString,
           MIN(colInt) AS min_colInt,
           AVG(colInt) AS mean_colInt,
           MAX(colInt) AS max_colInt,
           MIN(colNum) AS min_colNum,
           AVG(colNum) AS mean_colNum,
           MAX(colNum) AS max_colNum
    FROM (
        SELECT colString,
               colInt,
               colNum
        FROM DataMultiTypes
        WHERE colInt > 2000 AND colInt < 8000
) AS filtered_data
GROUP BY colString;")

  DBI::dbDisconnect(con, shutdown=TRUE)

}

robject_duckdb_arrow_dplyr <- function(variables) {

  DataMultiTypes |>

    to_duckdb() |>

    # Filter rows
    filter(
      colInt>2000 & colInt<8000
    ) |>
    # Grouping and aggregation
    group_by(colString) |>

    summarise(
      min_colInt = min(colInt, na.rm = TRUE),
      mean_colInt = mean(colInt, na.rm = TRUE),
      mas_colInt = max(colInt, na.rm = TRUE),
      min_colNum = min(colNum, na.rm = TRUE),
      mean_colNum = mean(colNum, na.rm = TRUE),
      max_colNum = max(colNum, na.rm = TRUE)
    )

}

Tip

One of the advantages of using the DuckDB engine and dplyr may be to use a feature implemented by DuckDB but not yet by Arrow. We can do the opposite, and return to the Arrow engine with arrow::to_arrow().

However, the benchmark results are clear: SQL queries are by far the fastest! 🏆

5.1.1 Results with a R object

microbenchmark(
  robject_polars(),
  robject_rbase(),
  robject_dplyr(),
  robject_collapse(),
  robject_dt(),
  robject_duckdb_dplyr(),
  robject_duckdb_sql(),
  robject_duckdb_arrow_dplyr(),
  times = 5
 )

Unit: milliseconds
                         expr       min        lq      mean    median        uq
             robject_polars()  14.39658  14.62841  17.08159  15.51163  18.78945
              robject_rbase() 131.73818 144.91998 152.59728 149.60272 165.83076
              robject_dplyr()  20.40825  26.09569  29.38987  32.03580  32.78579
           robject_collapse()  11.24945  12.65132  16.67177  17.87227  19.92227
                 robject_dt()  44.61854  44.85481  48.38962  49.62127  49.96919
       robject_duckdb_dplyr() 228.04456 242.84166 262.66091 250.46544 253.95490
         robject_duckdb_sql()  64.95571  64.96156  71.33825  72.78427  74.29731
 robject_duckdb_arrow_dplyr() 168.47224 169.98296 179.68150 174.70430 187.52658
       max neval
  22.08187     5
 170.89476     5
  35.62381     5
  21.66352     5
  52.88430     5
 337.99797     5
  79.69239     5
 197.72144     5

👉 Conclusion of this little benchmark using R objects already loaded in memory: the fastest to run is collapse. Next are data.table and dplyr followed closely by polars. 🏆🏆🏆 The worst performer is surprisingly duckdb with the dplyr syntax, while duckdb with the SQL language does very well and comes 4th in this ranking.

5.2 From a csv file

For this comparison, we will use :

For polars (eager), the pl$read_csv() method
For polars (lazy), the pl$scan_csv() method
For R base, the read.csv() method
For dplyr, the readr::read_csv() method
For data.table, the data.table::fread() method

csv_eager_polars <- function() {
# Reading the csv file (eager mode)
result_agg <- pl$read_csv(source = "Datasets/DataMultiTypes.csv")$
  # Conversion of 2 columns to Date format
  with_columns(
    pl$col("colDate1")$str$strptime(pl$Date, "%F %T", strict = FALSE),
    pl$col("colDate2")$str$strptime(pl$Date, "%F %T", strict = FALSE)
  )$
  # Creation of a diff column between 2 dates (in days)
  with_columns(
    (pl$col("colDate2") - pl$col("colDate1"))$dt$total_days()$alias("diff")
  )$
  # Filter rows
  filter(
    pl$col("colInt")>2000 & pl$col("colInt")<8000
  )$
  # Grouping and aggregation
  group_by(
    "colString")$
  agg(
    pl$col("colInt")$min()$alias("min_colInt"),
    pl$col("colInt")$mean()$alias("mean_colInt"),
    pl$col("colInt")$max()$alias("max_colInt"),
    pl$col("colNum")$min()$alias("min_colNum"),
    pl$col("colNum")$mean()$alias("mean_colNum"),
    pl$col("colNum")$max()$alias("max_colNum")
  )

  return(result_agg)
}
tic()
csv_eager_polars()

shape: (3, 7)
┌───────────┬────────────┬─────────────┬────────────┬────────────┬─────────────┬────────────┐
│ colString ┆ min_colInt ┆ mean_colInt ┆ max_colInt ┆ min_colNum ┆ mean_colNum ┆ max_colNum │
│ ---       ┆ ---        ┆ ---         ┆ ---        ┆ ---        ┆ ---         ┆ ---        │
│ str       ┆ i64        ┆ f64         ┆ i64        ┆ f64        ┆ f64         ┆ f64        │
╞═══════════╪════════════╪═════════════╪════════════╪════════════╪═════════════╪════════════╡
│ C         ┆ 2001       ┆ 5001.243285 ┆ 7999       ┆ 0.00003    ┆ 0.501472    ┆ 0.999992   │
│ B         ┆ 2001       ┆ 5004.31148  ┆ 7999       ┆ 0.000034   ┆ 0.500546    ┆ 0.999986   │
│ A         ┆ 2001       ┆ 4998.624945 ┆ 7999       ┆ 0.000038   ┆ 0.498445    ┆ 0.999988   │
└───────────┴────────────┴─────────────┴────────────┴────────────┴─────────────┴────────────┘

toc()

0.109 sec elapsed

csv_lazy_polars <- function() {
# Reading the csv file (eager mode)
result_agg <- pl$scan_csv(source = "Datasets/DataMultiTypes.csv")$
  # Conversion of 2 columns to Date format
  with_columns(
    pl$col("colDate1")$str$strptime(pl$Date, "%F %T", strict = FALSE),
    pl$col("colDate2")$str$strptime(pl$Date, "%F %T", strict = FALSE)
  )$
  # Creation of a diff column between 2 dates (in days)
  with_columns(
    (pl$col("colDate2") - pl$col("colDate1"))$dt$total_days()$alias("diff")
  )$
  # Filter rows
  filter(
    pl$col("colInt")>2000 & pl$col("colInt")<8000
  )$
  # Grouping and aggregation
  group_by(
    "colString")$
  agg(
    pl$col("colInt")$min()$alias("min_colInt"),
    pl$col("colInt")$mean()$alias("mean_colInt"),
    pl$col("colInt")$max()$alias("max_colInt"),
    pl$col("colNum")$min()$alias("min_colNum"),
    pl$col("colNum")$mean()$alias("mean_colNum"),
    pl$col("colNum")$max()$alias("max_colNum")
  )

  return(result_agg)
}
tic()
csv_lazy_polars()$collect()

shape: (3, 7)
┌───────────┬────────────┬─────────────┬────────────┬────────────┬─────────────┬────────────┐
│ colString ┆ min_colInt ┆ mean_colInt ┆ max_colInt ┆ min_colNum ┆ mean_colNum ┆ max_colNum │
│ ---       ┆ ---        ┆ ---         ┆ ---        ┆ ---        ┆ ---         ┆ ---        │
│ str       ┆ i64        ┆ f64         ┆ i64        ┆ f64        ┆ f64         ┆ f64        │
╞═══════════╪════════════╪═════════════╪════════════╪════════════╪═════════════╪════════════╡
│ A         ┆ 2001       ┆ 4998.624945 ┆ 7999       ┆ 0.000038   ┆ 0.498445    ┆ 0.999988   │
│ B         ┆ 2001       ┆ 5004.31148  ┆ 7999       ┆ 0.000034   ┆ 0.500546    ┆ 0.999986   │
│ C         ┆ 2001       ┆ 5001.243285 ┆ 7999       ┆ 0.00003    ┆ 0.501472    ┆ 0.999992   │
└───────────┴────────────┴─────────────┴────────────┴────────────┴─────────────┴────────────┘

toc()

0.052 sec elapsed

csv_rbase <- function() {

  # Reading the csv file
  result <- read.csv("Datasets/DataMultiTypes.csv")

  # Conversion of 2 columns to Date format
  result$colDate1 <- as.Date(result$colDate1)
  result$colDate2 <- as.Date(result$colDate2)

  # Creation of a diff column between 2 dates (in days)
  result$diff <- round(
    as.integer(
      difftime(
        result$colDate2,
        result$colDate1,
        units = "days")
      ),
    0)

  # Filter rows
  result <- result[result$colInt>2000 & result$colInt<8000,]

  # Grouping and aggregation
  result_agg <- aggregate(cbind(colInt, colNum) ~ colString,
                          data = result,
                          FUN = function(x) c(mean = mean(x),
                                              min = min(x),
                                              max = max(x)))

  return(result_agg)
}

tic()
res_rbase <- csv_rbase()
toc()

6.273 sec elapsed

print(res_rbase)

  colString colInt.mean colInt.min colInt.max  colNum.mean   colNum.min
1         A    4998.625   2001.000   7999.000 4.984446e-01 3.794138e-05
2         B    5004.311   2001.000   7999.000 5.005457e-01 3.385660e-05
3         C    5001.243   2001.000   7999.000 5.014723e-01 3.045052e-05
    colNum.max
1 9.999879e-01
2 9.999863e-01
3 9.999921e-01

csv_dplyr <- function() {

  # Reading the csv file
  result <- readr::read_csv("Datasets/DataMultiTypes.csv", show_col_types = FALSE)

  # Conversion of 2 columns to Date format
  result <- result |>
    mutate(
      colDate1 = as.Date(colDate1),
      colDate2 = as.Date(colDate2)
    )

  # Creation of a diff column between 2 dates (in days)
  result <- result |>
    mutate(diff = round(as.integer(difftime(colDate2, colDate1, units = "days")),0))

  # Filter rows
  result <- result |>
    filter(
      colInt>2000 & colInt<8000
      )

  # Grouping and aggregation
  result_agg <- result |>
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt),
      mean_colInt = mean(colInt),
      mas_colInt = max(colInt),
      min_colNum = min(colNum),
      mean_colNum = mean(colNum),
      max_colNum = max(colNum)
  )

  return(result_agg)
}

tic()
res_dplyr <- csv_dplyr()
toc()

0.383 sec elapsed

print(res_dplyr)

# A tibble: 3 × 7
  colString min_colInt mean_colInt mas_colInt min_colNum mean_colNum max_colNum
  <chr>          <dbl>       <dbl>      <dbl>      <dbl>       <dbl>      <dbl>
1 A               2001       4999.       7999  0.0000379       0.498      1.000
2 B               2001       5004.       7999  0.0000339       0.501      1.000
3 C               2001       5001.       7999  0.0000305       0.501      1.000

csv_arrow <- function() {

  # Reading the csv file
  result <- arrow::read_csv_arrow("Datasets/DataMultiTypes.csv", as_data_frame = FALSE)

  # Conversion of 2 columns to Date format
  result <- result |>
    mutate(
      colDate1 = as.Date(colDate1),
      colDate2 = as.Date(colDate2)
    )

  # Creation of a diff column between 2 dates (in days)
  result <- result |>
    # difftime(unit = "days") is not supported in arrow yet
    mutate(diff = round(as.integer64(difftime(colDate2, colDate1)) %/% (60 * 60 * 24), 0))

  # Filter rows
  result <- result |>
    filter(
      colInt>2000 & colInt<8000
      )

  # Grouping and aggregation
  result_agg <- result |>
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt),
      mean_colInt = mean(colInt),
      mas_colInt = max(colInt),
      min_colNum = min(colNum),
      mean_colNum = mean(colNum),
      max_colNum = max(colNum)
  ) |>
    collect()

  return(result_agg)
}

tic()
res_arrow <- csv_arrow()
toc()

0.241 sec elapsed

print(res_arrow)

# A tibble: 3 × 7
  colString min_colInt mean_colInt mas_colInt min_colNum mean_colNum max_colNum
  <chr>          <int>       <dbl>      <int>      <dbl>       <dbl>      <dbl>
1 B               2001       5004.       7999  0.0000339       0.501      1.000
2 C               2001       5001.       7999  0.0000305       0.501      1.000
3 A               2001       4999.       7999  0.0000379       0.498      1.000

csv_dt <- function() {

  result_agg <- as.data.table(data.table::fread("Datasets/DataMultiTypes.csv"))[, `:=`(

  colDate1 = as.Date(colDate1),
  colDate2 = as.Date(colDate2),
  diff = as.integer(difftime(colDate2, colDate1, units = "days"))

)][colInt > 2000 & colInt < 8000, .(

  min_colInt = min(colInt),
  mean_colInt = mean(colInt),
  max_colInt = max(colInt),
  min_colNum = min(colNum),
  mean_colNum = mean(colNum),
  max_colNum = max(colNum)

), by = colString]

  return(result_agg)
}
tic()
csv_dt()

   colString min_colInt mean_colInt max_colInt   min_colNum mean_colNum
      <char>      <int>       <num>      <int>        <num>       <num>
1:         B       2001    5004.311       7999 3.385660e-05   0.5005457
2:         C       2001    5001.243       7999 3.045052e-05   0.5014723
3:         A       2001    4998.625       7999 3.794138e-05   0.4984446
   max_colNum
        <num>
1:  0.9999863
2:  0.9999921
3:  0.9999879

toc()

0.19 sec elapsed

Note

The data processing performed is not entirely equivalent, since it includes in addition: - for polars (lazy mode), conversion to data.frame R at the end of processing - for data.table, conversion to dt format at the start, then conversion to data.frame R at the end of processing

5.2.1 Results eager vs lazy mode

csv_bmk <- microbenchmark(
  "polars (eager) from csv file" = csv_eager_polars(),
  "polars (lazy) from csv file" = csv_lazy_polars()$collect(),
  "R base - from csv file" = csv_rbase(),
  "dplyr - from csv file" = csv_dplyr(),
  "dplyr (Acero) - from csv file" = csv_arrow(),
  "data.table - from csv file" = csv_dt(),
  times = 5
 )
csv_bmk

Unit: milliseconds
                          expr        min         lq       mean     median
  polars (eager) from csv file  100.10710  104.45615  110.24332  105.42976
   polars (lazy) from csv file   49.99684   50.66335   53.59532   50.93296
        R base - from csv file 5296.54776 5386.00485 5764.10443 5885.79341
         dplyr - from csv file  313.07306  314.45092  364.51508  316.52342
 dplyr (Acero) - from csv file  157.72500  161.53851  268.74104  162.60366
    data.table - from csv file  106.16420  108.40508  173.28179  190.36685
         uq        max neval
  116.37711  124.84646     5
   52.15469   64.22875     5
 6122.65865 6129.51750     5
  319.51875  559.00925     5
  394.11733  467.72070     5
  191.09641  270.37639     5

👉 Conclusion of this little benchmark based on csv files: the big winners are polars (eager mode) and dplyr with {arrow}. The results will undoubtedly be even better with polars (lazy mode)… 🏆🏆🏆 TO DO !!!

5.3 From an unique parquet file

For this comparison on unique parquet file, we will use :

For polars (lazy), the pl$scan_parquet() method
For arrow (eager), the arrow::read_parquet() method
For arrow (lazy), the arrow::open_dataset() method
For Duckdb and SQL, the arrow::read_parquet() and DBI::dbGetQuery() methods

Note

With arrow, you can use the following verbs from the tidyverse to do transformations on your tables.

parquet_polars_lazy <- function(variables) {

  result <- pl$scan_parquet(source = "Datasets/DataMultiTypes.parquet")$
    # Conversion of 2 columns to Date format
    with_columns(
      pl$col("colDate1")$str$strptime(pl$Date, "%F %T", strict = FALSE),
      pl$col("colDate2")$str$strptime(pl$Date, "%F %T", strict = FALSE)
    )$
    # Filter rows
    filter(
      pl$col("colInt")>2000 & pl$col("colInt")<8000
    )$
    # Grouping and aggregation
    group_by(
      "colString")$
    agg(
      pl$col("colInt")$min()$alias("min_colInt"),
      pl$col("colInt")$mean()$alias("mean_colInt"),
      pl$col("colInt")$max()$alias("max_colInt"),
      pl$col("colNum")$min()$alias("min_colNum"),
      pl$col("colNum")$mean()$alias("mean_colNum"),
      pl$col("colNum")$max()$alias("max_colNum")
    )

  return(result)
}
tic()
parquet_polars_lazy()$collect()$to_data_frame()

  colString min_colInt mean_colInt max_colInt   min_colNum mean_colNum
1         C       2001    5001.243       7999 3.045052e-05   0.5014723
2         A       2001    4998.625       7999 3.794138e-05   0.4984446
3         B       2001    5004.311       7999 3.385660e-05   0.5005457
  max_colNum
1  0.9999921
2  0.9999879
3  0.9999863

toc()

0.032 sec elapsed

arrow_eager <- function(variables) {

  result <- arrow::read_parquet("Datasets/DataMultiTypes.parquet") |>

    mutate(
      # Conversion of 2 columns to Date format
      colDate1 = as.Date(colDate1),
      colDate2 = as.Date(colDate2)
    ) |>
    # Filter rows
    filter(
      colInt>2000 & colInt<8000
    ) |>
    # Grouping and aggregation
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt),
      mean_colInt = mean(colInt),
      mas_colInt = max(colInt),
      min_colNum = min(colNum),
      mean_colNum = mean(colNum),
      max_colNum = max(colNum)
  )

  return(result)

}
tic()
arrow_eager()

# A tibble: 3 × 7
  colString min_colInt mean_colInt mas_colInt min_colNum mean_colNum max_colNum
  <chr>          <int>       <dbl>      <int>      <dbl>       <dbl>      <dbl>
1 A               2001       4999.       7999  0.0000379       0.498      1.000
2 B               2001       5004.       7999  0.0000339       0.501      1.000
3 C               2001       5001.       7999  0.0000305       0.501      1.000

toc()

0.085 sec elapsed

arrow_lazy <- function(variables) {

  result <- arrow::open_dataset("Datasets/DataMultiTypes.parquet") |>

    mutate(
      # Conversion of 2 columns to Date format
      colDate1 = as.Date(colDate1),
      colDate2 = as.Date(colDate2)
    ) |>
    # Filter rows
    filter(
      colInt>2000 & colInt<8000
    ) |>
    # Grouping and aggregation
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt),
      mean_colInt = mean(colInt),
      mas_colInt = max(colInt),
      min_colNum = min(colNum),
      mean_colNum = mean(colNum),
      max_colNum = max(colNum)
  )

  return(result)

}
tic()
arrow_lazy() |> collect()

# A tibble: 3 × 7
  colString min_colInt mean_colInt mas_colInt min_colNum mean_colNum max_colNum
  <chr>          <int>       <dbl>      <int>      <dbl>       <dbl>      <dbl>
1 B               2001       5004.       7999  0.0000339       0.501      1.000
2 C               2001       5001.       7999  0.0000305       0.501      1.000
3 A               2001       4999.       7999  0.0000379       0.498      1.000

toc()

0.139 sec elapsed

parquet_duckdb_sql <- function(variables) {

  con <- dbConnect(duckdb::duckdb())

  result <- dbGetQuery(
    con,
    "SELECT colString,
           MIN(colInt) AS min_colInt,
           AVG(colInt) AS mean_colInt,
           MAX(colInt) AS max_colInt,
           MIN(colNum) AS min_colNum,
           AVG(colNum) AS mean_colNum,
           MAX(colNum) AS max_colNum
    FROM (
        SELECT colString,
               colInt,
               colNum
        FROM read_parquet('Datasets/DataMultiTypes.parquet')
        WHERE colInt > 2000 AND colInt < 8000
) AS filtered_data
GROUP BY colString;")

  dbDisconnect(con, shutdown=TRUE)

  return(result)
}
tic()
parquet_duckdb_sql()

  colString min_colInt mean_colInt max_colInt   min_colNum mean_colNum
1         C       2001    5001.243       7999 3.045052e-05   0.5014723
2         B       2001    5004.311       7999 3.385660e-05   0.5005457
3         A       2001    4998.625       7999 3.794138e-05   0.4984446
  max_colNum
1  0.9999921
2  0.9999863
3  0.9999879

toc()

0.067 sec elapsed

5.3.1 Results for unique parquet file

unique_parquet_bmk <- microbenchmark(
  "polars (lazy) - from unique parquet file" = parquet_polars_lazy()$collect()$to_data_frame(),
  "arrow (eager) - from unique parquet file" = arrow_eager(),
  "arrow (lazy) - from unique parquet file" = arrow_lazy() |> collect(),
  "Duckdb and SQL - from unique parquet file" = parquet_duckdb_sql(),
  times = 5
 )
print(unique_parquet_bmk)

Unit: milliseconds
                                      expr       min        lq      mean
  polars (lazy) - from unique parquet file  22.55059  24.58096  27.26230
  arrow (eager) - from unique parquet file  68.85400  74.87362  78.67519
   arrow (lazy) - from unique parquet file 100.34809 106.42091 107.57132
 Duckdb and SQL - from unique parquet file  51.49014  52.30944  54.56829
    median        uq       max neval
  24.65582  25.18379  39.34036     5
  78.59241  84.41689  86.63901     5
 106.81364 110.70833 113.56562     5
  53.00729  56.37833  59.65624     5

👉 Conclusion of this little benchmark based on unique parquet files: the big winner is polars (lazy mode) ! 🏆🏆🏆

5.4 From a partitioned parquet file

Let’s now look at how to perform queries on partitioned files.

The structure of partitioned files on the disk is as follows:

fs::dir_tree(path = "Datasets/DataMultiTypes/")

Datasets/DataMultiTypes/
├── colFactor=High
│   └── part-0.parquet
├── colFactor=Low
│   └── part-0.parquet
└── colFactor=Medium
    └── part-0.parquet

For this comparison, we will use :

For arrow (lazy), the arrow::open_dataset() method
For dplyr (duckdb), the DBI::dbConnect, dplyr::tbl() and arrow::read_parquet() methods
For polars (lazy), the pl$scan_parquet() method

partitioned_parquet_arrow_lazy <- function(variables) {

  result <- arrow::open_dataset(
    "Datasets/DataMultiTypes/",
    partitioning = arrow::schema(colFactor = arrow::utf8())) |>

    mutate(
      # Conversion of 2 columns to Date format
      colDate1 = as.Date(colDate1),
      colDate2 = as.Date(colDate2)
    ) |>
    # Filter rows
    filter(
      colInt>2000 & colInt<8000
    ) |>
    # Grouping and aggregation
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt),
      mean_colInt = mean(colInt),
      mas_colInt = max(colInt),
      min_colNum = min(colNum),
      mean_colNum = mean(colNum),
      max_colNum = max(colNum)
    ) |>
    collect()

  return(result)

}
tic()
partitioned_parquet_arrow_lazy()

# A tibble: 3 × 7
  colString min_colInt mean_colInt mas_colInt min_colNum mean_colNum max_colNum
  <chr>          <int>       <dbl>      <int>      <dbl>       <dbl>      <dbl>
1 B               2001       5004.       7999  0.0000339       0.501      1.000
2 C               2001       5001.       7999  0.0000305       0.501      1.000
3 A               2001       4999.       7999  0.0000379       0.498      1.000

toc()

0.128 sec elapsed

# library(dbplyr)

partitioned_parquet_dplyr_duckdb <- function(variables) {

  con <- DBI::dbConnect(duckdb::duckdb())

  result <- tbl(con, "read_parquet('Datasets/DataMultiTypes/*/*.parquet', hive_partitioning=1)") |>

    mutate(
      # Conversion of 2 columns to Date format
      colDate1 = as.Date(colDate1),
      colDate2 = as.Date(colDate2)
    ) |>
    # Filter rows
    filter(
      colInt>2000 & colInt<8000
    ) |>
    # Grouping and aggregation
    group_by(colString) |>
    summarise(
      min_colInt = min(colInt, na.rm = TRUE),
      mean_colInt = mean(colInt, na.rm = TRUE),
      mas_colInt = max(colInt, na.rm = TRUE),
      min_colNum = min(colNum, na.rm = TRUE),
      mean_colNum = mean(colNum, na.rm = TRUE),
      max_colNum = max(colNum, na.rm = TRUE)
    ) |>
    collect()

  DBI::dbDisconnect(con)
  return(result)
}
tic()
partitioned_parquet_dplyr_duckdb()

# A tibble: 3 × 7
  colString min_colInt mean_colInt mas_colInt min_colNum mean_colNum max_colNum
  <chr>          <int>       <dbl>      <int>      <dbl>       <dbl>      <dbl>
1 B               2001       5004.       7999  0.0000339       0.501      1.000
2 A               2001       4999.       7999  0.0000379       0.498      1.000
3 C               2001       5001.       7999  0.0000305       0.501      1.000

toc()

0.3 sec elapsed

partitioned_parquet_polars_lazy <- function(variables) {

  result <- pl$scan_parquet(source = "Datasets/DataMultiTypes.parquet")$
    # Conversion of 2 columns to Date format
    with_columns(
      pl$col("colDate1")$str$strptime(pl$Date, "%F %T", strict = FALSE),
      pl$col("colDate2")$str$strptime(pl$Date, "%F %T", strict = FALSE)
    )$
    # Filter rows
    filter(
      pl$col("colInt")>2000 & pl$col("colInt")<8000
    )$
    # Grouping and aggregation
    group_by(
      "colString")$
    agg(
      pl$col("colInt")$min()$alias("min_colInt"),
      pl$col("colInt")$mean()$alias("mean_colInt"),
      pl$col("colInt")$max()$alias("max_colInt"),
      pl$col("colNum")$min()$alias("min_colNum"),
      pl$col("colNum")$mean()$alias("mean_colNum"),
      pl$col("colNum")$max()$alias("max_colNum")
    )$collect()

  return(result)
}
tic()
partitioned_parquet_polars_lazy()

shape: (3, 7)
┌───────────┬────────────┬─────────────┬────────────┬────────────┬─────────────┬────────────┐
│ colString ┆ min_colInt ┆ mean_colInt ┆ max_colInt ┆ min_colNum ┆ mean_colNum ┆ max_colNum │
│ ---       ┆ ---        ┆ ---         ┆ ---        ┆ ---        ┆ ---         ┆ ---        │
│ str       ┆ i32        ┆ f64         ┆ i32        ┆ f64        ┆ f64         ┆ f64        │
╞═══════════╪════════════╪═════════════╪════════════╪════════════╪═════════════╪════════════╡
│ B         ┆ 2001       ┆ 5004.31148  ┆ 7999       ┆ 0.000034   ┆ 0.500546    ┆ 0.999986   │
│ C         ┆ 2001       ┆ 5001.243285 ┆ 7999       ┆ 0.00003    ┆ 0.501472    ┆ 0.999992   │
│ A         ┆ 2001       ┆ 4998.624945 ┆ 7999       ┆ 0.000038   ┆ 0.498445    ┆ 0.999988   │
└───────────┴────────────┴─────────────┴────────────┴────────────┴─────────────┴────────────┘

toc()

0.024 sec elapsed

5.4.1 Results for partitioned parquet files

partitioned_parquet_bmk <- microbenchmark(
  "arrow (lazy) - from partitioned parquet file" = partitioned_parquet_arrow_lazy(),
  "dplyr (duckdb) - from partitioned parquet file" = partitioned_parquet_dplyr_duckdb(),
  "polars (lazy) - from partitioned parquet file" = partitioned_parquet_polars_lazy()$to_data_frame(),
  times = 5
 )
print(partitioned_parquet_bmk)

Unit: milliseconds
                                           expr      min        lq      mean
   arrow (lazy) - from partitioned parquet file 105.3972 105.81027 107.86492
 dplyr (duckdb) - from partitioned parquet file 256.6856 257.22597 273.09037
  polars (lazy) - from partitioned parquet file  22.0174  22.22226  24.54591
    median        uq       max neval
 106.19355 106.35045 115.57317     5
 261.18165 264.51536 325.84326     5
  22.48782  22.85002  33.15207     5

👉 Conclusion of this little benchmark based on partitioned parquet files: as for unique parquet files, the big winner is polars (lazy mode) ! 🏆🏆🏆

5.5 From a DuckDb file

Let’s look at how to perform queries on duckdb files.

For this comparison, we will use :

For SQL, the DBI::dbGetQuery() method. In this way, we use the standard DBI methods to work from a DuckDb file.

duckdb_dbfile_sql <- function(variables) {
  
  con <- dbConnect(duckdb::duckdb(),
                 "Datasets/DataMultiTypes.duckdb")
  
  result <- dbGetQuery(
    con, 
    "SELECT colString,
           MIN(colInt) AS min_colInt,
           AVG(colInt) AS mean_colInt,
           MAX(colInt) AS max_colInt,
           MIN(colNum) AS min_colNum,
           AVG(colNum) AS mean_colNum,
           MAX(colNum) AS max_colNum
    FROM (
        SELECT colString,
               colInt,
               colNum
        FROM DataMultiTypes
        WHERE colInt > 2000 AND colInt < 8000
) AS filtered_data
GROUP BY colString;")
  
  dbDisconnect(con, shutdown=TRUE)
  
  return(result)
  
}
tic()
duckdb_dbfile_sql()

  colString min_colInt mean_colInt max_colInt   min_colNum mean_colNum
1         A       2001    4998.625       7999 3.794138e-05   0.4984446
2         B       2001    5004.311       7999 3.385660e-05   0.5005457
3         C       2001    5001.243       7999 3.045052e-05   0.5014723
  max_colNum
1  0.9999879
2  0.9999863
3  0.9999921

toc()

0.054 sec elapsed

5.5.1 Results for DuckDB file

duckdb_bmk <- microbenchmark(
  "SQL from duckdb file" = duckdb_dbfile_sql(),
  times = 5
 )
duckdb_bmk

Unit: milliseconds
                 expr     min       lq     mean   median       uq      max
 SQL from duckdb file 48.1574 48.68273 50.67644 49.70555 49.94159 56.89493
 neval
     5

Note that the query with the standard DBI methods is faster than those with dplyr verbs 🏆

5.6 Final results

5.6.1 Performance

So what can we conclude?
With which file format and which method is it fastest to execute the same request?

Let’s have a look! First, let’s aggregate all the benchmark results:

# Aggregation
bmk_results <- rbind(
  csv_bmk,
  unique_parquet_bmk,
  partitioned_parquet_bmk,
  duckdb_bmk
)

Let’s do a quick sort on the expr column before plotting the results :

Code

# Sort the results
bmk_results$expr <- reorder(bmk_results$expr, bmk_results$time, decreasing = TRUE)

Final conclusions

👉 A few conclusions can be drawn from this section on benchmarking:

It is more efficient to work from a parquet or duckdb file except for polars with lazy evaluation which is very fast;
In terms of execution speed, there is no great difference between a single parquet file and several partitioned parquet files (although the gap will undoubtedly widen in favour of partitioned files if the size of the initial work file is increased);
Lazy evaluation of polars performs best whatever the format of the file you are working on. 🏆🏆
It’s followed by SQL queries executed directly on a duckdb file. 🏆

5.6.2 Memory usage

We’ve just analysed the performance of the various alternatives to Polars, but what about R’s memory usage?

To do this, we’re going to use mem_change() from {pryr} package. This method tells you how memory changes during code execution. Positive numbers represent an increase in the memory used by R, and negative numbers represent a decrease.

Memory usage conclusions

👉 A few conclusions can be drawn from this section on benchmarking about memory usage:

Firstly, the method with data.table from a csv file surprisingly consumes a lot of RAM. Maybe it’s related to the as.data.table() conversion? If a reader has an explanation, I’m interested and feel free to open an issue;
Regarding csv files, syntaxes with R base and dplyr are the least consuming RAM (but at the expense of speed);
Regarding parquet files, syntaxes with arrow (eager) and Duckdb with SQL are the least consuming RAM;
The SQL language used on a Duckdb file also consumes very little RAM

🏆🏆🏆