4 Lazy execution

Below is the code used to create the fake data needed to compare the eager and lazy modes in this document.
Click to expand it! 👇

Code

library(polars)
library(arrow)
library(dplyr)
library(fakir)
library(tictoc)

# Creation the "Datasets" folder
dir.create(normalizePath("Datasets"))

# Creation of large example R data.frame
fake_data <- fake_ticket_client(vol = 100000)

# Creation of large example csv dataset
write.csv(
  x = fake_data,
  file = normalizePath("Datasets/fakir_file.csv"))

# Creation of large example parquet dataset
write_parquet(
  x = fake_data,
  sink = normalizePath("Datasets/fakir_file.parquet"))

Note

This chapter only deals with the lazy execution of Polars. It does not include a comparison with R base, dplyr and data.table

4.1 Introduction to lazy mode

Polars supports two modes of operation: lazy and eager.

Let’s start this chapter by citing the official documentation:

In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it is ‘needed’. Deferring the execution to the last minute can have significant performance advantages that is why the Lazy API is preferred in most cases. Delaying execution until the last possible moment allows Polars to apply automatic optimization to every query.

As you can see, with lazy mode, you give the engine the chance to analyse what you want to do in order to propose optimal execution (for both reading and transforming datasets). Lazy evaluation is a fairly common method of improving processing speed and is used by Spark, among others.

So far in this book, we have only used the eager mode but fortunately all the syntax we’ve seen applies to lazy mode too. Whatever mode is used, queries will be executed transparently for users.

4.1.1 Creation of a `LazyFrame` with lazy()

To convert a DataFrame to a LazyFrame we can use the lazy() contructor.

as_polars_df(iris)$lazy()

polars LazyFrame
 $explain(): Show the optimized query plan.

Naive plan:
DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]; PROJECT */5 COLUMNS; SELECTION: None

We are no longer working on a DataFrame but on a LazyFrame.

4.1.2 First query passed on `LazyFrame`

Let’s look at what happens when we request this LazyFrame:

myquery <- as_polars_df(iris)$lazy()$filter(
  pl$col("Species") == "setosa"
)$select(
  pl$col(c("Species", "Petal.Length"))
)
myquery

polars LazyFrame
 $explain(): Show the optimized query plan.

Naive plan:
 SELECT [col("Species"), col("Petal.Length")] FROM
  FILTER [(col("Species")) == (String(setosa))] FROM
    DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]; PROJECT */5 COLUMNS; SELECTION: None

This way, we can display the naive plan (which means it is an non-optimized plan). Let’s see what it contains for our example:

FILTER [(col("Species")) == (Utf8(setosa))] FROM DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"] means that once the entire datasets has been read into memory, this DataFrame will be filtered for rows with Species equals to “setosa”;
PROJECT */5 COLUMNS selects all 5 of the columns (* is the wildcard meaning all);
SELECTION: "None" means no rows will be filtered out.

As indicated in the console, we can use the explain() method to see the optimized plan.

myquery$explain()

[1] "DF [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\"]; PROJECT 2/5 COLUMNS; SELECTION: [(col(\"Species\")) == (String(setosa))]"

This example shows a simple but surprisingly effective element of query optimisation: projection.

Let’s see what changed in this optimized plan:

PROJECT 2/5 COLUMNS selects only 2 columns;
SELECTION: "[(col(\"Species\")) == (Utf8(setosa))] means that Polars will apply the filter conditions on the Species column as the datasets is being read.

We can see that Polars has identified that only 2 columns are needed to execute our query which is memory efficient! And Polars did this without me having to select these variables myself (for example with a select method).
The added value of Polars is that it applies some optimizations that I/you might not even have known about. 💪

4.1.3 Execute the plan

To actually execute the plan, we just need to invoke the collect() method.

myquery$collect()

shape: (50, 2)
┌─────────┬──────────────┐
│ Species ┆ Petal.Length │
│ ---     ┆ ---          │
│ cat     ┆ f64          │
╞═════════╪══════════════╡
│ setosa  ┆ 1.4          │
│ setosa  ┆ 1.4          │
│ setosa  ┆ 1.3          │
│ setosa  ┆ 1.5          │
│ setosa  ┆ 1.4          │
│ …       ┆ …            │
│ setosa  ┆ 1.4          │
│ setosa  ┆ 1.6          │
│ setosa  ┆ 1.4          │
│ setosa  ┆ 1.5          │
│ setosa  ┆ 1.4          │
└─────────┴──────────────┘

4.2 `Lazy` vs `eager` mode comparison

4.2.1 General principles

In this first example we use the eager API:

df <- pl$read_csv("examples/iris.csv")
df_small = df$filter(pl$col("Petal.Length") > 5)
df_agg = df_small$group_by("Species")$agg(pl$col("Petal.Width")$median())
df_agg

shape: (2, 2)
┌────────────┬─────────────┐
│ Species    ┆ Petal.Width │
│ ---        ┆ ---         │
│ str        ┆ f64         │
╞════════════╪═════════════╡
│ virginica  ┆ 2.1         │
│ versicolor ┆ 1.6         │
└────────────┴─────────────┘

This example:

Read the iris dataset.
Filter the dataset based on Petal.Length
Calculate the median of the Petal.Width per Species

Every step is executed immediately returning the intermediate results. This can be very wastefull as we might do work or load extra data that is not being used. If we instead used the lazy API and waited on execution untill all the steps are defined then the query planner could perform various optimizations. In this case:

Predicate pushdown: Apply filters as early as possible while reading the dataset, thus only reading rows with sepal length greater than 5.
Projection pushdown: Select only the columns that are needed while reading the dataset, thus removing the need to load additional columns

Tip

To consult the list of optimisations made by Polars on queries in lazy mode, see this page..

Here is the equivalent code using the lazy API. At the end of the query, don’t forget to use the collect() method to inform Polars that you want to execute it.

pl$scan_csv("examples/iris.csv")$
  filter(
    pl$col("Petal.Length") > 5)$
  group_by("Species")$
  agg(pl$col("Petal.Width")$median())$
  collect() # <- don't forget collect() here!

shape: (2, 2)
┌────────────┬─────────────┐
│ Species    ┆ Petal.Width │
│ ---        ┆ ---         │
│ str        ┆ f64         │
╞════════════╪═════════════╡
│ versicolor ┆ 1.6         │
│ virginica  ┆ 2.1         │
└────────────┴─────────────┘

Important

Use lazy execution will signficantly lower the load on memory & CPU thus allowing you to fit bigger datasets in memory and process faster.

The next section will demonstrate this time saving. 👇

4.2.2 Limits of lazy mode

There are some operations that cannot be performed in lazy mode (whether in polars or other lazy frameworks such as SQL database). One limitation is that Polars needs to know the column names and dtypes at each step of the query plan.

For example, we can’t pivot() (see here) in lazy mode as the column names are data-dependant following a pivot. Indeed, when you have to pivot() a DataFrame your future columns names cannot be predicted because it depends on what it is actually in your datasets!

When you have to do operations that can be done in lazy mode, the recommandation is: - Running your query in lazy mode as far as possible; - Evaluating this lazy query with collect() when you need a non-lazy method; - Running the non-lazy methods; - Calling lazy() on the output to continue in lazy mode.

Here’s an example:

pl$scan_parquet("Datasets/fakir_file.parquet")$
  # Call collect() because I need to pivot()
  collect()$
  pivot(
    index = "region",
    on = "priority",
    values = "age",
    aggregate_function = "mean"
  )$
  # Continue in lazy mode
  lazy()$
  select(
    pl$col(c("region","Gold","Platinium"))
  )$
  collect()

shape: (23, 3)
┌────────────────────────────┬───────────┬───────────┐
│ region                     ┆ Gold      ┆ Platinium │
│ ---                        ┆ ---       ┆ ---       │
│ str                        ┆ f64       ┆ f64       │
╞════════════════════════════╪═══════════╪═══════════╡
│ Centre                     ┆ 65.066667 ┆ 61.815642 │
│ Bourgogne                  ┆ 40.0      ┆ 81.0      │
│ null                       ┆ null      ┆ null      │
│ Aquitaine                  ┆ 68.0      ┆ null      │
│ Champagne-Ardenne          ┆ 43.0      ┆ 67.0      │
│ …                          ┆ …         ┆ …         │
│ Provence-Alpes-Côte d'Azur ┆ null      ┆ null      │
│ Picardie                   ┆ 32.0      ┆ null      │
│ Languedoc-Roussillon       ┆ null      ┆ null      │
│ Nord-Pas-de-Calais         ┆ 60.0      ┆ 72.0      │
│ Limousin                   ┆ null      ┆ null      │
└────────────────────────────┴───────────┴───────────┘

4.3 `Lazy` vs `eager` mode : fight! ⚔️

For this fight, we’re going to use a fake dataset with 1 000 000 rows and 25 columns created with the {fakir} package. The code for creating this dataset is available at the beginning of this document.

This fight will take place over 3 rounds :

With an eager query versus a lazy query from a DataFrame
With an eager query versus a lazy query from a csv file
With an eager query versus a lazy query from a parquet file

4.3.1 From a DataFrame

For this first round and as seen above, let’s start with a simple query from a DataFrame:

tic()
#| label: fight-eager_dataframe
pl$DataFrame(fake_data)$select(
    pl$col(c("region","departement","priority"))
  )$
  filter(
    pl$col("region") == "Aquitaine")

Warning in pl$DataFrame(fake_data): Passing a `data.frame` or
`RPolarsDataFrame` to `pl$DataFrame()` is deprecated and will be removed in the
future. Use `as_polars_df()` instead.

shape: (6_793, 3)
┌───────────┬─────────────┬──────────┐
│ region    ┆ departement ┆ priority │
│ ---       ┆ ---         ┆ ---      │
│ str       ┆ str         ┆ cat      │
╞═══════════╪═════════════╪══════════╡
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ …         ┆ …           ┆ …        │
│ Aquitaine ┆ null        ┆ Bronze   │
│ Aquitaine ┆ null        ┆ Bronze   │
│ Aquitaine ┆ Dordogne    ┆ Silver   │
│ Aquitaine ┆ Landes      ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
└───────────┴─────────────┴──────────┘

toc()

0.063 sec elapsed

As seen above, we’re going to use the lazy() method to convert a DataFrame to a LazyFrame:

tic()
#| label: fight-lazy_lazyframe
pl$DataFrame(fake_data)$lazy()$
  select(
    pl$col(c("region","departement","priority"))
  )$
  filter(
    pl$col("region") == "Aquitaine")$
  collect() # don't forget collect() here!

Warning in pl$DataFrame(fake_data): Passing a `data.frame` or
`RPolarsDataFrame` to `pl$DataFrame()` is deprecated and will be removed in the
future. Use `as_polars_df()` instead.

shape: (6_793, 3)
┌───────────┬─────────────┬──────────┐
│ region    ┆ departement ┆ priority │
│ ---       ┆ ---         ┆ ---      │
│ str       ┆ str         ┆ cat      │
╞═══════════╪═════════════╪══════════╡
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
│ …         ┆ …           ┆ …        │
│ Aquitaine ┆ null        ┆ Bronze   │
│ Aquitaine ┆ null        ┆ Bronze   │
│ Aquitaine ┆ Dordogne    ┆ Silver   │
│ Aquitaine ┆ Landes      ┆ Silver   │
│ Aquitaine ┆ null        ┆ Silver   │
└───────────┴─────────────┴──────────┘

toc()

0.061 sec elapsed

4.3.2 From a csv file

Now, the eager mode is represented here by the read_csv() method…

tic()
#| label: fight-eager_read_csv
pl$read_csv("Datasets/fakir_file.csv", infer_schema_length=0)$
  select(
    pl$col(c("region","departement","priority","age")))$
  with_columns(
    pl$col("age")$cast(pl$Int32,strict = FALSE))$
  filter(
    pl$col("region") == "Bretagne")$
  group_by("departement","priority")$
  agg(pl$col("age")$mean())

shape: (5, 3)
┌─────────────┬───────────┬───────────┐
│ departement ┆ priority  ┆ age       │
│ ---         ┆ ---       ┆ ---       │
│ str         ┆ str       ┆ f64       │
╞═════════════╪═══════════╪═══════════╡
│ NA          ┆ Silver    ┆ 46.352518 │
│ Finistère   ┆ Bronze    ┆ 22.0      │
│ NA          ┆ Platinium ┆ 70.0      │
│ NA          ┆ Bronze    ┆ 28.176471 │
│ Finistère   ┆ Silver    ┆ 32.0      │
└─────────────┴───────────┴───────────┘

toc()

0.061 sec elapsed

… while the lazy method is represented by the pl$scan_csv():

tic()
pl$scan_csv("Datasets/fakir_file.csv", infer_schema_length=0)$
  select(
    pl$col(c("region","departement","priority","age")))$
  with_columns(
    pl$col("age")$cast(pl$Int32,strict = FALSE))$
  filter(
    pl$col("region") == "Bretagne")$
  group_by("departement","priority")$
  agg(pl$col("age")$mean())$
  collect()

shape: (5, 3)
┌─────────────┬───────────┬───────────┐
│ departement ┆ priority  ┆ age       │
│ ---         ┆ ---       ┆ ---       │
│ str         ┆ str       ┆ f64       │
╞═════════════╪═══════════╪═══════════╡
│ NA          ┆ Platinium ┆ 70.0      │
│ Finistère   ┆ Silver    ┆ 32.0      │
│ NA          ┆ Bronze    ┆ 28.176471 │
│ Finistère   ┆ Bronze    ┆ 22.0      │
│ NA          ┆ Silver    ┆ 46.352518 │
└─────────────┴───────────┴───────────┘

toc()

0.023 sec elapsed

We can clearly see that we save a lot of time when executing the lazy version of the code!

4.3.3 From a parquet file

The read_parquet() method has not been implemented in the R Polars package, but for this fight we will use arrow::read_parquet() and {dplyr} syntax, which will compete with pl$scan_parquet().

tic()
arrow::read_parquet("Datasets/fakir_file.parquet", as_data_frame = FALSE) |>
  filter(region == "Bretagne") |>
  group_by(departement,priority) |>
  summarise(mymean=mean(age, na.rm = TRUE)) |>
  arrange(departement) |>
  collect()

# A tibble: 5 × 3
# Groups:   departement [2]
  departement priority  mymean
  <chr>       <fct>      <dbl>
1 Finistère   Bronze      22  
2 Finistère   Silver      32  
3 <NA>        Platinium   70  
4 <NA>        Bronze      28.2
5 <NA>        Silver      46.4

toc()

0.191 sec elapsed

tic()
pl$scan_parquet("Datasets/fakir_file.parquet")$
  filter(
    pl$col("region") == "Bretagne")$
  group_by(c("departement","priority"))$
  agg(
    pl$col(c("age"))$mean()
)$sort("departement")$
  collect()

shape: (5, 3)
┌─────────────┬───────────┬───────────┐
│ departement ┆ priority  ┆ age       │
│ ---         ┆ ---       ┆ ---       │
│ str         ┆ cat       ┆ f64       │
╞═════════════╪═══════════╪═══════════╡
│ null        ┆ Silver    ┆ 46.352518 │
│ null        ┆ Platinium ┆ 70.0      │
│ null        ┆ Bronze    ┆ 28.176471 │
│ Finistère   ┆ Bronze    ┆ 22.0      │
│ Finistère   ┆ Silver    ┆ 32.0      │
└─────────────┴───────────┴───────────┘

toc()

0.008 sec elapsed

And it’s another victory for the lazy execution!

Important

Note that the {arrow} package also have ability to scan parquet files in a lazy way with the arrow::open_dataset function.

tic()
arrow::open_dataset("Datasets/fakir_file.parquet") |>
  filter(region == "Bretagne") |>
  group_by(departement,priority) |>
  summarise(mymean=mean(age, na.rm = TRUE)) |>
  arrange(departement) |>
  collect()

# A tibble: 5 × 3
# Groups:   departement [2]
  departement priority  mymean
  <chr>       <fct>      <dbl>
1 Finistère   Bronze      22  
2 Finistère   Silver      32  
3 <NA>        Platinium   70  
4 <NA>        Bronze      28.2
5 <NA>        Silver      46.4

toc()

0.092 sec elapsed

4.1 Introduction to lazy mode

4.1.1 Creation of a LazyFrame with lazy()

4.1.2 First query passed on LazyFrame

4.1.3 Execute the plan

4.2 Lazy vs eager mode comparison

4.2.1 General principles

4.2.2 Limits of lazy mode

4.3 Lazy vs eager mode : fight! ⚔️

4.3.1 From a DataFrame

4.3.2 From a csv file

4.3.3 From a parquet file

4.1.1 Creation of a `LazyFrame` with lazy()

4.1.2 First query passed on `LazyFrame`

4.2 `Lazy` vs `eager` mode comparison

4.3 `Lazy` vs `eager` mode : fight! ⚔️