4  Lazy execution

Below is the code used to create the fake data needed to compare the eager and lazy modes in this document.
Click to expand it! πŸ‘‡

Code
library(polars)
library(arrow)
library(dplyr)
library(fakir)
library(tictoc)

# Creation the "Datasets" folder
dir.create(normalizePath("Datasets"))

# Creation of large example R data.frame
fake_data <- fake_ticket_client(vol = 100000)

# Creation of large example csv dataset
write.csv(
  x = fake_data,
  file = normalizePath("Datasets/fakir_file.csv"))

# Creation of large example parquet dataset
write_parquet(
  x = fake_data,
  sink = normalizePath("Datasets/fakir_file.parquet"))
Note

This chapter only deals with the lazy execution of Polars. It does not include a comparison with R base, dplyr and data.table

4.1 Introduction to lazy mode

Polars supports two modes of operation: lazy and eager.

Let’s start this chapter by citing the official documentation:

In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it is β€˜needed’. Deferring the execution to the last minute can have significant performance advantages that is why the Lazy API is preferred in most cases. Delaying execution until the last possible moment allows Polars to apply automatic optimization to every query.

As you can see, with lazy mode, you give the engine the chance to analyse what you want to do in order to propose optimal execution (for both reading and transforming datasets). Lazy evaluation is a fairly common method of improving processing speed and is used by Spark, among others.

So far in this book, we have only used the eager mode but fortunately all the syntax we’ve seen applies to lazy mode too. Whatever mode is used, queries will be executed transparently for users.

4.1.1 Creation of a LazyFrame with lazy()

To convert a DataFrame to a LazyFrame we can use the lazy() contructor.

pl$DataFrame(iris)$lazy()
polars LazyFrame
 $describe_optimized_plan() : Show the optimized query plan.

Naive plan:
DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]; PROJECT */5 COLUMNS; SELECTION: "None"

We are no longer working on a DataFrame but on a LazyFrame.

4.1.2 First query passed on LazyFrame

Let’s look at what happens when we request this LazyFrame:

myquery <- pl$DataFrame(iris)$lazy()$filter(
  pl$col("Species") == "setosa"
)$select(
  pl$col(c("Species", "Petal.Length"))
)
myquery
polars LazyFrame
 $describe_optimized_plan() : Show the optimized query plan.

Naive plan:
 SELECT [col("Species"), col("Petal.Length")] FROM
  FILTER [(col("Species")) == (String(setosa))] FROM

  DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]; PROJECT */5 COLUMNS; SELECTION: "None"

This way, we can display the naive plan (which means it is an non-optimized plan). Let’s see what it contains for our example:

  • FILTER [(col("Species")) == (Utf8(setosa))] FROM DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"] means that once the entire datasets has been read into memory, this DataFrame will be filtered for rows with Species equals to β€œsetosa”;
  • PROJECT */5 COLUMNS selects all 5 of the columns (* is the wildcard meaning all);
  • SELECTION: "None" means no rows will be filtered out.

As indicated in the console, we can use the describe_optimized_plan() method to see the optimized plan.

myquery$describe_optimized_plan()
DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]; PROJECT 2/5 COLUMNS; SELECTION: "[(col(\"Species\")) == (String(setosa))]"

This example shows a simple but surprisingly effective element of query optimisation: projection.

Let’s see what changed in this optimized plan:

  • PROJECT 2/5 COLUMNS selects only 2 columns;
  • SELECTION: "[(col(\"Species\")) == (Utf8(setosa))] means that Polars will apply the filter conditions on the Species column as the datasets is being read.

We can see that Polars has identified that only 2 columns are needed to execute our query which is memory efficient! And Polars did this without me having to select these variables myself (for example with a select method).
The added value of Polars is that it applies some optimizations that I/you might not even have known about. πŸ’ͺ

4.1.3 Execute the plan

To actually execute the plan, we just need to invoke the collect() method.

myquery$collect()
shape: (50, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species ┆ Petal.Length β”‚
β”‚ ---     ┆ ---          β”‚
β”‚ cat     ┆ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ══════════════║
β”‚ setosa  ┆ 1.4          β”‚
β”‚ setosa  ┆ 1.4          β”‚
β”‚ setosa  ┆ 1.3          β”‚
β”‚ setosa  ┆ 1.5          β”‚
β”‚ setosa  ┆ 1.4          β”‚
β”‚ …       ┆ …            β”‚
β”‚ setosa  ┆ 1.4          β”‚
β”‚ setosa  ┆ 1.6          β”‚
β”‚ setosa  ┆ 1.4          β”‚
β”‚ setosa  ┆ 1.5          β”‚
β”‚ setosa  ┆ 1.4          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.2 Lazy vs eager mode comparison

4.2.1 General principles

In this first example we use the eager API:

df <- pl$read_csv("examples/iris.csv")
df_small = df$filter(pl$col("Petal.Length") > 5)
df_agg = df_small$group_by("Species")$agg(pl$col("Petal.Width")$median())
df_agg
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Petal.Width β”‚
β”‚ ---        ┆ ---         β”‚
β”‚ str        ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ virginica  ┆ 2.1         β”‚
β”‚ versicolor ┆ 1.6         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This example:

  • Read the iris dataset.
  • Filter the dataset based on Petal.Length
  • Calculate the median of the Petal.Width per Species

Every step is executed immediately returning the intermediate results. This can be very wastefull as we might do work or load extra data that is not being used. If we instead used the lazy API and waited on execution untill all the steps are defined then the query planner could perform various optimizations. In this case:

  • Predicate pushdown: Apply filters as early as possible while reading the dataset, thus only reading rows with sepal length greater than 5.
  • Projection pushdown: Select only the columns that are needed while reading the dataset, thus removing the need to load additional columns
Tip

To consult the list of optimisations made by Polars on queries in lazy mode, see this page..

Here is the equivalent code using the lazy API. At the end of the query, don’t forget to use the collect() method to inform Polars that you want to execute it.

pl$scan_csv("examples/iris.csv")$
  filter(
    pl$col("Petal.Length") > 5)$
  group_by("Species")$
  agg(pl$col("Petal.Width")$median())$
  collect() # <- don't forget collect() here!
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Petal.Width β”‚
β”‚ ---        ┆ ---         β”‚
β”‚ str        ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ versicolor ┆ 1.6         β”‚
β”‚ virginica  ┆ 2.1         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Important

Use lazy execution will signficantly lower the load on memory & CPU thus allowing you to fit bigger datasets in memory and process faster.

The next section will demonstrate this time saving. πŸ‘‡

4.2.2 Limits of lazy mode

There are some operations that cannot be performed in lazy mode (whether in polars or other lazy frameworks such as SQL database). One limitation is that Polars needs to know the column names and dtypes at each step of the query plan.

For example, we can’t pivot() (see here) in lazy mode as the column names are data-dependant following a pivot. Indeed, when you have to pivot() a DataFrame your future columns names cannot be predicted because it depends on what it is actually in your datasets!

When you have to do operations that can be done in lazy mode, the recommandation is: - Running your query in lazy mode as far as possible; - Evaluating this lazy query with collect() when you need a non-lazy method; - Running the non-lazy methods; - Calling lazy() on the output to continue in lazy mode.

Here’s an example:

pl$scan_parquet("Datasets/fakir_file.parquet")$
  # Call collect() because I need to pivot()
  collect()$
  pivot(
    index = "region",
    columns = "priority",
    values = "age",
    aggregate_function = "mean"
  )$
  # Continue in lazy mode
  lazy()$
  select(
    pl$col(c("region","Gold","Platinium"))
  )$
  collect()
shape: (21, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ region                     ┆ Gold      ┆ Platinium β”‚
β”‚ ---                        ┆ ---       ┆ ---       β”‚
β”‚ str                        ┆ f64       ┆ f64       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════════β•ͺ═══════════║
β”‚ Île-de-France              ┆ 71.209059 ┆ null      β”‚
β”‚ Poitou-Charentes           ┆ 60.0      ┆ 73.172524 β”‚
β”‚ Aquitaine                  ┆ null      ┆ null      β”‚
β”‚ Centre                     ┆ null      ┆ null      β”‚
β”‚ Midi-PyrΓ©nΓ©es              ┆ 54.445931 ┆ null      β”‚
β”‚ …                          ┆ …         ┆ …         β”‚
β”‚ Lorraine                   ┆ null      ┆ 74.0      β”‚
β”‚ Provence-Alpes-CΓ΄te d'Azur ┆ 66.0      ┆ null      β”‚
β”‚ Alsace                     ┆ 40.0      ┆ null      β”‚
β”‚ Bourgogne                  ┆ null      ┆ null      β”‚
β”‚ Haute-Normandie            ┆ null      ┆ null      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.3 Lazy vs eager mode : fight! βš”οΈ

For this fight, we’re going to use a fake dataset with 1 000 000 rows and 25 columns created with the {fakir} package. The code for creating this dataset is available at the beginning of this document.

This fight will take place over 3 rounds :

  1. With an eager query versus a lazy query from a DataFrame
  2. With an eager query versus a lazy query from a csv file
  3. With an eager query versus a lazy query from a parquet file

4.3.1 From a DataFrame

For this first round and as seen above, let’s start with a simple query from a DataFrame:

tic()
#| label: fight-eager_dataframe
pl$DataFrame(fake_data)$select(
    pl$col(c("region","departement","priority"))
  )$
  filter(
    pl$col("region") == "Aquitaine")
shape: (5_862, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ region    ┆ departement ┆ priority β”‚
β”‚ ---       ┆ ---         ┆ ---      β”‚
β”‚ str       ┆ str         ┆ cat      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════║
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ …         ┆ …           ┆ …        β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β”‚ Aquitaine ┆ Landes      ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
toc()
0.063 sec elapsed

As seen above, we’re going to use the lazy() method to convert a DataFrame to a LazyFrame:

tic()
#| label: fight-lazy_lazyframe
pl$DataFrame(fake_data)$lazy()$
  select(
    pl$col(c("region","departement","priority"))
  )$
  filter(
    pl$col("region") == "Aquitaine")$
  collect() # don't forget collect() here!
shape: (5_862, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ region    ┆ departement ┆ priority β”‚
β”‚ ---       ┆ ---         ┆ ---      β”‚
β”‚ str       ┆ str         ┆ cat      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════║
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Silver   β”‚
β”‚ …         ┆ …           ┆ …        β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β”‚ Aquitaine ┆ Landes      ┆ Silver   β”‚
β”‚ Aquitaine ┆ Dordogne    ┆ Bronze   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
toc()
0.06 sec elapsed

4.3.2 From a csv file

Now, the eager mode is represented here by the read_csv() method…

tic()
#| label: fight-eager_read_csv
pl$read_csv("Datasets/fakir_file.csv", infer_schema_length=0)$
  select(
    pl$col(c("region","departement","priority","age")))$
  with_columns(
    pl$col("age")$cast(pl$Int32,strict = FALSE))$
  filter(
    pl$col("region") == "Bretagne")$
  group_by("departement","priority")$
  agg(pl$col("age")$mean())
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ departement   ┆ priority ┆ age       β”‚
β”‚ ---           ┆ ---      ┆ ---       β”‚
β”‚ str           ┆ str      ┆ f64       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ═══════════║
β”‚ Morbihan      ┆ Bronze   ┆ 18.832957 β”‚
β”‚ CΓ΄tes-d'Armor ┆ Bronze   ┆ 23.0      β”‚
β”‚ CΓ΄tes-d'Armor ┆ Silver   ┆ 61.0      β”‚
β”‚ FinistΓ¨re     ┆ Silver   ┆ 23.0      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
toc()
0.073 sec elapsed

… while the lazy method is represented by the pl$scan_csv():

tic()
pl$scan_csv("Datasets/fakir_file.csv", infer_schema_length=0)$
  select(
    pl$col(c("region","departement","priority","age")))$
  with_columns(
    pl$col("age")$cast(pl$Int32,strict = FALSE))$
  filter(
    pl$col("region") == "Bretagne")$
  group_by("departement","priority")$
  agg(pl$col("age")$mean())$
  collect()
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ departement   ┆ priority ┆ age       β”‚
β”‚ ---           ┆ ---      ┆ ---       β”‚
β”‚ str           ┆ str      ┆ f64       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ═══════════║
β”‚ CΓ΄tes-d'Armor ┆ Silver   ┆ 61.0      β”‚
β”‚ CΓ΄tes-d'Armor ┆ Bronze   ┆ 23.0      β”‚
β”‚ Morbihan      ┆ Bronze   ┆ 18.832957 β”‚
β”‚ FinistΓ¨re     ┆ Silver   ┆ 23.0      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
toc()
0.026 sec elapsed

We can clearly see that we save a lot of time when executing the lazy version of the code!

4.3.3 From a parquet file

The read_parquet() method has not been implemented in the R Polars package, but for this fight we will use arrow::read_parquet() and {dplyr} syntax, which will compete with pl$scan_parquet().

tic()
arrow::read_parquet("Datasets/fakir_file.parquet", as_data_frame = FALSE) |>
  filter(region == "Bretagne") |>
  group_by(departement,priority) |>
  summarise(mymean=mean(age, na.rm = TRUE)) |>
  arrange(departement) |>
  collect()
# A tibble: 4 Γ— 3
# Groups:   departement [3]
  departement   priority mymean
  <chr>         <fct>     <dbl>
1 CΓ΄tes-d'Armor Bronze     23  
2 CΓ΄tes-d'Armor Silver     61  
3 Finistère     Silver     23  
4 Morbihan      Bronze     18.8
toc()
0.182 sec elapsed
tic()
pl$scan_parquet("Datasets/fakir_file.parquet")$
  filter(
    pl$col("region") == "Bretagne")$
  group_by(c("departement","priority"))$
  agg(
    pl$col(c("age"))$mean()
)$sort("departement")$
  collect()
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ departement   ┆ priority ┆ age       β”‚
β”‚ ---           ┆ ---      ┆ ---       β”‚
β”‚ str           ┆ cat      ┆ f64       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ═══════════║
β”‚ CΓ΄tes-d'Armor ┆ Silver   ┆ 61.0      β”‚
β”‚ CΓ΄tes-d'Armor ┆ Bronze   ┆ 23.0      β”‚
β”‚ FinistΓ¨re     ┆ Silver   ┆ 23.0      β”‚
β”‚ Morbihan      ┆ Bronze   ┆ 18.832957 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
toc()
0.011 sec elapsed

And it’s another victory for the lazy execution!

Important

Note that the {arrow} package also have ability to scan parquet files in a lazy way with the arrow::open_dataset function.

tic()
arrow::open_dataset("Datasets/fakir_file.parquet") |>
  filter(region == "Bretagne") |>
  group_by(departement,priority) |>
  summarise(mymean=mean(age, na.rm = TRUE)) |>
  arrange(departement) |>
  collect()
# A tibble: 4 Γ— 3
# Groups:   departement [3]
  departement   priority mymean
  <chr>         <fct>     <dbl>
1 CΓ΄tes-d'Armor Bronze     23  
2 CΓ΄tes-d'Armor Silver     61  
3 Finistère     Silver     23  
4 Morbihan      Bronze     18.8
toc()
0.09 sec elapsed