Below is the code used to create the fake data needed to compare the eager and lazy modes in this document.
Click to expand it! π
Code
library(polars)library(arrow)library(dplyr)library(fakir)library(tictoc)# Creation the "Datasets" folderdir.create(normalizePath("Datasets"))# Creation of large example R data.framefake_data <-fake_ticket_client(vol =100000)# Creation of large example csv datasetwrite.csv(x = fake_data,file =normalizePath("Datasets/fakir_file.csv"))# Creation of large example parquet datasetwrite_parquet(x = fake_data,sink =normalizePath("Datasets/fakir_file.parquet"))
Note
This chapter only deals with the lazy execution of Polars. It does not include a comparison with R base, dplyr and data.table
4.1 Introduction to lazy mode
Polars supports two modes of operation: lazy and eager.
Letβs start this chapter by citing the official documentation:
In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it is βneededβ. Deferring the execution to the last minute can have significant performance advantages that is why the Lazy API is preferred in most cases. Delaying execution until the last possible moment allows Polars to apply automatic optimization to every query.
As you can see, with lazy mode, you give the engine the chance to analyse what you want to do in order to propose optimal execution (for both reading and transforming datasets). Lazy evaluation is a fairly common method of improving processing speed and is used by Spark, among others.
So far in this book, we have only used the eager mode but fortunately all the syntax weβve seen applies to lazy mode too. Whatever mode is used, queries will be executed transparently for users.
4.1.1 Creation of a LazyFrame with lazy()
To convert a DataFrame to a LazyFrame we can use the lazy() contructor.
polars LazyFrame
$describe_optimized_plan() : Show the optimized query plan.
Naive plan:
SELECT [col("Species"), col("Petal.Length")] FROM
FILTER [(col("Species")) == (String(setosa))] FROM
DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]; PROJECT */5 COLUMNS; SELECTION: "None"
This way, we can display the naive plan (which means it is an non-optimized plan). Letβs see what it contains for our example:
FILTER [(col("Species")) == (Utf8(setosa))] FROM DF ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"] means that once the entire datasets has been read into memory, this DataFrame will be filtered for rows with Species equals to βsetosaβ;
PROJECT */5 COLUMNS selects all 5 of the columns (* is the wildcard meaning all);
SELECTION: "None" means no rows will be filtered out.
As indicated in the console, we can use the describe_optimized_plan() method to see the optimized plan.
This example shows a simple but surprisingly effective element of query optimisation: projection.
Letβs see what changed in this optimized plan:
PROJECT 2/5 COLUMNS selects only 2 columns;
SELECTION: "[(col(\"Species\")) == (Utf8(setosa))] means that Polars will apply the filter conditions on the Species column as the datasets is being read.
We can see that Polars has identified that only 2 columns are needed to execute our query which is memory efficient! And Polars did this without me having to select these variables myself (for example with a select method). The added value of Polars is that it applies some optimizations that I/you might not even have known about. πͺ
4.1.3 Execute the plan
To actually execute the plan, we just need to invoke the collect() method.
Calculate the median of the Petal.Width per Species
Every step is executed immediately returning the intermediate results. This can be very wastefull as we might do work or load extra data that is not being used. If we instead used the lazy API and waited on execution untill all the steps are defined then the query planner could perform various optimizations. In this case:
Predicate pushdown: Apply filters as early as possible while reading the dataset, thus only reading rows with sepal length greater than 5.
Projection pushdown: Select only the columns that are needed while reading the dataset, thus removing the need to load additional columns
Tip
To consult the list of optimisations made by Polars on queries in lazy mode, see this page..
Here is the equivalent code using the lazy API. At the end of the query, donβt forget to use the collect() method to inform Polars that you want to execute it.
Use lazy execution will signficantly lower the load on memory & CPU thus allowing you to fit bigger datasets in memory and process faster.
The next section will demonstrate this time saving. π
4.2.2 Limits of lazy mode
There are some operations that cannot be performed in lazy mode (whether in polars or other lazy frameworks such as SQL database). One limitation is that Polars needs to know the column names and dtypes at each step of the query plan.
For example, we canβt pivot() (see here) in lazy mode as the column names are data-dependant following a pivot. Indeed, when you have to pivot() a DataFrame your future columns names cannot be predicted because it depends on what it is actually in your datasets!
When you have to do operations that can be done in lazy mode, the recommandation is: - Running your query in lazy mode as far as possible; - Evaluating this lazy query with collect() when you need a non-lazy method; - Running the non-lazy methods; - Calling lazy() on the output to continue in lazy mode.
Hereβs an example:
pl$scan_parquet("Datasets/fakir_file.parquet")$# Call collect() because I need to pivot()collect()$pivot(index ="region",columns ="priority",values ="age",aggregate_function ="mean" )$# Continue in lazy modelazy()$select( pl$col(c("region","Gold","Platinium")) )$collect()
For this fight, weβre going to use a fake dataset with 1 000 000 rows and 25 columns created with the {fakir} package. The code for creating this dataset is available at the beginning of this document.
This fight will take place over 3 rounds :
With an eager query versus a lazy query from a DataFrame
With an eager query versus a lazy query from a csv file
With an eager query versus a lazy query from a parquet file
4.3.1 From a DataFrame
For this first round and as seen above, letβs start with a simple query from a DataFrame:
We can clearly see that we save a lot of time when executing the lazy version of the code!
4.3.3 From a parquet file
The read_parquet() method has not been implemented in the R Polars package, but for this fight we will use arrow::read_parquet() and {dplyr} syntax, which will compete with pl$scan_parquet().