1  First steps with Polars

First of all we need to install all the packages and create an big random dataset needed for this book to work, so don’t bother with the following code:

Code
# Installation of packages for cookbook-rpolars
packages <- c('dplyr','data.table','tidyr','arrow','DBI','fakir','tictoc','duckdb','microbenchmark','readr','fs','ggplot2','pryr','dbplyr','forcats','collapse')
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], dependencies = TRUE)
}

# Loading packages
invisible(lapply(packages, library, character.only = TRUE))

# Creation of iris_dt
iris_dt <- as.data.table(iris)

1.1 Installation

Until the R polars package is uploaded to CRAN, the polars package development team offers several solutions for installation.

The most practical one in my opinion at the moment is to use R-universe and install like this:

install.packages("polars", repos = "https://rpolars.r-universe.dev")

To know the version of the polars package you have just installed and to have information on which features are enabled, you can use the polars_info() function.

library(polars)

polars_info()
Polars R package version : 0.16.2
Rust Polars crate version: 0.39.2

Thread pool size: 4 

Features:                               
default                    TRUE
full_features              TRUE
disable_limit_max_threads  TRUE
nightly                    TRUE
sql                        TRUE
rpolars_debug_print       FALSE

Code completion: deactivated 

1.2 First glimpse

Polars’ main functions are stored in the “pl” namespace and can be accessed using the “pl$” prefix to prevent conflicts with other packages and base R function names. For more, see here.

1.2.1 Convert a R data.frame to a polars DataFrame

First example to convert the most famous R data frame (iris) to a Polars DataFrame:

iris_polars <- pl$DataFrame(iris)
iris_polars
shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ …            ┆ …           ┆ …            ┆ …           ┆ …         │
│ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

1.2.2 Count the number of lines

For example, to count the number of lines of the iris data frame :

# With pl$ prefix
pl$DataFrame(iris)$height
[1] 150
# Using iris_polars
iris_polars$height
[1] 150
nrow(iris)
[1] 150

1.2.3 Extract data from a DataFrame

To select the first 5 iris rows and the Petal.Length and Species columns, syntax is identical between Polars and R base:

iris_polars[1:5, c("Petal.Length", "Species")]
shape: (5, 2)
┌──────────────┬─────────┐
│ Petal.Length ┆ Species │
│ ---          ┆ ---     │
│ f64          ┆ cat     │
╞══════════════╪═════════╡
│ 1.4          ┆ setosa  │
│ 1.4          ┆ setosa  │
│ 1.3          ┆ setosa  │
│ 1.5          ┆ setosa  │
│ 1.4          ┆ setosa  │
└──────────────┴─────────┘
iris[1:5, c("Petal.Length", "Species")]
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
4          1.5  setosa
5          1.4  setosa
iris |> 
  slice(1:5) |> 
  select(Petal.Length,Species)
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
4          1.5  setosa
5          1.4  setosa
iris_dt[1:5, .(Petal.Length, Species)]
   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa
4:          1.5  setosa
5:          1.4  setosa

1.3 Data Structures

The core base data structures provided by Polars are Series and DataFrames.

1.3.1 Series and vectors

Important

Series are a 1-dimensional data structure. Within a series all elements have the same Data Type.

In Polars objects, Series object are like R vectors.
To create a Polars Series from scratch:

mynumbers_serie <- pl$Series(1:3)
Warning in pl$Series(1:3): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
mynumbers_serie
polars Series: shape: (3,)
Series: '' [i32]
[
    1
    2
    3
]
myletters_serie <- pl$Series(c("a","b","c"))
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
myletters_serie
polars Series: shape: (3,)
Series: '' [str]
[
    "a"
    "b"
    "c"
]
# To name a Series
pl$Series(name = "myletters", c("a","b","c"))
Warning in pl$Series(name = "myletters", c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
polars Series: shape: (3,)
Series: 'myletters' [str]
[
    "a"
    "b"
    "c"
]
mynumbers_vector <- 1:3
mynumbers_vector
[1] 1 2 3
myletters_vector <- c("a","b","c")
myletters_vector
[1] "a" "b" "c"

1.3.2 DataFrame and data.frame

Note

A DataFrame is a 2-dimensional data structure that is backed by a Series, and it can be seen as an abstraction of a collection (e.g. list) of Series.

In polars objects, DataFrame object are like R data.frame and close to a tibble and a data.table object. DataFrame has some attributes and you can see here to know how you can use it.

To create a Polars DataFrame from scratch:

# Creation of a DataFrame object with Series
mydf <- pl$DataFrame(
  col1 = mynumbers_serie,
  col2 = myletters_serie
)
# Creation of a DataFrame object with Series and vectors
pl$DataFrame(
  col1 = mynumbers_serie,
  col2 = myletters_vector
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i32  ┆ str  │
╞══════╪══════╡
│ 1    ┆ a    │
│ 2    ┆ b    │
│ 3    ┆ c    │
└──────┴──────┘
data.frame(
  col1 = mynumbers_vector,
  col2 = myletters_vector
)
  col1 col2
1    1    a
2    2    b
3    3    c
tibble(
  col1 = mynumbers_vector,
  col2 = myletters_vector
)
# A tibble: 3 × 2
   col1 col2 
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
data.table(
  col1 = mynumbers_vector,
  col2 = myletters_vector
)
    col1   col2
   <int> <char>
1:     1      a
2:     2      b
3:     3      c

1.3.2.1 Missing values

As in arrow, missing data is represented in Polars with a null value. This null missing value applies for all data types including numerical values.

You can manually define a missing value using NA value in R:

pl$DataFrame(
  col1 = pl$Series(c(NA,"b","c"))
)
Warning in pl$Series(c(NA, "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ str  │
╞══════╡
│ null │
│ b    │
│ c    │
└──────┘

To learn more about dealing with missing values in polars, see here.

1.4 Manipulation of Series and DataFrames with R standard functions

Series and DataFrames can be manipulated with a lot of standard R functions.
Some examples with Series:

sum(mynumbers_serie)
[1] 6
paste(myletters_serie,collapse = "")
[1] "abc"

Some examples with DataFrames:

names(mydf)
[1] "col1" "col2"
ncol(mydf)
[1] 2

1.5 Expressions

Here I’m quoting what Damian Skrzypiec said in his blog about Polars expressions:

One of fundamental building blocks in Polars are Polars expressions. In general Polars expression is any function that transforms Polars series into another Polars series. There are few advantageous aspects of Polars expressions. Firstly expressions are optimized. Particularly if expression need to be executed on multiple columns, then it will be parallelized. It’s one of reasons behind Polars high performance. Another aspect is the fact the Polars implements an extensive set of builtin expressions that user can compose (chain) into more complex expressions.

This is what an Polars expression looks like:

pl$col("Petal.Length")$round(decimals = 0)$alias("Petal.Length.rounded")

Which means that: - Select column “Petal.Length” - Then round the column with 0 decimals - Then rename the column “Petal.Length.rounded”

Tip

Every expression produces a new expression, and that they can be piped together.

For example:

pl$col("bar")$filter(pl$col("foo") == 1)$sum()

To learn more about Polars expressions, see the official documentation.


If you have read this far and managed to reproduce the examples, congratulations! You are ready to dive into the deep end of Polars with R in the next parts of this cookbook! 🚀

1.6 DataFrames display on Windows

This section is for Windows and RStudio users only!

As a Windows and RStudio user, you may encounter a problem with the display of Polars DataFrames.

Here’s what can happen with the default font in RStudio Lucida Console:

Displaying the mtcars DataFrame with Lucida Console font.

To resolve this display problem, I recommend using the Cascadia font:

Displaying the mtcars DataFrame with Cascadia font.