2  Data manipulation

Below is the code used to create the fake data needed to run code in this document. Click to expand it! πŸ‘‡

Code
library(polars)
library(dplyr)
library(tidyr)
library(data.table)
library(arrow)
library(microbenchmark)
iris_dt <- as.data.table(iris)

# Creation of Series
mynumbers_serie <- pl$Series(1:3)
myletters_serie <- pl$Series(c("a","b","c"))

# Creation of mydf DataFrame 
mydf <- pl$DataFrame(
  col1 = mynumbers_serie,
  col2 = myletters_serie
)

2.1 Introduction to methods

The added value of Polars consists in the methods. Those powerful methods are accessed using the $ operator.

  • For Series, see this section for the methods available in {polars}.
  • For DataFrames, see this section for the methods available in {polars}.

Some examples with Series:

# To get a sum
mynumbers_serie$sum()
[1] 6
# To sort
mynumbers_serie$sort()
polars Series: shape: (3,)
Series: '' [i32]
[
    1
    2
    3
]

Some examples with DataFrame:

# To get a character vector of column names
mydf$columns
[1] "col1" "col2"
# To get dimensions of DataFrame
mydf$shape
[1] 3 2
# We can mix standard R functions and methods
length(mydf$columns)
[1] 2

Polars includes a very useful chaining method in data manipulation operations. From this point of view, Polars is more like dplyr and data.table. This is how the chaining method is defined in the official documentation:

In polars our method chaining syntax takes the form object$m1()$m2(), where object is our data object, and m1() and m2() are appropriate methods, like subsetting or aggregation expressions.

Let’s see an example with the iris dataset:

pl$DataFrame(iris)$
  group_by(
  "Species")$
  median()
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ virginica  ┆ 6.5          ┆ 3.0         ┆ 5.55         ┆ 2.0         β”‚
β”‚ versicolor ┆ 5.9          ┆ 2.8         ┆ 4.35         ┆ 1.3         β”‚
β”‚ setosa     ┆ 5.0          ┆ 3.4         ┆ 1.5          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
aggregate(. ~ Species, iris, median)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4         1.50         0.2
2 versicolor          5.9         2.8         4.35         1.3
3  virginica          6.5         3.0         5.55         2.0
iris |>
  group_by(Species) |>
  summarise(across(everything(),median))
# A tibble: 3 Γ— 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa              5           3.4         1.5          0.2
2 versicolor          5.9         2.8         4.35         1.3
3 virginica           6.5         3           5.55         2  
iris_dt[, lapply(.SD, median), by = Species]
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa          5.0         3.4         1.50         0.2
2: versicolor          5.9         2.8         4.35         1.3
3:  virginica          6.5         3.0         5.55         2.0
Tip

In the Polars code used above, you will notice that we have introduced line breaks. We could have written the whole code on the same line but for the sake of readability I prefer to separate the methods used by line breaks.

2.2 Conversion between Series/DataFrames and vector/data.frames

2.2.1 From vector/data.frames to Series/DataFrames

These conversions have already been seen earlier in this book.

# To convert vector to Polars Series
myvector <- pl$Series(c("a","b","c"))
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# To convert data.frames to DataFrames
iris_polars <- pl$DataFrame(iris)

2.2.2 From Series/DataFrames to vector/data.frames

Here, we can use to_r() and to_data_frame() methods.

# To convert Polars Series to vector or list
myletters_serie$to_r()
[1] "a" "b" "c"
# To convert DataFrames to data.frames
mydf$to_data_frame()
  col1 col2
1    1    a
2    2    b
3    3    c

2.3 Initial informations on a DataFrame

Here is a list of instructions that I frequently use to quickly get information about a DataFrame.

2.3.1 Get the schema

A DataFrame has a schema attribute which is a named list of DataTypes.

mydf$schema
$col1
DataType: Int32

$col2
DataType: String
# This works also on LazyFrame
mydf$lazy()$schema
$col1
DataType: Int32

$col2
DataType: String

2.3.2 Get the dimensions

A DataFrame has a shape attribute which is a two length numeric vector of c(nrows,ncols).

mydf$shape
[1] 3 2

2.3.3 Get columns types

A DataFrame has a dtypes attribute which is a list of dtypes (for data type) for every column of the DataFrame.

Alternatively, the dtype_strings() method can be used to get columns types in a character/string vector.

# With dtypes attribute (with a "s" and without parentheses)
mydf$dtypes
[[1]]
DataType: Int32

[[2]]
DataType: String
# This works also on LazyFrame
mydf$lazy()$dtypes
[[1]]
DataType: Int32

[[2]]
DataType: String
# With dtype_strings() method (wihout a "s" and with parentheses)
mydf$dtype_strings()
[1] "i32" "str"

2.3.4 Get a glimpse

You can access a dense preview of a DataFrame by using the glimpse() method. f The formatting is done one line per column, so wide DataFrame show nicely. Each line will show the column name, the dtypes attributes and the first few values.

pl$DataFrame(iris)$glimpse()
& Sepal.Length <f64> 5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9
& Sepal.Width  <f64> 3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1
& Petal.Length <f64> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5
& Petal.Width  <f64> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1
& Species      <cat> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa

2.3.5 Get summary statistics

You can access some summary statistics from a DataFrame by using the describe() method.

pl$DataFrame(iris)$describe()
shape: (9, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ statistic  ┆ Sepal.Length     ┆ Sepal.Width      ┆ Petal.Length    ┆ Petal.Width     ┆ Species   β”‚
β”‚ ---        ┆ ---              ┆ ---              ┆ ---             ┆ ---             ┆ ---       β”‚
β”‚ str        ┆ str              ┆ str              ┆ str             ┆ str             ┆ str       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════β•ͺ══════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═══════════║
β”‚ count      ┆ 150              ┆ 150              ┆ 150             ┆ 150             ┆ 150       β”‚
β”‚ null_count ┆ 0                ┆ 0                ┆ 0               ┆ 0               ┆ 0         β”‚
β”‚ mean       ┆ 5.84333333333333 ┆ 3.05733333333333 ┆ 3.7579999999999 ┆ 1.1993333333333 ┆ null      β”‚
β”‚            ┆ 4                ┆ 32               ┆ 996             ┆ 336             ┆           β”‚
β”‚ std        ┆ 0.82806612797786 ┆ 0.43586628493669 ┆ 1.7652982332594 ┆ 0.7622376689603 ┆ null      β”‚
β”‚            ┆ 3                ┆ 82               ┆ 664             ┆ 466             ┆           β”‚
β”‚ min        ┆ 4.3              ┆ 2.0              ┆ 1.0             ┆ 0.1             ┆ setosa    β”‚
β”‚ 25%        ┆ 5.1              ┆ 2.8              ┆ 1.6             ┆ 0.3             ┆ null      β”‚
β”‚ 50%        ┆ 5.8              ┆ 3.0              ┆ 4.4             ┆ 1.3             ┆ null      β”‚
β”‚ 75%        ┆ 6.4              ┆ 3.3              ┆ 5.1             ┆ 1.8             ┆ null      β”‚
β”‚ max        ┆ 7.9              ┆ 4.4              ┆ 6.9             ┆ 2.5             ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.3.6 Get the rows number

A DataFrame has a width attribute which returns the width of the DataFrame.

# With width attribute
mydf$width
[1] 2

2.3.7 Get the columns number

A DataFrame has a height attribute which returns the width of the DataFrame.

# With height attribute
mydf$height
[1] 3

2.3.8 Get the columns names

A DataFrame has a columns attribute which returns the columns names of the DataFrame in a character vector.

mydf$columns
[1] "col1" "col2"
# This works also on LazyFrame
mydf$lazy()$columns
[1] "col1" "col2"

2.3.9 Rename columns

A character vector can be passed to the columns attribute to rename the columns names of the DataFrame.

mydf$columns <- c("colA","colB")
mydf$columns
[1] "colA" "colB"

2.3.10 Get the size

The estimated_size() method can be used to get an estimation of the total allocated size (in Bytes) of a DataFrame.

mydf$estimated_size()
[1] 15

2.3.11 Get the first n rows

The head() method can be used to get the first n rows of a DataFrame.

# To get the 2 first rows
mydf$head(2)
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ i32  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ 1    ┆ a    β”‚
β”‚ 2    ┆ b    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

2.3.12 Count values in a DataFrame

The value_counts() method can be used to count values in a Series of a DataFrame. value_counts() works with a Series. It must therefore be supplied either with square brackets or with the select() method. See here to learn about it.

# 1st option with square brackets
pl$DataFrame(iris)[,c("Species")]$value_counts()
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ count β”‚
β”‚ ---        ┆ ---   β”‚
β”‚ cat        ┆ u32   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║
β”‚ setosa     ┆ 50    β”‚
β”‚ versicolor ┆ 50    β”‚
β”‚ virginica  ┆ 50    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
# 2nd option with select() method
pl$DataFrame(iris)$select(pl$col("Species"))$to_series()$value_counts()
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ count β”‚
β”‚ ---        ┆ ---   β”‚
β”‚ cat        ┆ u32   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║
β”‚ setosa     ┆ 50    β”‚
β”‚ versicolor ┆ 50    β”‚
β”‚ virginica  ┆ 50    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
iris |>
  count(Species)
     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50
iris_dt[, .N, by = Species]
      Species     N
       <fctr> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

2.3.13 Count NA over columns in a DataFrame

The null_count() method can be used to count NA values of a DataFrame.

mydfNA <- pl$DataFrame(
  colA = pl$Series(c("a",NA,"c")),
  colB = pl$Series(c("d",NA,NA)))
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
mydfNA$null_count()
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ u32  ┆ u32  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ 1    ┆ 2    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mydfNA <- data.frame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA))
sapply(mydfNA, function(x) sum(is.na(x)))
colA colB 
   1    2 
mydfNA |>
  summarise(across(everything(), ~sum(is.na(.))))
  colA colB
1    1    2
mydfNA_dt <- as.data.table(mydfNA)
mydfNA_dt[, lapply(.SD, function(x) sum(is.na(x))), .SDcols = names(mydfNA_dt)]
    colA  colB
   <int> <int>
1:     1     2

2.4 Filter rows

The first option to filter rows of a DataFrame is to use square brackets [] indexing (with integer row number).

data(iris)
# The first four lines
pl$DataFrame(iris)[1:4,]
shape: (4, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# The lines 1, 3 and 5
pl$DataFrame(iris)[c(1,3,5),]
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β”‚ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Important

It’s convenient when you have to quickly inspect your data. But you’ll quickly be limited by the square brackets, as they don’t accept conditions with the expressions. For example pl$DataFrame(iris)[Petal.Length > 6] doesn’t work.

The second and best option is to use the filter() method. It must be used with the Polars expression, here the col() method which allows to designate the columns on which the filter condition will be applied.

Let’s see in details what’s inside a filter() method with an example:

  • pl$col("Petal.Length"): this expression selects the Petal.Length column from iris;
  • >6: applies a Boolean condition to this expression (for all Petals that have a length > 6).

In the example below, we will use & operator to apply multiple conditions in filter() method:

pl$DataFrame(iris)$filter(
  pl$col("Petal.Length") > 6 & pl$col("Petal.Width") < 2)
shape: (2, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ 7.3          ┆ 2.9         ┆ 6.3          ┆ 1.8         ┆ virginica β”‚
β”‚ 7.4          ┆ 2.8         ┆ 6.1          ┆ 1.9         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[iris$Petal.Length > 6 & iris$Petal.Width < 2,] # here don't forget the comma
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
108          7.3         2.9          6.3         1.8 virginica
131          7.4         2.8          6.1         1.9 virginica
iris |>
  filter(Petal.Length > 6 & Petal.Width < 2) 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.3         2.9          6.3         1.8 virginica
2          7.4         2.8          6.1         1.9 virginica
iris_dt[Petal.Length > 6 & Petal.Width < 2]
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <num>       <num>        <num>       <num>    <fctr>
1:          7.3         2.9          6.3         1.8 virginica
2:          7.4         2.8          6.1         1.9 virginica
Tip

The equivalent of %in% R operator is the is_in() method. It should be used in association with the lit() method.

pl$DataFrame(
  colA = pl$Series(c("a","b","c"))
  )$filter(
  pl$col("colA")$is_in(pl$lit(c("a","b")))
  )
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
shape: (2, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ colA β”‚
β”‚ ---  β”‚
β”‚ str  β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ a    β”‚
β”‚ b    β”‚
β””β”€β”€β”€β”€β”€β”€β”˜

Another reason for using the filter() method is that filter expressions can be optimised in lazy mode by the query optimiser. Square brackets [] can only be used in eager mode.

Tip

There is another way to speed up filter processing on rows: tell polars that the column(s) used to filter rows are already sorted! To do this, you can use the set_sorted() method.
Here’s an example:

mydf <- pl$DataFrame(
  col1 = pl$Series(sort(runif(10000000)))
)
Warning in pl$Series(sort(runif(1e+07))): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
microbenchmark(
  "Without telling col1 is sorted" = mydf$filter(pl$col("col1") < 100),
  "Telling col1 is sorted" = mydf$with_columns(pl$col("col1")$set_sorted())$filter(pl$col("col1") < 100)
  )
Unit: milliseconds
                           expr      min       lq     mean   median       uq
 Without telling col1 is sorted 6.557458 6.896630 7.235332 7.032378 7.248586
         Telling col1 is sorted 3.868157 4.142338 4.465664 4.343943 4.534148
      max neval
 12.53200   100
 10.22463   100

2.5 Select columns

2.5.1 Selecting by name

The first option for selecting columns of a DataFrame is to use square brackets [].
The second option is to use the select() method. In this case, it must be used with the col() method which allows to designate the columns to be selected with a character vector.

# 1st option : with square brackets syntax
pl$DataFrame(iris)[1:3,c("Petal.Length","Petal.Width")] 
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# 2nd option : with select() method
pl$DataFrame(iris)$select(
  pl$col(c("Petal.Length","Petal.Width"))
)$head(3) # display the first 3 lines
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[1:3,c("Petal.Length","Petal.Width")] 
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
iris |>
  select(Petal.Length,Petal.Width) |>
  head(3)
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
iris_dt[1:3,.(Petal.Length,Petal.Width)]
   Petal.Length Petal.Width
          <num>       <num>
1:          1.4         0.2
2:          1.4         0.2
3:          1.3         0.2

Keep in mind that when:
- Square brackets are used:
- Selecting only one column from a DataFrame, the output is a Series;
- Whereas if there is more than one column selected, the output is a DataFrame.
- select() method is used:
- The output is always a DataFrame rather than a Series even if one column is selected.
=> If you need a Series you can use the to_series() method. See here.

inherits(pl$DataFrame(iris)[,"Petal.Length"],"Series")
[1] FALSE
inherits(pl$DataFrame(iris)[,"Petal.Length"],"DataFrame")
[1] FALSE
inherits(pl$DataFrame(iris)[,c("Petal.Length","Petal.Width")],"DataFrame")
[1] FALSE
inherits(pl$DataFrame(iris)[,c("Petal.Length","Petal.Width")],"Series")
[1] FALSE

With Polars if you want to obtain a result in an R data.frame, you can simply add the method to_data_frame() at the end of the method chaining. See here for examples.

Tip

Beyond the minor differences discussed above, there are two major reasons why you should use the select() method over the syntax with square brackets:
- When you select and transform multiple columns with select() method, Polars will run these selections in parallel;
- Expressions used in select() method can be optimised in lazy mode by the query optimizer.

Finally, the select() method can also be used to re-order columns of a DataFrame.
For example, to re-order the columns in alphabetical order:

pl$DataFrame(iris)$select(
  pl$col(sort(pl$DataFrame(iris)$columns)))$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width ┆ Sepal.Length ┆ Sepal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 1.4          ┆ 0.2         ┆ 5.1          ┆ 3.5         ┆ setosa  β”‚
β”‚ 1.4          ┆ 0.2         ┆ 4.9          ┆ 3.0         ┆ setosa  β”‚
β”‚ 1.3          ┆ 0.2         ┆ 4.7          ┆ 3.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.5.2 Selecting by data type

To select columns by data type from a DataFrame, you can pass a data type to pl$col expression. For example:

# Select only Float64 columns
pl$DataFrame(iris)$select(pl$col(pl$Float64))$head(3)
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select only numeric columns
pl$DataFrame(iris)$select(pl$col(pl$numeric_dtypes))$head(3) 
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select only factor columns
as.data.frame(iris[1:3, sapply(iris, is.factor)])
  iris[1:3, sapply(iris, is.factor)]
1                             setosa
2                             setosa
3                             setosa
# Select only numeric columns
iris[1:3, sapply(iris, is.numeric)]
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
# Select only factor columns
iris %>% 
  select_if(is.factor)
       Species
1       setosa
2       setosa
3       setosa
4       setosa
5       setosa
6       setosa
7       setosa
8       setosa
9       setosa
10      setosa
11      setosa
12      setosa
13      setosa
14      setosa
15      setosa
16      setosa
17      setosa
18      setosa
19      setosa
20      setosa
21      setosa
22      setosa
23      setosa
24      setosa
25      setosa
26      setosa
27      setosa
28      setosa
29      setosa
30      setosa
31      setosa
32      setosa
33      setosa
34      setosa
35      setosa
36      setosa
37      setosa
38      setosa
39      setosa
40      setosa
41      setosa
42      setosa
43      setosa
44      setosa
45      setosa
46      setosa
47      setosa
48      setosa
49      setosa
50      setosa
51  versicolor
52  versicolor
53  versicolor
54  versicolor
55  versicolor
56  versicolor
57  versicolor
58  versicolor
59  versicolor
60  versicolor
61  versicolor
62  versicolor
63  versicolor
64  versicolor
65  versicolor
66  versicolor
67  versicolor
68  versicolor
69  versicolor
70  versicolor
71  versicolor
72  versicolor
73  versicolor
74  versicolor
75  versicolor
76  versicolor
77  versicolor
78  versicolor
79  versicolor
80  versicolor
81  versicolor
82  versicolor
83  versicolor
84  versicolor
85  versicolor
86  versicolor
87  versicolor
88  versicolor
89  versicolor
90  versicolor
91  versicolor
92  versicolor
93  versicolor
94  versicolor
95  versicolor
96  versicolor
97  versicolor
98  versicolor
99  versicolor
100 versicolor
101  virginica
102  virginica
103  virginica
104  virginica
105  virginica
106  virginica
107  virginica
108  virginica
109  virginica
110  virginica
111  virginica
112  virginica
113  virginica
114  virginica
115  virginica
116  virginica
117  virginica
118  virginica
119  virginica
120  virginica
121  virginica
122  virginica
123  virginica
124  virginica
125  virginica
126  virginica
127  virginica
128  virginica
129  virginica
130  virginica
131  virginica
132  virginica
133  virginica
134  virginica
135  virginica
136  virginica
137  virginica
138  virginica
139  virginica
140  virginica
141  virginica
142  virginica
143  virginica
144  virginica
145  virginica
146  virginica
147  virginica
148  virginica
149  virginica
150  virginica
# Select only numeric columns
iris %>% 
  select_if(is.numeric)
    Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
4            4.6         3.1          1.5         0.2
5            5.0         3.6          1.4         0.2
6            5.4         3.9          1.7         0.4
7            4.6         3.4          1.4         0.3
8            5.0         3.4          1.5         0.2
9            4.4         2.9          1.4         0.2
10           4.9         3.1          1.5         0.1
11           5.4         3.7          1.5         0.2
12           4.8         3.4          1.6         0.2
13           4.8         3.0          1.4         0.1
14           4.3         3.0          1.1         0.1
15           5.8         4.0          1.2         0.2
16           5.7         4.4          1.5         0.4
17           5.4         3.9          1.3         0.4
18           5.1         3.5          1.4         0.3
19           5.7         3.8          1.7         0.3
20           5.1         3.8          1.5         0.3
21           5.4         3.4          1.7         0.2
22           5.1         3.7          1.5         0.4
23           4.6         3.6          1.0         0.2
24           5.1         3.3          1.7         0.5
25           4.8         3.4          1.9         0.2
26           5.0         3.0          1.6         0.2
27           5.0         3.4          1.6         0.4
28           5.2         3.5          1.5         0.2
29           5.2         3.4          1.4         0.2
30           4.7         3.2          1.6         0.2
31           4.8         3.1          1.6         0.2
32           5.4         3.4          1.5         0.4
33           5.2         4.1          1.5         0.1
34           5.5         4.2          1.4         0.2
35           4.9         3.1          1.5         0.2
36           5.0         3.2          1.2         0.2
37           5.5         3.5          1.3         0.2
38           4.9         3.6          1.4         0.1
39           4.4         3.0          1.3         0.2
40           5.1         3.4          1.5         0.2
41           5.0         3.5          1.3         0.3
42           4.5         2.3          1.3         0.3
43           4.4         3.2          1.3         0.2
44           5.0         3.5          1.6         0.6
45           5.1         3.8          1.9         0.4
46           4.8         3.0          1.4         0.3
47           5.1         3.8          1.6         0.2
48           4.6         3.2          1.4         0.2
49           5.3         3.7          1.5         0.2
50           5.0         3.3          1.4         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
54           5.5         2.3          4.0         1.3
55           6.5         2.8          4.6         1.5
56           5.7         2.8          4.5         1.3
57           6.3         3.3          4.7         1.6
58           4.9         2.4          3.3         1.0
59           6.6         2.9          4.6         1.3
60           5.2         2.7          3.9         1.4
61           5.0         2.0          3.5         1.0
62           5.9         3.0          4.2         1.5
63           6.0         2.2          4.0         1.0
64           6.1         2.9          4.7         1.4
65           5.6         2.9          3.6         1.3
66           6.7         3.1          4.4         1.4
67           5.6         3.0          4.5         1.5
68           5.8         2.7          4.1         1.0
69           6.2         2.2          4.5         1.5
70           5.6         2.5          3.9         1.1
71           5.9         3.2          4.8         1.8
72           6.1         2.8          4.0         1.3
73           6.3         2.5          4.9         1.5
74           6.1         2.8          4.7         1.2
75           6.4         2.9          4.3         1.3
76           6.6         3.0          4.4         1.4
77           6.8         2.8          4.8         1.4
78           6.7         3.0          5.0         1.7
79           6.0         2.9          4.5         1.5
80           5.7         2.6          3.5         1.0
81           5.5         2.4          3.8         1.1
82           5.5         2.4          3.7         1.0
83           5.8         2.7          3.9         1.2
84           6.0         2.7          5.1         1.6
85           5.4         3.0          4.5         1.5
86           6.0         3.4          4.5         1.6
87           6.7         3.1          4.7         1.5
88           6.3         2.3          4.4         1.3
89           5.6         3.0          4.1         1.3
90           5.5         2.5          4.0         1.3
91           5.5         2.6          4.4         1.2
92           6.1         3.0          4.6         1.4
93           5.8         2.6          4.0         1.2
94           5.0         2.3          3.3         1.0
95           5.6         2.7          4.2         1.3
96           5.7         3.0          4.2         1.2
97           5.7         2.9          4.2         1.3
98           6.2         2.9          4.3         1.3
99           5.1         2.5          3.0         1.1
100          5.7         2.8          4.1         1.3
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1
104          6.3         2.9          5.6         1.8
105          6.5         3.0          5.8         2.2
106          7.6         3.0          6.6         2.1
107          4.9         2.5          4.5         1.7
108          7.3         2.9          6.3         1.8
109          6.7         2.5          5.8         1.8
110          7.2         3.6          6.1         2.5
111          6.5         3.2          5.1         2.0
112          6.4         2.7          5.3         1.9
113          6.8         3.0          5.5         2.1
114          5.7         2.5          5.0         2.0
115          5.8         2.8          5.1         2.4
116          6.4         3.2          5.3         2.3
117          6.5         3.0          5.5         1.8
118          7.7         3.8          6.7         2.2
119          7.7         2.6          6.9         2.3
120          6.0         2.2          5.0         1.5
121          6.9         3.2          5.7         2.3
122          5.6         2.8          4.9         2.0
123          7.7         2.8          6.7         2.0
124          6.3         2.7          4.9         1.8
125          6.7         3.3          5.7         2.1
126          7.2         3.2          6.0         1.8
127          6.2         2.8          4.8         1.8
128          6.1         3.0          4.9         1.8
129          6.4         2.8          5.6         2.1
130          7.2         3.0          5.8         1.6
131          7.4         2.8          6.1         1.9
132          7.9         3.8          6.4         2.0
133          6.4         2.8          5.6         2.2
134          6.3         2.8          5.1         1.5
135          6.1         2.6          5.6         1.4
136          7.7         3.0          6.1         2.3
137          6.3         3.4          5.6         2.4
138          6.4         3.1          5.5         1.8
139          6.0         3.0          4.8         1.8
140          6.9         3.1          5.4         2.1
141          6.7         3.1          5.6         2.4
142          6.9         3.1          5.1         2.3
143          5.8         2.7          5.1         1.9
144          6.8         3.2          5.9         2.3
145          6.7         3.3          5.7         2.5
146          6.7         3.0          5.2         2.3
147          6.3         2.5          5.0         1.9
148          6.5         3.0          5.2         2.0
149          6.2         3.4          5.4         2.3
150          5.9         3.0          5.1         1.8
# Select only factor columns
iris_dt[, .SD, .SDcols = is.factor]
       Species
        <fctr>
  1:    setosa
  2:    setosa
  3:    setosa
  4:    setosa
  5:    setosa
 ---          
146: virginica
147: virginica
148: virginica
149: virginica
150: virginica
# Select only numeric columns
iris_dt[, .SD, .SDcols = is.numeric]
     Sepal.Length Sepal.Width Petal.Length Petal.Width
            <num>       <num>        <num>       <num>
  1:          5.1         3.5          1.4         0.2
  2:          4.9         3.0          1.4         0.2
  3:          4.7         3.2          1.3         0.2
  4:          4.6         3.1          1.5         0.2
  5:          5.0         3.6          1.4         0.2
 ---                                                  
146:          6.7         3.0          5.2         2.3
147:          6.3         2.5          5.0         1.9
148:          6.5         3.0          5.2         2.0
149:          6.2         3.4          5.4         2.3
150:          5.9         3.0          5.1         1.8
Note

You can also select columns from a DataFrame using a name pattern. See an example by removing a column

2.5.3 Selecting with a regex

To select columns with a regex from a DataFrame, you can pass it in pl$col expression. For example, to select all columns that starts with β€œSepal” in iris dataset:

pl$DataFrame(iris)$select(
  pl$col("^Sepal.*$")
)$head(3) # display the first 3 lines
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         β”‚
β”‚ 4.9          ┆ 3.0         β”‚
β”‚ 4.7          ┆ 3.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[, grep("^Sepal.*$", colnames(iris))]
    Sepal.Length Sepal.Width
1            5.1         3.5
2            4.9         3.0
3            4.7         3.2
4            4.6         3.1
5            5.0         3.6
6            5.4         3.9
7            4.6         3.4
8            5.0         3.4
9            4.4         2.9
10           4.9         3.1
11           5.4         3.7
12           4.8         3.4
13           4.8         3.0
14           4.3         3.0
15           5.8         4.0
16           5.7         4.4
17           5.4         3.9
18           5.1         3.5
19           5.7         3.8
20           5.1         3.8
21           5.4         3.4
22           5.1         3.7
23           4.6         3.6
24           5.1         3.3
25           4.8         3.4
26           5.0         3.0
27           5.0         3.4
28           5.2         3.5
29           5.2         3.4
30           4.7         3.2
31           4.8         3.1
32           5.4         3.4
33           5.2         4.1
34           5.5         4.2
35           4.9         3.1
36           5.0         3.2
37           5.5         3.5
38           4.9         3.6
39           4.4         3.0
40           5.1         3.4
41           5.0         3.5
42           4.5         2.3
43           4.4         3.2
44           5.0         3.5
45           5.1         3.8
46           4.8         3.0
47           5.1         3.8
48           4.6         3.2
49           5.3         3.7
50           5.0         3.3
51           7.0         3.2
52           6.4         3.2
53           6.9         3.1
54           5.5         2.3
55           6.5         2.8
56           5.7         2.8
57           6.3         3.3
58           4.9         2.4
59           6.6         2.9
60           5.2         2.7
61           5.0         2.0
62           5.9         3.0
63           6.0         2.2
64           6.1         2.9
65           5.6         2.9
66           6.7         3.1
67           5.6         3.0
68           5.8         2.7
69           6.2         2.2
70           5.6         2.5
71           5.9         3.2
72           6.1         2.8
73           6.3         2.5
74           6.1         2.8
75           6.4         2.9
76           6.6         3.0
77           6.8         2.8
78           6.7         3.0
79           6.0         2.9
80           5.7         2.6
81           5.5         2.4
82           5.5         2.4
83           5.8         2.7
84           6.0         2.7
85           5.4         3.0
86           6.0         3.4
87           6.7         3.1
88           6.3         2.3
89           5.6         3.0
90           5.5         2.5
91           5.5         2.6
92           6.1         3.0
93           5.8         2.6
94           5.0         2.3
95           5.6         2.7
96           5.7         3.0
97           5.7         2.9
98           6.2         2.9
99           5.1         2.5
100          5.7         2.8
101          6.3         3.3
102          5.8         2.7
103          7.1         3.0
104          6.3         2.9
105          6.5         3.0
106          7.6         3.0
107          4.9         2.5
108          7.3         2.9
109          6.7         2.5
110          7.2         3.6
111          6.5         3.2
112          6.4         2.7
113          6.8         3.0
114          5.7         2.5
115          5.8         2.8
116          6.4         3.2
117          6.5         3.0
118          7.7         3.8
119          7.7         2.6
120          6.0         2.2
121          6.9         3.2
122          5.6         2.8
123          7.7         2.8
124          6.3         2.7
125          6.7         3.3
126          7.2         3.2
127          6.2         2.8
128          6.1         3.0
129          6.4         2.8
130          7.2         3.0
131          7.4         2.8
132          7.9         3.8
133          6.4         2.8
134          6.3         2.8
135          6.1         2.6
136          7.7         3.0
137          6.3         3.4
138          6.4         3.1
139          6.0         3.0
140          6.9         3.1
141          6.7         3.1
142          6.9         3.1
143          5.8         2.7
144          6.8         3.2
145          6.7         3.3
146          6.7         3.0
147          6.3         2.5
148          6.5         3.0
149          6.2         3.4
150          5.9         3.0
iris %>%
  select(starts_with("Sepal"))
    Sepal.Length Sepal.Width
1            5.1         3.5
2            4.9         3.0
3            4.7         3.2
4            4.6         3.1
5            5.0         3.6
6            5.4         3.9
7            4.6         3.4
8            5.0         3.4
9            4.4         2.9
10           4.9         3.1
11           5.4         3.7
12           4.8         3.4
13           4.8         3.0
14           4.3         3.0
15           5.8         4.0
16           5.7         4.4
17           5.4         3.9
18           5.1         3.5
19           5.7         3.8
20           5.1         3.8
21           5.4         3.4
22           5.1         3.7
23           4.6         3.6
24           5.1         3.3
25           4.8         3.4
26           5.0         3.0
27           5.0         3.4
28           5.2         3.5
29           5.2         3.4
30           4.7         3.2
31           4.8         3.1
32           5.4         3.4
33           5.2         4.1
34           5.5         4.2
35           4.9         3.1
36           5.0         3.2
37           5.5         3.5
38           4.9         3.6
39           4.4         3.0
40           5.1         3.4
41           5.0         3.5
42           4.5         2.3
43           4.4         3.2
44           5.0         3.5
45           5.1         3.8
46           4.8         3.0
47           5.1         3.8
48           4.6         3.2
49           5.3         3.7
50           5.0         3.3
51           7.0         3.2
52           6.4         3.2
53           6.9         3.1
54           5.5         2.3
55           6.5         2.8
56           5.7         2.8
57           6.3         3.3
58           4.9         2.4
59           6.6         2.9
60           5.2         2.7
61           5.0         2.0
62           5.9         3.0
63           6.0         2.2
64           6.1         2.9
65           5.6         2.9
66           6.7         3.1
67           5.6         3.0
68           5.8         2.7
69           6.2         2.2
70           5.6         2.5
71           5.9         3.2
72           6.1         2.8
73           6.3         2.5
74           6.1         2.8
75           6.4         2.9
76           6.6         3.0
77           6.8         2.8
78           6.7         3.0
79           6.0         2.9
80           5.7         2.6
81           5.5         2.4
82           5.5         2.4
83           5.8         2.7
84           6.0         2.7
85           5.4         3.0
86           6.0         3.4
87           6.7         3.1
88           6.3         2.3
89           5.6         3.0
90           5.5         2.5
91           5.5         2.6
92           6.1         3.0
93           5.8         2.6
94           5.0         2.3
95           5.6         2.7
96           5.7         3.0
97           5.7         2.9
98           6.2         2.9
99           5.1         2.5
100          5.7         2.8
101          6.3         3.3
102          5.8         2.7
103          7.1         3.0
104          6.3         2.9
105          6.5         3.0
106          7.6         3.0
107          4.9         2.5
108          7.3         2.9
109          6.7         2.5
110          7.2         3.6
111          6.5         3.2
112          6.4         2.7
113          6.8         3.0
114          5.7         2.5
115          5.8         2.8
116          6.4         3.2
117          6.5         3.0
118          7.7         3.8
119          7.7         2.6
120          6.0         2.2
121          6.9         3.2
122          5.6         2.8
123          7.7         2.8
124          6.3         2.7
125          6.7         3.3
126          7.2         3.2
127          6.2         2.8
128          6.1         3.0
129          6.4         2.8
130          7.2         3.0
131          7.4         2.8
132          7.9         3.8
133          6.4         2.8
134          6.3         2.8
135          6.1         2.6
136          7.7         3.0
137          6.3         3.4
138          6.4         3.1
139          6.0         3.0
140          6.9         3.1
141          6.7         3.1
142          6.9         3.1
143          5.8         2.7
144          6.8         3.2
145          6.7         3.3
146          6.7         3.0
147          6.3         2.5
148          6.5         3.0
149          6.2         3.4
150          5.9         3.0
iris_dt[, .SD, .SDcols = grep("^Sepal", colnames(iris_dt))]
     Sepal.Length Sepal.Width
            <num>       <num>
  1:          5.1         3.5
  2:          4.9         3.0
  3:          4.7         3.2
  4:          4.6         3.1
  5:          5.0         3.6
 ---                         
146:          6.7         3.0
147:          6.3         2.5
148:          6.5         3.0
149:          6.2         3.4
150:          5.9         3.0

2.5.4 Selecting using lists

It is also possible to select columns from a DataFrame using R lists which can be very practical depending on the case. Here’s an example:

data(iris)

l_expr = list(
  pl$col("Petal.Length"),
  pl$col("Species")
)

# Select only categorical columns
pl$DataFrame(iris)$select(l_expr)$head(3) 
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Species β”‚
β”‚ ---          ┆ ---     β”‚
β”‚ f64          ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════║
β”‚ 1.4          ┆ setosa  β”‚
β”‚ 1.4          ┆ setosa  β”‚
β”‚ 1.3          ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This way of selecting columns also works with simple R lists:

data(iris)

l_expr = list(
  "Petal.Length",
  "Species",
  "Petal.Width"
)

# Select only categorical columns
pl$DataFrame(iris)$select(l_expr)$head(3) 
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Species ┆ Petal.Width β”‚
β”‚ ---          ┆ ---     ┆ ---         β”‚
β”‚ f64          ┆ cat     ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ═════════════║
β”‚ 1.4          ┆ setosa  ┆ 0.2         β”‚
β”‚ 1.4          ┆ setosa  ┆ 0.2         β”‚
β”‚ 1.3          ┆ setosa  ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.5.5 Selecting - other ways

The selectors API of polars enables to use other methods to select columns.
Here’s a few examples selecting the first and last column of a DataFrame respectively with first() and last() methods:

# Select the first column
pl$DataFrame(iris)$select(
  pl$first()
)$head(3)
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 5.1          β”‚
β”‚ 4.9          β”‚
β”‚ 4.7          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select the last column
pl$DataFrame(iris)$select(
  pl$last()
)$head(3)
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species β”‚
β”‚ ---     β”‚
β”‚ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•‘
β”‚ setosa  β”‚
β”‚ setosa  β”‚
β”‚ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.6 Modify/Add columns

2.6.1 Modify existing column

Similar to the dplyr package, the select() method can also be used to modify existing data. However, the result will exclude any columns that were not specified in the expression.

For example, if we want to get in the data.frame iris the Petal.Length column rounded without decimals.

pl$DataFrame(iris)$select(
  pl$col("Petal.Length")$round(decimals = 0)
)$head(3) # display the first 3 lines
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 1.0          β”‚
β”‚ 1.0          β”‚
β”‚ 1.0          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The problem here is that we would like to keep all the iris columns and not just Petal.Length.

Again, let’s look at the official documentation:

To modify or add some columnsβ€”whilst preserving all others in the datasetβ€”it is therefore better to use the with_columns() method.

Note that the with_columns() method only accepts expressions. Let’s use it in an example:

pl$DataFrame(iris)$with_columns(
  pl$col("Petal.Length")$round(decimals = 0)
)$head(3) # display the first 3 lines
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris$Petal.Length <- round(iris$Petal.Length, digits = 0)
iris[1:3,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5            1         0.2  setosa
2          4.9         3.0            1         0.2  setosa
3          4.7         3.2            1         0.2  setosa
data(iris)
iris |>
  mutate(Petal.Length = round(Petal.Length,0)) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5            1         0.2  setosa
2          4.9         3.0            1         0.2  setosa
3          4.7         3.2            1         0.2  setosa
iris_dt[,Petal.Length := round(Petal.Length, digits = 0)]
iris_dt[1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          5.1         3.5            1         0.2  setosa
2:          4.9         3.0            1         0.2  setosa
3:          4.7         3.2            1         0.2  setosa

2.6.2 Add a new column

If you want to add a column to a data.frame, you use the same syntax as above with with_columns(). Simply use the alias() method to specify the name of the newly created column.

pl$DataFrame(iris)$with_columns(
  pl$col("Petal.Length")$round(decimals = 0)$alias("Petal.Length.rounded")
)$head(3) # display the first 3 lines
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species ┆ Petal.Length.rounded β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     ┆ ---                  β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     ┆ f64                  β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════β•ͺ══════════════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ 1.0                  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ 1.0                  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  ┆ 1.0                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.6.3 Add a new column with a constant value

If you need to create a new column with a constant value (i.e. the same value for all the rows in your DataFrame), you can use the literal lit() method. It works with the main types of Polars.

pl$DataFrame(iris)$with_columns(
  pl$lit("toto")$alias("mynewcolumn")
)$head(3) # display the first 3 lines
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species ┆ mynewcolumn β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     ┆ str         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ toto        β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ toto        β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  ┆ toto        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.6.4 Add a new column based on conditions

To add new columns based on conditions, the when-then-otherwise expression must be used.

Here’s the syntax:

pl$when(**Boolean Expression**)$
  then(**Value if True**)$
  otherwise(**Value if False)$
  alias(**New Column Name**)

Here’s an example with equivalent syntax to help you understand:

pl$DataFrame(iris)$with_columns(
  pl$when(pl$col("Petal.Length") <= 2)$then(pl$lit("<=2"))$
    when(pl$col("Petal.Length") <= 5)$then(pl$lit("<=5"))$
    otherwise(pl$lit(">5"))$
    alias("mygroups")
# we only need to display 2 variables to check that it's OK
)$select(
  pl$col(c("Petal.Length","mygroups"))
)[c(1,2,59,150),]
shape: (4, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ mygroups β”‚
β”‚ ---          ┆ ---      β”‚
β”‚ f64          ┆ str      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════║
β”‚ 1.4          ┆ <=2      β”‚
β”‚ 1.4          ┆ <=2      β”‚
β”‚ 4.6          ┆ <=5      β”‚
β”‚ 5.1          ┆ >5       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris$mygroups <- ifelse(iris$Petal.Length <= 2, "<=2",
                        ifelse(iris$Petal.Length <= 5, "<=5", ">5"))
# we only need to display 2 variables to check that it's OK
iris[c(1,2,59,150), c("Petal.Length", "mygroups")]
    Petal.Length mygroups
1            1.4      <=2
2            1.4      <=2
59           4.6      <=5
150          5.1       >5
iris |>
  mutate(
    mygroups = case_when(
      Petal.Length <=2 ~ "<=2",
      Petal.Length <=5 ~ "<=5",
      .default = ">5")
  ) |>
  # we only need to display 2 variables  to check that it's OK
  select(Petal.Length,mygroups) |>
  slice(1,2,59,150)
  Petal.Length mygroups
1          1.4      <=2
2          1.4      <=2
3          4.6      <=5
4          5.1       >5
iris_dt[, mygroups := case_when(
  Petal.Length <= 2 ~ "<=2",
  Petal.Length <= 5 ~ "<=5",
  TRUE ~ ">5"
)]
# we only need to display 2 variables to check that it's OK
iris_dt[c(1,2,59,150), .(Petal.Length, mygroups)]
   Petal.Length mygroups
          <num>   <char>
1:            1      <=2
2:            1      <=2
3:            5      <=5
4:            5      <=5

2.6.5 Add a new column by group

To add new columns by group, the over expression must be used. This expression is similar to performing a groupby aggregation and joining the result back into the original dataframe. Here’s an example with equivalent syntax to help you understand:

df <- data.frame(
  name = c("X","X","Y","Y","Z","Z"),
  adress = c("A","B","C","D","E","F"),
  col2 = c(2L,4L,1L,3L,4L,2L),
  col3 = c(5L,19L,17L,12L,11L,15L)
)
df
  name adress col2 col3
1    X      A    2    5
2    X      B    4   19
3    Y      C    1   17
4    Y      D    3   12
5    Z      E    4   11
6    Z      F    2   15
pl$DataFrame(df)$with_columns(
  pl$col("col3")$max()$over("name")$name$suffix("_max")
)
shape: (6, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ name ┆ adress ┆ col2 ┆ col3 ┆ col3_max β”‚
β”‚ ---  ┆ ---    ┆ ---  ┆ ---  ┆ ---      β”‚
β”‚ str  ┆ str    ┆ i32  ┆ i32  ┆ i32      β”‚
β•žβ•β•β•β•β•β•β•ͺ════════β•ͺ══════β•ͺ══════β•ͺ══════════║
β”‚ X    ┆ A      ┆ 2    ┆ 5    ┆ 19       β”‚
β”‚ X    ┆ B      ┆ 4    ┆ 19   ┆ 19       β”‚
β”‚ Y    ┆ C      ┆ 1    ┆ 17   ┆ 17       β”‚
β”‚ Y    ┆ D      ┆ 3    ┆ 12   ┆ 17       β”‚
β”‚ Z    ┆ E      ┆ 4    ┆ 11   ┆ 15       β”‚
β”‚ Z    ┆ F      ┆ 2    ┆ 15   ┆ 15       β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
result <- aggregate(col3 ~ name, data = df, FUN = max)
colnames(result) <- c("name", "col3_max")
merge(df, result, by = "name", all.x = TRUE)
  name adress col2 col3 col3_max
1    X      A    2    5       19
2    X      B    4   19       19
3    Y      C    1   17       17
4    Y      D    3   12       17
5    Z      E    4   11       15
6    Z      F    2   15       15
df |>
  group_by(name) |>
  mutate(col3_max = max(col3))
# A tibble: 6 Γ— 5
# Groups:   name [3]
  name  adress  col2  col3 col3_max
  <chr> <chr>  <int> <int>    <int>
1 X     A          2     5       19
2 X     B          4    19       19
3 Y     C          1    17       17
4 Y     D          3    12       17
5 Z     E          4    11       15
6 Z     F          2    15       15
dt <- as.data.table(df)
dt[, col3_max := max(col3), by = name]
dt
     name adress  col2  col3 col3_max
   <char> <char> <int> <int>    <int>
1:      X      A     2     5       19
2:      X      B     4    19       19
3:      Y      C     1    17       17
4:      Y      D     3    12       17
5:      Z      E     4    11       15
6:      Z      F     2    15       15
Tip

If you need to pass multiple column names in the over expression, you can either list them like this over("name","adress") or - more conveniently - use a character vector over(c("name","adress")).

2.7 Rename columns

Similar to the dplyr package, the rename() method can also be used to rename existing column.
The renaming logic is identical to that of dplyr, and is performed as follows: new_name="old_name".

Note

Note the double quotes "" surrounding the name of the old variable to be renamed which does not exist with dplyr (see examples below).

data(iris)
pl$DataFrame(iris)$
  rename(
    sepal_length = "Sepal.Length", 
    sepal_width = "Sepal.Width",
    `length of petal` = "Petal.Length",
    `width of petal` = "Petal.Width",
    species = "Species"
  )$columns
[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"        
data(iris)
names(iris) <- c("sepal_length","sepal_width","length of petal","width of petal","species")
names(iris)
[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"        
data(iris)
iris |>
  rename(
    sepal_length = Sepal.Length, 
    sepal_width = Sepal.Width,
    `length of petal` = Petal.Length,
    `width of petal` = Petal.Width,
    species = Species
  ) |>
  names()
[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"        
setnames(iris_dt, 
         old = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"),
         new = c("sepal_length", "sepal_width", "length_of_petal", "width_of_petal", "species"))
names(iris_dt)
[1] "sepal_length"    "sepal_width"     "length_of_petal" "width_of_petal" 
[5] "species"         "mygroups"       

2.8 Remove columns

2.8.1 Removing by name

To remove columns from a DataFrame, the drop method must be used.
Here’s an example with equivalent syntax:

pl$DataFrame(iris)$
  drop(c("Petal.Width","Sepal.Length","Sepal.Width"))$
  head(3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Species β”‚
β”‚ ---          ┆ ---     β”‚
β”‚ f64          ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════║
β”‚ 1.4          ┆ setosa  β”‚
β”‚ 1.4          ┆ setosa  β”‚
β”‚ 1.3          ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[1:3,!(names(iris) %in% c("Petal.Width","Sepal.Length","Sepal.Width"))]
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
iris |>
  select(-c(Petal.Width,Sepal.Length,Sepal.Width)) |>
  head(3)
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
iris_dt[, c("Petal.Width","Sepal.Length","Sepal.Width") := NULL][1:3,]
   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa

2.8.2 Removing with a regex

To remove columns with a regex from a DataFrame, the drop expression must be used.
Let’s see an example where you want to drop columns whose names starts with β€œPetal” in iris.

num_to_drop <- !grepl("^Petal",pl$DataFrame(iris)$columns)
col_to_drop <- pl$DataFrame(iris)$columns[num_to_drop]

pl$DataFrame(iris)$
  drop(col_to_drop)$
  head(3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[1:3,!grepl("^Petal",names(iris))]
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
iris |>
  select(-starts_with("Petal")) |>
  head(3)
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
iris_dt[, which(grepl("^Petal$", names(iris_dt))):=NULL][1:3,]
   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa
Note

All the ways we’ve seen in the section about selecting columns from a DataFrame (by name, data type and with a list) also work with drop() method!

2.9 Aggregation by group

Another frequently used data manipulation is the aggregation of data by group. To do this, we indicate in the group_by() method which column will be used to group the data.frame. And the agg() method which specifies the expression to aggregate.

The methods available for the agg() method are (in each group):

  • first() get the first element
  • last() get the last element
  • n_unique() get the number of unique elements
  • count() get the number of elements
  • sum() sum the elements
  • min() get the smallest element
  • max() get the largest element
  • mean() get the average of elements
  • median() get the median
  • quantile() calculate quantiles

Here’s a minimal example with sum applied to 2 different columns:

pl$DataFrame(iris)$
  group_by("Species")$
  agg(pl$col(c("Petal.Length","Petal.Width"))$sum())
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════║
β”‚ virginica  ┆ 277.6        ┆ 101.3       β”‚
β”‚ versicolor ┆ 213.0        ┆ 66.3        β”‚
β”‚ setosa     ┆ 73.1         ┆ 12.3        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
aggregate(cbind(Petal.Length, Petal.Width) ~ Species, data = iris, FUN = sum)
     Species Petal.Length Petal.Width
1     setosa         73.1        12.3
2 versicolor        213.0        66.3
3  virginica        277.6       101.3
data(iris)
iris |>
  group_by(Species) |>
  summarise(across(c(Petal.Length, Petal.Width), sum))
# A tibble: 3 Γ— 3
  Species    Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>
1 setosa             73.1        12.3
2 versicolor        213          66.3
3 virginica         278.        101. 
iris_dt <- as.data.table(iris)
iris_dt[, .(Petal.Length = sum(Petal.Length), Petal.Width = sum(Petal.Width)), by = Species]
      Species Petal.Length Petal.Width
       <fctr>        <num>       <num>
1:     setosa         73.1        12.3
2: versicolor        213.0        66.3
3:  virginica        277.6       101.3
Tip

Be careful! Calling multiple aggregations on the same column produces columns of the same name which generates an error with R.

You can use the alias() or suffix() method to ensure column names are unique. For example:

pl$DataFrame(iris)$
  group_by("Species")$
  agg(
    # With alias()
    pl$col(c("Petal.Length"))$sum()$alias("Petal.Length_Sum"),
    pl$col(c("Petal.Length"))$mean()$alias("Petal.Length_Mean"),
    # With suffix()
    pl$col(c("Petal.Width"))$sum()$name$suffix("_Sum"),
    pl$col(c("Petal.Width"))$mean()$name$suffix("_Mean")
  )
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Petal.Length_Sum ┆ Petal.Length_Mean ┆ Petal.Width_Sum ┆ Petal.Width_Mean β”‚
β”‚ ---        ┆ ---              ┆ ---               ┆ ---             ┆ ---              β”‚
β”‚ cat        ┆ f64              ┆ f64               ┆ f64             ┆ f64              β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════β•ͺ═══════════════════β•ͺ═════════════════β•ͺ══════════════════║
β”‚ versicolor ┆ 213.0            ┆ 4.26              ┆ 66.3            ┆ 1.326            β”‚
β”‚ setosa     ┆ 73.1             ┆ 1.462             ┆ 12.3            ┆ 0.246            β”‚
β”‚ virginica  ┆ 277.6            ┆ 5.552             ┆ 101.3           ┆ 2.026            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You can perform more advanced aggregations, for example calculate the sum of all the floating point columns in iris:

pl$DataFrame(iris)$
  group_by("Species")$
  agg(pl$col(pl$Float64)$sum())
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ versicolor ┆ 296.8        ┆ 138.5       ┆ 213.0        ┆ 66.3        β”‚
β”‚ virginica  ┆ 329.4        ┆ 148.7       ┆ 277.6        ┆ 101.3       β”‚
β”‚ setosa     ┆ 250.3        ┆ 171.4       ┆ 73.1         ┆ 12.3        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris_sum_by_species_base <- aggregate(iris[, sapply(iris, is.numeric)], by = list(Species = iris$Species), FUN = sum)
iris_sum_by_species_base
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        250.3       171.4         73.1        12.3
2 versicolor        296.8       138.5        213.0        66.3
3  virginica        329.4       148.7        277.6       101.3
iris |>
  group_by(Species) %>%
  summarise(across(where(is.numeric), sum))
# A tibble: 3 Γ— 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             250.        171.         73.1        12.3
2 versicolor         297.        138.        213          66.3
3 virginica          329.        149.        278.        101. 
iris_dt <- as.data.table(iris)
iris_sum_by_species_dt <- iris_dt[, lapply(.SD, sum), by = Species, .SDcols = sapply(iris_dt, is.numeric)]
iris_sum_by_species_dt
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa        250.3       171.4         73.1        12.3
2: versicolor        296.8       138.5        213.0        66.3
3:  virginica        329.4       148.7        277.6       101.3

2.10 Sort a DataFrame

2.10.1 Simply sort a DataFrame

The sort() method with a character vector can be used to sort a DataFrame.

# Sort by one column
pl$DataFrame(iris)$
  sort("Species")$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Sort by two columns
pl$DataFrame(iris)$
  sort(c("Species","Petal.Length"))$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 4.6          ┆ 3.6         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.3          ┆ 3.0         ┆ 1.1          ┆ 0.1         ┆ setosa  β”‚
β”‚ 5.0          ┆ 3.2         ┆ 1.2          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Sort by two columns one in a decreasing manner and the other in an increasing manner
pl$DataFrame(iris)$
  sort(c("Species","Petal.Length"), descending = c(TRUE,FALSE))$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ 4.9          ┆ 2.5         ┆ 4.5          ┆ 1.7         ┆ virginica β”‚
β”‚ 6.0          ┆ 3.0         ┆ 4.8          ┆ 1.8         ┆ virginica β”‚
β”‚ 6.2          ┆ 2.8         ┆ 4.8          ┆ 1.8         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Sort by one column
iris[order(iris$Species),][1:3,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
# Sort by two columns
iris[order(iris$Species,iris$Petal.Length),][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
23          4.6         3.6          1.0         0.2  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris[order(rev(iris$Species),iris$Petal.Length),][1:3,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
107          4.9         2.5          4.5         1.7 virginica
127          6.2         2.8          4.8         1.8 virginica
139          6.0         3.0          4.8         1.8 virginica
# Sort by one column
iris |>
  arrange(Species) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
# Sort by two columns
iris |>
  arrange(Species, Petal.Length) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.6         3.6          1.0         0.2  setosa
2          4.3         3.0          1.1         0.1  setosa
3          5.8         4.0          1.2         0.2  setosa
# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris |>
  arrange(desc(Species), Petal.Length) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          4.9         2.5          4.5         1.7 virginica
2          6.2         2.8          4.8         1.8 virginica
3          6.0         3.0          4.8         1.8 virginica
# Sort by one column
iris_dt[order(Species)][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
# Sort by two columns
iris_dt[order(Species,Petal.Length)][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          4.6         3.6          1.0         0.2  setosa
2:          4.3         3.0          1.1         0.1  setosa
3:          5.8         4.0          1.2         0.2  setosa
# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris_dt[order(-Species,Petal.Length)][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <num>       <num>        <num>       <num>    <fctr>
1:          4.9         2.5          4.5         1.7 virginica
2:          6.2         2.8          4.8         1.8 virginica
3:          6.0         3.0          4.8         1.8 virginica

2.10.2 Keep unique rows

If you want to keep unique/distinct rows from a DataFrame, you can use the unique() method:

# With one column
pl$DataFrame(iris)$unique(
  subset = "Species"
)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ════════════║
β”‚ 6.3          ┆ 3.3         ┆ 6.0          ┆ 2.5         ┆ virginica  β”‚
β”‚ 7.0          ┆ 3.2         ┆ 4.7          ┆ 1.4         ┆ versicolor β”‚
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With one column keeping last entry
pl$DataFrame(iris)$unique(
  subset = "Species",
  keep = "last"
)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ════════════║
β”‚ 5.0          ┆ 3.3         ┆ 1.4          ┆ 0.2         ┆ setosa     β”‚
β”‚ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica  β”‚
β”‚ 5.7          ┆ 2.8         ┆ 4.1          ┆ 1.3         ┆ versicolor β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With two colums, keeping last entry and maintaining the same order
pl$DataFrame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
)$unique(
  subset = c("x","y"),
  keep = "last",
  maintain_order = TRUE
)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ x   ┆ y   ┆ z   β”‚
β”‚ --- ┆ --- ┆ --- β”‚
β”‚ i32 ┆ i32 ┆ i32 β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═════║
β”‚ 1   ┆ 1   ┆ 1   β”‚
β”‚ 2   ┆ 2   ┆ 2   β”‚
β”‚ 3   ┆ 3   ┆ 4   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
# With one column
aggregate(. ~ Species, data = iris, FUN = head, N = 1)
     Species Sepal.Length.1 Sepal.Length.2 Sepal.Length.3 Sepal.Length.4
1     setosa            5.1            4.9            4.7            4.6
2 versicolor            7.0            6.4            6.9            5.5
3  virginica            6.3            5.8            7.1            6.3
  Sepal.Length.5 Sepal.Length.6 Sepal.Width.1 Sepal.Width.2 Sepal.Width.3
1            5.0            5.4           3.5           3.0           3.2
2            6.5            5.7           3.2           3.2           3.1
3            6.5            7.6           3.3           2.7           3.0
  Sepal.Width.4 Sepal.Width.5 Sepal.Width.6 Petal.Length.1 Petal.Length.2
1           3.1           3.6           3.9            1.4            1.4
2           2.3           2.8           2.8            4.7            4.5
3           2.9           3.0           3.0            6.0            5.1
  Petal.Length.3 Petal.Length.4 Petal.Length.5 Petal.Length.6 Petal.Width.1
1            1.3            1.5            1.4            1.7           0.2
2            4.9            4.0            4.6            4.5           1.4
3            5.9            5.6            5.8            6.6           2.5
  Petal.Width.2 Petal.Width.3 Petal.Width.4 Petal.Width.5 Petal.Width.6
1           0.2           0.2           0.2           0.2           0.4
2           1.5           1.5           1.3           1.5           1.3
3           1.9           2.1           1.8           2.2           2.1
# With one column keeping last entry
aggregate(. ~ Species, data = iris, FUN = tail, n = 1)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.3          1.4         0.2
2 versicolor          5.7         2.8          4.1         1.3
3  virginica          5.9         3.0          5.1         1.8
# With two colums, keeping last entry and maintaining the same order
mytest <- data.frame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) 
aggregate(. ~ x + y, data = mytest, FUN = tail, n = 1)
  x y z
1 1 1 1
2 2 2 2
3 3 3 4
# With one column
iris |>
  distinct(Species, .keep_all = TRUE)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          7.0         3.2          4.7         1.4 versicolor
3          6.3         3.3          6.0         2.5  virginica
# With one column keeping last entry
iris |>
  group_by(Species) |>
  slice_tail() |>
  ungroup()
# A tibble: 3 Γ— 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5           3.3          1.4         0.2 setosa    
2          5.7         2.8          4.1         1.3 versicolor
3          5.9         3            5.1         1.8 virginica 
# With two colums, keeping last entry and maintaining the same order
data.frame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) |>
  group_by(x,y) |>
  slice_tail() |>
  ungroup()
# A tibble: 3 Γ— 3
      x     y     z
  <int> <int> <int>
1     1     1     1
2     2     2     2
3     3     3     4
# With one column
unique(iris_dt, by = "Species")
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          5.1         3.5          1.4         0.2     setosa
2:          7.0         3.2          4.7         1.4 versicolor
3:          6.3         3.3          6.0         2.5  virginica
# With one column keeping last entry
unique(iris_dt, by = "Species", fromLast = TRUE)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          5.0         3.3          1.4         0.2     setosa
2:          5.7         2.8          4.1         1.3 versicolor
3:          5.9         3.0          5.1         1.8  virginica
# With two colums, keeping last entry and maintaining the same order
mytest_dt <- data.table(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) 
unique(mytest_dt, by = c("x","y"), fromLast = TRUE)
       x     y     z
   <int> <int> <int>
1:     1     1     1
2:     2     2     2
3:     3     3     4

2.10.3 Keep some columns from sorted DataFrame

If you want to keep only some columns from a sorted DataFrame, you can use the sort_by() method with the select() method.
In details, sort_by() can sort a column by the ordering of another column, or multiple other columns.
It’s the equivalent of order() method of R base.

pl$DataFrame(iris)$
  select(pl$col("Petal.Length")$
  sort_by("Petal.Width"))
shape: (150, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 1.5          β”‚
β”‚ 1.4          β”‚
β”‚ 1.1          β”‚
β”‚ 1.5          β”‚
β”‚ 1.4          β”‚
β”‚ …            β”‚
β”‚ 5.6          β”‚
β”‚ 5.6          β”‚
β”‚ 6.0          β”‚
β”‚ 6.1          β”‚
β”‚ 5.7          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
data(iris)
iris[order(iris$Petal.Width), "Petal.Length", drop = FALSE]
    Petal.Length
10           1.5
13           1.4
14           1.1
33           1.5
38           1.4
1            1.4
2            1.4
3            1.3
4            1.5
5            1.4
8            1.5
9            1.4
11           1.5
12           1.6
15           1.2
21           1.7
23           1.0
25           1.9
26           1.6
28           1.5
29           1.4
30           1.6
31           1.6
34           1.4
35           1.5
36           1.2
37           1.3
39           1.3
40           1.5
43           1.3
47           1.6
48           1.4
49           1.5
50           1.4
7            1.4
18           1.4
19           1.7
20           1.5
41           1.3
42           1.3
46           1.4
6            1.7
16           1.5
17           1.3
22           1.5
27           1.6
32           1.5
45           1.9
24           1.7
44           1.6
58           3.3
61           3.5
63           4.0
68           4.1
80           3.5
82           3.7
94           3.3
70           3.9
81           3.8
99           3.0
74           4.7
83           3.9
91           4.4
93           4.0
96           4.2
54           4.0
56           4.5
59           4.6
65           3.6
72           4.0
75           4.3
88           4.4
89           4.1
90           4.0
95           4.2
97           4.2
98           4.3
100          4.1
51           4.7
60           3.9
64           4.7
66           4.4
76           4.4
77           4.8
92           4.6
135          5.6
52           4.5
53           4.9
55           4.6
62           4.2
67           4.5
69           4.5
73           4.9
79           4.5
85           4.5
87           4.7
120          5.0
134          5.1
57           4.7
84           5.1
86           4.5
130          5.8
78           5.0
107          4.5
71           4.8
104          5.6
108          6.3
109          5.8
117          5.5
124          4.9
126          6.0
127          4.8
128          4.9
138          5.5
139          4.8
150          5.1
102          5.1
112          5.3
131          6.1
143          5.1
147          5.0
111          5.1
114          5.0
122          4.9
123          6.7
132          6.4
148          5.2
103          5.9
106          6.6
113          5.5
125          5.7
129          5.6
140          5.4
105          5.8
118          6.7
133          5.6
116          5.3
119          6.9
121          5.7
136          6.1
142          5.1
144          5.9
146          5.2
149          5.4
115          5.1
137          5.6
141          5.6
101          6.0
110          6.1
145          5.7
data(iris)
iris |>
  arrange(Petal.Width) |>
  select(Petal.Length)
    Petal.Length
1            1.5
2            1.4
3            1.1
4            1.5
5            1.4
6            1.4
7            1.4
8            1.3
9            1.5
10           1.4
11           1.5
12           1.4
13           1.5
14           1.6
15           1.2
16           1.7
17           1.0
18           1.9
19           1.6
20           1.5
21           1.4
22           1.6
23           1.6
24           1.4
25           1.5
26           1.2
27           1.3
28           1.3
29           1.5
30           1.3
31           1.6
32           1.4
33           1.5
34           1.4
35           1.4
36           1.4
37           1.7
38           1.5
39           1.3
40           1.3
41           1.4
42           1.7
43           1.5
44           1.3
45           1.5
46           1.6
47           1.5
48           1.9
49           1.7
50           1.6
51           3.3
52           3.5
53           4.0
54           4.1
55           3.5
56           3.7
57           3.3
58           3.9
59           3.8
60           3.0
61           4.7
62           3.9
63           4.4
64           4.0
65           4.2
66           4.0
67           4.5
68           4.6
69           3.6
70           4.0
71           4.3
72           4.4
73           4.1
74           4.0
75           4.2
76           4.2
77           4.3
78           4.1
79           4.7
80           3.9
81           4.7
82           4.4
83           4.4
84           4.8
85           4.6
86           5.6
87           4.5
88           4.9
89           4.6
90           4.2
91           4.5
92           4.5
93           4.9
94           4.5
95           4.5
96           4.7
97           5.0
98           5.1
99           4.7
100          5.1
101          4.5
102          5.8
103          5.0
104          4.5
105          4.8
106          5.6
107          6.3
108          5.8
109          5.5
110          4.9
111          6.0
112          4.8
113          4.9
114          5.5
115          4.8
116          5.1
117          5.1
118          5.3
119          6.1
120          5.1
121          5.0
122          5.1
123          5.0
124          4.9
125          6.7
126          6.4
127          5.2
128          5.9
129          6.6
130          5.5
131          5.7
132          5.6
133          5.4
134          5.8
135          6.7
136          5.6
137          5.3
138          6.9
139          5.7
140          6.1
141          5.1
142          5.9
143          5.2
144          5.4
145          5.1
146          5.6
147          5.6
148          6.0
149          6.1
150          5.7
data(iris)
iris_dt <- as.data.table(iris)
iris_dt[order(Petal.Width)][,.(Petal.Length)]
     Petal.Length
            <num>
  1:          1.5
  2:          1.4
  3:          1.1
  4:          1.5
  5:          1.4
 ---             
146:          5.6
147:          5.6
148:          6.0
149:          6.1
150:          5.7
Tip

If you want to use multiple columns/expressions, you can pass it in a list like this for example sort_by(list("Petal.Width","Sepal.Width")) or sort_by(list("Petal.Width", pl$col("Sepal.Width")))

2.11 Join DataFrames

To perform joins, the join() method must be used.

Multiple strategies are available:

  • "inner": returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded.
  • "left": returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled.
  • "outer": returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled.
  • "semi": returns all rows from the left frame in which the join key is also present in the right frame.
  • "anti": returns all rows from the left frame in which the join key is not present in the right frame.
  • "cross": returns the cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained. The table length of A cross-joined with B is always len(A) Γ— len(B).

The main arguments are: - on: name(s) of the join columns in both DataFrames.
- how: join strategy.
- suffix: suffix to append to columns with a duplicate name.

Let’s see a simple example:

# First, creation of colors_species Polars DataFrame
colors <- pl$DataFrame(
  Surname = pl$Series(c("toto","titi","tata")),
  Color = pl$Series(c("blue","red","yellow"))
)
Warning in pl$Series(c("toto", "titi", "tata")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("blue", "red", "yellow")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
values <- pl$DataFrame(
  Surname = pl$Series(c("toto","titi","tata")),
  value = pl$Series(c(10,20,30))
)
Warning in pl$Series(c("toto", "titi", "tata")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(10, 20, 30)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# Let's join !
colors$join(
  other = values,
  on = "Surname",
  how = "left"
)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Surname ┆ Color  ┆ value β”‚
β”‚ ---     ┆ ---    ┆ ---   β”‚
β”‚ str     ┆ str    ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ════════β•ͺ═══════║
β”‚ toto    ┆ blue   ┆ 10.0  β”‚
β”‚ titi    ┆ red    ┆ 20.0  β”‚
β”‚ tata    ┆ yellow ┆ 30.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
colors_df <- data.frame(
  Surname = c("toto","titi","tata"),
  Color = c("blue","red","yellow")
)
values_df <- data.frame(
  Surname = c("toto","titi","tata"),
  value = c(10,20,30)
)
merge(colors_df, values_df, by = "Surname", all.x = TRUE)
  Surname  Color value
1    tata yellow    30
2    titi    red    20
3    toto   blue    10
colors_df |>
  left_join(values_df,
            by = "Surname") 
  Surname  Color value
1    toto   blue    10
2    titi    red    20
3    tata yellow    30
merge(as.data.table(colors_df), 
      as.data.table(values_df), 
      by = "Surname", all.x = TRUE)
Key: <Surname>
   Surname  Color value
    <char> <char> <num>
1:    tata yellow    30
2:    titi    red    20
3:    toto   blue    10

If join columns have different names in both DataFrames, you can use arguments "left_on" and "right_on".
Here’s an example:

values2 <- values$rename(Surname2 = "Surname")

colors$join(
  other = values2,
  left_on = "Surname",
  right_on = "Surname2",
  how = "left"
)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Surname ┆ Color  ┆ value β”‚
β”‚ ---     ┆ ---    ┆ ---   β”‚
β”‚ str     ┆ str    ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ════════β•ͺ═══════║
β”‚ toto    ┆ blue   ┆ 10.0  β”‚
β”‚ titi    ┆ red    ┆ 20.0  β”‚
β”‚ tata    ┆ yellow ┆ 30.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
Tip

When we join on integer columns that are sorted polars uses a fast-track algorithm.

To use the fast-track algorithm polars needs to know the join columns are sorted. See the tip in this section for an example with filter() method.

Polars can’t use the fast-track algorithm for joining string columns as the algorithm works on integers.

=> To use the fast-track algorithm the string column must be cast to categorical dtype. See an example here.

2.12 Concatenate DataFrames

To perform concatenations, the concat() method must be used.
In this section, we will see vertical, horizontal and diagonal concatenation.

Before we need to create some datasets to run our examples. Click to expand it! πŸ‘‡

Code
dfup <- pl$DataFrame(
  col1 = pl$Series(c("a")),
  col2 = pl$Series(c("b")),
  col3 = pl$Series(c("c"))
)
Warning in pl$Series(c("a")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("b")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Code
dfdown <- pl$DataFrame(
  col1 = pl$Series(c("x")),
  col2 = pl$Series(c("y")),
  col3 = pl$Series(c("z"))
)
Warning in pl$Series(c("x")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("y")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Code
dfleft <- pl$DataFrame(col1 = pl$Series(c("a","b","c")))
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Code
dfright <- pl$DataFrame(col2 = pl$Series(c("x","y","z")))
Warning in pl$Series(c("x", "y", "z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Code
dfup_df <- dfup$to_data_frame()
dfdown_df <- dfdown$to_data_frame()
dfleft_df <- dfleft$to_data_frame()
dfright_df <- dfright$to_data_frame()

dfup_dt <- as.data.table(dfup)
dfdown_dt <- as.data.table(dfdown)
dfleft_dt <- as.data.table(dfleft)
dfright_dt <- as.data.table(dfright)

2.12.1 Vertical concatenation

To concatenate multiple DataFrames vertically (=by row), you can use the concat() method and the argument how = "vertical".

pl$concat(list(dfup,dfdown),how = "vertical")
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 β”‚
β”‚ ---  ┆ ---  ┆ ---  β”‚
β”‚ str  ┆ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════║
β”‚ a    ┆ b    ┆ c    β”‚
β”‚ x    ┆ y    ┆ z    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
rbind(dfup_df,dfdown_df)
  col1 col2 col3
1    a    b    c
2    x    y    z
bind_rows(list(dfup_df,dfdown_df))
  col1 col2 col3
1    a    b    c
2    x    y    z
rbindlist(list(dfup_dt, dfdown_dt))
     col1   col2   col3
   <char> <char> <char>
1:      a      b      c
2:      x      y      z

2.12.2 Horizontal concatenation

To concatenate multiple DataFrames horizontally (=by col), you can use the concat() method and the argument how = "horizontal".

pl$concat(list(dfleft,dfright),how = "horizontal")
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ a    ┆ x    β”‚
β”‚ b    ┆ y    β”‚
β”‚ c    ┆ z    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
cbind(dfleft_df,dfright_df)
  col1 col2
1    a    x
2    b    y
3    c    z
cbind(list(dfleft_df,dfright_df))
     [,1]        
[1,] data.frame,1
[2,] data.frame,1
data.table(dfleft_dt, dfright_dt)
     col1   col2
   <char> <char>
1:      a      x
2:      b      y
3:      c      z

2.12.3 Diagonal concatenation

To concatenate multiple DataFrames diagonally, you can use the concat() method and the argument how = "diagonal".
Diagonal concatenation is useful when the column names are not identical in initial DataFrames.

dfup <- dfup$rename(col4="col3")
pl$concat(list(dfup,dfdown),how = "diagonal")
shape: (2, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col4 ┆ col3 β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---  β”‚
β”‚ str  ┆ str  ┆ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ══════║
β”‚ a    ┆ b    ┆ c    ┆ null β”‚
β”‚ x    ┆ y    ┆ null ┆ z    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
dfup_df <- dfup$to_data_frame()
dfup_df[setdiff(names(dfdown_df), names(dfup_df))] <- NA
dfdown_df[setdiff(names(dfup_df), names(dfdown_df))] <- NA
rbind(dfup_df,dfdown_df)
  col1 col2 col4 col3
1    a    b    c <NA>
2    x    y <NA>    z
dfup_df <- dfup$to_data_frame()
bind_rows(list(dfup_df,dfdown_df))
  col1 col2 col4 col3
1    a    b    c <NA>
2    x    y <NA>    z
dfup_dt <- as.data.table(dfup)
rbindlist(list(dfup_dt, dfdown_dt),fill = TRUE)
     col1   col2   col4   col3
   <char> <char> <char> <char>
1:      a      b      c   <NA>
2:      x      y   <NA>      z

2.13 Pivot a DataFrame

2.13.1 From long to wide

The pivot() method can be used to pivot a DataFrame from long to wide.

Let’s go for a first example :

df <- data.frame(
  country = c(rep("France",3),rep("Italy","3")),
  city = c("Paris", "Lille", "Nice", "Roma", "Milan", "Napoli"),
  location = c("North","North","South","South","North","South"),
  population = c(2.1, 0.2, 0.4, 2.8, 1.4, 3.0)
)
df
  country   city location population
1  France  Paris    North        2.1
2  France  Lille    North        0.2
3  France   Nice    South        0.4
4   Italy   Roma    South        2.8
5   Italy  Milan    North        1.4
6   Italy Napoli    South        3.0
pl$DataFrame(df)$pivot(
  index = "country", 
  columns = "city",
  values = "population", 
)
shape: (2, 7)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ Paris ┆ Lille ┆ Nice ┆ Roma ┆ Milan ┆ Napoli β”‚
β”‚ ---     ┆ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---    β”‚
β”‚ str     ┆ f64   ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ══════β•ͺ══════β•ͺ═══════β•ͺ════════║
β”‚ France  ┆ 2.1   ┆ 0.2   ┆ 0.4  ┆ null ┆ null  ┆ null   β”‚
β”‚ Italy   ┆ null  ┆ null  ┆ null ┆ 2.8  ┆ 1.4   ┆ 3.0    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
reshape(df[,-which(names(df) %in% c("location"))], 
        idvar = "country", 
        timevar = "city", 
        direction = "wide")
  country population.Paris population.Lille population.Nice population.Roma
1  France              2.1              0.2             0.4              NA
4   Italy               NA               NA              NA             2.8
  population.Milan population.Napoli
1               NA                NA
4              1.4                 3
df |>
  pivot_wider(
    id_cols = country,
    names_from = city,
    values_from = population
  )
# A tibble: 2 Γ— 7
  country Paris Lille  Nice  Roma Milan Napoli
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 France    2.1   0.2   0.4  NA    NA       NA
2 Italy    NA    NA    NA     2.8   1.4      3
df_dt <- as.data.table(df)
dcast(df_dt, country ~ city, value.var = "population")
Key: <country>
   country Lille Milan Napoli  Nice Paris  Roma
    <char> <num> <num>  <num> <num> <num> <num>
1:  France   0.2    NA     NA   0.4   2.1    NA
2:   Italy    NA   1.4      3    NA    NA   2.8

You can also aggregate the results using a function that you enter in the argument aggregate_function in pivot() method.

In this case, the aggregate_function argument of pivot() is the equivalent of values_fn of pivot_wider() from {tidyr} and fun.aggregate of dcast() from {data.table}.

pl$DataFrame(df)$pivot(
  index = "country", 
  columns = "location",
  values = "population",
  aggregate_function = "mean"
)
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ North ┆ South β”‚
β”‚ ---     ┆ ---   ┆ ---   β”‚
β”‚ str     ┆ f64   ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═══════║
β”‚ France  ┆ 1.15  ┆ 0.4   β”‚
β”‚ Italy   ┆ 1.4   ┆ 2.9   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
df_summary <- aggregate(population ~ country + location, data = df, FUN = mean)
df_final <- reshape(df_summary, idvar = "country", timevar = "location", direction = "wide")
colnames(df_final) <- c("country", "North", "South")
df_final
  country North South
1  France  1.15   0.4
2   Italy  1.40   2.9
df |>
  pivot_wider(id_cols = country,
              names_from = location, 
              values_from = population, 
              values_fn = mean)
# A tibble: 2 Γ— 3
  country North South
  <chr>   <dbl> <dbl>
1 France   1.15   0.4
2 Italy    1.4    2.9
df_dt <- as.data.table(df)
dcast(df_dt, country ~ location, value.var = "population", fun.aggregate = mean)
Key: <country>
   country North South
    <char> <num> <num>
1:  France  1.15   0.4
2:   Italy  1.40   2.9

However with {polars}, we can also run an expression as an aggregation function.
With {tidyr} and {data.table}, you need to calculate this in advance.
For example:

pl$DataFrame(df)$pivot(
  index = "country", 
  columns = "location",
  values = "population",
  aggregate_function = pl$element()$sum()$sqrt()
)
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ North    ┆ South    β”‚
β”‚ ---     ┆ ---      ┆ ---      β”‚
β”‚ str     ┆ f64      ┆ f64      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ══════════║
β”‚ France  ┆ 1.516575 ┆ 0.632456 β”‚
β”‚ Italy   ┆ 1.183216 ┆ 2.408319 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
df_summarized <- aggregate(population ~ country + location, df, FUN = function(x) sqrt(sum(x)))
df_final <- reshape(df_summarized, idvar = "country", timevar = "location", direction = "wide")
colnames(df_final) <- c("country", "North", "South")
df_final
  country    North     South
1  France 1.516575 0.6324555
2   Italy 1.183216 2.4083189
df |>
  group_by(country, location) |>
  summarise(population_sum = sqrt(sum(population))) |>
  pivot_wider(names_from = location, values_from = population_sum)
# A tibble: 2 Γ— 3
# Groups:   country [2]
  country North South
  <chr>   <dbl> <dbl>
1 France   1.52 0.632
2 Italy    1.18 2.41 
dt_final <- df_dt[, .(population_sum = sqrt(sum(population))), by = .(country, location)]
dcast(dt_final, country ~ location, value.var = "population_sum")
Key: <country>
   country    North     South
    <char>    <num>     <num>
1:  France 1.516575 0.6324555
2:   Italy 1.183216 2.4083189

2.13.2 From wide to long

The melt() method can be used to pivot a DataFrame from wide to long.

Let’s see with an example :

df <- data.frame(
  country = c("France","Italy"),
  North = c(1.1,1.4),
  South = c(0.4,2.9)
)
pl$DataFrame(df)$melt(
  id_vars = "country",
  value_vars = c("North","South")
)
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ variable ┆ value β”‚
β”‚ ---     ┆ ---      ┆ ---   β”‚
β”‚ str     ┆ str      ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ═══════║
β”‚ France  ┆ North    ┆ 1.1   β”‚
β”‚ Italy   ┆ North    ┆ 1.4   β”‚
β”‚ France  ┆ South    ┆ 0.4   β”‚
β”‚ Italy   ┆ South    ┆ 2.9   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
melted_df  <- reshape(df, 
        varying = c("North", "South"), 
        v.names = "value", 
        idvar = "country", 
        times = c("North", "South"), 
        timevar = "variable", 
        direction = "long")
rownames(melted_df) <- NULL
melted_df
  country variable value
1  France    North   1.1
2   Italy    North   1.4
3  France    South   0.4
4   Italy    South   2.9
df |>
  pivot_longer(
    !country,
    names_to = "variable",
    values_to = "value"
  )
# A tibble: 4 Γ— 3
  country variable value
  <chr>   <chr>    <dbl>
1 France  North      1.1
2 France  South      0.4
3 Italy   North      1.4
4 Italy   South      2.9
df_dt <- as.data.table(df)
melt(df_dt, id.vars = "country", variable.name = "variable", value.name = "value")
   country variable value
    <char>   <fctr> <num>
1:  France    North   1.1
2:   Italy    North   1.4
3:  France    South   0.4
4:   Italy    South   2.9

2.14 Dealing with missing values

We have already introduced missing values in here. In this section, we will go further and understand how to deal with missing values with polars and R.

2.14.1 Check if Series has missing values

The is_null() and is_not_null() methods can be used to check if Series has missing values.
Theses methods are the equivalent of is.na() and !is.na() of R base.

Let’s see two examples combining thes methods with select() and filter() methods:

mydfNA <- pl$DataFrame(
  colA = pl$Series(c("a",NA,"c")),
  colB = pl$Series(c("d",NA,NA))
)
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# Find null values
mydfNA$
  select(
    pl$col("colA"),
    pl$col("colA")$is_null()$alias("is_null"),
    pl$col("colA")$is_not_null()$alias("is_not_null")
  )
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ is_null ┆ is_not_null β”‚
β”‚ ---  ┆ ---     ┆ ---         β”‚
β”‚ str  ┆ bool    ┆ bool        β”‚
β•žβ•β•β•β•β•β•β•ͺ═════════β•ͺ═════════════║
β”‚ a    ┆ false   ┆ true        β”‚
β”‚ null ┆ true    ┆ false       β”‚
β”‚ c    ┆ false   ┆ true        β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Filter by null values
mydfNA$
  filter(
    pl$col("colA")$is_not_null() & pl$col("colB")$is_not_null()
  )
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ a    ┆ d    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mydfNA2 <- mydfNA$to_data_frame()

# Find null values
data.frame(
  colA = mydfNA2$colA,
  is_null = is.na(mydfNA2$colA),
  is_not_null = !is.na(mydfNA2$colA)
)
  colA is_null is_not_null
1    a   FALSE        TRUE
2 <NA>    TRUE       FALSE
3    c   FALSE        TRUE
# Filter by null values
mydfNA2[!is.na(mydfNA2$colA) & !is.na(mydfNA2$colB), ]
  colA colB
1    a    d

2.14.2 Replace missing values with a constant

The fill_null() method can be used to replace missing values in a Series with a constant.

Here’s some examples where I replace missing values from all columns of a DataFrame:

mydfNA <- pl$DataFrame(
  colA = pl$Series(c("a",NA,"c")),
  colB = pl$Series(c("d",NA,NA))
)
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# In the same columns
mydfNA$
  with_columns(
  pl$all()$fill_null("missing")
)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA    ┆ colB    β”‚
β”‚ ---     ┆ ---     β”‚
β”‚ str     ┆ str     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════║
β”‚ a       ┆ d       β”‚
β”‚ missing ┆ missing β”‚
β”‚ c       ┆ missing β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# In new columns suffixed by "_corrected"
mydfNA$
  with_columns(
  pl$all()$fill_null("missing")$name$suffix("_corrected")
)
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colA_corrected ┆ colB_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ str            ┆ str            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ a              ┆ d              β”‚
β”‚ null ┆ null ┆ missing        ┆ missing        β”‚
β”‚ c    ┆ null ┆ c              ┆ missing        β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydfNA2 <- mydfNA$to_data_frame()

# In the same columns
mydfNA2[is.na(mydfNA2)] <- "missing"
mydfNA2
     colA    colB
1       a       d
2 missing missing
3       c missing
# In new columns suffixed by "_corrected"
mydfNA2 <- mydfNA$to_data_frame()

transform(mydfNA2,
          colA_corrected = ifelse(is.na(colA), "missing", colA),
          colB_corrected = ifelse(is.na(colB), "missing", colB))
  colA colB colA_corrected colB_corrected
1    a    d              a              d
2 <NA> <NA>        missing        missing
3    c <NA>              c        missing
mydfNA2 <- mydfNA$to_data_frame()

# In the same columns
mydfNA2 %>%
  mutate(across(everything(), ~ifelse(is.na(.), "missing", .)))
     colA    colB
1       a       d
2 missing missing
3       c missing
# In new columns suffixed by "_corrected"
mydfNA2 %>%
  mutate(across(c(colA, colB), ~ifelse(is.na(.), "missing", .), .names = "{col}_corrected"))
  colA colB colA_corrected colB_corrected
1    a    d              a              d
2 <NA> <NA>        missing        missing
3    c <NA>              c        missing
mydfNA2 <- mydfNA$to_data_frame()
mydfNA2_dt <- as.data.table(mydfNA2)

# In the same columns
mydfNA2_dt[is.na(mydfNA2_dt)] <- "missing"
mydfNA2_dt
      colA    colB
    <char>  <char>
1:       a       d
2: missing missing
3:       c missing
# In new columns suffixed by "_corrected"
mydfNA2_dt[,
           c("colA_corrected", "colB_corrected") := lapply(.SD, function(x) ifelse(is.na(x), "missing", x)),
           .SDcols = c("colA", "colB")]
mydfNA2_dt
      colA    colB colA_corrected colB_corrected
    <char>  <char>         <char>         <char>
1:       a       d              a              d
2: missing missing        missing        missing
3:       c missing              c        missing
Important

Be careful, the fill_null() method can in some cases modify data types like cast(). This can happen, for example, when you’re working on an integer column and you want to replace the missing values with a string => the column will then has a string dtype!

2.14.3 Replace missing values with a strategy

The fill_null() method of polars has a strategy argument for replacing missing values:

  • forward: replace with the previous non-null value in the Series
  • backward: replace with the next non-null value in the Series
  • min: replace with the smallest value in the Series
  • max: replace with the largest value in the Series
  • mean: replace with the median value in the Series
  • zero: replace with 0
  • one: replace with 1
Note

We can set a limit on how many rows to fill-forward or backward with limit

Here’s some examples :

# In the same columns
mydfNA <- pl$DataFrame(
  colA = pl$Series(c("a",NA,"c")),
  colB = pl$Series(c("d",NA,NA)),
  colC = pl$Series(c(1,NA,3))
)
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(1, NA, 3)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# With forward strategy
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "forward")$name$suffix("_corrected")
  )
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              ┆ d              ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With forward strategy and a limit
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "forward", limit = 1)$name$suffix("_corrected")
  )
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              ┆ null           ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With backward strategy
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "backward")$name$suffix("_corrected")
  )
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ c              ┆ null           ┆ 3.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              ┆ null           ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With mean strategy only on "colC" column
mydfNA$
  with_columns(
    pl$col("colC")$fill_null(strategy = "mean")$name$suffix("_corrected")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ 2.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.14.4 Replace missing values with an expression

Of course, you are not limited to the built-in strategies of polars => with fill_null() you can also use expressions to replace missing values. It works with expression from the same column and from another column.

Here’s some examples:

# Replace missing values with the mean of the non-null values for that column
mydfNA$
  with_columns(
    pl$col("colC")$fill_null(pl$mean("colC"))$name$suffix("_corrected")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ 2.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Replace missing values with the values from another column
mydfNA$
  with_columns(
    pl$col("colB")$fill_null(pl$col("colA"))$name$suffix("_corrected")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colB_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ d              β”‚
β”‚ null ┆ null ┆ null ┆ null           β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydfNA2 <- mydfNA$to_data_frame()

# Replace missing values with the mean of the non-null values for that column
mydfNA2$colC_corrected <- ifelse(is.na(mydfNA2$colC), mean(mydfNA2$colC, na.rm = TRUE), mydfNA2$colC)

mydfNA2 <- mydfNA$to_data_frame()
# Replace missing values with the values from another column
mydfNA2$colB_corrected <- ifelse(is.na(mydfNA2$colB), mydfNA2$colA, mydfNA2$colB)
mydfNA2
  colA colB colC colB_corrected
1    a    d    1              d
2 <NA> <NA>   NA           <NA>
3    c <NA>    3              c
mydfNA2 <- mydfNA$to_data_frame()

# Replace missing values with the mean of the non-null values for that column
mydfNA2 %>%
  mutate(colC_corrected = ifelse(is.na(colC), mean(mydfNA2$colC, na.rm = TRUE), colC))
  colA colB colC colC_corrected
1    a    d    1              1
2 <NA> <NA>   NA              2
3    c <NA>    3              3
# Replace missing values with the values from another column
mydfNA2 %>%
  mutate(colB_corrected = ifelse(is.na(colB), colA, colB))
  colA colB colC colB_corrected
1    a    d    1              d
2 <NA> <NA>   NA           <NA>
3    c <NA>    3              c
mydfNA2 <- mydfNA$to_data_frame()
mydfNA2_dt <- as.data.table(mydfNA2)

# Replace missing values with the mean of the non-null values for that column
mydfNA2_dt[, colC_corrected := ifelse(is.na(colC), mean(mydfNA2_dt$colC, na.rm = TRUE), colC)]
mydfNA2_dt
     colA   colB  colC colC_corrected
   <char> <char> <num>          <num>
1:      a      d     1              1
2:   <NA>   <NA>    NA              2
3:      c   <NA>     3              3
# Replace missing values with the values from another column
mydfNA2_dt[, colB_corrected := ifelse(is.na(colB), colA, colB)]
mydfNA2_dt
     colA   colB  colC colC_corrected colB_corrected
   <char> <char> <num>          <num>         <char>
1:      a      d     1              1              d
2:   <NA>   <NA>    NA              2           <NA>
3:      c   <NA>     3              3              c

2.14.5 Replace missing values with a sequence of columns

The coalesce() method can be used to replace missing values based on a sequence of columns.

Here’s an example creating a new column β€œcol4” that has the first non-null value as we go through some columns (in order!):

mynewdfNA <- pl$DataFrame(
  col1 = pl$Series(c(NA,"y",NA)),
  col2 = pl$Series(c(NA,"v","w")),
  col3 = pl$Series(c("r","s",NA))
)
Warning in pl$Series(c(NA, "y", NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(NA, "v", "w")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("r", "s", NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
mynewdfNA$
  with_columns(
    pl$coalesce("col1","col2","col3")$alias("col4")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 ┆ col4 β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---  β”‚
β”‚ str  ┆ str  ┆ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ══════║
β”‚ null ┆ null ┆ r    ┆ r    β”‚
β”‚ y    ┆ v    ┆ s    ┆ y    β”‚
β”‚ null ┆ w    ┆ null ┆ w    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mynewdfNA2 <-  mynewdfNA$to_data_frame()

mynewdfNA2$col4 <- coalesce(mynewdfNA2$col1, mynewdfNA2$col2, mynewdfNA2$col3)
mynewdfNA2
  col1 col2 col3 col4
1 <NA> <NA>    r    r
2    y    v    s    y
3 <NA>    w <NA>    w
mynewdfNA2 <-  mynewdfNA$to_data_frame()

mynewdfNA2 %>%
  mutate(col4 = coalesce(col1, col2, col3))
  col1 col2 col3 col4
1 <NA> <NA>    r    r
2    y    v    s    y
3 <NA>    w <NA>    w
mynewdfNA2 <-  mynewdfNA$to_data_frame()
mynewdfNA2_dt <- as.data.table(mynewdfNA2)

mynewdfNA2_dt[, col4 := fcoalesce(col1, col2, col3)]
mynewdfNA2_dt
     col1   col2   col3   col4
   <char> <char> <char> <char>
1:   <NA>   <NA>      r      r
2:      y      v      s      y
3:   <NA>      w   <NA>      w

2.15 Others useful methods

2.15.1 On Series

2.15.1.1 Change name of Series

The alias() method is very useful especially in method chaining operation.
With R base, the syntax is longer.

pl$Series(1:3, name = "toto")$alias("titi")
Warning in pl$Series(1:3, name = "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
polars Series: shape: (3,)
Series: 'titi' [i32]
[
    1
    2
    3
]
toto <- 1:3
titi <- toto
rm(toto)

2.15.1.2 Reduce Boolean Series

The all() and any() methods can be used to check if all or any values in a vector evaluate to TRUE for some expression.

# all(pl$Series(c(TRUE,TRUE))) doesn't work
pl$Series(c(TRUE, TRUE, NA))$all()
Warning in pl$Series(c(TRUE, TRUE, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
pl$Series(c(TRUE, TRUE, FALSE))$all()
Warning in pl$Series(c(TRUE, TRUE, FALSE)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
pl$Series(c(TRUE, TRUE, TRUE))$all()
Warning in pl$Series(c(TRUE, TRUE, TRUE)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
all(c(TRUE,TRUE,NA))
[1] NA
all(c(TRUE,TRUE,FALSE))
[1] FALSE
all(c(TRUE,TRUE,TRUE))
[1] TRUE

2.15.1.3 Get data type of Series

The dtype() method can be used to get data type of Series.

pl$Series(letters)$dtype
Warning in pl$Series(letters): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: String
pl$Series(c(1, 2))$dtype
Warning in pl$Series(c(1, 2)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Float64
infer_type(letters)
Utf8
string
infer_type(c(1, 2))
Float64
double
Tip

Polars is strongly typed. print(ls(pl$dtypes)) returns the full list of valid Polars types. Caution, some type names differ from what they are called in R base. See below!

pl$Series(c("x","y","z"))$dtype
Warning in pl$Series(c("x", "y", "z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: String
pl$Series(c(1, 2, 3))$dtype
Warning in pl$Series(c(1, 2, 3)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Float64
pl$Series(c(1:3))$dtype
Warning in pl$Series(c(1:3)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Int32
pl$Series(c(TRUE,FALSE))$dtype
Warning in pl$Series(c(TRUE, FALSE)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Boolean
pl$Series(factor(c("a","b","c")))$dtype
Warning in pl$Series(factor(c("a", "b", "c"))): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Categorical(
    Some(
        local,
    ),
    Physical,
)
pl$Series(Sys.Date())$dtype
Warning in pl$Series(Sys.Date()): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Date
pl$Series(c(0,1))$dtype
Warning in pl$Series(c(0, 1)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Float64
typeof(c("x","y","z"))
[1] "character"
typeof(c(1, 2, 3))
[1] "double"
typeof(c(1:3))
[1] "integer"
typeof(c(TRUE,FALSE))
[1] "logical"
typeof(factor(c("a","b","c")))
[1] "integer"
typeof(Sys.Date())
[1] "double"

To summarise the main types between Polars and R:

Polars R Base
Utf8 character
Float64 double
Int32 integer
Boolean logical
Categorical Factor
Date Date

To learn more about Data types in Polars, see here.

2.15.1.4 Cast

The cast() method can be used to convert the data types of a column to a new one.

pl$DataFrame(iris)$with_columns(
  pl$col("Petal.Length")$cast(pl$Int8), # The "Petal.Length" column is converted into integers
  pl$col("Species")$cast(pl$Utf8) # The "Species" column is converted into strings
  )$schema
$Sepal.Length
DataType: Float64

$Sepal.Width
DataType: Float64

$Petal.Length
DataType: Int8

$Petal.Width
DataType: Float64

$Species
DataType: String
data(iris)
iris$Petal.Length <- as.integer(iris$Petal.Length)
iris$Species <- as.integer(iris$Species)
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  1 1 1 1 1 1 1 1 1 1 ...
data(iris)
iris |>
  mutate(
    Petal.Length = as.integer(Petal.Length),
    Species = as.character(Species)) |>
  str()
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
iris_dt[, `:=`(Petal.Length = as.integer(Petal.Length),
               Species = as.character(Species))]
str(iris_dt)
Classes 'data.table' and 'data.frame':  150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Tip

When working with very large tables we can Reduce the memory footprint by modifying the number of bits allocated to an element. ⚠️

For example, the example below illustrates how converting Float64 to Float8 reduces memory usage:

pl$DataFrame(iris)$estimated_size()
[1] 5425
#| label: reduce-footprint-polars
pl$DataFrame(iris)$with_columns(
  pl$col("Petal.Length")$cast(pl$Float32),
  pl$col("Petal.Width")$cast(pl$Float32),
  pl$col("Sepal.Length")$cast(pl$Float32),
  pl$col("Sepal.Width")$cast(pl$Float32)
  )$estimated_size()
[1] 3025

When performing downcasting, it is crucial to ensure that the chosen number of bits is sufficient to accommodate the largest and smallest numbers in the column.

A quick reminder:

Type Range Accuracy
Int8 -128 to +127
Int16 -32768 to +32767
Int32 -2147483648 to +2147483647
Int64 –2E63 to –2E63-1
Float32 -3.4E+38 to +3.4E+38 about 7 decimal digits
Float64 -1.7E+308 to +1.7E+308 about 16 decimal digits

2.15.1.5 Check if Series is numeric

The is_numeric() method can be used to check if Series is numeric.
Note that unlike R base, there is no method to check if a Series is character (in this case, its type is anyway Utf8).

pl$Series(1:4)$is_numeric()
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
pl$Series(c("a", "b", "c"))$is_numeric()
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
is.numeric(1:4)
[1] TRUE
is.numeric(c("a","b","c"))
[1] FALSE

2.15.1.6 Check if Series is sorted

The is_sorted() method can be used to check if Series is sorted.
Note that R base provides is.unsorted() which returns the opposite boolean to is_sorted() of Polars.

pl$Series(1:4)$is_sorted()
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
pl$Series(c(1,3,2))$is_sorted()
Warning in pl$Series(c(1, 3, 2)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
is.unsorted(1:4)
[1] FALSE
is.unsorted(c(1,3,2))
[1] TRUE

2.15.1.7 Get length of a Series

The len() method can be used to get the length of a Series.

pl$Series(1:4)$len()
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] 4
length(1:4)
[1] 4

2.15.1.8 Check if Series are equal

The series_equal() method can be used to check if a Series is equal with another Series.

Tip

Caution, if two series are identical but one is named and the other is not then series_equal() returns FALSE.

pl$Series(1:4)$equals(pl$Series(1:4))
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
pl$Series(1:4,name = "toto")$equals(pl$Series(1:4))
Warning in pl$Series(1:4, name = "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:4, name = "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
identical(1:4,1:4)
[1] TRUE

2.15.1.9 Convert Series to Polars DataFrame

The to_frame() method can be used to convert a Series to a DataFrame.
In this case, a DataFrame with only one column will be created. If the Series is initially named then the column of the DataFrame will be named as such.

pl$Series(1:3, "toto")$to_frame()
Warning in pl$Series(1:3, "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ toto β”‚
β”‚ ---  β”‚
β”‚ i32  β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ 1    β”‚
β”‚ 2    β”‚
β”‚ 3    β”‚
β””β”€β”€β”€β”€β”€β”€β”˜

2.15.1.10 Get value Counts of a Series

The value_counts() method can be used to get a value counts of a Series.

pl$Series(iris$Species)$value_counts()
Warning in pl$Series(iris$Species): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚            ┆ count β”‚
β”‚ ---        ┆ ---   β”‚
β”‚ cat        ┆ u32   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════║
β”‚ setosa     ┆ 50    β”‚
β”‚ versicolor ┆ 50    β”‚
β”‚ virginica  ┆ 50    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
iris |>
  count(Species)
     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50
iris_dt[, .N, by = Species]
      Species     N
       <char> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

2.15.1.11 Sum across Series

The sum() method can be used to get a sum of a Series.

  • From a single Series:
pl$Series(1:3)$sum()
Warning in pl$Series(1:3): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] 6
sum(c(1:3))
[1] 6
  • From a DataFrame and a column as a string:
pl$DataFrame(iris)$select(pl$sum("Petal.Length"))
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 563.7        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
sum(iris$Petal.Length)
[1] 563.7
iris |> summarise(sum(Petal.Length))
  sum(Petal.Length)
1             563.7
sum(iris_dt[, Petal.Length])
[1] 496
  • From a DataFrame and a column as an expression:
pl$DataFrame(iris)$select(pl$sum("Petal.Width"))
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Width β”‚
β”‚ ---         β”‚
β”‚ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 179.9       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • From a DataFrame and a column as a list and sum horizontally:

In this case, use with_columns() method.

df_pl <- pl$DataFrame(col1 = c(10L,20L), col2= c(30L,40L), col3 = c(40L,50L))
df_pl$with_columns(pl$sum("col1", "col3"))
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 β”‚
β”‚ ---  ┆ ---  ┆ ---  β”‚
β”‚ i32  ┆ i32  ┆ i32  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════║
β”‚ 30   ┆ 30   ┆ 90   β”‚
β”‚ 30   ┆ 40   ┆ 90   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
df <- data.frame(col1 = c(10L,20L), col2= c(30L,40L), col3 = c(40L,50L))
mysum <- rowSums(df[, c("col1", "col3")])
cbind(df,mysum)
  col1 col2 col3 mysum
1   10   30   40    50
2   20   40   50    70
df |>
  rowwise() |>
  mutate(mysum = sum(col1,col3))
# A tibble: 2 Γ— 4
# Rowwise: 
   col1  col2  col3 mysum
  <int> <int> <int> <int>
1    10    30    40    50
2    20    40    50    70
df_dt <- as.data.table(df)
df_dt[, somme := rowSums(.SD), .SDcols = c("col1", "col3")]
print(df_dt)
    col1  col2  col3 somme
   <int> <int> <int> <num>
1:    10    30    40    50
2:    20    40    50    70
  • From a DataFrame and sum horizontally all columns:

In this case, use list(*).

df_pl$with_columns(pl$sum_horizontal(list("*")))
shape: (2, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 ┆           β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---       β”‚
β”‚ i32  ┆ i32  ┆ i32  ┆ list[str] β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ═══════════║
β”‚ 10   ┆ 30   ┆ 40   ┆ ["*"]     β”‚
β”‚ 20   ┆ 40   ┆ 50   ┆ ["*"]     β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

With iris Dataframe, you must first select numerical variables:

pl$DataFrame(iris)$
  select(
  pl$col(c("Petal.Length","Petal.Width")))$
  with_columns(pl$sum_horizontal(list("*")))
shape: (150, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width ┆           β”‚
β”‚ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ list[str] β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ═══════════║
β”‚ 1.4          ┆ 0.2         ┆ ["*"]     β”‚
β”‚ 1.4          ┆ 0.2         ┆ ["*"]     β”‚
β”‚ 1.3          ┆ 0.2         ┆ ["*"]     β”‚
β”‚ 1.5          ┆ 0.2         ┆ ["*"]     β”‚
β”‚ 1.4          ┆ 0.2         ┆ ["*"]     β”‚
β”‚ …            ┆ …           ┆ …         β”‚
β”‚ 5.2          ┆ 2.3         ┆ ["*"]     β”‚
β”‚ 5.0          ┆ 1.9         ┆ ["*"]     β”‚
β”‚ 5.2          ┆ 2.0         ┆ ["*"]     β”‚
β”‚ 5.4          ┆ 2.3         ┆ ["*"]     β”‚
β”‚ 5.1          ┆ 1.8         ┆ ["*"]     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.15.2 On DataFrames

2.15.2.1 Get Series from DataFrame

The to_series() method can be used to get one column from DataFrame as Series.

pl$DataFrame(iris)$select(pl$col("Petal.Length"))$to_series()
polars Series: shape: (150,)
Series: 'Petal.Length' [f64]
[
    1.4
    1.4
    1.3
    1.5
    1.4
    …
    5.2
    5.0
    5.2
    5.4
    5.1
]
iris$Petal.Length
  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1
iris |>
  pull(Petal.Length)
  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1
iris_dt[, c(Petal.Length)]
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 3 4 3 3 4 4 4 3 4 4 4 4 3 4 4 4 4
 [75] 4 4 4 5 4 3 3 3 3 5 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 6 5 5 5 5 6 4 6 5 6 5
[112] 5 5 5 5 5 5 6 6 5 5 4 6 4 5 6 4 4 5 5 6 6 5 5 5 6 5 5 4 5 5 5 5 5 5 5 5 5
[149] 5 5

2.15.2.2 Get a R List from DataFrame

The to_list() method can be used to get a R List from DataFrame.

mylist <- pl$DataFrame(iris)$to_list()
names(mylist)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
mylist <- as.list(iris)
names(mylist)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

2.15.2.3 Get a slice of a DataFrame

The slice() method can be used to get a slice of a DataFrame.

Important

With Polars, numeric default is 0! Thus the equivalent to slice(1,3) with Polars will be 1:4 in R Base and data.table and slice(1,4) with dplyr.

pl$DataFrame(iris)$slice(1,3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[2:4,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
iris |>
  slice(2:4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
iris_dt[2:4,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <int>       <num>  <char>
1:          4.9         3.0            1         0.2  setosa
2:          4.7         3.2            1         0.2  setosa
3:          4.6         3.1            1         0.2  setosa

2.15.2.4 Get a structure from a DataFrame

The to_struct() method can be used to get a structure from a DataFrame.

mystruc <- pl$DataFrame(iris)$to_struct()
mystruc
polars Series: shape: (150,)
Series: '' [struct[5]]
[
    {5.1,3.5,1.4,0.2,"setosa"}
    {4.9,3.0,1.4,0.2,"setosa"}
    {4.7,3.2,1.3,0.2,"setosa"}
    {4.6,3.1,1.5,0.2,"setosa"}
    {5.0,3.6,1.4,0.2,"setosa"}
    …
    {6.7,3.0,5.2,2.3,"virginica"}
    {6.3,2.5,5.0,1.9,"virginica"}
    {6.5,3.0,5.2,2.0,"virginica"}
    {6.2,3.4,5.4,2.3,"virginica"}
    {5.9,3.0,5.1,1.8,"virginica"}
]

to_struct() returns a Series which can be converted to a R list with to_r method.

mylist <- mystruc$to_r()
str(mylist)
List of 5
 $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "is_struct")= logi TRUE

mylist is now a list where each element is a column of the initial DataFrame.

With to_frame() and unnest() methods, we can reconstruct the original DataFrame:

back_df <- mystruc$to_frame()$unnest()
back_df
shape: (150, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    β”‚
β”‚ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    β”‚
β”‚ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    β”‚
β”‚ …            ┆ …           ┆ …            ┆ …           ┆ …         β”‚
β”‚ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica β”‚
β”‚ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica β”‚
β”‚ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica β”‚
β”‚ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica β”‚
β”‚ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.15.2.5 Drop all rows that contain null values

The drop_nulls() method can be used to drop all rows that contain null values in a DataFrame.

By default, drop_nulls() use all columns to drop rows:

data_pl <- pl$DataFrame(
  col1 = pl$Series(c(NA,"b","c")),
  col2 = pl$Series(c(1,2,NA))
)
Warning in pl$Series(c(NA, "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(1, 2, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
data_pl$drop_nulls()
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ f64  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ b    ┆ 2.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
data_df <- data.frame(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
data_df[complete.cases(data_df), ]
  col1 col2
2    b    2
data_df <- data.frame(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
data_df %>%
  filter(complete.cases(.))
  col1 col2
1    b    2
data_dt <- data.table(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
na.omit(data_dt)
     col1  col2
   <char> <num>
1:      b     2

If you want, you can specify a column (or multiple columns):

data_pl$drop_nulls("col1")
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ f64  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ b    ┆ 2.0  β”‚
β”‚ c    ┆ null β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
data_df[!is.na(data_df$col1), ]
  col1 col2
2    b    2
3    c   NA
data_df |>
  filter(!is.na(col1))
  col1 col2
1    b    2
2    c   NA
data_dt[complete.cases(data_dt[, .SD, .SDcols = "col1"]), ]
     col1  col2
   <char> <num>
1:      b     2
2:      c    NA

2.16 Strings methods

In polars, a lot of strings methods are useful. Here is the list.
To use them, simply prefix them with str.

2.16.1 Get substrings

The str$slice() method can be used to create substrings of the string values of a Utf8 Series.

str$slice() does not work like R base’s substr() function for finding the substring of interest: - substr() takes two arguments: the first and last elements; - str$slice() takes two arguments: the first element and the extraction length.

Important

With Polars, numeric default is 0! Thus the equivalent to str$slice(0,3) with Polars will be substr(1,3).

Two further comments:

  • If the second argument length is not specified, the sub-character string of interest will default to the end of the character string. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(1) is equivalent to substr(mycol,2,4) in dplyr.

  • The first argument accepts negative values, which means that sub-strings can be considered starting from the end. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(-2) is equivalent to substr(mycol,3,4) in dplyr.

Let’s see an example:

mydf <- data.frame(
  col1 = 1:4,
  col2 = c("One_X","One_Y","Two_X","Two_Y")
)
pl$DataFrame(mydf)$with_columns(
  pl$col("col2")$str$slice(0,length=3)$alias("level"),
  pl$col("col2")$str$slice(-1)$alias("x_y")
)
shape: (4, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- β”‚
β”‚ i32  ┆ str   ┆ str   ┆ str β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════║
β”‚ 1    ┆ One_X ┆ One   ┆ X   β”‚
β”‚ 2    ┆ One_Y ┆ One   ┆ Y   β”‚
β”‚ 3    ┆ Two_X ┆ Two   ┆ X   β”‚
β”‚ 4    ┆ Two_Y ┆ Two   ┆ Y   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
mydf$level <- substr(mydf$col2, 1, 3)
mydf$x_y <- substr(mydf$col2, nchar(mydf$col2), nchar(mydf$col2))
mydf
  col1  col2 level x_y
1    1 One_X   One   X
2    2 One_Y   One   Y
3    3 Two_X   Two   X
4    4 Two_Y   Two   Y
mydf |>
  mutate(level = substr(col2, 1, 3),
         x_y = substr(col2, nchar(col2), nchar(col2)))
  col1  col2 level x_y
1    1 One_X   One   X
2    2 One_Y   One   Y
3    3 Two_X   Two   X
4    4 Two_Y   Two   Y
mydt <- as.data.table(mydf) 
mydt[, c("level", "x_y") := .(substr(col2, 1, 3), substr(col2, nchar(col2), nchar(col2)))]
mydt
    col1   col2  level    x_y
   <int> <char> <char> <char>
1:     1  One_X    One      X
2:     2  One_Y    One      Y
3:     3  Two_X    Two      X
4:     4  Two_Y    Two      Y

2.16.2 Check if string values start with a substring

The str$starts_with() method can be used to check if string values start with a substring. It returns a Boolean.

Let’s see an example where we create new Boolean columns based on the start of a character string:

pl$DataFrame(mydf)$with_columns(
  pl$col("col2")$str$starts_with("One")$alias("is_one"),
  pl$col("col2")$str$starts_with("Two")$alias("is_two")
)
shape: (4, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    β”‚
β”‚ i32  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════β•ͺ════════β•ͺ════════║
β”‚ 1    ┆ One_X ┆ One   ┆ X   ┆ true   ┆ false  β”‚
β”‚ 2    ┆ One_Y ┆ One   ┆ Y   ┆ true   ┆ false  β”‚
β”‚ 3    ┆ Two_X ┆ Two   ┆ X   ┆ false  ┆ true   β”‚
β”‚ 4    ┆ Two_Y ┆ Two   ┆ Y   ┆ false  ┆ true   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydf$is_one <- grepl("^One", mydf$col2)
mydf$is_two <- grepl("^Two", mydf$col2)
mydf
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    2 One_Y   One   Y   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
4    4 Two_Y   Two   Y  FALSE   TRUE
mydf |>
  mutate(level = substr(col2, 1, 3),
         x_y = substr(col2, nchar(col2), nchar(col2)))
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    2 One_Y   One   Y   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
4    4 Two_Y   Two   Y  FALSE   TRUE
mydt <- as.data.table(mydf) 
mydt[, c("is_one", "is_two") := .(grepl("^One", col2), grepl("^Two", col2))]
mydt
    col1   col2  level    x_y is_one is_two
   <int> <char> <char> <char> <lgcl> <lgcl>
1:     1  One_X    One      X   TRUE  FALSE
2:     2  One_Y    One      Y   TRUE  FALSE
3:     3  Two_X    Two      X  FALSE   TRUE
4:     4  Two_Y    Two      Y  FALSE   TRUE

2.16.3 Check if string values end with a substring

The str$ends_with() method can be used to check if string values start with a substring. It returns a Boolean.

Let’s see an example where we filter the lines of a DataFrame based on the start of a character string:

pl$DataFrame(mydf)$filter(
  pl$col("col2")$str$ends_with("X")
)
shape: (2, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    β”‚
β”‚ i32  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════β•ͺ════════β•ͺ════════║
β”‚ 1    ┆ One_X ┆ One   ┆ X   ┆ true   ┆ false  β”‚
β”‚ 3    ┆ Two_X ┆ Two   ┆ X   ┆ false  ┆ true   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydf[substr(mydf$col2, nchar(mydf$col2), nchar(mydf$col2)) == "X", ]
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
mydf |>
  filter(endsWith(col2, "X"))
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    3 Two_X   Two   X  FALSE   TRUE
mydt <- as.data.table(mydf) 
mydt[substr(col2, nchar(col2), nchar(col2)) == "X"]
    col1   col2  level    x_y is_one is_two
   <int> <char> <char> <char> <lgcl> <lgcl>
1:     1  One_X    One      X   TRUE  FALSE
2:     3  Two_X    Two      X  FALSE   TRUE

2.17 Create your methods

With R you can create your own method/function with function().
Let’s try to create a R function to captue some DataFrame transformations.

Our simple function:
- Takes a DataFrame as an input (argument data)
- Convert Categorical columns into Strings
- Make all Strings columns uppercase
- And filter only the third first rows

fn_transformation <- function(data) {
  
  data$
    # Convert Integer columns into Float 
    with_columns(
      pl$col(pl$Int32)$cast(pl$Float64))$
    # Make all Strings columns uppercase
    with_columns(
      pl$col(pl$Utf8)$str$to_uppercase())$
    # Filter only the third first rows
    head(3)
  
}

Let’s apply our method to pl$DataFrame(iris):

fn_transformation(pl$DataFrame(mydf))
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    β”‚
β”‚ f64  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════β•ͺ════════β•ͺ════════║
β”‚ 1.0  ┆ ONE_X ┆ ONE   ┆ X   ┆ true   ┆ false  β”‚
β”‚ 2.0  ┆ ONE_Y ┆ ONE   ┆ Y   ┆ true   ┆ false  β”‚
β”‚ 3.0  ┆ TWO_X ┆ TWO   ┆ X   ┆ false  ┆ true   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Our method fn_transformation() can now be re-used multiple times for example on another datasets:

Newdf <- pl$DataFrame(
  col_categ = pl$Series(factor(c("a","b","c"))),
  col_str = pl$Series(c("x","y","z")),
  col_num = pl$Series(1:3)
)
Warning in pl$Series(factor(c("a", "b", "c"))): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("x", "y", "z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:3): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
fn_transformation(Newdf)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col_categ ┆ col_str ┆ col_num β”‚
β”‚ ---       ┆ ---     ┆ ---     β”‚
β”‚ cat       ┆ str     ┆ f64     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ═════════║
β”‚ a         ┆ X       ┆ 1.0     β”‚
β”‚ b         ┆ Y       ┆ 2.0     β”‚
β”‚ c         ┆ Z       ┆ 3.0     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Of course, in real life, we will create functions that are more complicated than our example.