2  Data manipulation

Below is the code used to create the fake data needed to run code in this document. Click to expand it! πŸ‘‡

Code
library(polars)
library(dplyr)
library(tidyr)
library(tidypolars)
library(data.table)
library(arrow)
library(microbenchmark)
iris_dt <- as.data.table(iris)

# Creation of Series
mynumbers_serie <- as_polars_series(1:3)
myletters_serie <- as_polars_series(c("a","b","c"))

# Creation of mydf DataFrame 
mydf <- pl$DataFrame(
  col1 = mynumbers_serie,
  col2 = myletters_serie
)

2.1 Introduction to methods

Important

Although some simple R functions work out of the box on polars objects, the added value of Polars consists in the methods. Thosepowerful methods are accessed using the $ operator.

There are numerous methods designed to accomplish various tasks.

  • One advantage of using methods is that many more operations are possible on Polars objects using methods than through base R functions.

  • A second advantage is Methods Chaining, a core part of the Polars workflow. If you are coming from one of the other popular data wrangling libraries in R, then you probably already have an innate sense of what this means. For instance:

    • In dplyr we use a pipe operator, e.g. dat |> filter(…) |> select(…)
      -In data.table we use its indexing syntax, e.g. DT[i, j, by][…]

In polars our method chaining syntax takes the form object$m1()$m2(), where object is our data object, and $m1() and $m2() are appropriate methods.

  • For Series, see this section for the methods available in {polars}.
  • For DataFrames, see this section for the methods available in {polars}.

Some examples with Series: Some examples with Series:

# To get a sum
mynumbers_serie$sum()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚     β”‚
β”‚ --- β”‚
β”‚ i32 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 6   β”‚
β””β”€β”€β”€β”€β”€β”˜
# To sort
mynumbers_serie$sort()
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚     β”‚
β”‚ --- β”‚
β”‚ i32 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 1   β”‚
β”‚ 2   β”‚
β”‚ 3   β”‚
β””β”€β”€β”€β”€β”€β”˜

Some examples with DataFrame:

# To get a character vector of column names
mydf$columns
[1] "col1" "col2"
# To get dimensions of DataFrame
mydf$shape
[1] 3 2
# We can mix standard R functions and methods
length(mydf$columns)
[1] 2

Polars includes a very useful chaining method in data manipulation operations. From this point of view, Polars is more like dplyr and data.table. This is how the chaining method is defined in the official documentation:

In polars our method chaining syntax takes the form object$m1()$m2(), where object is our data object, and m1() and m2() are appropriate methods, like subsetting or aggregation expressions.

Let’s see an example with the iris dataset:

as_polars_df(iris)$
  group_by(
  "Species")$
  median()
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ versicolor ┆ 5.9          ┆ 2.8         ┆ 4.35         ┆ 1.3         β”‚
β”‚ setosa     ┆ 5.0          ┆ 3.4         ┆ 1.5          ┆ 0.2         β”‚
β”‚ virginica  ┆ 6.5          ┆ 3.0         ┆ 5.55         ┆ 2.0         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_df(iris) |>
  group_by(Species) |>
  summarise(across(everything(),median))
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ versicolor ┆ 5.9          ┆ 2.8         ┆ 4.35         ┆ 1.3         β”‚
β”‚ setosa     ┆ 5.0          ┆ 3.4         ┆ 1.5          ┆ 0.2         β”‚
β”‚ virginica  ┆ 6.5          ┆ 3.0         ┆ 5.55         ┆ 2.0         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
aggregate(. ~ Species, iris, median)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4         1.50         0.2
2 versicolor          5.9         2.8         4.35         1.3
3  virginica          6.5         3.0         5.55         2.0
iris |>
  group_by(Species) |>
  summarise(across(everything(),median))
# A tibble: 3 Γ— 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa              5           3.4         1.5          0.2
2 versicolor          5.9         2.8         4.35         1.3
3 virginica           6.5         3           5.55         2  
iris_dt[, lapply(.SD, median), by = Species]
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa          5.0         3.4         1.50         0.2
2: versicolor          5.9         2.8         4.35         1.3
3:  virginica          6.5         3.0         5.55         2.0
Tip

In the Polars code used above, you will notice that we have introduced line breaks. We could have written the whole code on the same line but for the sake of readability I prefer to separate the methods used by line breaks.

2.2 Conversion between Series/DataFrames and vector/data.frames

2.2.1 From vector/data.frames to Series/DataFrames

These conversions have already been seen earlier in this book.

# To convert vector to Polars Series
myvector <- as_polars_series(c("a","b","c"))
# To convert data.frames to DataFrames
iris_polars <- as_polars_df(iris)

2.2.2 From Series/DataFrames to vector/data.frames

Here, we can use as.vector() and as.data.frame() functions.

# To convert Polars Series to a vector
as.vector(myvector)
[1] "a" "b" "c"
# To convert Polars Series to a list
as.list(myvector)
[[1]]
[1] "a"

[[2]]
[1] "b"

[[3]]
[1] "c"
# To convert DataFrames to data.frames
as.data.frame(iris_polars$slice(1,5))
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
4          5.0         3.6          1.4         0.2  setosa
5          5.4         3.9          1.7         0.4  setosa

2.3 Initial informations on a DataFrame

Here is a list of instructions that I frequently use to quickly get information about a DataFrame.

2.3.1 Get the schema

A DataFrame has a schema attribute which is a named list of DataTypes.

mydf$schema
$col1
Int32

$col2
String

2.3.2 Get the dimensions

A DataFrame has a shape attribute which is a two length numeric vector of c(nrows,ncols).

mydf$shape
[1] 3 2

2.3.3 Get columns types

A DataFrame has a dtypes attribute which is a list of dtypes (for data type) for every column of the DataFrame.

Alternatively, the dtype_strings() method can be used to get columns types in a character/string vector.

# With dtypes attribute (with a "s" and without parentheses)
mydf$dtypes
[[1]]
Int32

[[2]]
String

2.3.4 Get a glimpse

You can access a dense preview of a DataFrame by using the glimpse() method. f The formatting is done one line per column, so wide DataFrame show nicely. Each line will show the column name, the dtypes attributes and the first few values.

as_polars_df(iris)$glimpse()
Rows: 150
Columns: 5
$ Sepal.Length <f64>: 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9
$ Sepal.Width  <f64>: 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1
$ Petal.Length <f64>: 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5
$ Petal.Width  <f64>: 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1
$ Species      <cat>: setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa

2.3.5 Get summary statistics

You can access some summary statistics from a DataFrame by using the describe() method.

as_polars_df(iris)$describe()
shape: (9, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ statistic  ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ str        ┆ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ count      ┆ 150.0        ┆ 150.0       ┆ 150.0        ┆ 150.0       ┆ 150       β”‚
β”‚ null_count ┆ 0.0          ┆ 0.0         ┆ 0.0          ┆ 0.0         ┆ 0         β”‚
β”‚ mean       ┆ 5.843333     ┆ 3.057333    ┆ 3.758        ┆ 1.199333    ┆ null      β”‚
β”‚ std        ┆ 0.828066     ┆ 0.435866    ┆ 1.765298     ┆ 0.762238    ┆ null      β”‚
β”‚ min        ┆ 4.3          ┆ 2.0         ┆ 1.0          ┆ 0.1         ┆ setosa    β”‚
β”‚ 25%        ┆ 5.1          ┆ 2.8         ┆ 1.6          ┆ 0.3         ┆ null      β”‚
β”‚ 50%        ┆ 5.8          ┆ 3.0         ┆ 4.4          ┆ 1.3         ┆ null      β”‚
β”‚ 75%        ┆ 6.4          ┆ 3.3         ┆ 5.1          ┆ 1.8         ┆ null      β”‚
β”‚ max        ┆ 7.9          ┆ 4.4         ┆ 6.9          ┆ 2.5         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.3.6 Get the rows number

A DataFrame has a width attribute which returns the width of the DataFrame.

# With width attribute
mydf$width
[1] 2

2.3.7 Get the columns number

A DataFrame has a height attribute which returns the width of the DataFrame.

# With height attribute
mydf$height
[1] 3

2.3.8 Get the columns names

A DataFrame has a columns attribute which returns the columns names of the DataFrame in a character vector.
The names() method can be used to get the columns names.

mydf$columns
[1] "col1" "col2"
# With names() method
names(mydf)
[1] "col1" "col2"

2.3.9 Rename columns

A character vector can be passed to the columns attribute to rename the columns names of the DataFrame.

mydf$rename(col1 = "colA", col2 = "colB")
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ i32  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ 1    ┆ a    β”‚
β”‚ 2    ┆ b    β”‚
β”‚ 3    ┆ c    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

2.3.10 Get the first n rows

The head() method can be used to get the first n rows of a DataFrame.

# To get the 2 first rows
mydf$head(2)
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ i32  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ 1    ┆ a    β”‚
β”‚ 2    ┆ b    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

2.3.11 Count values in a DataFrame

The value_counts() method can be used to count the occurences of uniques values in a Series of a DataFrame. value_counts() works with a Series. It must therefore be supplied with the select() method. See here to learn about it.

as_polars_df(iris)$select(pl$col("Species")$value_counts())
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species           β”‚
β”‚ ---               β”‚
β”‚ struct[2]         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ {"versicolor",50} β”‚
β”‚ {"setosa",50}     β”‚
β”‚ {"virginica",50}  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_df(iris) |>
  count(Species)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ n   β”‚
β”‚ ---        ┆ --- β”‚
β”‚ cat        ┆ u32 β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════║
β”‚ setosa     ┆ 50  β”‚
β”‚ versicolor ┆ 50  β”‚
β”‚ virginica  ┆ 50  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
iris |>
  count(Species)
     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50
iris_dt[, .N, by = Species]
      Species     N
       <fctr> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

2.3.12 Count NA over columns in a DataFrame

The null_count() method can be used to count NA values of a DataFrame.

mydfNA <- pl$DataFrame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA))
mydfNA$select(pl$all()$null_count())
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ u32  ┆ u32  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ 1    ┆ 2    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mydfNA |>
  summarise(across(everything(), ~sum(is.na(.x))))
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ u32  ┆ u32  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ 1    ┆ 2    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mydfNA <- data.frame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA))
sapply(mydfNA, function(x) sum(is.na(x)))
colA colB 
   1    2 
mydfNA |>
  summarise(across(everything(), ~sum(is.na(.x))))
  colA colB
1    1    2
mydfNA_dt <- as.data.table(mydfNA)
mydfNA_dt[, lapply(.SD, function(x) sum(is.na(x))), .SDcols = names(mydfNA_dt)]
    colA  colB
   <int> <int>
1:     1     2

2.4 Filter rows

The first option to filter rows of a DataFrame is to use square brackets [] indexing (with integer row number).

data(iris)
# The first four lines
as_polars_df(iris)[1:4,]
shape: (4, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# The lines 1, 3 and 5
as_polars_df(iris)[c(1,3,5),]
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β”‚ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Important

It’s convenient when you have to quickly inspect your data. But you’ll quickly be limited by the square brackets, as they don’t accept conditions with the expressions. For example as_polars_df(iris)[Petal.Length > 6] doesn’t work.

The second and best option is to use the filter() method. It must be used with the Polars expression, here the col() method which allows to designate the columns on which the filter condition will be applied.

Let’s see in details what’s inside a filter() method with an example:

  • pl$col("Petal.Length"): this expression selects the Petal.Length column from iris;
  • >6: applies a Boolean condition to this expression (for all Petals that have a length > 6).

In the example below, we will use & operator to apply multiple conditions in filter() method:

as_polars_df(iris)$filter(
  pl$col("Petal.Length") > 6 & pl$col("Petal.Width") < 2)
shape: (2, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ 7.3          ┆ 2.9         ┆ 6.3          ┆ 1.8         ┆ virginica β”‚
β”‚ 7.4          ┆ 2.8         ┆ 6.1          ┆ 1.9         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_df(iris) |> 
  filter(Petal.Length > 6 & Petal.Width < 2)  
shape: (2, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ 7.3          ┆ 2.9         ┆ 6.3          ┆ 1.8         ┆ virginica β”‚
β”‚ 7.4          ┆ 2.8         ┆ 6.1          ┆ 1.9         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[iris$Petal.Length > 6 & iris$Petal.Width < 2,] # here don't forget the comma
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
108          7.3         2.9          6.3         1.8 virginica
131          7.4         2.8          6.1         1.9 virginica
iris |>
  filter(Petal.Length > 6 & Petal.Width < 2) 
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.3         2.9          6.3         1.8 virginica
2          7.4         2.8          6.1         1.9 virginica
iris_dt[Petal.Length > 6 & Petal.Width < 2]
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <num>       <num>        <num>       <num>    <fctr>
1:          7.3         2.9          6.3         1.8 virginica
2:          7.4         2.8          6.1         1.9 virginica
Tip

The equivalent of %in% R operator is the is_in() method. It should be used in association with the lit() method.

pl$DataFrame(
  colA = c("a","b","c")
  )$filter(
  pl$col("colA")$is_in(pl$lit(c("a","b")))
  )
shape: (2, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ colA β”‚
β”‚ ---  β”‚
β”‚ str  β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ a    β”‚
β”‚ b    β”‚
β””β”€β”€β”€β”€β”€β”€β”˜

Another reason for using the filter() method is that filter expressions can be optimised in lazy mode by the query optimiser. Square brackets [] can only be used in eager mode.

Tip

There is another way to speed up filter processing on rows: tell polars that the column(s) used to filter rows are already sorted! To do this, you can use the set_sorted() method.
Here’s an example:

mydf <- pl$DataFrame(
  col1 = sort(runif(10000000))
)

microbenchmark(
  "Without telling col1 is sorted" = mydf$filter(pl$col("col1") < 100),
  "Telling col1 is sorted" = mydf$with_columns(pl$col("col1")$set_sorted())$filter(pl$col("col1") < 100)
  )
Unit: milliseconds
                           expr      min       lq     mean   median       uq
 Without telling col1 is sorted 5.405537 5.514903 5.645413 5.573309 5.675186
         Telling col1 is sorted 2.395465 2.477669 2.677980 2.560529 2.632537
       max neval
 10.390910   100
  7.811401   100

2.5 Select columns

2.5.1 Selecting by name

The first option for selecting columns of a DataFrame is to use square brackets [].
The second option is to use the select() method. In this case, it must be used with the col() method which allows to designate the columns to be selected with a character vector.

# 1st option : with square brackets syntax
as_polars_df(iris)[1:3,c("Petal.Length","Petal.Width")] 
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# 2nd option : with select() method
as_polars_df(iris)$select(
  pl$col("Petal.Length","Petal.Width")
)$head(3) # display the first 3 lines
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_df(iris) |>
  select(Petal.Length,Petal.Width) |>
  head(3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[1:3,c("Petal.Length","Petal.Width")] 
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
iris |>
  select(Petal.Length,Petal.Width) |>
  head(3)
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
iris_dt[1:3,.(Petal.Length,Petal.Width)]
   Petal.Length Petal.Width
          <num>       <num>
1:          1.4         0.2
2:          1.4         0.2
3:          1.3         0.2

There are two major reasons why you should use the select() method over the syntax with square brackets:
- When you select and transform multiple columns with select() method, Polars will run these selections in parallel;
- Expressions used in select() method can be optimised in lazy mode by the query optimizer.

2.5.2 Selecting by data type

To select columns by data type from a DataFrame, you can pass a data type to pl$col expression. For example:

# Select only Float64 columns
as_polars_df(iris)$select(pl$col(pl$Float64))$head(3)
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select only numeric columns
as_polars_df(iris)$select(cs$numeric())$head(3) 
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select only numeric columns
as_polars_df(iris) %>% 
  select(where(is.numeric)) %>% 
  head(3)
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select only factor columns
as.data.frame(iris[1:3, sapply(iris, is.factor)])
  iris[1:3, sapply(iris, is.factor)]
1                             setosa
2                             setosa
3                             setosa
# Select only numeric columns
iris[1:3, sapply(iris, is.numeric)]
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
# Select only factor columns
iris %>% 
  select_if(is.factor) %>%
  head(3)
  Species
1  setosa
2  setosa
3  setosa
# Select only numeric columns
iris %>% 
  select_if(is.numeric) %>%
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
# Select only factor columns
iris_dt[, .SD, .SDcols = is.factor][1:3]
   Species
    <fctr>
1:  setosa
2:  setosa
3:  setosa
# Select only numeric columns
iris_dt[, .SD, .SDcols = is.numeric][1:3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width
          <num>       <num>        <num>       <num>
1:          5.1         3.5          1.4         0.2
2:          4.9         3.0          1.4         0.2
3:          4.7         3.2          1.3         0.2
Note

You can also select columns from a DataFrame using a name pattern. See an example by removing a column

2.5.3 Selecting with a regex

To select columns with a regex from a DataFrame, you can pass it in pl$col expression. For example, to select all columns that starts with β€œSepal” in iris dataset:

as_polars_df(iris)$select(
  pl$col("^Sepal.*$")
)$head(3) # display the first 3 lines
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         β”‚
β”‚ 4.9          ┆ 3.0         β”‚
β”‚ 4.7          ┆ 3.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_df(iris) %>%
  select(starts_with("Sepal")) %>%
  head(3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         β”‚
β”‚ 4.9          ┆ 3.0         β”‚
β”‚ 4.7          ┆ 3.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[, grep("^Sepal.*$", colnames(iris))]
    Sepal.Length Sepal.Width
1            5.1         3.5
2            4.9         3.0
3            4.7         3.2
4            4.6         3.1
5            5.0         3.6
6            5.4         3.9
7            4.6         3.4
8            5.0         3.4
9            4.4         2.9
10           4.9         3.1
11           5.4         3.7
12           4.8         3.4
13           4.8         3.0
14           4.3         3.0
15           5.8         4.0
16           5.7         4.4
17           5.4         3.9
18           5.1         3.5
19           5.7         3.8
20           5.1         3.8
21           5.4         3.4
22           5.1         3.7
23           4.6         3.6
24           5.1         3.3
25           4.8         3.4
26           5.0         3.0
27           5.0         3.4
28           5.2         3.5
29           5.2         3.4
30           4.7         3.2
31           4.8         3.1
32           5.4         3.4
33           5.2         4.1
34           5.5         4.2
35           4.9         3.1
36           5.0         3.2
37           5.5         3.5
38           4.9         3.6
39           4.4         3.0
40           5.1         3.4
41           5.0         3.5
42           4.5         2.3
43           4.4         3.2
44           5.0         3.5
45           5.1         3.8
46           4.8         3.0
47           5.1         3.8
48           4.6         3.2
49           5.3         3.7
50           5.0         3.3
51           7.0         3.2
52           6.4         3.2
53           6.9         3.1
54           5.5         2.3
55           6.5         2.8
56           5.7         2.8
57           6.3         3.3
58           4.9         2.4
59           6.6         2.9
60           5.2         2.7
61           5.0         2.0
62           5.9         3.0
63           6.0         2.2
64           6.1         2.9
65           5.6         2.9
66           6.7         3.1
67           5.6         3.0
68           5.8         2.7
69           6.2         2.2
70           5.6         2.5
71           5.9         3.2
72           6.1         2.8
73           6.3         2.5
74           6.1         2.8
75           6.4         2.9
76           6.6         3.0
77           6.8         2.8
78           6.7         3.0
79           6.0         2.9
80           5.7         2.6
81           5.5         2.4
82           5.5         2.4
83           5.8         2.7
84           6.0         2.7
85           5.4         3.0
86           6.0         3.4
87           6.7         3.1
88           6.3         2.3
89           5.6         3.0
90           5.5         2.5
91           5.5         2.6
92           6.1         3.0
93           5.8         2.6
94           5.0         2.3
95           5.6         2.7
96           5.7         3.0
97           5.7         2.9
98           6.2         2.9
99           5.1         2.5
100          5.7         2.8
101          6.3         3.3
102          5.8         2.7
103          7.1         3.0
104          6.3         2.9
105          6.5         3.0
106          7.6         3.0
107          4.9         2.5
108          7.3         2.9
109          6.7         2.5
110          7.2         3.6
111          6.5         3.2
112          6.4         2.7
113          6.8         3.0
114          5.7         2.5
115          5.8         2.8
116          6.4         3.2
117          6.5         3.0
118          7.7         3.8
119          7.7         2.6
120          6.0         2.2
121          6.9         3.2
122          5.6         2.8
123          7.7         2.8
124          6.3         2.7
125          6.7         3.3
126          7.2         3.2
127          6.2         2.8
128          6.1         3.0
129          6.4         2.8
130          7.2         3.0
131          7.4         2.8
132          7.9         3.8
133          6.4         2.8
134          6.3         2.8
135          6.1         2.6
136          7.7         3.0
137          6.3         3.4
138          6.4         3.1
139          6.0         3.0
140          6.9         3.1
141          6.7         3.1
142          6.9         3.1
143          5.8         2.7
144          6.8         3.2
145          6.7         3.3
146          6.7         3.0
147          6.3         2.5
148          6.5         3.0
149          6.2         3.4
150          5.9         3.0
iris %>%
  select(starts_with("Sepal")) %>%
  head(3)
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
iris_dt[, .SD, .SDcols = grep("^Sepal", colnames(iris_dt))][1:3]
   Sepal.Length Sepal.Width
          <num>       <num>
1:          5.1         3.5
2:          4.9         3.0
3:          4.7         3.2

2.5.4 Selecting - other ways

The selectors API of polars enables to use other methods to select columns.
Here’s a few examples selecting the first and last column of a DataFrame respectively with first() and last() methods:

# Select the first column
as_polars_df(iris)$select(
  pl$first()
)$head(3)
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 5.1          β”‚
β”‚ 4.9          β”‚
β”‚ 4.7          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Select the last column
as_polars_df(iris)$select(
  pl$last()
)$head(3)
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species β”‚
β”‚ ---     β”‚
β”‚ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•‘
β”‚ setosa  β”‚
β”‚ setosa  β”‚
β”‚ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.6 Modify/Add columns

2.6.1 Modify existing column

Similar to the dplyr package, the select() method can also be used to modify existing data. However, the result will exclude any columns that were not specified in the expression.

For example, if we want to get in the data.frame iris the Petal.Length column rounded without decimals.

as_polars_df(iris)$select(
  pl$col("Petal.Length")$round(decimals = 0)
)$head(3) # display the first 3 lines
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 1.0          β”‚
β”‚ 1.0          β”‚
β”‚ 1.0          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The problem here is that we would like to keep all the iris columns and not just Petal.Length.

Again, let’s look at the official documentation:

To modify or add some columnsβ€”whilst preserving all others in the datasetβ€”it is therefore better to use the with_columns() method.

Note that the with_columns() method only accepts expressions. Let’s use it in an example:

as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$round(decimals = 0)
)$head(3) # display the first 3 lines
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_df(iris) |>
  mutate(Petal.Length = round(Petal.Length,0)) |>
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris$Petal.Length <- round(iris$Petal.Length, digits = 0)
iris[1:3,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5            1         0.2  setosa
2          4.9         3.0            1         0.2  setosa
3          4.7         3.2            1         0.2  setosa
data(iris)
iris |>
  mutate(Petal.Length = round(Petal.Length,0)) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5            1         0.2  setosa
2          4.9         3.0            1         0.2  setosa
3          4.7         3.2            1         0.2  setosa
iris_dt[,Petal.Length := round(Petal.Length, digits = 0)]
iris_dt[1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          5.1         3.5            1         0.2  setosa
2:          4.9         3.0            1         0.2  setosa
3:          4.7         3.2            1         0.2  setosa

2.6.2 Add a new column

If you want to add a column to a data.frame, you use the same syntax as above with with_columns(). Simply use the alias() method to specify the name of the newly created column.

as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$round(decimals = 0)$alias("Petal.Length.rounded")
)$head(3) # display the first 3 lines
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species ┆ Petal.Length.rounded β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     ┆ ---                  β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     ┆ f64                  β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════β•ͺ══════════════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ 1.0                  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ 1.0                  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  ┆ 1.0                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.6.3 Add a new column with a constant value

If you need to create a new column with a constant value (i.e. the same value for all the rows in your DataFrame), you can use the literal lit() method. It works with the main types of Polars.

as_polars_df(iris)$with_columns(
  pl$lit("toto")$alias("mynewcolumn")
)$head(3) # display the first 3 lines
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species ┆ mynewcolumn β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     ┆ ---         β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     ┆ str         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════β•ͺ═════════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ toto        β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ toto        β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  ┆ toto        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.6.4 Add a new column based on conditions

To add new columns based on conditions, the when-then-otherwise expression must be used.

Here’s the syntax:

pl$when(**Boolean Expression**)$
  then(**Value if True**)$
  otherwise(**Value if False)$
  alias(**New Column Name**)

Here’s an example with equivalent syntax to help you understand:

as_polars_df(iris)$with_columns(
  pl$when(pl$col("Petal.Length") <= 2)$then(pl$lit("<=2"))$
    when(pl$col("Petal.Length") <= 5)$then(pl$lit("<=5"))$
    otherwise(pl$lit(">5"))$
    alias("mygroups"))$
  # we only need to display 2 variables to check that it's OK
  select(
    c("Petal.Length","mygroups"))[c(1,2,59,150),]
shape: (4, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ mygroups β”‚
β”‚ ---          ┆ ---      β”‚
β”‚ f64          ┆ str      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════║
β”‚ 1.4          ┆ <=2      β”‚
β”‚ 1.4          ┆ <=2      β”‚
β”‚ 4.6          ┆ <=5      β”‚
β”‚ 5.1          ┆ >5       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris$mygroups <- ifelse(iris$Petal.Length <= 2, "<=2",
                        ifelse(iris$Petal.Length <= 5, "<=5", ">5"))
# we only need to display 2 variables to check that it's OK
iris[c(1,2,59,150), c("Petal.Length", "mygroups")]
    Petal.Length mygroups
1            1.4      <=2
2            1.4      <=2
59           4.6      <=5
150          5.1       >5
as_polars_df(iris) %>%
  mutate(
    mygroups = case_when(
      Petal.Length <= 2 ~ "<=2",
      Petal.Length <= 5 ~ "<=5",
      TRUE ~ ">5"          # Γ©quivalent de .default
    )
  ) %>%
  select(Petal.Length, mygroups) %>%
  slice_head(n = 3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ mygroups β”‚
β”‚ ---          ┆ ---      β”‚
β”‚ f64          ┆ str      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════║
β”‚ 1.4          ┆ <=2      β”‚
β”‚ 1.4          ┆ <=2      β”‚
β”‚ 1.3          ┆ <=2      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris |>
  mutate(
    mygroups = case_when(
      Petal.Length <=2 ~ "<=2",
      Petal.Length <=5 ~ "<=5",
      .default = ">5")
  ) |>
  # we only need to display 2 variables  to check that it's OK
  select(Petal.Length,mygroups) |>
  slice(1,2,59,150)
  Petal.Length mygroups
1          1.4      <=2
2          1.4      <=2
3          4.6      <=5
4          5.1       >5
iris_dt[, mygroups := case_when(
  Petal.Length <= 2 ~ "<=2",
  Petal.Length <= 5 ~ "<=5",
  TRUE ~ ">5"
)]
# we only need to display 2 variables to check that it's OK
iris_dt[c(1,2,59,150), .(Petal.Length, mygroups)]
   Petal.Length mygroups
          <num>   <char>
1:            1      <=2
2:            1      <=2
3:            5      <=5
4:            5      <=5

2.6.5 Add a new column by group

To add new columns by group, the over expression must be used. This expression is similar to performing a groupby aggregation and joining the result back into the original dataframe. Here’s an example with equivalent syntax to help you understand:

df <- data.frame(
  name = c("X","X","Y","Y","Z","Z"),
  adress = c("A","B","C","D","E","F"),
  col2 = c(2L,4L,1L,3L,4L,2L),
  col3 = c(5L,19L,17L,12L,11L,15L)
)
df
  name adress col2 col3
1    X      A    2    5
2    X      B    4   19
3    Y      C    1   17
4    Y      D    3   12
5    Z      E    4   11
6    Z      F    2   15
as_polars_df(df)$with_columns(
  pl$col("col3")$max()$over("name")$name$suffix("_max")
)
shape: (6, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ name ┆ adress ┆ col2 ┆ col3 ┆ col3_max β”‚
β”‚ ---  ┆ ---    ┆ ---  ┆ ---  ┆ ---      β”‚
β”‚ str  ┆ str    ┆ i32  ┆ i32  ┆ i32      β”‚
β•žβ•β•β•β•β•β•β•ͺ════════β•ͺ══════β•ͺ══════β•ͺ══════════║
β”‚ X    ┆ A      ┆ 2    ┆ 5    ┆ 19       β”‚
β”‚ X    ┆ B      ┆ 4    ┆ 19   ┆ 19       β”‚
β”‚ Y    ┆ C      ┆ 1    ┆ 17   ┆ 17       β”‚
β”‚ Y    ┆ D      ┆ 3    ┆ 12   ┆ 17       β”‚
β”‚ Z    ┆ E      ┆ 4    ┆ 11   ┆ 15       β”‚
β”‚ Z    ┆ F      ┆ 2    ┆ 15   ┆ 15       β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
result <- aggregate(col3 ~ name, data = df, FUN = max)
colnames(result) <- c("name", "col3_max")
merge(df, result, by = "name", all.x = TRUE)
  name adress col2 col3 col3_max
1    X      A    2    5       19
2    X      B    4   19       19
3    Y      C    1   17       17
4    Y      D    3   12       17
5    Z      E    4   11       15
6    Z      F    2   15       15
df |>
  group_by(name) |>
  mutate(col3_max = max(col3))
# A tibble: 6 Γ— 5
# Groups:   name [3]
  name  adress  col2  col3 col3_max
  <chr> <chr>  <int> <int>    <int>
1 X     A          2     5       19
2 X     B          4    19       19
3 Y     C          1    17       17
4 Y     D          3    12       17
5 Z     E          4    11       15
6 Z     F          2    15       15
dt <- as.data.table(df)
dt[, col3_max := max(col3), by = name]
dt
     name adress  col2  col3 col3_max
   <char> <char> <int> <int>    <int>
1:      X      A     2     5       19
2:      X      B     4    19       19
3:      Y      C     1    17       17
4:      Y      D     3    12       17
5:      Z      E     4    11       15
6:      Z      F     2    15       15
Tip

If you need to pass multiple column names in the over expression, you can either list them like this over("name","adress") or - more conveniently - use a character vector over(c("name","adress")).

2.7 Rename columns

Similar to the dplyr package, the rename() method can also be used to rename existing column.
The renaming logic is NOT identical to that of dplyr, and is performed as follows: old_name="new_name".

Note

Note the double quotes "" surrounding the name of the old variable to be renamed which does not exist with dplyr (see examples below).

data(iris)
as_polars_df(iris)$
  rename(
    Sepal.Length = "sepal_length", 
    Sepal.Width = "sepal_width",
    Petal.Length = "length of petal",
    Petal.Width = "width of petal",
    Species = "species"
  )$columns
[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"        
data(iris)
names(iris) <- c("sepal_length","sepal_width","length of petal","width of petal","species")
names(iris)
[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"        
data(iris)
iris |>
  rename(
    sepal_length = Sepal.Length, 
    sepal_width = Sepal.Width,
    `length of petal` = Petal.Length,
    `width of petal` = Petal.Width,
    species = Species
  ) |>
  names()
[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"        
setnames(iris_dt, 
         old = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"),
         new = c("sepal_length", "sepal_width", "length_of_petal", "width_of_petal", "species"))
names(iris_dt)
[1] "sepal_length"    "sepal_width"     "length_of_petal" "width_of_petal" 
[5] "species"         "mygroups"       

2.8 Remove columns

2.8.1 Removing by name

To remove columns from a DataFrame, the drop method must be used.
Here’s an example with equivalent syntax:

as_polars_df(iris)$
  drop(c("Petal.Width","Sepal.Length","Sepal.Width"))$
  head(3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Species β”‚
β”‚ ---          ┆ ---     β”‚
β”‚ f64          ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════║
β”‚ 1.4          ┆ setosa  β”‚
β”‚ 1.4          ┆ setosa  β”‚
β”‚ 1.3          ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[1:3,!(names(iris) %in% c("Petal.Width","Sepal.Length","Sepal.Width"))]
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
iris |>
  select(-c(Petal.Width,Sepal.Length,Sepal.Width)) |>
  head(3)
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
iris_dt[, c("Petal.Width","Sepal.Length","Sepal.Width") := NULL][1:3,]
   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa

2.8.2 Removing with a regex

To remove columns with a regex from a DataFrame, the drop expression must be used.
Let’s see an example where you want to drop columns whose names starts with β€œPetal” in iris.

num_to_drop <- !grepl("^Petal",as_polars_df(iris)$columns)
col_to_drop <- as_polars_df(iris)$columns[num_to_drop]

as_polars_df(iris)$
  drop(col_to_drop)$
  head(3)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length ┆ Petal.Width β”‚
β”‚ ---          ┆ ---         β”‚
β”‚ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.4          ┆ 0.2         β”‚
β”‚ 1.3          ┆ 0.2         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[1:3,!grepl("^Petal",names(iris))]
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
iris |>
  select(-starts_with("Petal")) |>
  head(3)
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
iris_dt[, which(grepl("^Petal$", names(iris_dt))):=NULL][1:3,]
   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa
Note

All the ways we’ve seen in the section about selecting columns from a DataFrame (by name, data type and with a list) also work with drop() method!

2.9 Aggregation by group

Another frequently used data manipulation is the aggregation of data by group. To do this, we indicate in the group_by() method which column will be used to group the data.frame. And the agg() method which specifies the expression to aggregate.

The methods available for the agg() method are (in each group):

  • first() get the first element
  • last() get the last element
  • n_unique() get the number of unique elements
  • count() get the number of elements
  • sum() sum the elements
  • min() get the smallest element
  • max() get the largest element
  • mean() get the average of elements
  • median() get the median
  • quantile() calculate quantiles

Here’s a minimal example with sum and mean applied to 2 different columns:

as_polars_df(iris)$
  group_by("Species")$
  agg(pl$sum("Petal.Length"),pl$col("Petal.Width")$mean())
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════║
β”‚ setosa     ┆ 73.1         ┆ 0.246       β”‚
β”‚ versicolor ┆ 213.0        ┆ 1.326       β”‚
β”‚ virginica  ┆ 277.6        ┆ 2.026       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
aggregate(Petal.Length ~ Species, data = iris, FUN = sum)
     Species Petal.Length
1     setosa         73.1
2 versicolor        213.0
3  virginica        277.6
aggregate(Petal.Width ~ Species, data = iris, FUN = mean)
     Species Petal.Width
1     setosa       0.246
2 versicolor       1.326
3  virginica       2.026
data(iris)
iris %>%
  group_by(Species) %>%
  summarise(
    Petal.Length_sum = sum(Petal.Length),
    Petal.Width_mean = mean(Petal.Width)
  )
# A tibble: 3 Γ— 3
  Species    Petal.Length_sum Petal.Width_mean
  <fct>                 <dbl>            <dbl>
1 setosa                 73.1            0.246
2 versicolor            213              1.33 
3 virginica             278.             2.03 
iris_dt <- as.data.table(iris)
iris_dt[, .(Petal.Length = sum(Petal.Length), Petal.Width = mean(Petal.Width)), by = Species]
      Species Petal.Length Petal.Width
       <fctr>        <num>       <num>
1:     setosa         73.1       0.246
2: versicolor        213.0       1.326
3:  virginica        277.6       2.026
Tip

Be careful! Calling multiple aggregations on the same column produces columns of the same name which generates an error with R.

You can use the alias() or suffix() method to ensure column names are unique. For example:

as_polars_df(iris)$
  group_by("Species")$
  agg(
    # With alias()
    pl$col(c("Petal.Length"))$sum()$alias("Petal.Length_Sum"),
    pl$col(c("Petal.Length"))$mean()$alias("Petal.Length_Mean"),
    # With suffix()
    pl$col(c("Petal.Width"))$sum()$name$suffix("_Sum"),
    pl$col(c("Petal.Width"))$mean()$name$suffix("_Mean")
  )
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Petal.Length_Sum ┆ Petal.Length_Mean ┆ Petal.Width_Sum ┆ Petal.Width_Mean β”‚
β”‚ ---        ┆ ---              ┆ ---               ┆ ---             ┆ ---              β”‚
β”‚ cat        ┆ f64              ┆ f64               ┆ f64             ┆ f64              β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════════β•ͺ═══════════════════β•ͺ═════════════════β•ͺ══════════════════║
β”‚ setosa     ┆ 73.1             ┆ 1.462             ┆ 12.3            ┆ 0.246            β”‚
β”‚ versicolor ┆ 213.0            ┆ 4.26              ┆ 66.3            ┆ 1.326            β”‚
β”‚ virginica  ┆ 277.6            ┆ 5.552             ┆ 101.3           ┆ 2.026            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You can perform more advanced aggregations, for example calculate the sum of all the floating point columns in iris:

as_polars_df(iris)$
  group_by("Species")$
  agg(pl$col(pl$Float64)$sum())
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width β”‚
β”‚ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         β”‚
β”‚ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════║
β”‚ virginica  ┆ 329.4        ┆ 148.7       ┆ 277.6        ┆ 101.3       β”‚
β”‚ versicolor ┆ 296.8        ┆ 138.5       ┆ 213.0        ┆ 66.3        β”‚
β”‚ setosa     ┆ 250.3        ┆ 171.4       ┆ 73.1         ┆ 12.3        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris_sum_by_species_base <- aggregate(iris[, sapply(iris, is.numeric)], by = list(Species = iris$Species), FUN = sum)
iris_sum_by_species_base
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        250.3       171.4         73.1        12.3
2 versicolor        296.8       138.5        213.0        66.3
3  virginica        329.4       148.7        277.6       101.3
iris |>
  group_by(Species) %>%
  summarise(across(where(is.numeric), sum))
# A tibble: 3 Γ— 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             250.        171.         73.1        12.3
2 versicolor         297.        138.        213          66.3
3 virginica          329.        149.        278.        101. 
iris_dt <- as.data.table(iris)
iris_sum_by_species_dt <- iris_dt[, lapply(.SD, sum), by = Species, .SDcols = sapply(iris_dt, is.numeric)]
iris_sum_by_species_dt
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa        250.3       171.4         73.1        12.3
2: versicolor        296.8       138.5        213.0        66.3
3:  virginica        329.4       148.7        277.6       101.3

2.10 Sort a DataFrame

2.10.1 Simply sort a DataFrame

The sort() method with a character vector can be used to sort a DataFrame.

# Sort by one column
as_polars_df(iris)$
  sort("Species")$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Sort by two columns
as_polars_df(iris)$
  sort(c("Species","Petal.Length"))$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 4.6          ┆ 3.6         ┆ 1.0          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.3          ┆ 3.0         ┆ 1.1          ┆ 0.1         ┆ setosa  β”‚
β”‚ 5.0          ┆ 3.2         ┆ 1.2          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Sort by two columns one in a decreasing manner and the other in an increasing manner
as_polars_df(iris)$
  sort(pl$col("Species")$sort(descending = TRUE),pl$col("Petal.Length")$sort(descending = FALSE))$
  head(3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═══════════║
β”‚ 6.3          ┆ 3.3         ┆ 6.0          ┆ 2.5         ┆ virginica β”‚
β”‚ 5.8          ┆ 2.7         ┆ 5.1          ┆ 1.9         ┆ virginica β”‚
β”‚ 7.1          ┆ 3.0         ┆ 5.9          ┆ 2.1         ┆ virginica β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Sort by one column
iris[order(iris$Species),][1:3,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
# Sort by two columns
iris[order(iris$Species,iris$Petal.Length),][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
23          4.6         3.6          1.0         0.2  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris[order(rev(iris$Species),iris$Petal.Length),][1:3,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
107          4.9         2.5          4.5         1.7 virginica
127          6.2         2.8          4.8         1.8 virginica
139          6.0         3.0          4.8         1.8 virginica
# Sort by one column
iris |>
  arrange(Species) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
# Sort by two columns
iris |>
  arrange(Species, Petal.Length) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.6         3.6          1.0         0.2  setosa
2          4.3         3.0          1.1         0.1  setosa
3          5.8         4.0          1.2         0.2  setosa
# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris |>
  arrange(desc(Species), Petal.Length) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          4.9         2.5          4.5         1.7 virginica
2          6.2         2.8          4.8         1.8 virginica
3          6.0         3.0          4.8         1.8 virginica
# Sort by one column
iris_dt[order(Species)][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
# Sort by two columns
iris_dt[order(Species,Petal.Length)][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          4.6         3.6          1.0         0.2  setosa
2:          4.3         3.0          1.1         0.1  setosa
3:          5.8         4.0          1.2         0.2  setosa
# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris_dt[order(-Species,Petal.Length)][1:3,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <num>       <num>        <num>       <num>    <fctr>
1:          4.9         2.5          4.5         1.7 virginica
2:          6.2         2.8          4.8         1.8 virginica
3:          6.0         3.0          4.8         1.8 virginica

2.10.2 Keep unique rows

If you want to keep unique/distinct rows from a DataFrame, you can use the unique() method:

# With one column
as_polars_df(iris)$unique(
  subset = "Species"
)
Warning: ! The `subset` argument of `$unique()` is deprecated and replaced by `...` as
of polars 1.1.0.
This warning is displayed once every 8 hours.
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ════════════║
β”‚ 7.0          ┆ 3.2         ┆ 4.7          ┆ 1.4         ┆ versicolor β”‚
β”‚ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa     β”‚
β”‚ 6.3          ┆ 3.3         ┆ 6.0          ┆ 2.5         ┆ virginica  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With one column keeping last entry
as_polars_df(iris)$unique(
  subset = "Species",
  keep = "last"
)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ════════════║
β”‚ 5.7          ┆ 2.8         ┆ 4.1          ┆ 1.3         ┆ versicolor β”‚
β”‚ 5.0          ┆ 3.3         ┆ 1.4          ┆ 0.2         ┆ setosa     β”‚
β”‚ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With two colums, keeping last entry and maintaining the same order
pl$DataFrame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
)$unique(
  subset = c("x","y"),
  keep = "last",
  maintain_order = TRUE
)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ x   ┆ y   ┆ z   β”‚
β”‚ --- ┆ --- ┆ --- β”‚
β”‚ i32 ┆ i32 ┆ i32 β”‚
β•žβ•β•β•β•β•β•ͺ═════β•ͺ═════║
β”‚ 1   ┆ 1   ┆ 1   β”‚
β”‚ 2   ┆ 2   ┆ 2   β”‚
β”‚ 3   ┆ 3   ┆ 4   β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
# With one column
aggregate(. ~ Species, data = iris, FUN = head, N = 1)
     Species Sepal.Length.1 Sepal.Length.2 Sepal.Length.3 Sepal.Length.4
1     setosa            5.1            4.9            4.7            4.6
2 versicolor            7.0            6.4            6.9            5.5
3  virginica            6.3            5.8            7.1            6.3
  Sepal.Length.5 Sepal.Length.6 Sepal.Width.1 Sepal.Width.2 Sepal.Width.3
1            5.0            5.4           3.5           3.0           3.2
2            6.5            5.7           3.2           3.2           3.1
3            6.5            7.6           3.3           2.7           3.0
  Sepal.Width.4 Sepal.Width.5 Sepal.Width.6 Petal.Length.1 Petal.Length.2
1           3.1           3.6           3.9            1.4            1.4
2           2.3           2.8           2.8            4.7            4.5
3           2.9           3.0           3.0            6.0            5.1
  Petal.Length.3 Petal.Length.4 Petal.Length.5 Petal.Length.6 Petal.Width.1
1            1.3            1.5            1.4            1.7           0.2
2            4.9            4.0            4.6            4.5           1.4
3            5.9            5.6            5.8            6.6           2.5
  Petal.Width.2 Petal.Width.3 Petal.Width.4 Petal.Width.5 Petal.Width.6
1           0.2           0.2           0.2           0.2           0.4
2           1.5           1.5           1.3           1.5           1.3
3           1.9           2.1           1.8           2.2           2.1
# With one column keeping last entry
aggregate(. ~ Species, data = iris, FUN = tail, n = 1)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.3          1.4         0.2
2 versicolor          5.7         2.8          4.1         1.3
3  virginica          5.9         3.0          5.1         1.8
# With two colums, keeping last entry and maintaining the same order
mytest <- data.frame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) 
aggregate(. ~ x + y, data = mytest, FUN = tail, n = 1)
  x y z
1 1 1 1
2 2 2 2
3 3 3 4
# With one column
iris |>
  distinct(Species, .keep_all = TRUE)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          7.0         3.2          4.7         1.4 versicolor
3          6.3         3.3          6.0         2.5  virginica
# With one column keeping last entry
iris |>
  group_by(Species) |>
  slice_tail() |>
  ungroup()
# A tibble: 3 Γ— 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5           3.3          1.4         0.2 setosa    
2          5.7         2.8          4.1         1.3 versicolor
3          5.9         3            5.1         1.8 virginica 
# With two colums, keeping last entry and maintaining the same order
data.frame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) |>
  group_by(x,y) |>
  slice_tail() |>
  ungroup()
# A tibble: 3 Γ— 3
      x     y     z
  <int> <int> <int>
1     1     1     1
2     2     2     2
3     3     3     4
# With one column
unique(iris_dt, by = "Species")
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          5.1         3.5          1.4         0.2     setosa
2:          7.0         3.2          4.7         1.4 versicolor
3:          6.3         3.3          6.0         2.5  virginica
# With one column keeping last entry
unique(iris_dt, by = "Species", fromLast = TRUE)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          5.0         3.3          1.4         0.2     setosa
2:          5.7         2.8          4.1         1.3 versicolor
3:          5.9         3.0          5.1         1.8  virginica
# With two colums, keeping last entry and maintaining the same order
mytest_dt <- data.table(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) 
unique(mytest_dt, by = c("x","y"), fromLast = TRUE)
       x     y     z
   <int> <int> <int>
1:     1     1     1
2:     2     2     2
3:     3     3     4

2.10.3 Keep some columns from sorted DataFrame

If you want to keep only some columns from a sorted DataFrame, you can use the sort_by() method with the select() method.
In details, sort_by() can sort a column by the ordering of another column, or multiple other columns.
It’s the equivalent of order() method of R base.

as_polars_df(iris)$
  select(pl$col("Petal.Length")$
  sort_by("Petal.Width"))
shape: (150, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 1.5          β”‚
β”‚ 1.4          β”‚
β”‚ 1.1          β”‚
β”‚ 1.5          β”‚
β”‚ 1.4          β”‚
β”‚ …            β”‚
β”‚ 5.6          β”‚
β”‚ 5.6          β”‚
β”‚ 6.0          β”‚
β”‚ 6.1          β”‚
β”‚ 5.7          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
data(iris)
iris[order(iris$Petal.Width), "Petal.Length", drop = FALSE]
    Petal.Length
10           1.5
13           1.4
14           1.1
33           1.5
38           1.4
1            1.4
2            1.4
3            1.3
4            1.5
5            1.4
8            1.5
9            1.4
11           1.5
12           1.6
15           1.2
21           1.7
23           1.0
25           1.9
26           1.6
28           1.5
29           1.4
30           1.6
31           1.6
34           1.4
35           1.5
36           1.2
37           1.3
39           1.3
40           1.5
43           1.3
47           1.6
48           1.4
49           1.5
50           1.4
7            1.4
18           1.4
19           1.7
20           1.5
41           1.3
42           1.3
46           1.4
6            1.7
16           1.5
17           1.3
22           1.5
27           1.6
32           1.5
45           1.9
24           1.7
44           1.6
58           3.3
61           3.5
63           4.0
68           4.1
80           3.5
82           3.7
94           3.3
70           3.9
81           3.8
99           3.0
74           4.7
83           3.9
91           4.4
93           4.0
96           4.2
54           4.0
56           4.5
59           4.6
65           3.6
72           4.0
75           4.3
88           4.4
89           4.1
90           4.0
95           4.2
97           4.2
98           4.3
100          4.1
51           4.7
60           3.9
64           4.7
66           4.4
76           4.4
77           4.8
92           4.6
135          5.6
52           4.5
53           4.9
55           4.6
62           4.2
67           4.5
69           4.5
73           4.9
79           4.5
85           4.5
87           4.7
120          5.0
134          5.1
57           4.7
84           5.1
86           4.5
130          5.8
78           5.0
107          4.5
71           4.8
104          5.6
108          6.3
109          5.8
117          5.5
124          4.9
126          6.0
127          4.8
128          4.9
138          5.5
139          4.8
150          5.1
102          5.1
112          5.3
131          6.1
143          5.1
147          5.0
111          5.1
114          5.0
122          4.9
123          6.7
132          6.4
148          5.2
103          5.9
106          6.6
113          5.5
125          5.7
129          5.6
140          5.4
105          5.8
118          6.7
133          5.6
116          5.3
119          6.9
121          5.7
136          6.1
142          5.1
144          5.9
146          5.2
149          5.4
115          5.1
137          5.6
141          5.6
101          6.0
110          6.1
145          5.7
data(iris)
iris |>
  arrange(Petal.Width) |>
  select(Petal.Length)
    Petal.Length
1            1.5
2            1.4
3            1.1
4            1.5
5            1.4
6            1.4
7            1.4
8            1.3
9            1.5
10           1.4
11           1.5
12           1.4
13           1.5
14           1.6
15           1.2
16           1.7
17           1.0
18           1.9
19           1.6
20           1.5
21           1.4
22           1.6
23           1.6
24           1.4
25           1.5
26           1.2
27           1.3
28           1.3
29           1.5
30           1.3
31           1.6
32           1.4
33           1.5
34           1.4
35           1.4
36           1.4
37           1.7
38           1.5
39           1.3
40           1.3
41           1.4
42           1.7
43           1.5
44           1.3
45           1.5
46           1.6
47           1.5
48           1.9
49           1.7
50           1.6
51           3.3
52           3.5
53           4.0
54           4.1
55           3.5
56           3.7
57           3.3
58           3.9
59           3.8
60           3.0
61           4.7
62           3.9
63           4.4
64           4.0
65           4.2
66           4.0
67           4.5
68           4.6
69           3.6
70           4.0
71           4.3
72           4.4
73           4.1
74           4.0
75           4.2
76           4.2
77           4.3
78           4.1
79           4.7
80           3.9
81           4.7
82           4.4
83           4.4
84           4.8
85           4.6
86           5.6
87           4.5
88           4.9
89           4.6
90           4.2
91           4.5
92           4.5
93           4.9
94           4.5
95           4.5
96           4.7
97           5.0
98           5.1
99           4.7
100          5.1
101          4.5
102          5.8
103          5.0
104          4.5
105          4.8
106          5.6
107          6.3
108          5.8
109          5.5
110          4.9
111          6.0
112          4.8
113          4.9
114          5.5
115          4.8
116          5.1
117          5.1
118          5.3
119          6.1
120          5.1
121          5.0
122          5.1
123          5.0
124          4.9
125          6.7
126          6.4
127          5.2
128          5.9
129          6.6
130          5.5
131          5.7
132          5.6
133          5.4
134          5.8
135          6.7
136          5.6
137          5.3
138          6.9
139          5.7
140          6.1
141          5.1
142          5.9
143          5.2
144          5.4
145          5.1
146          5.6
147          5.6
148          6.0
149          6.1
150          5.7
data(iris)
iris_dt <- as.data.table(iris)
iris_dt[order(Petal.Width)][,.(Petal.Length)]
     Petal.Length
            <num>
  1:          1.5
  2:          1.4
  3:          1.1
  4:          1.5
  5:          1.4
 ---             
146:          5.6
147:          5.6
148:          6.0
149:          6.1
150:          5.7
Tip

If you want to use multiple columns/expressions, you can pass it in a list like this for example sort_by(list("Petal.Width","Sepal.Width")) or sort_by(list("Petal.Width", pl$col("Sepal.Width")))

2.11 Join DataFrames

To perform joins, the join() method must be used.

Multiple strategies are available:

  • "inner": returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded.
  • "left": returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled.
  • "outer": returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled.
  • "semi": returns all rows from the left frame in which the join key is also present in the right frame.
  • "anti": returns all rows from the left frame in which the join key is not present in the right frame.
  • "cross": returns the cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained. The table length of A cross-joined with B is always len(A) Γ— len(B).

The main arguments are: - on: name(s) of the join columns in both DataFrames.
- how: join strategy.
- suffix: suffix to append to columns with a duplicate name.

Let’s see a simple example:

# First, creation of colors_species Polars DataFrame
colors <- pl$DataFrame(
  Surname = c("toto","titi","tata"),
  Color = c("blue","red","yellow")
)
values <- pl$DataFrame(
  Surname = c("toto","titi","tata"),
  value = c(10,20,30)
)
# Let's join !
colors$join(
  other = values,
  on = "Surname",
  how = "left"
)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Surname ┆ Color  ┆ value β”‚
β”‚ ---     ┆ ---    ┆ ---   β”‚
β”‚ str     ┆ str    ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ════════β•ͺ═══════║
β”‚ toto    ┆ blue   ┆ 10.0  β”‚
β”‚ titi    ┆ red    ┆ 20.0  β”‚
β”‚ tata    ┆ yellow ┆ 30.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
colors_df <- data.frame(
  Surname = c("toto","titi","tata"),
  Color = c("blue","red","yellow")
)
values_df <- data.frame(
  Surname = c("toto","titi","tata"),
  value = c(10,20,30)
)
merge(colors_df, values_df, by = "Surname", all.x = TRUE)
  Surname  Color value
1    tata yellow    30
2    titi    red    20
3    toto   blue    10
colors_df |>
  left_join(values_df,
            by = "Surname") 
  Surname  Color value
1    toto   blue    10
2    titi    red    20
3    tata yellow    30
merge(as.data.table(colors_df), 
      as.data.table(values_df), 
      by = "Surname", all.x = TRUE)
Key: <Surname>
   Surname  Color value
    <char> <char> <num>
1:    tata yellow    30
2:    titi    red    20
3:    toto   blue    10

If join columns have different names in both DataFrames, you can use arguments "left_on" and "right_on".
Here’s an example:

values2 <- values$rename(Surname = "Surname2")

colors$join(
  other = values2,
  left_on = "Surname",
  right_on = "Surname2",
  how = "left"
)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Surname ┆ Color  ┆ value β”‚
β”‚ ---     ┆ ---    ┆ ---   β”‚
β”‚ str     ┆ str    ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ════════β•ͺ═══════║
β”‚ toto    ┆ blue   ┆ 10.0  β”‚
β”‚ titi    ┆ red    ┆ 20.0  β”‚
β”‚ tata    ┆ yellow ┆ 30.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
Tip

When we join on integer columns that are sorted polars uses a fast-track algorithm.

To use the fast-track algorithm polars needs to know the join columns are sorted. See the tip in this section for an example with filter() method.

Polars can’t use the fast-track algorithm for joining string columns as the algorithm works on integers.

=> To use the fast-track algorithm the string column must be cast to categorical dtype. See an example here.

2.12 Concatenate DataFrames

To perform concatenations, the concat() method must be used.
In this section, we will see vertical, horizontal and diagonal concatenation.

Before we need to create some datasets to run our examples. Click to expand it! πŸ‘‡

Code
dfup <- pl$DataFrame(
  col1 = c("a"),
  col2 = c("b"),
  col3 = c("c")
)
dfdown <- pl$DataFrame(
  col1 = c("x"),
  col2 = c("y"),
  col3 = c("z")
)
dfleft <- pl$DataFrame(col1 = c("a","b","c"))
dfright <- pl$DataFrame(col2 = c("x","y","z"))

dfup_df <- as.data.frame(dfup)
dfdown_df <- as.data.frame(dfdown)
dfleft_df <- as.data.frame(dfleft)
dfright_df <- as.data.frame(dfright)

dfup_dt <- as.data.table(dfup)
dfdown_dt <- as.data.table(dfdown)
dfleft_dt <- as.data.table(dfleft)
dfright_dt <- as.data.table(dfright)

2.12.1 Vertical concatenation

To concatenate multiple DataFrames vertically (=by row), you can use the concat() method and the argument how = "vertical".

pl$concat(dfup,dfdown,how = "vertical")
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 β”‚
β”‚ ---  ┆ ---  ┆ ---  β”‚
β”‚ str  ┆ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════║
β”‚ a    ┆ b    ┆ c    β”‚
β”‚ x    ┆ y    ┆ z    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
rbind(dfup_df,dfdown_df)
  col1 col2 col3
1    a    b    c
2    x    y    z
bind_rows(list(dfup_df,dfdown_df))
  col1 col2 col3
1    a    b    c
2    x    y    z
rbindlist(list(dfup_dt, dfdown_dt))
     col1   col2   col3
   <char> <char> <char>
1:      a      b      c
2:      x      y      z

2.12.2 Horizontal concatenation

To concatenate multiple DataFrames horizontally (=by col), you can use the concat() method and the argument how = "horizontal".

pl$concat(dfleft,dfright,how = "horizontal")
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ a    ┆ x    β”‚
β”‚ b    ┆ y    β”‚
β”‚ c    ┆ z    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
cbind(dfleft_df,dfright_df)
  col1 col2
1    a    x
2    b    y
3    c    z
cbind(list(dfleft_df,dfright_df))
     [,1]        
[1,] data.frame,1
[2,] data.frame,1
data.table(dfleft_dt, dfright_dt)
     col1   col2
   <char> <char>
1:      a      x
2:      b      y
3:      c      z

2.12.3 Diagonal concatenation

To concatenate multiple DataFrames diagonally, you can use the concat() method and the argument how = "diagonal".
Diagonal concatenation is useful when the column names are not identical in initial DataFrames.

dfup <- dfup$rename(col3="col4")
pl$concat(dfup,dfdown,how = "diagonal")
shape: (2, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col4 ┆ col3 β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---  β”‚
β”‚ str  ┆ str  ┆ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ══════║
β”‚ a    ┆ b    ┆ c    ┆ null β”‚
β”‚ x    ┆ y    ┆ null ┆ z    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
dfup_df <- as.data.frame(dfup)
dfup_df[setdiff(names(dfdown_df), names(dfup_df))] <- NA
dfdown_df[setdiff(names(dfup_df), names(dfdown_df))] <- NA
rbind(dfup_df,dfdown_df)
  col1 col2 col4 col3
1    a    b    c <NA>
2    x    y <NA>    z
dfup_df <- as.data.frame(dfup)
bind_rows(list(dfup_df,dfdown_df))
  col1 col2 col4 col3
1    a    b    c <NA>
2    x    y <NA>    z
dfup_dt <- as.data.table(dfup)
rbindlist(list(dfup_dt, dfdown_dt),fill = TRUE)
     col1   col2   col4   col3
   <char> <char> <char> <char>
1:      a      b      c   <NA>
2:      x      y   <NA>      z

2.13 Pivot a DataFrame

2.13.1 From long to wide

The pivot() method can be used to pivot a DataFrame from long to wide.

Let’s go for a first example :

df <- data.frame(
  country = c(rep("France",3),rep("Italy","3")),
  city = c("Paris", "Lille", "Nice", "Roma", "Milan", "Napoli"),
  location = c("North","North","South","South","North","South"),
  population = c(2.1, 0.2, 0.4, 2.8, 1.4, 3.0)
)
df
  country   city location population
1  France  Paris    North        2.1
2  France  Lille    North        0.2
3  France   Nice    South        0.4
4   Italy   Roma    South        2.8
5   Italy  Milan    North        1.4
6   Italy Napoli    South        3.0
as_polars_df(df)$pivot(
  on = "city",
  index = "country", 
  values = "population", 
)
shape: (2, 7)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ Paris ┆ Lille ┆ Nice ┆ Roma ┆ Milan ┆ Napoli β”‚
β”‚ ---     ┆ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---    β”‚
β”‚ str     ┆ f64   ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ══════β•ͺ══════β•ͺ═══════β•ͺ════════║
β”‚ France  ┆ 2.1   ┆ 0.2   ┆ 0.4  ┆ null ┆ null  ┆ null   β”‚
β”‚ Italy   ┆ null  ┆ null  ┆ null ┆ 2.8  ┆ 1.4   ┆ 3.0    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
reshape(df[,-which(names(df) %in% c("location"))], 
        idvar = "country", 
        timevar = "city", 
        direction = "wide")
  country population.Paris population.Lille population.Nice population.Roma
1  France              2.1              0.2             0.4              NA
4   Italy               NA               NA              NA             2.8
  population.Milan population.Napoli
1               NA                NA
4              1.4                 3
df |>
  pivot_wider(
    id_cols = country,
    names_from = city,
    values_from = population
  )
# A tibble: 2 Γ— 7
  country Paris Lille  Nice  Roma Milan Napoli
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 France    2.1   0.2   0.4  NA    NA       NA
2 Italy    NA    NA    NA     2.8   1.4      3
df_dt <- as.data.table(df)
dcast(df_dt, country ~ city, value.var = "population")
Key: <country>
   country Lille Milan Napoli  Nice Paris  Roma
    <char> <num> <num>  <num> <num> <num> <num>
1:  France   0.2    NA     NA   0.4   2.1    NA
2:   Italy    NA   1.4      3    NA    NA   2.8

You can also aggregate the results using a function that you enter in the argument aggregate_function in pivot() method.

In this case, the aggregate_function argument of pivot() is the equivalent of values_fn of pivot_wider() from {tidyr} and fun.aggregate of dcast() from {data.table}.

as_polars_df(df)$pivot(
  on = "location",
  index = "country", 
  values = "population",
  aggregate_function = "mean"
)
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ North ┆ South β”‚
β”‚ ---     ┆ ---   ┆ ---   β”‚
β”‚ str     ┆ f64   ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═══════║
β”‚ France  ┆ 1.15  ┆ 0.4   β”‚
β”‚ Italy   ┆ 1.4   ┆ 2.9   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
df_summary <- aggregate(population ~ country + location, data = df, FUN = mean)
df_final <- reshape(df_summary, idvar = "country", timevar = "location", direction = "wide")
colnames(df_final) <- c("country", "North", "South")
df_final
  country North South
1  France  1.15   0.4
2   Italy  1.40   2.9
df |>
  pivot_wider(id_cols = country,
              names_from = location, 
              values_from = population, 
              values_fn = mean)
# A tibble: 2 Γ— 3
  country North South
  <chr>   <dbl> <dbl>
1 France   1.15   0.4
2 Italy    1.4    2.9
df_dt <- as.data.table(df)
dcast(df_dt, country ~ location, value.var = "population", fun.aggregate = mean)
Key: <country>
   country North South
    <char> <num> <num>
1:  France  1.15   0.4
2:   Italy  1.40   2.9

However with {polars}, we can also run an expression as an aggregation function.
With {tidyr} and {data.table}, you need to calculate this in advance.
For example:

as_polars_df(df)$pivot(
  on = "location",
  index = "country", 
  values = "population",
  aggregate_function = pl$element()$sum()$sqrt()
)
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ North    ┆ South    β”‚
β”‚ ---     ┆ ---      ┆ ---      β”‚
β”‚ str     ┆ f64      ┆ f64      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ══════════║
β”‚ France  ┆ 1.516575 ┆ 0.632456 β”‚
β”‚ Italy   ┆ 1.183216 ┆ 2.408319 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
df_summarized <- aggregate(population ~ country + location, df, FUN = function(x) sqrt(sum(x)))
df_final <- reshape(df_summarized, idvar = "country", timevar = "location", direction = "wide")
colnames(df_final) <- c("country", "North", "South")
df_final
  country    North     South
1  France 1.516575 0.6324555
2   Italy 1.183216 2.4083189
df |>
  group_by(country, location) |>
  summarise(population_sum = sqrt(sum(population))) |>
  pivot_wider(names_from = location, values_from = population_sum)
# A tibble: 2 Γ— 3
# Groups:   country [2]
  country North South
  <chr>   <dbl> <dbl>
1 France   1.52 0.632
2 Italy    1.18 2.41 
dt_final <- df_dt[, .(population_sum = sqrt(sum(population))), by = .(country, location)]
dcast(dt_final, country ~ location, value.var = "population_sum")
Key: <country>
   country    North     South
    <char>    <num>     <num>
1:  France 1.516575 0.6324555
2:   Italy 1.183216 2.4083189

2.13.2 From wide to long

The unpivot() method can be used to pivot a DataFrame from wide to long.

Let’s see with an example :

df <- data.frame(
  country = c("France","Italy"),
  North = c(1.1,1.4),
  South = c(0.4,2.9)
)
as_polars_df(df)$unpivot(
  index = "country",
  on = c("North","South")
)
shape: (4, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ country ┆ variable ┆ value β”‚
β”‚ ---     ┆ ---      ┆ ---   β”‚
β”‚ str     ┆ str      ┆ f64   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ═══════║
β”‚ France  ┆ North    ┆ 1.1   β”‚
β”‚ Italy   ┆ North    ┆ 1.4   β”‚
β”‚ France  ┆ South    ┆ 0.4   β”‚
β”‚ Italy   ┆ South    ┆ 2.9   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
melted_df  <- reshape(df, 
        varying = c("North", "South"), 
        v.names = "value", 
        idvar = "country", 
        times = c("North", "South"), 
        timevar = "variable", 
        direction = "long")
rownames(melted_df) <- NULL
melted_df
  country variable value
1  France    North   1.1
2   Italy    North   1.4
3  France    South   0.4
4   Italy    South   2.9
df |>
  pivot_longer(
    !country,
    names_to = "variable",
    values_to = "value"
  )
# A tibble: 4 Γ— 3
  country variable value
  <chr>   <chr>    <dbl>
1 France  North      1.1
2 France  South      0.4
3 Italy   North      1.4
4 Italy   South      2.9
df_dt <- as.data.table(df)
melt(df_dt, id.vars = "country", variable.name = "variable", value.name = "value")
   country variable value
    <char>   <fctr> <num>
1:  France    North   1.1
2:   Italy    North   1.4
3:  France    South   0.4
4:   Italy    South   2.9

2.14 Dealing with missing values

We have already introduced missing values in here. In this section, we will go further and understand how to deal with missing values with polars and R.

2.14.1 Check if Series has missing values

The is_null() and is_not_null() methods can be used to check if Series has missing values.
Theses methods are the equivalent of is.na() and !is.na() of R base.

Let’s see two examples combining thes methods with select() and filter() methods:

mydfNA <- pl$DataFrame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA)
)

# Find null values
mydfNA$
  select(
    pl$col("colA"),
    pl$col("colA")$is_null()$alias("is_null"),
    pl$col("colA")$is_not_null()$alias("is_not_null")
  )
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ is_null ┆ is_not_null β”‚
β”‚ ---  ┆ ---     ┆ ---         β”‚
β”‚ str  ┆ bool    ┆ bool        β”‚
β•žβ•β•β•β•β•β•β•ͺ═════════β•ͺ═════════════║
β”‚ a    ┆ false   ┆ true        β”‚
β”‚ null ┆ true    ┆ false       β”‚
β”‚ c    ┆ false   ┆ true        β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Filter by null values
mydfNA$
  filter(
    pl$col("colA")$is_not_null() & pl$col("colB")$is_not_null()
  )
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ a    ┆ d    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mydfNA2 <- as.data.frame(mydfNA)

# Find null values
data.frame(
  colA = mydfNA2$colA,
  is_null = is.na(mydfNA2$colA),
  is_not_null = !is.na(mydfNA2$colA)
)
  colA is_null is_not_null
1    a   FALSE        TRUE
2 <NA>    TRUE       FALSE
3    c   FALSE        TRUE
# Filter by null values
mydfNA2[!is.na(mydfNA2$colA) & !is.na(mydfNA2$colB), ]
  colA colB
1    a    d

2.14.2 Replace missing values with a constant

The fill_null() method can be used to replace missing values in a Series with a constant.

Here’s some examples where I replace missing values from all columns of a DataFrame:

mydfNA <- pl$DataFrame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA)
)

# In the same columns
mydfNA$
  with_columns(
  pl$all()$fill_null("missing")
)
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA    ┆ colB    β”‚
β”‚ ---     ┆ ---     β”‚
β”‚ str     ┆ str     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════║
β”‚ a       ┆ d       β”‚
β”‚ missing ┆ missing β”‚
β”‚ c       ┆ missing β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# In new columns suffixed by "_corrected"
mydfNA$
  with_columns(
  pl$all()$fill_null("missing")$name$suffix("_corrected")
)
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colA_corrected ┆ colB_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ str            ┆ str            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ a              ┆ d              β”‚
β”‚ null ┆ null ┆ missing        ┆ missing        β”‚
β”‚ c    ┆ null ┆ c              ┆ missing        β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydfNA2 <- as.data.frame(mydfNA)

# In the same columns
mydfNA2[is.na(mydfNA2)] <- "missing"
mydfNA2
     colA    colB
1       a       d
2 missing missing
3       c missing
# In new columns suffixed by "_corrected"
mydfNA2 <- as.data.frame(mydfNA)

transform(mydfNA2,
          colA_corrected = ifelse(is.na(colA), "missing", colA),
          colB_corrected = ifelse(is.na(colB), "missing", colB))
  colA colB colA_corrected colB_corrected
1    a    d              a              d
2 <NA> <NA>        missing        missing
3    c <NA>              c        missing
mydfNA2 <- as.data.frame(mydfNA)

# In the same columns
mydfNA2 %>%
  mutate(across(everything(), ~ifelse(is.na(.), "missing", .)))
     colA    colB
1       a       d
2 missing missing
3       c missing
# In new columns suffixed by "_corrected"
mydfNA2 %>%
  mutate(across(c(colA, colB), ~ifelse(is.na(.), "missing", .), .names = "{col}_corrected"))
  colA colB colA_corrected colB_corrected
1    a    d              a              d
2 <NA> <NA>        missing        missing
3    c <NA>              c        missing
mydfNA2 <- as.data.frame(mydfNA)
mydfNA2_dt <- as.data.table(mydfNA2)

# In the same columns
mydfNA2_dt[is.na(mydfNA2_dt)] <- "missing"
mydfNA2_dt
      colA    colB
    <char>  <char>
1:       a       d
2: missing missing
3:       c missing
# In new columns suffixed by "_corrected"
mydfNA2_dt[,
           c("colA_corrected", "colB_corrected") := lapply(.SD, function(x) ifelse(is.na(x), "missing", x)),
           .SDcols = c("colA", "colB")]
mydfNA2_dt
      colA    colB colA_corrected colB_corrected
    <char>  <char>         <char>         <char>
1:       a       d              a              d
2: missing missing        missing        missing
3:       c missing              c        missing
Important

Be careful, the fill_null() method can in some cases modify data types like cast(). This can happen, for example, when you’re working on an integer column and you want to replace the missing values with a string => the column will then has a string dtype!

2.14.3 Replace missing values with a strategy

The fill_null() method of polars has a strategy argument for replacing missing values:

  • forward: replace with the previous non-null value in the Series
  • backward: replace with the next non-null value in the Series
  • min: replace with the smallest value in the Series
  • max: replace with the largest value in the Series
  • mean: replace with the median value in the Series
  • zero: replace with 0
  • one: replace with 1
Note

We can set a limit on how many rows to fill-forward or backward with limit

Here’s some examples :

# In the same columns
mydfNA <- pl$DataFrame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA),
  colC = c(1,NA,3)
)
# With forward strategy
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "forward")$name$suffix("_corrected")
  )
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              ┆ d              ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With forward strategy and a limit
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "forward", limit = 1)$name$suffix("_corrected")
  )
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              ┆ null           ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With backward strategy
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "backward")$name$suffix("_corrected")
  )
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════β•ͺ════════════════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ c              ┆ null           ┆ 3.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              ┆ null           ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# With mean strategy only on "colC" column
mydfNA$
  with_columns(
    pl$col("colC")$fill_null(strategy = "mean")$name$suffix("_corrected")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ 2.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.14.4 Replace missing values with an expression

Of course, you are not limited to the built-in strategies of polars => with fill_null() you can also use expressions to replace missing values. It works with expression from the same column and from another column.

Here’s some examples:

# Replace missing values with the mean of the non-null values for that column
mydfNA$
  with_columns(
    pl$col("colC")$fill_null(strategy = "mean")$name$suffix("_corrected")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colC_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ 1.0            β”‚
β”‚ null ┆ null ┆ null ┆ 2.0            β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ 3.0            β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Replace missing values with the values from another column
mydfNA$
  with_columns(
    pl$col("colB")$fill_null(pl$col("colA"))$name$suffix("_corrected")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ colA ┆ colB ┆ colC ┆ colB_corrected β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---            β”‚
β”‚ str  ┆ str  ┆ f64  ┆ str            β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ════════════════║
β”‚ a    ┆ d    ┆ 1.0  ┆ d              β”‚
β”‚ null ┆ null ┆ null ┆ null           β”‚
β”‚ c    ┆ null ┆ 3.0  ┆ c              β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydfNA2 <- as.data.frame(mydfNA)

# Replace missing values with the mean of the non-null values for that column
mydfNA2$colC_corrected <- ifelse(is.na(mydfNA2$colC), mean(mydfNA2$colC, na.rm = TRUE), mydfNA2$colC)

mydfNA2 <- as.data.frame(mydfNA)
# Replace missing values with the values from another column
mydfNA2$colB_corrected <- ifelse(is.na(mydfNA2$colB), mydfNA2$colA, mydfNA2$colB)
mydfNA2
  colA colB colC colB_corrected
1    a    d    1              d
2 <NA> <NA>   NA           <NA>
3    c <NA>    3              c
mydfNA2 <- as.data.frame(mydfNA)

# Replace missing values with the mean of the non-null values for that column
mydfNA2 %>%
  mutate(colC_corrected = ifelse(is.na(colC), mean(mydfNA2$colC, na.rm = TRUE), colC))
  colA colB colC colC_corrected
1    a    d    1              1
2 <NA> <NA>   NA              2
3    c <NA>    3              3
# Replace missing values with the values from another column
mydfNA2 %>%
  mutate(colB_corrected = ifelse(is.na(colB), colA, colB))
  colA colB colC colB_corrected
1    a    d    1              d
2 <NA> <NA>   NA           <NA>
3    c <NA>    3              c
mydfNA2 <- as.data.frame(mydfNA)
mydfNA2_dt <- as.data.table(mydfNA2)

# Replace missing values with the mean of the non-null values for that column
mydfNA2_dt[, colC_corrected := ifelse(is.na(colC), mean(mydfNA2_dt$colC, na.rm = TRUE), colC)]
mydfNA2_dt
     colA   colB  colC colC_corrected
   <char> <char> <num>          <num>
1:      a      d     1              1
2:   <NA>   <NA>    NA              2
3:      c   <NA>     3              3
# Replace missing values with the values from another column
mydfNA2_dt[, colB_corrected := ifelse(is.na(colB), colA, colB)]
mydfNA2_dt
     colA   colB  colC colC_corrected colB_corrected
   <char> <char> <num>          <num>         <char>
1:      a      d     1              1              d
2:   <NA>   <NA>    NA              2           <NA>
3:      c   <NA>     3              3              c

2.14.5 Replace missing values with a sequence of columns

The coalesce() method can be used to replace missing values based on a sequence of columns.

Here’s an example creating a new column β€œcol4” that has the first non-null value as we go through some columns (in order!):

mynewdfNA <- pl$DataFrame(
  col1 = c(NA,"y",NA),
  col2 = c(NA,"v","w"),
  col3 = c("r","s",NA)
)

mynewdfNA$
  with_columns(
    pl$coalesce("col1","col2","col3")$alias("col4")
  )
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 ┆ col4 β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ ---  β”‚
β”‚ str  ┆ str  ┆ str  ┆ str  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ══════║
β”‚ null ┆ null ┆ r    ┆ r    β”‚
β”‚ y    ┆ v    ┆ s    ┆ y    β”‚
β”‚ null ┆ w    ┆ null ┆ w    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
mynewdfNA2 <-  as.data.frame(mynewdfNA)

mynewdfNA2$col4 <- coalesce(mynewdfNA2$col1, mynewdfNA2$col2, mynewdfNA2$col3)
mynewdfNA2
  col1 col2 col3 col4
1 <NA> <NA>    r    r
2    y    v    s    y
3 <NA>    w <NA>    w
mynewdfNA2 <-  as.data.frame(mynewdfNA)

mynewdfNA2 %>%
  mutate(col4 = coalesce(col1, col2, col3))
  col1 col2 col3 col4
1 <NA> <NA>    r    r
2    y    v    s    y
3 <NA>    w <NA>    w
mynewdfNA2 <-   as.data.frame(mynewdfNA)
mynewdfNA2_dt <- as.data.table(mynewdfNA2)

mynewdfNA2_dt[, col4 := fcoalesce(col1, col2, col3)]
mynewdfNA2_dt
     col1   col2   col3   col4
   <char> <char> <char> <char>
1:   <NA>   <NA>      r      r
2:      y      v      s      y
3:   <NA>      w   <NA>      w

2.15 Others useful methods

2.15.1 On Series

2.15.1.1 Change name of Series

The alias() method is very useful especially in method chaining operation.
With R base, the syntax is longer.

as_polars_series(1:3, name = "toto")$alias("titi")
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ titi β”‚
β”‚ ---  β”‚
β”‚ i32  β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ 1    β”‚
β”‚ 2    β”‚
β”‚ 3    β”‚
β””β”€β”€β”€β”€β”€β”€β”˜
toto <- 1:3
titi <- toto
rm(toto)

2.15.1.2 Reduce Boolean Series

The all() and any() methods can be used to check if all or any values in a vector evaluate to TRUE for some expression.

# all(as_polars_series(c(TRUE,TRUE))) doesn't work
as_polars_series(c(TRUE, TRUE, NA))$all()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚      β”‚
β”‚ ---  β”‚
β”‚ bool β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ true β”‚
β””β”€β”€β”€β”€β”€β”€β”˜
as_polars_series(c(TRUE, TRUE, FALSE))$all()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚       β”‚
β”‚ ---   β”‚
β”‚ bool  β”‚
β•žβ•β•β•β•β•β•β•β•‘
β”‚ false β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜
as_polars_series(c(TRUE, TRUE, TRUE))$all()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚      β”‚
β”‚ ---  β”‚
β”‚ bool β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ true β”‚
β””β”€β”€β”€β”€β”€β”€β”˜
all(c(TRUE,TRUE,NA))
[1] NA
all(c(TRUE,TRUE,FALSE))
[1] FALSE
all(c(TRUE,TRUE,TRUE))
[1] TRUE

2.15.1.3 Get data type of Series

The dtype() method can be used to get data type of Series.

as_polars_series(letters)$dtype
String
as_polars_series(c(1, 2))$dtype
Float64
infer_type(letters)
Utf8
string
infer_type(c(1, 2))
Float64
double
Tip

Polars is strongly typed. print(ls(pl$dtypes)) returns the full list of valid Polars types. Caution, some type names differ from what they are called in R base. See below!

as_polars_series(c("x","y","z"))$dtype
String
as_polars_series(c(1, 2, 3))$dtype
Float64
as_polars_series(c(1:3))$dtype
Int32
as_polars_series(c(TRUE,FALSE))$dtype
Boolean
as_polars_series(factor(c("a","b","c")))$dtype
Categorical(ordering='lexical')
as_polars_series(Sys.Date())$dtype
Date
as_polars_series(c(0,1))$dtype
Float64
typeof(c("x","y","z"))
[1] "character"
typeof(c(1, 2, 3))
[1] "double"
typeof(c(1:3))
[1] "integer"
typeof(c(TRUE,FALSE))
[1] "logical"
typeof(factor(c("a","b","c")))
[1] "integer"
typeof(Sys.Date())
[1] "double"

To summarise the main types between Polars and R:

Polars R Base
String character
Float64 double
Int32 integer
Boolean logical
Categorical Factor
Date Date

To learn more about Data types in Polars, see here.

2.15.1.4 Cast

The cast() method can be used to convert the data types of a column to a new one.
See here for the data types available with Polars.

as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$cast(pl$Int8), # The "Petal.Length" column is converted into integers
  pl$col("Species")$cast(pl$String) # The "Species" column is converted into strings
)$schema
$Sepal.Length
Float64

$Sepal.Width
Float64

$Petal.Length
Int8

$Petal.Width
Float64

$Species
String
data(iris)
iris$Petal.Length <- as.integer(iris$Petal.Length)
iris$Species <- as.integer(iris$Species)
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  1 1 1 1 1 1 1 1 1 1 ...
data(iris)
iris |>
  mutate(
    Petal.Length = as.integer(Petal.Length),
    Species = as.character(Species)) |>
  str()
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
iris_dt[, `:=`(Petal.Length = as.integer(Petal.Length),
               Species = as.character(Species))]
str(iris_dt)
Classes 'data.table' and 'data.frame':  150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Tip

When working with very large tables we can Reduce the memory footprint by modifying the number of bits allocated to an element. ⚠️

When performing downcasting, it is crucial to ensure that the chosen number of bits is sufficient to accommodate the largest and smallest numbers in the column.

A quick reminder:

Type Range Accuracy
Int8 -128 to +127
Int16 -32768 to +32767
Int32 -2147483648 to +2147483647
Int64 –2E63 to –2E63-1
Float32 -3.4E+38 to +3.4E+38 about 7 decimal digits
Float64 -1.7E+308 to +1.7E+308 about 16 decimal digits

2.15.1.5 Check if Series is numeric

The infer_polars_dtype() method can be used to find the DataType corresponding to an R object.

::: {.panel-tabset} ## polars

s <- as_polars_series(1:4)
infer_polars_dtype(s)
Int32

2.15.1.6 Get length of a Series

The len() method can be used to get the length of a Series.

as_polars_series(1:4)$len()
[1] 4
length(1:4)
[1] 4

2.15.1.7 Check if Series are equal

The series_equal() method can be used to check if a Series is equal with another Series.

Tip

Caution, if two series are identical but one is named and the other is not then series_equal() returns FALSE.

as_polars_series(1:4)$equals(as_polars_series(1:4))
[1] TRUE
as_polars_series(1:4,name = "toto")$equals(as_polars_series(1:4))
[1] TRUE
identical(1:4,1:4)
[1] TRUE

2.15.1.8 Convert Series to Polars DataFrame

The to_frame() method can be used to convert a Series to a DataFrame.
In this case, a DataFrame with only one column will be created. If the Series is initially named then the column of the DataFrame will be named as such.

as_polars_series(1:3, "toto")$to_frame()
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ toto β”‚
β”‚ ---  β”‚
β”‚ i32  β”‚
β•žβ•β•β•β•β•β•β•‘
β”‚ 1    β”‚
β”‚ 2    β”‚
β”‚ 3    β”‚
β””β”€β”€β”€β”€β”€β”€β”˜

2.15.1.9 Get value Counts of a Series

The value_counts() method can be used to get a value counts of a Series.

as_polars_series(iris$Species)$value_counts()
shape: (3, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   β”‚
β”‚ ---               β”‚
β”‚ struct[2]         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ {"versicolor",50} β”‚
β”‚ {"setosa",50}     β”‚
β”‚ {"virginica",50}  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
iris |>
  count(Species)
     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50
iris_dt[, .N, by = Species]
      Species     N
       <char> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

2.15.1.10 Sum across Series

The sum() method can be used to get a sum of a Series.

  • From a single Series:
as_polars_series(1:3)$sum()
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”
β”‚     β”‚
β”‚ --- β”‚
β”‚ i32 β”‚
β•žβ•β•β•β•β•β•‘
β”‚ 6   β”‚
β””β”€β”€β”€β”€β”€β”˜
sum(c(1:3))
[1] 6
  • From a DataFrame and a column as a string:
as_polars_df(iris)$select(pl$sum("Petal.Length"))
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 563.7        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
sum(iris$Petal.Length)
[1] 563.7
iris |> summarise(sum(Petal.Length))
  sum(Petal.Length)
1             563.7
sum(iris_dt[, Petal.Length])
[1] 496
  • From a DataFrame and a column as an expression:
as_polars_df(iris)$select(pl$sum("Petal.Width"))
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Width β”‚
β”‚ ---         β”‚
β”‚ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 179.9       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • From a DataFrame and a column as a list and sum horizontally:

In this case, use with_columns() method.

df_pl <- pl$DataFrame(col1 = c(10L,20L), col2= c(30L,40L), col3 = c(40L,50L))
df_pl$with_columns(pl$sum("col1", "col3"))
shape: (2, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 β”‚
β”‚ ---  ┆ ---  ┆ ---  β”‚
β”‚ i32  ┆ i32  ┆ i32  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════║
β”‚ 30   ┆ 30   ┆ 90   β”‚
β”‚ 30   ┆ 40   ┆ 90   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
df <- data.frame(col1 = c(10L,20L), col2= c(30L,40L), col3 = c(40L,50L))
mysum <- rowSums(df[, c("col1", "col3")])
cbind(df,mysum)
  col1 col2 col3 mysum
1   10   30   40    50
2   20   40   50    70
df |>
  rowwise() |>
  mutate(mysum = sum(col1,col3))
# A tibble: 2 Γ— 4
# Rowwise: 
   col1  col2  col3 mysum
  <int> <int> <int> <int>
1    10    30    40    50
2    20    40    50    70
df_dt <- as.data.table(df)
df_dt[, somme := rowSums(.SD), .SDcols = c("col1", "col3")]
print(df_dt)
    col1  col2  col3 somme
   <int> <int> <int> <num>
1:    10    30    40    50
2:    20    40    50    70
  • From a DataFrame and sum horizontally some columns:

An example with iris Dataframe:

df_pl$with_columns(
  pl$sum_horizontal("col1", "col2", "col3")$alias("sum")
)
shape: (2, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 ┆ col3 ┆ sum β”‚
β”‚ ---  ┆ ---  ┆ ---  ┆ --- β”‚
β”‚ i32  ┆ i32  ┆ i32  ┆ i32 β”‚
β•žβ•β•β•β•β•β•β•ͺ══════β•ͺ══════β•ͺ═════║
β”‚ 10   ┆ 30   ┆ 40   ┆ 80  β”‚
β”‚ 20   ┆ 40   ┆ 50   ┆ 110 β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

2.15.2 On DataFrames

2.15.2.1 Get Series from DataFrame

The to_series() method can be used to get one column from DataFrame as Series.

as_polars_df(iris)$select(pl$col("Petal.Length"))$to_series()
shape: (150, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Petal.Length β”‚
β”‚ ---          β”‚
β”‚ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ 1.4          β”‚
β”‚ 1.4          β”‚
β”‚ 1.3          β”‚
β”‚ 1.5          β”‚
β”‚ 1.4          β”‚
β”‚ …            β”‚
β”‚ 5.2          β”‚
β”‚ 5.0          β”‚
β”‚ 5.2          β”‚
β”‚ 5.4          β”‚
β”‚ 5.1          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris$Petal.Length
  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1
iris |>
  pull(Petal.Length)
  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1
iris_dt[, c(Petal.Length)]
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 3 4 3 3 4 4 4 3 4 4 4 4 3 4 4 4 4
 [75] 4 4 4 5 4 3 3 3 3 5 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 6 5 5 5 5 6 4 6 5 6 5
[112] 5 5 5 5 5 5 6 6 5 5 4 6 4 5 6 4 4 5 5 6 6 5 5 5 6 5 5 4 5 5 5 5 5 5 5 5 5
[149] 5 5

2.15.2.2 Get a R List from DataFrame

The as.list() function can be used to get a R List from a Polars object.

mylist <- as.list(as_polars_df(iris))
names(mylist)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
mylist <- as.list(iris)
names(mylist)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

2.15.2.3 Get a slice of a DataFrame

The slice() method can be used to get a slice of a DataFrame.

Important

With Polars, numeric default is 0! Thus the equivalent to slice(1,3) with Polars will be 1:4 in R Base and data.table and slice(1,4) with dplyr.

as_polars_df(iris)$slice(1,3)
shape: (3, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species β”‚
β”‚ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     β”‚
β”‚ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ═════════║
β”‚ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  β”‚
β”‚ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
iris[2:4,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
iris |>
  slice(2:4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
iris_dt[2:4,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <int>       <num>  <char>
1:          4.9         3.0            1         0.2  setosa
2:          4.7         3.2            1         0.2  setosa
3:          4.6         3.1            1         0.2  setosa

2.15.2.4 Get a structure from a DataFrame

The to_struct() method can be used to get a structure from a DataFrame.

mystruc <- as_polars_df(iris)$to_struct()
mystruc
shape: (150, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                               β”‚
β”‚ ---                           β”‚
β”‚ struct[5]                     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ {5.1,3.5,1.4,0.2,"setosa"}    β”‚
β”‚ {4.9,3.0,1.4,0.2,"setosa"}    β”‚
β”‚ {4.7,3.2,1.3,0.2,"setosa"}    β”‚
β”‚ {4.6,3.1,1.5,0.2,"setosa"}    β”‚
β”‚ {5.0,3.6,1.4,0.2,"setosa"}    β”‚
β”‚ …                             β”‚
β”‚ {6.7,3.0,5.2,2.3,"virginica"} β”‚
β”‚ {6.3,2.5,5.0,1.9,"virginica"} β”‚
β”‚ {6.5,3.0,5.2,2.0,"virginica"} β”‚
β”‚ {6.2,3.4,5.4,2.3,"virginica"} β”‚
β”‚ {5.9,3.0,5.1,1.8,"virginica"} β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

to_struct() returns a Series which can be converted to a R list with as.vector() function.

mylist <- as.vector(mystruc)
β„Ή `as.vector()` on a Polars Series of type struct[5] may drop some useful
  attributes.
β„Ή Use `$to_r_vector()` instead for finer control of the conversion from Polars
  to R.
str(mylist)
List of 5
 $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

mylist is now a list where each element is a column of the initial DataFrame.

With to_frame() and unnest() methods, we can reconstruct the original DataFrame:

back_df <- mystruc$to_frame()$unnest()
back_df
shape: (150, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                               β”‚
β”‚ ---                           β”‚
β”‚ struct[5]                     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•‘
β”‚ {5.1,3.5,1.4,0.2,"setosa"}    β”‚
β”‚ {4.9,3.0,1.4,0.2,"setosa"}    β”‚
β”‚ {4.7,3.2,1.3,0.2,"setosa"}    β”‚
β”‚ {4.6,3.1,1.5,0.2,"setosa"}    β”‚
β”‚ {5.0,3.6,1.4,0.2,"setosa"}    β”‚
β”‚ …                             β”‚
β”‚ {6.7,3.0,5.2,2.3,"virginica"} β”‚
β”‚ {6.3,2.5,5.0,1.9,"virginica"} β”‚
β”‚ {6.5,3.0,5.2,2.0,"virginica"} β”‚
β”‚ {6.2,3.4,5.4,2.3,"virginica"} β”‚
β”‚ {5.9,3.0,5.1,1.8,"virginica"} β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.15.2.5 Drop all rows that contain null values

The drop_nulls() method can be used to drop all rows that contain null values in a DataFrame.

By default, drop_nulls() use all columns to drop rows:

data_pl <- pl$DataFrame(
  col1 = c(NA,"b","c"),
  col2 = c(1,2,NA)
)
data_pl$drop_nulls()
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ f64  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ b    ┆ 2.0  β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
data_df <- data.frame(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
data_df[complete.cases(data_df), ]
  col1 col2
2    b    2
data_df <- data.frame(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
data_df %>%
  filter(complete.cases(.))
  col1 col2
1    b    2
data_dt <- data.table(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
na.omit(data_dt)
     col1  col2
   <char> <num>
1:      b     2

If you want, you can specify a column (or multiple columns):

data_pl$drop_nulls("col1")
shape: (2, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ f64  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ b    ┆ 2.0  β”‚
β”‚ c    ┆ null β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
data_df[!is.na(data_df$col1), ]
  col1 col2
2    b    2
3    c   NA
data_df |>
  filter(!is.na(col1))
  col1 col2
1    b    2
2    c   NA
data_dt[complete.cases(data_dt[, .SD, .SDcols = "col1"]), ]
     col1  col2
   <char> <num>
1:      b     2
2:      c    NA

2.16 Strings methods

In polars, a lot of strings methods are useful. Here is the list.
To use them, simply prefix them with str.

2.16.1 Get substrings

The str$slice() method can be used to create substrings of the string values of a Utf8 Series.

str$slice() does not work like R base’s substr() function for finding the substring of interest: - substr() takes two arguments: the first and last elements; - str$slice() takes two arguments: the first element and the extraction length.

Important

With Polars, numeric default is 0! Thus the equivalent to str$slice(0,3) with Polars will be substr(1,3).

Two further comments:

  • If the second argument length is not specified, the sub-character string of interest will default to the end of the character string. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(1) is equivalent to substr(mycol,2,4) in dplyr.

  • The first argument accepts negative values, which means that sub-strings can be considered starting from the end. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(-2) is equivalent to substr(mycol,3,4) in dplyr.

Let’s see an example:

mydf <- data.frame(
  col1 = 1:4,
  col2 = c("One_X","One_Y","Two_X","Two_Y")
)
as_polars_df(mydf)$with_columns(
  pl$col("col2")$str$slice(0, 3)$alias("level"),   # les 3 premiers caractères
  pl$col("col2")$str$slice(-1)$alias("x_y")        # le dernier caractère
)
shape: (4, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- β”‚
β”‚ i32  ┆ str   ┆ str   ┆ str β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════║
β”‚ 1    ┆ One_X ┆ One   ┆ X   β”‚
β”‚ 2    ┆ One_Y ┆ One   ┆ Y   β”‚
β”‚ 3    ┆ Two_X ┆ Two   ┆ X   β”‚
β”‚ 4    ┆ Two_Y ┆ Two   ┆ Y   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
mydf$level <- substr(mydf$col2, 1, 3)
mydf$x_y <- substr(mydf$col2, nchar(mydf$col2), nchar(mydf$col2))
mydf
  col1  col2 level x_y
1    1 One_X   One   X
2    2 One_Y   One   Y
3    3 Two_X   Two   X
4    4 Two_Y   Two   Y
mydf |>
  mutate(level = substr(col2, 1, 3),
         x_y = substr(col2, nchar(col2), nchar(col2)))
  col1  col2 level x_y
1    1 One_X   One   X
2    2 One_Y   One   Y
3    3 Two_X   Two   X
4    4 Two_Y   Two   Y
mydt <- as.data.table(mydf) 
mydt[, c("level", "x_y") := .(substr(col2, 1, 3), substr(col2, nchar(col2), nchar(col2)))]
mydt
    col1   col2  level    x_y
   <int> <char> <char> <char>
1:     1  One_X    One      X
2:     2  One_Y    One      Y
3:     3  Two_X    Two      X
4:     4  Two_Y    Two      Y

2.16.2 Check if string values start with a substring

The str$starts_with() method can be used to check if string values start with a substring. It returns a Boolean.

Let’s see an example where we create new Boolean columns based on the start of a character string:

as_polars_df(mydf)$with_columns(
  pl$col("col2")$str$starts_with("One")$alias("is_one"),
  pl$col("col2")$str$starts_with("Two")$alias("is_two")
)
shape: (4, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    β”‚
β”‚ i32  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════β•ͺ════════β•ͺ════════║
β”‚ 1    ┆ One_X ┆ One   ┆ X   ┆ true   ┆ false  β”‚
β”‚ 2    ┆ One_Y ┆ One   ┆ Y   ┆ true   ┆ false  β”‚
β”‚ 3    ┆ Two_X ┆ Two   ┆ X   ┆ false  ┆ true   β”‚
β”‚ 4    ┆ Two_Y ┆ Two   ┆ Y   ┆ false  ┆ true   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydf$is_one <- grepl("^One", mydf$col2)
mydf$is_two <- grepl("^Two", mydf$col2)
mydf
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    2 One_Y   One   Y   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
4    4 Two_Y   Two   Y  FALSE   TRUE
mydf |>
  mutate(level = substr(col2, 1, 3),
         x_y = substr(col2, nchar(col2), nchar(col2)))
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    2 One_Y   One   Y   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
4    4 Two_Y   Two   Y  FALSE   TRUE
mydt <- as.data.table(mydf) 
mydt[, c("is_one", "is_two") := .(grepl("^One", col2), grepl("^Two", col2))]
mydt
    col1   col2  level    x_y is_one is_two
   <int> <char> <char> <char> <lgcl> <lgcl>
1:     1  One_X    One      X   TRUE  FALSE
2:     2  One_Y    One      Y   TRUE  FALSE
3:     3  Two_X    Two      X  FALSE   TRUE
4:     4  Two_Y    Two      Y  FALSE   TRUE

2.16.3 Check if string values end with a substring

The str$ends_with() method can be used to check if string values start with a substring. It returns a Boolean.

Let’s see an example where we filter the lines of a DataFrame based on the start of a character string:

as_polars_df(mydf)$filter(
  pl$col("col2")$str$ends_with("X")
)
shape: (2, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    β”‚
β”‚ i32  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════β•ͺ════════β•ͺ════════║
β”‚ 1    ┆ One_X ┆ One   ┆ X   ┆ true   ┆ false  β”‚
β”‚ 3    ┆ Two_X ┆ Two   ┆ X   ┆ false  ┆ true   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
mydf[substr(mydf$col2, nchar(mydf$col2), nchar(mydf$col2)) == "X", ]
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
mydf |>
  filter(endsWith(col2, "X"))
  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    3 Two_X   Two   X  FALSE   TRUE
mydt <- as.data.table(mydf) 
mydt[substr(col2, nchar(col2), nchar(col2)) == "X"]
    col1   col2  level    x_y is_one is_two
   <int> <char> <char> <char> <lgcl> <lgcl>
1:     1  One_X    One      X   TRUE  FALSE
2:     3  Two_X    Two      X  FALSE   TRUE

2.17 Create your methods

With R you can create your own method/function with function().
Let’s try to create a R function to captue some DataFrame transformations.

Our simple function:
- Takes a DataFrame as an input (argument data)
- Convert Categorical columns into Strings
- Make all Strings columns uppercase
- And filter only the third first rows

fn_transformation <- function(data) {
  
  data$
    # Convert Integer columns into Float 
    with_columns(
      pl$col(pl$Int32)$cast(pl$Float64))$
    # Make all Strings columns uppercase
    with_columns(
      pl$col(pl$String)$str$to_uppercase())$
    # Filter only the third first rows
    head(3)
  
}

Let’s apply our method to as_polars_df(iris):

fn_transformation(as_polars_df(mydf))
shape: (3, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two β”‚
β”‚ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    β”‚
β”‚ f64  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═══════β•ͺ═════β•ͺ════════β•ͺ════════║
β”‚ 1.0  ┆ ONE_X ┆ ONE   ┆ X   ┆ true   ┆ false  β”‚
β”‚ 2.0  ┆ ONE_Y ┆ ONE   ┆ Y   ┆ true   ┆ false  β”‚
β”‚ 3.0  ┆ TWO_X ┆ TWO   ┆ X   ┆ false  ┆ true   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Our method fn_transformation() can now be re-used multiple times for example on another datasets:

Newdf <- pl$DataFrame(
  col_categ = factor(c("a","b","c")),
  col_str = c("x","y","z"),
  col_num = 1:3
)
  
fn_transformation(Newdf)
shape: (3, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ col_categ ┆ col_str ┆ col_num β”‚
β”‚ ---       ┆ ---     ┆ ---     β”‚
β”‚ cat       ┆ str     ┆ f64     β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ═════════║
β”‚ a         ┆ X       ┆ 1.0     β”‚
β”‚ b         ┆ Y       ┆ 2.0     β”‚
β”‚ c         ┆ Z       ┆ 3.0     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Of course, in real life, we will create functions that are more complicated than our example.