2 Data manipulation

Below is the code used to create the fake data needed to run code in this document. Click to expand it! 👇

Code

library(polars)
library(dplyr)
library(tidyr)
library(data.table)
library(arrow)
library(microbenchmark)
iris_dt <- as.data.table(iris)

# Creation of Series
mynumbers_serie <- as_polars_series(1:3)
myletters_serie <- as_polars_series(c("a","b","c"))

# Creation of mydf DataFrame 
mydf <- pl$DataFrame(
  col1 = mynumbers_serie,
  col2 = myletters_serie
)

2.1 Introduction to methods

Important

Although some simple R functions work out of the box on polars objects, the added value of Polars consists in the methods. Thosepowerful methods are accessed using the $ operator.

There are numerous methods designed to accomplish various tasks.

One advantage of using methods is that many more operations are possible on Polars objects using methods than through base R functions.
A second advantage is Methods Chaining, a core part of the Polars workflow. If you are coming from one of the other popular data wrangling libraries in R, then you probably already have an innate sense of what this means. For instance:
- In dplyr we use a pipe operator, e.g. dat |> filter(…) |> select(…)
  -In data.table we use its indexing syntax, e.g. DT[i, j, by][…]

In polars our method chaining syntax takes the form object$m1()$m2(), where object is our data object, and $m1() and $m2() are appropriate methods.

For Series, see this section for the methods available in {polars}.
For DataFrames, see this section for the methods available in {polars}.

Some examples with Series: Some examples with Series:

# To get a sum
mynumbers_serie$sum()

[1] 6

# To sort
mynumbers_serie$sort()

polars Series: shape: (3,)
Series: '' [i32]
[
    1
    2
    3
]

Some examples with DataFrame:

# To get a character vector of column names
mydf$columns

[1] "col1" "col2"

# To get dimensions of DataFrame
mydf$shape

[1] 3 2

# We can mix standard R functions and methods
length(mydf$columns)

[1] 2

Polars includes a very useful chaining method in data manipulation operations. From this point of view, Polars is more like dplyr and data.table. This is how the chaining method is defined in the official documentation:

In polars our method chaining syntax takes the form object$m1()$m2(), where object is our data object, and m1() and m2() are appropriate methods, like subsetting or aggregation expressions.

Let’s see an example with the iris dataset:

as_polars_df(iris)$
  group_by(
  "Species")$
  median()

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ virginica  ┆ 6.5          ┆ 3.0         ┆ 5.55         ┆ 2.0         │
│ setosa     ┆ 5.0          ┆ 3.4         ┆ 1.5          ┆ 0.2         │
│ versicolor ┆ 5.9          ┆ 2.8         ┆ 4.35         ┆ 1.3         │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘

aggregate(. ~ Species, iris, median)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.4         1.50         0.2
2 versicolor          5.9         2.8         4.35         1.3
3  virginica          6.5         3.0         5.55         2.0

iris |>
  group_by(Species) |>
  summarise(across(everything(),median))

# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa              5           3.4         1.5          0.2
2 versicolor          5.9         2.8         4.35         1.3
3 virginica           6.5         3           5.55         2

iris_dt[, lapply(.SD, median), by = Species]

      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa          5.0         3.4         1.50         0.2
2: versicolor          5.9         2.8         4.35         1.3
3:  virginica          6.5         3.0         5.55         2.0

Tip

In the Polars code used above, you will notice that we have introduced line breaks. We could have written the whole code on the same line but for the sake of readability I prefer to separate the methods used by line breaks.

2.2 Conversion between Series/DataFrames and vector/data.frames

2.2.1 From vector/data.frames to Series/DataFrames

These conversions have already been seen earlier in this book.

# To convert vector to Polars Series
myvector <- as_polars_series(c("a","b","c"))
# To convert data.frames to DataFrames
iris_polars <- as_polars_df(iris)

2.2.2 From Series/DataFrames to vector/data.frames

Here, we can use to_r() and to_data_frame() methods.

# To convert Polars Series to a vector
myvector$to_r()

[1] "a" "b" "c"

# To convert Polars Series to a list
myvector$to_list()

[[1]]
[1] "a"

[[2]]
[1] "b"

[[3]]
[1] "c"

# To convert DataFrames to data.frames
iris_polars$to_data_frame()

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         4.0          1.2         0.2     setosa
16           5.7         4.4          1.5         0.4     setosa
17           5.4         3.9          1.3         0.4     setosa
18           5.1         3.5          1.4         0.3     setosa
19           5.7         3.8          1.7         0.3     setosa
20           5.1         3.8          1.5         0.3     setosa
21           5.4         3.4          1.7         0.2     setosa
22           5.1         3.7          1.5         0.4     setosa
23           4.6         3.6          1.0         0.2     setosa
24           5.1         3.3          1.7         0.5     setosa
25           4.8         3.4          1.9         0.2     setosa
26           5.0         3.0          1.6         0.2     setosa
27           5.0         3.4          1.6         0.4     setosa
28           5.2         3.5          1.5         0.2     setosa
29           5.2         3.4          1.4         0.2     setosa
30           4.7         3.2          1.6         0.2     setosa
31           4.8         3.1          1.6         0.2     setosa
32           5.4         3.4          1.5         0.4     setosa
33           5.2         4.1          1.5         0.1     setosa
34           5.5         4.2          1.4         0.2     setosa
35           4.9         3.1          1.5         0.2     setosa
36           5.0         3.2          1.2         0.2     setosa
37           5.5         3.5          1.3         0.2     setosa
38           4.9         3.6          1.4         0.1     setosa
39           4.4         3.0          1.3         0.2     setosa
40           5.1         3.4          1.5         0.2     setosa
41           5.0         3.5          1.3         0.3     setosa
42           4.5         2.3          1.3         0.3     setosa
43           4.4         3.2          1.3         0.2     setosa
44           5.0         3.5          1.6         0.6     setosa
45           5.1         3.8          1.9         0.4     setosa
46           4.8         3.0          1.4         0.3     setosa
47           5.1         3.8          1.6         0.2     setosa
48           4.6         3.2          1.4         0.2     setosa
49           5.3         3.7          1.5         0.2     setosa
50           5.0         3.3          1.4         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
54           5.5         2.3          4.0         1.3 versicolor
55           6.5         2.8          4.6         1.5 versicolor
56           5.7         2.8          4.5         1.3 versicolor
57           6.3         3.3          4.7         1.6 versicolor
58           4.9         2.4          3.3         1.0 versicolor
59           6.6         2.9          4.6         1.3 versicolor
60           5.2         2.7          3.9         1.4 versicolor
61           5.0         2.0          3.5         1.0 versicolor
62           5.9         3.0          4.2         1.5 versicolor
63           6.0         2.2          4.0         1.0 versicolor
64           6.1         2.9          4.7         1.4 versicolor
65           5.6         2.9          3.6         1.3 versicolor
66           6.7         3.1          4.4         1.4 versicolor
67           5.6         3.0          4.5         1.5 versicolor
68           5.8         2.7          4.1         1.0 versicolor
69           6.2         2.2          4.5         1.5 versicolor
70           5.6         2.5          3.9         1.1 versicolor
71           5.9         3.2          4.8         1.8 versicolor
72           6.1         2.8          4.0         1.3 versicolor
73           6.3         2.5          4.9         1.5 versicolor
74           6.1         2.8          4.7         1.2 versicolor
75           6.4         2.9          4.3         1.3 versicolor
76           6.6         3.0          4.4         1.4 versicolor
77           6.8         2.8          4.8         1.4 versicolor
78           6.7         3.0          5.0         1.7 versicolor
79           6.0         2.9          4.5         1.5 versicolor
80           5.7         2.6          3.5         1.0 versicolor
81           5.5         2.4          3.8         1.1 versicolor
82           5.5         2.4          3.7         1.0 versicolor
83           5.8         2.7          3.9         1.2 versicolor
84           6.0         2.7          5.1         1.6 versicolor
85           5.4         3.0          4.5         1.5 versicolor
86           6.0         3.4          4.5         1.6 versicolor
87           6.7         3.1          4.7         1.5 versicolor
88           6.3         2.3          4.4         1.3 versicolor
89           5.6         3.0          4.1         1.3 versicolor
90           5.5         2.5          4.0         1.3 versicolor
91           5.5         2.6          4.4         1.2 versicolor
92           6.1         3.0          4.6         1.4 versicolor
93           5.8         2.6          4.0         1.2 versicolor
94           5.0         2.3          3.3         1.0 versicolor
95           5.6         2.7          4.2         1.3 versicolor
96           5.7         3.0          4.2         1.2 versicolor
97           5.7         2.9          4.2         1.3 versicolor
98           6.2         2.9          4.3         1.3 versicolor
99           5.1         2.5          3.0         1.1 versicolor
100          5.7         2.8          4.1         1.3 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica
104          6.3         2.9          5.6         1.8  virginica
105          6.5         3.0          5.8         2.2  virginica
106          7.6         3.0          6.6         2.1  virginica
107          4.9         2.5          4.5         1.7  virginica
108          7.3         2.9          6.3         1.8  virginica
109          6.7         2.5          5.8         1.8  virginica
110          7.2         3.6          6.1         2.5  virginica
111          6.5         3.2          5.1         2.0  virginica
112          6.4         2.7          5.3         1.9  virginica
113          6.8         3.0          5.5         2.1  virginica
114          5.7         2.5          5.0         2.0  virginica
115          5.8         2.8          5.1         2.4  virginica
116          6.4         3.2          5.3         2.3  virginica
117          6.5         3.0          5.5         1.8  virginica
118          7.7         3.8          6.7         2.2  virginica
119          7.7         2.6          6.9         2.3  virginica
120          6.0         2.2          5.0         1.5  virginica
121          6.9         3.2          5.7         2.3  virginica
122          5.6         2.8          4.9         2.0  virginica
123          7.7         2.8          6.7         2.0  virginica
124          6.3         2.7          4.9         1.8  virginica
125          6.7         3.3          5.7         2.1  virginica
126          7.2         3.2          6.0         1.8  virginica
127          6.2         2.8          4.8         1.8  virginica
128          6.1         3.0          4.9         1.8  virginica
129          6.4         2.8          5.6         2.1  virginica
130          7.2         3.0          5.8         1.6  virginica
131          7.4         2.8          6.1         1.9  virginica
132          7.9         3.8          6.4         2.0  virginica
133          6.4         2.8          5.6         2.2  virginica
134          6.3         2.8          5.1         1.5  virginica
135          6.1         2.6          5.6         1.4  virginica
136          7.7         3.0          6.1         2.3  virginica
137          6.3         3.4          5.6         2.4  virginica
138          6.4         3.1          5.5         1.8  virginica
139          6.0         3.0          4.8         1.8  virginica
140          6.9         3.1          5.4         2.1  virginica
141          6.7         3.1          5.6         2.4  virginica
142          6.9         3.1          5.1         2.3  virginica
143          5.8         2.7          5.1         1.9  virginica
144          6.8         3.2          5.9         2.3  virginica
145          6.7         3.3          5.7         2.5  virginica
146          6.7         3.0          5.2         2.3  virginica
147          6.3         2.5          5.0         1.9  virginica
148          6.5         3.0          5.2         2.0  virginica
149          6.2         3.4          5.4         2.3  virginica
150          5.9         3.0          5.1         1.8  virginica

2.3 Initial informations on a DataFrame

Here is a list of instructions that I frequently use to quickly get information about a DataFrame.

2.3.1 Get the schema

A DataFrame has a schema attribute which is a named list of DataTypes.

mydf$schema

$col1
DataType: Int32

$col2
DataType: String

# This works also on LazyFrame
mydf$lazy()$schema

$col1
DataType: Int32

$col2
DataType: String

2.3.2 Get the dimensions

A DataFrame has a shape attribute which is a two length numeric vector of c(nrows,ncols).

mydf$shape

[1] 3 2

2.3.3 Get columns types

A DataFrame has a dtypes attribute which is a list of dtypes (for data type) for every column of the DataFrame.

Alternatively, the dtype_strings() method can be used to get columns types in a character/string vector.

# With dtypes attribute (with a "s" and without parentheses)
mydf$dtypes

[[1]]
DataType: Int32

[[2]]
DataType: String

# This works also on LazyFrame
mydf$lazy()$dtypes

[[1]]
DataType: Int32

[[2]]
DataType: String

# With dtype_strings() method (wihout a "s" and with parentheses)
mydf$dtype_strings()

[1] "i32" "str"

2.3.4 Get a glimpse

You can access a dense preview of a DataFrame by using the glimpse() method. f The formatting is done one line per column, so wide DataFrame show nicely. Each line will show the column name, the dtypes attributes and the first few values.

as_polars_df(iris)$glimpse()

& Sepal.Length <f64> 5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9
& Sepal.Width  <f64> 3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1
& Petal.Length <f64> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5
& Petal.Width  <f64> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1
& Species      <cat> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa

2.3.5 Get summary statistics

You can access some summary statistics from a DataFrame by using the describe() method.

as_polars_df(iris)$describe()

shape: (9, 6)
┌────────────┬──────────────────┬──────────────────┬─────────────────┬─────────────────┬───────────┐
│ statistic  ┆ Sepal.Length     ┆ Sepal.Width      ┆ Petal.Length    ┆ Petal.Width     ┆ Species   │
│ ---        ┆ ---              ┆ ---              ┆ ---             ┆ ---             ┆ ---       │
│ str        ┆ str              ┆ str              ┆ str             ┆ str             ┆ str       │
╞════════════╪══════════════════╪══════════════════╪═════════════════╪═════════════════╪═══════════╡
│ count      ┆ 150              ┆ 150              ┆ 150             ┆ 150             ┆ 150       │
│ null_count ┆ 0                ┆ 0                ┆ 0               ┆ 0               ┆ 0         │
│ mean       ┆ 5.84333333333333 ┆ 3.05733333333333 ┆ 3.7579999999999 ┆ 1.1993333333333 ┆ null      │
│            ┆ 4                ┆ 32               ┆ 996             ┆ 336             ┆           │
│ std        ┆ 0.82806612797786 ┆ 0.43586628493669 ┆ 1.7652982332594 ┆ 0.7622376689603 ┆ null      │
│            ┆ 28               ┆ 82               ┆ 667             ┆ 465             ┆           │
│ min        ┆ 4.3              ┆ 2.0              ┆ 1.0             ┆ 0.1             ┆ setosa    │
│ 25%        ┆ 5.1              ┆ 2.8              ┆ 1.6             ┆ 0.3             ┆ null      │
│ 50%        ┆ 5.8              ┆ 3.0              ┆ 4.4             ┆ 1.3             ┆ null      │
│ 75%        ┆ 6.4              ┆ 3.3              ┆ 5.1             ┆ 1.8             ┆ null      │
│ max        ┆ 7.9              ┆ 4.4              ┆ 6.9             ┆ 2.5             ┆ virginica │
└────────────┴──────────────────┴──────────────────┴─────────────────┴─────────────────┴───────────┘

2.3.6 Get the rows number

A DataFrame has a width attribute which returns the width of the DataFrame.

# With width attribute
mydf$width

[1] 2

2.3.7 Get the columns number

A DataFrame has a height attribute which returns the width of the DataFrame.

# With height attribute
mydf$height

[1] 3

2.3.8 Get the columns names

A DataFrame has a columns attribute which returns the columns names of the DataFrame in a character vector.

mydf$columns

[1] "col1" "col2"

# This works also on LazyFrame
mydf$lazy()$columns

[1] "col1" "col2"

2.3.9 Rename columns

A character vector can be passed to the columns attribute to rename the columns names of the DataFrame.

mydf$columns <- c("colA","colB")
mydf$columns

[1] "colA" "colB"

2.3.10 Get the size

The estimated_size() method can be used to get an estimation of the total allocated size (in Bytes) of a DataFrame.

mydf$estimated_size()

[1] 15

2.3.11 Get the first n rows

The head() method can be used to get the first n rows of a DataFrame.

# To get the 2 first rows
mydf$head(2)

shape: (2, 2)
┌──────┬──────┐
│ colA ┆ colB │
│ ---  ┆ ---  │
│ i32  ┆ str  │
╞══════╪══════╡
│ 1    ┆ a    │
│ 2    ┆ b    │
└──────┴──────┘

2.3.12 Count values in a DataFrame

The value_counts() method can be used to count values in a Series of a DataFrame. value_counts() works with a Series. It must therefore be supplied either with square brackets or with the select() method. See here to learn about it.

# 1st option with square brackets
as_polars_df(iris)[,c("Species")]$value_counts()

shape: (3, 2)
┌────────────┬───────┐
│ Species    ┆ count │
│ ---        ┆ ---   │
│ cat        ┆ u32   │
╞════════════╪═══════╡
│ setosa     ┆ 50    │
│ versicolor ┆ 50    │
│ virginica  ┆ 50    │
└────────────┴───────┘

# 2nd option with select() method
as_polars_df(iris)$select(pl$col("Species"))$to_series()$value_counts()

shape: (3, 2)
┌────────────┬───────┐
│ Species    ┆ count │
│ ---        ┆ ---   │
│ cat        ┆ u32   │
╞════════════╪═══════╡
│ setosa     ┆ 50    │
│ versicolor ┆ 50    │
│ virginica  ┆ 50    │
└────────────┴───────┘

table(iris$Species)


    setosa versicolor  virginica 
        50         50         50

iris |>
  count(Species)

     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50

iris_dt[, .N, by = Species]

      Species     N
       <fctr> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

2.3.13 Count NA over columns in a DataFrame

The null_count() method can be used to count NA values of a DataFrame.

mydfNA <- pl$DataFrame(
  colA = as_polars_series(c("a",NA,"c")),
  colB = as_polars_series(c("d",NA,NA)))
mydfNA$null_count()

shape: (1, 2)
┌──────┬──────┐
│ colA ┆ colB │
│ ---  ┆ ---  │
│ u32  ┆ u32  │
╞══════╪══════╡
│ 1    ┆ 2    │
└──────┴──────┘

mydfNA <- data.frame(
  colA = c("a",NA,"c"),
  colB = c("d",NA,NA))
sapply(mydfNA, function(x) sum(is.na(x)))

colA colB 
   1    2

mydfNA |>
  summarise(across(everything(), ~sum(is.na(.))))

  colA colB
1    1    2

mydfNA_dt <- as.data.table(mydfNA)
mydfNA_dt[, lapply(.SD, function(x) sum(is.na(x))), .SDcols = names(mydfNA_dt)]

    colA  colB
   <int> <int>
1:     1     2

2.4 Filter rows

The first option to filter rows of a DataFrame is to use square brackets [] indexing (with integer row number).

data(iris)
# The first four lines
as_polars_df(iris)[1:4,]

shape: (4, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

# The lines 1, 3 and 5
as_polars_df(iris)[c(1,3,5),]

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Important

It’s convenient when you have to quickly inspect your data. But you’ll quickly be limited by the square brackets, as they don’t accept conditions with the expressions. For example as_polars_df(iris)[Petal.Length > 6] doesn’t work.

The second and best option is to use the filter() method. It must be used with the Polars expression, here the col() method which allows to designate the columns on which the filter condition will be applied.

Let’s see in details what’s inside a filter() method with an example:

pl$col("Petal.Length"): this expression selects the Petal.Length column from iris;
>6: applies a Boolean condition to this expression (for all Petals that have a length > 6).

In the example below, we will use & operator to apply multiple conditions in filter() method:

as_polars_df(iris)$filter(
  pl$col("Petal.Length") > 6 & pl$col("Petal.Width") < 2)

shape: (2, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 7.3          ┆ 2.9         ┆ 6.3          ┆ 1.8         ┆ virginica │
│ 7.4          ┆ 2.8         ┆ 6.1          ┆ 1.9         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

iris[iris$Petal.Length > 6 & iris$Petal.Width < 2,] # here don't forget the comma

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
108          7.3         2.9          6.3         1.8 virginica
131          7.4         2.8          6.1         1.9 virginica

iris |>
  filter(Petal.Length > 6 & Petal.Width < 2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.3         2.9          6.3         1.8 virginica
2          7.4         2.8          6.1         1.9 virginica

iris_dt[Petal.Length > 6 & Petal.Width < 2]

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <num>       <num>        <num>       <num>    <fctr>
1:          7.3         2.9          6.3         1.8 virginica
2:          7.4         2.8          6.1         1.9 virginica

Tip

The equivalent of %in% R operator is the is_in() method. It should be used in association with the lit() method.

pl$DataFrame(
  colA = as_polars_series(c("a","b","c"))
  )$filter(
  pl$col("colA")$is_in(pl$lit(c("a","b")))
  )

shape: (2, 1)
┌──────┐
│ colA │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ b    │
└──────┘

Another reason for using the filter() method is that filter expressions can be optimised in lazy mode by the query optimiser. Square brackets [] can only be used in eager mode.

Tip

There is another way to speed up filter processing on rows: tell polars that the column(s) used to filter rows are already sorted! To do this, you can use the set_sorted() method.
Here’s an example:

mydf <- pl$DataFrame(
  col1 = as_polars_series(sort(runif(10000000)))
)

microbenchmark(
  "Without telling col1 is sorted" = mydf$filter(pl$col("col1") < 100),
  "Telling col1 is sorted" = mydf$with_columns(pl$col("col1")$set_sorted())$filter(pl$col("col1") < 100)
  )

Unit: milliseconds
                           expr      min       lq     mean   median       uq
 Without telling col1 is sorted 6.520649 6.654393 6.792447 6.731507 6.829590
         Telling col1 is sorted 4.047393 4.150235 4.358348 4.219845 4.340194
       max neval
 10.735644   100
  8.765798   100

2.5 Select columns

2.5.1 Selecting by name

The first option for selecting columns of a DataFrame is to use square brackets [].
The second option is to use the select() method. In this case, it must be used with the col() method which allows to designate the columns to be selected with a character vector.

# 1st option : with square brackets syntax
as_polars_df(iris)[1:3,c("Petal.Length","Petal.Width")]

shape: (3, 2)
┌──────────────┬─────────────┐
│ Petal.Length ┆ Petal.Width │
│ ---          ┆ ---         │
│ f64          ┆ f64         │
╞══════════════╪═════════════╡
│ 1.4          ┆ 0.2         │
│ 1.4          ┆ 0.2         │
│ 1.3          ┆ 0.2         │
└──────────────┴─────────────┘

# 2nd option : with select() method
as_polars_df(iris)$select(
  pl$col(c("Petal.Length","Petal.Width"))
)$head(3) # display the first 3 lines

shape: (3, 2)
┌──────────────┬─────────────┐
│ Petal.Length ┆ Petal.Width │
│ ---          ┆ ---         │
│ f64          ┆ f64         │
╞══════════════╪═════════════╡
│ 1.4          ┆ 0.2         │
│ 1.4          ┆ 0.2         │
│ 1.3          ┆ 0.2         │
└──────────────┴─────────────┘

iris[1:3,c("Petal.Length","Petal.Width")]

  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2

iris |>
  select(Petal.Length,Petal.Width) |>
  head(3)

  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2

iris_dt[1:3,.(Petal.Length,Petal.Width)]

   Petal.Length Petal.Width
          <num>       <num>
1:          1.4         0.2
2:          1.4         0.2
3:          1.3         0.2

Keep in mind that when:
- Square brackets are used:
- Selecting only one column from a DataFrame, the output is a Series;
- Whereas if there is more than one column selected, the output is a DataFrame.
- select() method is used:
- The output is always a DataFrame rather than a Series even if one column is selected.
=> If you need a Series you can use the to_series() method. See here.

inherits(as_polars_df(iris)[,"Petal.Length"],"Series")

[1] FALSE

inherits(as_polars_df(iris)[,"Petal.Length"],"DataFrame")

[1] FALSE

inherits(as_polars_df(iris)[,c("Petal.Length","Petal.Width")],"DataFrame")

[1] FALSE

inherits(as_polars_df(iris)[,c("Petal.Length","Petal.Width")],"Series")

[1] FALSE

With Polars if you want to obtain a result in an R data.frame, you can simply add the method to_data_frame() at the end of the method chaining. See here for examples.

Tip

Beyond the minor differences discussed above, there are two major reasons why you should use the select() method over the syntax with square brackets:
- When you select and transform multiple columns with select() method, Polars will run these selections in parallel;
- Expressions used in select() method can be optimised in lazy mode by the query optimizer.

Finally, the select() method can also be used to re-order columns of a DataFrame.
For example, to re-order the columns in alphabetical order:

as_polars_df(iris)$select(
  pl$col(sort(as_polars_df(iris)$columns)))$
  head(3)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Petal.Length ┆ Petal.Width ┆ Sepal.Length ┆ Sepal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 1.4          ┆ 0.2         ┆ 5.1          ┆ 3.5         ┆ setosa  │
│ 1.4          ┆ 0.2         ┆ 4.9          ┆ 3.0         ┆ setosa  │
│ 1.3          ┆ 0.2         ┆ 4.7          ┆ 3.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

2.5.2 Selecting by data type

To select columns by data type from a DataFrame, you can pass a data type to pl$col expression. For example:

# Select only Float64 columns
as_polars_df(iris)$select(pl$col(pl$Float64))$head(3)

shape: (3, 4)
┌──────────────┬─────────────┬──────────────┬─────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width │
│ ---          ┆ ---         ┆ ---          ┆ ---         │
│ f64          ┆ f64         ┆ f64          ┆ f64         │
╞══════════════╪═════════════╪══════════════╪═════════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         │
└──────────────┴─────────────┴──────────────┴─────────────┘

# Select only numeric columns
as_polars_df(iris)$select(pl$col(pl$numeric_dtypes))$head(3)

shape: (3, 4)
┌──────────────┬─────────────┬──────────────┬─────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width │
│ ---          ┆ ---         ┆ ---          ┆ ---         │
│ f64          ┆ f64         ┆ f64          ┆ f64         │
╞══════════════╪═════════════╪══════════════╪═════════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         │
└──────────────┴─────────────┴──────────────┴─────────────┘

# Select only factor columns
as.data.frame(iris[1:3, sapply(iris, is.factor)])

  iris[1:3, sapply(iris, is.factor)]
1                             setosa
2                             setosa
3                             setosa

# Select only numeric columns
iris[1:3, sapply(iris, is.numeric)]

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2

# Select only factor columns
iris %>% 
  select_if(is.factor)

       Species
1       setosa
2       setosa
3       setosa
4       setosa
5       setosa
6       setosa
7       setosa
8       setosa
9       setosa
10      setosa
11      setosa
12      setosa
13      setosa
14      setosa
15      setosa
16      setosa
17      setosa
18      setosa
19      setosa
20      setosa
21      setosa
22      setosa
23      setosa
24      setosa
25      setosa
26      setosa
27      setosa
28      setosa
29      setosa
30      setosa
31      setosa
32      setosa
33      setosa
34      setosa
35      setosa
36      setosa
37      setosa
38      setosa
39      setosa
40      setosa
41      setosa
42      setosa
43      setosa
44      setosa
45      setosa
46      setosa
47      setosa
48      setosa
49      setosa
50      setosa
51  versicolor
52  versicolor
53  versicolor
54  versicolor
55  versicolor
56  versicolor
57  versicolor
58  versicolor
59  versicolor
60  versicolor
61  versicolor
62  versicolor
63  versicolor
64  versicolor
65  versicolor
66  versicolor
67  versicolor
68  versicolor
69  versicolor
70  versicolor
71  versicolor
72  versicolor
73  versicolor
74  versicolor
75  versicolor
76  versicolor
77  versicolor
78  versicolor
79  versicolor
80  versicolor
81  versicolor
82  versicolor
83  versicolor
84  versicolor
85  versicolor
86  versicolor
87  versicolor
88  versicolor
89  versicolor
90  versicolor
91  versicolor
92  versicolor
93  versicolor
94  versicolor
95  versicolor
96  versicolor
97  versicolor
98  versicolor
99  versicolor
100 versicolor
101  virginica
102  virginica
103  virginica
104  virginica
105  virginica
106  virginica
107  virginica
108  virginica
109  virginica
110  virginica
111  virginica
112  virginica
113  virginica
114  virginica
115  virginica
116  virginica
117  virginica
118  virginica
119  virginica
120  virginica
121  virginica
122  virginica
123  virginica
124  virginica
125  virginica
126  virginica
127  virginica
128  virginica
129  virginica
130  virginica
131  virginica
132  virginica
133  virginica
134  virginica
135  virginica
136  virginica
137  virginica
138  virginica
139  virginica
140  virginica
141  virginica
142  virginica
143  virginica
144  virginica
145  virginica
146  virginica
147  virginica
148  virginica
149  virginica
150  virginica

# Select only numeric columns
iris %>% 
  select_if(is.numeric)

    Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
4            4.6         3.1          1.5         0.2
5            5.0         3.6          1.4         0.2
6            5.4         3.9          1.7         0.4
7            4.6         3.4          1.4         0.3
8            5.0         3.4          1.5         0.2
9            4.4         2.9          1.4         0.2
10           4.9         3.1          1.5         0.1
11           5.4         3.7          1.5         0.2
12           4.8         3.4          1.6         0.2
13           4.8         3.0          1.4         0.1
14           4.3         3.0          1.1         0.1
15           5.8         4.0          1.2         0.2
16           5.7         4.4          1.5         0.4
17           5.4         3.9          1.3         0.4
18           5.1         3.5          1.4         0.3
19           5.7         3.8          1.7         0.3
20           5.1         3.8          1.5         0.3
21           5.4         3.4          1.7         0.2
22           5.1         3.7          1.5         0.4
23           4.6         3.6          1.0         0.2
24           5.1         3.3          1.7         0.5
25           4.8         3.4          1.9         0.2
26           5.0         3.0          1.6         0.2
27           5.0         3.4          1.6         0.4
28           5.2         3.5          1.5         0.2
29           5.2         3.4          1.4         0.2
30           4.7         3.2          1.6         0.2
31           4.8         3.1          1.6         0.2
32           5.4         3.4          1.5         0.4
33           5.2         4.1          1.5         0.1
34           5.5         4.2          1.4         0.2
35           4.9         3.1          1.5         0.2
36           5.0         3.2          1.2         0.2
37           5.5         3.5          1.3         0.2
38           4.9         3.6          1.4         0.1
39           4.4         3.0          1.3         0.2
40           5.1         3.4          1.5         0.2
41           5.0         3.5          1.3         0.3
42           4.5         2.3          1.3         0.3
43           4.4         3.2          1.3         0.2
44           5.0         3.5          1.6         0.6
45           5.1         3.8          1.9         0.4
46           4.8         3.0          1.4         0.3
47           5.1         3.8          1.6         0.2
48           4.6         3.2          1.4         0.2
49           5.3         3.7          1.5         0.2
50           5.0         3.3          1.4         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
54           5.5         2.3          4.0         1.3
55           6.5         2.8          4.6         1.5
56           5.7         2.8          4.5         1.3
57           6.3         3.3          4.7         1.6
58           4.9         2.4          3.3         1.0
59           6.6         2.9          4.6         1.3
60           5.2         2.7          3.9         1.4
61           5.0         2.0          3.5         1.0
62           5.9         3.0          4.2         1.5
63           6.0         2.2          4.0         1.0
64           6.1         2.9          4.7         1.4
65           5.6         2.9          3.6         1.3
66           6.7         3.1          4.4         1.4
67           5.6         3.0          4.5         1.5
68           5.8         2.7          4.1         1.0
69           6.2         2.2          4.5         1.5
70           5.6         2.5          3.9         1.1
71           5.9         3.2          4.8         1.8
72           6.1         2.8          4.0         1.3
73           6.3         2.5          4.9         1.5
74           6.1         2.8          4.7         1.2
75           6.4         2.9          4.3         1.3
76           6.6         3.0          4.4         1.4
77           6.8         2.8          4.8         1.4
78           6.7         3.0          5.0         1.7
79           6.0         2.9          4.5         1.5
80           5.7         2.6          3.5         1.0
81           5.5         2.4          3.8         1.1
82           5.5         2.4          3.7         1.0
83           5.8         2.7          3.9         1.2
84           6.0         2.7          5.1         1.6
85           5.4         3.0          4.5         1.5
86           6.0         3.4          4.5         1.6
87           6.7         3.1          4.7         1.5
88           6.3         2.3          4.4         1.3
89           5.6         3.0          4.1         1.3
90           5.5         2.5          4.0         1.3
91           5.5         2.6          4.4         1.2
92           6.1         3.0          4.6         1.4
93           5.8         2.6          4.0         1.2
94           5.0         2.3          3.3         1.0
95           5.6         2.7          4.2         1.3
96           5.7         3.0          4.2         1.2
97           5.7         2.9          4.2         1.3
98           6.2         2.9          4.3         1.3
99           5.1         2.5          3.0         1.1
100          5.7         2.8          4.1         1.3
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1
104          6.3         2.9          5.6         1.8
105          6.5         3.0          5.8         2.2
106          7.6         3.0          6.6         2.1
107          4.9         2.5          4.5         1.7
108          7.3         2.9          6.3         1.8
109          6.7         2.5          5.8         1.8
110          7.2         3.6          6.1         2.5
111          6.5         3.2          5.1         2.0
112          6.4         2.7          5.3         1.9
113          6.8         3.0          5.5         2.1
114          5.7         2.5          5.0         2.0
115          5.8         2.8          5.1         2.4
116          6.4         3.2          5.3         2.3
117          6.5         3.0          5.5         1.8
118          7.7         3.8          6.7         2.2
119          7.7         2.6          6.9         2.3
120          6.0         2.2          5.0         1.5
121          6.9         3.2          5.7         2.3
122          5.6         2.8          4.9         2.0
123          7.7         2.8          6.7         2.0
124          6.3         2.7          4.9         1.8
125          6.7         3.3          5.7         2.1
126          7.2         3.2          6.0         1.8
127          6.2         2.8          4.8         1.8
128          6.1         3.0          4.9         1.8
129          6.4         2.8          5.6         2.1
130          7.2         3.0          5.8         1.6
131          7.4         2.8          6.1         1.9
132          7.9         3.8          6.4         2.0
133          6.4         2.8          5.6         2.2
134          6.3         2.8          5.1         1.5
135          6.1         2.6          5.6         1.4
136          7.7         3.0          6.1         2.3
137          6.3         3.4          5.6         2.4
138          6.4         3.1          5.5         1.8
139          6.0         3.0          4.8         1.8
140          6.9         3.1          5.4         2.1
141          6.7         3.1          5.6         2.4
142          6.9         3.1          5.1         2.3
143          5.8         2.7          5.1         1.9
144          6.8         3.2          5.9         2.3
145          6.7         3.3          5.7         2.5
146          6.7         3.0          5.2         2.3
147          6.3         2.5          5.0         1.9
148          6.5         3.0          5.2         2.0
149          6.2         3.4          5.4         2.3
150          5.9         3.0          5.1         1.8

# Select only factor columns
iris_dt[, .SD, .SDcols = is.factor]

       Species
        <fctr>
  1:    setosa
  2:    setosa
  3:    setosa
  4:    setosa
  5:    setosa
 ---          
146: virginica
147: virginica
148: virginica
149: virginica
150: virginica

# Select only numeric columns
iris_dt[, .SD, .SDcols = is.numeric]

     Sepal.Length Sepal.Width Petal.Length Petal.Width
            <num>       <num>        <num>       <num>
  1:          5.1         3.5          1.4         0.2
  2:          4.9         3.0          1.4         0.2
  3:          4.7         3.2          1.3         0.2
  4:          4.6         3.1          1.5         0.2
  5:          5.0         3.6          1.4         0.2
 ---                                                  
146:          6.7         3.0          5.2         2.3
147:          6.3         2.5          5.0         1.9
148:          6.5         3.0          5.2         2.0
149:          6.2         3.4          5.4         2.3
150:          5.9         3.0          5.1         1.8

Note

You can also select columns from a DataFrame using a name pattern. See an example by removing a column

2.5.3 Selecting with a regex

To select columns with a regex from a DataFrame, you can pass it in pl$col expression. For example, to select all columns that starts with “Sepal” in iris dataset:

as_polars_df(iris)$select(
  pl$col("^Sepal.*$")
)$head(3) # display the first 3 lines

shape: (3, 2)
┌──────────────┬─────────────┐
│ Sepal.Length ┆ Sepal.Width │
│ ---          ┆ ---         │
│ f64          ┆ f64         │
╞══════════════╪═════════════╡
│ 5.1          ┆ 3.5         │
│ 4.9          ┆ 3.0         │
│ 4.7          ┆ 3.2         │
└──────────────┴─────────────┘

iris[, grep("^Sepal.*$", colnames(iris))]

    Sepal.Length Sepal.Width
1            5.1         3.5
2            4.9         3.0
3            4.7         3.2
4            4.6         3.1
5            5.0         3.6
6            5.4         3.9
7            4.6         3.4
8            5.0         3.4
9            4.4         2.9
10           4.9         3.1
11           5.4         3.7
12           4.8         3.4
13           4.8         3.0
14           4.3         3.0
15           5.8         4.0
16           5.7         4.4
17           5.4         3.9
18           5.1         3.5
19           5.7         3.8
20           5.1         3.8
21           5.4         3.4
22           5.1         3.7
23           4.6         3.6
24           5.1         3.3
25           4.8         3.4
26           5.0         3.0
27           5.0         3.4
28           5.2         3.5
29           5.2         3.4
30           4.7         3.2
31           4.8         3.1
32           5.4         3.4
33           5.2         4.1
34           5.5         4.2
35           4.9         3.1
36           5.0         3.2
37           5.5         3.5
38           4.9         3.6
39           4.4         3.0
40           5.1         3.4
41           5.0         3.5
42           4.5         2.3
43           4.4         3.2
44           5.0         3.5
45           5.1         3.8
46           4.8         3.0
47           5.1         3.8
48           4.6         3.2
49           5.3         3.7
50           5.0         3.3
51           7.0         3.2
52           6.4         3.2
53           6.9         3.1
54           5.5         2.3
55           6.5         2.8
56           5.7         2.8
57           6.3         3.3
58           4.9         2.4
59           6.6         2.9
60           5.2         2.7
61           5.0         2.0
62           5.9         3.0
63           6.0         2.2
64           6.1         2.9
65           5.6         2.9
66           6.7         3.1
67           5.6         3.0
68           5.8         2.7
69           6.2         2.2
70           5.6         2.5
71           5.9         3.2
72           6.1         2.8
73           6.3         2.5
74           6.1         2.8
75           6.4         2.9
76           6.6         3.0
77           6.8         2.8
78           6.7         3.0
79           6.0         2.9
80           5.7         2.6
81           5.5         2.4
82           5.5         2.4
83           5.8         2.7
84           6.0         2.7
85           5.4         3.0
86           6.0         3.4
87           6.7         3.1
88           6.3         2.3
89           5.6         3.0
90           5.5         2.5
91           5.5         2.6
92           6.1         3.0
93           5.8         2.6
94           5.0         2.3
95           5.6         2.7
96           5.7         3.0
97           5.7         2.9
98           6.2         2.9
99           5.1         2.5
100          5.7         2.8
101          6.3         3.3
102          5.8         2.7
103          7.1         3.0
104          6.3         2.9
105          6.5         3.0
106          7.6         3.0
107          4.9         2.5
108          7.3         2.9
109          6.7         2.5
110          7.2         3.6
111          6.5         3.2
112          6.4         2.7
113          6.8         3.0
114          5.7         2.5
115          5.8         2.8
116          6.4         3.2
117          6.5         3.0
118          7.7         3.8
119          7.7         2.6
120          6.0         2.2
121          6.9         3.2
122          5.6         2.8
123          7.7         2.8
124          6.3         2.7
125          6.7         3.3
126          7.2         3.2
127          6.2         2.8
128          6.1         3.0
129          6.4         2.8
130          7.2         3.0
131          7.4         2.8
132          7.9         3.8
133          6.4         2.8
134          6.3         2.8
135          6.1         2.6
136          7.7         3.0
137          6.3         3.4
138          6.4         3.1
139          6.0         3.0
140          6.9         3.1
141          6.7         3.1
142          6.9         3.1
143          5.8         2.7
144          6.8         3.2
145          6.7         3.3
146          6.7         3.0
147          6.3         2.5
148          6.5         3.0
149          6.2         3.4
150          5.9         3.0

iris %>%
  select(starts_with("Sepal"))

    Sepal.Length Sepal.Width
1            5.1         3.5
2            4.9         3.0
3            4.7         3.2
4            4.6         3.1
5            5.0         3.6
6            5.4         3.9
7            4.6         3.4
8            5.0         3.4
9            4.4         2.9
10           4.9         3.1
11           5.4         3.7
12           4.8         3.4
13           4.8         3.0
14           4.3         3.0
15           5.8         4.0
16           5.7         4.4
17           5.4         3.9
18           5.1         3.5
19           5.7         3.8
20           5.1         3.8
21           5.4         3.4
22           5.1         3.7
23           4.6         3.6
24           5.1         3.3
25           4.8         3.4
26           5.0         3.0
27           5.0         3.4
28           5.2         3.5
29           5.2         3.4
30           4.7         3.2
31           4.8         3.1
32           5.4         3.4
33           5.2         4.1
34           5.5         4.2
35           4.9         3.1
36           5.0         3.2
37           5.5         3.5
38           4.9         3.6
39           4.4         3.0
40           5.1         3.4
41           5.0         3.5
42           4.5         2.3
43           4.4         3.2
44           5.0         3.5
45           5.1         3.8
46           4.8         3.0
47           5.1         3.8
48           4.6         3.2
49           5.3         3.7
50           5.0         3.3
51           7.0         3.2
52           6.4         3.2
53           6.9         3.1
54           5.5         2.3
55           6.5         2.8
56           5.7         2.8
57           6.3         3.3
58           4.9         2.4
59           6.6         2.9
60           5.2         2.7
61           5.0         2.0
62           5.9         3.0
63           6.0         2.2
64           6.1         2.9
65           5.6         2.9
66           6.7         3.1
67           5.6         3.0
68           5.8         2.7
69           6.2         2.2
70           5.6         2.5
71           5.9         3.2
72           6.1         2.8
73           6.3         2.5
74           6.1         2.8
75           6.4         2.9
76           6.6         3.0
77           6.8         2.8
78           6.7         3.0
79           6.0         2.9
80           5.7         2.6
81           5.5         2.4
82           5.5         2.4
83           5.8         2.7
84           6.0         2.7
85           5.4         3.0
86           6.0         3.4
87           6.7         3.1
88           6.3         2.3
89           5.6         3.0
90           5.5         2.5
91           5.5         2.6
92           6.1         3.0
93           5.8         2.6
94           5.0         2.3
95           5.6         2.7
96           5.7         3.0
97           5.7         2.9
98           6.2         2.9
99           5.1         2.5
100          5.7         2.8
101          6.3         3.3
102          5.8         2.7
103          7.1         3.0
104          6.3         2.9
105          6.5         3.0
106          7.6         3.0
107          4.9         2.5
108          7.3         2.9
109          6.7         2.5
110          7.2         3.6
111          6.5         3.2
112          6.4         2.7
113          6.8         3.0
114          5.7         2.5
115          5.8         2.8
116          6.4         3.2
117          6.5         3.0
118          7.7         3.8
119          7.7         2.6
120          6.0         2.2
121          6.9         3.2
122          5.6         2.8
123          7.7         2.8
124          6.3         2.7
125          6.7         3.3
126          7.2         3.2
127          6.2         2.8
128          6.1         3.0
129          6.4         2.8
130          7.2         3.0
131          7.4         2.8
132          7.9         3.8
133          6.4         2.8
134          6.3         2.8
135          6.1         2.6
136          7.7         3.0
137          6.3         3.4
138          6.4         3.1
139          6.0         3.0
140          6.9         3.1
141          6.7         3.1
142          6.9         3.1
143          5.8         2.7
144          6.8         3.2
145          6.7         3.3
146          6.7         3.0
147          6.3         2.5
148          6.5         3.0
149          6.2         3.4
150          5.9         3.0

iris_dt[, .SD, .SDcols = grep("^Sepal", colnames(iris_dt))]

     Sepal.Length Sepal.Width
            <num>       <num>
  1:          5.1         3.5
  2:          4.9         3.0
  3:          4.7         3.2
  4:          4.6         3.1
  5:          5.0         3.6
 ---                         
146:          6.7         3.0
147:          6.3         2.5
148:          6.5         3.0
149:          6.2         3.4
150:          5.9         3.0

2.5.4 Selecting using lists

It is also possible to select columns from a DataFrame using R lists which can be very practical depending on the case. Here’s an example:

data(iris)

l_expr = list(
  pl$col("Petal.Length"),
  pl$col("Species")
)

# Select only categorical columns
as_polars_df(iris)$select(l_expr)$head(3)

shape: (3, 2)
┌──────────────┬─────────┐
│ Petal.Length ┆ Species │
│ ---          ┆ ---     │
│ f64          ┆ cat     │
╞══════════════╪═════════╡
│ 1.4          ┆ setosa  │
│ 1.4          ┆ setosa  │
│ 1.3          ┆ setosa  │
└──────────────┴─────────┘

This way of selecting columns also works with simple R lists:

data(iris)

l_expr = list(
  "Petal.Length",
  "Species",
  "Petal.Width"
)

# Select only categorical columns
as_polars_df(iris)$select(l_expr)$head(3)

shape: (3, 3)
┌──────────────┬─────────┬─────────────┐
│ Petal.Length ┆ Species ┆ Petal.Width │
│ ---          ┆ ---     ┆ ---         │
│ f64          ┆ cat     ┆ f64         │
╞══════════════╪═════════╪═════════════╡
│ 1.4          ┆ setosa  ┆ 0.2         │
│ 1.4          ┆ setosa  ┆ 0.2         │
│ 1.3          ┆ setosa  ┆ 0.2         │
└──────────────┴─────────┴─────────────┘

2.5.5 Selecting - other ways

The selectors API of polars enables to use other methods to select columns.
Here’s a few examples selecting the first and last column of a DataFrame respectively with first() and last() methods:

# Select the first column
as_polars_df(iris)$select(
  pl$first()
)$head(3)

shape: (3, 1)
┌──────────────┐
│ Sepal.Length │
│ ---          │
│ f64          │
╞══════════════╡
│ 5.1          │
│ 4.9          │
│ 4.7          │
└──────────────┘

# Select the last column
as_polars_df(iris)$select(
  pl$last()
)$head(3)

shape: (3, 1)
┌─────────┐
│ Species │
│ ---     │
│ cat     │
╞═════════╡
│ setosa  │
│ setosa  │
│ setosa  │
└─────────┘

2.6 Modify/Add columns

2.6.1 Modify existing column

Similar to the dplyr package, the select() method can also be used to modify existing data. However, the result will exclude any columns that were not specified in the expression.

For example, if we want to get in the data.frame iris the Petal.Length column rounded without decimals.

as_polars_df(iris)$select(
  pl$col("Petal.Length")$round(decimals = 0)
)$head(3) # display the first 3 lines

shape: (3, 1)
┌──────────────┐
│ Petal.Length │
│ ---          │
│ f64          │
╞══════════════╡
│ 1.0          │
│ 1.0          │
│ 1.0          │
└──────────────┘

The problem here is that we would like to keep all the iris columns and not just Petal.Length.

Again, let’s look at the official documentation:

To modify or add some columns—whilst preserving all others in the dataset—it is therefore better to use the with_columns() method.

Note that the with_columns() method only accepts expressions. Let’s use it in an example:

as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$round(decimals = 0)
)$head(3) # display the first 3 lines

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.0          ┆ 0.2         ┆ setosa  │
│ 4.9          ┆ 3.0         ┆ 1.0          ┆ 0.2         ┆ setosa  │
│ 4.7          ┆ 3.2         ┆ 1.0          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

iris$Petal.Length <- round(iris$Petal.Length, digits = 0)
iris[1:3,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5            1         0.2  setosa
2          4.9         3.0            1         0.2  setosa
3          4.7         3.2            1         0.2  setosa

data(iris)
iris |>
  mutate(Petal.Length = round(Petal.Length,0)) |>
  head(3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5            1         0.2  setosa
2          4.9         3.0            1         0.2  setosa
3          4.7         3.2            1         0.2  setosa

iris_dt[,Petal.Length := round(Petal.Length, digits = 0)]
iris_dt[1:3,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          5.1         3.5            1         0.2  setosa
2:          4.9         3.0            1         0.2  setosa
3:          4.7         3.2            1         0.2  setosa

2.6.2 Add a new column

If you want to add a column to a data.frame, you use the same syntax as above with with_columns(). Simply use the alias() method to specify the name of the newly created column.

as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$round(decimals = 0)$alias("Petal.Length.rounded")
)$head(3) # display the first 3 lines

shape: (3, 6)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┬──────────────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species ┆ Petal.Length.rounded │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     ┆ ---                  │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     ┆ f64                  │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╪══════════════════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ 1.0                  │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ 1.0                  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  ┆ 1.0                  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┴──────────────────────┘

2.6.3 Add a new column with a constant value

If you need to create a new column with a constant value (i.e. the same value for all the rows in your DataFrame), you can use the literal lit() method. It works with the main types of Polars.

as_polars_df(iris)$with_columns(
  pl$lit("toto")$alias("mynewcolumn")
)$head(3) # display the first 3 lines

shape: (3, 6)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┬─────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species ┆ mynewcolumn │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     ┆ ---         │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     ┆ str         │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╪═════════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ toto        │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  ┆ toto        │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  ┆ toto        │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┴─────────────┘

2.6.4 Add a new column based on conditions

To add new columns based on conditions, the when-then-otherwise expression must be used.

Here’s the syntax:

pl$when(**Boolean Expression**)$
  then(**Value if True**)$
  otherwise(**Value if False)$
  alias(**New Column Name**)

Here’s an example with equivalent syntax to help you understand:

as_polars_df(iris)$with_columns(
  pl$when(pl$col("Petal.Length") <= 2)$then(pl$lit("<=2"))$
    when(pl$col("Petal.Length") <= 5)$then(pl$lit("<=5"))$
    otherwise(pl$lit(">5"))$
    alias("mygroups")
# we only need to display 2 variables to check that it's OK
)$select(
  pl$col(c("Petal.Length","mygroups"))
)[c(1,2,59,150),]

shape: (4, 2)
┌──────────────┬──────────┐
│ Petal.Length ┆ mygroups │
│ ---          ┆ ---      │
│ f64          ┆ str      │
╞══════════════╪══════════╡
│ 1.4          ┆ <=2      │
│ 1.4          ┆ <=2      │
│ 4.6          ┆ <=5      │
│ 5.1          ┆ >5       │
└──────────────┴──────────┘

iris$mygroups <- ifelse(iris$Petal.Length <= 2, "<=2",
                        ifelse(iris$Petal.Length <= 5, "<=5", ">5"))
# we only need to display 2 variables to check that it's OK
iris[c(1,2,59,150), c("Petal.Length", "mygroups")]

    Petal.Length mygroups
1            1.4      <=2
2            1.4      <=2
59           4.6      <=5
150          5.1       >5

iris |>
  mutate(
    mygroups = case_when(
      Petal.Length <=2 ~ "<=2",
      Petal.Length <=5 ~ "<=5",
      .default = ">5")
  ) |>
  # we only need to display 2 variables  to check that it's OK
  select(Petal.Length,mygroups) |>
  slice(1,2,59,150)

  Petal.Length mygroups
1          1.4      <=2
2          1.4      <=2
3          4.6      <=5
4          5.1       >5

iris_dt[, mygroups := case_when(
  Petal.Length <= 2 ~ "<=2",
  Petal.Length <= 5 ~ "<=5",
  TRUE ~ ">5"
)]
# we only need to display 2 variables to check that it's OK
iris_dt[c(1,2,59,150), .(Petal.Length, mygroups)]

   Petal.Length mygroups
          <num>   <char>
1:            1      <=2
2:            1      <=2
3:            5      <=5
4:            5      <=5

2.6.5 Add a new column by group

To add new columns by group, the over expression must be used. This expression is similar to performing a groupby aggregation and joining the result back into the original dataframe. Here’s an example with equivalent syntax to help you understand:

df <- data.frame(
  name = c("X","X","Y","Y","Z","Z"),
  adress = c("A","B","C","D","E","F"),
  col2 = c(2L,4L,1L,3L,4L,2L),
  col3 = c(5L,19L,17L,12L,11L,15L)
)
df

  name adress col2 col3
1    X      A    2    5
2    X      B    4   19
3    Y      C    1   17
4    Y      D    3   12
5    Z      E    4   11
6    Z      F    2   15

as_polars_df(df)$with_columns(
  pl$col("col3")$max()$over("name")$name$suffix("_max")
)

shape: (6, 5)
┌──────┬────────┬──────┬──────┬──────────┐
│ name ┆ adress ┆ col2 ┆ col3 ┆ col3_max │
│ ---  ┆ ---    ┆ ---  ┆ ---  ┆ ---      │
│ str  ┆ str    ┆ i32  ┆ i32  ┆ i32      │
╞══════╪════════╪══════╪══════╪══════════╡
│ X    ┆ A      ┆ 2    ┆ 5    ┆ 19       │
│ X    ┆ B      ┆ 4    ┆ 19   ┆ 19       │
│ Y    ┆ C      ┆ 1    ┆ 17   ┆ 17       │
│ Y    ┆ D      ┆ 3    ┆ 12   ┆ 17       │
│ Z    ┆ E      ┆ 4    ┆ 11   ┆ 15       │
│ Z    ┆ F      ┆ 2    ┆ 15   ┆ 15       │
└──────┴────────┴──────┴──────┴──────────┘

result <- aggregate(col3 ~ name, data = df, FUN = max)
colnames(result) <- c("name", "col3_max")
merge(df, result, by = "name", all.x = TRUE)

  name adress col2 col3 col3_max
1    X      A    2    5       19
2    X      B    4   19       19
3    Y      C    1   17       17
4    Y      D    3   12       17
5    Z      E    4   11       15
6    Z      F    2   15       15

df |>
  group_by(name) |>
  mutate(col3_max = max(col3))

# A tibble: 6 × 5
# Groups:   name [3]
  name  adress  col2  col3 col3_max
  <chr> <chr>  <int> <int>    <int>
1 X     A          2     5       19
2 X     B          4    19       19
3 Y     C          1    17       17
4 Y     D          3    12       17
5 Z     E          4    11       15
6 Z     F          2    15       15

dt <- as.data.table(df)
dt[, col3_max := max(col3), by = name]
dt

     name adress  col2  col3 col3_max
   <char> <char> <int> <int>    <int>
1:      X      A     2     5       19
2:      X      B     4    19       19
3:      Y      C     1    17       17
4:      Y      D     3    12       17
5:      Z      E     4    11       15
6:      Z      F     2    15       15

Tip

If you need to pass multiple column names in the over expression, you can either list them like this over("name","adress") or - more conveniently - use a character vector over(c("name","adress")).

2.7 Rename columns

Similar to the dplyr package, the rename() method can also be used to rename existing column.
The renaming logic is NOT identical to that of dplyr, and is performed as follows: old_name="new_name".

Note

Note the double quotes "" surrounding the name of the old variable to be renamed which does not exist with dplyr (see examples below).

data(iris)
as_polars_df(iris)$
  rename(
    Sepal.Length = "sepal_length", 
    Sepal.Width = "sepal_width",
    Petal.Length = "length of petal",
    Petal.Width = "width of petal",
    Species = "species"
  )$columns

[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"

data(iris)
names(iris) <- c("sepal_length","sepal_width","length of petal","width of petal","species")
names(iris)

[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"

data(iris)
iris |>
  rename(
    sepal_length = Sepal.Length, 
    sepal_width = Sepal.Width,
    `length of petal` = Petal.Length,
    `width of petal` = Petal.Width,
    species = Species
  ) |>
  names()

[1] "sepal_length"    "sepal_width"     "length of petal" "width of petal" 
[5] "species"

setnames(iris_dt, 
         old = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"),
         new = c("sepal_length", "sepal_width", "length_of_petal", "width_of_petal", "species"))
names(iris_dt)

[1] "sepal_length"    "sepal_width"     "length_of_petal" "width_of_petal" 
[5] "species"         "mygroups"

2.8 Remove columns

2.8.1 Removing by name

To remove columns from a DataFrame, the drop method must be used.
Here’s an example with equivalent syntax:

as_polars_df(iris)$
  drop(c("Petal.Width","Sepal.Length","Sepal.Width"))$
  head(3)

shape: (3, 2)
┌──────────────┬─────────┐
│ Petal.Length ┆ Species │
│ ---          ┆ ---     │
│ f64          ┆ cat     │
╞══════════════╪═════════╡
│ 1.4          ┆ setosa  │
│ 1.4          ┆ setosa  │
│ 1.3          ┆ setosa  │
└──────────────┴─────────┘

iris[1:3,!(names(iris) %in% c("Petal.Width","Sepal.Length","Sepal.Width"))]

  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa

iris |>
  select(-c(Petal.Width,Sepal.Length,Sepal.Width)) |>
  head(3)

  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa

iris_dt[, c("Petal.Width","Sepal.Length","Sepal.Width") := NULL][1:3,]

   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa

2.8.2 Removing with a regex

To remove columns with a regex from a DataFrame, the drop expression must be used.
Let’s see an example where you want to drop columns whose names starts with “Petal” in iris.

num_to_drop <- !grepl("^Petal",as_polars_df(iris)$columns)
col_to_drop <- as_polars_df(iris)$columns[num_to_drop]

as_polars_df(iris)$
  drop(col_to_drop)$
  head(3)

shape: (3, 2)
┌──────────────┬─────────────┐
│ Petal.Length ┆ Petal.Width │
│ ---          ┆ ---         │
│ f64          ┆ f64         │
╞══════════════╪═════════════╡
│ 1.4          ┆ 0.2         │
│ 1.4          ┆ 0.2         │
│ 1.3          ┆ 0.2         │
└──────────────┴─────────────┘

iris[1:3,!grepl("^Petal",names(iris))]

  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa

iris |>
  select(-starts_with("Petal")) |>
  head(3)

  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa

iris_dt[, which(grepl("^Petal$", names(iris_dt))):=NULL][1:3,]

   Petal.Length Species
          <num>  <fctr>
1:          1.4  setosa
2:          1.4  setosa
3:          1.3  setosa

Note

All the ways we’ve seen in the section about selecting columns from a DataFrame (by name, data type and with a list) also work with drop() method!

2.9 Aggregation by group

Another frequently used data manipulation is the aggregation of data by group. To do this, we indicate in the group_by() method which column will be used to group the data.frame. And the agg() method which specifies the expression to aggregate.

The methods available for the agg() method are (in each group):

first() get the first element
last() get the last element
n_unique() get the number of unique elements
count() get the number of elements
sum() sum the elements
min() get the smallest element
max() get the largest element
mean() get the average of elements
median() get the median
quantile() calculate quantiles

Here’s a minimal example with sum applied to 2 different columns:

as_polars_df(iris)$
  group_by("Species")$
  agg(pl$col(c("Petal.Length","Petal.Width"))$sum())

shape: (3, 3)
┌────────────┬──────────────┬─────────────┐
│ Species    ┆ Petal.Length ┆ Petal.Width │
│ ---        ┆ ---          ┆ ---         │
│ cat        ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╡
│ setosa     ┆ 73.1         ┆ 12.3        │
│ versicolor ┆ 213.0        ┆ 66.3        │
│ virginica  ┆ 277.6        ┆ 101.3       │
└────────────┴──────────────┴─────────────┘

aggregate(cbind(Petal.Length, Petal.Width) ~ Species, data = iris, FUN = sum)

     Species Petal.Length Petal.Width
1     setosa         73.1        12.3
2 versicolor        213.0        66.3
3  virginica        277.6       101.3

data(iris)
iris |>
  group_by(Species) |>
  summarise(across(c(Petal.Length, Petal.Width), sum))

# A tibble: 3 × 3
  Species    Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>
1 setosa             73.1        12.3
2 versicolor        213          66.3
3 virginica         278.        101.

iris_dt <- as.data.table(iris)
iris_dt[, .(Petal.Length = sum(Petal.Length), Petal.Width = sum(Petal.Width)), by = Species]

      Species Petal.Length Petal.Width
       <fctr>        <num>       <num>
1:     setosa         73.1        12.3
2: versicolor        213.0        66.3
3:  virginica        277.6       101.3

Tip

Be careful! Calling multiple aggregations on the same column produces columns of the same name which generates an error with R.

You can use the alias() or suffix() method to ensure column names are unique. For example:

as_polars_df(iris)$
  group_by("Species")$
  agg(
    # With alias()
    pl$col(c("Petal.Length"))$sum()$alias("Petal.Length_Sum"),
    pl$col(c("Petal.Length"))$mean()$alias("Petal.Length_Mean"),
    # With suffix()
    pl$col(c("Petal.Width"))$sum()$name$suffix("_Sum"),
    pl$col(c("Petal.Width"))$mean()$name$suffix("_Mean")
  )

shape: (3, 5)
┌────────────┬──────────────────┬───────────────────┬─────────────────┬──────────────────┐
│ Species    ┆ Petal.Length_Sum ┆ Petal.Length_Mean ┆ Petal.Width_Sum ┆ Petal.Width_Mean │
│ ---        ┆ ---              ┆ ---               ┆ ---             ┆ ---              │
│ cat        ┆ f64              ┆ f64               ┆ f64             ┆ f64              │
╞════════════╪══════════════════╪═══════════════════╪═════════════════╪══════════════════╡
│ virginica  ┆ 277.6            ┆ 5.552             ┆ 101.3           ┆ 2.026            │
│ versicolor ┆ 213.0            ┆ 4.26              ┆ 66.3            ┆ 1.326            │
│ setosa     ┆ 73.1             ┆ 1.462             ┆ 12.3            ┆ 0.246            │
└────────────┴──────────────────┴───────────────────┴─────────────────┴──────────────────┘

You can perform more advanced aggregations, for example calculate the sum of all the floating point columns in iris:

as_polars_df(iris)$
  group_by("Species")$
  agg(pl$col(pl$Float64)$sum())

shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ Species    ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width │
│ ---        ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
│ cat        ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ virginica  ┆ 329.4        ┆ 148.7       ┆ 277.6        ┆ 101.3       │
│ versicolor ┆ 296.8        ┆ 138.5       ┆ 213.0        ┆ 66.3        │
│ setosa     ┆ 250.3        ┆ 171.4       ┆ 73.1         ┆ 12.3        │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘

iris_sum_by_species_base <- aggregate(iris[, sapply(iris, is.numeric)], by = list(Species = iris$Species), FUN = sum)
iris_sum_by_species_base

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        250.3       171.4         73.1        12.3
2 versicolor        296.8       138.5        213.0        66.3
3  virginica        329.4       148.7        277.6       101.3

iris |>
  group_by(Species) %>%
  summarise(across(where(is.numeric), sum))

# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             250.        171.         73.1        12.3
2 versicolor         297.        138.        213          66.3
3 virginica          329.        149.        278.        101.

iris_dt <- as.data.table(iris)
iris_sum_by_species_dt <- iris_dt[, lapply(.SD, sum), by = Species, .SDcols = sapply(iris_dt, is.numeric)]
iris_sum_by_species_dt

      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
       <fctr>        <num>       <num>        <num>       <num>
1:     setosa        250.3       171.4         73.1        12.3
2: versicolor        296.8       138.5        213.0        66.3
3:  virginica        329.4       148.7        277.6       101.3

2.10 Sort a DataFrame

2.10.1 Simply sort a DataFrame

The sort() method with a character vector can be used to sort a DataFrame.

# Sort by one column
as_polars_df(iris)$
  sort("Species")$
  head(3)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

# Sort by two columns
as_polars_df(iris)$
  sort(c("Species","Petal.Length"))$
  head(3)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 4.6          ┆ 3.6         ┆ 1.0          ┆ 0.2         ┆ setosa  │
│ 4.3          ┆ 3.0         ┆ 1.1          ┆ 0.1         ┆ setosa  │
│ 5.0          ┆ 3.2         ┆ 1.2          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

# Sort by two columns one in a decreasing manner and the other in an increasing manner
as_polars_df(iris)$
  sort(c("Species","Petal.Length"), descending = c(TRUE,FALSE))$
  head(3)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 4.9          ┆ 2.5         ┆ 4.5          ┆ 1.7         ┆ virginica │
│ 6.0          ┆ 3.0         ┆ 4.8          ┆ 1.8         ┆ virginica │
│ 6.2          ┆ 2.8         ┆ 4.8          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

# Sort by one column
iris[order(iris$Species),][1:3,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

# Sort by two columns
iris[order(iris$Species,iris$Petal.Length),][1:3,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
23          4.6         3.6          1.0         0.2  setosa
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa

# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris[order(rev(iris$Species),iris$Petal.Length),][1:3,]

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
107          4.9         2.5          4.5         1.7 virginica
127          6.2         2.8          4.8         1.8 virginica
139          6.0         3.0          4.8         1.8 virginica

# Sort by one column
iris |>
  arrange(Species) |>
  head(3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

# Sort by two columns
iris |>
  arrange(Species, Petal.Length) |>
  head(3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.6         3.6          1.0         0.2  setosa
2          4.3         3.0          1.1         0.1  setosa
3          5.8         4.0          1.2         0.2  setosa

# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris |>
  arrange(desc(Species), Petal.Length) |>
  head(3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          4.9         2.5          4.5         1.7 virginica
2          6.2         2.8          4.8         1.8 virginica
3          6.0         3.0          4.8         1.8 virginica

# Sort by one column
iris_dt[order(Species)][1:3,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa

# Sort by two columns
iris_dt[order(Species,Petal.Length)][1:3,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <num>       <num>  <fctr>
1:          4.6         3.6          1.0         0.2  setosa
2:          4.3         3.0          1.1         0.1  setosa
3:          5.8         4.0          1.2         0.2  setosa

# Sort by two columns one in a decreasing manner and the other in an increasing manner
iris_dt[order(-Species,Petal.Length)][1:3,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <num>       <num>        <num>       <num>    <fctr>
1:          4.9         2.5          4.5         1.7 virginica
2:          6.2         2.8          4.8         1.8 virginica
3:          6.0         3.0          4.8         1.8 virginica

2.10.2 Keep unique rows

If you want to keep unique/distinct rows from a DataFrame, you can use the unique() method:

# With one column
as_polars_df(iris)$unique(
  subset = "Species"
)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        │
╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa     │
│ 6.3          ┆ 3.3         ┆ 6.0          ┆ 2.5         ┆ virginica  │
│ 7.0          ┆ 3.2         ┆ 4.7          ┆ 1.4         ┆ versicolor │
└──────────────┴─────────────┴──────────────┴─────────────┴────────────┘

# With one column keeping last entry
as_polars_df(iris)$unique(
  subset = "Species",
  keep = "last"
)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species    │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---        │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat        │
╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
│ 5.7          ┆ 2.8         ┆ 4.1          ┆ 1.3         ┆ versicolor │
│ 5.0          ┆ 3.3         ┆ 1.4          ┆ 0.2         ┆ setosa     │
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica  │
└──────────────┴─────────────┴──────────────┴─────────────┴────────────┘

# With two colums, keeping last entry and maintaining the same order
pl$DataFrame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
)$unique(
  subset = c("x","y"),
  keep = "last",
  maintain_order = TRUE
)

shape: (3, 3)
┌─────┬─────┬─────┐
│ x   ┆ y   ┆ z   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
│ 2   ┆ 2   ┆ 2   │
│ 3   ┆ 3   ┆ 4   │
└─────┴─────┴─────┘

# With one column
aggregate(. ~ Species, data = iris, FUN = head, N = 1)

     Species Sepal.Length.1 Sepal.Length.2 Sepal.Length.3 Sepal.Length.4
1     setosa            5.1            4.9            4.7            4.6
2 versicolor            7.0            6.4            6.9            5.5
3  virginica            6.3            5.8            7.1            6.3
  Sepal.Length.5 Sepal.Length.6 Sepal.Width.1 Sepal.Width.2 Sepal.Width.3
1            5.0            5.4           3.5           3.0           3.2
2            6.5            5.7           3.2           3.2           3.1
3            6.5            7.6           3.3           2.7           3.0
  Sepal.Width.4 Sepal.Width.5 Sepal.Width.6 Petal.Length.1 Petal.Length.2
1           3.1           3.6           3.9            1.4            1.4
2           2.3           2.8           2.8            4.7            4.5
3           2.9           3.0           3.0            6.0            5.1
  Petal.Length.3 Petal.Length.4 Petal.Length.5 Petal.Length.6 Petal.Width.1
1            1.3            1.5            1.4            1.7           0.2
2            4.9            4.0            4.6            4.5           1.4
3            5.9            5.6            5.8            6.6           2.5
  Petal.Width.2 Petal.Width.3 Petal.Width.4 Petal.Width.5 Petal.Width.6
1           0.2           0.2           0.2           0.2           0.4
2           1.5           1.5           1.3           1.5           1.3
3           1.9           2.1           1.8           2.2           2.1

# With one column keeping last entry
aggregate(. ~ Species, data = iris, FUN = tail, n = 1)

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.0         3.3          1.4         0.2
2 versicolor          5.7         2.8          4.1         1.3
3  virginica          5.9         3.0          5.1         1.8

# With two colums, keeping last entry and maintaining the same order
mytest <- data.frame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) 
aggregate(. ~ x + y, data = mytest, FUN = tail, n = 1)

# With one column
iris |>
  distinct(Species, .keep_all = TRUE)

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          7.0         3.2          4.7         1.4 versicolor
3          6.3         3.3          6.0         2.5  virginica

# With one column keeping last entry
iris |>
  group_by(Species) |>
  slice_tail() |>
  ungroup()

# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5           3.3          1.4         0.2 setosa    
2          5.7         2.8          4.1         1.3 versicolor
3          5.9         3            5.1         1.8 virginica

# With two colums, keeping last entry and maintaining the same order
data.frame(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) |>
  group_by(x,y) |>
  slice_tail() |>
  ungroup()

# A tibble: 3 × 3
      x     y     z
  <int> <int> <int>
1     1     1     1
2     2     2     2
3     3     3     4

# With one column
unique(iris_dt, by = "Species")

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          5.1         3.5          1.4         0.2     setosa
2:          7.0         3.2          4.7         1.4 versicolor
3:          6.3         3.3          6.0         2.5  virginica

# With one column keeping last entry
unique(iris_dt, by = "Species", fromLast = TRUE)

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
          <num>       <num>        <num>       <num>     <fctr>
1:          5.0         3.3          1.4         0.2     setosa
2:          5.7         2.8          4.1         1.3 versicolor
3:          5.9         3.0          5.1         1.8  virginica

# With two colums, keeping last entry and maintaining the same order
mytest_dt <- data.table(
  x = c(1L, 1:3, 3L),
  y = c(1L, 1:3, 3L),
  z = c(1L, 1:3, 4L)
) 
unique(mytest_dt, by = c("x","y"), fromLast = TRUE)

       x     y     z
   <int> <int> <int>
1:     1     1     1
2:     2     2     2
3:     3     3     4

2.10.3 Keep some columns from sorted DataFrame

If you want to keep only some columns from a sorted DataFrame, you can use the sort_by() method with the select() method.
In details, sort_by() can sort a column by the ordering of another column, or multiple other columns.
It’s the equivalent of order() method of R base.

as_polars_df(iris)$
  select(pl$col("Petal.Length")$
  sort_by("Petal.Width"))

shape: (150, 1)
┌──────────────┐
│ Petal.Length │
│ ---          │
│ f64          │
╞══════════════╡
│ 1.5          │
│ 1.4          │
│ 1.1          │
│ 1.5          │
│ 1.4          │
│ …            │
│ 5.6          │
│ 5.6          │
│ 6.0          │
│ 6.1          │
│ 5.7          │
└──────────────┘

data(iris)
iris[order(iris$Petal.Width), "Petal.Length", drop = FALSE]

    Petal.Length
10           1.5
13           1.4
14           1.1
33           1.5
38           1.4
1            1.4
2            1.4
3            1.3
4            1.5
5            1.4
8            1.5
9            1.4
11           1.5
12           1.6
15           1.2
21           1.7
23           1.0
25           1.9
26           1.6
28           1.5
29           1.4
30           1.6
31           1.6
34           1.4
35           1.5
36           1.2
37           1.3
39           1.3
40           1.5
43           1.3
47           1.6
48           1.4
49           1.5
50           1.4
7            1.4
18           1.4
19           1.7
20           1.5
41           1.3
42           1.3
46           1.4
6            1.7
16           1.5
17           1.3
22           1.5
27           1.6
32           1.5
45           1.9
24           1.7
44           1.6
58           3.3
61           3.5
63           4.0
68           4.1
80           3.5
82           3.7
94           3.3
70           3.9
81           3.8
99           3.0
74           4.7
83           3.9
91           4.4
93           4.0
96           4.2
54           4.0
56           4.5
59           4.6
65           3.6
72           4.0
75           4.3
88           4.4
89           4.1
90           4.0
95           4.2
97           4.2
98           4.3
100          4.1
51           4.7
60           3.9
64           4.7
66           4.4
76           4.4
77           4.8
92           4.6
135          5.6
52           4.5
53           4.9
55           4.6
62           4.2
67           4.5
69           4.5
73           4.9
79           4.5
85           4.5
87           4.7
120          5.0
134          5.1
57           4.7
84           5.1
86           4.5
130          5.8
78           5.0
107          4.5
71           4.8
104          5.6
108          6.3
109          5.8
117          5.5
124          4.9
126          6.0
127          4.8
128          4.9
138          5.5
139          4.8
150          5.1
102          5.1
112          5.3
131          6.1
143          5.1
147          5.0
111          5.1
114          5.0
122          4.9
123          6.7
132          6.4
148          5.2
103          5.9
106          6.6
113          5.5
125          5.7
129          5.6
140          5.4
105          5.8
118          6.7
133          5.6
116          5.3
119          6.9
121          5.7
136          6.1
142          5.1
144          5.9
146          5.2
149          5.4
115          5.1
137          5.6
141          5.6
101          6.0
110          6.1
145          5.7

data(iris)
iris |>
  arrange(Petal.Width) |>
  select(Petal.Length)

    Petal.Length
1            1.5
2            1.4
3            1.1
4            1.5
5            1.4
6            1.4
7            1.4
8            1.3
9            1.5
10           1.4
11           1.5
12           1.4
13           1.5
14           1.6
15           1.2
16           1.7
17           1.0
18           1.9
19           1.6
20           1.5
21           1.4
22           1.6
23           1.6
24           1.4
25           1.5
26           1.2
27           1.3
28           1.3
29           1.5
30           1.3
31           1.6
32           1.4
33           1.5
34           1.4
35           1.4
36           1.4
37           1.7
38           1.5
39           1.3
40           1.3
41           1.4
42           1.7
43           1.5
44           1.3
45           1.5
46           1.6
47           1.5
48           1.9
49           1.7
50           1.6
51           3.3
52           3.5
53           4.0
54           4.1
55           3.5
56           3.7
57           3.3
58           3.9
59           3.8
60           3.0
61           4.7
62           3.9
63           4.4
64           4.0
65           4.2
66           4.0
67           4.5
68           4.6
69           3.6
70           4.0
71           4.3
72           4.4
73           4.1
74           4.0
75           4.2
76           4.2
77           4.3
78           4.1
79           4.7
80           3.9
81           4.7
82           4.4
83           4.4
84           4.8
85           4.6
86           5.6
87           4.5
88           4.9
89           4.6
90           4.2
91           4.5
92           4.5
93           4.9
94           4.5
95           4.5
96           4.7
97           5.0
98           5.1
99           4.7
100          5.1
101          4.5
102          5.8
103          5.0
104          4.5
105          4.8
106          5.6
107          6.3
108          5.8
109          5.5
110          4.9
111          6.0
112          4.8
113          4.9
114          5.5
115          4.8
116          5.1
117          5.1
118          5.3
119          6.1
120          5.1
121          5.0
122          5.1
123          5.0
124          4.9
125          6.7
126          6.4
127          5.2
128          5.9
129          6.6
130          5.5
131          5.7
132          5.6
133          5.4
134          5.8
135          6.7
136          5.6
137          5.3
138          6.9
139          5.7
140          6.1
141          5.1
142          5.9
143          5.2
144          5.4
145          5.1
146          5.6
147          5.6
148          6.0
149          6.1
150          5.7

data(iris)
iris_dt <- as.data.table(iris)
iris_dt[order(Petal.Width)][,.(Petal.Length)]

     Petal.Length
            <num>
  1:          1.5
  2:          1.4
  3:          1.1
  4:          1.5
  5:          1.4
 ---             
146:          5.6
147:          5.6
148:          6.0
149:          6.1
150:          5.7

Tip

If you want to use multiple columns/expressions, you can pass it in a list like this for example sort_by(list("Petal.Width","Sepal.Width")) or sort_by(list("Petal.Width", pl$col("Sepal.Width")))

2.11 Join DataFrames

To perform joins, the join() method must be used.

Multiple strategies are available:

"inner": returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded.
"left": returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled.
"outer": returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled.
"semi": returns all rows from the left frame in which the join key is also present in the right frame.
"anti": returns all rows from the left frame in which the join key is not present in the right frame.
"cross": returns the cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained. The table length of A cross-joined with B is always len(A) × len(B).

The main arguments are: - on: name(s) of the join columns in both DataFrames.
- how: join strategy.
- suffix: suffix to append to columns with a duplicate name.

Let’s see a simple example:

# First, creation of colors_species Polars DataFrame
colors <- pl$DataFrame(
  Surname = as_polars_series(c("toto","titi","tata")),
  Color = as_polars_series(c("blue","red","yellow"))
)
values <- pl$DataFrame(
  Surname = as_polars_series(c("toto","titi","tata")),
  value = as_polars_series(c(10,20,30))
)
# Let's join !
colors$join(
  other = values,
  on = "Surname",
  how = "left"
)

shape: (3, 3)
┌─────────┬────────┬───────┐
│ Surname ┆ Color  ┆ value │
│ ---     ┆ ---    ┆ ---   │
│ str     ┆ str    ┆ f64   │
╞═════════╪════════╪═══════╡
│ toto    ┆ blue   ┆ 10.0  │
│ titi    ┆ red    ┆ 20.0  │
│ tata    ┆ yellow ┆ 30.0  │
└─────────┴────────┴───────┘

colors_df <- data.frame(
  Surname = c("toto","titi","tata"),
  Color = c("blue","red","yellow")
)
values_df <- data.frame(
  Surname = c("toto","titi","tata"),
  value = c(10,20,30)
)
merge(colors_df, values_df, by = "Surname", all.x = TRUE)

  Surname  Color value
1    tata yellow    30
2    titi    red    20
3    toto   blue    10

colors_df |>
  left_join(values_df,
            by = "Surname")

  Surname  Color value
1    toto   blue    10
2    titi    red    20
3    tata yellow    30

merge(as.data.table(colors_df), 
      as.data.table(values_df), 
      by = "Surname", all.x = TRUE)

Key: <Surname>
   Surname  Color value
    <char> <char> <num>
1:    tata yellow    30
2:    titi    red    20
3:    toto   blue    10

If join columns have different names in both DataFrames, you can use arguments "left_on" and "right_on".
Here’s an example:

values2 <- values$rename(Surname = "Surname2")

colors$join(
  other = values2,
  left_on = "Surname",
  right_on = "Surname2",
  how = "left"
)

shape: (3, 3)
┌─────────┬────────┬───────┐
│ Surname ┆ Color  ┆ value │
│ ---     ┆ ---    ┆ ---   │
│ str     ┆ str    ┆ f64   │
╞═════════╪════════╪═══════╡
│ toto    ┆ blue   ┆ 10.0  │
│ titi    ┆ red    ┆ 20.0  │
│ tata    ┆ yellow ┆ 30.0  │
└─────────┴────────┴───────┘

Tip

When we join on integer columns that are sorted polars uses a fast-track algorithm.

To use the fast-track algorithm polars needs to know the join columns are sorted. See the tip in this section for an example with filter() method.

Polars can’t use the fast-track algorithm for joining string columns as the algorithm works on integers.

=> To use the fast-track algorithm the string column must be cast to categorical dtype. See an example here.

2.12 Concatenate DataFrames

To perform concatenations, the concat() method must be used.
In this section, we will see vertical, horizontal and diagonal concatenation.

Before we need to create some datasets to run our examples. Click to expand it! 👇

Code

dfup <- pl$DataFrame(
  col1 = as_polars_series(c("a")),
  col2 = as_polars_series(c("b")),
  col3 = as_polars_series(c("c"))
)
dfdown <- pl$DataFrame(
  col1 = as_polars_series(c("x")),
  col2 = as_polars_series(c("y")),
  col3 = as_polars_series(c("z"))
)
dfleft <- pl$DataFrame(col1 = as_polars_series(c("a","b","c")))
dfright <- pl$DataFrame(col2 = as_polars_series(c("x","y","z")))

dfup_df <- dfup$to_data_frame()
dfdown_df <- dfdown$to_data_frame()
dfleft_df <- dfleft$to_data_frame()
dfright_df <- dfright$to_data_frame()

dfup_dt <- as.data.table(dfup)
dfdown_dt <- as.data.table(dfdown)
dfleft_dt <- as.data.table(dfleft)
dfright_dt <- as.data.table(dfright)

2.12.1 Vertical concatenation

To concatenate multiple DataFrames vertically (=by row), you can use the concat() method and the argument how = "vertical".

pl$concat(list(dfup,dfdown),how = "vertical")

shape: (2, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ a    ┆ b    ┆ c    │
│ x    ┆ y    ┆ z    │
└──────┴──────┴──────┘

rbind(dfup_df,dfdown_df)

  col1 col2 col3
1    a    b    c
2    x    y    z

bind_rows(list(dfup_df,dfdown_df))

  col1 col2 col3
1    a    b    c
2    x    y    z

rbindlist(list(dfup_dt, dfdown_dt))

     col1   col2   col3
   <char> <char> <char>
1:      a      b      c
2:      x      y      z

2.12.2 Horizontal concatenation

To concatenate multiple DataFrames horizontally (=by col), you can use the concat() method and the argument how = "horizontal".

pl$concat(list(dfleft,dfright),how = "horizontal")

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ a    ┆ x    │
│ b    ┆ y    │
│ c    ┆ z    │
└──────┴──────┘

cbind(dfleft_df,dfright_df)

  col1 col2
1    a    x
2    b    y
3    c    z

cbind(list(dfleft_df,dfright_df))

     [,1]        
[1,] data.frame,1
[2,] data.frame,1

data.table(dfleft_dt, dfright_dt)

     col1   col2
   <char> <char>
1:      a      x
2:      b      y
3:      c      z

2.12.3 Diagonal concatenation

To concatenate multiple DataFrames diagonally, you can use the concat() method and the argument how = "diagonal".
Diagonal concatenation is useful when the column names are not identical in initial DataFrames.

dfup <- dfup$rename(col3="col4")
pl$concat(list(dfup,dfdown),how = "diagonal")

shape: (2, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col4 ┆ col3 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╪══════╡
│ a    ┆ b    ┆ c    ┆ null │
│ x    ┆ y    ┆ null ┆ z    │
└──────┴──────┴──────┴──────┘

dfup_df <- dfup$to_data_frame()
dfup_df[setdiff(names(dfdown_df), names(dfup_df))] <- NA
dfdown_df[setdiff(names(dfup_df), names(dfdown_df))] <- NA
rbind(dfup_df,dfdown_df)

  col1 col2 col4 col3
1    a    b    c <NA>
2    x    y <NA>    z

dfup_df <- dfup$to_data_frame()
bind_rows(list(dfup_df,dfdown_df))

  col1 col2 col4 col3
1    a    b    c <NA>
2    x    y <NA>    z

dfup_dt <- as.data.table(dfup)
rbindlist(list(dfup_dt, dfdown_dt),fill = TRUE)

     col1   col2   col4   col3
   <char> <char> <char> <char>
1:      a      b      c   <NA>
2:      x      y   <NA>      z

2.13 Pivot a DataFrame

2.13.1 From long to wide

The pivot() method can be used to pivot a DataFrame from long to wide.

Let’s go for a first example :

df <- data.frame(
  country = c(rep("France",3),rep("Italy","3")),
  city = c("Paris", "Lille", "Nice", "Roma", "Milan", "Napoli"),
  location = c("North","North","South","South","North","South"),
  population = c(2.1, 0.2, 0.4, 2.8, 1.4, 3.0)
)
df

  country   city location population
1  France  Paris    North        2.1
2  France  Lille    North        0.2
3  France   Nice    South        0.4
4   Italy   Roma    South        2.8
5   Italy  Milan    North        1.4
6   Italy Napoli    South        3.0

as_polars_df(df)$pivot(
  on = "city",
  index = "country", 
  values = "population", 
)

shape: (2, 7)
┌─────────┬───────┬───────┬──────┬──────┬───────┬────────┐
│ country ┆ Paris ┆ Lille ┆ Nice ┆ Roma ┆ Milan ┆ Napoli │
│ ---     ┆ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---    │
│ str     ┆ f64   ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64    │
╞═════════╪═══════╪═══════╪══════╪══════╪═══════╪════════╡
│ France  ┆ 2.1   ┆ 0.2   ┆ 0.4  ┆ null ┆ null  ┆ null   │
│ Italy   ┆ null  ┆ null  ┆ null ┆ 2.8  ┆ 1.4   ┆ 3.0    │
└─────────┴───────┴───────┴──────┴──────┴───────┴────────┘

reshape(df[,-which(names(df) %in% c("location"))], 
        idvar = "country", 
        timevar = "city", 
        direction = "wide")

  country population.Paris population.Lille population.Nice population.Roma
1  France              2.1              0.2             0.4              NA
4   Italy               NA               NA              NA             2.8
  population.Milan population.Napoli
1               NA                NA
4              1.4                 3

df |>
  pivot_wider(
    id_cols = country,
    names_from = city,
    values_from = population
  )

# A tibble: 2 × 7
  country Paris Lille  Nice  Roma Milan Napoli
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 France    2.1   0.2   0.4  NA    NA       NA
2 Italy    NA    NA    NA     2.8   1.4      3

df_dt <- as.data.table(df)
dcast(df_dt, country ~ city, value.var = "population")

Key: <country>
   country Lille Milan Napoli  Nice Paris  Roma
    <char> <num> <num>  <num> <num> <num> <num>
1:  France   0.2    NA     NA   0.4   2.1    NA
2:   Italy    NA   1.4      3    NA    NA   2.8

You can also aggregate the results using a function that you enter in the argument aggregate_function in pivot() method.

In this case, the aggregate_function argument of pivot() is the equivalent of values_fn of pivot_wider() from {tidyr} and fun.aggregate of dcast() from {data.table}.

as_polars_df(df)$pivot(
  on = "location",
  index = "country", 
  values = "population",
  aggregate_function = "mean"
)

shape: (2, 3)
┌─────────┬───────┬───────┐
│ country ┆ North ┆ South │
│ ---     ┆ ---   ┆ ---   │
│ str     ┆ f64   ┆ f64   │
╞═════════╪═══════╪═══════╡
│ France  ┆ 1.15  ┆ 0.4   │
│ Italy   ┆ 1.4   ┆ 2.9   │
└─────────┴───────┴───────┘

df_summary <- aggregate(population ~ country + location, data = df, FUN = mean)
df_final <- reshape(df_summary, idvar = "country", timevar = "location", direction = "wide")
colnames(df_final) <- c("country", "North", "South")
df_final

  country North South
1  France  1.15   0.4
2   Italy  1.40   2.9

df |>
  pivot_wider(id_cols = country,
              names_from = location, 
              values_from = population, 
              values_fn = mean)

# A tibble: 2 × 3
  country North South
  <chr>   <dbl> <dbl>
1 France   1.15   0.4
2 Italy    1.4    2.9

df_dt <- as.data.table(df)
dcast(df_dt, country ~ location, value.var = "population", fun.aggregate = mean)

Key: <country>
   country North South
    <char> <num> <num>
1:  France  1.15   0.4
2:   Italy  1.40   2.9

However with {polars}, we can also run an expression as an aggregation function.
With {tidyr} and {data.table}, you need to calculate this in advance.
For example:

as_polars_df(df)$pivot(
  on = "location",
  index = "country", 
  values = "population",
  aggregate_function = pl$element()$sum()$sqrt()
)

shape: (2, 3)
┌─────────┬──────────┬──────────┐
│ country ┆ North    ┆ South    │
│ ---     ┆ ---      ┆ ---      │
│ str     ┆ f64      ┆ f64      │
╞═════════╪══════════╪══════════╡
│ France  ┆ 1.516575 ┆ 0.632456 │
│ Italy   ┆ 1.183216 ┆ 2.408319 │
└─────────┴──────────┴──────────┘

df_summarized <- aggregate(population ~ country + location, df, FUN = function(x) sqrt(sum(x)))
df_final <- reshape(df_summarized, idvar = "country", timevar = "location", direction = "wide")
colnames(df_final) <- c("country", "North", "South")
df_final

  country    North     South
1  France 1.516575 0.6324555
2   Italy 1.183216 2.4083189

df |>
  group_by(country, location) |>
  summarise(population_sum = sqrt(sum(population))) |>
  pivot_wider(names_from = location, values_from = population_sum)

# A tibble: 2 × 3
# Groups:   country [2]
  country North South
  <chr>   <dbl> <dbl>
1 France   1.52 0.632
2 Italy    1.18 2.41

dt_final <- df_dt[, .(population_sum = sqrt(sum(population))), by = .(country, location)]
dcast(dt_final, country ~ location, value.var = "population_sum")

Key: <country>
   country    North     South
    <char>    <num>     <num>
1:  France 1.516575 0.6324555
2:   Italy 1.183216 2.4083189

2.13.2 From wide to long

The unpivot() method can be used to pivot a DataFrame from wide to long.

Let’s see with an example :

df <- data.frame(
  country = c("France","Italy"),
  North = c(1.1,1.4),
  South = c(0.4,2.9)
)

as_polars_df(df)$unpivot(
  index = "country",
  on = c("North","South")
)

shape: (4, 3)
┌─────────┬──────────┬───────┐
│ country ┆ variable ┆ value │
│ ---     ┆ ---      ┆ ---   │
│ str     ┆ str      ┆ f64   │
╞═════════╪══════════╪═══════╡
│ France  ┆ North    ┆ 1.1   │
│ Italy   ┆ North    ┆ 1.4   │
│ France  ┆ South    ┆ 0.4   │
│ Italy   ┆ South    ┆ 2.9   │
└─────────┴──────────┴───────┘

melted_df  <- reshape(df, 
        varying = c("North", "South"), 
        v.names = "value", 
        idvar = "country", 
        times = c("North", "South"), 
        timevar = "variable", 
        direction = "long")
rownames(melted_df) <- NULL
melted_df

  country variable value
1  France    North   1.1
2   Italy    North   1.4
3  France    South   0.4
4   Italy    South   2.9

df |>
  pivot_longer(
    !country,
    names_to = "variable",
    values_to = "value"
  )

# A tibble: 4 × 3
  country variable value
  <chr>   <chr>    <dbl>
1 France  North      1.1
2 France  South      0.4
3 Italy   North      1.4
4 Italy   South      2.9

df_dt <- as.data.table(df)
melt(df_dt, id.vars = "country", variable.name = "variable", value.name = "value")

   country variable value
    <char>   <fctr> <num>
1:  France    North   1.1
2:   Italy    North   1.4
3:  France    South   0.4
4:   Italy    South   2.9

2.14 Dealing with missing values

We have already introduced missing values in here. In this section, we will go further and understand how to deal with missing values with polars and R.

2.14.1 Check if Series has missing values

The is_null() and is_not_null() methods can be used to check if Series has missing values.
Theses methods are the equivalent of is.na() and !is.na() of R base.

Let’s see two examples combining thes methods with select() and filter() methods:

polars
R base

mydfNA <- pl$DataFrame(
  colA = as_polars_series(c("a",NA,"c")),
  colB = as_polars_series(c("d",NA,NA))
)

# Find null values
mydfNA$
  select(
    pl$col("colA"),
    pl$col("colA")$is_null()$alias("is_null"),
    pl$col("colA")$is_not_null()$alias("is_not_null")
  )

shape: (3, 3)
┌──────┬─────────┬─────────────┐
│ colA ┆ is_null ┆ is_not_null │
│ ---  ┆ ---     ┆ ---         │
│ str  ┆ bool    ┆ bool        │
╞══════╪═════════╪═════════════╡
│ a    ┆ false   ┆ true        │
│ null ┆ true    ┆ false       │
│ c    ┆ false   ┆ true        │
└──────┴─────────┴─────────────┘

# Filter by null values
mydfNA$
  filter(
    pl$col("colA")$is_not_null() & pl$col("colB")$is_not_null()
  )

shape: (1, 2)
┌──────┬──────┐
│ colA ┆ colB │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ a    ┆ d    │
└──────┴──────┘

mydfNA2 <- mydfNA$to_data_frame()

# Find null values
data.frame(
  colA = mydfNA2$colA,
  is_null = is.na(mydfNA2$colA),
  is_not_null = !is.na(mydfNA2$colA)
)

  colA is_null is_not_null
1    a   FALSE        TRUE
2 <NA>    TRUE       FALSE
3    c   FALSE        TRUE

# Filter by null values
mydfNA2[!is.na(mydfNA2$colA) & !is.na(mydfNA2$colB), ]

  colA colB
1    a    d

2.14.2 Replace missing values with a constant

The fill_null() method can be used to replace missing values in a Series with a constant.

Here’s some examples where I replace missing values from all columns of a DataFrame:

mydfNA <- pl$DataFrame(
  colA = as_polars_series(c("a",NA,"c")),
  colB = as_polars_series(c("d",NA,NA))
)

# In the same columns
mydfNA$
  with_columns(
  pl$all()$fill_null("missing")
)

shape: (3, 2)
┌─────────┬─────────┐
│ colA    ┆ colB    │
│ ---     ┆ ---     │
│ str     ┆ str     │
╞═════════╪═════════╡
│ a       ┆ d       │
│ missing ┆ missing │
│ c       ┆ missing │
└─────────┴─────────┘

# In new columns suffixed by "_corrected"
mydfNA$
  with_columns(
  pl$all()$fill_null("missing")$name$suffix("_corrected")
)

shape: (3, 4)
┌──────┬──────┬────────────────┬────────────────┐
│ colA ┆ colB ┆ colA_corrected ┆ colB_corrected │
│ ---  ┆ ---  ┆ ---            ┆ ---            │
│ str  ┆ str  ┆ str            ┆ str            │
╞══════╪══════╪════════════════╪════════════════╡
│ a    ┆ d    ┆ a              ┆ d              │
│ null ┆ null ┆ missing        ┆ missing        │
│ c    ┆ null ┆ c              ┆ missing        │
└──────┴──────┴────────────────┴────────────────┘

mydfNA2 <- mydfNA$to_data_frame()

# In the same columns
mydfNA2[is.na(mydfNA2)] <- "missing"
mydfNA2

     colA    colB
1       a       d
2 missing missing
3       c missing

# In new columns suffixed by "_corrected"
mydfNA2 <- mydfNA$to_data_frame()

transform(mydfNA2,
          colA_corrected = ifelse(is.na(colA), "missing", colA),
          colB_corrected = ifelse(is.na(colB), "missing", colB))

  colA colB colA_corrected colB_corrected
1    a    d              a              d
2 <NA> <NA>        missing        missing
3    c <NA>              c        missing

mydfNA2 <- mydfNA$to_data_frame()

# In the same columns
mydfNA2 %>%
  mutate(across(everything(), ~ifelse(is.na(.), "missing", .)))

     colA    colB
1       a       d
2 missing missing
3       c missing

# In new columns suffixed by "_corrected"
mydfNA2 %>%
  mutate(across(c(colA, colB), ~ifelse(is.na(.), "missing", .), .names = "{col}_corrected"))

  colA colB colA_corrected colB_corrected
1    a    d              a              d
2 <NA> <NA>        missing        missing
3    c <NA>              c        missing

mydfNA2 <- mydfNA$to_data_frame()
mydfNA2_dt <- as.data.table(mydfNA2)

# In the same columns
mydfNA2_dt[is.na(mydfNA2_dt)] <- "missing"
mydfNA2_dt

      colA    colB
    <char>  <char>
1:       a       d
2: missing missing
3:       c missing

# In new columns suffixed by "_corrected"
mydfNA2_dt[,
           c("colA_corrected", "colB_corrected") := lapply(.SD, function(x) ifelse(is.na(x), "missing", x)),
           .SDcols = c("colA", "colB")]
mydfNA2_dt

      colA    colB colA_corrected colB_corrected
    <char>  <char>         <char>         <char>
1:       a       d              a              d
2: missing missing        missing        missing
3:       c missing              c        missing

Important

Be careful, the fill_null() method can in some cases modify data types like cast(). This can happen, for example, when you’re working on an integer column and you want to replace the missing values with a string => the column will then has a string dtype!

2.14.3 Replace missing values with a strategy

The fill_null() method of polars has a strategy argument for replacing missing values:

forward: replace with the previous non-null value in the Series
backward: replace with the next non-null value in the Series
min: replace with the smallest value in the Series
max: replace with the largest value in the Series
mean: replace with the median value in the Series
zero: replace with 0
one: replace with 1

Note

We can set a limit on how many rows to fill-forward or backward with limit

Here’s some examples :

# In the same columns
mydfNA <- pl$DataFrame(
  colA = as_polars_series(c("a",NA,"c")),
  colB = as_polars_series(c("d",NA,NA)),
  colC = as_polars_series(c(1,NA,3))
)
# With forward strategy
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "forward")$name$suffix("_corrected")
  )

shape: (3, 6)
┌──────┬──────┬──────┬────────────────┬────────────────┬────────────────┐
│ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected │
│ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            │
│ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            │
╞══════╪══════╪══════╪════════════════╪════════════════╪════════════════╡
│ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            │
│ null ┆ null ┆ null ┆ a              ┆ d              ┆ 1.0            │
│ c    ┆ null ┆ 3.0  ┆ c              ┆ d              ┆ 3.0            │
└──────┴──────┴──────┴────────────────┴────────────────┴────────────────┘

# With forward strategy and a limit
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "forward", limit = 1)$name$suffix("_corrected")
  )

shape: (3, 6)
┌──────┬──────┬──────┬────────────────┬────────────────┬────────────────┐
│ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected │
│ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            │
│ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            │
╞══════╪══════╪══════╪════════════════╪════════════════╪════════════════╡
│ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            │
│ null ┆ null ┆ null ┆ a              ┆ d              ┆ 1.0            │
│ c    ┆ null ┆ 3.0  ┆ c              ┆ null           ┆ 3.0            │
└──────┴──────┴──────┴────────────────┴────────────────┴────────────────┘

# With backward strategy
mydfNA$
  with_columns(
    pl$all()$fill_null(strategy = "backward")$name$suffix("_corrected")
  )

shape: (3, 6)
┌──────┬──────┬──────┬────────────────┬────────────────┬────────────────┐
│ colA ┆ colB ┆ colC ┆ colA_corrected ┆ colB_corrected ┆ colC_corrected │
│ ---  ┆ ---  ┆ ---  ┆ ---            ┆ ---            ┆ ---            │
│ str  ┆ str  ┆ f64  ┆ str            ┆ str            ┆ f64            │
╞══════╪══════╪══════╪════════════════╪════════════════╪════════════════╡
│ a    ┆ d    ┆ 1.0  ┆ a              ┆ d              ┆ 1.0            │
│ null ┆ null ┆ null ┆ c              ┆ null           ┆ 3.0            │
│ c    ┆ null ┆ 3.0  ┆ c              ┆ null           ┆ 3.0            │
└──────┴──────┴──────┴────────────────┴────────────────┴────────────────┘

# With mean strategy only on "colC" column
mydfNA$
  with_columns(
    pl$col("colC")$fill_null(strategy = "mean")$name$suffix("_corrected")
  )

shape: (3, 4)
┌──────┬──────┬──────┬────────────────┐
│ colA ┆ colB ┆ colC ┆ colC_corrected │
│ ---  ┆ ---  ┆ ---  ┆ ---            │
│ str  ┆ str  ┆ f64  ┆ f64            │
╞══════╪══════╪══════╪════════════════╡
│ a    ┆ d    ┆ 1.0  ┆ 1.0            │
│ null ┆ null ┆ null ┆ 2.0            │
│ c    ┆ null ┆ 3.0  ┆ 3.0            │
└──────┴──────┴──────┴────────────────┘

2.14.4 Replace missing values with an expression

Of course, you are not limited to the built-in strategies of polars => with fill_null() you can also use expressions to replace missing values. It works with expression from the same column and from another column.

Here’s some examples:

# Replace missing values with the mean of the non-null values for that column
mydfNA$
  with_columns(
    pl$col("colC")$fill_null(pl$mean("colC"))$name$suffix("_corrected")
  )

shape: (3, 4)
┌──────┬──────┬──────┬────────────────┐
│ colA ┆ colB ┆ colC ┆ colC_corrected │
│ ---  ┆ ---  ┆ ---  ┆ ---            │
│ str  ┆ str  ┆ f64  ┆ f64            │
╞══════╪══════╪══════╪════════════════╡
│ a    ┆ d    ┆ 1.0  ┆ 1.0            │
│ null ┆ null ┆ null ┆ 2.0            │
│ c    ┆ null ┆ 3.0  ┆ 3.0            │
└──────┴──────┴──────┴────────────────┘

# Replace missing values with the values from another column
mydfNA$
  with_columns(
    pl$col("colB")$fill_null(pl$col("colA"))$name$suffix("_corrected")
  )

shape: (3, 4)
┌──────┬──────┬──────┬────────────────┐
│ colA ┆ colB ┆ colC ┆ colB_corrected │
│ ---  ┆ ---  ┆ ---  ┆ ---            │
│ str  ┆ str  ┆ f64  ┆ str            │
╞══════╪══════╪══════╪════════════════╡
│ a    ┆ d    ┆ 1.0  ┆ d              │
│ null ┆ null ┆ null ┆ null           │
│ c    ┆ null ┆ 3.0  ┆ c              │
└──────┴──────┴──────┴────────────────┘

mydfNA2 <- mydfNA$to_data_frame()

# Replace missing values with the mean of the non-null values for that column
mydfNA2$colC_corrected <- ifelse(is.na(mydfNA2$colC), mean(mydfNA2$colC, na.rm = TRUE), mydfNA2$colC)

mydfNA2 <- mydfNA$to_data_frame()
# Replace missing values with the values from another column
mydfNA2$colB_corrected <- ifelse(is.na(mydfNA2$colB), mydfNA2$colA, mydfNA2$colB)
mydfNA2

  colA colB colC colB_corrected
1    a    d    1              d
2 <NA> <NA>   NA           <NA>
3    c <NA>    3              c

mydfNA2 <- mydfNA$to_data_frame()

# Replace missing values with the mean of the non-null values for that column
mydfNA2 %>%
  mutate(colC_corrected = ifelse(is.na(colC), mean(mydfNA2$colC, na.rm = TRUE), colC))

  colA colB colC colC_corrected
1    a    d    1              1
2 <NA> <NA>   NA              2
3    c <NA>    3              3

# Replace missing values with the values from another column
mydfNA2 %>%
  mutate(colB_corrected = ifelse(is.na(colB), colA, colB))

  colA colB colC colB_corrected
1    a    d    1              d
2 <NA> <NA>   NA           <NA>
3    c <NA>    3              c

mydfNA2 <- mydfNA$to_data_frame()
mydfNA2_dt <- as.data.table(mydfNA2)

# Replace missing values with the mean of the non-null values for that column
mydfNA2_dt[, colC_corrected := ifelse(is.na(colC), mean(mydfNA2_dt$colC, na.rm = TRUE), colC)]
mydfNA2_dt

     colA   colB  colC colC_corrected
   <char> <char> <num>          <num>
1:      a      d     1              1
2:   <NA>   <NA>    NA              2
3:      c   <NA>     3              3

# Replace missing values with the values from another column
mydfNA2_dt[, colB_corrected := ifelse(is.na(colB), colA, colB)]
mydfNA2_dt

     colA   colB  colC colC_corrected colB_corrected
   <char> <char> <num>          <num>         <char>
1:      a      d     1              1              d
2:   <NA>   <NA>    NA              2           <NA>
3:      c   <NA>     3              3              c

2.14.5 Replace missing values with a sequence of columns

The coalesce() method can be used to replace missing values based on a sequence of columns.

Here’s an example creating a new column “col4” that has the first non-null value as we go through some columns (in order!):

mynewdfNA <- pl$DataFrame(
  col1 = as_polars_series(c(NA,"y",NA)),
  col2 = as_polars_series(c(NA,"v","w")),
  col3 = as_polars_series(c("r","s",NA))
)

mynewdfNA$
  with_columns(
    pl$coalesce("col1","col2","col3")$alias("col4")
  )

shape: (3, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  ┆ str  │
╞══════╪══════╪══════╪══════╡
│ null ┆ null ┆ r    ┆ r    │
│ y    ┆ v    ┆ s    ┆ y    │
│ null ┆ w    ┆ null ┆ w    │
└──────┴──────┴──────┴──────┘

mynewdfNA2 <-  mynewdfNA$to_data_frame()

mynewdfNA2$col4 <- coalesce(mynewdfNA2$col1, mynewdfNA2$col2, mynewdfNA2$col3)
mynewdfNA2

  col1 col2 col3 col4
1 <NA> <NA>    r    r
2    y    v    s    y
3 <NA>    w <NA>    w

mynewdfNA2 <-  mynewdfNA$to_data_frame()

mynewdfNA2 %>%
  mutate(col4 = coalesce(col1, col2, col3))

  col1 col2 col3 col4
1 <NA> <NA>    r    r
2    y    v    s    y
3 <NA>    w <NA>    w

mynewdfNA2 <-  mynewdfNA$to_data_frame()
mynewdfNA2_dt <- as.data.table(mynewdfNA2)

mynewdfNA2_dt[, col4 := fcoalesce(col1, col2, col3)]
mynewdfNA2_dt

     col1   col2   col3   col4
   <char> <char> <char> <char>
1:   <NA>   <NA>      r      r
2:      y      v      s      y
3:   <NA>      w   <NA>      w

2.15 Others useful methods

2.15.1 On Series

2.15.1.1 Change name of Series

The alias() method is very useful especially in method chaining operation.
With R base, the syntax is longer.

polars
R base

as_polars_series(1:3, name = "toto")$alias("titi")

polars Series: shape: (3,)
Series: 'titi' [i32]
[
    1
    2
    3
]

toto <- 1:3
titi <- toto
rm(toto)

2.15.1.2 Reduce Boolean Series

The all() and any() methods can be used to check if all or any values in a vector evaluate to TRUE for some expression.

polars
R base

# all(as_polars_series(c(TRUE,TRUE))) doesn't work
as_polars_series(c(TRUE, TRUE, NA))$all()

[1] FALSE

as_polars_series(c(TRUE, TRUE, FALSE))$all()

[1] FALSE

as_polars_series(c(TRUE, TRUE, TRUE))$all()

[1] TRUE

all(c(TRUE,TRUE,NA))

[1] NA

all(c(TRUE,TRUE,FALSE))

[1] FALSE

all(c(TRUE,TRUE,TRUE))

[1] TRUE

2.15.1.3 Get data type of Series

The dtype() method can be used to get data type of Series.

polars
R base

as_polars_series(letters)$dtype

DataType: String

as_polars_series(c(1, 2))$dtype

DataType: Float64

infer_type(letters)

Utf8
string

infer_type(c(1, 2))

Float64
double

Tip

Polars is strongly typed. print(ls(pl$dtypes)) returns the full list of valid Polars types. Caution, some type names differ from what they are called in R base. See below!

polars
R base

as_polars_series(c("x","y","z"))$dtype

DataType: String

as_polars_series(c(1, 2, 3))$dtype

DataType: Float64

as_polars_series(c(1:3))$dtype

DataType: Int32

as_polars_series(c(TRUE,FALSE))$dtype

DataType: Boolean

as_polars_series(factor(c("a","b","c")))$dtype

DataType: Categorical(
    Some(
        local,
    ),
    Physical,
)

as_polars_series(Sys.Date())$dtype

DataType: Date

as_polars_series(c(0,1))$dtype

DataType: Float64

typeof(c("x","y","z"))

[1] "character"

typeof(c(1, 2, 3))

[1] "double"

typeof(c(1:3))

[1] "integer"

typeof(c(TRUE,FALSE))

[1] "logical"

typeof(factor(c("a","b","c")))

[1] "integer"

typeof(Sys.Date())

[1] "double"

To summarise the main types between Polars and R:

Polars	R Base
Utf8	character
Float64	double
Int32	integer
Boolean	logical
Categorical	Factor
Date	Date

To learn more about Data types in Polars, see here.

2.15.1.4 Cast

The cast() method can be used to convert the data types of a column to a new one.

as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$cast(pl$Int8), # The "Petal.Length" column is converted into integers
  pl$col("Species")$cast(pl$Utf8) # The "Species" column is converted into strings
  )$schema

$Sepal.Length
DataType: Float64

$Sepal.Width
DataType: Float64

$Petal.Length
DataType: Int8

$Petal.Width
DataType: Float64

$Species
DataType: String

data(iris)
iris$Petal.Length <- as.integer(iris$Petal.Length)
iris$Species <- as.integer(iris$Species)
str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  1 1 1 1 1 1 1 1 1 1 ...

data(iris)
iris |>
  mutate(
    Petal.Length = as.integer(Petal.Length),
    Species = as.character(Species)) |>
  str()

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

iris_dt[, `:=`(Petal.Length = as.integer(Petal.Length),
               Species = as.character(Species))]
str(iris_dt)

Classes 'data.table' and 'data.frame':  150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, ".internal.selfref")=<externalptr>

Tip

When working with very large tables we can Reduce the memory footprint by modifying the number of bits allocated to an element. ⚠️

For example, the example below illustrates how converting Float64 to Float8 reduces memory usage:

as_polars_df(iris)$estimated_size()

[1] 5425

#| label: reduce-footprint-polars
as_polars_df(iris)$with_columns(
  pl$col("Petal.Length")$cast(pl$Float32),
  pl$col("Petal.Width")$cast(pl$Float32),
  pl$col("Sepal.Length")$cast(pl$Float32),
  pl$col("Sepal.Width")$cast(pl$Float32)
  )$estimated_size()

[1] 3025

When performing downcasting, it is crucial to ensure that the chosen number of bits is sufficient to accommodate the largest and smallest numbers in the column.

A quick reminder:

Type	Range	Accuracy
Int8	-128 to +127
Int16	-32768 to +32767
Int32	-2147483648 to +2147483647
Int64	–2E63 to –2E63-1
Float32	-3.4E+38 to +3.4E+38	about 7 decimal digits
Float64	-1.7E+308 to +1.7E+308	about 16 decimal digits

2.15.1.5 Check if Series is numeric

The is_numeric() method can be used to check if Series is numeric.
Note that unlike R base, there is no method to check if a Series is character (in this case, its type is anyway Utf8).

polars
R base

as_polars_series(1:4)$is_numeric()

[1] TRUE

as_polars_series(c("a", "b", "c"))$is_numeric()

[1] FALSE

is.numeric(1:4)

[1] TRUE

is.numeric(c("a","b","c"))

[1] FALSE

2.15.1.6 Check if Series is sorted

The is_sorted() method can be used to check if Series is sorted.
Note that R base provides is.unsorted() which returns the opposite boolean to is_sorted() of Polars.

polars
R base

as_polars_series(1:4)$is_sorted()

[1] TRUE

as_polars_series(c(1,3,2))$is_sorted()

[1] FALSE

is.unsorted(1:4)

[1] FALSE

is.unsorted(c(1,3,2))

[1] TRUE

2.15.1.7 Get length of a Series

The len() method can be used to get the length of a Series.

polars
R base

as_polars_series(1:4)$len()

[1] 4

length(1:4)

[1] 4

2.15.1.8 Check if Series are equal

The series_equal() method can be used to check if a Series is equal with another Series.

Tip

Caution, if two series are identical but one is named and the other is not then series_equal() returns FALSE.

polars
R base

as_polars_series(1:4)$equals(as_polars_series(1:4))

[1] TRUE

as_polars_series(1:4,name = "toto")$equals(as_polars_series(1:4))

[1] TRUE

identical(1:4,1:4)

[1] TRUE

2.15.1.9 Convert Series to Polars DataFrame

The to_frame() method can be used to convert a Series to a DataFrame.
In this case, a DataFrame with only one column will be created. If the Series is initially named then the column of the DataFrame will be named as such.

as_polars_series(1:3, "toto")$to_frame()

shape: (3, 1)
┌──────┐
│ toto │
│ ---  │
│ i32  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

2.15.1.10 Get value Counts of a Series

The value_counts() method can be used to get a value counts of a Series.

as_polars_series(iris$Species)$value_counts()

shape: (3, 2)
┌────────────┬───────┐
│            ┆ count │
│ ---        ┆ ---   │
│ cat        ┆ u32   │
╞════════════╪═══════╡
│ setosa     ┆ 50    │
│ versicolor ┆ 50    │
│ virginica  ┆ 50    │
└────────────┴───────┘

table(iris$Species)


    setosa versicolor  virginica 
        50         50         50

iris |>
  count(Species)

     Species  n
1     setosa 50
2 versicolor 50
3  virginica 50

iris_dt[, .N, by = Species]

      Species     N
       <char> <int>
1:     setosa    50
2: versicolor    50
3:  virginica    50

2.15.1.11 Sum across `Series`

The sum() method can be used to get a sum of a Series.

From a single Series:

polars
R base

as_polars_series(1:3)$sum()

[1] 6

sum(c(1:3))

[1] 6

From a DataFrame and a column as a string:

as_polars_df(iris)$select(pl$sum("Petal.Length"))

shape: (1, 1)
┌──────────────┐
│ Petal.Length │
│ ---          │
│ f64          │
╞══════════════╡
│ 563.7        │
└──────────────┘

sum(iris$Petal.Length)

[1] 563.7

iris |> summarise(sum(Petal.Length))

  sum(Petal.Length)
1             563.7

sum(iris_dt[, Petal.Length])

[1] 496

From a DataFrame and a column as an expression:

as_polars_df(iris)$select(pl$sum("Petal.Width"))

shape: (1, 1)
┌─────────────┐
│ Petal.Width │
│ ---         │
│ f64         │
╞═════════════╡
│ 179.9       │
└─────────────┘

From a DataFrame and a column as a list and sum horizontally:

In this case, use with_columns() method.

df_pl <- pl$DataFrame(col1 = c(10L,20L), col2= c(30L,40L), col3 = c(40L,50L))
df_pl$with_columns(pl$sum("col1", "col3"))

shape: (2, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i32  ┆ i32  ┆ i32  │
╞══════╪══════╪══════╡
│ 30   ┆ 30   ┆ 90   │
│ 30   ┆ 40   ┆ 90   │
└──────┴──────┴──────┘

df <- data.frame(col1 = c(10L,20L), col2= c(30L,40L), col3 = c(40L,50L))
mysum <- rowSums(df[, c("col1", "col3")])
cbind(df,mysum)

  col1 col2 col3 mysum
1   10   30   40    50
2   20   40   50    70

df |>
  rowwise() |>
  mutate(mysum = sum(col1,col3))

# A tibble: 2 × 4
# Rowwise: 
   col1  col2  col3 mysum
  <int> <int> <int> <int>
1    10    30    40    50
2    20    40    50    70

df_dt <- as.data.table(df)
df_dt[, somme := rowSums(.SD), .SDcols = c("col1", "col3")]
print(df_dt)

    col1  col2  col3 somme
   <int> <int> <int> <num>
1:    10    30    40    50
2:    20    40    50    70

From a DataFrame and sum horizontally some columns:

An example with iris Dataframe:

df_pl$with_columns(
  pl$sum_horizontal("col1", "col2", "col3")$alias("sum")
)

shape: (2, 4)
┌──────┬──────┬──────┬─────┐
│ col1 ┆ col2 ┆ col3 ┆ sum │
│ ---  ┆ ---  ┆ ---  ┆ --- │
│ i32  ┆ i32  ┆ i32  ┆ i32 │
╞══════╪══════╪══════╪═════╡
│ 10   ┆ 30   ┆ 40   ┆ 80  │
│ 20   ┆ 40   ┆ 50   ┆ 110 │
└──────┴──────┴──────┴─────┘

2.15.2 On DataFrames

2.15.2.1 Get Series from DataFrame

The to_series() method can be used to get one column from DataFrame as Series.

as_polars_df(iris)$select(pl$col("Petal.Length"))$to_series()

polars Series: shape: (150,)
Series: 'Petal.Length' [f64]
[
    1.4
    1.4
    1.3
    1.5
    1.4
    …
    5.2
    5.0
    5.2
    5.4
    5.1
]

iris$Petal.Length

  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1

iris |>
  pull(Petal.Length)

  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1

iris_dt[, c(Petal.Length)]

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 3 4 3 3 4 4 4 3 4 4 4 4 3 4 4 4 4
 [75] 4 4 4 5 4 3 3 3 3 5 4 4 4 4 4 4 4 4 4 3 4 4 4 4 3 4 6 5 5 5 5 6 4 6 5 6 5
[112] 5 5 5 5 5 5 6 6 5 5 4 6 4 5 6 4 4 5 5 6 6 5 5 5 6 5 5 4 5 5 5 5 5 5 5 5 5
[149] 5 5

2.15.2.2 Get a R List from DataFrame

The to_list() method can be used to get a R List from DataFrame.

polars
R base

mylist <- as_polars_df(iris)$to_list()
names(mylist)

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

mylist <- as.list(iris)
names(mylist)

[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

2.15.2.3 Get a slice of a DataFrame

The slice() method can be used to get a slice of a DataFrame.

Important

With Polars, numeric default is 0! Thus the equivalent to slice(1,3) with Polars will be 1:4 in R Base and data.table and slice(1,4) with dplyr.

as_polars_df(iris)$slice(1,3)

shape: (3, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

iris[2:4,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa

iris |>
  slice(2:4)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa

iris_dt[2:4,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <num>       <num>        <int>       <num>  <char>
1:          4.9         3.0            1         0.2  setosa
2:          4.7         3.2            1         0.2  setosa
3:          4.6         3.1            1         0.2  setosa

2.15.2.4 Get a structure from a DataFrame

The to_struct() method can be used to get a structure from a DataFrame.

mystruc <- as_polars_df(iris)$to_struct()
mystruc

polars Series: shape: (150,)
Series: '' [struct[5]]
[
    {5.1,3.5,1.4,0.2,"setosa"}
    {4.9,3.0,1.4,0.2,"setosa"}
    {4.7,3.2,1.3,0.2,"setosa"}
    {4.6,3.1,1.5,0.2,"setosa"}
    {5.0,3.6,1.4,0.2,"setosa"}
    …
    {6.7,3.0,5.2,2.3,"virginica"}
    {6.3,2.5,5.0,1.9,"virginica"}
    {6.5,3.0,5.2,2.0,"virginica"}
    {6.2,3.4,5.4,2.3,"virginica"}
    {5.9,3.0,5.1,1.8,"virginica"}
]

to_struct() returns a Series which can be converted to a R list with to_r method.

mylist <- mystruc$to_r()
str(mylist)

List of 5
 $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "is_struct")= logi TRUE

mylist is now a list where each element is a column of the initial DataFrame.

With to_frame() and unnest() methods, we can reconstruct the original DataFrame:

back_df <- mystruc$to_frame()$unnest()
back_df

shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ …            ┆ …           ┆ …            ┆ …           ┆ …         │
│ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

2.15.2.5 Drop all rows that contain null values

The drop_nulls() method can be used to drop all rows that contain null values in a DataFrame.

By default, drop_nulls() use all columns to drop rows:

data_pl <- pl$DataFrame(
  col1 = as_polars_series(c(NA,"b","c")),
  col2 = as_polars_series(c(1,2,NA))
)
data_pl$drop_nulls()

shape: (1, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ f64  │
╞══════╪══════╡
│ b    ┆ 2.0  │
└──────┴──────┘

data_df <- data.frame(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
data_df[complete.cases(data_df), ]

  col1 col2
2    b    2

data_df <- data.frame(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
data_df %>%
  filter(complete.cases(.))

  col1 col2
1    b    2

data_dt <- data.table(
  col1 = c(NA, "b", "c"),
  col2 = c(1, 2, NA)
)
na.omit(data_dt)

     col1  col2
   <char> <num>
1:      b     2

If you want, you can specify a column (or multiple columns):

data_pl$drop_nulls("col1")

shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ f64  │
╞══════╪══════╡
│ b    ┆ 2.0  │
│ c    ┆ null │
└──────┴──────┘

data_df[!is.na(data_df$col1), ]

  col1 col2
2    b    2
3    c   NA

data_df |>
  filter(!is.na(col1))

  col1 col2
1    b    2
2    c   NA

data_dt[complete.cases(data_dt[, .SD, .SDcols = "col1"]), ]

     col1  col2
   <char> <num>
1:      b     2
2:      c    NA

2.16 Strings methods

In polars, a lot of strings methods are useful. Here is the list.
To use them, simply prefix them with str.

2.16.1 Get substrings

The str$slice() method can be used to create substrings of the string values of a Utf8 Series.

str$slice() does not work like R base’s substr() function for finding the substring of interest: - substr() takes two arguments: the first and last elements; - str$slice() takes two arguments: the first element and the extraction length.

Important

With Polars, numeric default is 0! Thus the equivalent to str$slice(0,3) with Polars will be substr(1,3).

Two further comments:

If the second argument length is not specified, the sub-character string of interest will default to the end of the character string. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(1) is equivalent to substr(mycol,2,4) in dplyr.
The first argument accepts negative values, which means that sub-strings can be considered starting from the end. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(-2) is equivalent to substr(mycol,3,4) in dplyr.

Let’s see an example:

mydf <- data.frame(
  col1 = 1:4,
  col2 = c("One_X","One_Y","Two_X","Two_Y")
)

pl$DataFrame(mydf)$with_columns(
  pl$col("col2")$str$slice(0,length=3)$alias("level"),
  pl$col("col2")$str$slice(-1)$alias("x_y")
)

Warning in pl$DataFrame(mydf): Passing a `data.frame` or `RPolarsDataFrame` to
`pl$DataFrame()` is deprecated and will be removed in the future. Use
`as_polars_df()` instead.

shape: (4, 4)
┌──────┬───────┬───────┬─────┐
│ col1 ┆ col2  ┆ level ┆ x_y │
│ ---  ┆ ---   ┆ ---   ┆ --- │
│ i32  ┆ str   ┆ str   ┆ str │
╞══════╪═══════╪═══════╪═════╡
│ 1    ┆ One_X ┆ One   ┆ X   │
│ 2    ┆ One_Y ┆ One   ┆ Y   │
│ 3    ┆ Two_X ┆ Two   ┆ X   │
│ 4    ┆ Two_Y ┆ Two   ┆ Y   │
└──────┴───────┴───────┴─────┘

mydf$level <- substr(mydf$col2, 1, 3)
mydf$x_y <- substr(mydf$col2, nchar(mydf$col2), nchar(mydf$col2))
mydf

  col1  col2 level x_y
1    1 One_X   One   X
2    2 One_Y   One   Y
3    3 Two_X   Two   X
4    4 Two_Y   Two   Y

mydf |>
  mutate(level = substr(col2, 1, 3),
         x_y = substr(col2, nchar(col2), nchar(col2)))

  col1  col2 level x_y
1    1 One_X   One   X
2    2 One_Y   One   Y
3    3 Two_X   Two   X
4    4 Two_Y   Two   Y

mydt <- as.data.table(mydf) 
mydt[, c("level", "x_y") := .(substr(col2, 1, 3), substr(col2, nchar(col2), nchar(col2)))]
mydt

    col1   col2  level    x_y
   <int> <char> <char> <char>
1:     1  One_X    One      X
2:     2  One_Y    One      Y
3:     3  Two_X    Two      X
4:     4  Two_Y    Two      Y

2.16.2 Check if string values start with a substring

The str$starts_with() method can be used to check if string values start with a substring. It returns a Boolean.

Let’s see an example where we create new Boolean columns based on the start of a character string:

pl$DataFrame(mydf)$with_columns(
  pl$col("col2")$str$starts_with("One")$alias("is_one"),
  pl$col("col2")$str$starts_with("Two")$alias("is_two")
)

Warning in pl$DataFrame(mydf): Passing a `data.frame` or `RPolarsDataFrame` to
`pl$DataFrame()` is deprecated and will be removed in the future. Use
`as_polars_df()` instead.

shape: (4, 6)
┌──────┬───────┬───────┬─────┬────────┬────────┐
│ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two │
│ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    │
│ i32  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   │
╞══════╪═══════╪═══════╪═════╪════════╪════════╡
│ 1    ┆ One_X ┆ One   ┆ X   ┆ true   ┆ false  │
│ 2    ┆ One_Y ┆ One   ┆ Y   ┆ true   ┆ false  │
│ 3    ┆ Two_X ┆ Two   ┆ X   ┆ false  ┆ true   │
│ 4    ┆ Two_Y ┆ Two   ┆ Y   ┆ false  ┆ true   │
└──────┴───────┴───────┴─────┴────────┴────────┘

mydf$is_one <- grepl("^One", mydf$col2)
mydf$is_two <- grepl("^Two", mydf$col2)
mydf

  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    2 One_Y   One   Y   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
4    4 Two_Y   Two   Y  FALSE   TRUE

mydf |>
  mutate(level = substr(col2, 1, 3),
         x_y = substr(col2, nchar(col2), nchar(col2)))

  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    2 One_Y   One   Y   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE
4    4 Two_Y   Two   Y  FALSE   TRUE

mydt <- as.data.table(mydf) 
mydt[, c("is_one", "is_two") := .(grepl("^One", col2), grepl("^Two", col2))]
mydt

    col1   col2  level    x_y is_one is_two
   <int> <char> <char> <char> <lgcl> <lgcl>
1:     1  One_X    One      X   TRUE  FALSE
2:     2  One_Y    One      Y   TRUE  FALSE
3:     3  Two_X    Two      X  FALSE   TRUE
4:     4  Two_Y    Two      Y  FALSE   TRUE

2.16.3 Check if string values end with a substring

The str$ends_with() method can be used to check if string values start with a substring. It returns a Boolean.

Let’s see an example where we filter the lines of a DataFrame based on the start of a character string:

pl$DataFrame(mydf)$filter(
  pl$col("col2")$str$ends_with("X")
)

Warning in pl$DataFrame(mydf): Passing a `data.frame` or `RPolarsDataFrame` to
`pl$DataFrame()` is deprecated and will be removed in the future. Use
`as_polars_df()` instead.

shape: (2, 6)
┌──────┬───────┬───────┬─────┬────────┬────────┐
│ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two │
│ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    │
│ i32  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   │
╞══════╪═══════╪═══════╪═════╪════════╪════════╡
│ 1    ┆ One_X ┆ One   ┆ X   ┆ true   ┆ false  │
│ 3    ┆ Two_X ┆ Two   ┆ X   ┆ false  ┆ true   │
└──────┴───────┴───────┴─────┴────────┴────────┘

mydf[substr(mydf$col2, nchar(mydf$col2), nchar(mydf$col2)) == "X", ]

  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
3    3 Two_X   Two   X  FALSE   TRUE

mydf |>
  filter(endsWith(col2, "X"))

  col1  col2 level x_y is_one is_two
1    1 One_X   One   X   TRUE  FALSE
2    3 Two_X   Two   X  FALSE   TRUE

mydt <- as.data.table(mydf) 
mydt[substr(col2, nchar(col2), nchar(col2)) == "X"]

    col1   col2  level    x_y is_one is_two
   <int> <char> <char> <char> <lgcl> <lgcl>
1:     1  One_X    One      X   TRUE  FALSE
2:     3  Two_X    Two      X  FALSE   TRUE

2.17 Create your methods

With R you can create your own method/function with function().
Let’s try to create a R function to captue some DataFrame transformations.

Our simple function:
- Takes a DataFrame as an input (argument data)
- Convert Categorical columns into Strings
- Make all Strings columns uppercase
- And filter only the third first rows

fn_transformation <- function(data) {
  
  data$
    # Convert Integer columns into Float 
    with_columns(
      pl$col(pl$Int32)$cast(pl$Float64))$
    # Make all Strings columns uppercase
    with_columns(
      pl$col(pl$Utf8)$str$to_uppercase())$
    # Filter only the third first rows
    head(3)
  
}

Let’s apply our method to as_polars_df(iris):

fn_transformation(as_polars_df(mydf))

shape: (3, 6)
┌──────┬───────┬───────┬─────┬────────┬────────┐
│ col1 ┆ col2  ┆ level ┆ x_y ┆ is_one ┆ is_two │
│ ---  ┆ ---   ┆ ---   ┆ --- ┆ ---    ┆ ---    │
│ f64  ┆ str   ┆ str   ┆ str ┆ bool   ┆ bool   │
╞══════╪═══════╪═══════╪═════╪════════╪════════╡
│ 1.0  ┆ ONE_X ┆ ONE   ┆ X   ┆ true   ┆ false  │
│ 2.0  ┆ ONE_Y ┆ ONE   ┆ Y   ┆ true   ┆ false  │
│ 3.0  ┆ TWO_X ┆ TWO   ┆ X   ┆ false  ┆ true   │
└──────┴───────┴───────┴─────┴────────┴────────┘

Our method fn_transformation() can now be re-used multiple times for example on another datasets:

Newdf <- pl$DataFrame(
  col_categ = as_polars_series(factor(c("a","b","c"))),
  col_str = as_polars_series(c("x","y","z")),
  col_num = as_polars_series(1:3)
)
  
fn_transformation(Newdf)

shape: (3, 3)
┌───────────┬─────────┬─────────┐
│ col_categ ┆ col_str ┆ col_num │
│ ---       ┆ ---     ┆ ---     │
│ cat       ┆ str     ┆ f64     │
╞═══════════╪═════════╪═════════╡
│ a         ┆ X       ┆ 1.0     │
│ b         ┆ Y       ┆ 2.0     │
│ c         ┆ Z       ┆ 3.0     │
└───────────┴─────────┴─────────┘

Of course, in real life, we will create functions that are more complicated than our example.

2.1 Introduction to methods

2.2 Conversion between Series/DataFrames and vector/data.frames

2.2.1 From vector/data.frames to Series/DataFrames

2.2.2 From Series/DataFrames to vector/data.frames

2.3 Initial informations on a DataFrame

2.3.1 Get the schema

2.3.2 Get the dimensions

2.3.3 Get columns types

2.3.4 Get a glimpse

2.3.5 Get summary statistics

2.3.6 Get the rows number

2.3.7 Get the columns number

2.3.8 Get the columns names

2.3.9 Rename columns

2.3.10 Get the size

2.3.11 Get the first n rows

2.3.12 Count values in a DataFrame

2.3.13 Count NA over columns in a DataFrame

2.4 Filter rows

2.5 Select columns

2.5.1 Selecting by name

2.5.2 Selecting by data type

2.5.3 Selecting with a regex

2.5.4 Selecting using lists

2.5.5 Selecting - other ways

2.6 Modify/Add columns

2.6.1 Modify existing column

2.6.2 Add a new column

2.6.3 Add a new column with a constant value

2.6.4 Add a new column based on conditions

2.6.5 Add a new column by group

2.7 Rename columns

2.8 Remove columns

2.8.1 Removing by name

2.8.2 Removing with a regex

2.9 Aggregation by group

2.10 Sort a DataFrame

2.10.1 Simply sort a DataFrame

2.10.2 Keep unique rows

2.10.3 Keep some columns from sorted DataFrame

2.11 Join DataFrames

2.12 Concatenate DataFrames

2.12.1 Vertical concatenation

2.12.2 Horizontal concatenation

2.12.3 Diagonal concatenation

2.13 Pivot a DataFrame

2.13.1 From long to wide

2.13.2 From wide to long

2.14 Dealing with missing values

2.14.1 Check if Series has missing values

2.14.2 Replace missing values with a constant

2.14.3 Replace missing values with a strategy

2.14.4 Replace missing values with an expression

2.14.5 Replace missing values with a sequence of columns

2.15 Others useful methods

2.15.1 On Series

2.15.1.1 Change name of Series

2.15.1.2 Reduce Boolean Series

2.15.1.3 Get data type of Series

2.15.1.4 Cast

2.15.1.5 Check if Series is numeric

2.15.1.6 Check if Series is sorted

2.15.1.7 Get length of a Series

2.15.1.8 Check if Series are equal

2.15.1.9 Convert Series to Polars DataFrame

2.15.1.10 Get value Counts of a Series

2.15.1.11 Sum across Series

2.15.2 On DataFrames

2.15.2.1 Get Series from DataFrame

2.15.2.2 Get a R List from DataFrame

2.15.2.3 Get a slice of a DataFrame

2.15.2.4 Get a structure from a DataFrame

2.15.2.5 Drop all rows that contain null values

2.16 Strings methods

2.16.1 Get substrings

2.16.2 Check if string values start with a substring

2.16.3 Check if string values end with a substring

2.17 Create your methods

2.15.1.11 Sum across `Series`