# To get a character vector of column namesmydf$columns
[1] "col1" "col2"
# To get dimensions of DataFramemydf$shape
[1] 3 2
# We can mix standard R functions and methodslength(mydf$columns)
[1] 2
Polars includes a very useful chaining method in data manipulation operations. From this point of view, Polars is more like dplyr and data.table. This is how the chaining method is defined in the official documentation:
In polars our method chaining syntax takes the form object$m1()$m2(), where object is our data object, and m1() and m2() are appropriate methods, like subsetting or aggregation expressions.
In the Polars code used above, you will notice that we have introduced line breaks. We could have written the whole code on the same line but for the sake of readability I prefer to separate the methods used by line breaks.
2.2 Conversion between Series/DataFrames and vector/data.frames
2.2.1 From vector/data.frames to Series/DataFrames
These conversions have already been seen earlier in this book.
# To convert vector to Polars Seriesmyvector <- pl$Series(c("a","b","c"))
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# To convert data.frames to DataFramesiris_polars <- pl$DataFrame(iris)
2.2.2 From Series/DataFrames to vector/data.frames
Here, we can use to_r() and to_data_frame() methods.
# To convert Polars Series to vector or listmyletters_serie$to_r()
[1] "a" "b" "c"
# To convert DataFrames to data.framesmydf$to_data_frame()
col1 col2
1 1 a
2 2 b
3 3 c
2.3 Initial informations on a DataFrame
Here is a list of instructions that I frequently use to quickly get information about a DataFrame.
2.3.1 Get the schema
A DataFrame has a schema attribute which is a named list of DataTypes.
mydf$schema
$col1
DataType: Int32
$col2
DataType: String
# This works also on LazyFramemydf$lazy()$schema
$col1
DataType: Int32
$col2
DataType: String
2.3.2 Get the dimensions
A DataFrame has a shape attribute which is a two length numeric vector of c(nrows,ncols).
mydf$shape
[1] 3 2
2.3.3 Get columns types
A DataFrame has a dtypes attribute which is a list of dtypes (for data type) for every column of the DataFrame.
Alternatively, the dtype_strings() method can be used to get columns types in a character/string vector.
# With dtypes attribute (with a "s" and without parentheses)mydf$dtypes
[[1]]
DataType: Int32
[[2]]
DataType: String
# This works also on LazyFramemydf$lazy()$dtypes
[[1]]
DataType: Int32
[[2]]
DataType: String
# With dtype_strings() method (wihout a "s" and with parentheses)mydf$dtype_strings()
[1] "i32" "str"
2.3.4 Get a glimpse
You can access a dense preview of a DataFrame by using the glimpse() method. f The formatting is done one line per column, so wide DataFrame show nicely. Each line will show the column name, the dtypes attributes and the first few values.
The value_counts() method can be used to count values in a Series of a DataFrame. value_counts() works with a Series. It must therefore be supplied either with square brackets or with the select() method. See here to learn about it.
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Itβs convenient when you have to quickly inspect your data. But youβll quickly be limited by the square brackets, as they donβt accept conditions with the expressions. For example pl$DataFrame(iris)[Petal.Length > 6] doesnβt work.
The second and best option is to use the filter() method. It must be used with the Polars expression, here the col() method which allows to designate the columns on which the filter condition will be applied.
Letβs see in details whatβs inside a filter() method with an example:
pl$col("Petal.Length"): this expression selects the Petal.Length column from iris;
>6: applies a Boolean condition to this expression (for all Petals that have a length > 6).
In the example below, we will use & operator to apply multiple conditions in filter() method:
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
shape: (2, 1)
ββββββββ
β colA β
β --- β
β str β
ββββββββ‘
β a β
β b β
ββββββββ
Another reason for using the filter() method is that filter expressions can be optimised in lazy mode by the query optimiser. Square brackets [] can only be used in eager mode.
Tip
There is another way to speed up filter processing on rows: tell polars that the column(s) used to filter rows are already sorted! To do this, you can use the set_sorted() method.
Hereβs an example:
Warning in pl$Series(sort(runif(1e+07))): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
microbenchmark("Without telling col1 is sorted"= mydf$filter(pl$col("col1") <100),"Telling col1 is sorted"= mydf$with_columns(pl$col("col1")$set_sorted())$filter(pl$col("col1") <100) )
Unit: milliseconds
expr min lq mean median uq
Without telling col1 is sorted 6.557458 6.896630 7.235332 7.032378 7.248586
Telling col1 is sorted 3.868157 4.142338 4.465664 4.343943 4.534148
max neval
12.53200 100
10.22463 100
2.5 Select columns
2.5.1 Selecting by name
The first option for selecting columns of a DataFrame is to use square brackets [].
The second option is to use the select() method. In this case, it must be used with the col() method which allows to designate the columns to be selected with a character vector.
Keep in mind that when:
- Square brackets are used:
- Selecting only one column from a DataFrame, the output is a Series;
- Whereas if there is more than one column selected, the output is a DataFrame.
- select() method is used:
- The output is always a DataFrame rather than a Series even if one column is selected.
=> If you need a Series you can use the to_series() method. See here.
With Polars if you want to obtain a result in an R data.frame, you can simply add the method to_data_frame() at the end of the method chaining. See here for examples.
Tip
Beyond the minor differences discussed above, there are two major reasons why you should use the select() method over the syntax with square brackets:
- When you select and transform multiple columns with select() method, Polars will run these selections in parallel;
- Expressions used in select() method can be optimised in lazy mode by the query optimizer.
Finally, the select() method can also be used to re-order columns of a DataFrame.
For example, to re-order the columns in alphabetical order:
To select columns with a regex from a DataFrame, you can pass it in pl$col expression. For example, to select all columns that starts with βSepalβ in iris dataset:
The selectors API of polars enables to use other methods to select columns.
Hereβs a few examples selecting the first and last column of a DataFrame respectively with first() and last() methods:
# Select the first columnpl$DataFrame(iris)$select( pl$first())$head(3)
Similar to the dplyr package, the select() method can also be used to modify existing data. However, the result will exclude any columns that were not specified in the expression.
For example, if we want to get in the data.frame iris the Petal.Length column rounded without decimals.
pl$DataFrame(iris)$select( pl$col("Petal.Length")$round(decimals =0))$head(3) # display the first 3 lines
If you want to add a column to a data.frame, you use the same syntax as above with with_columns(). Simply use the alias() method to specify the name of the newly created column.
pl$DataFrame(iris)$with_columns( pl$col("Petal.Length")$round(decimals =0)$alias("Petal.Length.rounded"))$head(3) # display the first 3 lines
If you need to create a new column with a constant value (i.e. the same value for all the rows in your DataFrame), you can use the literal lit() method. It works with the main types of Polars.
pl$DataFrame(iris)$with_columns( pl$lit("toto")$alias("mynewcolumn"))$head(3) # display the first 3 lines
pl$DataFrame(iris)$with_columns( pl$when(pl$col("Petal.Length") <=2)$then(pl$lit("<=2"))$when(pl$col("Petal.Length") <=5)$then(pl$lit("<=5"))$otherwise(pl$lit(">5"))$alias("mygroups")# we only need to display 2 variables to check that it's OK)$select( pl$col(c("Petal.Length","mygroups")))[c(1,2,59,150),]
iris$mygroups <-ifelse(iris$Petal.Length <=2, "<=2",ifelse(iris$Petal.Length <=5, "<=5", ">5"))# we only need to display 2 variables to check that it's OKiris[c(1,2,59,150), c("Petal.Length", "mygroups")]
iris |>mutate(mygroups =case_when( Petal.Length <=2~"<=2", Petal.Length <=5~"<=5",.default =">5") ) |># we only need to display 2 variables to check that it's OKselect(Petal.Length,mygroups) |>slice(1,2,59,150)
iris_dt[, mygroups :=case_when( Petal.Length <=2~"<=2", Petal.Length <=5~"<=5",TRUE~">5")]# we only need to display 2 variables to check that it's OKiris_dt[c(1,2,59,150), .(Petal.Length, mygroups)]
To add new columns by group, the over expression must be used. This expression is similar to performing a groupby aggregation and joining the result back into the original dataframe. Hereβs an example with equivalent syntax to help you understand:
shape: (6, 5)
ββββββββ¬βββββββββ¬βββββββ¬βββββββ¬βββββββββββ
β name β adress β col2 β col3 β col3_max β
β --- β --- β --- β --- β --- β
β str β str β i32 β i32 β i32 β
ββββββββͺβββββββββͺβββββββͺβββββββͺβββββββββββ‘
β X β A β 2 β 5 β 19 β
β X β B β 4 β 19 β 19 β
β Y β C β 1 β 17 β 17 β
β Y β D β 3 β 12 β 17 β
β Z β E β 4 β 11 β 15 β
β Z β F β 2 β 15 β 15 β
ββββββββ΄βββββββββ΄βββββββ΄βββββββ΄βββββββββββ
result <-aggregate(col3 ~ name, data = df, FUN = max)colnames(result) <-c("name", "col3_max")merge(df, result, by ="name", all.x =TRUE)
name adress col2 col3 col3_max
1 X A 2 5 19
2 X B 4 19 19
3 Y C 1 17 17
4 Y D 3 12 17
5 Z E 4 11 15
6 Z F 2 15 15
df |>group_by(name) |>mutate(col3_max =max(col3))
# A tibble: 6 Γ 5
# Groups: name [3]
name adress col2 col3 col3_max
<chr> <chr> <int> <int> <int>
1 X A 2 5 19
2 X B 4 19 19
3 Y C 1 17 17
4 Y D 3 12 17
5 Z E 4 11 15
6 Z F 2 15 15
dt <-as.data.table(df)dt[, col3_max :=max(col3), by = name]dt
name adress col2 col3 col3_max
<char> <char> <int> <int> <int>
1: X A 2 5 19
2: X B 4 19 19
3: Y C 1 17 17
4: Y D 3 12 17
5: Z E 4 11 15
6: Z F 2 15 15
Tip
If you need to pass multiple column names in the over expression, you can either list them like this over("name","adress") or - more conveniently - use a character vector over(c("name","adress")).
2.7 Rename columns
Similar to the dplyr package, the rename() method can also be used to rename existing column.
The renaming logic is identical to that of dplyr, and is performed as follows: new_name="old_name".
Note
Note the double quotes "" surrounding the name of the old variable to be renamed which does not exist with dplyr (see examples below).
data(iris)pl$DataFrame(iris)$rename(sepal_length ="Sepal.Length", sepal_width ="Sepal.Width",`length of petal`="Petal.Length",`width of petal`="Petal.Width",species ="Species" )$columns
[1] "sepal_length" "sepal_width" "length of petal" "width of petal"
[5] "species"
data(iris)names(iris) <-c("sepal_length","sepal_width","length of petal","width of petal","species")names(iris)
[1] "sepal_length" "sepal_width" "length of petal" "width of petal"
[5] "species"
data(iris)iris |>rename(sepal_length = Sepal.Length, sepal_width = Sepal.Width,`length of petal`= Petal.Length,`width of petal`= Petal.Width,species = Species ) |>names()
[1] "sepal_length" "sepal_width" "length of petal" "width of petal"
[5] "species"
To remove columns with a regex from a DataFrame, the drop expression must be used.
Letβs see an example where you want to drop columns whose names starts with βPetalβ in iris.
All the ways weβve seen in the section about selecting columns from a DataFrame (by name, data type and with a list) also work with drop() method!
2.9 Aggregation by group
Another frequently used data manipulation is the aggregation of data by group. To do this, we indicate in the group_by() method which column will be used to group the data.frame. And the agg() method which specifies the expression to aggregate.
The methods available for the agg() method are (in each group):
first() get the first element
last() get the last element
n_unique() get the number of unique elements
count() get the number of elements
sum() sum the elements
min() get the smallest element
max() get the largest element
mean() get the average of elements
median() get the median
quantile() calculate quantiles
Hereβs a minimal example with sum applied to 2 different columns:
Be careful! Calling multiple aggregations on the same column produces columns of the same name which generates an error with R.
You can use the alias() or suffix() method to ensure column names are unique. For example:
pl$DataFrame(iris)$group_by("Species")$agg(# With alias() pl$col(c("Petal.Length"))$sum()$alias("Petal.Length_Sum"), pl$col(c("Petal.Length"))$mean()$alias("Petal.Length_Mean"),# With suffix() pl$col(c("Petal.Width"))$sum()$name$suffix("_Sum"), pl$col(c("Petal.Width"))$mean()$name$suffix("_Mean") )
# Sort by two columns one in a decreasing manner and the other in an increasing mannerpl$DataFrame(iris)$sort(c("Species","Petal.Length"), descending =c(TRUE,FALSE))$head(3)
# With two colums, keeping last entry and maintaining the same orderpl$DataFrame(x =c(1L, 1:3, 3L),y =c(1L, 1:3, 3L),z =c(1L, 1:3, 4L))$unique(subset =c("x","y"),keep ="last",maintain_order =TRUE)
# With two colums, keeping last entry and maintaining the same ordermytest <-data.frame(x =c(1L, 1:3, 3L),y =c(1L, 1:3, 3L),z =c(1L, 1:3, 4L)) aggregate(. ~ x + y, data = mytest, FUN = tail, n =1)
x y z
1 1 1 1
2 2 2 2
3 3 3 4
# With one columniris |>distinct(Species, .keep_all =TRUE)
# With two colums, keeping last entry and maintaining the same orderdata.frame(x =c(1L, 1:3, 3L),y =c(1L, 1:3, 3L),z =c(1L, 1:3, 4L)) |>group_by(x,y) |>slice_tail() |>ungroup()
# A tibble: 3 Γ 3
x y z
<int> <int> <int>
1 1 1 1
2 2 2 2
3 3 3 4
# With two colums, keeping last entry and maintaining the same ordermytest_dt <-data.table(x =c(1L, 1:3, 3L),y =c(1L, 1:3, 3L),z =c(1L, 1:3, 4L)) unique(mytest_dt, by =c("x","y"), fromLast =TRUE)
x y z
<int> <int> <int>
1: 1 1 1
2: 2 2 2
3: 3 3 4
2.10.3 Keep some columns from sorted DataFrame
If you want to keep only some columns from a sorted DataFrame, you can use the sort_by() method with the select() method.
In details, sort_by() can sort a column by the ordering of another column, or multiple other columns.
Itβs the equivalent of order() method of R base.
If you want to use multiple columns/expressions, you can pass it in a list like this for example sort_by(list("Petal.Width","Sepal.Width")) or sort_by(list("Petal.Width", pl$col("Sepal.Width")))
2.11 Join DataFrames
To perform joins, the join() method must be used.
Multiple strategies are available:
"inner": returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded.
"left": returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled.
"outer": returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled.
"semi": returns all rows from the left frame in which the join key is also present in the right frame.
"anti": returns all rows from the left frame in which the join key is not present in the right frame.
"cross": returns the cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained. The table length of A cross-joined with B is always len(A) Γ len(B).
The main arguments are: - on: name(s) of the join columns in both DataFrames.
- how: join strategy.
- suffix: suffix to append to columns with a duplicate name.
Warning in pl$Series(c("toto", "titi", "tata")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("blue", "red", "yellow")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("toto", "titi", "tata")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(10, 20, 30)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("a")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("b")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("x")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("y")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("x", "y", "z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
shape: (3, 2)
ββββββββ¬βββββββ
β col1 β col2 β
β --- β --- β
β str β str β
ββββββββͺβββββββ‘
β a β x β
β b β y β
β c β z β
ββββββββ΄βββββββ
cbind(dfleft_df,dfright_df)
col1 col2
1 a x
2 b y
3 c z
cbind(list(dfleft_df,dfright_df))
[,1]
[1,] data.frame,1
[2,] data.frame,1
data.table(dfleft_dt, dfright_dt)
col1 col2
<char> <char>
1: a x
2: b y
3: c z
2.12.3 Diagonal concatenation
To concatenate multiple DataFrames diagonally, you can use the concat() method and the argument how = "diagonal".
Diagonal concatenation is useful when the column names are not identical in initial DataFrames.
country city location population
1 France Paris North 2.1
2 France Lille North 0.2
3 France Nice South 0.4
4 Italy Roma South 2.8
5 Italy Milan North 1.4
6 Italy Napoli South 3.0
reshape(df[,-which(names(df) %in%c("location"))], idvar ="country", timevar ="city", direction ="wide")
country population.Paris population.Lille population.Nice population.Roma
1 France 2.1 0.2 0.4 NA
4 Italy NA NA NA 2.8
population.Milan population.Napoli
1 NA NA
4 1.4 3
df |>pivot_wider(id_cols = country,names_from = city,values_from = population )
# A tibble: 2 Γ 7
country Paris Lille Nice Roma Milan Napoli
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France 2.1 0.2 0.4 NA NA NA
2 Italy NA NA NA 2.8 1.4 3
df_dt <-as.data.table(df)dcast(df_dt, country ~ city, value.var ="population")
Key: <country>
country Lille Milan Napoli Nice Paris Roma
<char> <num> <num> <num> <num> <num> <num>
1: France 0.2 NA NA 0.4 2.1 NA
2: Italy NA 1.4 3 NA NA 2.8
You can also aggregate the results using a function that you enter in the argument aggregate_function in pivot() method.
In this case, the aggregate_function argument of pivot() is the equivalent of values_fn of pivot_wider() from {tidyr} and fun.aggregate of dcast() from {data.table}.
# A tibble: 2 Γ 3
country North South
<chr> <dbl> <dbl>
1 France 1.15 0.4
2 Italy 1.4 2.9
df_dt <-as.data.table(df)dcast(df_dt, country ~ location, value.var ="population", fun.aggregate = mean)
Key: <country>
country North South
<char> <num> <num>
1: France 1.15 0.4
2: Italy 1.40 2.9
However with {polars}, we can also run an expression as an aggregation function.
With {tidyr} and {data.table}, you need to calculate this in advance.
For example:
country variable value
<char> <fctr> <num>
1: France North 1.1
2: Italy North 1.4
3: France South 0.4
4: Italy South 2.9
2.14 Dealing with missing values
We have already introduced missing values in here. In this section, we will go further and understand how to deal with missing values with polars and R.
2.14.1 Check if Series has missing values
The is_null() and is_not_null() methods can be used to check if Series has missing values.
Theses methods are the equivalent of is.na() and !is.na() of R base.
Letβs see two examples combining thes methods with select() and filter() methods:
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# In the same columnsmydfNA$with_columns( pl$all()$fill_null("missing"))
shape: (3, 2)
βββββββββββ¬ββββββββββ
β colA β colB β
β --- β --- β
β str β str β
βββββββββββͺββββββββββ‘
β a β d β
β missing β missing β
β c β missing β
βββββββββββ΄ββββββββββ
# In new columns suffixed by "_corrected"mydfNA$with_columns( pl$all()$fill_null("missing")$name$suffix("_corrected"))
shape: (3, 4)
ββββββββ¬βββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β colA β colB β colA_corrected β colB_corrected β
β --- β --- β --- β --- β
β str β str β str β str β
ββββββββͺβββββββͺβββββββββββββββββͺβββββββββββββββββ‘
β a β d β a β d β
β null β null β missing β missing β
β c β null β c β missing β
ββββββββ΄βββββββ΄βββββββββββββββββ΄βββββββββββββββββ
mydfNA2 <- mydfNA$to_data_frame()# In the same columnsmydfNA2[is.na(mydfNA2)] <-"missing"mydfNA2
colA colB
1 a d
2 missing missing
3 c missing
# In new columns suffixed by "_corrected"mydfNA2 <- mydfNA$to_data_frame()transform(mydfNA2,colA_corrected =ifelse(is.na(colA), "missing", colA),colB_corrected =ifelse(is.na(colB), "missing", colB))
colA colB colA_corrected colB_corrected
1 a d a d
2 <NA> <NA> missing missing
3 c <NA> c missing
mydfNA2 <- mydfNA$to_data_frame()# In the same columnsmydfNA2 %>%mutate(across(everything(), ~ifelse(is.na(.), "missing", .)))
colA colB
1 a d
2 missing missing
3 c missing
# In new columns suffixed by "_corrected"mydfNA2 %>%mutate(across(c(colA, colB), ~ifelse(is.na(.), "missing", .), .names ="{col}_corrected"))
colA colB colA_corrected colB_corrected
1 a d a d
2 <NA> <NA> missing missing
3 c <NA> c missing
mydfNA2 <- mydfNA$to_data_frame()mydfNA2_dt <-as.data.table(mydfNA2)# In the same columnsmydfNA2_dt[is.na(mydfNA2_dt)] <-"missing"mydfNA2_dt
colA colB
<char> <char>
1: a d
2: missing missing
3: c missing
# In new columns suffixed by "_corrected"mydfNA2_dt[,c("colA_corrected", "colB_corrected") :=lapply(.SD, function(x) ifelse(is.na(x), "missing", x)), .SDcols =c("colA", "colB")]mydfNA2_dt
colA colB colA_corrected colB_corrected
<char> <char> <char> <char>
1: a d a d
2: missing missing missing missing
3: c missing c missing
Important
Be careful, the fill_null() method can in some cases modify data types like cast(). This can happen, for example, when youβre working on an integer column and you want to replace the missing values with a string => the column will then has a string dtype!
2.14.3 Replace missing values with a strategy
The fill_null() method of polars has a strategy argument for replacing missing values:
forward: replace with the previous non-null value in the Series
backward: replace with the next non-null value in the Series
min: replace with the smallest value in the Series
max: replace with the largest value in the Series
mean: replace with the median value in the Series
zero: replace with 0
one: replace with 1
Note
We can set a limit on how many rows to fill-forward or backward with limit
Hereβs some examples :
# In the same columnsmydfNA <- pl$DataFrame(colA = pl$Series(c("a",NA,"c")),colB = pl$Series(c("d",NA,NA)),colC = pl$Series(c(1,NA,3)))
Warning in pl$Series(c("a", NA, "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("d", NA, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(1, NA, 3)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
# With forward strategymydfNA$with_columns( pl$all()$fill_null(strategy ="forward")$name$suffix("_corrected") )
shape: (3, 6)
ββββββββ¬βββββββ¬βββββββ¬βββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β colA β colB β colC β colA_corrected β colB_corrected β colC_corrected β
β --- β --- β --- β --- β --- β --- β
β str β str β f64 β str β str β f64 β
ββββββββͺβββββββͺβββββββͺβββββββββββββββββͺβββββββββββββββββͺβββββββββββββββββ‘
β a β d β 1.0 β a β d β 1.0 β
β null β null β null β a β d β 1.0 β
β c β null β 3.0 β c β d β 3.0 β
ββββββββ΄βββββββ΄βββββββ΄βββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ
# With forward strategy and a limitmydfNA$with_columns( pl$all()$fill_null(strategy ="forward", limit =1)$name$suffix("_corrected") )
shape: (3, 6)
ββββββββ¬βββββββ¬βββββββ¬βββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ
β colA β colB β colC β colA_corrected β colB_corrected β colC_corrected β
β --- β --- β --- β --- β --- β --- β
β str β str β f64 β str β str β f64 β
ββββββββͺβββββββͺβββββββͺβββββββββββββββββͺβββββββββββββββββͺβββββββββββββββββ‘
β a β d β 1.0 β a β d β 1.0 β
β null β null β null β a β d β 1.0 β
β c β null β 3.0 β c β null β 3.0 β
ββββββββ΄βββββββ΄βββββββ΄βββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ
# With backward strategymydfNA$with_columns( pl$all()$fill_null(strategy ="backward")$name$suffix("_corrected") )
Of course, you are not limited to the built-in strategies of polars => with fill_null() you can also use expressions to replace missing values. It works with expression from the same column and from another column.
# Replace missing values with the mean of the non-null values for that columnmydfNA$with_columns( pl$col("colC")$fill_null(pl$mean("colC"))$name$suffix("_corrected") )
# Replace missing values with the values from another columnmydfNA$with_columns( pl$col("colB")$fill_null(pl$col("colA"))$name$suffix("_corrected") )
shape: (3, 4)
ββββββββ¬βββββββ¬βββββββ¬βββββββββββββββββ
β colA β colB β colC β colB_corrected β
β --- β --- β --- β --- β
β str β str β f64 β str β
ββββββββͺβββββββͺβββββββͺβββββββββββββββββ‘
β a β d β 1.0 β d β
β null β null β null β null β
β c β null β 3.0 β c β
ββββββββ΄βββββββ΄βββββββ΄βββββββββββββββββ
mydfNA2 <- mydfNA$to_data_frame()# Replace missing values with the mean of the non-null values for that columnmydfNA2$colC_corrected <-ifelse(is.na(mydfNA2$colC), mean(mydfNA2$colC, na.rm =TRUE), mydfNA2$colC)mydfNA2 <- mydfNA$to_data_frame()# Replace missing values with the values from another columnmydfNA2$colB_corrected <-ifelse(is.na(mydfNA2$colB), mydfNA2$colA, mydfNA2$colB)mydfNA2
colA colB colC colB_corrected
1 a d 1 d
2 <NA> <NA> NA <NA>
3 c <NA> 3 c
mydfNA2 <- mydfNA$to_data_frame()# Replace missing values with the mean of the non-null values for that columnmydfNA2 %>%mutate(colC_corrected =ifelse(is.na(colC), mean(mydfNA2$colC, na.rm =TRUE), colC))
colA colB colC colC_corrected
1 a d 1 1
2 <NA> <NA> NA 2
3 c <NA> 3 3
# Replace missing values with the values from another columnmydfNA2 %>%mutate(colB_corrected =ifelse(is.na(colB), colA, colB))
colA colB colC colB_corrected
1 a d 1 d
2 <NA> <NA> NA <NA>
3 c <NA> 3 c
mydfNA2 <- mydfNA$to_data_frame()mydfNA2_dt <-as.data.table(mydfNA2)# Replace missing values with the mean of the non-null values for that columnmydfNA2_dt[, colC_corrected :=ifelse(is.na(colC), mean(mydfNA2_dt$colC, na.rm =TRUE), colC)]mydfNA2_dt
colA colB colC colC_corrected
<char> <char> <num> <num>
1: a d 1 1
2: <NA> <NA> NA 2
3: c <NA> 3 3
# Replace missing values with the values from another columnmydfNA2_dt[, colB_corrected :=ifelse(is.na(colB), colA, colB)]mydfNA2_dt
colA colB colC colC_corrected colB_corrected
<char> <char> <num> <num> <char>
1: a d 1 1 d
2: <NA> <NA> NA 2 <NA>
3: c <NA> 3 3 c
2.14.5 Replace missing values with a sequence of columns
The coalesce() method can be used to replace missing values based on a sequence of columns.
Hereβs an example creating a new column βcol4β that has the first non-null value as we go through some columns (in order!):
Warning in pl$Series(c(NA, "y", NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(NA, "v", "w")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("r", "s", NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:3, name = "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(TRUE, TRUE, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
pl$Series(c(TRUE, TRUE, FALSE))$all()
Warning in pl$Series(c(TRUE, TRUE, FALSE)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
pl$Series(c(TRUE, TRUE, TRUE))$all()
Warning in pl$Series(c(TRUE, TRUE, TRUE)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
all(c(TRUE,TRUE,NA))
[1] NA
all(c(TRUE,TRUE,FALSE))
[1] FALSE
all(c(TRUE,TRUE,TRUE))
[1] TRUE
2.15.1.3 Get data type of Series
The dtype() method can be used to get data type of Series.
Warning in pl$Series(letters): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: String
pl$Series(c(1, 2))$dtype
Warning in pl$Series(c(1, 2)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Float64
infer_type(letters)
Utf8
string
infer_type(c(1, 2))
Float64
double
Tip
Polars is strongly typed. print(ls(pl$dtypes)) returns the full list of valid Polars types. Caution, some type names differ from what they are called in R base. See below!
Warning in pl$Series(c("x", "y", "z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: String
pl$Series(c(1, 2, 3))$dtype
Warning in pl$Series(c(1, 2, 3)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Float64
pl$Series(c(1:3))$dtype
Warning in pl$Series(c(1:3)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Int32
pl$Series(c(TRUE,FALSE))$dtype
Warning in pl$Series(c(TRUE, FALSE)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Boolean
pl$Series(factor(c("a","b","c")))$dtype
Warning in pl$Series(factor(c("a", "b", "c"))): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(Sys.Date()): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Date
pl$Series(c(0,1))$dtype
Warning in pl$Series(c(0, 1)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
DataType: Float64
typeof(c("x","y","z"))
[1] "character"
typeof(c(1, 2, 3))
[1] "double"
typeof(c(1:3))
[1] "integer"
typeof(c(TRUE,FALSE))
[1] "logical"
typeof(factor(c("a","b","c")))
[1] "integer"
typeof(Sys.Date())
[1] "double"
To summarise the main types between Polars and R:
Polars
R Base
Utf8
character
Float64
double
Int32
integer
Boolean
logical
Categorical
Factor
Date
Date
To learn more about Data types in Polars, see here.
2.15.1.4 Cast
The cast() method can be used to convert the data types of a column to a new one.
pl$DataFrame(iris)$with_columns( pl$col("Petal.Length")$cast(pl$Int8), # The "Petal.Length" column is converted into integers pl$col("Species")$cast(pl$Utf8) # The "Species" column is converted into strings )$schema
When performing downcasting, it is crucial to ensure that the chosen number of bits is sufficient to accommodate the largest and smallest numbers in the column.
A quick reminder:
Type
Range
Accuracy
Int8
-128 to +127
Int16
-32768 to +32767
Int32
-2147483648 to +2147483647
Int64
β2E63 to β2E63-1
Float32
-3.4E+38 to +3.4E+38
about 7 decimal digits
Float64
-1.7E+308 to +1.7E+308
about 16 decimal digits
2.15.1.5 Check if Series is numeric
The is_numeric() method can be used to check if Series is numeric.
Note that unlike R base, there is no method to check if a Series is character (in this case, its type is anyway Utf8).
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
pl$Series(c("a", "b", "c"))$is_numeric()
Warning in pl$Series(c("a", "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
is.numeric(1:4)
[1] TRUE
is.numeric(c("a","b","c"))
[1] FALSE
2.15.1.6 Check if Series is sorted
The is_sorted() method can be used to check if Series is sorted.
Note that R base provides is.unsorted() which returns the opposite boolean to is_sorted() of Polars.
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] TRUE
pl$Series(c(1,3,2))$is_sorted()
Warning in pl$Series(c(1, 3, 2)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
is.unsorted(1:4)
[1] FALSE
is.unsorted(c(1,3,2))
[1] TRUE
2.15.1.7 Get length of a Series
The len() method can be used to get the length of a Series.
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] 4
length(1:4)
[1] 4
2.15.1.8 Check if Series are equal
The series_equal() method can be used to check if a Series is equal with another Series.
Tip
Caution, if two series are identical but one is named and the other is not then series_equal() returns FALSE.
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:4): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:4, name = "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:4, name = "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
[1] FALSE
identical(1:4,1:4)
[1] TRUE
2.15.1.9 Convert Series to Polars DataFrame
The to_frame() method can be used to convert a Series to a DataFrame.
In this case, a DataFrame with only one column will be created. If the Series is initially named then the column of the DataFrame will be named as such.
pl$Series(1:3, "toto")$to_frame()
Warning in pl$Series(1:3, "toto"): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(iris$Species): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:3): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(NA, "b", "c")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c(1, 2, NA)): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
In polars, a lot of strings methods are useful. Here is the list.
To use them, simply prefix them with str.
2.16.1 Get substrings
The str$slice() method can be used to create substrings of the string values of a Utf8 Series.
str$slice() does not work like R baseβs substr() function for finding the substring of interest: - substr() takes two arguments: the first and last elements; - str$slice() takes two arguments: the first element and the extraction length.
Important
With Polars, numeric default is 0! Thus the equivalent to str$slice(0,3) with Polars will be substr(1,3).
Two further comments:
If the second argument length is not specified, the sub-character string of interest will default to the end of the character string. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(1) is equivalent to substr(mycol,2,4) in dplyr.
The first argument accepts negative values, which means that sub-strings can be considered starting from the end. For example in a DataFrame if mycol is a string column of length 4, pl.col("mycol").str.slice(-2) is equivalent to substr(mycol,3,4) in dplyr.
col1 col2 level x_y is_one is_two
<int> <char> <char> <char> <lgcl> <lgcl>
1: 1 One_X One X TRUE FALSE
2: 2 One_Y One Y TRUE FALSE
3: 3 Two_X Two X FALSE TRUE
4: 4 Two_Y Two Y FALSE TRUE
2.16.3 Check if string values end with a substring
The str$ends_with() method can be used to check if string values start with a substring. It returns a Boolean.
Letβs see an example where we filter the lines of a DataFrame based on the start of a character string:
col1 col2 level x_y is_one is_two
<int> <char> <char> <char> <lgcl> <lgcl>
1: 1 One_X One X TRUE FALSE
2: 3 Two_X Two X FALSE TRUE
2.17 Create your methods
With R you can create your own method/function with function().
Letβs try to create a R function to captue some DataFrame transformations.
Our simple function:
- Takes a DataFrame as an input (argument data)
- Convert Categorical columns into Strings
- Make all Strings columns uppercase
- And filter only the third first rows
fn_transformation <-function(data) { data$# Convert Integer columns into Float with_columns( pl$col(pl$Int32)$cast(pl$Float64))$# Make all Strings columns uppercasewith_columns( pl$col(pl$Utf8)$str$to_uppercase())$# Filter only the third first rowshead(3)}
Warning in pl$Series(factor(c("a", "b", "c"))): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(c("x", "y", "z")): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
Warning in pl$Series(1:3): `pl$Series()` will handle unnamed arguments differently as of 0.17.0:
- until 0.17.0, the first argument corresponds to the values and the second argument to the name of the Series.
- as of 0.17.0, the first argument will correspond to the name and the second argument to the values.
Use named arguments in `pl$Series()` or replace `pl$Series(<values>, <name>)` by `as_polars_series(<values>, <name>)` to silence this warning.
fn_transformation(Newdf)
shape: (3, 3)
βββββββββββββ¬ββββββββββ¬ββββββββββ
β col_categ β col_str β col_num β
β --- β --- β --- β
β cat β str β f64 β
βββββββββββββͺββββββββββͺββββββββββ‘
β a β X β 1.0 β
β b β Y β 2.0 β
β c β Z β 3.0 β
βββββββββββββ΄ββββββββββ΄ββββββββββ
Of course, in real life, we will create functions that are more complicated than our example.