Introduction to dataframes in Python with pandas and polars

Andreas Beger and Isaac Chung
Python Code Club x PyData Tallinn
27 November 2024

The source code for this notebook is github.com/andybega/dataframes-workshop.

Section 1: pandas

We will start with pandas, the original and most widely used dataframe and data analysis library in Python.

Getting Started

We first need to import pandas. We’ll also import another package, numpy, which implements data types for vectors and matrices. Pandas is built on top of numpy, but we only import it here to make it easier to generate an example data frame.

import numpy as np
import pandas as pd

# Create an example data frame
df = pd.DataFrame({
    "quarter": [1, 1, 2, 2, 3],
    "x": np.random.randn(5),
    "date": pd.date_range("2024-01-01", periods=5, freq="MS")
})

df
quarter x date
0 1 -0.640642 2024-01-01
1 1 -0.801308 2024-02-01
2 2 -0.382784 2024-03-01
3 2 -0.945216 2024-04-01
4 3 0.296990 2024-05-01

Components of a dataframe

Pandas DataFrames consist of three components:

  1. One or more Series, which are the columns in the DataFrame.
  2. The names for the series, i.e. column names of the dataframe.
  3. The row names for each row in the dataframe, which pandas calls the Index.

Series

df.x
0   -0.640642
1   -0.801308
2   -0.382784
3   -0.945216
4    0.296990
Name: x, dtype: float64

(Note how each series can have a different data type, unlike in a matrix or an array.)

Columns

df.columns
Index(['quarter', 'x', 'date'], dtype='object')

Index

df.index
RangeIndex(start=0, stop=5, step=1)

Since we didn’t explicitly set an index when we created the dataframe, it’s just a sequence of numbers starting at 0. Indexes are actually a key concept in pandas and we’ll talk a little bit more about them later.

Input: reading data from elsewhere

Pandas can import dataframes from a variety of external sources like text files, JSON, Excel spreadsheets, APIs, and SQL databases. See the input/output documentation for more information.

We’re going to read data on Estonian vehicle accidents from a comma-separated variable (CSV) file, one of the most common text file types for storing data.

(The accidents data are from the Estonian open data portal.)

accidents = pd.read_csv("data/estonia-traffic-accidents-clean.csv")

Inspecting

One of the first things we might want to do with a new dataset is to get our bearings on some basic characteristics of the data.

How many rows and columns are there?

accidents.shape
(14259, 8)

What are the column names?

accidents.columns
Index(['date', 'persons_involved', 'killed', 'injured', 'county',
       'pedestrian_involved', 'accident_type', 'light_conditions'],
      dtype='object')

What does the data look like?

accidents.head()
date persons_involved killed injured county pedestrian_involved accident_type light_conditions
0 2014-10-24 08:45:00 2 0 1 Harju maakond 1 Jalakäijaõnnetus Valge aeg
1 2014-10-24 13:45:00 2 0 1 Harju maakond 0 Kokkupõrge Valge aeg
2 2014-08-11 00:00:00 2 0 1 Harju maakond 0 Kokkupõrge Valge aeg
3 2014-11-17 17:32:00 2 0 2 Harju maakond 0 Kokkupõrge Pimeda aeg
4 2015-04-28 07:55:00 2 0 1 Harju maakond 0 Kokkupõrge Valge aeg

Do we have missing data? What are the data types?

accidents.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14259 entries, 0 to 14258
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 14259 non-null  object
 1   persons_involved     14259 non-null  int64 
 2   killed               14259 non-null  int64 
 3   injured              14259 non-null  int64 
 4   county               14259 non-null  object
 5   pedestrian_involved  14259 non-null  int64 
 6   accident_type        14259 non-null  object
 7   light_conditions     14259 non-null  object
dtypes: int64(4), object(4)
memory usage: 891.3+ KB

We could also have used accidents.dtypes to get the data types.

Note that, if you look back at the number of rows we looked up before, you’ll notice that none of the columns have missing values, and that all but one of the data types look correct. This is because this is an already cleaned up version of the data. Below in this notebook you can also see what we did to clean up the data.

Selecting columns

There are various ways to do this, including the <dataframe>.<column> notation we used above. One alternative is the square bracket <dataframe>.[]:

accidents["date"].head()
0    2014-10-24 08:45:00
1    2014-10-24 13:45:00
2    2014-08-11 00:00:00
3    2014-11-17 17:32:00
4    2015-04-28 07:55:00
Name: date, dtype: object

To select multiple columns we can use a list with the column names as the argument:

accidents[["date", "county"]].head()
date county
0 2014-10-24 08:45:00 Harju maakond
1 2014-10-24 13:45:00 Harju maakond
2 2014-08-11 00:00:00 Harju maakond
3 2014-11-17 17:32:00 Harju maakond
4 2015-04-28 07:55:00 Harju maakond

Mutating columns

The dateas are currently stored as strings (object). We should fix that.

(pandas uses the object data types by default for strings, even though this can store arbitrary Python objects. This is for historical reasons related to originally being built on top of numpy. More on pandas text data types.)

accidents["date"][0]
'2014-10-24 08:45:00'
type(accidents["date"][0])
str
accidents["date"] = pd.to_datetime(accidents["date"])
accidents["date"][0]
Timestamp('2014-10-24 08:45:00')

Mutate: add a new column

We’ve covered how to alter an existing column. We can actually use the same method to add a new column.

How many people were killed or injured, overall?

accidents["killed_or_injured"] = accidents["killed"] + accidents["injured"]

accidents[['killed', 'injured', 'killed_or_injured']].head()
killed injured killed_or_injured
0 0 1 1
1 0 1 1
2 0 1 1
3 0 2 2
4 0 1 1
accidents["killed_or_injured"].sum()
np.int64(18021)

Group and summarize

Oftentimes we want to summarize our data over some group that is defined by one of the variables. To do this we usually want to use a combination of groupby and agg.

How many people were harmed, by accident type?

# summarize total accidents by something
by_type = accidents.groupby("accident_type").agg({"killed_or_injured": "sum"})
by_type
killed_or_injured
accident_type
Jalakäijaõnnetus 3548
Kokkupõrge 7951
Muu liiklusõnnetus 1436
Teadmata 70
Ühesõidukiõnnetus 5016

(Optional) More on Indexes

In the table above, you might noticed that the “accident_type” is now for some reason shown differently from the “killed_or_injured” variable. Indeed, if we check the columns, it’s not there anymore:

by_type.columns
Index(['killed_or_injured'], dtype='object')

What happened is that when we did the group by and agg, pandas moved “accident_type” to the Index.

by_type.index
Index(['Jalakäijaõnnetus', 'Kokkupõrge', 'Muu liiklusõnnetus', 'Teadmata',
       'Ühesõidukiõnnetus'],
      dtype='object', name='accident_type')

Since we hadn’t set an index when we imported the data from CSV, this previously was just an integer count from 0, which you can see above when we showed the first few rows of the data with head(). Now it’s “accident_type”.

pandas extensively uses indexes for various operations. There are event hierarchical MultiIndexes that consist of more than one variable.

There are really only two important things to know about pandas Indexes.

First, there are two kinds of pandas users:

  1. Those that love indexes and use them extensively. Such index powerusers are rumored to exist, at least they say.
  2. People like me who don’t use them unless forced to.

Second, indexes are like variables, but moved to the row labels. You can move them back and forth with two functions:

  • set_index(<keys>): move columns to the index; can be a column name or list of names.
  • reset_index(): move the variables in the current index back to the dataframe as columns.

How many people were harmed, by year?

To further explore this, let’s look at the number of people harmed, by year.

by_year = (accidents
           .loc[:, ["date", "killed_or_injured", "persons_involved"]]
           .resample(rule="YE", on="date")
           .sum()
)

by_year.head()
killed_or_injured persons_involved
date
2011-12-31 533 722
2012-12-31 1713 2289
2013-12-31 1714 2271
2014-12-31 1758 2429
2015-12-31 1773 2859

This uses a bit more complicated code we won’t explain in more detail. .loc[] is on the of the alternative select/filter methods. resample() is like groupby() but for time series. Because the code is quite long for one line, we do something called method chaining, where we put each new method call on a new line. This requires wrapping the whole statement in parentheses.

The date column has been moved to the index. Since we’ve aggregated the data to yearly, it would be nice, e.g. for plotting, if we just had the years in a column, not the misleading full date times.

# Move date back to a column; note how we get a new dummy 0,1,... index
by_year = by_year.reset_index()
by_year.head()
date killed_or_injured persons_involved
0 2011-12-31 533 722
1 2012-12-31 1713 2289
2 2013-12-31 1714 2271
3 2014-12-31 1758 2429
4 2015-12-31 1773 2859
# To extract the year from the date we can use this:
by_year["date"].dt.year.head()
0    2011
1    2012
2    2013
3    2014
4    2015
Name: date, dtype: int32
# Create a new column with the year
by_year["year"] = by_year["date"].dt.year
# Drop the date column; we could also do this by selecting all columns but
# the one we want to drop, but this is more explicit
by_year = by_year.drop("date", axis=1)
# bring year to the first position
by_year = by_year[["year", "persons_involved", "killed_or_injured"]]
by_year
year persons_involved killed_or_injured
0 2011 722 533
1 2012 2289 1713
2 2013 2271 1714
3 2014 2429 1758
4 2015 2859 1773
5 2016 3171 1874
6 2017 2906 1725
7 2018 3131 1886
8 2019 2923 1752
9 2020 2577 1592
10 2021 2735 1701

(Optional) Cleaning the accidents data

We mentioned above that the accidents data is already pretty clean. That’s because we did the below to clean it up. If you’re curious about this (great!), you can add more code cells and copy/paste each segment so you can see some intermediary output as well.

import dateparser  # for fixing the raw dates

accidents = pd.read_csv("data/estonia-traffic-accidents.csv")

accidents.head()

accidents.info()

# Let's only keep a couple of columns for the workshop
keep = ["Toimumisaeg", "Isikuid", "Hukkunuid", "Vigastatuid", "Maakond (PPA)",
        "Jalakäija osalusel", "Liiklusõnnetuse liik [1]", "Valgustus [1]"]
accidents = accidents[keep]

# Translate the column names to English
translate_columns = {"Toimumisaeg": "date", "Isikuid": "persons_involved", 
                     "Hukkunuid": "killed", "Vigastatuid": "injured", 
                     "Maakond (PPA)": "county", 
                     "Jalakäija osalusel": "pedestrian_involved", 
                     "Liiklusõnnetuse liik [1]": "accident_type", 
                     "Valgustus [1]": "light_conditions"}
accidents = accidents.rename(columns=translate_columns)

# The original dates have a mix of formats. We're going to use the dateparser
# library to help parse these. 
accidents["date"] = accidents["date"].apply(lambda x: dateparser.parse(x, languages=["en"]))

# We've got missing data in several columns. Let's drop those rows.
# Usually you'd want to investigate what is missing and why, but it's fine
# for the workshop since we are just illustrating, not trying to learn anything
# about traffic accidents. 
accidents.isna().sum()
accidents = accidents.dropna()

# Some of the columns are float but should be integer values.
accidents = accidents.astype({"persons_involved": "int64",
                              "pedestrian_involved": "int64"})

# write out the cleaned data. 
# accidents.to_csv("data/estonia-traffic-accidents-clean.csv", index=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15708 entries, 0 to 15707
Data columns (total 53 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Juhtumi nr                                     15708 non-null  object 
 1   Toimumisaeg                                    15708 non-null  object 
 2   Isikuid                                        15633 non-null  float64
 3   Hukkunuid                                      15708 non-null  int64  
 4   Vigastatuid                                    15708 non-null  int64  
 5   Sõidukeid                                      15633 non-null  float64
 6   Aadress (PPA)                                  14601 non-null  object 
 7   Maja nr (PPA)                                  6686 non-null   object 
 8   Tänav (PPA)                                    14012 non-null  object 
 9   Ristuv tänav (PPA)                             2757 non-null   object 
 10  Maakond (PPA)                                  14676 non-null  object 
 11  Omavalitsus (PPA)                              14670 non-null  object 
 12  Asustus (PPA)                                  8822 non-null   object 
 13  Asula                                          15708 non-null  object 
 14  Liiklusõnnetuse liik [1]                       15708 non-null  object 
 15  Liiklusõnnetuse liik [3]                       15708 non-null  object 
 16  Kergliikurijuhi osalusel                       15563 non-null  float64
 17  Jalakäija osalusel                             15563 non-null  float64
 18  Kaassõitja osalusel                            15563 non-null  float64
 19  Maastikusõiduki juhi osalusel                  15563 non-null  float64
 20  Eaka (65+) mootorsõidukijuhi osalusel          15563 non-null  float64
 21  Bussijuhi osalusel                             15563 non-null  float64
 22  Veoautojuhi osalusel                           15563 non-null  float64
 23  Ühissõidukijuhi osalusel                       15563 non-null  float64
 24  Sõiduautojuhi osalusel                         15563 non-null  float64
 25  Mootorratturi osalusel                         15563 non-null  float64
 26  Mopeedijuhi osalusel                           15563 non-null  float64
 27  Jalgratturi osalusel                           15563 non-null  float64
 28  Alaealise osalusel                             15563 non-null  float64
 29  Turvavarustust mitte kasutanud isiku osalusel  15563 non-null  float64
 30  Esmase juhiloa omaniku osalusel                15563 non-null  float64
 31  Mootorsõidukijuhi osalusel                     15563 non-null  float64
 32  Tüüpskeemi nr                                  15245 non-null  float64
 33  Tüüpskeem [2]                                  15245 non-null  object 
 34  Tee tüüp [1]                                   15581 non-null  object 
 35  Tee tüüp [2]                                   15581 non-null  object 
 36  Tee element [1]                                15128 non-null  object 
 37  Tee element [2]                                15128 non-null  object 
 38  Tee objekt [2]                                 15169 non-null  object 
 39  Kurvilisus                                     15199 non-null  object 
 40  Tee tasasus                                    15265 non-null  object 
 41  Tee seisund                                    15310 non-null  object 
 42  Teekate                                        15406 non-null  object 
 43  Teekatte seisund [2]                           15356 non-null  object 
 44  Sõiduradade arv                                4182 non-null   object 
 45  Lubatud sõidukiirus (PPA)                      15164 non-null  float64
 46  Tee nr (PPA)                                   4749 non-null   float64
 47  Tee km (PPA)                                   4754 non-null   object 
 48  Ilmastik [1]                                   15357 non-null  object 
 49  Valgustus [1]                                  15396 non-null  object 
 50  Valgustus [2]                                  15396 non-null  object 
 51  GPS X                                          13464 non-null  float64
 52  GPS Y                                          13459 non-null  float64
dtypes: float64(23), int64(2), object(28)
memory usage: 6.4+ MB

Section 2: polars

Polars is a more recent dataframe library, written on top of Rust, which has been gaining a lot of traction. It uses a very different philosopy and style for the API. Both of those (being written in Rust and the API) allow it to be faster and more memory efficient than pandas. It also works out of the box with data that is too large to fit into memory.

Getting started

Let’s redo some of the common dataframe operations we did above, using the accidents data.

import polars as pl

accidents = pl.read_csv("data/estonia-traffic-accidents-clean.csv")
accidents.head()
shape: (5, 8)
date persons_involved killed injured county pedestrian_involved accident_type light_conditions
str i64 i64 i64 str i64 str str
"2014-10-24 08:45:00" 2 0 1 "Harju maakond" 1 "Jalakäijaõnnetus" "Valge aeg"
"2014-10-24 13:45:00" 2 0 1 "Harju maakond" 0 "Kokkupõrge" "Valge aeg"
"2014-08-11 00:00:00" 2 0 1 "Harju maakond" 0 "Kokkupõrge" "Valge aeg"
"2014-11-17 17:32:00" 2 0 2 "Harju maakond" 0 "Kokkupõrge" "Pimeda aeg"
"2015-04-28 07:55:00" 2 0 1 "Harju maakond" 0 "Kokkupõrge" "Valge aeg"

Some basic differences from pandas

Like pandas, polars DataFrames are composed of Series. You’ll soon see that the API has a quite different style. Some other big differences:

  • No index.
  • The use of square brackets [] is discouraged, use methods instead.
  • Within reason, you want to use method chaining and do related things together, rather than splitting transformations line by line. This is so that the query optimizer can do it’s thing under the hood.
  • Expressions

Polars also has a a doc section for Coming from Pandas.

The biggest conceptual difference from pandas are expressions. So let’s talk about those, after a brief detour on how to select columns (since we will use that to illustrate expressions).

Convert from pandas to polars and vice versa

polars has functions for converting data frames back and forth:

import pyarrow

df = pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
# to make this a pandas dataframe
# (requires pyarrow)
df_pd = df.to_pandas()
# to convert it back to polars dataframe
df_pl = pl.DataFrame(df_pd)

Selecting columns

With the select() method:

accidents.select("date", "county").head()
shape: (5, 2)
date county
str str
"2014-10-24 08:45:00" "Harju maakond"
"2014-10-24 13:45:00" "Harju maakond"
"2014-08-11 00:00:00" "Harju maakond"
"2014-11-17 17:32:00" "Harju maakond"
"2015-04-28 07:55:00" "Harju maakond"

Expressions

Polars relies very heavily on expressions. These are data transformations that abstractly define what we want to do with some data, and operate within a context that provides data to apply those changes to actual data.

select() is one of the contexts in which an expression can be executed. One of the most basic expressions is pl.col(), which selects a column in the context provided. We can use this to select a column, like we did above.

accidents.select(pl.col("date")).head()
shape: (5, 1)
date
str
"2014-10-24 08:45:00"
"2014-10-24 13:45:00"
"2014-08-11 00:00:00"
"2014-11-17 17:32:00"
"2015-04-28 07:55:00"

Composing expressions

What’s the point of that? By itself having a more verbose method to select columns is not that useful, but what is very useful is that we can compose expressions to do many useful things. For example:

How many people were injured or killed in the biggest accident in the data?

accidents.select(
    # select 'killed'
    pl.col("killed")
    # add 'injured'
    .add(pl.col("injured"))
    # give the result a new column name
    .alias("killed_or_injured")
    # identify the max value
    .max())
shape: (1, 1)
killed_or_injured
i64
23

Contexts

There are 4 contexts in which expressions can be used:

  • select()
  • filter()
  • with_columns() to mutate data frames
  • group_by() and aggregations

We’ll see all of those below.

For more on the concept of expressions in polars, see the Expressions and contexts section in the documentation.

For a list of available expressions, see this more conceptual overview in the user guide, or the Python API reference on expressions.

Filtering rows

Using filter():

accidents.filter(pl.col("county").eq("Harju maakond")).shape
(7000, 8)

(You can generally use operators like ==, +, - as you’d expect them to work, the only thing is that if you wanted to do something further with the result, you’d have to wrap the calculation in extra parentheses, like (pl.col("a") + pl.col("b")).max(), versus pl.col("a").add(pl.col("b")).max().)

Another small example of how expressions can make life easier is below. Instead of filtering using the full county name “Harju maakond”, we can just look for strings that contain the substring “Harju”.

accidents.filter(pl.col("county").str.contains("Harju")).shape
(7000, 8)
accidents.filter(pl.col("county")=="Harju maakond").shape
(7000, 8)

Mutating dataframes

You use with_columns() and various expressions to mutate (add, change) columns in a dataframe. Here is how you would add a “killed_or_injured” column, like we did with pandas earlier.

accidents = accidents.with_columns(
    pl.col("killed").add(pl.col("injured")).alias("killed_or_injured")
)
accidents.head()
shape: (5, 9)
date persons_involved killed injured county pedestrian_involved accident_type light_conditions killed_or_injured
str i64 i64 i64 str i64 str str i64
"2014-10-24 08:45:00" 2 0 1 "Harju maakond" 1 "Jalakäijaõnnetus" "Valge aeg" 1
"2014-10-24 13:45:00" 2 0 1 "Harju maakond" 0 "Kokkupõrge" "Valge aeg" 1
"2014-08-11 00:00:00" 2 0 1 "Harju maakond" 0 "Kokkupõrge" "Valge aeg" 1
"2014-11-17 17:32:00" 2 0 2 "Harju maakond" 0 "Kokkupõrge" "Pimeda aeg" 2
"2015-04-28 07:55:00" 2 0 1 "Harju maakond" 0 "Kokkupõrge" "Valge aeg" 1

Group and aggregate

Instead of calculating the number of victims by accident type or year, as we did above, let’s do by county instead.

by_county = (accidents
             .group_by("county")
             .agg(pl.len().alias("accidents"),
                  pl.col("killed_or_injured").sum())
             .sort("killed_or_injured", descending=True)
)
by_county.head()
shape: (5, 3)
county accidents killed_or_injured
str u32 i64
"Harju maakond" 7000 8423
"Tartu maakond" 1591 1968
"Ida-Viru maakond" 991 1348
"Pärnu maakond" 1008 1293
"Lääne-Viru maakond" 645 883

(Optional) Joining dataframes

Often you have information in different dataframes (tables) that you want to join (or merge) together. For example, what if we want to know the accident rate per capita by county? For this we can first group and aggregate to get the county-level number of accident victims, then join a table with population figures for each county.

Since we conveniently already have county-level data from above, we just need to read and join in the county population data.

(Data are from Statistics Estonia)

county_pop = (pl.read_csv("data/county-pop.csv", skip_rows=2)
              .rename({"County": "county", "Age groups total": "population"})
              .select(["county", "population"])
              # this has "county" in the county names, not "maakond"
              .with_columns(pl.col("county").str.replace("county", "maakond"))
              )

by_county_w_pop = by_county.join(county_pop, on="county", how="left")
by_county_w_pop
shape: (15, 4)
county accidents killed_or_injured population
str u32 i64 i64
"Harju maakond" 7000 8423 598059
"Tartu maakond" 1591 1968 152977
"Ida-Viru maakond" 991 1348 136240
"Pärnu maakond" 1008 1293 85938
"Lääne-Viru maakond" 645 883 59325
"Saare maakond" 325 413 33108
"Valga maakond" 302 400 28370
"Põlva maakond" 285 383 25006
"Lääne maakond" 208 272 20507
"Hiiu maakond" 60 76 9387
by_county_w_pop.select(
    pl.col("county"), 
    pl.col("killed_or_injured"),
    pl.col("killed_or_injured").truediv(pl.col("population")).mul(1000).alias("rate/1000")
    ).head(3)
shape: (3, 3)
county killed_or_injured rate/1000
str i64 f64
"Harju maakond" 8423 14.083895
"Tartu maakond" 1968 12.864679
"Ida-Viru maakond" 1348 9.894304

Which 3 counties have the highest vehicle accident victim rates?

# Add your code here

Which 3 counties have the lowest vehicle accident victim rates?

# Add your code here

(Optional) Reshaping / pivoting dataframes

For this we will look at another dataset, on reflector usage. (🌃🎄 Tis’ the time of year…)

reflectors = (pl.read_csv("data/reflectors.csv", has_header=True, separator=";", skip_rows=2)
              .filter(pl.col("Sex").ne("Men and women"))
              .drop(["Type of data", "Year", "All age groups (16-64)"])
              .sort("Reflector use", "Sex")
)
reflectors.head()
shape: (5, 7)
Reflector use Sex 16-24 25-34 35-44 45-54 55-64
str str f64 f64 f64 f64 f64
"Nearly always" "Men" 34.3 40.5 52.2 58.6 55.9
"Nearly always" "Women" 58.4 64.9 71.4 78.4 74.9
"Never" "Men" 14.3 12.4 7.2 3.9 2.7
"Never" "Women" 8.8 5.0 4.6 2.0 2.5
"Never walk on dark streets, ro… "Men" 4.8 10.8 9.7 11.3 12.8

After some basic cleaning, we can see that this dataframe has a not uncommon pattern, time series represented as rows, not columns, with their ID label as the column name. That’s fine for presentation purposes when you look at the table, but for plotting life will be easier if we reshape this data from the current wide format to a long format.

We want to create two new columns from all the “16-24” etc. columns:

  • One with the age group information.
  • The other with the reflector usage values.
reflectors = (reflectors
              .unpivot(index=["Reflector use", "Sex"], 
                       variable_name="age_group", 
                       value_name="percentage")
              .sort("Reflector use", "Sex", "age_group")
)
reflectors.head()
shape: (5, 4)
Reflector use Sex age_group percentage
str str str f64
"Nearly always" "Men" "16-24" 34.3
"Nearly always" "Men" "25-34" 40.5
"Nearly always" "Men" "35-44" 52.2
"Nearly always" "Men" "45-54" 58.6
"Nearly always" "Men" "55-64" 55.9
(reflectors
 .filter(pl.col("Reflector use").eq("Never"))
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(
    width=600,
    height=300
    )
)

Classic. Men don’t wear reflectors as much as women. Old people are less likely to say that they never wear reflectors.

Actually I’m not sure who in Estonia lives in a place where they can only walk on well-lit streets. Let’s combine that category with “Never”.

(reflectors.
 with_columns(
     pl.col("Reflector use").str.replace("Never walk on dark streets, roads", "Never")
 )
 .group_by(["Reflector use", "Sex", "age_group"])
 .agg(pl.col("percentage").sum())
 .filter(pl.col("Reflector use").eq("Never"))
 .sort(["age_group", "Sex"])
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(
    width=600,
    height=300
    )
)

Hmm. Maybe some people really don’t walk a lot, but it’s also likely that a lot of people, especially older people, don’t see the need to wear reflectors.

(Optional) LazyFrames and bigger than memory data

Aside from regular DataFrames, polars also has the concept of LazyFrames. These are abstract dataframes from some external source that are not read into memory (materialized) until you do something with them. And depending on what you do with them, this sometimes means that you can work with larger than memory datasets.

We’re not going to work with a 30GB dataset here, so let’s just illustrate the intitution here with some code:

smaller_data = (pl.scan_csv("huge/datafile.csv")
                .select("id1", "x1", "x2")
                .filter(pl.col("id1") > 100)
                .collect())

What this does is read data from a CSV, select 3 columns, and filter rows based on “id1” being greater than 100.

There are two key bits here:

  • pl.scan_csv(): unlike read_csv(), this does not immediately read the entire data into memory, but rather streams it…and only once we tell it to.
  • collect() to indicate that we want to start executing (materializing) our query.

What will happen with the particular code we have above is that polars will analyze our query and then applie to memory-saving optimizations:

  • It will only consider data that is in the 3 columns we selected.
  • From those columns, it will only read rows (elements) that satisfy our filter condition.

More on LazyFrames and the Lazy API at the user guide.

Pro-tip: always plot your data

Sometimes your job is to pipe data from A to B, and that’s fine.

But if you are doing data analysis or modeling (AI, ML, …), it often pays to be curious (and suspicious) about your data. Sometimes this helps you catch errors in what you are doing with the data. Othertimes it helps you reduce the “garbage in” part of the “garbage in, garbace out” equation.

Here’s a cute example some people created to make this point. We have two data sets, is there anything unusual about them?

df1 = pl.read_csv("data/dataset1.csv")
df2 = pl.read_csv("data/dataset2.csv")

The data appear to be similar. Same number of rows and columns.

df1.shape
(142, 2)
df2.shape
(142, 2)

They also have similar distributions:

df1.describe()
shape: (9, 3)
statistic x y
str f64 f64
"count" 142.0 142.0
"null_count" 0.0 0.0
"mean" 54.2661 47.834721
"std" 16.769825 26.939743
"min" 15.56075 0.015119
"25%" 39.706326 24.46783
"50%" 53.421463 48.398346
"75%" 69.359559 71.806616
"max" 91.639961 97.475771
df2.describe()
shape: (9, 3)
statistic x y
str f64 f64
"count" 142.0 142.0
"null_count" 0.0 0.0
"mean" 54.263273 47.832253
"std" 16.765142 26.935403
"min" 22.3077 2.9487
"25%" 44.1026 25.2564
"50%" 53.5897 46.0256
"75%" 64.8718 69.1026
"max" 98.2051 99.4872

As you can guess from the section heading, we should try plotting them. (If the “x” and “y” column names weren’t hint enough by themselves…)

df1.plot.point("x", "y")

Ok, all good, looks boring.

df2.plot.point("x", "y")

Ahaa! 🦖

Appendix

Datasaurus

The cute point data from the last example came from a dataset called the Datasaurus dozen, which itself was inspired by an earlier, famous statistical toy dataset, Anscombe’s quartet.

The other 10 datasaurus plots are also here in the data, we just took out the two above to make life easier for the workshop.

# split this file but just use it to make a point about also plotting data
datasaurus = pd.read_csv("data/datasaurus.csv")

df1 = datasaurus[datasaurus["dataset"] == "away"].drop("dataset", axis=1)
df2 = datasaurus[datasaurus["dataset"] == "dino"].drop("dataset", axis=1)

df1.to_csv("data/dataset1.csv", index=False)
df2.to_csv("data/dataset2.csv", index=False)