Introduction to dataframes in Python with pandas and polars

Andreas Beger and Isaac Chung
Python Code Club x PyData Tallinn
27 November 2024

The source code for this notebook is github.com/andybega/dataframes-workshop.

Section 1: pandas

We will start with pandas, the original and most widely used dataframe and data analysis library in Python.

Getting Started

We first need to import pandas. We’ll also import another package, numpy, which implements data types for vectors and matrices. Pandas is built on top of numpy, but we only import it here to make it easier to generate an example data frame.

import numpy as np
import pandas as pd

# Create an example data frame
df = pd.DataFrame({
    "quarter": [1, 1, 2, 2, 3],
    "x": np.random.randn(5),
    "date": pd.date_range("2024-01-01", periods=5, freq="MS")
})

df

	quarter	x	date
0	1	-0.640642	2024-01-01
1	1	-0.801308	2024-02-01
2	2	-0.382784	2024-03-01
3	2	-0.945216	2024-04-01
4	3	0.296990	2024-05-01

Components of a dataframe

Pandas DataFrames consist of three components:

One or more Series, which are the columns in the DataFrame.
The names for the series, i.e. column names of the dataframe.
The row names for each row in the dataframe, which pandas calls the Index.

Series

df.x

0   -0.640642
1   -0.801308
2   -0.382784
3   -0.945216
4    0.296990
Name: x, dtype: float64

(Note how each series can have a different data type, unlike in a matrix or an array.)

Columns

df.columns

Index(['quarter', 'x', 'date'], dtype='object')

Index

df.index

RangeIndex(start=0, stop=5, step=1)

Since we didn’t explicitly set an index when we created the dataframe, it’s just a sequence of numbers starting at 0. Indexes are actually a key concept in pandas and we’ll talk a little bit more about them later.

Input: reading data from elsewhere

Pandas can import dataframes from a variety of external sources like text files, JSON, Excel spreadsheets, APIs, and SQL databases. See the input/output documentation for more information.

We’re going to read data on Estonian vehicle accidents from a comma-separated variable (CSV) file, one of the most common text file types for storing data.

(The accidents data are from the Estonian open data portal.)

accidents = pd.read_csv("data/estonia-traffic-accidents-clean.csv")

Inspecting

One of the first things we might want to do with a new dataset is to get our bearings on some basic characteristics of the data.

How many rows and columns are there?

accidents.shape

(14259, 8)

What are the column names?

accidents.columns

Index(['date', 'persons_involved', 'killed', 'injured', 'county',
       'pedestrian_involved', 'accident_type', 'light_conditions'],
      dtype='object')

What does the data look like?

accidents.head()

	date	persons_involved	injured	county	pedestrian_involved	accident_type	light_conditions
0	2014-10-24 08:45:00	2	1	Harju maakond	1	Jalakäijaõnnetus	Valge aeg
1	2014-10-24 13:45:00	2	1	Harju maakond	0	Kokkupõrge	Valge aeg
2	2014-08-11 00:00:00	2	1	Harju maakond	0	Kokkupõrge	Valge aeg
3	2014-11-17 17:32:00	2	2	Harju maakond	0	Kokkupõrge	Pimeda aeg
4	2015-04-28 07:55:00	2	1	Harju maakond	0	Kokkupõrge	Valge aeg

Do we have missing data? What are the data types?

accidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14259 entries, 0 to 14258
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 14259 non-null  object
 1   persons_involved     14259 non-null  int64 
 2   killed               14259 non-null  int64 
 3   injured              14259 non-null  int64 
 4   county               14259 non-null  object
 5   pedestrian_involved  14259 non-null  int64 
 6   accident_type        14259 non-null  object
 7   light_conditions     14259 non-null  object
dtypes: int64(4), object(4)
memory usage: 891.3+ KB

We could also have used accidents.dtypes to get the data types.

Note that, if you look back at the number of rows we looked up before, you’ll notice that none of the columns have missing values, and that all but one of the data types look correct. This is because this is an already cleaned up version of the data. Below in this notebook you can also see what we did to clean up the data.

Selecting columns

There are various ways to do this, including the <dataframe>.<column> notation we used above. One alternative is the square bracket <dataframe>.[]:

accidents["date"].head()

0    2014-10-24 08:45:00
1    2014-10-24 13:45:00
2    2014-08-11 00:00:00
3    2014-11-17 17:32:00
4    2015-04-28 07:55:00
Name: date, dtype: object

To select multiple columns we can use a list with the column names as the argument:

accidents[["date", "county"]].head()

	date	county
0	2014-10-24 08:45:00	Harju maakond
1	2014-10-24 13:45:00	Harju maakond
2	2014-08-11 00:00:00	Harju maakond
3	2014-11-17 17:32:00	Harju maakond
4	2015-04-28 07:55:00	Harju maakond

Mutating columns

The dateas are currently stored as strings (object). We should fix that.

(pandas uses the object data types by default for strings, even though this can store arbitrary Python objects. This is for historical reasons related to originally being built on top of numpy. More on pandas text data types.)

accidents["date"][0]

'2014-10-24 08:45:00'

type(accidents["date"][0])

str

accidents["date"] = pd.to_datetime(accidents["date"])
accidents["date"][0]

Timestamp('2014-10-24 08:45:00')

Sidebar: Pandas Series

Let’s spend a hot second looking at an individual series.

dates = accidents["date"]
type(dates)

pandas.core.series.Series

What date range does the data cover?

Series have their own methods…

start = accidents["date"].min()
end = accidents["date"].max()
print(f"First accident: {start}\nLast accident: {end}")

First accident: 2011-01-05 00:00:00
Last accident: 2021-12-31 23:45:00

What types of accidents were there?

The Series.value_counts() method is a quite useful method for tabulating categorical variables:

accidents["accident_type"].value_counts()

accident_type
Kokkupõrge            5605
Ühesõidukiõnnetus     3946
Jalakäijaõnnetus      3386
Muu liiklusõnnetus    1262
Teadmata                60
Name: count, dtype: int64

Filter rows

Oftentimes we don’t need all rows in a dataframe to answer a specific question. This is commonly called filtering or subsetting rows.

How many of the accidents were in Harju county?

We have a “county” column, so we can use that to look at only Harju county. As with selecting columns, there are multiple ways to do this in pandas. One of the basic ones is to again use the <dataframe>.[] square brackets.


accidents[accidents["county"] == "Harju maakond"].shape

(7000, 8)

Another sidebar – Why does this work? The part inside the outer square brackets creates a boolean vector:

accidents["county"] == "Harju maakond"

0         True
1         True
2         True
3         True
4         True
         ...  
14254    False
14255    False
14256     True
14257    False
14258    False
Name: county, Length: 14259, dtype: bool

When we pass this boolean vector to the square brackets, it uses it to filter rows. Rather than select columns, like it did with `accidents[“county”].

Yeah, it’s weird and inconsistent, which is one of the things people complain about with pandas.

What’s more, there are many more ways to select and filter. See the “Indexing and selecting data” documentation for all the various methods pandas has, both for column selecting and row filtering.

Mutate: add a new column

We’ve covered how to alter an existing column. We can actually use the same method to add a new column.

How many people were killed or injured, overall?

accidents["killed_or_injured"] = accidents["killed"] + accidents["injured"]

accidents[['killed', 'injured', 'killed_or_injured']].head()

	injured	killed_or_injured
0	1	1
1	1	1
2	1	1
3	2	2
4	1	1

accidents["killed_or_injured"].sum()

np.int64(18021)

Group and summarize

Oftentimes we want to summarize our data over some group that is defined by one of the variables. To do this we usually want to use a combination of groupby and agg.

How many people were harmed, by accident type?

# summarize total accidents by something
by_type = accidents.groupby("accident_type").agg({"killed_or_injured": "sum"})
by_type

	killed_or_injured
accident_type
Jalakäijaõnnetus	3548
Kokkupõrge	7951
Muu liiklusõnnetus	1436
Teadmata	70
Ühesõidukiõnnetus	5016

(Optional) More on Indexes

In the table above, you might noticed that the “accident_type” is now for some reason shown differently from the “killed_or_injured” variable. Indeed, if we check the columns, it’s not there anymore:

by_type.columns

Index(['killed_or_injured'], dtype='object')

What happened is that when we did the group by and agg, pandas moved “accident_type” to the Index.

by_type.index

Index(['Jalakäijaõnnetus', 'Kokkupõrge', 'Muu liiklusõnnetus', 'Teadmata',
       'Ühesõidukiõnnetus'],
      dtype='object', name='accident_type')

Since we hadn’t set an index when we imported the data from CSV, this previously was just an integer count from 0, which you can see above when we showed the first few rows of the data with head(). Now it’s “accident_type”.

pandas extensively uses indexes for various operations. There are event hierarchical MultiIndexes that consist of more than one variable.

There are really only two important things to know about pandas Indexes.

First, there are two kinds of pandas users:

Those that love indexes and use them extensively. Such index powerusers are rumored to exist, at least they say.
People like me who don’t use them unless forced to.

Second, indexes are like variables, but moved to the row labels. You can move them back and forth with two functions:

set_index(<keys>): move columns to the index; can be a column name or list of names.
reset_index(): move the variables in the current index back to the dataframe as columns.

How many people were harmed, by year?

To further explore this, let’s look at the number of people harmed, by year.

by_year = (accidents
           .loc[:, ["date", "killed_or_injured", "persons_involved"]]
           .resample(rule="YE", on="date")
           .sum()
)

by_year.head()

	killed_or_injured	persons_involved
date
2011-12-31	533	722
2012-12-31	1713	2289
2013-12-31	1714	2271
2014-12-31	1758	2429
2015-12-31	1773	2859

This uses a bit more complicated code we won’t explain in more detail. .loc[] is on the of the alternative select/filter methods. resample() is like groupby() but for time series. Because the code is quite long for one line, we do something called method chaining, where we put each new method call on a new line. This requires wrapping the whole statement in parentheses.

The date column has been moved to the index. Since we’ve aggregated the data to yearly, it would be nice, e.g. for plotting, if we just had the years in a column, not the misleading full date times.

# Move date back to a column; note how we get a new dummy 0,1,... index
by_year = by_year.reset_index()
by_year.head()

	date	killed_or_injured	persons_involved
0	2011-12-31	533	722
1	2012-12-31	1713	2289
2	2013-12-31	1714	2271
3	2014-12-31	1758	2429
4	2015-12-31	1773	2859

# To extract the year from the date we can use this:
by_year["date"].dt.year.head()

0    2011
1    2012
2    2013
3    2014
4    2015
Name: date, dtype: int32

# Create a new column with the year
by_year["year"] = by_year["date"].dt.year
# Drop the date column; we could also do this by selecting all columns but
# the one we want to drop, but this is more explicit
by_year = by_year.drop("date", axis=1)
# bring year to the first position
by_year = by_year[["year", "persons_involved", "killed_or_injured"]]
by_year

	year	persons_involved	killed_or_injured
0	2011	722	533
1	2012	2289	1713
2	2013	2271	1714
3	2014	2429	1758
4	2015	2859	1773
5	2016	3171	1874
6	2017	2906	1725
7	2018	3131	1886
8	2019	2923	1752
9	2020	2577	1592
10	2021	2735	1701

(Optional) Cleaning the accidents data

We mentioned above that the accidents data is already pretty clean. That’s because we did the below to clean it up. If you’re curious about this (great!), you can add more code cells and copy/paste each segment so you can see some intermediary output as well.

import dateparser  # for fixing the raw dates

accidents = pd.read_csv("data/estonia-traffic-accidents.csv")

accidents.head()

accidents.info()

# Let's only keep a couple of columns for the workshop
keep = ["Toimumisaeg", "Isikuid", "Hukkunuid", "Vigastatuid", "Maakond (PPA)",
        "Jalakäija osalusel", "Liiklusõnnetuse liik [1]", "Valgustus [1]"]
accidents = accidents[keep]

# Translate the column names to English
translate_columns = {"Toimumisaeg": "date", "Isikuid": "persons_involved", 
                     "Hukkunuid": "killed", "Vigastatuid": "injured", 
                     "Maakond (PPA)": "county", 
                     "Jalakäija osalusel": "pedestrian_involved", 
                     "Liiklusõnnetuse liik [1]": "accident_type", 
                     "Valgustus [1]": "light_conditions"}
accidents = accidents.rename(columns=translate_columns)

# The original dates have a mix of formats. We're going to use the dateparser
# library to help parse these. 
accidents["date"] = accidents["date"].apply(lambda x: dateparser.parse(x, languages=["en"]))

# We've got missing data in several columns. Let's drop those rows.
# Usually you'd want to investigate what is missing and why, but it's fine
# for the workshop since we are just illustrating, not trying to learn anything
# about traffic accidents. 
accidents.isna().sum()
accidents = accidents.dropna()

# Some of the columns are float but should be integer values.
accidents = accidents.astype({"persons_involved": "int64",
                              "pedestrian_involved": "int64"})

# write out the cleaned data. 
# accidents.to_csv("data/estonia-traffic-accidents-clean.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15708 entries, 0 to 15707
Data columns (total 53 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Juhtumi nr                                     15708 non-null  object 
 1   Toimumisaeg                                    15708 non-null  object 
 2   Isikuid                                        15633 non-null  float64
 3   Hukkunuid                                      15708 non-null  int64  
 4   Vigastatuid                                    15708 non-null  int64  
 5   Sõidukeid                                      15633 non-null  float64
 6   Aadress (PPA)                                  14601 non-null  object 
 7   Maja nr (PPA)                                  6686 non-null   object 
 8   Tänav (PPA)                                    14012 non-null  object 
 9   Ristuv tänav (PPA)                             2757 non-null   object 
 10  Maakond (PPA)                                  14676 non-null  object 
 11  Omavalitsus (PPA)                              14670 non-null  object 
 12  Asustus (PPA)                                  8822 non-null   object 
 13  Asula                                          15708 non-null  object 
 14  Liiklusõnnetuse liik [1]                       15708 non-null  object 
 15  Liiklusõnnetuse liik [3]                       15708 non-null  object 
 16  Kergliikurijuhi osalusel                       15563 non-null  float64
 17  Jalakäija osalusel                             15563 non-null  float64
 18  Kaassõitja osalusel                            15563 non-null  float64
 19  Maastikusõiduki juhi osalusel                  15563 non-null  float64
 20  Eaka (65+) mootorsõidukijuhi osalusel          15563 non-null  float64
 21  Bussijuhi osalusel                             15563 non-null  float64
 22  Veoautojuhi osalusel                           15563 non-null  float64
 23  Ühissõidukijuhi osalusel                       15563 non-null  float64
 24  Sõiduautojuhi osalusel                         15563 non-null  float64
 25  Mootorratturi osalusel                         15563 non-null  float64
 26  Mopeedijuhi osalusel                           15563 non-null  float64
 27  Jalgratturi osalusel                           15563 non-null  float64
 28  Alaealise osalusel                             15563 non-null  float64
 29  Turvavarustust mitte kasutanud isiku osalusel  15563 non-null  float64
 30  Esmase juhiloa omaniku osalusel                15563 non-null  float64
 31  Mootorsõidukijuhi osalusel                     15563 non-null  float64
 32  Tüüpskeemi nr                                  15245 non-null  float64
 33  Tüüpskeem [2]                                  15245 non-null  object 
 34  Tee tüüp [1]                                   15581 non-null  object 
 35  Tee tüüp [2]                                   15581 non-null  object 
 36  Tee element [1]                                15128 non-null  object 
 37  Tee element [2]                                15128 non-null  object 
 38  Tee objekt [2]                                 15169 non-null  object 
 39  Kurvilisus                                     15199 non-null  object 
 40  Tee tasasus                                    15265 non-null  object 
 41  Tee seisund                                    15310 non-null  object 
 42  Teekate                                        15406 non-null  object 
 43  Teekatte seisund [2]                           15356 non-null  object 
 44  Sõiduradade arv                                4182 non-null   object 
 45  Lubatud sõidukiirus (PPA)                      15164 non-null  float64
 46  Tee nr (PPA)                                   4749 non-null   float64
 47  Tee km (PPA)                                   4754 non-null   object 
 48  Ilmastik [1]                                   15357 non-null  object 
 49  Valgustus [1]                                  15396 non-null  object 
 50  Valgustus [2]                                  15396 non-null  object 
 51  GPS X                                          13464 non-null  float64
 52  GPS Y                                          13459 non-null  float64
dtypes: float64(23), int64(2), object(28)
memory usage: 6.4+ MB

Section 2: polars

Polars is a more recent dataframe library, written on top of Rust, which has been gaining a lot of traction. It uses a very different philosopy and style for the API. Both of those (being written in Rust and the API) allow it to be faster and more memory efficient than pandas. It also works out of the box with data that is too large to fit into memory.

Getting started

Let’s redo some of the common dataframe operations we did above, using the accidents data.

import polars as pl

accidents = pl.read_csv("data/estonia-traffic-accidents-clean.csv")
accidents.head()

shape: (5, 8)

date	persons_involved	killed	injured	county	pedestrian_involved	accident_type	light_conditions
str	i64	i64	i64	str	i64	str	str
"2014-10-24 08:45:00"	2	0	1	"Harju maakond"	1	"Jalakäijaõnnetus"	"Valge aeg"
"2014-10-24 13:45:00"	2	0	1	"Harju maakond"	0	"Kokkupõrge"	"Valge aeg"
"2014-08-11 00:00:00"	2	0	1	"Harju maakond"	0	"Kokkupõrge"	"Valge aeg"
"2014-11-17 17:32:00"	2	0	2	"Harju maakond"	0	"Kokkupõrge"	"Pimeda aeg"
"2015-04-28 07:55:00"	2	0	1	"Harju maakond"	0	"Kokkupõrge"	"Valge aeg"

Some basic differences from pandas

Like pandas, polars DataFrames are composed of Series. You’ll soon see that the API has a quite different style. Some other big differences:

No index.
The use of square brackets [] is discouraged, use methods instead.
Within reason, you want to use method chaining and do related things together, rather than splitting transformations line by line. This is so that the query optimizer can do it’s thing under the hood.
Expressions

Polars also has a a doc section for Coming from Pandas.

The biggest conceptual difference from pandas are expressions. So let’s talk about those, after a brief detour on how to select columns (since we will use that to illustrate expressions).

Convert from pandas to polars and vice versa

polars has functions for converting data frames back and forth:

import pyarrow

df = pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
# to make this a pandas dataframe
# (requires pyarrow)
df_pd = df.to_pandas()
# to convert it back to polars dataframe
df_pl = pl.DataFrame(df_pd)

Selecting columns

With the select() method:

accidents.select("date", "county").head()

shape: (5, 2)

date	county
str	str
"2014-10-24 08:45:00"	"Harju maakond"
"2014-10-24 13:45:00"	"Harju maakond"
"2014-08-11 00:00:00"	"Harju maakond"
"2014-11-17 17:32:00"	"Harju maakond"
"2015-04-28 07:55:00"	"Harju maakond"

Expressions

Polars relies very heavily on expressions. These are data transformations that abstractly define what we want to do with some data, and operate within a context that provides data to apply those changes to actual data.

select() is one of the contexts in which an expression can be executed. One of the most basic expressions is pl.col(), which selects a column in the context provided. We can use this to select a column, like we did above.

accidents.select(pl.col("date")).head()

shape: (5, 1)

date
str
"2014-10-24 08:45:00"
"2014-10-24 13:45:00"
"2014-08-11 00:00:00"
"2014-11-17 17:32:00"
"2015-04-28 07:55:00"

Composing expressions

What’s the point of that? By itself having a more verbose method to select columns is not that useful, but what is very useful is that we can compose expressions to do many useful things. For example:

How many people were injured or killed in the biggest accident in the data?

accidents.select(
    # select 'killed'
    pl.col("killed")
    # add 'injured'
    .add(pl.col("injured"))
    # give the result a new column name
    .alias("killed_or_injured")
    # identify the max value
    .max())

shape: (1, 1)

killed_or_injured
i64
23

Contexts

There are 4 contexts in which expressions can be used:

select()
filter()
with_columns() to mutate data frames
group_by() and aggregations

We’ll see all of those below.

For more on the concept of expressions in polars, see the Expressions and contexts section in the documentation.

For a list of available expressions, see this more conceptual overview in the user guide, or the Python API reference on expressions.

Filtering rows

Using filter():

accidents.filter(pl.col("county").eq("Harju maakond")).shape

(7000, 8)

(You can generally use operators like ==, +, - as you’d expect them to work, the only thing is that if you wanted to do something further with the result, you’d have to wrap the calculation in extra parentheses, like (pl.col("a") + pl.col("b")).max(), versus pl.col("a").add(pl.col("b")).max().)

Another small example of how expressions can make life easier is below. Instead of filtering using the full county name “Harju maakond”, we can just look for strings that contain the substring “Harju”.

accidents.filter(pl.col("county").str.contains("Harju")).shape

(7000, 8)

accidents.filter(pl.col("county")=="Harju maakond").shape

(7000, 8)

Mutating dataframes

You use with_columns() and various expressions to mutate (add, change) columns in a dataframe. Here is how you would add a “killed_or_injured” column, like we did with pandas earlier.

accidents = accidents.with_columns(
    pl.col("killed").add(pl.col("injured")).alias("killed_or_injured")
)
accidents.head()

shape: (5, 9)

date	persons_involved	killed	injured	county	pedestrian_involved	accident_type	light_conditions	killed_or_injured
str	i64	i64	i64	str	i64	str	str	i64
"2014-10-24 08:45:00"	2	0	1	"Harju maakond"	1	"Jalakäijaõnnetus"	"Valge aeg"	1
"2014-10-24 13:45:00"	2	0	1	"Harju maakond"	0	"Kokkupõrge"	"Valge aeg"	1
"2014-08-11 00:00:00"	2	0	1	"Harju maakond"	0	"Kokkupõrge"	"Valge aeg"	1
"2014-11-17 17:32:00"	2	0	2	"Harju maakond"	0	"Kokkupõrge"	"Pimeda aeg"	2
"2015-04-28 07:55:00"	2	0	1	"Harju maakond"	0	"Kokkupõrge"	"Valge aeg"	1

Group and aggregate

Instead of calculating the number of victims by accident type or year, as we did above, let’s do by county instead.

by_county = (accidents
             .group_by("county")
             .agg(pl.len().alias("accidents"),
                  pl.col("killed_or_injured").sum())
             .sort("killed_or_injured", descending=True)
)
by_county.head()

shape: (5, 3)

county	accidents	killed_or_injured
str	u32	i64
"Harju maakond"	7000	8423
"Tartu maakond"	1591	1968
"Ida-Viru maakond"	991	1348
"Pärnu maakond"	1008	1293
"Lääne-Viru maakond"	645	883

(Optional) Joining dataframes

Often you have information in different dataframes (tables) that you want to join (or merge) together. For example, what if we want to know the accident rate per capita by county? For this we can first group and aggregate to get the county-level number of accident victims, then join a table with population figures for each county.

Since we conveniently already have county-level data from above, we just need to read and join in the county population data.

(Data are from Statistics Estonia)

county_pop = (pl.read_csv("data/county-pop.csv", skip_rows=2)
              .rename({"County": "county", "Age groups total": "population"})
              .select(["county", "population"])
              # this has "county" in the county names, not "maakond"
              .with_columns(pl.col("county").str.replace("county", "maakond"))
              )

by_county_w_pop = by_county.join(county_pop, on="county", how="left")
by_county_w_pop

shape: (15, 4)

county	accidents	killed_or_injured	population
str	u32	i64	i64
"Harju maakond"	7000	8423	598059
"Tartu maakond"	1591	1968	152977
"Ida-Viru maakond"	991	1348	136240
"Pärnu maakond"	1008	1293	85938
"Lääne-Viru maakond"	645	883	59325
…	…	…	…
"Saare maakond"	325	413	33108
"Valga maakond"	302	400	28370
"Põlva maakond"	285	383	25006
"Lääne maakond"	208	272	20507
"Hiiu maakond"	60	76	9387

by_county_w_pop.select(
    pl.col("county"), 
    pl.col("killed_or_injured"),
    pl.col("killed_or_injured").truediv(pl.col("population")).mul(1000).alias("rate/1000")
    ).head(3)

shape: (3, 3)

county	killed_or_injured	rate/1000
str	i64	f64
"Harju maakond"	8423	14.083895
"Tartu maakond"	1968	12.864679
"Ida-Viru maakond"	1348	9.894304

Which 3 counties have the highest vehicle accident victim rates?

# Add your code here

Which 3 counties have the lowest vehicle accident victim rates?

# Add your code here

(Optional) Reshaping / pivoting dataframes

For this we will look at another dataset, on reflector usage. (🌃🎄 Tis’ the time of year…)

reflectors = (pl.read_csv("data/reflectors.csv", has_header=True, separator=";", skip_rows=2)
              .filter(pl.col("Sex").ne("Men and women"))
              .drop(["Type of data", "Year", "All age groups (16-64)"])
              .sort("Reflector use", "Sex")
)
reflectors.head()

shape: (5, 7)

Reflector use	Sex	16-24	25-34	35-44	45-54	55-64
str	str	f64	f64	f64	f64	f64
"Nearly always"	"Men"	34.3	40.5	52.2	58.6	55.9
"Nearly always"	"Women"	58.4	64.9	71.4	78.4	74.9
"Never"	"Men"	14.3	12.4	7.2	3.9	2.7
"Never"	"Women"	8.8	5.0	4.6	2.0	2.5
"Never walk on dark streets, ro…	"Men"	4.8	10.8	9.7	11.3	12.8

After some basic cleaning, we can see that this dataframe has a not uncommon pattern, time series represented as rows, not columns, with their ID label as the column name. That’s fine for presentation purposes when you look at the table, but for plotting life will be easier if we reshape this data from the current wide format to a long format.

We want to create two new columns from all the “16-24” etc. columns:

One with the age group information.
The other with the reflector usage values.

reflectors = (reflectors
              .unpivot(index=["Reflector use", "Sex"], 
                       variable_name="age_group", 
                       value_name="percentage")
              .sort("Reflector use", "Sex", "age_group")
)
reflectors.head()

shape: (5, 4)

Reflector use	Sex	age_group	percentage
str	str	str	f64
"Nearly always"	"Men"	"16-24"	34.3
"Nearly always"	"Men"	"25-34"	40.5
"Nearly always"	"Men"	"35-44"	52.2
"Nearly always"	"Men"	"45-54"	58.6
"Nearly always"	"Men"	"55-64"	55.9

(reflectors
 .filter(pl.col("Reflector use").eq("Never"))
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(
    width=600,
    height=300
    )
)

Classic. Men don’t wear reflectors as much as women. Old people are less likely to say that they never wear reflectors.

Actually I’m not sure who in Estonia lives in a place where they can only walk on well-lit streets. Let’s combine that category with “Never”.

(reflectors.
 with_columns(
     pl.col("Reflector use").str.replace("Never walk on dark streets, roads", "Never")
 )
 .group_by(["Reflector use", "Sex", "age_group"])
 .agg(pl.col("percentage").sum())
 .filter(pl.col("Reflector use").eq("Never"))
 .sort(["age_group", "Sex"])
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(
    width=600,
    height=300
    )
)

Hmm. Maybe some people really don’t walk a lot, but it’s also likely that a lot of people, especially older people, don’t see the need to wear reflectors.

Sidebar on terminology

There is a plethora of names to refer to the two basic directions of data reshaping.

For wide to long, like we did here, polars uses unpivot(), pandas calls it melt(), and R’s dplyr calls it…pivot_longer().

For the opposite, long to wider, polars and pandas both use pivot(), while R’s dplyr calls it pivot_sider(). It’s sometimes also referred to as “cast”.

To make things supremely confusing, spreadsheet software like Excel or Google Sheets also has the concept of a pivot table, which doesn’t just pivot, but also summarizes data values. Unlike what we are doing here, where we are only changing the shape of our data, not values themselves.

(Optional) LazyFrames and bigger than memory data

Aside from regular DataFrames, polars also has the concept of LazyFrames. These are abstract dataframes from some external source that are not read into memory (materialized) until you do something with them. And depending on what you do with them, this sometimes means that you can work with larger than memory datasets.

We’re not going to work with a 30GB dataset here, so let’s just illustrate the intitution here with some code:

smaller_data = (pl.scan_csv("huge/datafile.csv")
                .select("id1", "x1", "x2")
                .filter(pl.col("id1") > 100)
                .collect())

What this does is read data from a CSV, select 3 columns, and filter rows based on “id1” being greater than 100.

There are two key bits here:

pl.scan_csv(): unlike read_csv(), this does not immediately read the entire data into memory, but rather streams it…and only once we tell it to.
collect() to indicate that we want to start executing (materializing) our query.

What will happen with the particular code we have above is that polars will analyze our query and then applie to memory-saving optimizations:

It will only consider data that is in the 3 columns we selected.
From those columns, it will only read rows (elements) that satisfy our filter condition.

More on LazyFrames and the Lazy API at the user guide.

Pro-tip: always plot your data

Sometimes your job is to pipe data from A to B, and that’s fine.

But if you are doing data analysis or modeling (AI, ML, …), it often pays to be curious (and suspicious) about your data. Sometimes this helps you catch errors in what you are doing with the data. Othertimes it helps you reduce the “garbage in” part of the “garbage in, garbace out” equation.

Here’s a cute example some people created to make this point. We have two data sets, is there anything unusual about them?

df1 = pl.read_csv("data/dataset1.csv")
df2 = pl.read_csv("data/dataset2.csv")

The data appear to be similar. Same number of rows and columns.

df1.shape

(142, 2)

df2.shape

(142, 2)

They also have similar distributions:

df1.describe()

shape: (9, 3)

statistic	x	y
str	f64	f64
"count"	142.0	142.0
"null_count"	0.0	0.0
"mean"	54.2661	47.834721
"std"	16.769825	26.939743
"min"	15.56075	0.015119
"25%"	39.706326	24.46783
"50%"	53.421463	48.398346
"75%"	69.359559	71.806616
"max"	91.639961	97.475771

df2.describe()

shape: (9, 3)

statistic	x	y
str	f64	f64
"count"	142.0	142.0
"null_count"	0.0	0.0
"mean"	54.263273	47.832253
"std"	16.765142	26.935403
"min"	22.3077	2.9487
"25%"	44.1026	25.2564
"50%"	53.5897	46.0256
"75%"	64.8718	69.1026
"max"	98.2051	99.4872

As you can guess from the section heading, we should try plotting them. (If the “x” and “y” column names weren’t hint enough by themselves…)

df1.plot.point("x", "y")

Ok, all good, looks boring.

df2.plot.point("x", "y")

Ahaa! 🦖

Appendix

Datasaurus

The cute point data from the last example came from a dataset called the Datasaurus dozen, which itself was inspired by an earlier, famous statistical toy dataset, Anscombe’s quartet.

The other 10 datasaurus plots are also here in the data, we just took out the two above to make life easier for the workshop.

# split this file but just use it to make a point about also plotting data
datasaurus = pd.read_csv("data/datasaurus.csv")

df1 = datasaurus[datasaurus["dataset"] == "away"].drop("dataset", axis=1)
df2 = datasaurus[datasaurus["dataset"] == "dino"].drop("dataset", axis=1)

df1.to_csv("data/dataset1.csv", index=False)
df2.to_csv("data/dataset2.csv", index=False)