We will start with pandas, the original and most widely used dataframe and data analysis library in Python.
Getting Started
We first need to import pandas. We’ll also import another package, numpy, which implements data types for vectors and matrices. Pandas is built on top of numpy, but we only import it here to make it easier to generate an example data frame.
import numpy as npimport pandas as pd# Create an example data framedf = pd.DataFrame({"quarter": [1, 1, 2, 2, 3],"x": np.random.randn(5),"date": pd.date_range("2024-01-01", periods=5, freq="MS")})df
quarter
x
date
0
1
-0.640642
2024-01-01
1
1
-0.801308
2024-02-01
2
2
-0.382784
2024-03-01
3
2
-0.945216
2024-04-01
4
3
0.296990
2024-05-01
Components of a dataframe
Pandas DataFrames consist of three components:
One or more Series, which are the columns in the DataFrame.
The names for the series, i.e. column names of the dataframe.
The row names for each row in the dataframe, which pandas calls the Index.
(Note how each series can have a different data type, unlike in a matrix or an array.)
Columns
df.columns
Index(['quarter', 'x', 'date'], dtype='object')
Index
df.index
RangeIndex(start=0, stop=5, step=1)
Since we didn’t explicitly set an index when we created the dataframe, it’s just a sequence of numbers starting at 0. Indexes are actually a key concept in pandas and we’ll talk a little bit more about them later.
Input: reading data from elsewhere
Pandas can import dataframes from a variety of external sources like text files, JSON, Excel spreadsheets, APIs, and SQL databases. See the input/output documentation for more information.
We’re going to read data on Estonian vehicle accidents from a comma-separated variable (CSV) file, one of the most common text file types for storing data.
We could also have used accidents.dtypes to get the data types.
Note that, if you look back at the number of rows we looked up before, you’ll notice that none of the columns have missing values, and that all but one of the data types look correct. This is because this is an already cleaned up version of the data. Below in this notebook you can also see what we did to clean up the data.
Selecting columns
There are various ways to do this, including the <dataframe>.<column> notation we used above. One alternative is the square bracket <dataframe>.[]:
To select multiple columns we can use a list with the column names as the argument:
accidents[["date", "county"]].head()
date
county
0
2014-10-24 08:45:00
Harju maakond
1
2014-10-24 13:45:00
Harju maakond
2
2014-08-11 00:00:00
Harju maakond
3
2014-11-17 17:32:00
Harju maakond
4
2015-04-28 07:55:00
Harju maakond
Mutating columns
The dateas are currently stored as strings (object). We should fix that.
(pandas uses the object data types by default for strings, even though this can store arbitrary Python objects. This is for historical reasons related to originally being built on top of numpy. More on pandas text data types.)
First accident: 2011-01-05 00:00:00
Last accident: 2021-12-31 23:45:00
What types of accidents were there?
The Series.value_counts() method is a quite useful method for tabulating categorical variables:
accidents["accident_type"].value_counts()
accident_type
Kokkupõrge 5605
Ühesõidukiõnnetus 3946
Jalakäijaõnnetus 3386
Muu liiklusõnnetus 1262
Teadmata 60
Name: count, dtype: int64
Filter rows
Oftentimes we don’t need all rows in a dataframe to answer a specific question. This is commonly called filtering or subsetting rows.
How many of the accidents were in Harju county?
We have a “county” column, so we can use that to look at only Harju county. As with selecting columns, there are multiple ways to do this in pandas. One of the basic ones is to again use the <dataframe>.[] square brackets.
When we pass this boolean vector to the square brackets, it uses it to filter rows. Rather than select columns, like it did with `accidents[“county”].
Yeah, it’s weird and inconsistent, which is one of the things people complain about with pandas.
What’s more, there are many more ways to select and filter. See the “Indexing and selecting data” documentation for all the various methods pandas has, both for column selecting and row filtering.
Mutate: add a new column
We’ve covered how to alter an existing column. We can actually use the same method to add a new column.
Oftentimes we want to summarize our data over some group that is defined by one of the variables. To do this we usually want to use a combination of groupby and agg.
How many people were harmed, by accident type?
# summarize total accidents by somethingby_type = accidents.groupby("accident_type").agg({"killed_or_injured": "sum"})by_type
killed_or_injured
accident_type
Jalakäijaõnnetus
3548
Kokkupõrge
7951
Muu liiklusõnnetus
1436
Teadmata
70
Ühesõidukiõnnetus
5016
(Optional) More on Indexes
In the table above, you might noticed that the “accident_type” is now for some reason shown differently from the “killed_or_injured” variable. Indeed, if we check the columns, it’s not there anymore:
by_type.columns
Index(['killed_or_injured'], dtype='object')
What happened is that when we did the group by and agg, pandas moved “accident_type” to the Index.
Since we hadn’t set an index when we imported the data from CSV, this previously was just an integer count from 0, which you can see above when we showed the first few rows of the data with head(). Now it’s “accident_type”.
pandas extensively uses indexes for various operations. There are event hierarchical MultiIndexes that consist of more than one variable.
There are really only two important things to know about pandas Indexes.
First, there are two kinds of pandas users:
Those that love indexes and use them extensively. Such index powerusers are rumored to exist, at least they say.
People like me who don’t use them unless forced to.
Second, indexes are like variables, but moved to the row labels. You can move them back and forth with two functions:
set_index(<keys>): move columns to the index; can be a column name or list of names.
reset_index(): move the variables in the current index back to the dataframe as columns.
How many people were harmed, by year?
To further explore this, let’s look at the number of people harmed, by year.
This uses a bit more complicated code we won’t explain in more detail. .loc[] is on the of the alternative select/filter methods. resample() is like groupby() but for time series. Because the code is quite long for one line, we do something called method chaining, where we put each new method call on a new line. This requires wrapping the whole statement in parentheses.
The date column has been moved to the index. Since we’ve aggregated the data to yearly, it would be nice, e.g. for plotting, if we just had the years in a column, not the misleading full date times.
# Move date back to a column; note how we get a new dummy 0,1,... indexby_year = by_year.reset_index()by_year.head()
date
killed_or_injured
persons_involved
0
2011-12-31
533
722
1
2012-12-31
1713
2289
2
2013-12-31
1714
2271
3
2014-12-31
1758
2429
4
2015-12-31
1773
2859
# To extract the year from the date we can use this:by_year["date"].dt.year.head()
# Create a new column with the yearby_year["year"] = by_year["date"].dt.year# Drop the date column; we could also do this by selecting all columns but# the one we want to drop, but this is more explicitby_year = by_year.drop("date", axis=1)# bring year to the first positionby_year = by_year[["year", "persons_involved", "killed_or_injured"]]by_year
year
persons_involved
killed_or_injured
0
2011
722
533
1
2012
2289
1713
2
2013
2271
1714
3
2014
2429
1758
4
2015
2859
1773
5
2016
3171
1874
6
2017
2906
1725
7
2018
3131
1886
8
2019
2923
1752
9
2020
2577
1592
10
2021
2735
1701
(Optional) Cleaning the accidents data
We mentioned above that the accidents data is already pretty clean. That’s because we did the below to clean it up. If you’re curious about this (great!), you can add more code cells and copy/paste each segment so you can see some intermediary output as well.
import dateparser # for fixing the raw datesaccidents = pd.read_csv("data/estonia-traffic-accidents.csv")accidents.head()accidents.info()# Let's only keep a couple of columns for the workshopkeep = ["Toimumisaeg", "Isikuid", "Hukkunuid", "Vigastatuid", "Maakond (PPA)","Jalakäija osalusel", "Liiklusõnnetuse liik [1]", "Valgustus [1]"]accidents = accidents[keep]# Translate the column names to Englishtranslate_columns = {"Toimumisaeg": "date", "Isikuid": "persons_involved", "Hukkunuid": "killed", "Vigastatuid": "injured", "Maakond (PPA)": "county", "Jalakäija osalusel": "pedestrian_involved", "Liiklusõnnetuse liik [1]": "accident_type", "Valgustus [1]": "light_conditions"}accidents = accidents.rename(columns=translate_columns)# The original dates have a mix of formats. We're going to use the dateparser# library to help parse these. accidents["date"] = accidents["date"].apply(lambda x: dateparser.parse(x, languages=["en"]))# We've got missing data in several columns. Let's drop those rows.# Usually you'd want to investigate what is missing and why, but it's fine# for the workshop since we are just illustrating, not trying to learn anything# about traffic accidents. accidents.isna().sum()accidents = accidents.dropna()# Some of the columns are float but should be integer values.accidents = accidents.astype({"persons_involved": "int64","pedestrian_involved": "int64"})# write out the cleaned data. # accidents.to_csv("data/estonia-traffic-accidents-clean.csv", index=False)
Polars is a more recent dataframe library, written on top of Rust, which has been gaining a lot of traction. It uses a very different philosopy and style for the API. Both of those (being written in Rust and the API) allow it to be faster and more memory efficient than pandas. It also works out of the box with data that is too large to fit into memory.
Getting started
Let’s redo some of the common dataframe operations we did above, using the accidents data.
import polars as placcidents = pl.read_csv("data/estonia-traffic-accidents-clean.csv")accidents.head()
shape: (5, 8)
date
persons_involved
killed
injured
county
pedestrian_involved
accident_type
light_conditions
str
i64
i64
i64
str
i64
str
str
"2014-10-24 08:45:00"
2
0
1
"Harju maakond"
1
"Jalakäijaõnnetus"
"Valge aeg"
"2014-10-24 13:45:00"
2
0
1
"Harju maakond"
0
"Kokkupõrge"
"Valge aeg"
"2014-08-11 00:00:00"
2
0
1
"Harju maakond"
0
"Kokkupõrge"
"Valge aeg"
"2014-11-17 17:32:00"
2
0
2
"Harju maakond"
0
"Kokkupõrge"
"Pimeda aeg"
"2015-04-28 07:55:00"
2
0
1
"Harju maakond"
0
"Kokkupõrge"
"Valge aeg"
Some basic differences from pandas
Like pandas, polars DataFrames are composed of Series. You’ll soon see that the API has a quite different style. Some other big differences:
No index.
The use of square brackets [] is discouraged, use methods instead.
Within reason, you want to use method chaining and do related things together, rather than splitting transformations line by line. This is so that the query optimizer can do it’s thing under the hood.
The biggest conceptual difference from pandas are expressions. So let’s talk about those, after a brief detour on how to select columns (since we will use that to illustrate expressions).
Convert from pandas to polars and vice versa
polars has functions for converting data frames back and forth:
import pyarrowdf = pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})# to make this a pandas dataframe# (requires pyarrow)df_pd = df.to_pandas()# to convert it back to polars dataframedf_pl = pl.DataFrame(df_pd)
Selecting columns
With the select() method:
accidents.select("date", "county").head()
shape: (5, 2)
date
county
str
str
"2014-10-24 08:45:00"
"Harju maakond"
"2014-10-24 13:45:00"
"Harju maakond"
"2014-08-11 00:00:00"
"Harju maakond"
"2014-11-17 17:32:00"
"Harju maakond"
"2015-04-28 07:55:00"
"Harju maakond"
Expressions
Polars relies very heavily on expressions. These are data transformations that abstractly define what we want to do with some data, and operate within a context that provides data to apply those changes to actual data.
select() is one of the contexts in which an expression can be executed. One of the most basic expressions is pl.col(), which selects a column in the context provided. We can use this to select a column, like we did above.
accidents.select(pl.col("date")).head()
shape: (5, 1)
date
str
"2014-10-24 08:45:00"
"2014-10-24 13:45:00"
"2014-08-11 00:00:00"
"2014-11-17 17:32:00"
"2015-04-28 07:55:00"
Composing expressions
What’s the point of that? By itself having a more verbose method to select columns is not that useful, but what is very useful is that we can compose expressions to do many useful things. For example:
How many people were injured or killed in the biggest accident in the data?
accidents.select(# select 'killed' pl.col("killed")# add 'injured' .add(pl.col("injured"))# give the result a new column name .alias("killed_or_injured")# identify the max value .max())
shape: (1, 1)
killed_or_injured
i64
23
Contexts
There are 4 contexts in which expressions can be used:
select()
filter()
with_columns() to mutate data frames
group_by() and aggregations
We’ll see all of those below.
For more on the concept of expressions in polars, see the Expressions and contexts section in the documentation.
(You can generally use operators like ==, +, - as you’d expect them to work, the only thing is that if you wanted to do something further with the result, you’d have to wrap the calculation in extra parentheses, like (pl.col("a") + pl.col("b")).max(), versus pl.col("a").add(pl.col("b")).max().)
Another small example of how expressions can make life easier is below. Instead of filtering using the full county name “Harju maakond”, we can just look for strings that contain the substring “Harju”.
You use with_columns() and various expressions to mutate (add, change) columns in a dataframe. Here is how you would add a “killed_or_injured” column, like we did with pandas earlier.
Often you have information in different dataframes (tables) that you want to join (or merge) together. For example, what if we want to know the accident rate per capita by county? For this we can first group and aggregate to get the county-level number of accident victims, then join a table with population figures for each county.
Since we conveniently already have county-level data from above, we just need to read and join in the county population data.
county_pop = (pl.read_csv("data/county-pop.csv", skip_rows=2) .rename({"County": "county", "Age groups total": "population"}) .select(["county", "population"])# this has "county" in the county names, not "maakond" .with_columns(pl.col("county").str.replace("county", "maakond")) )by_county_w_pop = by_county.join(county_pop, on="county", how="left")by_county_w_pop
Which 3 counties have the highest vehicle accident victim rates?
# Add your code here
Which 3 counties have the lowest vehicle accident victim rates?
# Add your code here
(Optional) Reshaping / pivoting dataframes
For this we will look at another dataset, on reflector usage. (🌃🎄 Tis’ the time of year…)
reflectors = (pl.read_csv("data/reflectors.csv", has_header=True, separator=";", skip_rows=2) .filter(pl.col("Sex").ne("Men and women")) .drop(["Type of data", "Year", "All age groups (16-64)"]) .sort("Reflector use", "Sex"))reflectors.head()
shape: (5, 7)
Reflector use
Sex
16-24
25-34
35-44
45-54
55-64
str
str
f64
f64
f64
f64
f64
"Nearly always"
"Men"
34.3
40.5
52.2
58.6
55.9
"Nearly always"
"Women"
58.4
64.9
71.4
78.4
74.9
"Never"
"Men"
14.3
12.4
7.2
3.9
2.7
"Never"
"Women"
8.8
5.0
4.6
2.0
2.5
"Never walk on dark streets, ro…
"Men"
4.8
10.8
9.7
11.3
12.8
After some basic cleaning, we can see that this dataframe has a not uncommon pattern, time series represented as rows, not columns, with their ID label as the column name. That’s fine for presentation purposes when you look at the table, but for plotting life will be easier if we reshape this data from the current wide format to a long format.
We want to create two new columns from all the “16-24” etc. columns:
(reflectors .filter(pl.col("Reflector use").eq("Never")) .plot.line(x ="age_group", y ="percentage", color ="Sex") .properties( width=600, height=300 ))
Classic. Men don’t wear reflectors as much as women. Old people are less likely to say that they never wear reflectors.
Actually I’m not sure who in Estonia lives in a place where they can only walk on well-lit streets. Let’s combine that category with “Never”.
(reflectors. with_columns( pl.col("Reflector use").str.replace("Never walk on dark streets, roads", "Never") ) .group_by(["Reflector use", "Sex", "age_group"]) .agg(pl.col("percentage").sum()) .filter(pl.col("Reflector use").eq("Never")) .sort(["age_group", "Sex"]) .plot.line(x ="age_group", y ="percentage", color ="Sex") .properties( width=600, height=300 ))
Hmm. Maybe some people really don’t walk a lot, but it’s also likely that a lot of people, especially older people, don’t see the need to wear reflectors.
Sidebar on terminology
There is a plethora of names to refer to the two basic directions of data reshaping.
For wide to long, like we did here, polars uses unpivot(), pandas calls it melt(), and R’s dplyr calls it…pivot_longer().
For the opposite, long to wider, polars and pandas both use pivot(), while R’s dplyr calls it pivot_sider(). It’s sometimes also referred to as “cast”.
To make things supremely confusing, spreadsheet software like Excel or Google Sheets also has the concept of a pivot table, which doesn’t just pivot, but also summarizes data values. Unlike what we are doing here, where we are only changing the shape of our data, not values themselves.
(Optional) LazyFrames and bigger than memory data
Aside from regular DataFrames, polars also has the concept of LazyFrames. These are abstract dataframes from some external source that are not read into memory (materialized) until you do something with them. And depending on what you do with them, this sometimes means that you can work with larger than memory datasets.
We’re not going to work with a 30GB dataset here, so let’s just illustrate the intitution here with some code:
What this does is read data from a CSV, select 3 columns, and filter rows based on “id1” being greater than 100.
There are two key bits here:
pl.scan_csv(): unlike read_csv(), this does not immediately read the entire data into memory, but rather streams it…and only once we tell it to.
collect() to indicate that we want to start executing (materializing) our query.
What will happen with the particular code we have above is that polars will analyze our query and then applie to memory-saving optimizations:
It will only consider data that is in the 3 columns we selected.
From those columns, it will only read rows (elements) that satisfy our filter condition.
More on LazyFrames and the Lazy API at the user guide.
Pro-tip: always plot your data
Sometimes your job is to pipe data from A to B, and that’s fine.
But if you are doing data analysis or modeling (AI, ML, …), it often pays to be curious (and suspicious) about your data. Sometimes this helps you catch errors in what you are doing with the data. Othertimes it helps you reduce the “garbage in” part of the “garbage in, garbace out” equation.
Here’s a cute example some people created to make this point. We have two data sets, is there anything unusual about them?
The data appear to be similar. Same number of rows and columns.
df1.shape
(142, 2)
df2.shape
(142, 2)
They also have similar distributions:
df1.describe()
shape: (9, 3)
statistic
x
y
str
f64
f64
"count"
142.0
142.0
"null_count"
0.0
0.0
"mean"
54.2661
47.834721
"std"
16.769825
26.939743
"min"
15.56075
0.015119
"25%"
39.706326
24.46783
"50%"
53.421463
48.398346
"75%"
69.359559
71.806616
"max"
91.639961
97.475771
df2.describe()
shape: (9, 3)
statistic
x
y
str
f64
f64
"count"
142.0
142.0
"null_count"
0.0
0.0
"mean"
54.263273
47.832253
"std"
16.765142
26.935403
"min"
22.3077
2.9487
"25%"
44.1026
25.2564
"50%"
53.5897
46.0256
"75%"
64.8718
69.1026
"max"
98.2051
99.4872
As you can guess from the section heading, we should try plotting them. (If the “x” and “y” column names weren’t hint enough by themselves…)
df1.plot.point("x", "y")
Ok, all good, looks boring.
df2.plot.point("x", "y")
Ahaa! 🦖
Appendix
Datasaurus
The cute point data from the last example came from a dataset called the Datasaurus dozen, which itself was inspired by an earlier, famous statistical toy dataset, Anscombe’s quartet.
The other 10 datasaurus plots are also here in the data, we just took out the two above to make life easier for the workshop.
# split this file but just use it to make a point about also plotting datadatasaurus = pd.read_csv("data/datasaurus.csv")df1 = datasaurus[datasaurus["dataset"] =="away"].drop("dataset", axis=1)df2 = datasaurus[datasaurus["dataset"] =="dino"].drop("dataset", axis=1)df1.to_csv("data/dataset1.csv", index=False)df2.to_csv("data/dataset2.csv", index=False)