Speed comparison database versus file-based

This vignette has some timings for database versus in memory calculations using the entire ICEWS event data. Generally, in-memory calculations are much faster than calculations via SQL, but they come at the cost of loading several GB into memory, which on my laptop takes about 2 minutes.

This vignette depends on the ICEWS event data I have locally on the laptop I’m using to compile it. The results are manually copied into the Rmd file, since evaluating the chunks at re-build is not possible on a server that doesn’t have the local data present.

library(icews)
library(dplyr)
library(tidyr)
library(tictoc)

Read into memory

tic()
events <- read_icews(find_db())
nrow(events)
format(object.size(events), "Gb")
toc()

#> [1] 18594331
#> "3 GB"
#> 282.18 sec elapsed

tic()
events <- read_icews(find_raw())
nrow(events)
format(object.size(events), "Gb")
toc()

#> [1] 18594374
#> [1] "2.8 Gb"
#> 191.052 sec elapsed

The data from SQLite are slightly bigger because there is an extra column indicating the source file from which a set of events came. The command above directly reads from the raw source files, without recording which file an event came from.

Country-year event counts

The in-memory calculation is much faster than in the database, although longer if one counts the overhead of reading the data into memory as well.

tic()
cy_total_mem <- events %>% 
  group_by(country, year) %>% 
  summarize(events = n())
toc()

#> 2.294 sec elapsed

And here by querying the database:

tic()
cy_total_db <- query_icews(
  "SELECT count(*) AS total FROM events GROUP BY country, year;")
toc()

#> 67.247 sec elapsed

Grouping first by year, then country:

tic()
cy_total_db <- query_icews(
  "SELECT count() AS total FROM events GROUP BY year, country;")
toc()

#> 65.662 sec elapsed

This is an equivalent SQL query but using dplyr’s behind the scenes construction:

tic()
cy_total_db <- tbl(connect(), "events") %>% 
  group_by(year, country) %>% 
  summarize(events = n()) %>%
  collect()
toc()

#> 67.812 sec elapsed

Column cardinality

Usually SQL indices work better on columns that have high cardinality.

col_vals <- query_icews("
select count(*) as rows,
       count(distinct(event_id)) as event_id,
       count(distinct(event_date)) as event_date,
       count(distinct(source_name)) as source_name,
       count(distinct(source_sectors)) as source_sectors,
       count(distinct(source_country)) as source_country,
       count(distinct(event_text)) as event_text,
       count(distinct(cameo_code)) as cameo_code,
       count(distinct(intensity)) as intensity,
       count(distinct(target_name)) as target_name,
       count(distinct(target_sectors)) as target_sectors,
       count(distinct(target_country)) as target_country,
       count(distinct(story_id)) as story_id,
       count(distinct(sentence_number)) as sentence_number,
       count(distinct(publisher)) as publisher,
       count(distinct(city)) as city,
       count(distinct(district)) as district,
       count(distinct(province)) as province,
       count(distinct(country)) as country,
       count(distinct(latitude)) as latitude,
       count(distinct(longitude)) as longitude,
       count(distinct(year)) as year,
       count(distinct(yearmonth)) as yearmonth,
       count(distinct(source_file)) as source_file
from events;") 

col_vals <- col_vals %>%
  tidyr::pivot_longer(rows:source_file, 
                      names_to = "Column", 
                      values_to = "Unique_values") %>%
  arrange(Unique_values, Column) 

col_vals %>%
  knitr::kable()

Column	Unique_values
sentence_number	6
year	26
intensity	41
source_file	48
source_country	253
target_country	253
country	254
cameo_code	273
publisher	292
event_text	304
yearmonth	309
province	5286
event_date	9402
district	10082
target_name	75028
source_name	77463
latitude	88844
city	91842
longitude	92960
target_sectors	272279
source_sectors	337609
story_id	9116259
event_id	18449019
rows	18594331

No indices are created by default, but if you prefer working with the events in a SQLite database and doing certain queries frequently then it may make sense to create some. There are plenty of tutorials online on how to create good indices (for example Use the index, Luke!), and you can always preface a query you are developing with EXPLAIN QUERY PLAN [query] (see here for details).

Andreas Beger

2022-07-25

Read into memory

Country-year event counts

Column cardinality