This vignette has some timings for database versus in memory calculations using the entire ICEWS event data. Generally, in-memory calculations are much faster than calculations via SQL, but they come at the cost of loading several GB into memory, which on my laptop takes about 2 minutes.

This vignette depends on the ICEWS event data I have locally on the laptop I’m using to compile it. The results are manually copied into the Rmd file, since evaluating the chunks at re-build is not possible on a server that doesn’t have the local data present.

library(icews)
library(dplyr)
library(tidyr)
library(tictoc)

tic()
nrow(events)
format(object.size(events), "Gb")
toc()
#> [1] 18594331
#> "3 GB"
#> 282.18 sec elapsed
tic()
nrow(events)
format(object.size(events), "Gb")
toc()
#> [1] 18594374
#> [1] "2.8 Gb"
#> 191.052 sec elapsed

The data from SQLite are slightly bigger because there is an extra column indicating the source file from which a set of events came. The command above directly reads from the raw source files, without recording which file an event came from.

# Country-year event counts

The in-memory calculation is much faster than in the database, although longer if one counts the overhead of reading the data into memory as well.

tic()
cy_total_mem <- events %>%
group_by(country, year) %>%
summarize(events = n())
toc()
#> 2.294 sec elapsed

And here by querying the database:

tic()
cy_total_db <- query_icews(
"SELECT count(*) AS total FROM events GROUP BY country, year;")
toc()
#> 67.247 sec elapsed

Grouping first by year, then country:

tic()
cy_total_db <- query_icews(
"SELECT count() AS total FROM events GROUP BY year, country;")
toc()
#> 65.662 sec elapsed

This is an equivalent SQL query but using dplyr’s behind the scenes construction:

tic()
cy_total_db <- tbl(connect(), "events") %>%
group_by(year, country) %>%
summarize(events = n()) %>%
collect()
toc()
#> 67.812 sec elapsed

# Column cardinality

Usually SQL indices work better on columns that have high cardinality.

col_vals <- query_icews([1177 chars quoted with '"'])

col_vals <- col_vals %>%
tidyr::pivot_longer(rows:source_file,
names_to = "Column",
values_to = "Unique_values") %>%
arrange(Unique_values, Column)

col_vals %>%
knitr::kable()
Column Unique_values
sentence_number 6
year 26
intensity 41
source_file 48
source_country 253
target_country 253
country 254
cameo_code 273
publisher 292
event_text 304
yearmonth 309
province 5286
event_date 9402
district 10082
target_name 75028
source_name 77463
latitude 88844
city 91842
longitude 92960
target_sectors 272279
source_sectors 337609
story_id 9116259
event_id 18449019
rows 18594331

No indices are created by default, but if you prefer working with the events in a SQLite database and doing certain queries frequently then it may make sense to create some. There are plenty of tutorials online on how to create good indices (for example Use the index, Luke!), and you can always preface a query you are developing with EXPLAIN QUERY PLAN [query] (see here for details).