vignettes/speed-comparison-database-versus-file-based.Rmd
speed-comparison-database-versus-file-based.Rmd
This vignette has some timings for database versus in memory calculations using the entire ICEWS event data. Generally, in-memory calculations are much faster than calculations via SQL, but they come at the cost of loading several GB into memory, which on my laptop takes about 2 minutes.
This vignette depends on the ICEWS event data I have locally on the laptop I’m using to compile it. The results are manually copied into the Rmd file, since evaluating the chunks at re-build is not possible on a server that doesn’t have the local data present.
tic()
events <- read_icews(find_db())
nrow(events)
format(object.size(events), "Gb")
toc()
#> [1] 18594331
#> "3 GB"
#> 282.18 sec elapsed
tic()
events <- read_icews(find_raw())
nrow(events)
format(object.size(events), "Gb")
toc()
#> [1] 18594374
#> [1] "2.8 Gb"
#> 191.052 sec elapsed
The data from SQLite are slightly bigger because there is an extra column indicating the source file from which a set of events came. The command above directly reads from the raw source files, without recording which file an event came from.
The in-memory calculation is much faster than in the database, although longer if one counts the overhead of reading the data into memory as well.
#> 2.294 sec elapsed
And here by querying the database:
tic()
cy_total_db <- query_icews(
"SELECT count(*) AS total FROM events GROUP BY country, year;")
toc()
#> 67.247 sec elapsed
Grouping first by year, then country:
tic()
cy_total_db <- query_icews(
"SELECT count() AS total FROM events GROUP BY year, country;")
toc()
#> 65.662 sec elapsed
This is an equivalent SQL query but using dplyr’s behind the scenes construction:
tic()
cy_total_db <- tbl(connect(), "events") %>%
group_by(year, country) %>%
summarize(events = n()) %>%
collect()
toc()
#> 67.812 sec elapsed
Usually SQL indices work better on columns that have high cardinality.
col_vals <- query_icews("
select count(*) as rows,
count(distinct(event_id)) as event_id,
count(distinct(event_date)) as event_date,
count(distinct(source_name)) as source_name,
count(distinct(source_sectors)) as source_sectors,
count(distinct(source_country)) as source_country,
count(distinct(event_text)) as event_text,
count(distinct(cameo_code)) as cameo_code,
count(distinct(intensity)) as intensity,
count(distinct(target_name)) as target_name,
count(distinct(target_sectors)) as target_sectors,
count(distinct(target_country)) as target_country,
count(distinct(story_id)) as story_id,
count(distinct(sentence_number)) as sentence_number,
count(distinct(publisher)) as publisher,
count(distinct(city)) as city,
count(distinct(district)) as district,
count(distinct(province)) as province,
count(distinct(country)) as country,
count(distinct(latitude)) as latitude,
count(distinct(longitude)) as longitude,
count(distinct(year)) as year,
count(distinct(yearmonth)) as yearmonth,
count(distinct(source_file)) as source_file
from events;")
col_vals <- col_vals %>%
tidyr::pivot_longer(rows:source_file,
names_to = "Column",
values_to = "Unique_values") %>%
arrange(Unique_values, Column)
col_vals %>%
knitr::kable()
Column | Unique_values |
---|---|
sentence_number | 6 |
year | 26 |
intensity | 41 |
source_file | 48 |
source_country | 253 |
target_country | 253 |
country | 254 |
cameo_code | 273 |
publisher | 292 |
event_text | 304 |
yearmonth | 309 |
province | 5286 |
event_date | 9402 |
district | 10082 |
target_name | 75028 |
source_name | 77463 |
latitude | 88844 |
city | 91842 |
longitude | 92960 |
target_sectors | 272279 |
source_sectors | 337609 |
story_id | 9116259 |
event_id | 18449019 |
rows | 18594331 |
No indices are created by default, but if you prefer working with the events in a SQLite database and doing certain queries frequently then it may make sense to create some. There are plenty of tutorials online on how to create good indices (for example Use the index, Luke!), and you can always preface a query you are developing with EXPLAIN QUERY PLAN [query]
(see here for details).