This note describes the database internals. In addition to the main
events table, there are other tables that impact how the updating process works.
SQLite has a special
sqlite_master table that describes the schema. This can be queried to find out what objects are present in the database, e.g.:
query_icews("select * from sqlite_master;")
With the default setup, four tables should be present:
events: contains the actual ICEWS events
stats: statistics for the event table; this currently only contains the total number of rows/events in the
eventstable, and is used by
dr_icews()when showing the total number of events in the database (if using the database option)
null_source_files: these tables are related to the raw data files (.tab) that have been ingested into the database. If a raw source file only contains duplicate events1, no events are ingested into the database and the
eventstable contains no reference to this file. To avoid an attempt to re-download and re-ingest the file at a future update, it is in this case instead added to the
null_sources_filestable, which is taken into account when resolving the various states (dataverse, database, files).
Note that no indices are created by default for the
events table. Usually it makes sense to add some to speed up common queries. For example, I have at some points locally added indices for country, year, and yearmonth.
The package installation includes SQL scripts that create these tables. These can be accessed an printed from R like below. (One can also get these from SQLite with something like
cat(query_icews("select sql from sqlite_master where name = 'events';")$sql).)
cat(readLines(system.file("sql/events.sql", package = "icews")), sep = "\n") CREATE TABLE IF NOT EXISTS events ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL, source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT, intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER, sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT, latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL, source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date) );
cat(readLines(system.file("sql/stats.sql", package = "icews")), sep = "\n") /* This sets up a stats table, for now containing only the number of records in events (events_n), and triggers to UPDATE the row count when any records are added or deleted FROM events. Point is to make dr_icews faster when querying # of records */ CREATE TABLE IF NOT EXISTS stats ( name TEXT, value INTEGER ); INSERT INTO stats VALUES ('events_n', NULL); UPDATE stats SET value = ( SELECT count(*) FROM events ) WHERE name=='events_n'; /* CREATE trigger update_events_n_after_insert AFTER INSERT ON events BEGIN UPDATE stats SET value = ( SELECT count(*) FROM events ) WHERE name=='events_n'; END; CREATE trigger update_events_n_after_delete AFTER DELETE ON events BEGIN UPDATE stats SET value = ( SELECT count(*) FROM events ) WHERE name=='events_n'; END; */
cat(readLines(system.file("sql/source_files.sql", package = "icews")), sep = "\n") /* Create a table containing the currently ingested source_files, and update it via triggers if anything changes in the events table. Point is to make get_db_state() and thus update_icews(), etc. faster */ CREATE TABLE source_files ( name TEXT ); INSERT INTO source_files (name) SELECT DISTINCT(source_file) AS name FROM events; /* CREATE trigger update_source_files_after_insert AFTER INSERT ON events BEGIN DELETE FROM source_files; INSERT INTO source_files (name) SELECT DISTINCT(source_file) AS name FROM events; END; CREATE trigger update_source_files_after_delete AFTER DELETE ON events BEGIN DELETE FROM source_files; INSERT INTO source_files (name) SELECT DISTINCT(source_file) AS name FROM events; END; CREATE trigger update_source_files_after_update AFTER UPDATE ON events BEGIN DELETE FROM source_files; INSERT INTO source_files (name) SELECT DISTINCT(source_file) AS name FROM events; END; */ /* With the daily update repository, there can be source files from which no events are added to the DB because all events are already present. To avoid re-ingesting these files every time an update is run, because they are not in the stats source file column, keep track of them in a separate table. */ CREATE TABLE null_source_files ( name TEXT PRIMARY KEY ); CREATE trigger update_source_files_after_null_ingest AFTER INSERT ON null_source_files BEGIN INSERT INTO source_files (name) SELECT name FROM null_source_files EXCEPT SELECT name FROM source_files; END; /* Also update source_files when the table is destroyed/recreated with update_stats() */ CREATE trigger add_null_ingests_to_source_files AFTER INSERT ON source_files BEGIN INSERT INTO source_files (name) SELECT name FROM null_source_files EXCEPT SELECT name FROM source_files; END;
If you are running similar queries repeatedly, it might make sense to add indices for the events table. These can dramatically increase the speed of queries if setup correctly. They do take a while to create and take up extra storage space in the database file, though. They will also slow updating the data, and what I did in the past was to remove indices, update, and then re-add the indices.
I found the general guide on indices at Use the index, Luke! to be helpful, as well as the
EXPLAIN QUERY PLAN command, which can be used to tell what, if any, indices are going to be used for a query.
Specifically, records that share both event ID and event date.↩︎