Database internals • icews

This note describes the database internals. In addition to the main events table, there are other tables that impact how the updating process works.

library(icews)

SQLite has a special sqlite_master table that describes the schema. This can be queried to find out what objects are present in the database, e.g.:

query_icews("select * from sqlite_master;")

With the default setup, four tables should be present:

events: contains the actual ICEWS events
stats: statistics for the event table; this currently only contains the total number of rows/events in the events table, and is used by dr_icews() when showing the total number of events in the database (if using the database option)
source_files and null_source_files: these tables are related to the raw data files (.tab) that have been ingested into the database. If a raw source file only contains duplicate events¹, no events are ingested into the database and the events table contains no reference to this file. To avoid an attempt to re-download and re-ingest the file at a future update, it is in this case instead added to the null_sources_files table, which is taken into account when resolving the various states (dataverse, database, files).

Note that no indices are created by default for the events table. Usually it makes sense to add some to speed up common queries. For example, I have at some points locally added indices for country, year, and yearmonth.

Create table statements

The package installation includes SQL scripts that create these tables. These can be accessed an printed from R like below. (One can also get these from SQLite with something like cat(query_icews("select sql from sqlite_master where name = 'events';")$sql).)

Events

cat(readLines(system.file("sql/events.sql", package = "icews")), sep = "\n")
CREATE TABLE IF NOT EXISTS events (
  event_id INTEGER NOT NULL,
  event_date INTEGER NOT NULL,
  source_name TEXT,
  source_sectors TEXT,
  source_country TEXT,
  event_text TEXT,
  cameo_code TEXT,
  intensity REAL,
  target_name TEXT,
  target_sectors TEXT,
  target_country TEXT,
  story_id INTEGER,
  sentence_number INTEGER,
  publisher TEXT,
  city TEXT,
  district TEXT,
  province TEXT,
  country TEXT,
  latitude REAL,
  longitude REAL,
  year INTEGER NOT NULL,
  yearmonth INTEGER NOT NULL,
  source_file TEXT NOT NULL,
  PRIMARY KEY (event_id, event_date)
);

Stats table

cat(readLines(system.file("sql/stats.sql", package = "icews")), sep = "\n")
/*
   This sets up a stats table, for now containing only the number of records in
   events (events_n), and triggers to UPDATE the row count when any records
   are added or deleted FROM events.

   Point is to make dr_icews faster when querying # of records
*/

CREATE TABLE IF NOT EXISTS stats (
  name TEXT,
  value INTEGER
);

INSERT INTO stats VALUES ('events_n', NULL);

UPDATE stats SET value = ( SELECT count(*) FROM events )
WHERE name=='events_n';

/*
CREATE trigger update_events_n_after_insert AFTER INSERT ON events
BEGIN
  UPDATE stats SET value = ( SELECT count(*) FROM events )
  WHERE name=='events_n';
END;

CREATE trigger update_events_n_after_delete AFTER DELETE ON events
BEGIN
  UPDATE stats SET value = ( SELECT count(*) FROM events )
  WHERE name=='events_n';
END;
*/

Source and null source file tables

cat(readLines(system.file("sql/source_files.sql", package = "icews")), sep = "\n")
/*
   Create a table containing the currently ingested source_files, and update
   it via triggers if anything changes in the events table.

   Point is to make get_db_state() and thus update_icews(), etc. faster
*/

CREATE TABLE source_files (
  name TEXT
);

INSERT INTO source_files (name)
SELECT DISTINCT(source_file) AS name FROM events;

/*
CREATE trigger update_source_files_after_insert AFTER INSERT ON events
BEGIN
  DELETE FROM source_files;
  INSERT INTO source_files (name)
  SELECT DISTINCT(source_file) AS name FROM events;
END;

CREATE trigger update_source_files_after_delete AFTER DELETE ON events
BEGIN
  DELETE FROM source_files;
  INSERT INTO source_files (name)
  SELECT DISTINCT(source_file) AS name FROM events;
END;

CREATE trigger update_source_files_after_update AFTER UPDATE ON events
BEGIN
  DELETE FROM source_files;
  INSERT INTO source_files (name)
  SELECT DISTINCT(source_file) AS name FROM events;
END;
*/


/*
    With the daily update repository, there can be source files from which no
    events are added to the DB because all events are already present. To avoid
    re-ingesting these files every time an update is run, because they are not
    in the stats source file column, keep track of them in a separate table.
*/

CREATE TABLE null_source_files (
  name TEXT PRIMARY KEY
);

CREATE trigger update_source_files_after_null_ingest AFTER INSERT ON null_source_files
BEGIN
  INSERT INTO source_files (name)
  SELECT name FROM null_source_files EXCEPT SELECT name FROM source_files;
END;

/*
    Also update source_files when the table is destroyed/recreated with
    update_stats()
*/

CREATE trigger add_null_ingests_to_source_files AFTER INSERT ON source_files
BEGIN
  INSERT INTO source_files (name)
  SELECT name FROM null_source_files EXCEPT SELECT name FROM source_files;
END;

Indices

If you are running similar queries repeatedly, it might make sense to add indices for the events table. These can dramatically increase the speed of queries if setup correctly. They do take a while to create and take up extra storage space in the database file, though. They will also slow updating the data, and what I did in the past was to remove indices, update, and then re-add the indices.

I found the general guide on indices at Use the index, Luke! to be helpful, as well as the EXPLAIN QUERY PLAN command, which can be used to tell what, if any, indices are going to be used for a query.

Specifically, records that share both event ID and event date.↩︎