This vignette gives an overview of the ICEWS event data: how each event record is structured, and some the relationship between different sets of information in the event records and columns. The latter is mainly for instances where superficial assumptions one might make, e.g. that “event_id” is a unique event ID, are wrong.
Note: the package is inherently unfriendly to a completely replicable workflow, as it is setup with the intention that one will use R options set in .Rprofile at each session start. And in any case, it is not trivial to throw around 5 GB of data.
One side effect is that several parts of this vignette incorporate results from the local data copy I have, but which is not included on GitHub or the package install. However, I have tried to make the code here as explicit and replicable as possible.
The data used cover the beginning in 1995 to a last event date of 2020-09-28.
Below are a couple of gotchas that might lead to wrongly aggregated data.
“Event ID” is not a unique ID, but together with “Event Date” it does appear to be unique. The table below shows the number of times event ID occur in the data. If it was unique, this would always be only once, but sometimes it occurs twice.
query_icews(" SELECT id_n, count(*) as instances FROM ( SELECT event_id, count(*) as id_n FROM events GROUP BY event_id ) GROUP BY id_n; ") %>% knitr::kable(format.args = list(big.mark = ','))
In all cases, the duplicate events have distinct event dates, and the combination of
[event_id, event_date] is a unique identifier for each row.
query_icews(" SELECT count(*) AS dupe_records, distinct_dates FROM ( SELECT event_id, Count(Distinct(event_date)) AS distinct_dates FROM events GROUP BY event_id HAVING Count(*) > 1 ) GROUP BY distinct_dates;") %>% knitr::kable(format.args = list(big.mark = ','))
Below are 5 pairs of records that have duplicate event IDs. These are clearly separate events and differ in details beyond the event date.
query_icews(" WITH dupes AS ( SELECT distinct(event_id) as event_id FROM events GROUP BY event_id HAVING count(*) > 1 LIMIT 5 ) SELECT event_id, event_date, source_name, source_country, event_text, target_name, target_country FROM events WHERE events.event_id IN dupes;") %>% knitr::kable()
|20718170||20131112||Japan||Japan||Host a visit||Jacob J. Lew||United States|
|20718170||20140101||Police (Australia)||Australia||Arrest, detain, or charge with legal action||Men (Australia)||Australia|
|20718171||20131112||Combatant (Syria)||Syria||Use unconventional violence||Military (Syria)||Syria|
|20718171||20140101||Police (Australia)||Australia||Arrest, detain, or charge with legal action||Children (Australia)||Australia|
|20718172||20131112||Itsunori Onodera||Japan||Make statement||Turkey||Turkey|
|20718172||20140101||Government Official (Democratic Republic of Congo)||Democratic Republic of Congo||Make statement||Attacker (Democratic Republic of Congo)||Democratic Republic of Congo|
|20718173||20131112||Refugee (Rwanda)||Rwanda||Accuse||Citizen (Rwanda)||Rwanda|
|20718173||20140101||Armed Rebel (South Sudan)||South Sudan||Use unconventional violence||Military (South Sudan)||South Sudan|
|20718174||20131112||Barry O’Farrell||Australia||Make statement||Australia||Australia|
|20718174||20140101||Military (South Sudan)||South Sudan||Use conventional military force||Armed Rebel (South Sudan)||South Sudan|
This issue only affects data through 2013 and does not occur in more recent data.
query_icews(" SELECT min(event_date), max(event_date) FROM ( SELECT event_date FROM events GROUP BY event_id HAVING Count(*) > 1 );") %>% knitr::kable()
Only one source file is responsible for the duplicate event IDs:
query_icews(" SELECT distinct(source_file), Count(*) as N_records FROM ( SELECT source_file FROM events GROUP BY event_id HAVING Count(*) > 1 );") %>% knitr::kable()
For example, “10 Downing Street” occurs multiple times in this query because it is coded as belonging to multiple sectors, but
|10 Downing Street||Executive,Executive Office,Government||United Kingdom||275|
|10 Downing Street||Executive Office,Government,Executive||United Kingdom||258|
|10 Downing Street||Government,Executive,Executive Office||United Kingdom||242|
|10 Downing Street||Executive,Government,Executive Office||United Kingdom||203|
|10 Downing Street||Government,Executive Office,Executive||United Kingdom||197|
|10 Downing Street||Executive Office,Executive,Government||United Kingdom||140|
|10 Downing Street||“Executive Office,Government,Executive”||United Kingdom||18|
|10 Downing Street||“Executive,Government,Executive Office”||United Kingdom||17|
|10 Downing Street||“Government,Executive Office,Executive”||United Kingdom||15|
|10 Downing Street||“Executive Office,Executive,Government”||United Kingdom||10|
If we resort the source sectors, it becomes apparent that there are only two distinct sets of source sectors:
foo %>% mutate(source_sectors = stringr::str_remove_all(source_sectors, "\""), record_id = 1:n()) %>% tidyr::separate_rows(source_sectors, sep = ",") %>% group_by(record_id) %>% summarize(source_sectors = paste0(sort(source_sectors), collapse = ", "), n = unique(n)) %>% group_by(source_sectors) %>% summarize(n = sum(n)) %>% knitr::kable()
|Executive, Executive Office, Government||1390|
|Executive, Executive Office, Government, International Government Organization, Regional, Regional Diplomatic IGOs||67|
Some actors also change source sectors over time, e.g. politicians who were in or not in government.
Each event only has one full CAMEO code, however the codes themselves reflect a hierarchical structure with three levels. At the top level (“level” 0 in the
cameo_codes data included in the package) are 20 categories for types of interactions:
|01||MAKE PUBLIC STATEMENT||0||1|
|03||EXPRESS INTENT TO COOPERATE||0||3|
|05||ENGAGE IN DIPLOMATIC COOPERATION||0||5|
|06||ENGAGE IN MATERIAL COOPERATION||0||6|
|15||EXHIBIT FORCE POSTURE||0||15|
|20||USE UNCONVENTIONAL MASS VIOLENCE||0||20|
The “level” variable in
cameo_codes refers to a CAMEO codes level in the hierarchy, ranging from 0 to 2, whereas “lvl0” refers to the “level” 0 top-level code, e.g. “01”, and “lvl1” refers to the intermediate-level code at “level” 1, e.g. “010”.
Each top level code includes up to 2 further levels down, e.g. for the top level code “20”:
|20||USE UNCONVENTIONAL MASS VIOLENCE||0||20||NA|
|200||Use unconventional mass violence||1||20||200|
|201||Engage in mass expulsion||1||20||201|
|202||Engage in mass killings||1||20||202|
|203||Engage in ethnic cleansing||1||20||203|
|204||Use weapons of mass destruction||1||20||204|
|2041||Use chemical, biological, or radiological weapons||2||20||204|
|2042||Detonate nuclear weapons||2||20||204|
This means that in practice some care needs to be taken to count all the correct events when aggregating to the top- or intermediate-level CAMEO codes. In SQL, one can do:
# Instead of: query_icews("SELECT count(*) FROM events WHERE cameo_code = '20';") # which will not return any records, do: query_icews("SELECT count(*) FROM events WHERE Substr(cameo_code, 1, 2) = '20';") # returns ~3,000 records
Some records are also missing a CAMEO code (NULL in SQL, NA in R).
The location information for the event (not actors) consists of country, province, district, city, latitude, longitude.
bind_cols( query_icews("select count(*) as n_unique_locations from (select distinct latitude, longitude, country, province, district, city from events);"), query_icews("select count(*) as n_unique_coords from (select distinct latitude, longitude from events);"), query_icews("select count(*) as n_unique_countries from (select distinct country from events);") ) %>% tidyr::gather(Measure, Value) %>% knitr::kable(format.args = list(big.mark = ','))
Coordinates are not sufficient to distinguish each unique location once we take city, district, etc. into account. Nothing short of the full set of columns [country, province, district, city, latitude, longitude] uniquely identifies all distinct locations.