This vignette gives an overview of the ICEWS event data: how each event record is structured, and some the relationship between different sets of information in the event records and columns. The latter is mainly for instances where superficial assumptions one might make, e.g. that “event_id” is a unique event ID, are wrong.

Note: the package is inherently unfriendly to a completely replicable workflow, as it is setup with the intention that one will use R options set in .Rprofile at each session start. And in any case, it is not trivial to throw around 5 GB of data.

One side effect is that several parts of this vignette incorporate results from the local data copy I have, but which is not included on GitHub or the package install. However, I have tried to make the code here as explicit and replicable as possible.

The data used cover the beginning in 1995 to a last event date of 2020-09-28.

Things to look out for

Below are a couple of gotchas that might lead to wrongly aggregated data.

Event ID is not unique

“Event ID” is not a unique ID, but together with “Event Date” it does appear to be unique. The table below shows the number of times event ID occur in the data. If it was unique, this would always be only once, but sometimes it occurs twice.

query_icews("
SELECT id_n, count(*) as instances
FROM ( 
      SELECT event_id, count(*) as id_n 
      FROM events
      GROUP BY event_id
     )
GROUP BY id_n;
") %>% knitr::kable(format.args = list(big.mark = ','))
id_n instances
1 18,303,707
2 145,312

In all cases, the duplicate events have distinct event dates, and the combination of [event_id, event_date] is a unique identifier for each row.

query_icews("
SELECT count(*) AS dupe_records, distinct_dates
FROM (
  SELECT   event_id, Count(Distinct(event_date)) AS distinct_dates
  FROM     events
  GROUP BY event_id
  HAVING   Count(*) > 1
)
GROUP BY distinct_dates;") %>%
  knitr::kable(format.args = list(big.mark = ','))
dupe_records distinct_dates
145,312 2

Below are 5 pairs of records that have duplicate event IDs. These are clearly separate events and differ in details beyond the event date.

query_icews("
WITH dupes AS (
  SELECT distinct(event_id) as event_id
  FROM   events
  GROUP BY event_id
  HAVING count(*) > 1
  LIMIT 5
)
SELECT event_id, event_date, source_name, source_country, event_text,
       target_name, target_country
FROM   events
WHERE  events.event_id IN dupes;") %>%
  knitr::kable()
event_id event_date source_name source_country event_text target_name target_country
20718170 20131112 Japan Japan Host a visit Jacob J. Lew United States
20718170 20140101 Police (Australia) Australia Arrest, detain, or charge with legal action Men (Australia) Australia
20718171 20131112 Combatant (Syria) Syria Use unconventional violence Military (Syria) Syria
20718171 20140101 Police (Australia) Australia Arrest, detain, or charge with legal action Children (Australia) Australia
20718172 20131112 Itsunori Onodera Japan Make statement Turkey Turkey
20718172 20140101 Government Official (Democratic Republic of Congo) Democratic Republic of Congo Make statement Attacker (Democratic Republic of Congo) Democratic Republic of Congo
20718173 20131112 Refugee (Rwanda) Rwanda Accuse Citizen (Rwanda) Rwanda
20718173 20140101 Armed Rebel (South Sudan) South Sudan Use unconventional violence Military (South Sudan) South Sudan
20718174 20131112 Barry O’Farrell Australia Make statement Australia Australia
20718174 20140101 Military (South Sudan) South Sudan Use conventional military force Armed Rebel (South Sudan) South Sudan

This issue only affects data through 2013 and does not occur in more recent data.

query_icews("
SELECT min(event_date), max(event_date)
FROM (
  SELECT   event_date
  FROM     events
  GROUP BY event_id
  HAVING   Count(*) > 1 
);") %>%
  knitr::kable()
min(event_date) max(event_date)
20070201 20131231

Only one source file is responsible for the duplicate event IDs:

query_icews("
SELECT distinct(source_file), Count(*) as N_records
FROM (
  SELECT   source_file
  FROM     events
  GROUP BY event_id
  HAVING   Count(*) > 1 
);") %>%
  knitr::kable()
source_file N_records
events.2013.20150313084929.tab 145312

Each actor can have multiple sectors, and they can change over time

For example, “10 Downing Street” occurs multiple times in this query because it is coded as belonging to multiple sectors, but

  • those sets are not always the same,
  • the order in which they appear in “source_sectors” is variable,
  • and sometimes the value is quoted and sometimes not.
foo <- query_icews("
SELECT source_name, source_sectors, source_country, Count(*) AS n 
FROM   events
WHERE  source_name = '10 Downing Street'
GROUP BY source_name, source_sectors, source_country")

foo %>% 
  arrange(desc(n)) %>%
  head(n=10) %>%
  knitr::kable()
source_name source_sectors source_country n
10 Downing Street Executive,Executive Office,Government United Kingdom 275
10 Downing Street Executive Office,Government,Executive United Kingdom 258
10 Downing Street Government,Executive,Executive Office United Kingdom 242
10 Downing Street Executive,Government,Executive Office United Kingdom 203
10 Downing Street Government,Executive Office,Executive United Kingdom 197
10 Downing Street Executive Office,Executive,Government United Kingdom 140
10 Downing Street “Executive Office,Government,Executive” United Kingdom 18
10 Downing Street “Executive,Government,Executive Office” United Kingdom 17
10 Downing Street “Government,Executive Office,Executive” United Kingdom 15
10 Downing Street “Executive Office,Executive,Government” United Kingdom 10

If we resort the source sectors, it becomes apparent that there are only two distinct sets of source sectors:

foo %>% 
  mutate(source_sectors = stringr::str_remove_all(source_sectors, "\""),
         record_id = 1:n()) %>% 
  tidyr::separate_rows(source_sectors, sep = ",") %>% 
  group_by(record_id) %>% 
  summarize(source_sectors = paste0(sort(source_sectors), collapse = ", "), 
            n = unique(n)) %>% 
  group_by(source_sectors) %>% 
  summarize(n = sum(n)) %>%
  knitr::kable()
source_sectors n
Executive, Executive Office, Government 1390
Executive, Executive Office, Government, International Government Organization, Regional, Regional Diplomatic IGOs 67

Some actors also change source sectors over time, e.g. politicians who were in or not in government.

CAMEO codes are unique but have a nested hierarchical structure

Each event only has one full CAMEO code, however the codes themselves reflect a hierarchical structure with three levels. At the top level (“level” 0 in the cameo_codes data included in the package) are 20 categories for types of interactions:

data(cameo_codes) 
cameo_codes %>% 
  filter(level==0) %>% 
  select(cameo_code, name, level, lvl0) %>%
  knitr::kable()
cameo_code name level lvl0
01 MAKE PUBLIC STATEMENT 0 1
02 APPEAL 0 2
03 EXPRESS INTENT TO COOPERATE 0 3
04 CONSULT 0 4
05 ENGAGE IN DIPLOMATIC COOPERATION 0 5
06 ENGAGE IN MATERIAL COOPERATION 0 6
07 PROVIDE AID 0 7
08 YIELD 0 8
09 INVESTIGATE 0 9
10 DEMAND 0 10
11 DISAPPROVE 0 11
12 REJECT 0 12
13 THREATEN 0 13
14 PROTEST 0 14
15 EXHIBIT FORCE POSTURE 0 15
16 REDUCE RELATIONS 0 16
17 COERCE 0 17
18 ASSAULT 0 18
19 FIGHT 0 19
20 USE UNCONVENTIONAL MASS VIOLENCE 0 20

The “level” variable in cameo_codes refers to a CAMEO codes level in the hierarchy, ranging from 0 to 2, whereas “lvl0” refers to the “level” 0 top-level code, e.g. “01”, and “lvl1” refers to the intermediate-level code at “level” 1, e.g. “010”.

Each top level code includes up to 2 further levels down, e.g. for the top level code “20”:

cameo_codes %>%
  filter(lvl0==20) %>%
  select(cameo_code, name, level, lvl0, lvl1) %>%
  knitr::kable()
cameo_code name level lvl0 lvl1
20 USE UNCONVENTIONAL MASS VIOLENCE 0 20 NA
200 Use unconventional mass violence 1 20 200
201 Engage in mass expulsion 1 20 201
202 Engage in mass killings 1 20 202
203 Engage in ethnic cleansing 1 20 203
204 Use weapons of mass destruction 1 20 204
2041 Use chemical, biological, or radiological weapons 2 20 204
2042 Detonate nuclear weapons 2 20 204

This means that in practice some care needs to be taken to count all the correct events when aggregating to the top- or intermediate-level CAMEO codes. In SQL, one can do:

# Instead of:
query_icews("SELECT count(*) FROM events WHERE cameo_code = '20';")
# which will not return any records, do:
query_icews("SELECT count(*) FROM events WHERE Substr(cameo_code, 1, 2) = '20';")
# returns ~3,000 records

Some records are also missing a CAMEO code (NULL in SQL, NA in R).

Summary statistics

Total number of records

query_icews("SELECT max(event_date) AS Last_Event_Date, 
                    Count(*) as Records 
             FROM events;") %>%
  mutate(Last_Event_Date = as.Date(as.character(Last_Event_Date), format = "%Y%m%d"),
         Today = Sys.Date()) %>%
  select(Today, Last_Event_Date, Records) %>%
  knitr::kable(format.args = list(big.mark = ','))
Today Last_Event_Date Records
2020-10-15 2020-09-28 18,594,331

Distinct locations

The location information for the event (not actors) consists of country, province, district, city, latitude, longitude.

bind_cols(
  query_icews("select count(*) as n_unique_locations from (select distinct latitude, longitude, country, province, district, city from events);"),
  query_icews("select count(*) as n_unique_coords from (select distinct latitude, longitude from events);"),
  query_icews("select count(*) as n_unique_countries from (select distinct country from events);")
) %>%
  tidyr::gather(Measure, Value) %>%
  knitr::kable(format.args = list(big.mark = ','))
Measure Value
n_unique_locations 128,126
n_unique_coords 114,035
n_unique_countries 255

Coordinates are not sufficient to distinguish each unique location once we take city, district, etc. into account. Nothing short of the full set of columns [country, province, district, city, latitude, longitude] uniquely identifies all distinct locations.