ICEWS event data

This vignette gives an overview of the ICEWS event data: how each event record is structured, and some the relationship between different sets of information in the event records and columns. The latter is mainly for instances where superficial assumptions one might make, e.g. that “event_id” is a unique event ID, are wrong.

Note: the package is inherently unfriendly to a completely replicable workflow, as it is setup with the intention that one will use R options set in .Rprofile at each session start. And in any case, it is not trivial to throw around 5 GB of data.

One side effect is that several parts of this vignette incorporate results from the local data copy I have, but which is not included on GitHub or the package install. However, I have tried to make the code here as explicit and replicable as possible.

The data used cover the beginning in 1995 to a last event date of 2020-09-28.

suppressPackageStartupMessages({
  library("icews")
  library("dplyr")
  library("tidyr")
  library("stringr")
})

Things to look out for

Below are a couple of gotchas that might lead to wrongly aggregated data.

Event ID is not unique

“Event ID” is not a unique ID, but together with “Event Date” it does appear to be unique. The table below shows the number of times event ID occur in the data. If it was unique, this would always be only once, but sometimes it occurs twice.

query_icews("
SELECT id_n, count(*) as instances
FROM ( 
      SELECT event_id, count(*) as id_n 
      FROM events
      GROUP BY event_id
     )
GROUP BY id_n;
") %>% knitr::kable(format.args = list(big.mark = ','))

id_n	instances
1	18,303,707
2	145,312

In all cases, the duplicate events have distinct event dates, and the combination of [event_id, event_date] is a unique identifier for each row.

query_icews("
SELECT count(*) AS dupe_records, distinct_dates
FROM (
  SELECT   event_id, Count(Distinct(event_date)) AS distinct_dates
  FROM     events
  GROUP BY event_id
  HAVING   Count(*) > 1
)
GROUP BY distinct_dates;") %>%
  knitr::kable(format.args = list(big.mark = ','))

dupe_records	distinct_dates
145,312	2

Below are 5 pairs of records that have duplicate event IDs. These are clearly separate events and differ in details beyond the event date.

query_icews("
WITH dupes AS (
  SELECT distinct(event_id) as event_id
  FROM   events
  GROUP BY event_id
  HAVING count(*) > 1
  LIMIT 5
)
SELECT event_id, event_date, source_name, source_country, event_text,
       target_name, target_country
FROM   events
WHERE  events.event_id IN dupes;") %>%
  knitr::kable()

event_id	event_date	source_name	source_country	event_text	target_name	target_country
20718170	20131112	Japan	Japan	Host a visit	Jacob J. Lew	United States
20718170	20140101	Police (Australia)	Australia	Arrest, detain, or charge with legal action	Men (Australia)	Australia
20718171	20131112	Combatant (Syria)	Syria	Use unconventional violence	Military (Syria)	Syria
20718171	20140101	Police (Australia)	Australia	Arrest, detain, or charge with legal action	Children (Australia)	Australia
20718172	20131112	Itsunori Onodera	Japan	Make statement	Turkey	Turkey
20718172	20140101	Government Official (Democratic Republic of Congo)	Democratic Republic of Congo	Make statement	Attacker (Democratic Republic of Congo)	Democratic Republic of Congo
20718173	20131112	Refugee (Rwanda)	Rwanda	Accuse	Citizen (Rwanda)	Rwanda
20718173	20140101	Armed Rebel (South Sudan)	South Sudan	Use unconventional violence	Military (South Sudan)	South Sudan
20718174	20131112	Barry O’Farrell	Australia	Make statement	Australia	Australia
20718174	20140101	Military (South Sudan)	South Sudan	Use conventional military force	Armed Rebel (South Sudan)	South Sudan

This issue only affects data through 2013 and does not occur in more recent data.

query_icews("
SELECT min(event_date), max(event_date)
FROM (
  SELECT   event_date
  FROM     events
  GROUP BY event_id
  HAVING   Count(*) > 1 
);") %>%
  knitr::kable()

min(event_date)	max(event_date)
20070201	20131231

Only one source file is responsible for the duplicate event IDs:

query_icews("
SELECT distinct(source_file), Count(*) as N_records
FROM (
  SELECT   source_file
  FROM     events
  GROUP BY event_id
  HAVING   Count(*) > 1 
);") %>%
  knitr::kable()

source_file	N_records
events.2013.20150313084929.tab	145312

Each actor can have multiple sectors, and they can change over time

For example, “10 Downing Street” occurs multiple times in this query because it is coded as belonging to multiple sectors, but

those sets are not always the same,
the order in which they appear in “source_sectors” is variable,
and sometimes the value is quoted and sometimes not.

foo <- query_icews("
SELECT source_name, source_sectors, source_country, Count(*) AS n 
FROM   events
WHERE  source_name = '10 Downing Street'
GROUP BY source_name, source_sectors, source_country")

foo %>% 
  arrange(desc(n)) %>%
  head(n=10) %>%
  knitr::kable()

source_name	source_sectors	source_country	n
10 Downing Street	Executive,Executive Office,Government	United Kingdom	275
10 Downing Street	Executive Office,Government,Executive	United Kingdom	258
10 Downing Street	Government,Executive,Executive Office	United Kingdom	242
10 Downing Street	Executive,Government,Executive Office	United Kingdom	203
10 Downing Street	Government,Executive Office,Executive	United Kingdom	197
10 Downing Street	Executive Office,Executive,Government	United Kingdom	140
10 Downing Street	“Executive Office,Government,Executive”	United Kingdom	18
10 Downing Street	“Executive,Government,Executive Office”	United Kingdom	17
10 Downing Street	“Government,Executive Office,Executive”	United Kingdom	15
10 Downing Street	“Executive Office,Executive,Government”	United Kingdom	10

If we resort the source sectors, it becomes apparent that there are only two distinct sets of source sectors:

foo %>% 
  mutate(source_sectors = stringr::str_remove_all(source_sectors, "\""),
         record_id = 1:n()) %>% 
  tidyr::separate_rows(source_sectors, sep = ",") %>% 
  group_by(record_id) %>% 
  summarize(source_sectors = paste0(sort(source_sectors), collapse = ", "), 
            n = unique(n)) %>% 
  group_by(source_sectors) %>% 
  summarize(n = sum(n)) %>%
  knitr::kable()

source_sectors	n
Executive, Executive Office, Government	1390
Executive, Executive Office, Government, International Government Organization, Regional, Regional Diplomatic IGOs	67

Some actors also change source sectors over time, e.g. politicians who were in or not in government.

CAMEO codes are unique but have a nested hierarchical structure

Each event only has one full CAMEO code, however the codes themselves reflect a hierarchical structure with three levels. At the top level (“level” 0 in the cameo_codes data included in the package) are 20 categories for types of interactions:

data(cameo_codes) 
cameo_codes %>% 
  filter(level==0) %>% 
  select(cameo_code, name, level, lvl0) %>%
  knitr::kable()

cameo_code	name	lvl0
01	MAKE PUBLIC STATEMENT	1
02	APPEAL	2
03	EXPRESS INTENT TO COOPERATE	3
04	CONSULT	4
05	ENGAGE IN DIPLOMATIC COOPERATION	5
06	ENGAGE IN MATERIAL COOPERATION	6
07	PROVIDE AID	7
08	YIELD	8
09	INVESTIGATE	9
10	DEMAND	10
11	DISAPPROVE	11
12	REJECT	12
13	THREATEN	13
14	PROTEST	14
15	EXHIBIT FORCE POSTURE	15
16	REDUCE RELATIONS	16
17	COERCE	17
18	ASSAULT	18
19	FIGHT	19
20	USE UNCONVENTIONAL MASS VIOLENCE	20

The “level” variable in cameo_codes refers to a CAMEO codes level in the hierarchy, ranging from 0 to 2, whereas “lvl0” refers to the “level” 0 top-level code, e.g. “01”, and “lvl1” refers to the intermediate-level code at “level” 1, e.g. “010”.

Each top level code includes up to 2 further levels down, e.g. for the top level code “20”:

cameo_codes %>%
  filter(lvl0==20) %>%
  select(cameo_code, name, level, lvl0, lvl1) %>%
  knitr::kable()

cameo_code	name	level	lvl0	lvl1
20	USE UNCONVENTIONAL MASS VIOLENCE	0	20	NA
200	Use unconventional mass violence	1	20	200
201	Engage in mass expulsion	1	20	201
202	Engage in mass killings	1	20	202
203	Engage in ethnic cleansing	1	20	203
204	Use weapons of mass destruction	1	20	204
2041	Use chemical, biological, or radiological weapons	2	20	204
2042	Detonate nuclear weapons	2	20	204

This means that in practice some care needs to be taken to count all the correct events when aggregating to the top- or intermediate-level CAMEO codes. In SQL, one can do:

# Instead of:
query_icews("SELECT count(*) FROM events WHERE cameo_code = '20';")
# which will not return any records, do:
query_icews("SELECT count(*) FROM events WHERE Substr(cameo_code, 1, 2) = '20';")
# returns ~3,000 records

Some records are also missing a CAMEO code (NULL in SQL, NA in R).

Summary statistics

Total number of records

query_icews("SELECT max(event_date) AS Last_Event_Date, 
                    Count(*) as Records 
             FROM events;") %>%
  mutate(Last_Event_Date = as.Date(as.character(Last_Event_Date), format = "%Y%m%d"),
         Today = Sys.Date()) %>%
  select(Today, Last_Event_Date, Records) %>%
  knitr::kable(format.args = list(big.mark = ','))

Today	Last_Event_Date	Records
2020-10-15	2020-09-28	18,594,331

Distinct locations

The location information for the event (not actors) consists of country, province, district, city, latitude, longitude.

bind_cols(
  query_icews("select count(*) as n_unique_locations from (select distinct latitude, longitude, country, province, district, city from events);"),
  query_icews("select count(*) as n_unique_coords from (select distinct latitude, longitude from events);"),
  query_icews("select count(*) as n_unique_countries from (select distinct country from events);")
) %>%
  tidyr::gather(Measure, Value) %>%
  knitr::kable(format.args = list(big.mark = ','))

Measure	Value
n_unique_locations	128,126
n_unique_coords	114,035
n_unique_countries	255

Coordinates are not sufficient to distinguish each unique location once we take city, district, etc. into account. Nothing short of the full set of columns [country, province, district, city, latitude, longitude] uniquely identifies all distinct locations.

Andreas Beger