Chapter 1 Introduction
Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. In such cases, analysts need to use methods from statistical and computational science known as entity resolution (also called record linkage or de-duplication) to proceed with analysis. Entity resolution (ER) is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself, because many databases contain errors (noise, lies, omissions, duplications, etc.), and the number of parameters to be estimated grows with the number of records (Ball, 2000; Bhattacharya & Getoor, 2006; Bilenko & Mooney, 2003; Christen, 2008, 2012; Cohen, Ravikumar, & Fienberg, 2003; Dai & Storkey, 2011; Gutman, Afendulis, & Zaslavsky, 2013; Hsu, Lee, Liu, & Ling, 2000; Jewell, Spagat, & Jewell, 2013; Larsen, 2002, 2005; Lum, Price, & Banks, 2013; McCallum & Wellner, 2004; Miller, Frawley, & Sayward, 2000; Murphy et al., 2007; Sadinle, 2014; Sariyar & Borg, 2010; Sariyar, Borg, & Pommerening, 2012). To meet present and near-future needs, entity resolution methods must be flexible and scalable to large databases; furthermore, they must be able to handle uncertainty and be easily integrated with post-linkage statistical analyses, such as logistic regression or capture recapture. All this must be done while maintaining accuracy and low error rates.
Turning to the context of an armed conflict, creating such models is incredibly challenging as typically grass roots movements, families, friends collect multiple reports on the same victims. This naturally causes duplications to occur in the data. In this paper, we study a real example from El Salvador, where a Truth Commission was formed by the United Nations in 1992. This Truth Commission collected data on killings that occurred during the Salvadoran civil war (1980 – 1991). Given the data collection process, a victim can be reported by different family and friends, and thus, one important aspect is to remove any duplications in the data in order to make it more reliable. In addition, removing such duplications allows one to estimate the number of documented identifiable deaths.
1.1 The United Nations Truth Commission for El Salvador
Between 1980 and 1991, the Republic of El Salvador witnessed a civil war between the central government, the left-wing guerrilla Farabundo Mart National Liberation Front (FMLN), and the right-wing para-military death squads. After the peace agreement in 1992, the United Nations created a Commission on the Truth (UNTC) for El Salvador, which invited members of Salvadoran society to report war-related human rights violations, which mainly focused on killings and disappearances. In order to collect such information the UNTC invited individuals through newspaper, radio, and television advertisements to come forward and testify. The UNTC opened offices through El Salvador where witnesses could provide their testimonials, and this resulted in a list of potential victims with names, date of death, and reported location.
Due to the fact that testimonials were provided to the UNTC many years after the civil war, it is expected that witnesses could not recall some of the details of the killings. In addition, some details regarding testimonials of the same individual, may contain conflicting or differing information. This is a natural characteristic of this data and leads to more noise, distortions, and missingness in the data. Furthermore, a victim can be reported multiple times, which leads to the an issue with duplication in the data. Finally, there is not unique identifiers for this data set that are thought to be reliable, which motivates the use of fully unsupervised Bayesian methods.
Our work builds off the seminal work of (Sadinle, 2014), where we use the same data set for consistency. We refer to (Sadinle, 2014) for complete details regarding the UNTC data set. The entire data set contains 5395 records. The fields (features) used in this paper are full name, date of death (year, month and day), municipality, and department. Table 1.1 provides an illustrative example of how duplicates can appears in the UNTC data set. Records 1, 2, and 3 in Table 1.1 represent an example of three duplicated records that most likely refer to the same person. This example is one example, where we may have non-coreferent decisions made by models or by humans in making decisions between pairs of records due to the nature of the data at hand. One advantage to our proposed approach is that we never look at pairwise comparisons of records. Turing to the second example in our table, records 3 and 4 agree on all the same information except on given name and family name. This illustrates potential issues that one faces regarding Hispanic names. For example, record 5 may refer to the same person in record 4. Record 5 could have typographical and missing information as it’s quite common for a given name of JULIAN ANDRES to drop a given name to JULIAN. Turning too the family name, It’s also quite common for one of the family names to be dropped. Thus, RAMOS ROJAS could be shortened to RAMOS. The typographical errors are quite common as the original data was scanned using OCR, and this is the most likely reason that such errors would appear. In short, declaring records 4 and 5 as co-referent depends highly we believe on the agreement or disagreement on the fields.
|Record||Given name||Family name||Year||Month||Day||Municipality|
|4.||JULIAN ANDRES||RAMOS ROJAS||1986||8||5||B|