Abstract

Entity resolution (record linkage or de-duplication) is the process of removing duplicate entities in large, noisy databases. Entity resolution is made even more difficult when unique identifiers are not present and many of the observed records are subject to missing values. Furthermore, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we are motivated to study a real data set from El Salvador, where a Truth Commission formed by the United Nations in 1992 collected data on killings that occurred during the Salvadoran civil war (1980-1991). Due to the data collection process, victims can be duplicated, as they may have been reported by different relatives, friends, or grass roots teams working in the area. Our motivation is to be able (1) to build flexible and robust models that are computationally fast, (2) to better understand what types of models are well suited for conflict data, (3) and finally provide estimates and evaluations of the number of documented identifiable deaths for our motivating data set.

Keywords: record linkage, entity resolution, de-duplication, conflict data, Bayesian methods, El Salvador