Chapter 2 Current Approaches to Record Linkage

Many modern record linkage techniques can be viewed as an extension of the Fellegi-Sunter approach (FS), which computes pairwise probabilities of matching for all pairs of records using a likelihood ratio test (Fellegi & Sunter, 1969; Newcombe, Kennedy, Axford, & James, 1959). While modern FS methods are used today, such implementations assume that only two databases can be linked and that there are no duplicates within each database (Belin & Rubin, 1995; Gutman et al., 2013; Larsen & Rubin, 2001; Murray, 2016; Tancredi & Liseo, 2011). Furthermore, such approaches are known to be quite sensitive to the choice of the threshold that the likelihood ratio test is based upon. In short, these assumptions are inadequate for many record linkage tasks. Bayesian methods have been recently utilized in record linkage due to their flexibility and exact error propagation; however, they have been limited primarily to two-database matching, issues with scalability to large databases, and model misspecification (Copas & Hilton, 1990; Gutman et al., 2013; Sadinle, 2014, 2016; Tancredi & Liseo, 2011). These contributions, while valuable, do not easily generalize to multiple databases and to duplication within databases.

The most relevant work to our proposed methodology is that of (Sadinle, 2014), which deals with a special case of record linkage known as duplicate detection. Duplicate detection refers to removing duplicate entities within a data file (but not across and within data files). In (Sadinle, 2014), the authors propose a duplicate detection approach borrowing approaches from (Fellegi & Sunter, 1969; Newcombe et al., 1959), the Bayesian literature, and the blocking literature. Blocking (filtering or indexing) is a way of reducing down the entire space of records, such that one only must compare similar records. (Sadinle, 2014) first reduces down the space of all records using deterministic blocking rules. Next, the authors propose a Bayesian model based upon comparison data, which is an input from the blocking stage. One benefit of this is there is a computational cost from filtering records pairs, however, there is a tremendous drawback in that there is no way to propagate the uncertainty from the blocking mechanism in the duplicate detection step. The authors evaluate their proposed methodology for two municipalities, where ground truth is thought to be accurate. Further evaluations are performed in an entirely unsupervised fashion on the remaining municipalities.

In (Steorts, Hall, & Fienberg, 2016) a fully hierarchical-Bayesian approach to record linkage, using Dirichlet prior distributions over latent attributes and assuming a data distortion model. The authors derived an efficient hybrid (Metropolis-within-Gibbs) MCMC algorithm for fitting these models, SMERED. SMERED updates most of the latent variables and parameters using Gibbs sampling from conjugate conditional distributions. It updates the bipartite graph using a split-merge step, following (Jain & Neal, 2004). Thus, one has all the advantages of the Bayesian paradigm for both the latent entities and the linkage structure. Similar bipartite graph structures have been considered in the two-database scenario (Fortini, Liseo, Nuccitelli, & Scanu, 2001; Gutman et al., 2013; Larsen, 2002, 2005, 2012; Matsakis, 2010; Sadinle, 2014, 2016; Tancredi & Liseo, 2011). The attributes of the latent entities, the number of latent entities, the edges linking records to latents, etc., all have posterior distributions, and it is easy to sample from these distributions for uncertainty quantification or error propagation. More recently, (Steorts, 2015) extended these approaches to both categorical and noisy string data using an empirically motivated prior, , which is available on . The authors illustrated on real and simulated data that the EB method beat supervised methods (e.g., random forests, Bayesian Adaptive Regression Trees, logistic regression) when the training data is 10 percent (or less) of the total amount of data. While SMERED and the EB method work on moderately sized data sets, there are potential limitations with scaling to industrial-sized data sets. For a review on recent developments in Bayesian methods, see (Liseo & Tancredi, 2013).

2.1 Overview of the Article

In this section, we provide an overview of the article. In this paper, we provide five contributions to the literature. First, we propose an extension to the model for end-to-end empirical Bayesian entity resolution (Steorts, 2015). Specifically, we provide to our knowledge the first use of subjective priors on the linkage structure for generalized entity resolution. We consider two non-parametric priors on the linkage structure, which are the Pitman-Yor Process prior and the Dirichlet Process prior. Second, we do not require any dimension reduction (such as blocking) to be applied to the data, which means that the only sources of error in our inferential methods comes from the data and the entity resolution task. Third, our extension using generalized entity resolution propagates the error of the entity resolution task exactly into our inferential task. Fourth, our method considers an application the synthetic data, where we can understand and evaluate our methodology rigorously. Finally, our method looks at the case study to the UNTC data set from a fully unsupervised point of view.

In Section 3, we review prior methodology that is used in this paper, followed by a description of our proposed methodology. In Section 4, we apply our proposed model to a synthetic data set. In Section 5, we test our proposed methodology on our motivational data set from El Salvadoran Conflict, and then provide a discussion regarding our proposed work and directions for future research.