Chapter 6 Conclusion

In this paper, we have provided five novel contributions to the literature. First, we have introduced to our knowledge, the first subjective priors (the Pitman Yor and Dirichlet Process Priors) for entity resolution with both categorical and string valued data. Second, we have introduced missing values into our model, making the model more realistic to real data situations. Third, we have derived the conditional distributions and implemented a Gibbs sampler for our proposed model. Fourth, we have illustrated the strength and weakness of our model on both synthetic and real data. For the synthetic data, our model performs better than the uniform prior, where performance is measure by standard entity resolution comparisons. For the real data (UNTC data set), our model does well with respect to inference of the underlying population here, however, the precision and recall suffer. Perhaps there may be better similarity metrics that might work to adapt to how names appear in this dataset here, however, this seems to be a difficult task. It seems a centroid or latent variable model may not be the best approach for such data, and this is still under exploration and left for future work.