An unsupervised blocking technique for more efficient record linkage

K. O'Hare, Anna Jurek-Loughrey, Cassio de Campos

Research output: Contribution to journalArticleAcademicpeer-review

11 Citations (Scopus)
32 Downloads (Pure)

Abstract

Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person) within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.

Original languageEnglish
Pages (from-to)181-195
Number of pages15
JournalData & Knowledge Engineering
Volume122
DOIs
Publication statusPublished - 1 Jul 2019

Keywords

  • Entity resolution
  • Record linkage
  • Unsupervised blocking

Fingerprint

Dive into the research topics of 'An unsupervised blocking technique for more efficient record linkage'. Together they form a unique fingerprint.

Cite this