30711 - RECORD LINKAGE

Anno Accademico 2025/2026

  • Docente: Edoardo Redivo
  • Crediti formativi: 6
  • SSD: SECS-S/01
  • Lingua di insegnamento: Inglese
  • Modalità didattica: Convenzionale - Lezioni in presenza
  • Campus: Bologna
  • Corso: Laurea in Scienze statistiche (cod. 8873)

Conoscenze e abilità da conseguire

At the end of the course the student will know the methods for linking the information referred to the same statistical unit. This information belongs to different archives and the statistical unit is not identified by means of a code free of errors. The student will be able to use the exact matching, by means of deterministic and probabilistic record linkage and the basic tools of statistical matching.

Contenuti

  • The statistical formalisation of the record linkage problem
  • Deterministic record linkage
  • String similarity functions
  • Blocking
  • Fellegi-Sunter model and decision rule
  • Latent class model and its estimation via the EM algorithm
  • Record linkage as an assignment problem
  • Supervised classification for record linkage tasks
  • More recent developments and Bayesian models for record linkage

Testi/Bibliografia

Suggested readings:

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, P. Christen, Springer, 2012. ISBN: 978-3-642-43001-5. Chapters 1, 2, 4, 5, 6.

An Introduction to Statistical Learning with Applications in R, G. James, D. Witten, T. Hastie, and R. Tibshirani, Springer, 2021, Sections 4.3, 4.4, 4.5, 8.1.

Data Quality and Record Linkage Techniques, T. N. Herzog, F. J. Scheuren, and W. E. Winkler, Springer, 2007. ISBN: 978-0-387-69502-0. Chapters 8 and 9.

(Almost) All of Entity Resolution, O. Binette and R. C. Steorts, Science Advances, vol. 8, no. 12, 2022.

The stringdist Package for Approximate String Matching, M. P. J. van der Loo, The R Journal, vol. 6, no. 1, 2014.

 

Other study material, including slides and R scripts from the tutorials, will be made available on Virtuale.

Metodi didattici

Lectures and tutorials in R.

To participate in computer lab sessions, students must complete Modules 1 and 2 of health and safety training, available as online courses.

Modalità di verifica e valutazione dell'apprendimento

A 2-hour exam using R, with exercises covering both practical and theoretical aspects of the course. The exam is designed to assess the ability to solve a practical record linkage task, while also demonstrating understanding of the underlying methods and models.

All work must be completed in an R script, with reproducible computations. Written answers should be provided as comments within the code. The final grade is based on the total points earned across all exercises.

Paper notes and printed resources are allowed, while electronic and online resources are not.

Strumenti a supporto della didattica

Slides and notes that will be made available on Virtuale.

Orario di ricevimento

Consulta il sito web di Edoardo Redivo