30711 - Record Linkage

Academic Year 2022/2023

  • Docente: Riccardo D'Alberto
  • Credits: 6
  • SSD: SECS-S/01
  • Language: English
  • Moduli: Riccardo D'Alberto (Modulo 1) Riccardo D'Alberto (Modulo 2)
  • Teaching Mode: Traditional lectures Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
  • Campus: Bologna
  • Corso: First cycle degree programme (L) in Statistical Sciences (cod. 8873)

Learning outcomes

At the end of the course the student will know the methods for linking the information referred to the same statistical unit. This information belongs to different archives and the statistical unit is not identified by means of a code free of errors. The student will be able to use the exact matching, by means of deterministic and probabilistic record linkage and the basic tools of statistical matching.

Course contents

- The conditions for using a data base for statistical purposes.

- Data quality properties and how to measure it.

- Improving data quality through editing, imputation, and record linkage.

- The question of merging lists.

- The problem of duplication.

- Conditional independence and statistical matching techniques.

- Automatic data editing and imputation.

- Non random and probabilistic record linkage.

- Blocking techniques.

- The problem of disclosure and access to microdata.

- Examples in economics, health statistics, and Official Statistics.

- Examples with the use of the software R.

Readings/Bibliography

Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin: Springer, pp.270. ISBN: 978-3-642-43001-5.

D'Orazio, M., Di Zio, M., Scanu, M. (2006). Statistical Matching: Theory and Practice. Chichester: Wiley & Sons, pp.272. ISBN: 978-0-470-02353-2.

Herzog, N., Scheuren, F.J., Winkler, W.E. (2007). Data Quality and Record Linkage Techniques. New York: Springer, pp.227. ISBN: 978-0-387-69502-0.

Zhang, L.-C., Chambers, R.L. (eds.) (2019). Analysis of Integrated Data. Boca Raton: Chapman & Hall/CRC Press, pp.256. ISBN: 978-1-4987-2798-3.

Further bibliographical references, papers, technical reports, R scripts and data sets will be given during the course.

Teaching methods

Lectures and practical exercises with the software R.

Assessment methods

The final exam for this module of the course consists of a written essay AND an oral exam.

The written essay (through the "take-home" modality) will be based on the case study, data set(s) and/or the scientific articles proposed by the student by the end of the lectures period and approved by the teacher. The written essay must be sent to the teacher, at latest, 5 days before the oral exam. This essay will be discussed during the oral exam that, in addition, will consider the theoretical and practical arguments during the lectures.

A final overall mark for the two modules of the course will be proposed to each student, after that the exams for BOTH modules have been taken.

Further insights on the specificities of the written essay and the "to do" work will be given during the course.

Teaching tools

Slides sketching the content of the lessons will be available, as well as additional materials (e.g., scientific articles, technical reports, data sets, R scripts, etc.) through Virtuale.

Office hours

See the website of Riccardo D'Alberto

SDGs

Quality education Industry, innovation and infrastructure Reduced inequalities Partnerships for the goals

This teaching activity contributes to the achievement of the Sustainable Development Goals of the UN 2030 Agenda.