90483 - Modern Statistics And Big Data Analytics

Academic Year 2021/2022

  • Teaching Mode: Traditional lectures
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Statistical Sciences (cod. 9222)

Learning outcomes

By the end of the course, the student gains an understanding of theory and computing of modern statistical methods, with particular emphasis on methods for analysing large amounts of data (big data). More specifically, the student acquires knowledge on the most important methods of statistical learning and prediction and the skills required to solve real-world and decision-making problems.

Course contents

Cluster analysis: k-means, construction of distances, hierarchical clustering, partitioning around medoids, average silhouette width, mixture models, with algorithms, R-coding, theory, applications and in-depth discussion; outlook on big data issues

Robust statistics: Influence function, breakdown point, robust estimation of univariate and multivariate location and scale and regression, with algorithms, R-coding, theory, applications and in-depth discussion

 

Advanced issues in dimension reduction and model selection, dealing with model selection bias, bootstrap for model assessment

 


Readings/Bibliography

Everitt, B. S., Landau, S., Leese, M., Cluster Analysis (fourth edition), E. Arnold 2001

Hastie, T., Tibshirani, R., Friedman, J., The Elements of Statistical Learning (second edition), Springer 2009.

Hennig, C., Meila, M., Murtagh, F., and Rocci, R., Handbook of Cluster Analysis, Taylor & Francis 2016.

Maronna, R. A.,  Martin, R. D., Yohai, V. J., Salibián-Barrera, M., Robust Statistics: Theory and Methods (with R), 2nd Edition, Wiley 2019.

Teaching methods

Classroom lessons, tutorials, computer workshop

Assessment methods

The assessment will have two components plus a bonus component.  About 10/30 marks are assigned to a literature question to be done at home. These marks will be given for the ability to understand the scientific explanation of new methodology which is based on and closely related to methodology introduced in the course, the understanding of which is implicitly also assessed. About 20/30 marks are assigned to a 3 hours exam comprising of a theoretical exercise, a data analysis project, and an exercise that asks questions interpreting the given output of another data analysis. Aspects examined here are understanding of the theoretical background and how it is derived (carrying a rather low percentage of the marks), the ability to apply methodology learnt in the course to a real dataset, and the ability to understand and draw practically relevant conclusions from the computer output of such methodology.

5/30 bonus marks are assigned to regular homework activity (homework needs to be done but not necessarily correctly; students can work in groups if they choose to do so); they will count on top of the marks achieved in the written exam and the literature question as long as these are below 30. A 30L is achieved by either having at least 31 marks including homework bonus, or by having all 30 marks in literature question and exam alone.

Teaching tools

Lecture Notes, supporting material including datasets provided on the Virtuale website.

Office hours

See the website of Christian Martin Hennig