23741 - Statistical Methods for Data Mining

Academic Year 2023/2024

  • Docente: Matteo Farnè
  • Credits: 10
  • SSD: SECS-S/01
  • Language: Italian
  • Teaching Mode: Traditional lectures
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Statistics, Economics and Business (cod. 8876)

Learning outcomes

This course will present established statistical methods to extract knowledge from large databases, with a special attention to those techniques that may help managers to draw useful insights from data repositories by recognizing patterns and making predictions.

In particular, this course aims to enable the student:

- to correctly plan a data mining process

- to choose the most suitable methodology for the problem at hand

- to critically interpret the results

Course contents

Pre-requisites:

Elements of descriptive and inferential statistics. Elements of probability calculus. Multiple linear regression model.

Course content

Part I

- Introduction: Data Mining and Statistics.

- Data preparation: data discovery, data characterization, descriptive and exploratory statistics.

- Data cleaning: outliers and missing values.

- Variable transformations. Volume and dimension reduction techniques.

- Association rules.

- Introduction to statistical learming methods: regression and classification problems.

- Parametric prediction methods: linear models in regression problems; logistic regression.

- Clustering methods: hierarchical and partitioning methods.

Part II

- Nonparametric regression methods: smoothers, Generalized Additive Models. Nonparametric classifiers: knn classifier, Naive Bayes classifier.

- Recursive partitioning methods and decision tree.

- Artificial neural networks: multilayer perceptrons; regularization techniques.

- Aggregation of prediction models.

- Model assessment criteria in regression and classification problems (ROC curve and LIFT curve).

Some additional computer laboratory sessions in R Studio are planned.

Readings/Bibliography

Beyond the teaching material provided by the lecturer (and available on IOL) the following references are recommended as additional readings:

Hastie T. Tibshirani R., Friedman J. The Elements of Statistical Learning. Data Mining, Inference and Prediction , Springer-Verlag, New York, 2008

Andrea Cerioli, Mauro Zani, Analisi dei dati e data mining per le decisioni aziendali. Giuffrè Editore, 2007

Giudici P. Data Mining: Modelli informatici, statistici e applicazioni, McGraw Hill, 2005

Azzalini A., Scarpa B. Data analysis and data mining. An introduction, Oxford University Press, 2012

Teaching methods

The course consists in lectures and computer laboratory activities in R: lectures deal with methodological issues about the statistical tools listed in the course content, while computer laboratory sessions focus on the application of data mining algorithms on specific case studies.

The laboratory exercise has the aim to strengthen the knowledge acquired by students during the lectures, and to develop students' skills in choosing the most adequate methods for a given problem and in interpreting empirical results.

Assessment methods

Assessment is based on a single final written exam. It consists in open and multiple choice questions on theoretical aspects and questions requiring to interpret and comment the R Studio output of a Data Mining analysis.

The oral exam is optional and can be done after passing the written exam in the same exam session. The overall grade is expressed on a scale of 30, and takes into account the outcome of the written and the oral test: the evaluation obtained in the written test can increase or decrease by no more than 3/30.

Teaching tools

Blackboard; PC; videoprojector; computer laboratory.

Office hours

See the website of Matteo Farnè