- Docente: Michele Scagliarini
- Credits: 8
- SSD: SECS-S/01
- Language: Italian
- Moduli: Michele Scagliarini (Modulo 1) Marilena Pillati (Modulo 2)
- Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
- Campus: Bologna
- Corso: First cycle degree programme (L) in Statistical Sciences (cod. 8873)
-
from Sep 20, 2024 to Oct 23, 2024
-
from Nov 12, 2024 to Dec 13, 2024
Learning outcomes
By the end of the course the student gains an appreciation of the types of problems and questions which arise with multivariate data and the basic theory of survey sampling.
In particular the student should be able:- to apply and interpret methods of dimension reduction (including principal component analysis and factor analysis); - to apply and interpret methods for cluster analysis and discrimination; - to interpret the output of R procedures for multivariate statistics - to employ simple, stratified and probability sampling; - to derive the estimators and associated standard errors of population in the different sampling strategies; - to correct estimation by the ratio principle; - to understand the difference between observational and experimental studies.
Course contents
Module 1: Survey Sampling
Michele Scagliarini
1. Introduction, Population and Sample (3 hours).
-General introduction
-Sampling from finite populations
-Definition of population and descriptive statistics
-Definition of sample and descriptive statistics
2. Random sampling and estimators (3 hours).
-Definition of sampling design: extraction and inclusion probabilities
-Linear homogeneous estimators
-exercises
3. Simple random sampling (4 hours).
-simple random sampling with replacement
-simple random sampling without replacement
-exercises
4. Sampling with varying probability (6 hours).
-the Horvitz Thompson estimator
-the Hansen Hurwitz estimator
-exercises
5. The use of auxiliary information variables in simple random sampling (4 hours).
-the ratio estimator
-the regression estimator
-exercises
6. Stratified random sampling (4 hours)
-Optimal allocation
-Proportional allocation
-exercises
7. Use of R software to implement and create the sampling designs examined in the course ( 3 hours)
8. Final exercises (3 hours)
Module 2: Data Analysis
Gabriele Soffritti
-
Data matrices and additional matrices useful for multivariate statistical analysis (6 hours)
-
Cluster analysis (6 hours)
Hierarchical Clustering
K-means Clustering -
Principal component analysis (6 hours)
Geometrical concepts and mathematical details
Properties and practical considerations -
Factor analysis (6 hours)
The linear factor model: specification, identification and estimation
Rotation methods
Factor scores -
Discriminant analysis (6 hours)
Discrimination when the populations are known (maximum likelihood and Bayes discriminant rules)
Fisher's linear discriminant function
Estimation of the error rate -
Functions for performing principal component analysis, factor analysis, cluster analysis, discriminant analysis in the R environment (8 hours)
Syntax, functionalities and output
Illustrative examples carried out in R with comments on the obtained results
The estimated number of hours of lessons for each topic takes account of the additional practical lessons that will take place in a computer laboratory on a weekly basis starting from the second week (November 18-22, 2019; teacher: Dott. Gabriele Perrone).
Readings/Bibliography
Module 1: Survey Sampling
Downloadable lecture notes: Daniela Cocchi "Teoria dei Campioni (corso base)"
Sharon Lohr, “Sampling: design and analysis”, Pacific Grove, Duxbury press, 1999.
Some additional readings will be indicated during the course.
Additional useful readings
Module 2: Data Analysis
Compulsory readings
-
S. Mignani, A. Montanari, Appunti di analisi statistica multivariata. Esculapio, Bologna, 1994. Chapters 3, 4, 5, 7.
-
Teacher's lecture notes with examples of exercises from past written exams.
-
Additional teaching material will be provided by the teacher during the lessons.
Teacher's lecture notes are available on the platform "Insegnamenti online - Supporto online alla didattica" (IOL) (https://iol.unibo.it/ ) for all enrolled students. In order to have access to this platform, students must use their username and password. Additional materials useful for the preparation of the exam will be made available on the platform IOL.
Additional useful readings
-
B. Everitt, T. Hothorn, An introduction to applied multivariate analysis with R. Springer, 2011. Chapters 1, 3, 5, 6.
-
W. K. Hardle, Z. Hlavka, Multivariate statistics, exercises and solutions. Second edition. Springer, 2015, Chapters 11, 12, 13, 14.
-
R. Johnson, D. Wichern, Applied multivariate statistical analysis. Sixth edition. Pearson, 2014. Chapters 8, 9, 11, 12.
Teaching methods
Module 1: Survey Sampling
Lectures, exercises and computer exercises with R.
Module 2: Data Analysis
Theoretical lessons in a lecture hall and practical lessons in a computer laboratory.
Each statistical multivariate method listed in the course contents is described (in a lecture hall) starting from its main theoretical features and properties. Then, the corresponding functions and scripts in the R environment are detailed together with some illustrative examples (in a computer laboratory).
Although attending lessons is not mandatory, it is strongly recommended.
Assessment methods
The assessment aims to evaluate the achievement of the following learning objectives:
-
knowledge of the fundamental aspects of sampling from finite populations;
-
ability to properly use the statistical tools for design sampling plans;
-
knowledge of the multivariate analysis methods explained in the lectures;
-
ability to properly use the explained multivariate methods to analyse data matrices;
-
knowledge of the main functions of R to implement the sampling designs studied in the course.
The exam is written. The overall evaluation is expressed as a grade out of 30 and is the average of the evaluations of the two units.
The exam of the Survey Sampling lasts on huor and takes place in a room. It is composed of exercises and some open questions. In the written test you will not be asked to program in R, but to comment and explain the R code present in the text of the exercise. It is allowed to consult the manuals of the R packages examined in the lectures. During the exam it is allowed the use of a form (maximun a protocol sheet), while it is not allowed to use textbooks or notes. A pocket calulator is necessary.
The exam of the Data analysis unit lasts one hour and takes place in a room. It is composed of four exercises with open questions: some concern the theoretical aspects of the multivariate statistical methods, some other are mainly focused on the ability to use methods for data analysis and interpret results from their application. These latter questions require solving numerical exercises. In some cases, results obtained from the analysis of a real data set through the R packages illustrated during the lessons may be provided. Consulting textbooks or notes during the exam is not allowed. A pocket calculator is necessary. The maximum mark for each exercise is 8. The overall mark of the Data analysis unit is given by the sum of the marks in the four exercises, which is expressed on a scale of 30.
Further useful information about the exam
-
In order to take the exam, students are required to put their names down for the exam through Almaesami platform. An identity card is required to take part in the exam.
-
Exams can only be taken in the official exam sessions.
-
Exams of the two units can be taken in different exam sessions.
Teaching tools
Slides.
Links to further information
https://corsi.unibo.it/laurea/ScienzeStatistiche
Office hours
See the website of Michele Scagliarini
See the website of Marilena Pillati
SDGs
This teaching activity contributes to the achievement of the Sustainable Development Goals of the UN 2030 Agenda.