85302 - Data Science

Academic Year 2022/2023

  • Moduli: Laura Anderlucci (Modulo 1) Alex Mas Sandoval (Modulo 2)
  • Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
  • Campus: Bologna
  • Corso: First cycle degree programme (L) in Genomics (cod. 9211)

Learning outcomes

The course provides students with the current methods and techniques of data science using modern computational methods and software with an emphasis on rigorous statistical thinking. At the end of the course students are able to represent and organise knowledge about large-scale data collections, and to turn data into actionable knowledge by using concepts of statistical learning and data mining combined with data visualization techniques and reproducible data analysis.

Course contents

Part 0: Introduction to Statistical Learning

Part I: Classification

  • Naïve Bayes
  • Logistic Regression;
  • Linear Discriminant Analysis
  • k-Nearest Neighbors

Part II: Resampling Methods

  • Cross-Validation
  • The Bootstrap

Part III: Tree-Based Methods

  • Classification trees
  • Bagging; Random Forests; Boosting

Part IV: Unsupervised Learning

  • k-means
  • Hierarchical clustering

Part V: Overview of the main machine learning methods

  • Support Vector Machines
  • Neural Networks

Readings/Bibliography

The primary text for the course:

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to Statistical Learning. Second Edition. New York: Springer. ISBN: 978-1-0716-1417-4. E-book ISBN 978-1-0716-1418-1

    The book is freely available here:
    https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

In addition, we will use:

  • T. Hastie, R. Tibshirani, and J. Friedman (2001) The Elements of Statistical Learning: data mining, inference and prediction. Springer Verlag.
    Freely available at: https://web.stanford.edu/~hastie/Papers/ESLII.pdf
  • J. Han and M. Kamber (2000) Data mining: concepts and techniques. Morgan Kaufman.
    Freely available at: http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-2011.pdf

Teaching methods

Lectures and practical sessions.

Lectures complemented with practical sessions. As concerns the teaching methods of this course unit, all students must attend Module 1, 2 [http://www.unibo.it/en/services-and-opportunities/health-and-assistance/health-and-safety/online-course-on-health-and-safety-in-study-and-internship-areas] on Health and Safety online.


Assessment methods

The learning assessment is composed by a written test lasting 110 minutes. The written test is aimed at assessing the student's ability to use the learned definitions, concepts and properties and in solving exercises. During the written exam, students can only use the cheat sheet that is provided on virtuale.unibo.it, containing references to R packages and functions. Students cannot make use of the textbook, personal notes and mobile phones (smart watch or similar electronic data storage or communication device are not allowed either).

The written test consists of 7-10 questions, both multiple choice and open, some of which to be solved in R. The final grade is out of thirty.

Students that, despite having passed the exam, do not feel represented by the obtained result can ask to have an additional (optional) oral exam that can change the grade by +/-3 points.

Teaching tools

The following material will be provided: slides of the lectures, exercises with solutions, mock exam.


Office hours

See the website of Laura Anderlucci

See the website of Alex Mas Sandoval