90477 - Machine Learning Systems For Data Science

Academic Year 2019/2020

  • Docente: Stefano Lodi
  • Credits: 10
  • SSD: ING-INF/05
  • Language: English

Learning outcomes

By the end of the course, the student knows: - the fundamentals of the relational model of data, and how to query a relational database management systems using the SQL language; - the fundamentals of data warehousing, big data analysis tools and NOSQL databases; - the fundamental supervised and unsupervised data mining algorithms.

Course contents

Relational databases
══════════════════════

The relational model of data
Integrity constraints


The SQL language
══════════════════

Database creation, querying and updating
Transaction and authentication management



Data warehousing and OLAP
═══════════════════════════

OLTP and OLAP
The multidimensional model of data: Facts, measures, dimensions,
hierarchies, cuboids
Star schema, snowflake schema, galaxy schema
Operations in the multidimensional model: roll-up, drill-down, slice
and dice, pivot, data cube
Data warehouse: Definition, design, architecture



Advanced analytics and Machine Learning
════════════════════


The PageRank algorithm
──────────────────────────


Association rule discovery
──────────────────────────────

Classification of association rules
Apriori algorithm


Data clustering
───────────────────

The leader-follower algorithm
The BIRCH algorithm
The K-means algorithm
The EM algorithm


Supervised classification
─────────────────────────────

K nearest neighbours
Naive Bayes
Classification trees: C4.5, CART
Support Vector Models (SVM)
AdaBoost
Neural Networks


Tools for managing and analysing Big Data
──────────────────────────────────

The Linux operating system

The MapReduce programming model
The Hadoop implementation of MapReduce
The Spark system
NoSQL databases
The Python language
Python, Hadoop, and Spark
Hadoop and the R language


Laboratory classes
══════════════════════════

Relational DBMS MySQL, SQL language
Integrated development environments for Python
Advanced analytics

Readings/Bibliography

Maier, D. (1983) The theory of relational databases. Rockville, MD: Computer Science Press.
Van der Lans, R. F. Introduction to SQL (any edition). Addison-Wesley.
Han, J., & Kamber, M. (2011). Data Mining. Concepts and Techniques. San Francisco, CA: Morgan Kaufmann.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston: Pearson.
Zhang, Y. (2015). An Introduction to Python and Computer Programming. Springer, Singapore. Available as E-book in Catalogo del Polo Bolognese SebinaYOU

Teaching methods

The lessons of the course are divided into
• frontal lessons in a lecture room
• lessons in a laboratory, each comprising both frontal expositions and
exercises on the techniques for the design of databases and the
solution of query and data analysis problems, presented in the
expositions.

The topics of the course will be divided by lesson type:
• The theoretical and practical notions for the design of queries, for
the management of a database, for the design of database and data
warehouse schemata, and for advanced analytics are explained in
frontal lessons
• In laboratory lessons, students are encouraged to design increasingly
difficult SQL and multidimensional queries and to test their correctness
on the DBMS at hand, and to design the generation of advanced
analytics and machine learning models using both the tools integrated into the DBMSs, and the Python
programming language.

Assessment methods

The examination is composed of two parts:
• Preliminary examination on the design of Python
programs in laboratory.
• The student is given: A hard copy or digital text containing: the description of a simple analysis problem
• The student must produce: a Python program solving the analysis problem described in the text
• Notes: The student may: produce the solution on paper or as a
digital document; use one or more database management systems among the ones which were employed during the course lessons to test the solution.
• Oral examination. The student must answer three questions which may concern any part of the contents of the course. In particular, the student must show: Mastery of the theoretical notions of the discipline and of the logic, set theoretic, and mathematical formalism employed in it; knowledge of the elements of data warehousing and of the advanced analytics and machine learning techniques which were presented during lessons, and implemented in the tools used during lessons, and the ability to use such tools; knowledge of the
Python language.

Computation of the final mark and constraints among the examinations.

The marks of the two examinations are contained in the interval from zero to thirty, including the extremes. The mark achieved in the preliminary examination is valid until the end of the session in which the preliminary examination has been taken.The assessment of the overall outcome of the examination and the computation of the final mark takeplace at the end of the oral examination. The final mark is computed as a weighted average of the marks achieved in the two examinations, using the most recent valid mark for the preliminary examination, or zero if no valid mark exists. For the computation of the final mark, the following weights are used:

Preliminary examination on the design of Python programs in laboratory:
12/30

Oral examination: 18/30

Teaching tools

Presentation of the course topics using a overhead projector
Laboratory with desktop PCs equipped with MySQL and PyCharm; teacher's PC connected to an overhead projector to guide laboratory exercises
Documents used in the presentations, distributed on the site http://iol.unibo.it. Access to the documents is allowed only to students of the course.

Office hours

See the website of Stefano Lodi

SDGs

Industry, innovation and infrastructure

This teaching activity contributes to the achievement of the Sustainable Development Goals of the UN 2030 Agenda.