85172 - Systems and Algorithms for Data Science

Academic Year 2017/2018

  • Docente: Stefano Lodi
  • Credits: 10
  • SSD: ING-INF/05
  • Language: English

Learning outcomes

By the end of the course the student is able to design information systems and applications on database management systems (DBMS). In particular, the student is able to use the SQL language to query and data, and to apply a database design methodology to user requirements in natural language; - knows the fundamentals of data warehousing and big data management tools, like the MapReduce programming model, its Hadoop implementation, and NoSQL databases.

Course contents

1 Relational databases
══════════════════════

• The relational model of data
• Integrity constraints
• DBMS architecture


2 The SQL language
══════════════════

• Database creation, querying and updating
• Transaction and authentication management


3 Database design
═════════════════

• Relational database design
• Entity-Relationship conceptual design
• Logical design by mapping an ER schema to a relational schema


4 Data warehousing and OLAP
═══════════════════════════

• OLTP and OLAP
• The multidimensional model of data: Facts, measures, dimensions,
hierarchies, cuboids
• Star schema, snowflake schema, galaxy schema
• Operations in the multidimensional model: roll-up, drill-down, slice
and dice, pivot, data cube
• Data warehouse: Definition, design, architecture

 

5 Advanced analytics
════════════════════

5.1 The Knowledge Discovery process
───────────────────────────────────


5.2 The PageRank algorithm
──────────────────────────


5.3 Association rule discovery
──────────────────────────────

• Classification of association rules
• Apriori algorithm
• FP-growth algorithm


5.4 Data clustering
───────────────────

• The leader-follower algorithm
• The BIRCH algorithm
• The K-means algorithm
• The EM algorithm


5.5 Supervised classification
─────────────────────────────

• K nearest neighbours
• Naive Bayes
• Classification trees: C4.5, CART
• Support Vector Models (SVM)
• AdaBoost


5.6 Tools for managing and analysing Big Data
─────────────────────────────────────────────

• The MapReduce programming model
• The Hadoop implementation of MapReduce
• The Spark system
• NoSQL databases
• The Python language
• Python, Hadoop, and Spark
• Hadoop and the R language


5.7 Laboratory classes
──────────────────────

• Relational DBMS Microsoft SQL Server, SQL language
• Advanced analytics




Readings/Bibliography

Maier, D. (1983) The theory of relational databases. Rockville, MD: Computer Science Press.
Van der Lans, R. F. Introduction to SQL (any edition). Addison-Wesley.
Han, J., & Kamber, M. (2011). Data Mining. Concepts and Techniques. San Francisco, CA: Morgan Kaufmann.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston: Pearson.
Zhang, Y. (2015). An Introduction to Python and Computer Programming. Volume 353 of Lecture Notes in Electrical Engineering. Heidelberg: Springer. Available as E-book (Search in Catalogo del Polo Bolognese SEBINA YOU)

Teaching methods

The lessons of the course are divided into
• frontal lessons in a lecture room
• lessons in a laboratory, each comprising both frontal expositions and
exercises on the techniques for the design of databases and the
solution of query and data analysis problems, presented in the
expositions.

The topics of the course will be divided by lesson type:
• The theoretical and practical notions for the design of queries, for
the management of a database, for the design of database and data
warehouse schemata, and for advanced analytics are explained in
frontal lessons
• In laboratory lessons, students are encouraged to design increasingly
difficult SQL and multimensional queries and to test their correctness
on the DBMS at hand, and to design the generation of advanced
analytics models using both the tools integrated into the DBMSs, and
programming languages.

Assessment methods

The examination is composed of two parts:
• Preliminary examination on the design of SQL queries and Python
programs in laboratory.
• The student is given: A hard copy or digital text containing: the description of a relational database schema expressed as
CREATE TABLE statement in the SQL language and queries expressed  in natural language concerning the relation of the schema; the description of a simple analysis problem
• The student must produce: queries written in the SQL language that retrieve the data required by the the queries expressed in natural language which are described in the text; a Python program solving the analysis problem described in the text
• Notes: The student may: produce the solution on paper or as a
digital document; use one or more database management systems among the ones which were employed during the course lessons to test the solution.
• Oral examination. The student must answer three questions which may concern any part of the contents of the course. In particular, the student must show: Mastery of the theoretical notions of the discipline and of the logic, set theoretic, and mathematical formalism employed in it; ability in the design of portions of a ER schema corresponding to notable cases of design requirements expressed in natural language; mastery of the application of the logical design techniques from ER schemata to relational schemata; knowledge of the elements of data warehousing and of the advanced analytics techniques which were presented during lessons, and implemented in the tools used during lessons, and the ability to use such tools; knowledge of the
Python language.

Computation of the final mark and constraints among the examinations.

The marks of the two examinations are contained in the interval from zero to thirty, including the extremes. The mark achieved in the preliminary examination is valid until the end of the session in which the preliminary examination has been taken.The assessment of the overall outcome of the examination and the computation of the final mark takeplace at the end of the oral examination. The final mark is computed as a weighted average of the marks achieved in the two examinations, using the most recent valid mark for the preliminary examination, or zero if no valid mark exists. For the computation of the final mark, the following weights are used:

Preliminary examination on the design of SQL queries in laboratory:
12/30

Oral examination: 18/30

Teaching tools

  • Presentation of the course topics using a overhead projector
  • Laboratory with desktop PCs equipped with Microsoft SQL Server and Microsoft Access; teacher's PC connected to an overhead projector to guide laboratory exercises
  • Documents used in the presentations, distributed at the site http://campus.unibo.it/ . Access to the documents is allowed only to students of the course who subscribed to the course mailing list. Credentials to subscribe to the list are given in the first lesson of the course.

Office hours

See the website of Stefano Lodi