You are here:

81932 - Big Data

Academic Year 2017/2018

Docente: Enrico Gallinucci
Credits: 6
SSD: ING-INF/05
Language: Italian

Moduli: Enrico Gallinucci (Modulo 1) Andrea Mordenti (Modulo 2)
Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
Campus: Cesena
Corso: Second cycle degree programme (LM) in Computer Science and Engineering (cod. 8614)

Learning outcomes

At the end of the course, the student:- Knows the applications of Big Data technologies and the respective challenges - Knows the available hardware and software architectures to handle Big Data - Knows the techniques to store the data, the programming languages and paradigms generally adopted in this kind of systems - Knows the design methodologies for the different kinds of applications in the area of Big Data - Acquires practical expertise in using the different technologies through laboratory and projects. In particular, the main technologies used in practical exercises will be NoSQL databases and the Hadoop platform: Hive, Spark, Tez, Dremel, Giraph, Storm, Mahout, and Open R

Course contents

For real-time updates on the course's activities, please subscribe to the distribution list enrico.gallinucci.bigdata

1. Introduction to the course and to Big Data: what they are and how to use them

2. Cluster computing to handle Big Data

Parallel computing architectures
The Apache Hadoop framework and its modules (HDFS, Yarn)
Hadoop-specific data structures (Apache Parquet)

3. The MapReduce paradigm: basic principles, limitations, design of algorithms

4. The Apache Spark system

Architecture, data structures,basic principles
Data partitioning and shuffling
Optimization of the computation

5. SQL on Big Data with Spark SQL

6. Data streaming

The architecture to handle data streaming
Approximated algorithms in the streaming context

7. Big Data Analysis: a complete case study

8. Taking a Data Mining problem to a Big Data platform

Readings/Bibliography

Tom White. Hadoop - The Definitive Guide (4th edition). O'Reilly, 2015

Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark. O'Reilly, 2015

Teaching methods

Lessons and practical exercises

Assessment methods

Oral examination on all the covered topics and discussion of a project. The project must be arranged with the lecturer, and may consist in one of the following: analysis of a dataset using the learned techniques and tools; implementation of a data mining algorithm on a Big Data platform; experimental evaluation of a new tool within the Hadoop framework.

Teaching tools

Practical exercises will rely on a virtual cluster of 10 nodes, pre-configured with the Cloudera Express distributions. An SSH client will be used to connect to the cluster and interact with the available software tools (mainly Apache Hadoop and Apache Spark).

Office hours

See the website of Enrico Gallinucci

See the website of Andrea Mordenti