You are here:

81932 - Big Data

Academic Year 2021/2022

                
                        Docente:
                        Enrico Gallinucci
                    
                        Credits:
                        6
                    
                        SSD:
                        ING-INF/05
                    
                        Language:
                        Italian
                    
                        Teaching Mode:
                        In-person learning (entirely or partially)
                        
                            Campus:
                            Cesena
                        
                            Corso:
                            Second cycle degree programme (LM) in
                            Computer Science and Engineering (cod. 8614)

                            Teaching resources on Virtuale

Learning outcomes

At the end of the course, the student:- Knows the applications of Big Data technologies and the respective challenges - Knows the available hardware and software architectures to handle Big Data - Knows the techniques to store the data, the programming languages and paradigms generally adopted in this kind of systems - Knows the design methodologies for the different kinds of applications in the area of Big Data - Acquires practical expertise in using the different technologies through laboratory and projects. In particular, the main technologies used in practical exercises will be NoSQL databases and the Hadoop platform: Hive, Spark, Tez, Dremel, Giraph, Storm, Mahout, and Open R

Course contents

Requirements

A prior knowledge of relational databases, Java and Scala programming languages, and Unix-like systems is required to attend the course. Attendance of Business Intelligence and Data Mining courses is encouraged.

All lessons are given in Italian, but the teaching material is written in English. A good comprehension of English is thus required to use the material. Non-Italian speaking students can study on the English-written material and give the exam in English.

Course Contents

For real-time updates on the course's activities, please subscribe to the distribution list enrico.gallinucci.bigdata22

1. Introduction to the course and to Big Data: what they are and how to use them

2. Cluster computing to handle Big Data

Hardware and software architectures
The Apache Hadoop framework and its modules (HDFS, YARN)
Hadoop-specific data structures (Apache Parquet)

3. The MapReduce paradigm: basic principles, limitations, design of algorithms

4. The Apache Spark system

Architecture, data structures,basic principles
Data partitioning and shuffling
Optimization of the computation

5. SQL on Big Data with Spark SQL

6. Data streaming

The architecture to handle data streaming
Approximated algorithms in the streaming context

7. NoSQL databases

8. Handling Big Data in the Cloud

Cluster on-premises vs in the cloud
The technological stack in the cloud
Deploy of a real case study on a cloud provider

9. Deploying a Data Mining problem with Big Data logic

Readings/Bibliography

Slides

Teaching methods

Lessons and practical exercises.

As concerns the teaching methods of this course unit, all students must attend Module 1, 2 on Health and Safety online.

Assessment methods

The exam consists in an oral examination on all the covered topics and in the discussion of a project.

The goal of the project (to be arranged with the lecturer) is to identify a big-enough dataset, define an application to analyze the data (using the techniques and tools learned throughout the course) and write a short report. Groups up to 2 people can be formed. The project provides 0 to 3 points, that will be added to the grade obtained with the oral examination. Alternative projects (e.g., implementation of a data mining algorithm on a Big Data platform; experimental evaluation of a new tool within the Hadoop framework) can be discussed with the lecturer upon request.

Teaching tools

Practical exercises rely on a virtual cluster of 10 nodes, pre-configured with the Cloudera Express distribution. Each student is given a user account on one of the nodes, to be used to interact with the software tools installed in the cluster (mainly Apache Hadoop and Apache Spark). The connection to the cluster is done through an SSH client.

Additionally to the virtual cluster, alternative software solutions to interact with Big Data tools will be offered:

An individual virtual environment with the whole Cloudera Express distribution, to be used on the student's own computer on the lab computers
Access to the Cloud services of Amazon Web Services and/or Google Cloud Platform via 50$-100$ coupons.

Office hours

See the website of Enrico Gallinucci