- Docente: Enrico Gallinucci
- Crediti formativi: 6
- SSD: ING-INF/05
- Lingua di insegnamento: Inglese
- Moduli: Enrico Gallinucci (Modulo 1) Enrico Gallinucci (Modulo 2)
- Modalità didattica: Convenzionale - Lezioni in presenza (Modulo 1) Convenzionale - Lezioni in presenza (Modulo 2)
- Campus: Cesena
-
Corso:
Laurea Magistrale in
Ingegneria e scienze informatiche (cod. 8614)
Valido anche per Laurea Magistrale in Digital Transformation Management (cod. 5815)
-
Orario delle lezioni (Modulo 1)
dal 19/09/2023 al 25/10/2023
-
Orario delle lezioni (Modulo 2)
dal 20/09/2023 al 12/12/2023
Conoscenze e abilità da conseguire
Al termine del corso lo studente: - Conosce gli ambiti applicativi in cui utilizzare le tecnologie dei Big Data e le relative problematiche - Conosce le architetture hardware e software che sono state proposte per la loro gestione - Conosce le tecniche per la memorizzazione, utilizza i linguaggi e i paradigmi di programmazione adottati in questo tipo di sistemi - Conosce le metodologie di progettazione per le diverse tipologie di applicazioni in ambito Big Data Acquisisce competenze pratiche nellutilizzo delle diverse tecnologie mediante attività di laboratorio e di progetto. In particolare le principali tecnologie utilizzate in laboratorio saranno i DBMS NoSQL e la piattaforma Hadoop: Hive, Spark, Tez, Dremel, Giraph, Storm, Mahout, and Open R
Contenuti
Requirements
A prior knowledge of relational databases, Java and Scala programming languages, and Unix-like systems is required to attend the course. Attendance of Business Intelligence and Data Mining courses is encouraged.
Classes and teaching material are in English. A good comprehension of English is thus required to use the material and interact during class. Exam can be given in Italian.
Modules
The course is split into two modules.
The first module (20 hours) is mostly theoretical and it is shared with the Master's Degree in Digital Transformation Management (corresponding to Module 1 of Big Data and Cloud Platforms [https://www.unibo.it/it/didattica/insegnamenti/insegnamento/2022/466768] ).
The second module (30 hours) is mostly practical and it is exclusive to the Master's Degree in Computer Science and Engineering.
Course Contents
1. Introduction to the course and to Big Data: what they are and how to use them
2. Cluster computing to handle Big Data
- Hardware and software architectures
- The Apache Hadoop framework and its modules (HDFS, YARN)
- Hadoop-specific data structures (Apache Parquet)
3. The MapReduce paradigm: basic principles, limitations, design of algorithms
4. The Apache Spark system
- Architecture, data structures,basic principles
- Data partitioning and shuffling
- Optimization of the computation
5. SQL on Big Data with Spark SQL
6. Data streaming
- The architecture to handle data streaming
- Approximated algorithms in the streaming context
7. NoSQL databases
8. Handling Big Data in the Cloud
- Cluster on-premises vs in the cloud
- The technological stack in the cloud
- Deploy of a real case study on a cloud provider
9. Designing non-trivial problems under the MapReduce paradigm
Testi/Bibliografia
- Slides
Recommended readings:
- Tom White. Hadoop - The Definitive Guide (4th edition). O'Reilly, 2015
- Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark, 2nd Edition. O'Reilly, 2020
- Andrew G. Psaltis. Streaming Data - Understanding the real-time pipeline.Manning, 2017
- Ian Foster, Dennis Gannon. Cloud Computing for Science and Engineering. MIT Press, 2017
Further readings will be mentioned during the course.
Metodi didattici
Lessons and practical exercises.
As concerns the teaching methods of this course unit, all students must attend Module 1, 2 on Health and Safety [https://www.unibo.it/en/services-and-opportunities/health-and-assistance/health-and-safety/online-course-on-health-and-safety-in-study-and-internship-areas] online.
Modalità di verifica e valutazione dell'apprendimento
The exam consists in an oral examination on all the covered topics and in the discussion of a project.
Project details:
- Goal: identifying a big-enough dataset and designing a notebook application to process and analyze the data (using the techniques and tools learned throughout the course). Both the dataset and the workload must be validated by the teacher before beginning the implementation.
- Groups of up to 2 people can be formed.
- The discussion is done on the day of the oral examination.
- A well-done project can improve the vote obtained with the oral examination; a bad one can impede the access to the oral examination.
- Alternative proposals for projects (e.g., in combination with projects of other courses or with the internship/thesis) are always well-received.
Strumenti a supporto della didattica
Practical exercises will be carried out on the cloud-based virtual environment provided by the AWS Academy service. Students registering to AWS Academy will receive 100$ to spend on AWS services, which will be enough to carry out both the exercises during class and the project. SSH tunnelling will be necessary to connect to the GUI of the cloud services (e.g. Spark).
Orario di ricevimento
Consulta il sito web di Enrico Gallinucci
SDGs
L'insegnamento contribuisce al perseguimento degli Obiettivi di Sviluppo Sostenibile dell'Agenda 2030 dell'ONU.