B5788 - INTRODUCTION TO BIG DATA PROCESSING INFRASTRUCTURES

Academic Year 2025/2026

  • Moduli: Daniele Cesini (Modulo 1) Alessandro Costantini (Modulo 2)
  • Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Bioinformatics (cod. 6767)

Learning outcomes

At the end of the course, the student has the basic theoretical and practical knowledge on infrastructures for scientific computing, distributed and parallel systems, batch systems and security technologies.

Course contents

Module 1

Course contents

The course will provide basic concepts of Infrastructures for processing Big Data and for running scientific applications. In particular it will focus on the Infrastructure-as-a-Service Cloud paradigm. The course will start with an introduction to Big Data and how they are related to scientific applications. It will continue with a description of the building blocks of modern Data Centers and how they are abstracted by the Cloud computing models. A real-life computational challenge will be given and students will create (during the course) a Cloud-based computing model to solve this challenge. Access to a limited set of Cloud resources and services will be granted to students in order to complete the exercises. Containers and in particular Docker Containers will be introduced as for the concept of High Performance Computing (HPC). Notions about the emerging “Fog” and “Edge” computing paradigms and how they are linked to Cloud infrastructures will conclude the course.

Program:

1) Introduction to the course and the computational challenge

Big Data

- Big Data definition

- Big Data applications classification

- Big Data applications examples

- Big Data and scientific applications

- Presentation of the computational challenge that will accompany us during the course.

Hands on:

- Set up of connections and login

2) From your laptop to the datacenter - datacenter building blocks

CPU Farm

i. Batch system, queues, allocation policies, quota etc..

Storage

I. DAS vs NAS

II. SAN

III. TAN

IV. Parallel FS

V. Data lifecycle, QoS

- Migration, recall, ACL

Network: main protocols (Ethernet, infiniband, Fiber Channel)

Monitoring and Provisioning

Hands on: Submission on a small cluster already available to students

3) Infrastructures for Parallel Computing

HTC vs HPC

HTC

- Distributed systems

- Grid Computing

HPC

- Shared memory vs distributed memory

- OPENMPI/OPNMPI

- Accelerators for parallel computing

- Hybrid and non-standard resources

Energy efficiency and Low-power computing

- Towards exascale computing

Hands: Demo Live - Speedup curve creations for the NAMD SMTV/APOA1 use cases. Computing on a GPU. Computing on Low Power systems.

5) Cloud Infrastructure

Cloud Computing: Introduction

Clod Computing Dimensions - IaaS, PaaS, SaaS, service and isolation models

Cloud IaaS

i. Advantages and Disadvantages

ii. Application Porting to the Cloud

iii. AWS Usage

Cloud Storage - provisioning of block device and POSIX filesystems

Hands on:

  1. IaaS instantiation with AWS - create the infrastructure to run the course exercises
  2. Instantiation of multiple machines - experience on cloud elasticity - Create a mini-cluster - Run the course exercise on that cluster
  3. Create storage volumes on the Cloud and make them available to the cluster
  4. Hadoop cluster creation
  5. MapReduce introduction and exercise

6) Introduction to Containers

- Basic concepts about containers
- Running and extending containers
- Docker Hub and dockerfiles
- Connecting containers to file systems
- Exporting and importing containers
- Docker-compose
- Running docker containers in userspace with udocker

 

7) Computing Continuum

- Low Power devices

- Introduction to Edge Computing

- Introduction to Fog Computing

- The Computing Continuum for Big Data Infrastructures

- Energy efficiency and Low-power computing

- Towards exascale computing

The Course will include for the interested students a visit to the INFN-CNAF datacenter in Bologna.

 

 

Moduele 2

The course "Infrastructures for Big Data Processing Module2" (BDP Module2) builds on the course "Introduction to Big Data Processing Infrastructures Module1" (BDP Module1). Before following this course, students should have already followed the BDP1 course, or at least have good familiarity with the topics covered there.

The BDP Module 2 course will first recap the foundations of Cloud computing and storage services beyond IaaS (PaaS and SaaS). It will then proceed to discuss how to exploit distributed infrastructures for deploying applications and perform processing of big data.

A distinct feature of Module 2 is that it provides a substantial amount of hands-on sessions that directly connect to the theoretical parts. This way, students will readily apply the concepts that are being exposed to real-world use cases. To achieve maximum benefit out of this method, it is strongly recommended that students attend all lectures.

Introduction toBDP Module2

  • Course introduction and objectives
  • Clouds beyond the IaaS: general points
  • How to use the Cloud infrastructure for this course.

Cloud Storage

  • File systems and POSIX storage
  • The Network File System (NFS)
  • Object storage, the REST architecture and the JSON format
  • Virtual file systems
  • Simple examples of local and remote data processing

Advanced Docker Containers

  • Recap of basic concepts about containers (from BDP1)
  • Networking in containers
  • Process management, logging and security
  • Repositry and versioning: Git, GitHub, Docker-Hub
  • A complete application development workflow

Authentication and Authorization

  • Principles of Cloud authentication and authorization
  • X.500, LDAP, Radius, Kerberos
  • X.509 and public-key cryptography
  • SAML, eduGAIN, IDEM, SPID
  • OAuth and OpenID-Connect
  • INDIGO IAM
  • Adapting an application to use INDIGO-IAM

Cloud Automation

  • What is Cloud Automation
  • Microservices and monoliths
  • The DevOps concept
  • Container orchestration: Docker Swarm and Kubernetes, with extensive hands-on sessions on Kubernetes
  • Infrastructure as Code: serverless technologies
  • Template-based orchestration of applications
  • Cloud automation and service orchestration

Readings/Bibliography

Course material will be shared, plus external MOOCs and books will be suggested during the course.


Teaching methods

The teaching method will be based on some theoretical foundations but it will be highly complemented with practical considerations on real infrastructures used for big data processing, as well as with some hands-on sessions.

Due to the kind of activity and didactical methods, attending the present course requires the prior participation of all students to the following e-learning Modules 1 and 2:

Module 1 – Safety General Training [https://elearning-sicurezza.unibo.it/course/view.php?id=23]

Module 2 – Safety Specific Training (part I) [https://elearning-sicurezza.unibo.it/course/view.php?id=43]

Assessment methods

Multiple choice questions written exam.

 

Students with learning disorders and\or temporary or permanent disabilities: please, contact the office responsible (https://site.unibo.it/studenti-con-disabilita-e-dsa/en/for-students) as soon as possible so that they can propose acceptable adjustments. The request for adaptation must be submitted in advance (15 days before the exam date) to the lecturer, who will assess the appropriateness of the adjustments, taking into account the teaching objectives.

Teaching tools

Slides for the theory, use of real-world infrastructures for the hands-on sessions.

Office hours

See the website of Daniele Cesini

See the website of Alessandro Costantini