87809 - Introduction to Big Data Processing Infrastructures

Academic Year 2021/2022

  • Teaching Mode: Traditional lectures
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Bioinformatics (cod. 8020)

Learning outcomes

At the end of the course, the studente has the basic theoretical and practical knowledge on infrastructures for scientific computing, distributed and parallel systems, batch systems and security technologies.

Course contents

The course will provide basic concepts of Infrastructures for processing Big Data and for running scientific applications. In particular it will focus on the Infrastructure-as-a-Service Cloud paradigm. The course will start with an introduction to Big Data and how they are related to scientific applications. It will continue with a description of the building blocks of modern Data Centers and how they are abstracted by the Cloud computing models. A real-life computational challenge will be given and students will create (during the course) a Cloud-based computing model to solve this challenge. Access to a limited set of Cloud resources and services will be granted to students in order to complete the exercises. Containers and in particular Docker Containers will be introduced as for the concept of High Performance Computing (HPC). Notions about the emerging “Fog” and “Edge” computing paradigms and how they are linked to Cloud infrastructures will conclude the course.

Program:

1) Introduction to the course and the computational challenge

Big Data

- Big Data definition

- Big Data applications classification

- Big Data applications examples

- Big Data and scientific applications

- Presentation of the computational challenge that will accompany us during the course.

Hands on:

- Set up of connections and login

2) From your laptop to the datacenter - datacenter building blocks

CPU Farm

i. Batch system, queues, allocation policies, quota etc..

Storage

I. DAS vs NAS

II. SAN

III. TAN

IV. Parallel FS

V. Data lifecycle, QoS

- Migration, recall, ACL

Network: main protocols (Ethernet, infiniband, Fiber Channel)

Monitoring and Provisioning

Hands on: Submission on a small cluster already available to students

3) Infrastructures for Parallel Computing

HTC vs HPC

HTC

- Distributed systems

- Grid Computing

HPC

- Shared memory vs distributed memory

- OPENMPI/OPNMPI

- Accelerators for parallel computing

- Hybrid and non-standard resources

Energy efficiency and Low-power computing

- Towards exascale computing

Hands: Demo Live - Speedup curve creations for the NAMD SMTV/APOA1 use cases. Computing on a GPU. Computing on Low Power systems.

5) Cloud Infrastructure

Cloud Computing: Introduction

Clod Computing Dimensions - IaaS, PaaS, SaaS, service and isolation models

Cloud IaaS

i. Advantages and Disadvantages

ii. Application Porting to the Cloud

iii. AWS Usage

Cloud Storage - provisioning of block device and POSIX filesystems

Hands on:

  1. IaaS instantiation with AWS - create the infrastructure to run the course exercises
  2. Instantiation of multiple machines - experience on cloud elasticity - Create a mini-cluster - Run the course exercise on that cluster
  3. Create storage volumes on the Cloud and make them available to the cluster
  4. Hadoop cluster creation
  5. MapReduce introduction and exercise

6) Introduction to Containers

- Basic concepts about containers
- Running and extending containers
- Docker Hub and dockerfiles
- Connecting containers to file systems
- Exporting and importing containers
- Docker-compose
- Running docker containers in userspace with udocker


7) Computing Continuum

- Low Power devices

- Introduction to Edge Computing

- Introduction to Fog Computing

- The Computing Continuum for Big Data Infrastructures

- Energy efficiency and Low-power computing

- Towards exascale computing

 

The Course will include for the interested students a visit to the INFN-CNAF datacenter in Bologna.

Readings/Bibliography

Course material will be shared, plus external MOOCs and books will be suggested during the course.

Teaching methods

The teaching method will be based on some theoretical foundations but it will be highly complemented with practical considerations on real infrastructures used for big data processing, as well as with some hands-on sessions.

Assessment methods

There will be an oral exam, focusing on the topics presented during the course.

Students will be requested to prepare a small project that will be discussed during the exam.

Teaching tools

Slides for the theory, use of real-world infrastructures for the hands-on sessions

Office hours

See the website of Daniele Cesini