87809 - INTRODUCTION TO BIG DATA PROCESSING INFRASTRUCTURES

Academic Year 2018/2019

  • Teaching Mode: Traditional lectures
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Bioinformatics (cod. 8020)

Learning outcomes

At the end of the course, the studente has the basic theoretical and practical knowledge on infrastructures for scientific computing, distributed and parallel systems, batch systems and security technologies.

Course contents

The course will provide basic concepts of Infrastructure for BigData processing, including Cloud computing at the Infrastructure-as-a-Service level. The course will start with a description of the building blocks of modern data centers and how they are abstracted by the Cloud paradigm. A real-life computational challenge will be given and students will create (during the course) a cloud-based computing model to solve this challenge. A very brief introduction to High Performance Computing (HPC) will also be given. Notions about the emerging “fog” and “edge” computing paradigms and how they are linked to Cloud infrastructures will conclude the course.

Program:

1) Introduction to the course and the computational challenge

- Introduction to BigData

- Presentation of the computational challenge that will accompaign us during the course.

Hands on:

- Set up oftestbed for exercises

2) From your laptop to the datacenter - datacenter building blocks

- CPU Farm

i. Batch system, queues, allocation policies, quota etc..

- Storage

I. DAS vs NAS

II. SAN

III. TAN

IV. Parallel FS

V. Data lifecycle, QoS

- Migration, recall, ACL

- Network: main protocols (eth, infiniband, fc)

- Monitoring and Provisioning

Hands on: Submission on a small cluster already avalaible to students

3) Infrastructures for Parallel Computing

HTC vs HPC

HTC

- Distributed systems

- Grid Computing

HPC

- Shared memory vs distributed memory

- OPENMPI/OPNMPI

- Accelerators for parallel computing

- Hybrid and non-standard resources

Energy efficiency and Low-power computing

- Towards exascale computing

Hands: Demo Live - Speedup curve creations for the NAMD SMTV/APOA1 use cases. Computing on a GPU. Computing on Low Power systems.

4) Cloud IaaS

Cloud Computing: Introduction

Cloud IaaS

i. Advantages and Disadvantages

ii. Application Porting to the Cloud

iii. Openstack introduction

iv. Amazon vs Openstack

Cloud Storage - provisioning di block device e posix fs

Hands on: IaaS instantiation with Openstack - create the infrastructure to run the course exercises

Instatiation of multiple machines - experience on cloud elasticity - Create a mini-cluster - Run the course exercise on that cluster

Create storage volumes on the Cloud and make them available to the cluster

5) Creating a computing model in distributed infrastructures and multi-sites Cloud

Job Submission strategies

i. Push vs pull

ii. Compute driven model

iii. Workload Management Systems

Data Management startegies

i. Repliche, QoS

ii. Data driven computing models

Failover and Disaster Recovery strategies

6) Computing Continuum

- Low Power devices

- Introduction to Edge Computing

- Introducion to Fog Computing

- The Computing Continuum for Big Data Infrastructures

The Course will include for the interested students a visti to the INFN-CNAF datacenter in Bologna.

 

Readings/Bibliography

Course material will be shared, plus external MOOCs and books will be suggested during the course.

Teaching methods

The teaching method will be based on some theoretical foundations but it will be highly complemented with practical considerations on real infrastructures used for big data processing, as well as with some hands-on sessions.

Assessment methods

There will be an oral exam, focusing on the topics presented during the course.

Students will be requested to prepare a small project that will be discussed during the exam.

Teaching tools

Slides for the theory, use of real-world infrastructures for the hands-on sessions

Office hours

See the website of Daniele Cesini