88406 - INFRASTRUCTURES FOR BIG DATA PROCESSING

Course Unit Page

  • Teacher Davide Salomoni

  • Credits 4

  • SSD FIS/01

  • Teaching Mode Traditional lectures

  • Language English

Academic Year 2019/2020

Learning outcomes

At the end of the course, the student has practical and theoretical knowledge on distributed computing and storage infrastructures, cloud computing and virtualization, parallel computing and their application to Big Data Analysis

Course contents

The course on "Infrastructures for Big Data Processing" (BDP2) builds on the course "Introduction to Big Data Processing Infrastructures" (BDP1). Before following this course, you should have already followed the BDP1 course, or at least have good familiarity with the topics covered there.

The BDP2 course will first recap the foundations of Cloud computing and storage services beyond IaaS (PaaS and SaaS). This will then lead us to understand how to exploit distributed infrastructures for Big Data processing.

 

Introduction to BDP2

  • Course introduction and objectives
  • Clouds beyond the IaaS: general points
  • How to use a Cloud infrastructure to follow this course.

Cloud Storage

  • File systems and POSIX storage
  • The Network File System (NFS)
  • Object storage, the REST architecture and the JSON format
  • Virtual file systems, the example of Onedata
  • Simple examples of local and remote data processing

Containers

  • Basic concepts about containers
  • Running and extending containers
  • Docker Hub and dockerfiles
  • Connecting containers to file systems
  • Exporting and importing containers
  • Docker compose
  • Running docker containers in user space

Authentication and Authorization

  • Principles of Cloud authentication and authorization
  • X.500, LDAP, Radius, Kerberos
  • X.509 and public-key cryptography
  • SAML, eduGAIN, IDEM, SPID
  • OAuth and OpenID-Connect
  • INDIGO IAM
  • Adapting an application to use INDIGO-IAM

Cloud Automation

  • What is Cloud Automation
  • Microservices and monoliths
  • The DevOps concept
  • Container orchestration: Docker Swarm
  • Container orchestration: Kubernetes and Mesos
  • Infrastructure as Code: serverless technologies
  • Data ingestion, data processing and data querying
  • Template-based orchestration

 

Readings/Bibliography

Course material will be shared, plus external MOOCs and books will be suggested during the course.

Teaching methods

The teaching method will be based on some theoretical foundations but it will be highly complemented with practical considerations on real infrastructures used for big data processing, as well as with some hands-on sessions.

Assessment methods

The exam will be oral only, focusing on the topics presented during the course.

Teaching tools

Slides for the theory, use of real-world infrastructures for the hands-on sessions.

Note that a personal laptop (running Windows, Linux or MacOS - no tablets) is required during the lectures to follow the presented material and the hands-on sessions.

Office hours

See the website of Davide Salomoni