- Docente: Davide Salomoni
- Credits: 4
- SSD: FIS/01
- Language: English
- Teaching Mode: Traditional lectures
- Campus: Bologna
- Corso: Second cycle degree programme (LM) in Bioinformatics (cod. 8020)
Learning outcomes
At the end of the course, the student has practical and theoretical knowledge on distributed computing and storage infrastructures, cloud computing and virtualization, parallel computing and their application to Big Data Analysis
Course contents
The course on "Infrastructures for Big Data Processing" (BDP2) will first discuss the foundations of Cloud computing and storage services beyond IaaS (PaaS and SaaS). This will then lead us to understand how to exploit distributed infrastructures for Big Data processing. Before following this course, you should have already followed the course "Introduction to Big Data Processing Infrastructures" (BDP1), or have familiarity with the topics covered there.
Introduction to BDP2
- Course introduction and objectives
- Clouds beyond the IaaS: general points
- A concrete example: the INDIGO-DataCloud architecture
Authentication and Authorization
- Principles of Cloud authentication and authorization
- X.509, SAML, OpenID-Connect, LDAP, Kerberos, Username/password, OAuth
- INDIGO IAM
Cloud PaaS
- PaaS, i.e. the programmable Cloud
- PaaS examples
- TOSCA
Non-Posix Cloud Storage
- What is Posix storage
- Object storage
- CEPH
- HadoopFS & Map-Reduce
Containers
- The origina of containers
- Docker and dockerfiles
- Docker swarm
- Security considerations
- Running docker containers in user space
Resource orchestration
- Local orchestration of resources: Kubernetes, Mesos & Chronos
- Remote orchestration of resources: Information systems, The INDIGO-DataCloud Orchestrator
Distributed filesystems
- Introduction to common distributed filesystems
- Storj and ipfs
- Onedata
Cloud automation in scientific workflows
- What can be automated and how?
- Extending TOSCA templates
The full chain for a cloud-based big data experiment
- Examples
- Projects for the exam
Readings/Bibliography
Course material will be shared, plus external MOOCs and books will be suggested during the course.
Teaching methods
The teaching method will be based on some theoretical foundations but it will be highly complemented with practical considerations on real infrastructures used for big data processing, as well as with some hands-on sessions.
Assessment methods
There will be an oral exam, focusing on the topics presented during the course. Students will be requested to prepare a small project that will be discussed during the exam.
Teaching tools
Slides for the theory, use of real-world infrastructures for the hands-on sessions.
Note that a personal laptop (running Windows, Linux or MacOS - no tablets) is required during the lectures to follow some of the material & exercises.
Office hours
See the website of Davide Salomoni