91259 - Architecture and Platforms for Artificial Intelligence

Academic Year 2025/2026

  • Moduli: Moreno Marzolla (Modulo 1) Luca Benini (Modulo 2)
  • Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Artificial Intelligence (cod. 9063)

Learning outcomes

At the end of the course, the student has a deep understanding of the requirements of machine-learning workloads for computing systems, has an understanding of the main architectures for accelerating machine learning workloads and heterogeneous architectures for embedded machine learning, and of the most popular platforms made available by cloud providers to specifically support machine/deep learning applications.

Course contents

Module 1:

  1. Introduction to parallel programming.

  2. Parallel programming patterns: embarrassingly parallel, decomposition, master/worker, scan, reduce.

  3. Shared-Memory programming with OpenMP. OpenMP programming model: the “omp parallel” construct, scoping constructs and other work-sharing constructs.

  4. GPU programming with CUDA. CUDA architecture and memory hierarchy. The CUDA programming model: threads, groups, grid. Synchronization primitives and shared memory.

Prerequisites for module 1: good knowledge of C programming in the Unix/Linux environment, and basic understanding of computer architectures and concurrency theory.

Module 2:

From ML to DNNs - a computational perspective 

  • Introduction to key computational kernels (dot-product, matrix multiply...).
  • Inference vs training - workload analysis characterization. 
  • The NN computational zoo: DNNs, CNNs, RNNs, Attention-based Networks, State Based networks. 

Running ML workloads on programmable processors 

  • recap of processor instruction set architecture (ISA) with focus on data processing.
  • improving processor ISAs for ML: RISC-V and ARM use cases. 
  • fundamentals of parallel processor architecture and parallelization of ML workloads. 

Algorithmic optimizations for ML

  • Key bottlenecks taxonomy of optimization techniques.
  • Algorithmic techniques (e.g. Strassen, Winograd, FFT). 
  • Model distillation: efficient NN models - depthwise convolutions, inverse bottleneck, optimized attention, introduction to Neural Architectural Search.
  • Quantization and sparsity: scalar, block, vector.

Prerequisites for module 2: Programming in C, basic computer architecture concepts, basics linear algebra and vector calculu.


Readings/Bibliography

Suggested readings for Module 1: selected parts from the following books:

  • Peter Pacheco, Matthew Malensek, An Introduction to Parallel Programming [https://shop.elsevier.com/books/an-introduction-to-parallel-programming/pacheco/978-0-12-804605-0], 2nd ed., Morgan Kaufmann, 2021, ISBN 9780128046050.

  • CUDA C programming guide, NVidia Corporation, freely available at http://docs.nvidia.com/cuda/cuda-c-programming-guide/

Suggested readings for Module 2: selected parts from the following books:

  • Dive into Deep Learning (online: d2l.ai).

  • Efficient Processing of Deep Neural Networks (online: https://link.springer.com/book/10.1007/978-3-031-01766-7 ).

  • Machine Learning Systems (online: mlsysbook.ai).

 

Teaching methods

Traditional lectures for theory.

Both Module 1 and Module 2 include hands-on sessions requiring a student laptop.

Assessment methods

The exam consists of two parts which are independent and can be done in any order:

Module 1: Project work, which consists of the design and implementation of a parallel program and a written report. The project work must be done individually by each student; collaboration is not allowed. The program must implement a parallel algorithm according to specifications provided by the instructor, using C/OpenMP and/or CUDA/C. No other programming languages are allowed. The report has a maximum length of 6 pages, and must describe and motivate the parallelization strategies adopted, and analyze the performance of the parallel program. The instructor may require a brief oral discussion of the project.

Module 2: Written exam with oral discussion: the written exam is compulsory and consists of solving problems and answering questions. The oral exam is optional and consists of in-depth questions on topics covered in class.

Final grade: To get a final grade it is necessary to pass the exams for both modules with a passing mark, i.e., at least 18 out of 30; the final grade is computed as the average of the grades of the two modules, rounded to the nearest integer. Honors (“lode”) will be awarded for results of very high quality at the discretion of the instructors. Partial grades are lost at the end of the academic year. Important note: to get the final grade you need to pass the exams for both modules 1 and 2, i.e., it is not possible to register partial grades.

Teaching tools

Lectures using projector for slides provided by the instructors.

Hands-on sessions.

Office hours

See the website of Moreno Marzolla

See the website of Luca Benini