- Docente: Gianluca Moro
- Credits: 6
- SSD: ING-INF/05
- Language: English
- Moduli: Gianluca Moro (Modulo 1) Stefano Lodi (Modulo 2)
- Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2)
- Campus: Bologna
- Corso: Second cycle degree programme (LM) in Artificial Intelligence (cod. 9063)
-
from Sep 18, 2024 to Dec 18, 2024
-
from Sep 20, 2024 to Dec 20, 2024
Learning outcomes
At the end of the course, the student understands how corpora and big data sets can be analyzed to derive strategic knowledge and data-driven decisions. The student is able to address the main data and text analytics tasks such as data selection, transformation, analysis and interpretation, moreover is able to leverage deep learning pipelines and advanced solutions for text mining and natural language understanding downstream goals.
Course contents
Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)
The module focuses on the following state-of-the-art deep learning advancements to cope with challenging and highly business demanded natural language processing and understanding tasks with core open source technologies:
- Neural Language Modeling Foundations
Transformer Architecture, Pre-Training and Fine-Tuning Techniques, Decoding Strategies, Evaluation (Natural Language Generation Metrics, LLM-as-a-judge).
- Large Language Models
Scaling Laws, Prompting vs Fine-Tuning, Instruction Fine-Tuning, Preference Alignment, Efficient Attention Mechanisms (e.g., BigBird, Performer, Key-Value Caching, Multi-Query Attention)
Compression and Quantization Methods (GPTQ, AWQ, GGUF), Parameter-Efficient Fine-Tuning (e.g., LoRA, QLoRA, GaLoRE, Prompt Tuning), Mixture-of-Experts, Vision-Language Models.
- Knowledge-Enhanced Natural Language Processing. Retrieval-Augmented Generation (Chunking Strategies, Architectures) and VectorDBs, Agents and Tool Calling (Large Action Models - LAM), Graph Neural Networks, Graph Injection into Language Models for Trustworthiness.
- Labs on real-world case studies and in enterprise settings with core technologies Python, Pytorch, HuggingFace, LlamaIndex:
-- Prompting vs Full Fine-Tuning vs Parameter-Efficient Fine-Tuning for Dialogue Summarization; Foundational Multimodal Large Language Models for Visual Question Answering.
-- Efficient, Retrieval-Augmented Conversational Agents for Qustion-Answering in the Legal Domain.
-- Graph Learning, Embedding and Injection into Language Models for Semantics-Aware Lay Summarization; GraphRAG and Structured Query Generation to Interact with DBMS and Knowledge Graphs in Multi-Source Question Answering;
-- Graph Verbalization into Large Language Model Prompts for Medical Question Answering.
The module introduces also new published SOTA results in top-tier AI conferences, we also achieved on some of these topics.
Module 2 - Big Data Analytics (Prof. Stefano Lodi)
- Data Mining in Big Data Environments
- Maps and reductions in parallel programming. The MapReduce programming model.
- The Hadoop implementation of MapReduce.
- The Python API to the Spark system and examples of parallel programs.
- The Machine Learning
Library (MLlib) of Spark.
Readings/Bibliography
Readings/Bibliography
Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)
Readings: Slides, lab notebooks, datasets and recent papers will be supplied by the teacher, also from our repo
Suggested readings: Natural Language Processing Recipes Unlocking Text Data with Machine Learning and Deep Learning using Python, Akshay Kulkarni and Adarsha Shivananda, 2024
Module 2 - Big Data Analytics (Prof. Stefano Lodi)
- White, T. (2009). Hadoop: The definitive guide 4th Edition. Reilly
Media. - Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big
data processing made simple. Reilly Media, Inc.
Teaching methods
Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)
Lessons and lab activities are held also online with Teams using slides, colab notebooks and our servers.
Module 2 - Big Data Analytics (Prof. Stefano Lodi)
The lessons of the course are held in a classroom. Examples are implemented in Python on Linux installed in a virtual machine (VM). The VMs will be installed in the students’ own laptops. Full instructions on how to install the VMs will be given in the first lesson.
Assessment methods
The exam of the course consists of discussing, also online via ms teams, a single project arranged with one of the teachers, based on one or more topic chosen by the students from one or both modules. Students, either individually or in teams of up to four members, develop the project, if needed with the assistance of the tutor. Each student of the team is required to participate in the discussion individually. The project's scope should be approximately equivalent to that of a lab of the course and should take about a couple of weeks of work
A non-exhaustive list of dataset and text sets for the project is available in the "virtuale" teaching resources, but the students can also propose to use a different data set.
Teaching tools
Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)
The laboratory activities - which are carried out with Python/R using Colab - regard the following case studies:
- in the context of technical reports on air accidents, identification of the reasons that contribute to cause serious accidents
- extractive and abstractive text summarisation based on state-of-the-art transformer models
- deep metric learning for cross-modal information retrieval of texts and images
- SOTA transformers for closed and open question answering, also in business and legal domains, for named entity recognition (NER), for relation extraction (RE) among biomedical entities for chemical-protein and disease-disease interactions respectively, for natural language inference (NLI) to determine the validity of a hypothesis (i.e., true or false), document classification (DC) to assign a text document to a set of categories
Module 2 - Big Data Analytics (Prof. Stefano Lodi)
Presentation of the course topics using a overhead
projector. Exercises in Bring Your Own Device mode; directions on how to install the required software will be given during the course.
Documents used in the presentations are distributed on the site
[http://iol.unibo.it]. Access to the documents is allowed only to
students of the course.
See the website of Gianluca Moro
See the website of Stefano Lodi
Office hours
See the website of Gianluca Moro
See the website of Stefano Lodi
SDGs
This teaching activity contributes to the achievement of the Sustainable Development Goals of the UN 2030 Agenda.