You are here:

B2133 - Big Data Analytics and Text Mining

Academic Year 2024/2025

                
                        Docente:
                        Gianluca Moro
                    
                        Credits:
                        6
                    
                        SSD:
                        ING-INF/05
                    
                        Language:
                        English
                    
                        Moduli:
                        
                            Gianluca Moro
                            (Modulo 1)
                        
                            Stefano Lodi
                            (Modulo 2)
                        
                        Teaching Mode:
                        
                                    In-person learning (entirely or partially) (Modulo 1); 
                                
                                    In-person learning (entirely or partially) (Modulo 2)
                                
                            Campus:
                            Bologna
                        
                            Corso:
                            Second cycle degree programme (LM) in
                            Artificial Intelligence (cod. 9063)

                            Online Lessons
                        
                            Teaching resources on Virtuale

Learning outcomes

At the end of the course, the student understands how corpora and big data sets can be analyzed to derive strategic knowledge and data-driven decisions. The student is able to address the main data and text analytics tasks such as data selection, transformation, analysis and interpretation, moreover is able to leverage deep learning pipelines and advanced solutions for text mining and natural language understanding downstream goals.

Course contents

Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)

The module focuses on the following state-of-the-art deep learning advancements to cope with challenging and highly business demanded natural language processing and understanding tasks with core open source technologies:

Neural Language Modeling Foundations
Transformer Architecture, Pre-Training and Fine-Tuning Techniques, Decoding Strategies, Evaluation (Natural Language Generation Metrics, LLM-as-a-judge).
Large Language Models
Scaling Laws, Prompting vs Fine-Tuning, Instruction Fine-Tuning, Preference Alignment, Efficient Attention Mechanisms (e.g., BigBird, Performer, Key-Value Caching, Multi-Query Attention)

Compression and Quantization Methods (GPTQ, AWQ, GGUF), Parameter-Efficient Fine-Tuning (e.g., LoRA, QLoRA, GaLoRE, Prompt Tuning), Mixture-of-Experts, Vision-Language Models.

Knowledge-Enhanced Natural Language Processing. Retrieval-Augmented Generation (Chunking Strategies, Architectures) and VectorDBs, Agents and Tool Calling (Large Action Models - LAM), Graph Neural Networks, Graph Injection into Language Models for Trustworthiness.

Labs on real-world case studies and in enterprise settings with core technologies Python, Pytorch, HuggingFace, LlamaIndex:

-- Prompting vs Full Fine-Tuning vs Parameter-Efficient Fine-Tuning for Dialogue Summarization; Foundational Multimodal Large Language Models for Visual Question Answering.
-- Efficient, Retrieval-Augmented Conversational Agents for Qustion-Answering in the Legal Domain.
-- Graph Learning, Embedding and Injection into Language Models for Semantics-Aware Lay Summarization; GraphRAG and Structured Query Generation to Interact with DBMS and Knowledge Graphs in Multi-Source Question Answering;
-- Graph Verbalization into Large Language Model Prompts for Medical Question Answering.

The module introduces also new published SOTA results in top-tier AI conferences, we also achieved on some of these topics.

Module 2 - Big Data Analytics (Prof. Stefano Lodi)

Data Mining in Big Data Environments
Maps and reductions in parallel programming. The MapReduce programming model.
The Hadoop implementation of MapReduce.
The Python API to the Spark system and examples of parallel programs.
The Machine Learning
Library (MLlib) of Spark.

Readings/Bibliography

Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)

Readings: Slides, lab notebooks, datasets and recent papers will be supplied by the teacher, also from our repo

Suggested readings: Natural Language Processing Recipes Unlocking Text Data with Machine Learning and Deep Learning using Python, Akshay Kulkarni and Adarsha Shivananda, 2024

Module 2 - Big Data Analytics (Prof. Stefano Lodi)

White, T. (2009). Hadoop: The definitive guide 4th Edition. Reilly
Media.
Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big
data processing made simple. Reilly Media, Inc.

Teaching methods

Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)

Lessons and lab activities are held also online with Teams using slides, colab notebooks and our servers.

Module 2 - Big Data Analytics (Prof. Stefano Lodi)

The lessons of the course are held in a classroom. Examples are implemented in Python on Linux installed in a virtual machine (VM). The VMs will be installed in the students’ own laptops. Full instructions on how to install the VMs will be given in the first lesson.

Assessment methods

The exam of the course consists of discussing, also online via ms teams, a single project arranged with one of the teachers, based on one or more topic chosen by the students from one or both modules. Students, either individually or in teams of up to four members, develop the project, if needed with the assistance of the tutor. Each student of the team is required to participate in the discussion individually. The project's scope should be approximately equivalent to that of a lab of the course and should take about a couple of weeks of work

A non-exhaustive list of dataset and text sets for the project is available in the "virtuale" teaching resources, but the students can also propose to use a different data set.

Teaching tools

Module 1 - Text Mining and Large Language Models (Prof. Gianluca Moro)

The laboratory activities - which are carried out with Python/R using Colab - regard the following case studies:

in the context of technical reports on air accidents, identification of the reasons that contribute to cause serious accidents
extractive and abstractive text summarisation based on state-of-the-art transformer models
deep metric learning for cross-modal information retrieval of texts and images
SOTA transformers for closed and open question answering, also in business and legal domains, for named entity recognition (NER), for relation extraction (RE) among biomedical entities for chemical-protein and disease-disease interactions respectively, for natural language inference (NLI) to determine the validity of a hypothesis (i.e., true or false), document classification (DC) to assign a text document to a set of categories

Module 2 - Big Data Analytics (Prof. Stefano Lodi)

Presentation of the course topics using a overhead
projector. Exercises in Bring Your Own Device mode; directions on how to install the required software will be given during the course.

Documents used in the presentations are distributed on the site
[http://iol.unibo.it]. Access to the documents is allowed only to
students of the course.

Office hours

See the website of Gianluca Moro

See the website of Stefano Lodi

Office hours

See the website of Gianluca Moro

See the website of Stefano Lodi

SDGs

This teaching activity contributes to the achievement of the Sustainable Development Goals of the UN 2030 Agenda.