91262 - Data Mining, Text Mining and Big Data Analytics

Academic Year 2022/2023

  • Docente: Gianluca Moro
  • Credits: 6
  • SSD: ING-INF/05
  • Language: English
  • Moduli: Gianluca Moro (Modulo 1) Claudio Sartori (Modulo 2) Stefano Lodi (Modulo 3)
  • Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2) Traditional lectures (Modulo 3)
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Artificial Intelligence (cod. 9063)

Learning outcomes

At the end of the course, the student understands how a possibly very large set of data can be analyzed to derive strategic information and to address "data-driven" decisions. The student has a knowledge of the main data-mining tasks such as data selection, data transformation, analysis and interpretation, with specific reference to unstructured text data, and with the issues related to analysis in "big data" environments.

Course contents

Presentation slides of the course

 

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

The text mining module focuses on several forms of knowledge discovery from large corpora of unstructured text by also means on the most important set of the state-of-the-art deep learning advancements.

These advances are more and more fundamental to cope with challenging and highly business demanded natural language processing and natural language understanding tasks, such as text representation learning, semantic parsing, similarity search and efficient large scale neural information retrieval, self-supervised methods for unsupervised data, deep metric learning in NLP, explaining behaviours and phenomenon of interest from texts, knowledge graph learning from documents, open domain question answering, generative summarisation, chatbot or digital assistant, multi-modal information retrieval of texts and images, etc, including some traditional tasks such as sentiment analysis and opinion mining, text indexing, classification and topic analysis.

The module - after a summary introduction to non-contextual language models based on word embeddings, such as GloVe and word2vec, and to memory based neural networks particularly effective for textual data, such as recurrent neural networks like LSTM, GRU and BiLSTM - deals with the following new state-of-the-art deep learning advances:

  • dimensionality reduction and feature selection methods for textual data; lab on a novel latent semantic analysis method to discover underlying explanations of phenomena; lab: discovering explanations of aircraft accidents from unstructured flight textual reports
  • the best SOTA language models transformers for each prominent business area that have largely outperformed BERT with its quadratic attention complexity (published in 2018)
  • representation learning with deep metric learning; lab: neural self-supervised information retrieval based on deep metric learning; transfer learning paradigm to exploit and fine tune existing models in target domains semantically different from their training source domains and particularly useful to overcome the lack of labeled data in the target domain
  • multi-modal learning; lab: search engines development for large fashion corpora of texts and images with quadratic and linear attention
  • Deep learning methods for low-resource regimes (i.e. models and training sets order of magnitudes smaller than SOTA largest solutions); extractive and abstractive text summarization of long docs in low-resource settings; lab: achieving SOTA summarization of legal cases with training on a few dozen instances and small models
  • Learning knowledge graphs from unstructured text with Graph Neural Networks and graph attention methods; lab: extracting n-ary relations from medical literature for modeling & explaining biological processes
  • Memory-based deep neural networks for text mining, new retrieval-memory-based transformers, question answering and chatbot

The module introduces new published SOTA results in top AI conferences, we also achieved on some of these topics.

Module 3 - Big Data Analytics (Stefano Lodi)

  • Maps and reductions in parallel programming. The MapReduce programming model.
  • The Hadoop implementation of MapReduce.
  • The Python API to the Spark system and examples of parallel programs.
  • The Machine Learning
    Library (MLlib) of Spark.

Readings/Bibliography

Readings/Bibliography

 

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

Readings: Slides, lab notebooks, datasets and papers will be supplied by the teacher.

 

Module 3 - Big Data Analytics (Stefano Lodi)

  • White, T. (2009). Hadoop: The definitive guide 4th Edition. Reilly
    Media.
  • Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big
    data processing made simple. Reilly Media, Inc.

Teaching methods

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

Module 2 - Text Mining (Gianluca Moro)

Lessons and lab activities are held also online with Teams using slides and colab notebooks respectively.

Module 3 - Big Data Analytics (Stefano Lodi)

The lessons of the course are held in a classroom. Examples are implemented in Python on Linux installed in a virtual machine (VM). The VMs will be installed in the students’ own laptops. Full instructions on how to install the VMs will be given in the first lesson.

Assessment methods

The exam of the three modules is perfomed with a common single sit and consist of two parts:

The Verification of knowledge is tested through multiple choice questions (administered with the IoL  or virtuale system). The minimum to pass is to answer correctly 60% of the questions of each module.
The weight of this part is 67%.

The Verification of abilities is tested with the development of a lab project work on one (or optionally more) of the three modules selected by the student. The student proposes a project to the teacher(s) of the module(s) selected and discusses it with the involved teacher(s). Prof. Sartori, due to his increased teaching load, has delegated to Prof. Moro the agreement and evaluation of projects in data mining.
The size/extension of the project should be approximately equivalent to that of a lab of the module(s) selected and should take one week of work.
A non-exhaustive list of dataset and text sets for the project is available in the "virtuale" teaching resources, but the student can also propose to use a different data set.
The weight of this part is 33%.

 

Teaching tools

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

The laboratory activities - which are carried out with Python/R using Colab - regard the following case studies:

  • in the context of technical reports on air accidents, identification of the reasons that contribute to cause serious accidents
  • extractive and abstractive text summarisation based on state-of-the-art transformer models
  • deep metric learning for cross-modal information retrieval of texts and images
  • SOTA transformers for closed and open question answering, for named entity recognition (NER), for relation extraction (RE) among biomedical entities for chemical-protein and disease-disease interactions respectively, for natural language inference (NLI) to determine the validity of a hypothesis (i.e., true or false), document classification (DC) to assign a text document to a set of categories

 

Module 3 - Big Data Analytics (Stefano Lodi)

Presentation of the course topics using a overhead
projector. Exercises in Bring Your Own Device mode; directions on how to install the required software will be given during the course.

Documents used in the presentations are distributed on the site
[http://iol.unibo.it]. Access to the documents is allowed only to
students of the course.

Office hours

See the website of Gianluca Moro [https://www.unibo.it/sitoweb/gianluca.moro/en]

See the website of Claudio Sartori [https://www.unibo.it/sitoweb/claudio.sartori/en]

See the website of Stefano Lodi [https://www.unibo.it/sitoweb/stefano.lodi/en]

Office hours

See the website of Gianluca Moro

See the website of Claudio Sartori

See the website of Stefano Lodi