91262 - Data Mining, Text Mining and Big Data Analytics

Academic Year 2021/2022

  • Docente: Gianluca Moro
  • Credits: 6
  • SSD: ING-INF/05
  • Language: English
  • Moduli: Claudio Sartori (Modulo 1) Gianluca Moro (Modulo 2) Stefano Lodi (Modulo 3)
  • Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2) Traditional lectures (Modulo 3)
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Artificial Intelligence (cod. 9063)

Learning outcomes

At the end of the course, the student understands how a possibly very large set of data can be analyzed to derive strategic information and to address "data-driven" decisions. The student has a knowledge of the main data-mining tasks such as data selection, data transformation, analysis and interpretation, with specific reference to unstructured text data, and with the issues related to analysis in "big data" environments.

Course contents

Presentation slides of the course 

 

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M [https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2021/391683] Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

The text mining module focuses on knowledge discovering from large corpora of unstructured text which is fundamental to deal with several natural language processing tasks, such as text representation models, indexing and classification, analysis of topics, semantic similarity search, explaining behaviours and phenomenon of interest (a.k.a. descriptive text mining), sentiment analysis and opinion mining, text summarisation, chatbot or digital assistant creation, cross-modal information retrieval of texts and images etc.

The learning outcomes of the module are the capabilities of defining and implementing text mining processes, from text pre-processing and representation with traditional approaches and then with novel neural language models, up to the knowledge discovery with data science methods and machine & deep learning algorithms from several sources, such tweets, facebook posts, reviews, web pages, emails, loan requests, legal cases, news and documents in general.

The module introduces non-contextual language models based on word embeddings, such as GloVe and word2vec, and memory based neural networks particularly effective for textual data, such as recurrent neural networks like LSTM, GRU and BiLSTM, up to the attention mechanism, the transformer and the state-of-the-art of contextual word embeddings based on BERT and new linear transformer models for text summarization and deep metric learning for cross-modal information retrieval of texts and images.

Last but not least, the unit illustrates the transfer learning paradigm to exploit and fine tune existing models in target domains which are semantically different from their training source domains; this is particularly useful in order to overcome the lack of labeled data in the target domain.

 

Module 3 - Big Data Analytics (Stefano Lodi)

  • Maps and reductions in parallel programming. The MapReduce programming model.
  • The Hadoop implementation of MapReduce.
  • The Python API to the Spark system and examples of parallel programs.
  • The Machine Learning
    Library (MLlib) of Spark.

 

 

Readings/Bibliography

Readings/Bibliography

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M [https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2021/391683] Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

Readings: Slides, lab materials and papers will be supplied by the teacher.

Suggested Readings:

  • C. Manning, H. Schutze, P. Raghavan. Introduction to Information Retrieval, Cambridge, University Press, freely available from http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
  • B. Liu and L. Zhang. A survey of opinion mining and sentiment analysis. In Mining Text Data, Editors C. Aggarwal and C. Xiang Zhai. Springer. http://www.cs.uic.edu/~lzhang3/paper/opinion_survey.pdf
  • Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4). https://arxiv.org/abs/1801.07883
  • Tang D., Zhang M. Deep Learning in Sentiment Analysis. In Deep Learning in Natural Language Processing. Springer https://www.dropbox.com/s/yzeheq8zuh0owmi/Optional_Deep_Learning_in_Sentiment_A nalysis.pdf?dl=0

 

Module 3 - Big Data Analytics (Stefano Lodi)

  • White, T. (2009). Hadoop: The definitive guide 4th Edition. Reilly
    Media.
  • Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big
    data processing made simple. Reilly Media, Inc.

 

 

Teaching methods

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M [https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2021/391683] Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

Lessons and lab activities are held online with Teams using slides and colab notebooks respectively. 

 

Module 3 - Big Data Analytics (Stefano Lodi)

The lessons of the course are held in a classroom. Examples are implemented in Python on Linux installed in a virtual machine (VM). The VMs will be installed in the students’ own laptops. Full instructions on how to install the VMs will be given in the first lesson.

 

Assessment methods

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M [https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2021/391683] Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

The student proposes and prepares a project work in R, Python or WEKA - on one or more topics of the text mining module - which will be then discussed in the oral exam together with questions on the module contents.
The size/extension of the project should be approximately equivalent to that of a lab of the module.
A non-exhaustive list of text sets for the project is available in the "virtuale" teaching resources, but the student can also propose to use a different data set.

 

Module 3 - Big Data Analytics (Stefano Lodi)

  • Multiple choice test: the candidate must complete a given sentence with one of three possible completions, of which only one is correct. There will be 15 such sentences in the test. The mark will be given as a number from 0 to 30.
  • Oral examination: At most two questions. The mark will be given as a number from 0 to 30.
  • The scope of both the test and the oral examination is the entire presented content of the module. The contents of distributed slides which for any reason have not been presented in lectures will not be in the scope of the examination.
  • The final mark of the Big Data Analytics module is the average of the multiple choice test mark and the oral examination mark

 

 

Teaching tools

Module 1 - Data Mining (Claudio Sartori)

Multiple choice test + open question(s) and optional oral exam

See 75194 - DATA MINING M [https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2021/391683] Module 2 only

 

Module 2 - Text Mining (Gianluca Moro)

The laboratory activities - which are carried out with WEKA, R or Python mainly using Google Colab - regard the following case studies:

  • in the context of technical reports on air accidents, identification of the reasons that contribute to cause serious accidents
  • classification of documents by topic with several machine learning approaches
  • extractive and abstractive text summarisation based on state-of-the-art transformer models
  • deep metric learning for cross-modal information retrieval of texts and images

 

Module 3 - Big Data Analytics (Stefano Lodi)

Presentation of the course topics using a overhead
projector. Exercises in Bring Your Own Device mode; directions on how to install the required software will be given during the course.

Documents used in the presentations are distributed on the site
[http://iol.unibo.it]. Access to the documents is allowed only to
students of the course.

Office hours

See the website of Gianluca Moro

See the website of Claudio Sartori

See the website of Stefano Lodi