91262 - Data Mining, Text Mining and Big Data Analytics

Course Unit Page

Academic Year 2020/2021

Learning outcomes

At the end of the course, the student understands how a possibly very large set of data can be analyzed to derive strategic information and to address "data-driven" decisions. The student has a knowledge of the main data-mining tasks such as data selection, data transformation, analysis and interpretation, with specific reference to unstructured text data, and with the issues related to analysis in "big data" environments.

Course contents

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

Module 2 - Big Data Analytics (Stefano Lodi)

  • Maps and reductions in parallel programming. The MapReduce programming model.
  • The Hadoop implementation of MapReduce.
  • The Python API to the Spark system and examples of parallel programs.
  • The Machine Learning
    Library (MLlib) of Spark.

Module 3 - Text Mining (Gianluca Moro)

The text mining module focuses on knowledge discovering from large corpora of unstructured text which is fundamental to deal with several natural language processing tasks, such as text representation models, indexing and classification, analysis of topics, semantic similarity search, explaining behaviours and phenomenon of interest (a.k.a. descriptive text mining), sentiment analysis and opinion mining, text summarisation, chatbot or digital assistant creation etc.

The learning outcomes of the module are the capabilities of defining and implementing text mining processes, from text pre-processing and representation with traditional approaches and then with novel neural language models, up to the knowledge discovery with data science methods and machine & deep learning algorithms from several sources, such tweets, facebook posts, reviews, web pages, emails, loan requests, legal cases, news and documents in general.

The module introduces non-contextual language models based on word embeddings, such as GloVe and word2vec, and memory based neural networks particularly effective for textual data, such as recurrent neural networks like LSTM, GRU and BiLSTM, up to the attention mechanism, the transformer and the state-of-the-art of contextual word embeddings based on BERT. Last but not least, the unit illustrates the transfer learning paradigm to exploit and fine tune existing models in target domains which are semantically different from their training source domains; this is particularly useful in order to overcome the lack of labeled data in the target domain.

Readings/Bibliography

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

Module 2 - Big Data Analytics (Stefano Lodi)

  • White, T. (2009). Hadoop: The definitive guide 4th Edition. Reilly
    Media.
  • Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big
    data processing made simple. Reilly Media, Inc.

Module 3 - Text Mining (Gianluca Moro)

Readings: Slides, lab materials and papers will be supplied by the teacher.

Suggested Readings:

  • C. Manning, H. Schutze, P. Raghavan. Introduction to Information Retrieval, Cambridge, University Press, freely available from http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
  • B. Liu and L. Zhang. A survey of opinion mining and sentiment analysis. In Mining Text Data, Editors C. Aggarwal and C. Xiang Zhai. Springer. http://www.cs.uic.edu/~lzhang3/paper/opinion_survey.pdf
  • Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4). https://arxiv.org/abs/1801.07883
  • Tang D., Zhang M. Deep Learning in Sentiment Analysis. In Deep Learning in Natural Language Processing. Springer https://www.dropbox.com/s/yzeheq8zuh0owmi/Optional_Deep_Learning_in_Sentiment_A nalysis.pdf?dl=0

Teaching methods

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

Module 2 - Big Data Analytics (Stefano Lodi)

The lessons of the course are held in a laboratory, each comprising
both frontal expositions and exercises.

Module 3 - Text Mining (Gianluca Moro)

Lessons and lab activities

Assessment methods

Module 1 - Data Mining (Claudio Sartori)

See 75194 - DATA MINING M Module 2 only

Module 2 - Big Data Analytics (Stefano Lodi)

Multiple choice test and oral examination.

Module 3 - Text Mining (Gianluca Moro)

The student proposes and prepares a project work in R, Python or WEKA - on one or more topics of the text mining module - which will be then discussed in the oral exam together with questions on the module contents.
The size/extension of the project should be approximately equivalent to that of a lab of the module.
A non-exhaustive list of text sets for the project is available in the "virtuale" teaching resources, but the student can also propose to use a different data set.

Teaching tools

Module 1 - Data Mining (Claudio Sartori)

Multiple choice test + open question(s) and optional oral exam

See 75194 - DATA MINING M Module 2 only

Module 2 - Big Data Analytics (Stefano Lodi)

Presentation of the course topics using a overhead
projector. Exercises in Bring Your Own Device mode; directions on how to install the required software will be given during the course.

Documents used in the presentations are distributed on the site
[http://iol.unibo.it]. Access to the documents is allowed only to
students of the course.

Module 3 - Text Mining (Gianluca Moro)

The laboratory activities - which are carried out with WEKA, R or Python mainly using Google Colab - regard the following case studies:

  • in the context of technical reports on air accidents, identification of the reasons that contribute to cause serious accidents
  • classification of documents by topic with several machine learning algorithms
  • sentiment analysis and opinion mining of unlabeled text sets from twitter and labeled from tripadvisor, edmunds, amazon
  • language models, deep neural networks and transfer learning in opinion mining
  • text summarisation of real legal cases with state-of-the-art deep learning solutions.

Office hours

See the website of Claudio Sartori

See the website of Gianluca Moro

See the website of Stefano Lodi