Academic Year 2019/2020

  • Teaching Mode: Traditional lectures
  • Campus: Bologna
  • Corso: First cycle degree programme (L) in Statistical Sciences (cod. 8873)

Learning outcomes

By the end of the course the student will develop advanced expertise in analyzing real-world phenomena by using statistical methods. By the end of this course students will be able to: - implement appropriate advanced statistical analysis using a statistical software (SAS or R or SPSS); - interpret the output of the procedures; - critically collate results and conclusions; - present the main results and conclusions in the form of concise summaries; - work independently on practical data analysis problems.

Course contents

  • Text cleaning and text standardization (i. a. stemming, lemmatization, stopwords)
  • Creating Document Term Matrix with different weights
  • Data wrangling in text mining.
  • Searching for relationships and patterns between words.
  • Visualization techniques for text mining analysis.
  • Unsupervised machine learning methods for text analysis (clustering, sentiment analysis, dimensional reduction)
  • Supervised machine learning methods and simple feature engineering of text data (Naive Bayes, KNN, Decision Trees, SVM, Random forest).
  • R software and R infrastructure for the text mining analysis and machine learning (packages: tm, tidytext, quanteda, caret, mlr).

Readings/Bibliography

  • Ashish Kumar, Avinash Paul, Mastering Text Mining with R.„Packt Publishing", 2016.
  • Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.
  • Feldman, Ronen, and James Sanger. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press, 2007.
  • Friedl, Jeffrey EF. Mastering regular expressions. " O'Reilly Media, Inc.", 2006.
  • Kumar, Ashish, and Avinash Paul. Mastering Text Mining with R. Packt Publishing Ltd, 2016.
  • Kwartler, Ted. Text mining in practice with R. John Wiley & Sons, 1991.
  • Manning, Christopher D., and Hinrich Schütze. Foundations of statistical natural language processing. Vol. 999. Cambridge: MIT press, 1999.
  • Meyer, David, Kurt Hornik, and Ingo Feinerer. "Text mining infrastructure in R." Journal of statistical software 25.5 (2008): 1-54.
  • Silge, Julia, and David Robinson. Text mining with R: A tidy approach. " O'Reilly Media, Inc.", 2017.
  • Weiss, Sholom M., et al. Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media, 2010.

Teaching methods

Lectures and lab tutorials

Assessment methods

Attendance, take-home project.

Teaching tools

Lab tutorials & teaching notes

Office hours

See the website of Piotr Cwiakowski