69843 - Data Mining Processes and Techniques

Academic Year 2014/2015

  • Docente: Gianluca Moro
  • Credits: 5
  • SSD: ING-INF/05
  • Language: Italian
  • Teaching Mode: In-person learning (entirely or partially)
  • Campus: Rimini
  • Corso: Second cycle degree programme (LM) in Statistical, Financial and Actuarial Sciences (cod. 8613)

Learning outcomes

At the end of the course, the student knows the main issues and techniques, at the base of automatic data analysis, for the discovering of new knowledge useful to understand and forecast phenomenon of interests. Moreover, the student learns the knowledge discovery process, which includes the goal definitions, the collection and selection of data, the preparation of observations (i.e. instances), the employment of data mining techniques and algorithms together with methods for the validation of results. In particular the student is able to define a knowledge discovery process in specific enterprise and financial applicative domains, to extract knowledge models by applying appropriate techniques and algorithms in order to resolve a discovery problem, to validate and understand results.

Course contents

Introduction to the knowledge discovery process and data mining techniques both for structured data and unstructured text (e.g. web pages, documents) according to the CRoss Industry Standard Process for Data Mining (CRISP): 

  1. definition of goals, collection, comprehension and reconciliation of data in data warehousing (DW)
  • OLTP and OLAP, Introduction to DW: definition, architecture and design
  • multi-dimensional data model: facts, measures, dimensions, hierarchies, cuboids
  • star and snowflake schemas
  • operations according to the multi-dimensional model: roll-up, drill-down, slice and dice, pivot, data cube
  • selection and transformation of data into observations 
  • application of data mining techniques (classification with decision trees, associative rules, data clustering) applied also to unstructured text for the processing of web pages and, posts and, in general, documents
  • validation of results (i.e. efficacy of discovered knowledge models) 
  • deployment and exporting of knowledge models according to standard format such as the Predictive Model Markup Language (PMML)
  • Case studies developed with the open source tool WEKA and a commercial software:

    • developing, using Microsoft SQL Server, a data warehouse and performing classification and clustering 
    • predicting, in a financial context, the capability of customers to pay their loans and/or detecting of insurance frauds, predicting the default of companies
    • exploiting unstructured text variables in the previous analyses in order to better predict or explain the phenomenon of interest
    • market basket analysis, e.g. discovering combinations of products/services that tends to be bought together

    Readings/Bibliography


    • online chapters 4, 6 and 8 of the book Introduction to Data Mining by Tan, Steinbach, Kumar, Addison-Wesley, 2005. ISBN: 0321321367
    • lecture notes supplied by the teacher

    suggested readings:
    • The  WEKA manual
    • chapters 1, 2 e 11 of the book Pro SQL Server 2008 Analysis Services di Philo Janus, Guy Fouche, 2010, ISBN: 9781430219958   

    Teaching methods

    Theoretical lectures are followed by exercises in laboratory where students can cope with and resolve problems proposed throughout lessons

    Assessment methods

    laboratory exercise

    Teaching tools

    Office hours

    See the website of Gianluca Moro