84423 - STATISTICS AND ARCHITECTURES FOR BIG DATA PROCESSING M

Scheda insegnamento

Anno Accademico 2018/2019

Conoscenze e abilità da conseguire

The course provides students with a basic knowledge of problems and corresponding techniques of solutions implied by the ever increasing amount and complexity of the data available for analyses and decisions, i.e., the so called Big-Data (BD). The corresponding issues are tackled by multiple points of view, from the abstract characterization of the mathematical properties of BD, to the hardware architectures needed to process them, from the ad-hoc algorithms developed to cope with data deluge to the network issues implied by the storage and communication of data collections that are possibly partitioned in space and time.

Programma/Contenuti

MODULE 1

The two directions of along wich Big Data are big

High dimensionality:

  • geometric effects of high dimensionality
  • computational effects of high dimensionality
  • multiplication of large matrices
  • dimensionality reduction: JL lemma
  • dimensionality reduction: PCA
  • dimensionality reduction: compressed sensing classical and adapted approaches
  • interpolation in high-dimensional spaces

Streaming:

  • sampling data in streams
  • filtering data in streams
  • counting distinct elements in streams
  • estimations from streams: number of ones, distinct elements, most common element...

Prototype problems:

  • abstract summary of documents
  • Markov chains and pagerank-like algorithms

 

MODULE 2

Introduction to data centers:

  • High-level architecture
  • Compute units, network and storage
  • Energy efficiency, techniques for improving PUE
  • Trends and directions: scale-up vs. scale-out

Introduction to big data workloads

  • Amdahl's law, strong and weak scaling
  • Map Reduce: Hadoop
  • NO-SQL: Cassandra
  • In-memory computing: Spark

In-order CPU

  • Pipelining basics
  • Pipeline hazards
  • Memory hierarchy
  • Performance analysis techniques

Out-of-order CPU

  • ILP and instruction hazards
  • Removing false dependencies: renaming
  • Removing control hazards: branch prediction
  • Precise interrupts and speculation reorder buffer

Multicore CPU

  • Message passing vs shared memory vs
  • parallel execution models, heterogeneous parallelism
  • Cache coherency
  • Synchronization

Architectural Performance estimation and analysis

 

Orario di ricevimento

Consulta il sito web di Riccardo Rovatti

Consulta il sito web di Luca Benini

Consulta il sito web di Oreste Andrisano