84423 - Statistics and Architectures for Big Data Processing M

Academic Year 2018/2019

  • Moduli: Riccardo Rovatti (Modulo 1) Luca Benini (Modulo 2) Oreste Andrisano (Modulo 3)
  • Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2) Traditional lectures (Modulo 3)
  • Campus: Bologna
  • Corso: Second cycle degree programme (LM) in Electronic Engineering (cod. 0934)

Learning outcomes

The course provides students with a basic knowledge of problems and corresponding techniques of solutions implied by the ever increasing amount and complexity of the data available for analyses and decisions, i.e., the so called Big-Data (BD). The corresponding issues are tackled by multiple points of view, from the abstract characterization of the mathematical properties of BD, to the hardware architectures needed to process them, from the ad-hoc algorithms developed to cope with data deluge to the network issues implied by the storage and communication of data collections that are possibly partitioned in space and time.

Course contents

MODULE 1

The two directions of along wich Big Data are big

High dimensionality:

  • geometric effects of high dimensionality
  • computational effects of high dimensionality
  • multiplication of large matrices
  • dimensionality reduction: JL lemma
  • dimensionality reduction: PCA
  • dimensionality reduction: compressed sensing classical and adapted approaches
  • interpolation in high-dimensional spaces

Streaming:

  • sampling data in streams
  • filtering data in streams
  • counting distinct elements in streams
  • estimations from streams: number of ones, distinct elements, most common element...

Prototype problems:

  • abstract summary of documents
  • Markov chains and pagerank-like algorithms

MODULE 2

Introduction to data centers:

  • High-level architecture
  • Compute units, network and storage
  • Energy efficiency, techniques for improving PUE
  • Trends and directions: scale-up vs. scale-out

Introduction to big data workloads

  • Amdahl's law, strong and weak scaling
  • Map Reduce: Hadoop
  • NO-SQL: Cassandra
  • In-memory computing: Spark

In-order CPU

  • Pipelining basics
  • Pipeline hazards
  • Memory hierarchy
  • Performance analysis techniques

Out-of-order CPU

  • ILP and instruction hazards
  • Removing false dependencies: renaming
  • Removing control hazards: branch prediction
  • Precise interrupts and speculation reorder buffer

Multicore CPU

  • Message passing vs shared memory vs
  • parallel execution models, heterogeneous parallelism
  • Cache coherency
  • Synchronization

Architectural Performance estimation and analysis

Office hours

See the website of Riccardo Rovatti

See the website of Luca Benini

See the website of Oreste Andrisano