You are here:

84423 - Statistics and Architectures for Big Data Processing M

Academic Year 2018/2019

Docente: Riccardo Rovatti
Credits: 9
SSD: ING-INF/01
Language: English

Moduli: Riccardo Rovatti (Modulo 1) Luca Benini (Modulo 2) Oreste Andrisano (Modulo 3)
Teaching Mode: Traditional lectures (Modulo 1) Traditional lectures (Modulo 2) Traditional lectures (Modulo 3)
Campus: Bologna
Corso: Second cycle degree programme (LM) in Electronic Engineering (cod. 0934)

Learning outcomes

The course provides students with a basic knowledge of problems and corresponding techniques of solutions implied by the ever increasing amount and complexity of the data available for analyses and decisions, i.e., the so called Big-Data (BD). The corresponding issues are tackled by multiple points of view, from the abstract characterization of the mathematical properties of BD, to the hardware architectures needed to process them, from the ad-hoc algorithms developed to cope with data deluge to the network issues implied by the storage and communication of data collections that are possibly partitioned in space and time.

Course contents

MODULE 1

The two directions of along wich Big Data are big

High dimensionality:

geometric effects of high dimensionality
computational effects of high dimensionality
multiplication of large matrices
dimensionality reduction: JL lemma
dimensionality reduction: PCA
dimensionality reduction: compressed sensing classical and adapted approaches
interpolation in high-dimensional spaces

Streaming:

sampling data in streams
filtering data in streams
counting distinct elements in streams
estimations from streams: number of ones, distinct elements, most common element...

Prototype problems:

abstract summary of documents
Markov chains and pagerank-like algorithms

MODULE 2

Introduction to data centers:

High-level architecture
Compute units, network and storage
Energy efficiency, techniques for improving PUE
Trends and directions: scale-up vs. scale-out

Introduction to big data workloads

Amdahl's law, strong and weak scaling
Map Reduce: Hadoop
NO-SQL: Cassandra
In-memory computing: Spark

In-order CPU

Pipelining basics
Pipeline hazards
Memory hierarchy
Performance analysis techniques

Out-of-order CPU

ILP and instruction hazards
Removing false dependencies: renaming
Removing control hazards: branch prediction
Precise interrupts and speculation reorder buffer

Multicore CPU

Message passing vs shared memory vs
parallel execution models, heterogeneous parallelism
Cache coherency
Synchronization

Architectural Performance estimation and analysis

Office hours

See the website of Riccardo Rovatti

See the website of Luca Benini

See the website of Oreste Andrisano