Algorithms for Data Streams, summer semester 21/22
Webpage for a course on streaming algorithms, led by Pavel Veselý. Formally, it's under subject Selected Topics in Algorithms II - NTIN111 (please enroll there if you want to get credits). It will be on Wednesdays at 10:40 on the 3rd floor in the corridor of KAM/IUUK that's opposite to room S3. If you have any questions, drop me an email at vesely (guess what) iuuk.mff.cuni.cz (you can also find me in office S323, which is in the same corridor).
Brief introduction: The streaming model of computation provides a mathematically clean and practically relevant model of processing massive amounts of data (so called “Big Data”). A streaming algorithm must process the input stream (i.e. a sequence of updates) with substantially limited memory, sublinear in the data size, ideally (poly)logarithmic. Most often, just a single pass over the input is allowed. Given these restrictions, the goal is to compute an approximate answer to a given problem, with as good accuracy-space trade-off as possible. The particular question underlying streaming computation is how to summarize a massive input into a small data structure, called sketch or summary, so that properties of the input are approximately preserved.
We will cover basic streaming problems (approximate counting, finding frequent elements, frequency moments, counting distinct elements, and estimating quantiles) as well as slightly more advanced topics such as geometric coresets, graph streams (e.g. testing connectivity, maximum matchings, triangle counting), clustering ... we will see what time allows. During the last few lectures, we will look into how to prove lower bounds for streaming problems via communication complexity.
Prerequisites: the lecture is primarily targeted at master and PhD students, however, bachelor students are also welcome. In most lectures, we will formally analyze a streaming algorithm for a particular problem, which will be typically randomized, and so we will need basic tools from the probability theory like random variables, Markov, Chebyshev, and Chernoff bounds, and hash functions. These tools will be reviewed (most often without proofs, which are covered by other courses).
Lectures
(Note: there was no class on 16th February.) Links to chapters/units in literature below.23 February | Intro to streaming algorithms; the (deterministic) Misra-Gries algorithm for finding frequent elements and frequency estimation [AC unit 1, SSBD 3.2]; the KMV sketch (k minimum values of hash function; without analysis). |
2 March | KMV sketch details: the median trick, hash functions, possible uses [SSBD 2.5]; counting distinct elements using the Flajolet-Martin idea and BJKST algorithm [AC 3-4] |
9 March | Morris algorithm for approximate counting in o(log m) space [AC 4, SSBD 2.1]. Linear sketches: intro, CountSketch and Count-Min Sketch for frequency estimation [AC 5, SSBD 3.4-3.5] |
16 March | Frequency moments estimation: the Tug-of-War sketch for F_2 / Euclidean norm / inner products (and its relation to CountSketch) [AC 7, SSBD 3.6], ell_p norms for p in (0, 2] using stable distributions, without detailed analysis [AC 8, SSBD 3.7] |
23 March | Sparse recovery [AC 9, SSBD 3.8], distinct sampling (a.k.a ell_0 sampling) [AC 10, SSBD 3.9], and finding connected components in dynamic graph streams [AC 16, SSBD 7.1]. |
30 March | Quantile/rank estimation in data streams (or: cummulative distribution function approximation) using the KLL sketch [SSBD 4.3]. |
6 April | Streaming lower bounds using communication complexity: the problems of Equality, Disjoitness, Index, and their deterministic and randomized (one-way) communication complexities [AC 18]. |
13 April | Lower bounds and multiple passes: Count-Distinct and F_0 from Disjointness [AC 19.2]. Graph streaming: gap cycle counting (a lower bound from Boolean Hidden (Hyper)-Matching) [paper by Verbin & Yu], spanners for distance estimation [AC 14.4, SSBD 7.2], and maximum weighted matching [AC 15]. |
20 April | Geometric streams: a simple streaming 2-approximation of diameter, coresets for minimum enclosing ball (MEB), which also work for the diameter [AC 12], k-center clustering by Guha's cascading algorithm [AC 13, SSBD 5.4]. |
27 April | Cancelled |
4 May | Cancelled due to Spring School |
11 May | Dimensionality reduction: Johnson-Lindenstrauss transform and its proof using the Hanson-Wright inequality (without proof). Overview of impossibility results for other norms than the Euclidean norm. Also an outline of Sparse JL transform. Based on lecture notes by Jelani Nelson (a part of Chapter 5). |
18 May | Student presentation of paper An Optimal Algorithm for Triangle Counting in the Stream by Rajesh Jayaram and John Kallaugher; brief discussion of linear algebraic summaries. |
Literature
- [AC]: Lecture Notes on Data Stream Algorithms by Amit Chakrabarti, Dartmouth College
- [SSBD]: Graham Cormode and Ke Yi: Small Summaries for Big Data, Cambridge University Press, 2020> – this book provides a nice overview of all algorithms that we will discuss (it's not lecture notes, though); a draft version is publicly available
- Lecture notes by Jelani Nelson