📝Streaming Algorithms for Big Data, summer semester 25/26
NTIN114, 2/0
Wednesdays at 12:20 in S321 (corridor opposite to S3)
Webpage for a course on streaming algorithms and data sketches, taught by Pavel Veselý. If this time is not suitable for you, but you'd like to attend, or if you have any questions, drop me an email at vesely (guess what) iuuk.mff.cuni.cz (you can also find me in office S326). Content will be similar as 1.5 years ago or 3.5 years ago but extended.
If you attended NDMI025 Randomized Algorithms by Prof. Sgall, you have seen some basic streaming algorithms. This course will only have a small overlap (median trick, CountSketch, Morris counter) but more advanced/practical algorithms will be presented for counting distinct elements or finding frequent items. If you didn't attend Randomized algorithms, then we will recap the necessary parts for you.
Brief introduction: The streaming model of computation provides a mathematically clean and practically relevant model of processing massive amounts of data (so called “Big Data”). A streaming algorithm must process the input stream (i.e. a sequence of updates) with substantially limited memory, sublinear in the data size, ideally (poly)logarithmic. Most often, just a single pass over the input is allowed. Given these restrictions, the goal is to compute an approximate answer to a given problem, with as good accuracy-space trade-off as possible. The particular question underlying streaming computation is how to summarize a massive input into a small data structure, called sketch or summary📝, so that properties of the input are approximately preserved.
We will cover basic streaming problems (approximate counting, finding frequent elements, frequency moments, counting distinct elements, and estimating quantiles) as well as slightly more advanced topics such as geometric coresets, graph streams (e.g. testing connectivity, maximum matchings, triangle counting), clustering ... we'll see what time allows. We will show how to prove (unconditional) lower bounds on the space needed to solve a given problem, using the tools from communication complexity. We will also look into recent trends in research on streaming algorithms, such as robustness to adversarial attacks⚔️ or mergeable summaries for distributed datasets.
Prerequisites: the lecture is primarily targeted at master and PhD students, however, bachelor students are also welcome. In most lectures, we will formally analyze a streaming algorithm for a particular problem, which will be typically randomized, and so we will need basic tools from the probability theory like random variables, Markov, Chebyshev, and Chernoff bounds, and hash functions. These tools will be reviewed (most often without proofs, which are covered by other courses).
Credit requirements: Depending on the number of participants and their preference, the credit will be given for presenting a recent research paper or for passing an oral exam (details to be provided in the first lecture).
Lectures
Links to chapters/units in literature below.| 1 October | Intro to streaming algorithms [AC unit 1]; Reservoir sampling and its use for estimating frequencies [SSBD 2.2]; the (deterministic) Misra-Gries algorithm for finding frequent elements and frequency estimation [SSBD 3.2]. |
| 8 October | Cancelled due to a research visit |
| 15 October | Cardinality estimation, that is, approximate counting the number of distinct elements: The KMV sketch (k minimum values of hash function): analysis of the estimator according to notes by Lee Rhodes (unbiased and tight variance bound) and briefly also according to [SSBD 3.3] (simpler, using the Chebyshev inequality). Using KMV to estimate the Jaccard index, i.e., the size of the union and intersection of two sets. The Flajolet-Martin idea (loglog counting) and HyperLogLog (without analysis). |
| 22 October | Linear sketches: intro, Count-Min Sketch and CountSketch for frequency estimation [AC 5, SSBD 3.4-3.5], estimating the second frequency moments F_2 using CountSketch with 4-wise independent hash functions (a.k.a. Fast AMS sketch) [SSBD 3.6], without analysis of the variance (the AMS sketch, which is CountSketch with single=cell rows, is described in [AC 7], including the analysis of varinace). |
| 29 October | Plan: Substituted by Tomáš Domes: Quantile/rank estimation in data streams (or: cummulative distribution function approximation) using the KLL sketch [SSBD 4.3]. |
| 5 November | Finished analysis of Fast AMS sketch for the second frequency moment [SSBD 3.6]. Remark about connection to dimension reduction and the Johnson-Lindenstrauss lemma. Indyk's algorithm for ell_p norms for p in (0, 2] using p-stable distributions, without detailed analysis [AC 8, SSBD 3.7]. Note about the case p > 2. |
| 12 November | We decided to have a lecture despite Dean's sports day (with more exercises :-) (If the corridor is busy due to MASO, we will meet in S10.) Sparse recovery [AC 9, SSBD 3.8] and distinct sampling (a.k.a ell_0 sampling) [AC 10, SSBD 3.9]. Invertible Bloom Lookup Tables (IBLTs). |
| 19 November | Graph sketch: Finding connected components (and a spanning forest) in dynamic graph streams using distinct sampling [AC 16, SSBD 7.1]. Streaming lower bounds using communication complexity: the problems of Equality, Index, and their deterministic and randomized (one-way) communication complexities [AC 18]. |
| 26 November | Streaming lower bounds using communication complexity [AC 18-19, SSBD 10]: overview of main communication problems (Equality, Index, Disjointness, Gap-Hamming, Boolean Hidden Matching, Augmented Index) and their main applications (constant approximation of set cardinality requires randomness, lower bounds for frequency estimation, counting the number of components or triangles). Gap cycle counting (a lower bound from Boolean Hidden (Hyper)-Matching) [paper by Verbin & Yu]. Also a discussion of the need of discrete inputs in the dynamic setting. Spanners for distance estimation [AC 14.4, SSBD 7.2]. |
| 3 December | Geometric and metric streams: coresets for diameter and minimum enclosing ball (MEB): sqrt(2)+epsilon approximation in high dimension based on recent paper (also a sketch of 1+epsilon approximation in low dimension). Clustering: main variants and a (2+epsilon)-coreset based on threshold testing for k-center [AC 13, SSBD 5.4]. Brief discussion about different settings: insertion-only vs dynamic, low- vs high-dimensional. |
| 10 December |
Student presentations: J.K.: An Optimal Algorithm for Triangle Counting in the Stream by Rajesh Jayaram and John Kallaugher; J.K.: Simple & Optimal Quantile Sketch: Combining Greenwald-Khanna with Khanna-Greenwald by Elena Gribelyuk, Pachara Sawettamalya, Hongxun Wu and Huacheng Yu. |
| 17 December | Dimensionality reduction: Johnson-Lindenstrauss transform and a sketch of its proof by Dasgupta and Gupta. Also an outline of variants of the JL transform, including Sparse JL (similar to CountSketch). Simple logarithmic lower bound based on a volume argument and a simple application on a lower bound for estimating diameter in a high-dimension. A brief discussion of metric embeddings, impossibility results, and the FRT tree embedding of any metric. Partly based on lecture notes by Jelani Nelson (a part of Chapter 5). |
| 7 January | Plan: adversarially robust streaming algorithms. |
Literature
- [AC]: Lecture Notes on Data Stream Algorithms by Amit Chakrabarti, Dartmouth College
- [SSBD]: Graham Cormode and Ke Yi: Small Summaries for Big Data, Cambridge University Press, 2020 – this book provides a nice overview of almost all algorithms that we will discuss (it's not lecture notes, though, so some of the theoretical analyses are missing); a draft version is publicly available and a couple of copies is available in the MFF library.
- [JN]: Lecture notes by Jelani Nelson