📝Streaming Algorithms for Big Data, summer semester 25/26

NTIN114, 2/0

Wednesdays at 12:20 in S321 (corridor opposite to S3)

Webpage for a course on streaming algorithms and data sketches, taught by Pavel Veselý. If this time is not suitable for you, but you'd like to attend, or if you have any questions, drop me an email at vesely (guess what) iuuk.mff.cuni.cz (you can also find me in office S326). Content will be similar as 1.5 years ago or 3.5 years ago but extended.

If you attended NDMI025 Randomized Algorithms by Prof. Sgall, you have seen some basic streaming algorithms. This course will only have a small overlap (median trick, CountSketch, Morris counter) but more advanced/practical algorithms will be presented for counting distinct elements or finding frequent items. If you didn't attend Randomized algorithms, then we will recap the necessary parts for you.

Brief introduction: The streaming model of computation provides a mathematically clean and practically relevant model of processing massive amounts of data (so called “Big Data”). A streaming algorithm must process the input stream (i.e. a sequence of updates) with substantially limited memory, sublinear in the data size, ideally (poly)logarithmic. Most often, just a single pass over the input is allowed. Given these restrictions, the goal is to compute an approximate answer to a given problem, with as good accuracy-space trade-off as possible. The particular question underlying streaming computation is how to summarize a massive input into a small data structure, called sketch or summary📝, so that properties of the input are approximately preserved.

We will cover basic streaming problems (approximate counting, finding frequent elements, frequency moments, counting distinct elements, and estimating quantiles) as well as slightly more advanced topics such as geometric coresets, graph streams (e.g. testing connectivity, maximum matchings, triangle counting), clustering ... we'll see what time allows. We will show how to prove (unconditional) lower bounds on the space needed to solve a given problem, using the tools from communication complexity. We will also look into recent trends in research on streaming algorithms, such as robustness to adversarial attacks⚔️ or mergeable summaries for distributed datasets.

Prerequisites: the lecture is primarily targeted at master and PhD students, however, bachelor students are also welcome. In most lectures, we will formally analyze a streaming algorithm for a particular problem, which will be typically randomized, and so we will need basic tools from the probability theory like random variables, Markov, Chebyshev, and Chernoff bounds, and hash functions. These tools will be reviewed (most often without proofs, which are covered by other courses).

Credit requirements: Depending on the number of participants and their preference, the credit will be given for presenting a recent research paper or for passing an oral exam (details to be provided in the first lecture).

Lectures

Links to chapters/units in literature below.

1 October	Intro to streaming algorithms [AC unit 1]; Reservoir sampling and its use for estimating frequencies [SSBD 2.2]; the (deterministic) Misra-Gries algorithm for finding frequent elements and frequency estimation [SSBD 3.2].
8 October	Cancelled due to a research visit
15 October	Cardinality estimation, that is, approximate counting the number of distinct elements: The KMV sketch (k minimum values of hash function): analysis of the estimator according to notes by Lee Rhodes (unbiased and tight variance bound) and briefly also according to [SSBD 3.3] (simpler, using the Chebyshev inequality). Using KMV to estimate the Jaccard index, i.e., the size of the union and intersection of two sets. The Flajolet-Martin idea (loglog counting) and HyperLogLog (without analysis).
22 October	Linear sketches: intro, Count-Min Sketch and CountSketch for frequency estimation [AC 5, SSBD 3.4-3.5], estimating the second frequency moments F_2 using CountSketch with 4-wise independent hash functions (a.k.a. Fast AMS sketch) [SSBD 3.6], without analysis of the variance (the AMS sketch, which is CountSketch with single=cell rows, is described in [AC 7], including the analysis of varinace).
29 October	Plan: Substituted by Tomáš Domes: Quantile/rank estimation in data streams (or: cummulative distribution function approximation) using the KLL sketch [SSBD 4.3].
5 November	Finished analysis of Fast AMS sketch for the second frequency moment [SSBD 3.6]. Remark about connection to dimension reduction and the Johnson-Lindenstrauss lemma. Indyk's algorithm for ell_p norms for p in (0, 2] using p-stable distributions, without detailed analysis [AC 8, SSBD 3.7]. Note about the case p > 2.
12 November	We decided to have a lecture despite Dean's sports day (with more exercises :-) (If the corridor is busy due to MASO, we will meet in S10.) Plan: Sparse recovery [AC 9, SSBD 3.8], distinct sampling (a.k.a ell_0 sampling) [AC 10, SSBD 3.9], and finding connected components in dynamic graph streams [AC 16, SSBD 7.1]. Possibly also Invertible Bloom Lookup Tables (IBLTs).

Literature

[AC]: Lecture Notes on Data Stream Algorithms by Amit Chakrabarti, Dartmouth College
[SSBD]: Graham Cormode and Ke Yi: Small Summaries for Big Data, Cambridge University Press, 2020 – this book provides a nice overview of almost all algorithms that we will discuss (it's not lecture notes, though, so some of the theoretical analyses are missing); a draft version is publicly available and a couple of copies is available in the MFF library.
[JN]: Lecture notes by Jelani Nelson