Streaming Algorithms for Big Data, summer semester 23/24

NTIN114, 2/0

Webpage for a course on streaming algorithms, led by Pavel Veselý. Scheduled on Mondays at 14:00 in S326 (inside corridor opposite to S3). The first lecture will take place on February 26. If this time is not suitable for you, but you'd like to attend, or if you have any questions, drop me an email at vesely (guess what) iuuk.mff.cuni.cz (you can also find me in office S326). Content will be similar (but not the same) as two years ago.

Brief introduction: The streaming model of computation provides a mathematically clean and practically relevant model of processing massive amounts of data (so called “Big Data”). A streaming algorithm must process the input stream (i.e. a sequence of updates) with substantially limited memory, sublinear in the data size, ideally (poly)logarithmic. Most often, just a single pass over the input is allowed. Given these restrictions, the goal is to compute an approximate answer to a given problem, with as good accuracy-space trade-off as possible. The particular question underlying streaming computation is how to summarize a massive input into a small data structure, called sketch or summary, so that properties of the input are approximately preserved.

We will cover basic streaming problems (approximate counting, finding frequent elements, frequency moments, counting distinct elements, and estimating quantiles) as well as slightly more advanced topics such as geometric coresets, graph streams (e.g. testing connectivity, maximum matchings, triangle counting), clustering ... we'll see what time allows. We will show how to prove (unconditional) lower bounds on the space needed to solve a given problem, using the tools from communication complexity. We will also look into recent trends in research on streaming algorithms, such as robustness to adversarial attacks. Finally, we

Prerequisites: the lecture is primarily targeted at master and PhD students, however, bachelor students are also welcome. In most lectures, we will formally analyze a streaming algorithm for a particular problem, which will be typically randomized, and so we will need basic tools from the probability theory like random variables, Markov, Chebyshev, and Chernoff bounds, and hash functions. These tools will be reviewed (most often without proofs, which are covered by other courses).

Lectures

Links to chapters/units in literature below.
26 February Intro to streaming algorithms [AC unit 1]; Reservoir sampling and its use for estimating frequencies [SSBD 2.2]; the (deterministic) Misra-Gries algorithm for finding frequent elements and frequency estimation [SSBD 3.2].
4 March Estimating the number of distinct elements: The KMV sketch (k minimum values of hash function), together with the median trick, 2-wise indepedent hash functions, and possible uses for computing union, intersection, etc. [SSBD 2.5]. Counting distinct elements using the Flajolet-Martin idea (loglog counting) and BJKST algorithm [AC 3-4].
11 March Morris algorithm for approximate counting in o(log m) space [AC 4, SSBD 2.1]. Linear sketches: intro, CountSketch and Count-Min Sketch for frequency estimation [AC 5, SSBD 3.4-3.5]
18 March Frequency moments estimation: the Tug-of-War sketch for F_2 / Euclidean norm / inner products and its relation to CountSketch and geometric interpretation as Dimensionality reduction [AC 7, SSBD 3.6], ell_p norms for p in (0, 2] using stable distributions, without detailed analysis [AC 8, SSBD 3.7]
25 March Sparse recovery [AC 9, SSBD 3.8], distinct sampling (a.k.a ell_0 sampling) [AC 10, SSBD 3.9], and finding connected components in dynamic graph streams [AC 16, SSBD 7.1].
8 April Quantile/rank estimation in data streams (or: cummulative distribution function approximation) using the KLL sketch [SSBD 4.3].

Literature