📝Streaming Algorithms for Big Data, summer semester 25/26
NTIN114, 2/0
Webpage for a course on streaming algorithms and data sketches, taught by Pavel Veselý. Time and place to be scheduled during "Úmluva" (please vote there once it starts). If this time is not suitable for you, but you'd like to attend, or if you have any questions, drop me an email at vesely (guess what) iuuk.mff.cuni.cz (you can also find me in office S326). Content will be similar as 1.5 years ago or 3.5 years ago but extended.
If you attended NDMI025 Randomized Algorithms by Prof. Sgall, you have seen some basic streaming algorithms. This course will only have a small overlap (median trick, CountSketch, Morris counter) but more advanced/practical algorithms will be presented for counting distinct elements or finding frequent items. If you didn't attend Randomized algorithms, then we will recap the necessary parts for you.
Brief introduction: The streaming model of computation provides a mathematically clean and practically relevant model of processing massive amounts of data (so called “Big Data”). A streaming algorithm must process the input stream (i.e. a sequence of updates) with substantially limited memory, sublinear in the data size, ideally (poly)logarithmic. Most often, just a single pass over the input is allowed. Given these restrictions, the goal is to compute an approximate answer to a given problem, with as good accuracy-space trade-off as possible. The particular question underlying streaming computation is how to summarize a massive input into a small data structure, called sketch or summary📝, so that properties of the input are approximately preserved.
We will cover basic streaming problems (approximate counting, finding frequent elements, frequency moments, counting distinct elements, and estimating quantiles) as well as slightly more advanced topics such as geometric coresets, graph streams (e.g. testing connectivity, maximum matchings, triangle counting), clustering ... we'll see what time allows. We will show how to prove (unconditional) lower bounds on the space needed to solve a given problem, using the tools from communication complexity. We will also look into recent trends in research on streaming algorithms, such as robustness to adversarial attacks⚔️ or mergeable summaries for distributed datasets.
Prerequisites: the lecture is primarily targeted at master and PhD students, however, bachelor students are also welcome. In most lectures, we will formally analyze a streaming algorithm for a particular problem, which will be typically randomized, and so we will need basic tools from the probability theory like random variables, Markov, Chebyshev, and Chernoff bounds, and hash functions. These tools will be reviewed (most often without proofs, which are covered by other courses).
Credit requirements: Depending on the number of participants and their preference, the credit will be given for presenting a recent research paper or for passing an oral exam (details to be provided in the first lecture).
Literature
- [AC]: Lecture Notes on Data Stream Algorithms by Amit Chakrabarti, Dartmouth College
- [SSBD]: Graham Cormode and Ke Yi: Small Summaries for Big Data, Cambridge University Press, 2020 – this book provides a nice overview of almost all algorithms that we will discuss (it's not lecture notes, though, so some of the theoretical analyses are missing); a draft version is publicly available and a couple of copies is available in the MFF library.
- [JN]: Lecture notes by Jelani Nelson