Data compression and compressed data structures
Lecturers: Giovanni Manzini, Paolo Ferragina. Computer Science Dept. University of Pisa.
Tentative lesson plan:
- Statement of the Data compression problem. Main ideas beyond compression techniques: symbol substitution, phrase encoding, data transformations. Kraft inequality (16/5)
- Consequences of Kraft Inequality. Order-0 Entropy. Optimal codes, Huffman codes. Oder-k entropy PPM algorithm. LZ77 parsing, definition, bounds in terms of the order-k entropy (17/5)
- Burrows-Wheeler transform, relation with k-order entropy. Compression boosting (18/5)
- Beyond entropy: others measures of compressibility (25/5)
- Introduction to compressed indices and Wheeler Graphs (26/5)
- Compressed indices for highly repetitive collections (r-index) (27/5)
- Predecessor search: models of computation and various contexts of application. Binary Search. The case of bounded universe: x-fast trie and y-fast trie. The case of uniform key distributions: interpolation search with complexity evaluation. (30/5)
- Elias-fano coding with constant time access. Learned data structures (31/5)
All lectures are 9am-11am in Sala Seminari EST (Room 351). Streaming will be available on Google Meet: email me for the link. Lectures will not be recorded.
There are two possible exam formats:
-
students can present a problem/idea related to their research and show how it can take advantage of the techniques described in the course, possibly including a prototype of the suggested solution
-
students can present a related topic not covered in the course using materials provided by the instructors
Study material
-
Book: Cover, Thomas. Elements of Information Theory, Chapter 5 covers Kraft inequality and Huffman Coding.
-
Paper: LZ parsing and entropy. In-depth analysis of LZ77 and LZ78 algorithms, a little bit technical.
-
Slides and full paper on Compression boosting with the BWT
-
Paper on Repetitiveness Measures. Note that the proof of inequality $\gamma \leq b$ at page 19 is incorrect: to build an attractor you need to take the first position of each phrase.
-
Slides on BWT-based indices
-
Slides on Wheeler Graphs/Automata
-
Slides on x-fast and y-fast tries
-
Book on algorithm engineering. The relevant sections are: the Prologue, Sect. 9.2 (interpolation search), Sect. 11.6 (Elias-Fano).
Students’ presentations
Name | Topic | Date |
---|---|---|
Antonio Boffa | Compressing ultra-large source code datasets using locality-sensitive hashing | 29-07 |
Cosimo Rulli | Neural networks compression | 14-07 |
Daniele Atzeni, Francesco Tosoni | Neural networks compression for edge devices | 29-07 |
Federica Baccini | Graph Compression | 27-07 |
Lorenzo Catania | Compression of genomic datasets via INR | 26-07 |
Domenico Tortola | On the compression of blockchain | 27-07 |
Luca Corbucci, Rudy Semola | Neural network-based compression | 28-07 |
Alessandro Bocci | EEG compression | 28-07 |
Andrea Guerra | A heuristic for the packaging of source code in Software Heritage | 29-07 |
Domenico Tortorella |