Data compression and compressed data structures
Lecturers: Giovanni Manzini, Paolo Ferragina. Computer Science Dept. University of Pisa.
Tentative lesson plan:
- Statement of the Data compression problem. Main ideas beyond compression techniques: symbol substitution, phrase encoding, data transformations. Kraft inequality (16/5)
- Consequences of Kraft Inequality. Order-0 Entropy. Optimal codes, Huffman codes. Oder-k entropy PPM algorithm. LZ77 parsing, definition, bounds in terms of the order-k entropy (17/5)
- Burrows-Wheeler transform, relation with k-order entropy. Compression boosting (18/5)
- Beyond entropy: others measures of compressibility (25/5)
- Introduction to compressed indices and Wheeler Graphs (26/5)
- Compressed indices for highly repetitive collections (r-index) (27/5)
- Predecessor search: models of computation and various contexts of application. Binary Search. The case of bounded universe: x-fast trie and y-fast trie. The case of uniform key distributions: interpolation search with complexity evaluation. (30/5)
- Elias-fano coding with constant time access. Learned data structures (31/5)
All lectures are 9am-11am in Sala Seminari EST (Room 351). Streaming will be available on Google Meet: email me for the link. Lectures will not be recorded.
There are two possible exam formats:
students can present a problem/idea related to their research and show how it can take advantage of the techniques described in the course, possibly including a prototype of the suggested solution
students can present a related topic not covered in the course using materials provided by the instructors
Book: Cover, Thomas. Elements of Information Theory, Chapter 5 covers Kraft inequality and Huffman Coding.
Paper: LZ parsing and entropy. In-depth analysis of LZ77 and LZ78 algorithms, a little bit technical.
Slides and full paper on Compression boosting with the BWT
Paper on Repetitiveness Measures. Note that the proof of inequality $\gamma \leq b$ at page 19 is incorrect: to build an attractor you need to take the first position of each phrase.
Slides on BWT-based indices
Slides on Wheeler Graphs/Automata
Slides on x-fast and y-fast tries
Book on algorithm engineering. The relevant sections are: the Prologue, Sect. 9.2 (interpolation search), Sect. 11.6 (Elias-Fano).
|Antonio Boffa||Compressing ultra-large source code datasets using locality-sensitive hashing||29-07|
|Cosimo Rulli||Neural networks compression||14-07|
|Daniele Atzeni, Francesco Tosoni||Neural networks compression for edge devices||29-07|
|Federica Baccini||Graph Compression||27-07|
|Lorenzo Catania||Compression of genomic datasets via INR||26-07|
|Domenico Tortola||On the compression of blockchain||27-07|
|Luca Corbucci, Rudy Semola||Neural network-based compression||28-07|
|Alessandro Bocci||EEG compression||28-07|
|Andrea Guerra||A heuristic for the packaging of source code in Software Heritage||29-07|