Data compression and compressed data structures
Lecturers: Giovanni Manzini, Paolo Ferragina. Computer Science Dept. University of Pisa.
Tentative lesson plan:
 Statement of the Data compression problem. Main ideas beyond compression techniques: symbol substitution, phrase encoding, data transformations. Kraft inequality (16/5)
 Consequences of Kraft Inequality. Order0 Entropy. Optimal codes, Huffman codes. Oderk entropy PPM algorithm. LZ77 parsing, definition, bounds in terms of the orderk entropy (17/5)
 BurrowsWheeler transform, relation with korder entropy. Compression boosting (18/5)
 Beyond entropy: others measures of compressibility (25/5)
 Introduction to compressed indices and Wheeler Graphs (26/5)
 Compressed indices for highly repetitive collections (rindex) (27/5)
 Predecessor search: models of computation and various contexts of application. Binary Search. The case of bounded universe: xfast trie and yfast trie. The case of uniform key distributions: interpolation search with complexity evaluation. (30/5)
 Eliasfano coding with constant time access. Learned data structures (31/5)
All lectures are 9am11am in Sala Seminari EST (Room 351). Streaming will be available on Google Meet: email me for the link. Lectures will not be recorded.
There are two possible exam formats:

students can present a problem/idea related to their research and show how it can take advantage of the techniques described in the course, possibly including a prototype of the suggested solution

students can present a related topic not covered in the course using materials provided by the instructors
Study material

Book: Cover, Thomas. Elements of Information Theory, Chapter 5 covers Kraft inequality and Huffman Coding.

Paper: LZ parsing and entropy. Indepth analysis of LZ77 and LZ78 algorithms, a little bit technical.

Slides and full paper on Compression boosting with the BWT

Paper on Repetitiveness Measures. Note that the proof of inequality $\gamma \leq b$ at page 19 is incorrect: to build an attractor you need to take the first position of each phrase.

Slides on BWTbased indices

Slides on Wheeler Graphs/Automata

Slides on xfast and yfast tries

Book on algorithm engineering. The relevant sections are: the Prologue, Sect. 9.2 (interpolation search), Sect. 11.6 (EliasFano).
Students’ presentations
Name  Topic  Date 

Antonio Boffa  Compressing ultralarge source code datasets using localitysensitive hashing  2907 
Cosimo Rulli  Neural networks compression  1407 
Daniele Atzeni, Francesco Tosoni  Neural networks compression for edge devices  2907 
Federica Baccini  Graph Compression  2707 
Lorenzo Catania  Compression of genomic datasets via INR  2607 
Domenico Tortola  On the compression of blockchain  2707 
Luca Corbucci, Rudy Semola  Neural networkbased compression  2807 
Alessandro Bocci  EEG compression  2807 
Andrea Guerra  A heuristic for the packaging of source code in Software Heritage  2907 
Domenico Tortorella 