Data compression and compressed data structures

Lecturer: Giovanni Manzini, Computer Science Dept. University of Pisa.

Tentative lesson plan:

Statement of the lossless data compression problem. Kraft inequality. Order-0 Entropy. Optimal codes, Huffman codes and arithmetic coding (22/11).
Beyond order-0 Entropy: MTF + gamma codes. Order-k entropy. PPM algorithms (24/11).
LZ77 parsing: definition, bounds in terms of the order-k entropy. Introduction to the Burrows-Wheeler transform (27/11).
Compression bounds for the BWT. Introduction to compressed indices (29/11).
Wavelet Trees for efficient rank operations. Rank/Select operations on bitvectors (4/12).
Practical bitvector implementations. Compressed bitvectors (RRR representation). Finding the position of the occurrences in a BWT index: uniform sampling and the r-index (6/12).
Locate in the r-index (end). Subpath queries in Graph: conditional lower bound. (13/12).
Wheeler Graphs.(15/12)

All lectures are 9am-11am in Sala Seminari EST (Room 351). There will be no streaming or recording of the lectures.

There are two possible exam formats:

Students can present a problem/idea related to their research and show how it can take advantage of the techniques described in the course; this must include a working prototype of the suggested solution.
Students can present a related to the course using materials provided by the instructor.
All the exams should be given no later than 31/5/2024.

Study material

Book: Cover, Thomas. Elements of Information Theory, Chapter 5 covers Kraft inequality and Huffman Coding.
Paper on MTF encoding
Paper: LZ parsing and entropy. In-depth analysis of LZ77 and LZ78 algorithms, a little bit technical.
Slides: introduction to the BWT
Slides: searching using the BWT
Slides: introduction to Wavelet Trees
Paper: Optimized succinct data structures for massive data. Bitvector implementations from the authors of the sdsl library.
Paper: FM-index for dummies. Description and experiments of a simple BWT-index.
Slides: space efficient locate with the r-index
Slides: introduction to Wheeler Graphs
Paper: Wheeler Graphs
Paper: Subpath Queries on Compressed Graphs

Possible exam material

Some topics can be studied by two people that will prepare a joint presentation. Regardless of the topic, each student should give a 30 minute presentation describing: 1) the problem, 2) the previous state of the art, 3) the content of the new contribution. Tentative list of topics:

Asymmetric Numeral Systems: the evolution of arithmetic coding (MR + SM 7/2/24)
Analysis of a Suffix Array construction algorithm
Parametrized pattern matching: the latest solution and some background material (LL)
Applications of Wavelet Trees (AJ)
Wavelet matrix: an alternative layout for Wavelet Trees (LG 5/6/24)
Tunneling on BWTs and Wheeler Graphs (TW + MM)
Range Minumum and other queries (2 people)
An index based on LZ-parsing (LS+THGP 29/5/24)
Universal index based on the concept of attractors (2 people)
Grammar compression and indexing (GT+FA 1/3/24)
Succinct Trees (GC+FL 1/3/24)
Graph/automata indexing full study of the deterministic case (GB + LC + MO)

All papers should accessible using the department account: contact me if you have any problem.