Data compression and compressed data structures
Lecturer: Giovanni Manzini, Computer Science Dept. University of Pisa.
Tentative lesson plan:
- Statement of the lossless data compression problem. Kraft inequality. Order-0 Entropy. Optimal codes, Huffman codes and arithmetic coding (22/11).
- Beyond order-0 Entropy: MTF + gamma codes. Order-k entropy. PPM algorithms (24/11).
- LZ77 parsing: definition, bounds in terms of the order-k entropy. Introduction to the Burrows-Wheeler transform (27/11).
- Compression bounds for the BWT. Introduction to compressed indices (29/11).
- Wavelet Trees for efficient rank operations. Bitvectors supporting Rank/Select operations (4/12).
- From compressed bitvectors to compressed indices. Finding the position of the occurrences. Introduction to Wheeler graphs (6/12)
- (13/12)
- (15/12)
All lectures are 9am-11am in Sala Seminari EST (Room 351). There will be no streaming or recording of the lectures.
There are two possible exam formats:
-
Students can present a problem/idea related to their research and show how it can take advantage of the techniques described in the course; this must include a working prototype of the suggested solution.
-
Students can present a related to the course using materials provided by the instructor.
-
All the exams should be given no later than 31/5/2024.
Study material
-
Book: Cover, Thomas. Elements of Information Theory, Chapter 5 covers Kraft inequality and Huffman Coding.
-
Paper: LZ parsing and entropy. In-depth analysis of LZ77 and LZ78 algorithms, a little bit technical.
Possible exam material
Some topics can be studied by two people that will later do a joint presentation. Regardless of the topic, each student should do a 30 minute presentation describing: 1) the problem, 2) the previous state of the art, 3) the content of the new contribution. Tentative list of topics (paper links will be added soon):
- Asymmetric Numeral Systems
- Suffix Array Construction
- Parametrized pattern matching
- Wavelet matrix
- Tunneling on BWTs and Wheeler Graphs
- Range Minumum and related Queries (2 people)
- LZ-index (2 people)
- Grammar compression and indexing (2 people)