Professor Bill Smyth, McMaster University, Canada
Succinct data structures for strings and sequences
Professor Roberto Grossi, University of Pisa, Italy
The talk describes some problems that arise in industrial applications and how succinct and compressed data structures for strings and sequences can help using little space. The first part of the talk is devoted to semi-indexing semi-structured data, e.g. query log data in JSON format. The second part involves prefix searching on a set of strings which is helpful e.g. in search engines, and describes some succinct data structures for storing a set of strings. The third part of the talk considers a sequence of strings (not to be confused with a set of strings) and introduces the wavelet trie for storing textual data (e.g. URLs) in time order and supporting extra functionalities that exploit this order. The discussed results are joint works with Giuseppe Ottaviano presented at ACM CIKM 2011, ACM-SIAM ALENEX 2012, and ACM PODS 2012.
Algorithms on grammar-compressed strings
Professor Gad Landau, University of Haifa, Israel and NYU-Poly, USA
Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many of the popular compression schemes, including the Lempel-Ziv family, Run-Length Encoding, Byte-Pair Encoding, Sequitur and Re-Pair.
Let S be a string of length N given as a grammar G(S) of size n. The random access problem is to compactly represent G(S) while supporting fast random access queries. That is, given an index i, report S[i] without decompressing S. We will first present a linear space representation of G(S) that supports O (log N) random access time. This representation extends to efficiently support substring decompression. Namely, we can decompress any substring S[i]:::S[j] in the same complexity as a random access query and additional O(j- i) time.
Once we obtain an efficient substring decompression method, it can then serve as a basis for a compressed version of classical pattern matching. Namely, we can take any black-box (uncompressed) approximate pattern matching algorithm and turn it into a corresponding algorithm over grammar compressed strings. We will then focus on a specific algorithm for computing the edit distance of two grammar-compressed strings. This algorithm requires O(nN) time and uses the compression in a more complicated way (i.e., not through random access queries).