On searching compressed string collections cache-obliviously
Ferragina, P., Grossi, R., Gupta, A., Shah, R., & Vitter, J. S. Proceedings of the ACM Conference on Principles of Database Systems (PODS), Vancouver, Canada, May 2008.
Abstract: Current data structures for searching large string collections either
fail to achieve minimum space or cause too many cache misses. In this
paper we discuss some edge linearizations of the classic trie data
structure that are simultaneously cache-friendly and compressed. We
provide new insights on front coding [24], introduce other novel
linearizations, and study how close their space occupancy is to the
information-theoretic minimum. The moral is that they are not just
heuristics. Our second contribution is a novel dictionary encoding
scheme that builds upon such linearizations and achieves nearly optimal
space, offers competitive I/O-search time, and is also conscious of the
query distribution. Finally, we combine those data structures with
cache-oblivious tries [2, 5] and obtain a succinct variant whose space
is close to the information-theoretic minimum.
Source: Portal