DISK COMPRESSION OF K-MER SETS

Disk compression of k-mer sets

Disk compression of k-mer sets

Blog Article

Abstract K-mer based methods have become prevalent in many areas of bioinformatics.In applications such as database search, they often work with large multi-terabyte-sized datasets.Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts.

General purpose compressors like gzip, or those designed TRIBULUS for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets.In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk.In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress.

They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress.We explore their behavior both theoretically and on real data.We show that High Chair they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets.

We also derive lower bounds on how well this type of compression strategy can hope to do.

Report this page