Compression¶
In this tutorial you will learn more about compression in TileDB. It is highly recommended that you read the tutorials on attributes, filtering, and tiling dense/sparse arrays first.
Basic concepts and definitions¶
Attribute-based compression
TileDB allows you to choose a different compression scheme for each attribute by using filter lists. Compression in TileDB always occurs as a separate filter in a filter list.
Compression tradeoff
There is no ideal compressor. Each compressor offers a tradeoff between compression ratio and compression/decompression speed, which highly depends on the array data of your application.
Compressing arrays¶
The data types and values of the array cells most likely vary across attributes. Therefore, it is desirable to allow defining a different compression scheme per attribute, since different compressors may be more effective for different types of data. Compression in TileDB always occurs as a separate filter in a filter list. You specify the compressor in the filter list when defining an attribute upon array creation.
C++
Context ctx;
ArraySchema schema(ctx, TILEDB_DENSE);
auto a = Attribute::create<int>(ctx, "a");
FilterList filters(ctx);
filters.add_filter({ctx, TILEDB_FILTER_GZIP});
a.filters(filters);
schema.add_attribute(a);
Python
attr = tiledb.Attr(name="a", dtype=np.int32,
filters=tiledb.FilterList([tiledb.GzipFilter()]))
schema = tiledb.ArraySchema(attrs=[attr], domain=dom),
sparse=False)
TileDB offers a variety of compressors to choose from (see below). Moreover, you
can adjust the compression level (by setting the TILEDB_COMPRESSION_LEVEL
option on the compression filter, where -1
means “default compression
level”). If a compressor filter is not added to an attribute, TileDB will not
compress that attribute.
You can also specify a different compressor for the coordinates tiles as follows (the default is ZSTD with default compression level):
C++
FilterList coords_filters(ctx);
coords_filters.add_filter({ctx, TILEDB_LZ4});
schema.set_coords_filter_list(coords_filters);
Python
schema = tiledb.ArraySchema(..., coords_filters=tiledb.FilterList([tiledb.LZ4Filter()]))
To maximize the effect of compression on coordinates, TileDB first
unzips the coordinates tuples and groups the coordinates along
the dimensions. For instance, coordinates (1,2)
, (1,3)
, (1,5)
will be laid out as 1 1 1 2 3 5
prior to passing them into
the compressor. This layout allows for the slowly varying coordinate
values to be grouped together, which leads to better compressibility
(as these values will likely be similar).
Recall that there are two data files created for a variable-length attribute; one that stores the actual cell values and one that stores the starting offsets of the cell values in the first file. TileDB allows you to even specify a compressor for the offsets data tiles (the default is ZSTD with default compression level):
C++
FilterList offsets_filters(ctx);
offsets_filters.add_filter({ctx, TILEDB_BZIP2});
schema.set_offsets_filter_list(offsets_filters);
Python
schema = tiledb.ArraySchema(..., offsets_filters=tiledb.FilterList([tiledb.Bzip2Filter()]))
Choosing a compressor¶
TileDB offers a variety of compressors to choose from:
TileDB implements its own version of double-delta compression. It is similar to the one presented in Facebook’s Gorilla system. The difference is that TileDB uses a fixed bitsize for all values (in contrast to Gorilla’s variable bitsize). This makes the implementation a bit simpler, but also allows computing directly on the compressed data (which we are exploring in the future).
TileDB utilizes a blocking technique that divides the data in blocks that are small enough to fit in L1 cache of modern processors and perform compression/decompression there. This reduces the activity on the memory bus and allows leveraging the SIMD capabilities of the processor, thus leading to a performance speed up. TileDB also allows you to apply a shuffle filter before compression, which can result in improved compression ratio.
Choosing the right compressor for your application is quite challenging, as the effectiveness of a compressor heavily depends on the data being compressed. Moreover, each compressor offers a tradeoff between compression ratio and compression/decompression speed. Here are a couple of benchmarks that demonstrate this tradeoff:
What we recommend is to ingest a subset of your data into an array, and test with various different compressors for each of your attributes, in order to determine what compression ratio and speed is satisfactory for your application.
Compression and performance¶
Compression greatly affects performance; compression/decompression impacts the writing/reading speed, whereas the compression ratio influences the read/write I/O time in addition of course to storage consumption. As stated above, the choice of compressor is important for performance, but there is always a tradeoff between compression ratio and speed, which you need to adjust based on your application. Luckily for you, TileDB parallelizes internally both compression and decompression (and filtering in general). However, parallelization takes effect when the data tile to be filtered is large enough. See Introduction to Performance for more information on TileDB performance and how to tune it.