Filters

In this tutorial we will learn how to use TileDB filters. It is recommended to first read the tutorials on sparse arrays, multi-attribute arrays and variable-length attributes.

Full programs

Program

Links

filters

filterscpp filterspy

Basic concepts and definitions

Filter

A filter is used to transform attribute data before it is written to an array. One filter performs one type of data transformation. An example of a filter is the bzip2 compression filter, which transforms data by compressing it with bzip2.

Filter list

A filter list is an ordered list of zero or more filters. Filter lists can be arbitrarily long, and are used to specify an ordered sequence of data transformations that should be performed on attribute data. TileDB allows you to set a different filter list for each attribute.

Tile chunk

Recall that data tiles are the atomic unit of I/O in TileDB. Before writing, data tiles are internally divided into multiple chunks by TileDB. The tile chunk is the atomic unit of filtering, i.e., each chunk of a data tile is filtered separately (and in parallel).

Filter lists

Filtering in TileDB allows you to specify an ordered list of data transformations, such as compression, that should be applied to attribute data before it is written to disk. Individual filters are added to filter lists, which are then set on attributes in the array schema.

C++

To start, create a simple sparse array schema.

Context ctx;

// The array will be 4x4 with dimensions "rows" and "cols", with domain [1,4].
Domain domain(ctx);
domain.add_dimension(Dimension::create<int>(ctx, "rows", {{1, 4}}, 4))
      .add_dimension(Dimension::create<int>(ctx, "cols", {{1, 4}}, 4));

// The array will be sparse.
ArraySchema schema(ctx, TILEDB_SPARSE);
schema.set_domain(domain).set_order({{TILEDB_ROW_MAJOR, TILEDB_ROW_MAJOR}});

Next create two fixed-length attributes a1 and a2:

auto a1 = Attribute::create<uint32_t>(ctx, "a1");
auto a2 = Attribute::create<int32_t>(ctx, "a2");

Attribute a1 will be filtered by bit width reduction followed by Zstd compression. First, create Filter objects for the two filtering operations:

Filter bit_width_reduction(ctx, TILEDB_FILTER_BIT_WIDTH_REDUCTION);
Filter compression_zstd(ctx, TILEDB_FILTER_ZSTD);

Next, create a FilterList object and add the two filters. Note that the order you add Filter objects to a FilterList is the order that the filtering will occur.

FilterList a1_filters(ctx);
a1_filters.add_filter(bit_width_reduction)
    .add_filter(compression_zstd);

Now set the filter list on attribute a1:

a1.set_filter_list(a1_filters);

Attribute a2 will be filtered just with a single gzip compression filter:

FilterList a2_filters(ctx);
a2_filters.add_filter({ctx, TILEDB_FILTER_GZIP});
a2.set_filter_list(a2_filters);

Note that Filter and FilterList objects can be reused. If instead you wanted to use the same filter list for a2 as was used in a1 you could simply do:

a1.set_filter_list(a1_filters);
a2.set_filter_list(a1_filters);

Either way, add the attributes to the array schema and create the array:

schema.add_attribute(a1).add_attribute(a2);
Array::create(array_name, schema);

TileDB also allows you to set filter lists to be used on the offsets data for variable-length attributes as well as the coordinates for sparse fragments. For example, to set a filter list for the offsets you could do the following:

FilterList offsets_filters(ctx);
offsets_filters.add_filter({ctx, TILEDB_FILTER_POSITIVE_DELTA})
    .add_filter(bit_width_reduction)
    .add_filter(compression_zstd);
schema.set_offsets_filter_list(offsets_filters);

Python

To start, create a simple sparse array domain.

# The array will be 4x4 with dimensions "rows" and "cols", with domain [1,4].
dom = tiledb.Domain(tiledb.Dim(name="rows", domain=(1, 4), tile=4, dtype=np.int32),
                    tiledb.Dim(name="cols", domain=(1, 4), tile=4, dtype=np.int32))

Attribute a1 will be filtered by bit width reduction followed by Zstd compression. First, create filter objects for the two filtering operations:

bit_width_reduction = tiledb.BitWidthReductionFilter()
compression_zstd = tiledb.ZstdFilter()

Next, create a FilterList object with the two filters. Note that the order you specify filter objects to a FilterList is the order that the filtering will occur.

a1_filters = tiledb.FilterList([bit_width_reduction, compression_zstd])

Attribute a2 will be filtered just with a single gzip compression filter:

a2_filters = tiledb.FilterList([tiledb.GzipFilter()])

Add the attributes to the array schema and create the array:

schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            attrs=[tiledb.Attr(name="a1", dtype=np.uint32, filters=a1_filters),
                                   tiledb.Attr(name="a2", dtype=np.int32, filters=a2_filters)])
tiledb.SparseArray.create(array_name, schema)

TileDB also allows you to set filter lists to be used on the offsets data for variable-length attributes as well as the coordinates for sparse fragments. For example, to set a filter list for the offsets you could do the following:

offsets_filters = [tiledb.PositiveDeltaFilter(), tiledb.BitWidthReductionFilter(), tiledb.ZstdFilter()]
schema = tiledb.ArraySchema(domain=dom, sparse=True,
                            offsets_filters=offsets_filters,
                            attrs=[...]])

Now when data for attributes a1 and a2 is written to the array, the data will first be transformed by the filter lists you specified in the array schema.

When reading from the array, the filtered data on disk is “unfiltered” through the same list of filters in reverse, producing the original data.

When filtering the data tiles of an attribute, TileDB stores some necessary metadata in file __fragment_metadata.tdb, such as the starting location of each filtered tile and the original tile size in the case of variable-length attributes (recall that the original tile size has fixed size for fixed-length attributes in both dense and sparse arrays).

Filter options

Some filters have additional options that can be configured. For example, you can set the compression level as an option on the filters that perform compression.

C++

To set the compression level to level 5 on a bzip2 compression filter:

Filter compression_bzip2(ctx, TILEDB_FILTER_BZIP2);
int level = 5;
compression_bzip2.set_option(TILEDB_COMPRESSION_LEVEL, &level);

You can also retrieve option values from filters:

int level_get;
compression_bzip2.get_option(TILEDB_COMPRESSION_LEVEL, &level_get);
// Now level_get == 5

Python

To set the compression level to level 5 on a bzip2 compression filter:

compression_bzip2 = tiledb.Bzip2Filter(level=5)

You can also retrieve option values from filters:

level_get = compression_bzip2.level
# Now level_get == 5

The options supported by each filter are documented below.

Available filters

TileDB supports a number of filters, and more will continue to be added in the future.

Compression filters

There are several filters performing generic compression, which are the following:

  • TILEDB_FILTER_GZIP: Compresses with Gzip

  • TILEDB_FILTER_ZSTD: Compresses with Zstandard

  • TILEDB_FILTER_LZ4: Compresses with LZ4

  • TILEDB_FILTER_RLE: Compresses with run-length encoding

  • TILEDB_FILTER_BZIP2: Compresses with Bzip2

  • TILEDB_FILTER_DOUBLE_DELTA: Compresses with double-delta encoding

All of these filters support one filter option to set the compression level, although some compressors such as RLE currently ignore the setting. The filter option is:

  • TILEDB_COMPRESSION_LEVEL (type int32_t): The compression level to use. Default: -1 (compressor-specific default).

Byteshuffle

The filter TILEDB_FILTER_BYTESHUFFLE performs byte shuffling of data as a way to improve compression ratios. The byte shuffle implementation used by TileDB comes from the Blosc project.

The byte shuffling process rearranges the bytes of the input attribute cell values in a deterministic and reversible manner designed to result in long runs of similar bytes that can be compressed more effectively by a generic compressor than the original unshuffled elements. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.

For example, consider three 32-bit unsigned integer values 1, 2, 3, which have the following little-endian representation when stored adjacent in memory:

0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 0x00 0x00 0x00

The byte shuffle operation will rearrange the bytes of these integer elements in memory such that the resulting array of bytes will contain each element’s first byte, followed by each element’s second byte, etc. After shuffling the bytes would therefore be:

0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Note the longer run of zero-valued bytes, which will compress more efficiently.

The byteshuffle filter does not support any options.

Bitshuffle

The filter TILEDB_FILTER_BITSHUFFLE performs bit shuffling of data as a way to improve compression ratios. The bitshuffle implementation used in TileDB comes from https://github.com/kiyo-masui/bitshuffle.

Bitshuffling is conceptually very similar to byteshuffling, but operates on the bit granularity rather than the byte granularity. Shuffling at the bit level can increase compression ratios even further than the byteshuffle filter, at the cost of increased computation to perform the shuffle.

Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.

Positive-delta encoding

The filter TILEDB_FILTER_POSITIVE_DELTA performs positive-delta encoding. Positive-delta encoding is a form of delta encoding that only works when the delta value is positive. Positive-delta encoding can result in better compression ratios on the encoded data. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.

For example, if the data being filtered was the sequence of integers 100, 104, 108, 112, ..., then the resulting positive-encoded data would be 0, 4, 4, 4, .... This encoding is advantageous in that producing long runs of repeated values can result in better compression ratios, if a compression filter is added after positive-delta encoding.

The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.

The positive-delta encoding filter supports one option:

  • TILEDB_POSITIVE_DELTA_MAX_WINDOW (type uint32_t): The window size in bytes to use. Default: 1024.

Note

Positive-delta encoding is particularly useful for the offsets of variable-length attribute data, which by definition will always have positive deltas. The above example of the form 100, 104, 108, 112 can easily arise in the offsets, if for example you have a variable-length attribute of 4-byte values with mostly single values per cell instead of a variable number.

Bit width reduction

The filter TILEDB_FILTER_BIT_WIDTH_REDUCTION performs bit-width reduction, which is a form of compression.

Bit-width reduction examines a window of attribute values, and determines if all of the values in the window can be represented by a datatype of smaller byte size. If so, the values in the window are rewritten as values of the smaller datatype, potentially saving several bytes per cell.

For example, consider an attribute with datatype uint64_t. Initially, each cell of data for that attribute requires 8 bytes of storage. However, if you know that the actual value of the attribute is often 255 or less, those cells can be stored using just a single byte in the form of a uint8_t, saving 7 bytes of storage per cell. The bit-width reduction filter performs this analysis and compression automatically over windows of attribute data.

Additionally, each cell value in a window is treated relative to the minimum value in that window. For example, if the window size was 3 cells, which had the values 300, 350, 400, the bit-width reduction filter would first determine that the minimum value in the window was 300, and the relative cell values were 0, 50, 100. These relative values are now less than 255 and can be represented by a uint8_t value.

If possible, it can be a good idea to apply positive-delta encoding before bit-width reduction, as the positive-delta encoding may further increase the opportunities to store windows of data with a narrower datatype.

The bit-width reduction filter supports one option:

  • TILEDB_BIT_WIDTH_MAX_WINDOW (type uint32_t): The window size in bytes to use. Default: 256.

Note

Bit-width reduction only works on integral datatypes.

Tile chunks

Before filtering each data tile of an attribute, TileDB internally divides the tile into disjoint chunks. These chunks are then filtered individually.

Chunking tiles before filtering allows for better cache behavior in terms of temporal locality, as the chunk size can be chosen to fit within the L1 cache of your processor cores. This helps especially with multi-stage filter lists, as the output from the previous filter is likely to still be in L1 when used as input for the current filter.

Chunking tiles also increases the amount of parallel compute that TileDB can make effective use of. By breaking a tile into individual chunks, each chunk can then be filtered in parallel, which can result in excellent CPU utilization when combined with the cache-friendly size of the chunks.

The default chunk size used by TileDB is 64KB, which is the size of many common processor L1 caches. You can control the chunk size by changing the option on a filter list:

C++

Context ctx;
FilterList filter_list(ctx);
// Use a max chunk size of 10,000 bytes for this filter list:
filter_list.set_max_chunk_size(10000);

Python

# Use a max chunk size of 10,000 bytes for this filter list:
filter_list = tiledb.FilterList([tiledb.GzipFilter()], chunksize=10000)