Filters¶
In this tutorial we will learn how to use TileDB filters. It is recommended to first read the tutorials on sparse arrays, multi-attribute arrays and variable-length attributes.
Program |
Links |
|
Basic concepts and definitions¶
Filter
A filter is used to transform attribute data before it is written to an array. One filter performs one type of data transformation. An example of a filter is the bzip2 compression filter, which transforms data by compressing it with bzip2.
Filter list
A filter list is an ordered list of zero or more filters. Filter lists can be arbitrarily long, and are used to specify an ordered sequence of data transformations that should be performed on attribute data. TileDB allows you to set a different filter list for each attribute.
Tile chunk
Recall that data tiles are the atomic unit of I/O in TileDB. Before writing, data tiles are internally divided into multiple chunks by TileDB. The tile chunk is the atomic unit of filtering, i.e., each chunk of a data tile is filtered separately (and in parallel).
Filter lists¶
Filtering in TileDB allows you to specify an ordered list of data transformations, such as compression, that should be applied to attribute data before it is written to disk. Individual filters are added to filter lists, which are then set on attributes in the array schema.
C++
To start, create a simple sparse array schema.
Context ctx;
// The array will be 4x4 with dimensions "rows" and "cols", with domain [1,4].
Domain domain(ctx);
domain.add_dimension(Dimension::create<int>(ctx, "rows", {{1, 4}}, 4))
.add_dimension(Dimension::create<int>(ctx, "cols", {{1, 4}}, 4));
// The array will be sparse.
ArraySchema schema(ctx, TILEDB_SPARSE);
schema.set_domain(domain).set_order({{TILEDB_ROW_MAJOR, TILEDB_ROW_MAJOR}});
Next create two fixed-length attributes a1
and a2
:
auto a1 = Attribute::create<uint32_t>(ctx, "a1");
auto a2 = Attribute::create<int32_t>(ctx, "a2");
Attribute a1
will be filtered by bit width reduction followed by Zstd
compression. First, create Filter
objects for the two filtering
operations:
Filter bit_width_reduction(ctx, TILEDB_FILTER_BIT_WIDTH_REDUCTION);
Filter compression_zstd(ctx, TILEDB_FILTER_ZSTD);
Next, create a FilterList
object and add the two filters. Note that
the order you add Filter
objects to a FilterList
is the order that
the filtering will occur.
FilterList a1_filters(ctx);
a1_filters.add_filter(bit_width_reduction)
.add_filter(compression_zstd);
Now set the filter list on attribute a1
:
a1.set_filter_list(a1_filters);
Attribute a2
will be filtered just with a single gzip compression
filter:
FilterList a2_filters(ctx);
a2_filters.add_filter({ctx, TILEDB_FILTER_GZIP});
a2.set_filter_list(a2_filters);
Note that Filter
and FilterList
objects can be reused. If instead
you wanted to use the same filter list for a2
as was used in a1
you could simply do:
a1.set_filter_list(a1_filters);
a2.set_filter_list(a1_filters);
Either way, add the attributes to the array schema and create the array:
schema.add_attribute(a1).add_attribute(a2);
Array::create(array_name, schema);
TileDB also allows you to set filter lists to be used on the offsets data for variable-length attributes as well as the coordinates for sparse fragments. For example, to set a filter list for the offsets you could do the following:
FilterList offsets_filters(ctx);
offsets_filters.add_filter({ctx, TILEDB_FILTER_POSITIVE_DELTA})
.add_filter(bit_width_reduction)
.add_filter(compression_zstd);
schema.set_offsets_filter_list(offsets_filters);
Python
To start, create a simple sparse array domain.
# The array will be 4x4 with dimensions "rows" and "cols", with domain [1,4].
dom = tiledb.Domain(tiledb.Dim(name="rows", domain=(1, 4), tile=4, dtype=np.int32),
tiledb.Dim(name="cols", domain=(1, 4), tile=4, dtype=np.int32))
Attribute a1
will be filtered by bit width reduction followed by Zstd
compression. First, create filter objects for the two filtering
operations:
bit_width_reduction = tiledb.BitWidthReductionFilter()
compression_zstd = tiledb.ZstdFilter()
Next, create a FilterList
object with the two filters. Note that
the order you specify filter objects to a FilterList
is the order that
the filtering will occur.
a1_filters = tiledb.FilterList([bit_width_reduction, compression_zstd])
Attribute a2
will be filtered just with a single gzip compression
filter:
a2_filters = tiledb.FilterList([tiledb.GzipFilter()])
Add the attributes to the array schema and create the array:
schema = tiledb.ArraySchema(domain=dom, sparse=True,
attrs=[tiledb.Attr(name="a1", dtype=np.uint32, filters=a1_filters),
tiledb.Attr(name="a2", dtype=np.int32, filters=a2_filters)])
tiledb.SparseArray.create(array_name, schema)
TileDB also allows you to set filter lists to be used on the offsets data for variable-length attributes as well as the coordinates for sparse fragments. For example, to set a filter list for the offsets you could do the following:
offsets_filters = [tiledb.PositiveDeltaFilter(), tiledb.BitWidthReductionFilter(), tiledb.ZstdFilter()]
schema = tiledb.ArraySchema(domain=dom, sparse=True,
offsets_filters=offsets_filters,
attrs=[...]])
Now when data for attributes a1
and a2
is written to the array, the data
will first be transformed by the filter lists you specified in the array schema.
When reading from the array, the filtered data on disk is “unfiltered” through the same list of filters in reverse, producing the original data.
When filtering the data tiles of an attribute, TileDB stores some
necessary metadata in file __fragment_metadata.tdb
, such as the
starting location of each filtered tile and the original tile size
in the case of variable-length attributes (recall that the original tile
size has fixed size for fixed-length attributes in both dense and
sparse arrays).
Filter options¶
Some filters have additional options that can be configured. For example, you can set the compression level as an option on the filters that perform compression.
C++
To set the compression level to level 5 on a bzip2 compression filter:
Filter compression_bzip2(ctx, TILEDB_FILTER_BZIP2);
int level = 5;
compression_bzip2.set_option(TILEDB_COMPRESSION_LEVEL, &level);
You can also retrieve option values from filters:
int level_get;
compression_bzip2.get_option(TILEDB_COMPRESSION_LEVEL, &level_get);
// Now level_get == 5
Python
To set the compression level to level 5 on a bzip2 compression filter:
compression_bzip2 = tiledb.Bzip2Filter(level=5)
You can also retrieve option values from filters:
level_get = compression_bzip2.level
# Now level_get == 5
The options supported by each filter are documented below.
Available filters¶
TileDB supports a number of filters, and more will continue to be added in the future.
Compression filters¶
There are several filters performing generic compression, which are the following:
TILEDB_FILTER_GZIP
: Compresses with GzipTILEDB_FILTER_ZSTD
: Compresses with ZstandardTILEDB_FILTER_LZ4
: Compresses with LZ4TILEDB_FILTER_RLE
: Compresses with run-length encodingTILEDB_FILTER_BZIP2
: Compresses with Bzip2TILEDB_FILTER_DOUBLE_DELTA
: Compresses with double-delta encoding
All of these filters support one filter option to set the compression level, although some compressors such as RLE currently ignore the setting. The filter option is:
TILEDB_COMPRESSION_LEVEL
(typeint32_t
): The compression level to use. Default: -1 (compressor-specific default).
Byteshuffle¶
The filter TILEDB_FILTER_BYTESHUFFLE
performs byte shuffling of data as a
way to improve compression ratios. The
byte shuffle implementation
used by TileDB comes from the Blosc project.
The byte shuffling process rearranges the bytes of the input attribute cell values in a deterministic and reversible manner designed to result in long runs of similar bytes that can be compressed more effectively by a generic compressor than the original unshuffled elements. Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
For example, consider three 32-bit unsigned integer values 1, 2, 3
, which
have the following little-endian representation when stored adjacent in memory:
0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 0x00 0x00 0x00
The byte shuffle operation will rearrange the bytes of these integer elements in memory such that the resulting array of bytes will contain each element’s first byte, followed by each element’s second byte, etc. After shuffling the bytes would therefore be:
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Note the longer run of zero-valued bytes, which will compress more efficiently.
The byteshuffle filter does not support any options.
Bitshuffle¶
The filter TILEDB_FILTER_BITSHUFFLE
performs
bit shuffling
of data as a way to improve compression ratios. The bitshuffle implementation
used in TileDB comes from https://github.com/kiyo-masui/bitshuffle.
Bitshuffling is conceptually very similar to byteshuffling, but operates on the bit granularity rather than the byte granularity. Shuffling at the bit level can increase compression ratios even further than the byteshuffle filter, at the cost of increased computation to perform the shuffle.
Typically this filter is not used on its own, but rather immediately followed by a compression filter in a filter list.
Positive-delta encoding¶
The filter TILEDB_FILTER_POSITIVE_DELTA
performs positive-delta encoding.
Positive-delta encoding is a form of delta encoding that only works when the
delta value is positive. Positive-delta encoding can result in better
compression ratios on the encoded data. Typically this filter is not used
on its own, but rather immediately followed by a compression filter in a filter
list.
For example, if the data being filtered was the sequence of integers 100, 104,
108, 112, ...
, then the resulting positive-encoded data would be 0, 4, 4,
4, ...
. This encoding is advantageous in that producing long runs of repeated
values can result in better compression ratios, if a compression filter is added
after positive-delta encoding.
The filter operates on a “window” of values at a time, which can help in some cases to produce longer runs of repeated delta values.
The positive-delta encoding filter supports one option:
TILEDB_POSITIVE_DELTA_MAX_WINDOW
(typeuint32_t
): The window size in bytes to use. Default: 1024.
Note
Positive-delta encoding is particularly useful for the offsets of
variable-length attribute data, which by definition will always have
positive deltas. The above example of the form 100, 104, 108, 112
can
easily arise in the offsets, if for example you have a variable-length
attribute of 4-byte values with mostly single values per cell instead of a
variable number.
Bit width reduction¶
The filter TILEDB_FILTER_BIT_WIDTH_REDUCTION
performs bit-width reduction,
which is a form of compression.
Bit-width reduction examines a window of attribute values, and determines if all of the values in the window can be represented by a datatype of smaller byte size. If so, the values in the window are rewritten as values of the smaller datatype, potentially saving several bytes per cell.
For example, consider an attribute with datatype uint64_t
. Initially, each
cell of data for that attribute requires 8 bytes of storage. However, if you
know that the actual value of the attribute is often 255 or less, those cells
can be stored using just a single byte in the form of a uint8_t
, saving 7
bytes of storage per cell. The bit-width reduction filter performs this analysis
and compression automatically over windows of attribute data.
Additionally, each cell value in a window is treated relative to the minimum
value in that window. For example, if the window size was 3 cells, which had the
values 300, 350, 400
, the bit-width reduction filter would first determine
that the minimum value in the window was 300
, and the relative cell values
were 0, 50, 100
. These relative values are now less than 255 and can be
represented by a uint8_t
value.
If possible, it can be a good idea to apply positive-delta encoding before bit-width reduction, as the positive-delta encoding may further increase the opportunities to store windows of data with a narrower datatype.
The bit-width reduction filter supports one option:
TILEDB_BIT_WIDTH_MAX_WINDOW
(typeuint32_t
): The window size in bytes to use. Default: 256.
Note
Bit-width reduction only works on integral datatypes.
Tile chunks¶
Before filtering each data tile of an attribute, TileDB internally divides the tile into disjoint chunks. These chunks are then filtered individually.
Chunking tiles before filtering allows for better cache behavior in terms of temporal locality, as the chunk size can be chosen to fit within the L1 cache of your processor cores. This helps especially with multi-stage filter lists, as the output from the previous filter is likely to still be in L1 when used as input for the current filter.
Chunking tiles also increases the amount of parallel compute that TileDB can make effective use of. By breaking a tile into individual chunks, each chunk can then be filtered in parallel, which can result in excellent CPU utilization when combined with the cache-friendly size of the chunks.
The default chunk size used by TileDB is 64KB, which is the size of many common processor L1 caches. You can control the chunk size by changing the option on a filter list:
C++
Context ctx;
FilterList filter_list(ctx);
// Use a max chunk size of 10,000 bytes for this filter list:
filter_list.set_max_chunk_size(10000);
Python
# Use a max chunk size of 10,000 bytes for this filter list:
filter_list = tiledb.FilterList([tiledb.GzipFilter()], chunksize=10000)