Format description¶
This document describes the persisted format of TileDB arrays. The first section describes the tile-based format shared by all files written by TileDB (including attribute data, array metadata, etc), as well as the metadata added to each tile for filtering. The second section describes the byte format of the tile data written in each file in a TileDB array.
The current TileDB format version number is 3 (uint32_t
).
Note
All data written by TileDB and referenced in this document is little-endian.
Tile-based format¶
File format¶
Every file written by TileDB, including attribute data, array schema and other metadata, is internally treated as a byte array divided into tiles. I/O operations typically occur at tile granularity. All tiles in TileDB can pass through a filter pipeline.
Any file written by TileDB therefore has the following on-disk format:
Field |
Type |
Description |
---|---|---|
Tile 1 |
|
Array of bytes containing the first tile’s (filtered) data. |
… |
… |
… |
Tile N |
|
Array of bytes containing the nth tile’s (filtered) data. |
Every file is at least one tile. The current version of TileDB writes two
types of Tile
: attribute tiles and generic tiles. Attribute tiles
contain attribute data, offsets, or coordinates. Generic tiles contain generic
non-attribute data, of which there are two kinds: the array schema and the
fragment metadata.
The fragment metadata and array schema files consist of a single generic tile. Attribute, offsets and coordinate files consist of one or more attribute tiles.
Each generic tile contains some additional metadata in a header structure. A
generic Tile
has the on-disk format:
Field |
Type |
Description |
---|---|---|
Version number |
|
Format version number of the generic tile. |
Persisted size |
|
Persisted (e.g. compressed) size of the tile. |
Tile size |
|
In-memory (e.g. uncompressed) size of the tile. |
Datatype |
|
Datatype of the tile. |
Cell size |
|
Cell size of the tile. |
Encryption type |
|
Type of encryption used in filtering the tile. |
Filter pipeline size |
|
Number of bytes in the
serialized |
Filter pipeline |
|
Filter pipeline used to filter the tile. |
Tile data |
|
Array of filtered tile data bytes. |
Attribute tiles do not store extra metadata per tile (attribute metadata is
instead stored in the array schema and fragment metadata files), so an attribute
Tile
has the on-disk format:
Field |
Type |
Description |
---|---|---|
Tile data |
|
Array of filtered tile data bytes. |
A coordinate Tile
is additionally processed by “splitting” coordinate tuples
across dimensions. As an example, 3D coordinates are given by users in the form
[x1, y1, z1, x2, y2, z2, ...]
. Before being filtered, the coordinate values
stored in the tile data are rearranged to be
[x1, x2, ..., xN, y1, y2, ..., yN, z1, z2, ..., zN]
.
To account for filtering, some additional metadata is prepended in the tile data bytes in each tile. This filter pipeline metadata informs TileDB how the following tile bytes should be treated (for example, how to decompress it when reading from disk).
The array of filtered tile data bytes (for any type of tile) has the on-disk format:
Field |
Type |
Description |
---|---|---|
Number of chunks |
|
Number of chunks in the Tile |
Chunk 1 |
|
First chunk in the tile |
… |
… |
… |
Chunk N |
|
Nth chunk in the tile |
Internally tile data is divided into “chunks.” Every tile is at least one chunk.
A TileChunk
has the following on-disk format:
Field |
Type |
Description |
---|---|---|
Original length of chunk |
|
The original (unfiltered) number of bytes of chunk data. |
Filtered length of chunk |
|
The serialized (filtered) number of bytes of chunk data. |
Chunk metadata length |
|
Number of bytes in the chunk metadata |
Chunk metadata |
|
Chunk metadata bytes |
Chunk filtered data |
|
Filtered chunk bytes |
The metadata added to a chunk depends on the sequence of filters in the pipeline used to filter the containing tile.
If a pipeline used to filter tiles is empty (contains no filters), the tile is still divided into chunks and serialized according to the above format. In this case there are chunk metadata bytes (since there are no filters to add metadata), and the filtered bytes are the same as original bytes.
Chunk metadata and filtered data¶
The “chunk metadata bytes” before the actual chunk data bytes depend on the particular sequence of filters in the pipeline. In the simple case, each filter will simply concatenate its metadata to the chunk metadata region. Because some filters in the pipeline may wish to filter the metadata of previous filters (e.g. compression, where it is beneficial to compress previous filters’ metadata in addition to the actual chunk data), the ordering of filters also impacts the metadata that is eventually written to disk.
The “chunk filtered data” bytes contain the final bytes of the chunk after being passed through the entire pipeline. When reading tiles from disk, these filtered bytes are passed through the filter pipeline in the reverse order.
Internally, any filter in a filter pipeline produces two arrays of data as output: a metadata byte array and a filtered data byte array. Additionally, these output byte arrays can be arbitrarily separated into “parts” by any filter. Typically, when a next filter receives the output of the previous filter as its input, it will filter each “part” independently.
First we will look at the output of the filters individually.
Byteshuffle filter¶
The byteshuffle filter does not filter input metadata, and the output data is guaranteed to be the same length as the input data.
The byteshuffle filter produces output metadata in the format:
Field |
Type |
Description |
---|---|---|
Number of parts |
|
Number of data parts |
Length of part 1 |
|
Number of bytes in data part 1 |
… |
… |
… |
Length of part N |
|
Number of bytes in data part N |
The byteshuffle filter produces output data in the format:
Field |
Type |
Description |
---|---|---|
Part 1 |
|
Byteshuffled data part 1 |
… |
… |
… |
Part N |
|
Byteshuffled data part N |
Bitshuffle filter¶
The bitshuffle filter does not filter input metadata.
The bitshuffle filter produces output metadata in the format:
Field |
Type |
Description |
---|---|---|
Number of parts |
|
Number of data parts |
Length of part 1 |
|
Number of bytes in data part 1 |
… |
… |
… |
Length of part N |
|
Number of bytes in data part N |
The bitshuffle filter produces output data in the format:
Field |
Type |
Description |
---|---|---|
Part 1 |
|
Bitshuffled data part 1 |
… |
… |
… |
Part N |
|
Bitshuffled data part N |
Bit width reduction filter¶
The bit width reduction filter does not filter input metadata.
The bit width reduction filter produces output metadata in the format:
Field |
Type |
Description |
---|---|---|
Length of input |
|
Original input number of bytes |
Number of windows |
|
Number of windows in output |
Window 1 metadata |
|
Metadata for window 1 |
… |
… |
… |
Window N metadata |
|
Metadata for window N |
The type WindowMD
has the format:
Field |
Type |
Description |
---|---|---|
Window value offset |
|
Offset applied to values in the output window,
where |
Bit width of reduced type |
|
Number of bits in the new datatype of the values in the output window |
Window length |
|
Number of bytes in output window data. |
The bit width reduction filter produces output data in the format:
Field |
Type |
Description |
---|---|---|
Window 1 |
|
Window 1 data (possibly-reduced width elements) |
… |
… |
… |
Window N |
|
Window N data (possibly-reduced width elements) |
Positive delta encoding filter¶
The positive-delta encoding filter does not filter input metadata.
The positive-delta encoding filter produces output metadata in the format:
Field |
Type |
Description |
---|---|---|
Number of windows |
|
Number of windows in output |
Window 1 metadata |
|
Metadata for window 1 |
… |
… |
… |
Window N metadata |
|
Metadata for window N |
The type WindowMD
has the format:
Field |
Type |
Description |
---|---|---|
Window value delta offset |
|
Offset applied to values in the output window,
where |
Window length |
|
Number of bytes in output window data. |
The positive-delta encoding filter produces output data in the format:
Field |
Type |
Description |
---|---|---|
Window 1 |
|
Window 1 delta-encoded data |
… |
… |
… |
Window N |
|
Window N delta-encoded data |
Compression filters¶
The compression filters do filter input metadata.
The compression filters produce output metadata in the format:
Field |
Type |
Description |
---|---|---|
Number of metadata parts |
|
Number of input metadata parts that were compressed |
Number of data parts |
|
Number of input data parts that were compressed |
Metadata part 1 |
|
Metadata about the first metadata |
… |
… |
… |
Metadata part N |
|
Metadata about the nth metadata part |
Data part 1 |
|
Metadata about the first data part |
… |
… |
… |
Data part N |
|
Metadata about the nth data part |
The type CompressedPartMD
has the format:
Field |
Type |
Description |
---|---|---|
Part original length |
|
Input length of the part (before compression) |
Part compressed length |
|
Compressed length of the part |
The compression filters then produce output data in the format:
Field |
Type |
Description |
---|---|---|
Metadata part 0 compressed bytes |
|
Compressed bytes of the first metadata part |
… |
… |
… |
Metadata part N compressed bytes |
|
Compressed bytes of the nth metadata part |
Data part 0 compressed bytes |
|
Compressed bytes of the first data part |
… |
… |
… |
Data part N compressed bytes |
|
Compressed bytes of the nth data part |
Internal formats¶
As mentioned, any file written by TileDB including attribute data, array schema, fragment metadata, coordinates or offsets, is treated as an array of bytes and broken up into separate tiles before writing. The previous section defined the on-disk format of files written by TileDB in terms of tiles and filter metadata.
This section describes the data contained in each file, independent of any
tiling. In other words, the format structures defined here comprise unfiltered
tile data, which is treated as an array of bytes, broken into TileChunks
,
filtered, and written to disk with the format described in the previous section.
We refer to the byte format of unfiltered tile data as the “internal” format.
Array schema file¶
The file __array_schema.tdb
has the internal format:
Field |
Type |
Description |
---|---|---|
Array schema |
|
The serialized array schema. |
The type ArraySchema
has the internal format:
Field |
Type |
Description |
---|---|---|
Array version |
|
Format version number of the array schema |
Array type |
|
Dense or sparse |
Tile order |
|
Row or column major |
Cell order |
|
Row or column major |
Capacity |
|
For sparse arrays, the data tile capacity |
Coords filters |
|
The filter pipeline used for coordinate tiles |
Offsets filters |
|
The filter pipeline used for cell var-len offset tiles |
Domain |
|
The array domain |
Num attributes |
|
Number of attributes in the array |
Attribute 1 |
|
First attribute |
… |
… |
… |
Attribute N |
|
Nth attribute |
The type Domain
has the internal format:
Field |
Type |
Description |
---|---|---|
Type |
|
Datatype of dimension values ( |
Num dimensions |
|
Dimensionality/rank of the domain |
Dimension 1 |
|
First dimension |
… |
… |
… |
Dimension N |
|
Nth dimension |
The type Dimension
has the internal format:
Field |
Type |
Description |
---|---|---|
Dimension name length |
|
Number of characters in dimension name (the following array) |
Dimension name |
|
Dimension name character array |
Domain |
|
Byte array of length |
Null tile extent |
|
|
Tile extent |
|
Byte array of length |
The type Attribute
has the internal format:
Field |
Type |
Description |
---|---|---|
Attribute name length |
|
Number of characters in attribute name (the following array) |
Attribute name |
|
Attribute name character array |
Attribute datatype |
|
Datatype of the attribute values |
Cell val num |
|
Number of attribute values per cell. For
variable-length attributes, this is
|
Filters |
|
The filter pipeline used on attribute value tiles |
The type FilterPipeline
has the internal format:
Field |
Type |
Description |
---|---|---|
Max chunk size |
|
Maximum chunk size within a tile |
Num filters |
|
Number of filters in pipeline |
Filter 1 |
|
First filter |
… |
… |
… |
Filter N |
|
Nth filter |
The type Filter
has the internal format:
Field |
Type |
Description |
---|---|---|
Filter type |
|
Type of filter (e.g. |
Filter metadata size |
|
Number of bytes in filter metadata (the following array) — may be 0. |
Filter metadata |
|
Filter metadata, specific to each filter. E.g. compression level for compression filters. |
The filter metadata contains configuration parameters for the filters
that do not change once the array schema has been created. For the
compression filters (any of the filter types
TILEDB_FILTER_{GZIP,ZSTD,LZ4,RLE,BZIP2,DOUBLE_DELTA }
)
the filter metadata has the internal format:
Field |
Type |
Description |
---|---|---|
Compressor type |
|
Type of compression (e.g. |
Compression level |
|
Compression level used (ignored by some compressors). |
The filter metadata for TILEDB_FILTER_BIT_WIDTH_REDUCTION
has the
internal format:
Field |
Type |
Description |
---|---|---|
Max window size |
|
Maximum window size in bytes |
The filter metadata for TILEDB_FILTER_POSITIVE_DELTA
has the
internal format:
Field |
Type |
Description |
---|---|---|
Max window size |
|
Maximum window size in bytes |
The remaining filters (TILEDB_FILTER_BITSHUFFLE
and
TILEDB_FILTER_BYTESHUFFLE
) do not serialize any metadata.
Array lock file¶
The file __lock.tdb
is always an empty file on disk.
Fragment metadata file¶
The file __fragment_metadata.tdb
has the internal format:
Field |
Type |
Description |
---|---|---|
R-Tree |
|
The serialized R-Tree. |
Tile offsets for attribute 1 |
|
The serialized tile offsets for attribute 1. |
… |
… |
… |
Tile offsets for attribute N |
|
The serialized tile offsets for attribute N. |
Variable tile offsets for attribute 1 |
|
The serialized variable tile offsets for attribute 1. |
… |
… |
… |
Variable tile offsets for attribute N |
|
The serialized variable tile offsets for attribute N. |
Variable tile sizes for attribute 1 |
|
The serialized variable tile sizes for attribute 1. |
… |
… |
… |
Variable tile sizes for attribute N |
|
The serialized variable tile sizes for attribute N. |
Footer |
|
Basic metadata. |
The type RTree
is a generic tile with the following internal format:
Field |
Type |
Description |
---|---|---|
Dimensionality |
|
Number of dimensions. |
Fanout |
|
The tree fanout. |
Type |
|
The domain datatype. |
Number of levels |
|
The number of levels in the tree. |
Num MBRs at level 1 |
|
The number of MBRs at level 1. |
MBR 1 at level 1 |
|
Byte array of two coordinate tuples storing the min/max coords of the first MBR at level 1 |
… |
… |
… |
MBR N at level 1 |
|
Byte array of two coordinate tuples storing the min/max coords of the Nth MBR at level 1 |
… |
… |
… |
Num MBRs at level L |
|
The number of MBRs at level L. |
MBR 1 at level L |
|
Byte array of two coordinate tuples storing the min/max coords of the first MBR at level L |
… |
… |
… |
MBR N at level L |
|
Byte array of two coordinate tuples storing the min/max coords of the Nth MBR at level L |
The type TileOffsets
is a generic tile with the following internal format:
Field |
Type |
Description |
---|---|---|
Num tile offsets |
|
Number of tile offsets. |
Tile offset 1 |
|
Offset 1. |
… |
… |
… |
Tile offset N |
|
Offset N. |
The type TileSizes
is a generic tile with the following internal format:
Field |
Type |
Description |
---|---|---|
Num tile sizes |
|
Number of tile sizes. |
Tile size 1 |
|
Size 1. |
… |
… |
… |
Tile size N |
|
Size N. |
The type Footer
has the following internal format:
Field |
Type |
Description |
---|---|---|
Version number |
|
Format version number of the fragment. |
Null non-empty domain |
|
Indicates whether the non-empty domain is null or not. |
Non-empty domain |
|
Byte array of two coordinate tuples storing the min/max coords of a bounding box surrounding the non-empty domain of the fragment. |
Number of sparse tiles |
|
Number of sparse tiles. |
Last tile cell num |
|
For sparse arrays, the number of cells in the last tile in the fragment. |
File sizes |
|
The size in bytes of each attribute file in the fragment, including coords. For var-length attributes, this is the size of the offsets file. |
File var sizes |
|
The size in bytes of each var-length attribute file in the fragment. |
R-Tree offset |
|
The offset to the generic tile storing the R-Tree in the metadata file. |
Tile offset for attribute 1 |
|
The offset to the generic tile storing the tile offsets for attribute 1. |
… |
… |
… |
Tile offset for attribute N+1 |
|
The offset to the generic tile storing the tile offsets for attribute N+1 (N+1 stands for coordinates. |
Tile var offset for attribute 1 |
|
The offset to the generic tile storing the variable tile offsets for attribute 1. |
… |
… |
… |
Tile var offset for attribute N |
|
The offset to the generic tile storing the variable tile offsets for attribute N. |
Tile var sizes offset for attribute 1 |
|
The offset to the generic tile storing the variable tile sizes for attribute 1. |
… |
… |
… |
Tile var sizes offset for attribute N |
|
The offset to the generic tile storing the variable tile sizes for attribute N. |
Coords file¶
Within a sparse fragment, the file __coords.tdb
has the following
internal format:
Field |
Type |
Description |
---|---|---|
Dim 1 coordinate values |
|
Array of the first dimension values for all coordinate tuples of cells in the fragment. |
… |
… |
|
Dim N coordinate values |
|
Array of the nth dimension values for all coordinate tuples of cells in the fragment. |
Attribute file¶
Within a fragment, each fixed-length attribute named <attr>
has a
file <attr>.tdb
with the internal format:
Field |
Type |
Description |
---|---|---|
Attribute values |
|
Array of the attribute values for all cells. |
Each var-length attribute named <attr>
has two files,
<attr>_var.tdb
storing the variable-length data for the attribute in
each cell, and <attr>.tdb
storing the offsets in <attr_var>.tdb
for the data belonging to each cell. <attr>.tdb
has the internal
format:
Field |
Type |
Description |
---|---|---|
Attribute offsets |
|
Array of the attribute value offsets in
corresponding file |
And <attr>_var.tdb
has the format:
Field |
Type |
Description |
---|---|---|
Attribute values |
|
Array of the attribute values for all cells. |