Format description¶

This document describes the persisted format of TileDB arrays. The first section describes the tile-based format shared by all files written by TileDB (including attribute data, array metadata, etc), as well as the metadata added to each tile for filtering. The second section describes the byte format of the tile data written in each file in a TileDB array.

The current TileDB format version number is 3 (uint32_t).

Note

All data written by TileDB and referenced in this document is little-endian.

Tile-based format¶

File format¶

Every file written by TileDB, including attribute data, array schema and other metadata, is internally treated as a byte array divided into tiles. I/O operations typically occur at tile granularity. All tiles in TileDB can pass through a filter pipeline.

Any file written by TileDB therefore has the following on-disk format:

Field	Type	Description
Tile 1	`Tile`	Array of bytes containing the first tile’s (filtered) data.
…	…	…
Tile N	`Tile`	Array of bytes containing the nth tile’s (filtered) data.

Every file is at least one tile. The current version of TileDB writes two types of Tile: attribute tiles and generic tiles. Attribute tiles contain attribute data, offsets, or coordinates. Generic tiles contain generic non-attribute data, of which there are two kinds: the array schema and the fragment metadata.

The fragment metadata and array schema files consist of a single generic tile. Attribute, offsets and coordinate files consist of one or more attribute tiles.

Each generic tile contains some additional metadata in a header structure. A generic Tile has the on-disk format:

Field	Type	Description
Version number	`uint32_t`	Format version number of the generic tile.
Persisted size	`uint64_t`	Persisted (e.g. compressed) size of the tile.
Tile size	`uint64_t`	In-memory (e.g. uncompressed) size of the tile.
Datatype	`uint8_t`	Datatype of the tile.
Cell size	`uint64_t`	Cell size of the tile.
Encryption type	`uint8_t`	Type of encryption used in filtering the tile.
Filter pipeline size	`uint32_t`	Number of bytes in the serialized `FilterPipeline`.
Filter pipeline	`FilterPipeline`	Filter pipeline used to filter the tile.
Tile data	`uint8_t[]`	Array of filtered tile data bytes.

Attribute tiles do not store extra metadata per tile (attribute metadata is instead stored in the array schema and fragment metadata files), so an attribute Tile has the on-disk format:

Field	Type	Description
Tile data	`uint8_t[]`	Array of filtered tile data bytes.

A coordinate Tile is additionally processed by “splitting” coordinate tuples across dimensions. As an example, 3D coordinates are given by users in the form [x1, y1, z1, x2, y2, z2, ...]. Before being filtered, the coordinate values stored in the tile data are rearranged to be [x1, x2, ..., xN, y1, y2, ..., yN, z1, z2, ..., zN].

To account for filtering, some additional metadata is prepended in the tile data bytes in each tile. This filter pipeline metadata informs TileDB how the following tile bytes should be treated (for example, how to decompress it when reading from disk).

The array of filtered tile data bytes (for any type of tile) has the on-disk format:

Field	Type	Description
Number of chunks	`uint64_t`	Number of chunks in the Tile
Chunk 1	`TileChunk`	First chunk in the tile
…	…	…
Chunk N	`TileChunk`	Nth chunk in the tile

Internally tile data is divided into “chunks.” Every tile is at least one chunk. A TileChunk has the following on-disk format:

Field	Type	Description
Original length of chunk	`uint32_t`	The original (unfiltered) number of bytes of chunk data.
Filtered length of chunk	`uint32_t`	The serialized (filtered) number of bytes of chunk data.
Chunk metadata length	`uint32_t`	Number of bytes in the chunk metadata
Chunk metadata	`uint8_t[]`	Chunk metadata bytes
Chunk filtered data	`uint8_t[]`	Filtered chunk bytes

The metadata added to a chunk depends on the sequence of filters in the pipeline used to filter the containing tile.

If a pipeline used to filter tiles is empty (contains no filters), the tile is still divided into chunks and serialized according to the above format. In this case there are chunk metadata bytes (since there are no filters to add metadata), and the filtered bytes are the same as original bytes.

Chunk metadata and filtered data¶

The “chunk metadata bytes” before the actual chunk data bytes depend on the particular sequence of filters in the pipeline. In the simple case, each filter will simply concatenate its metadata to the chunk metadata region. Because some filters in the pipeline may wish to filter the metadata of previous filters (e.g. compression, where it is beneficial to compress previous filters’ metadata in addition to the actual chunk data), the ordering of filters also impacts the metadata that is eventually written to disk.

The “chunk filtered data” bytes contain the final bytes of the chunk after being passed through the entire pipeline. When reading tiles from disk, these filtered bytes are passed through the filter pipeline in the reverse order.

Internally, any filter in a filter pipeline produces two arrays of data as output: a metadata byte array and a filtered data byte array. Additionally, these output byte arrays can be arbitrarily separated into “parts” by any filter. Typically, when a next filter receives the output of the previous filter as its input, it will filter each “part” independently.

First we will look at the output of the filters individually.

Byteshuffle filter¶

The byteshuffle filter does not filter input metadata, and the output data is guaranteed to be the same length as the input data.

The byteshuffle filter produces output metadata in the format:

Field	Type	Description
Number of parts	`uint32_t`	Number of data parts
Length of part 1	`uint32_t`	Number of bytes in data part 1
…	…	…
Length of part N	`uint32_t`	Number of bytes in data part N

The byteshuffle filter produces output data in the format:

Field	Type	Description
Part 1	`uint8_t[]`	Byteshuffled data part 1
…	…	…
Part N	`uint8_t[]`	Byteshuffled data part N

Bitshuffle filter¶

The bitshuffle filter does not filter input metadata.

The bitshuffle filter produces output metadata in the format:

Field	Type	Description
Number of parts	`uint32_t`	Number of data parts
Length of part 1	`uint32_t`	Number of bytes in data part 1
…	…	…
Length of part N	`uint32_t`	Number of bytes in data part N

The bitshuffle filter produces output data in the format:

Field	Type	Description
Part 1	`uint8_t[]`	Bitshuffled data part 1
…	…	…
Part N	`uint8_t[]`	Bitshuffled data part N

Bit width reduction filter¶

The bit width reduction filter does not filter input metadata.

The bit width reduction filter produces output metadata in the format:

Field	Type	Description
Length of input	`uint32_t`	Original input number of bytes
Number of windows	`uint32_t`	Number of windows in output
Window 1 metadata	`WindowMD`	Metadata for window 1
…	…	…
Window N metadata	`WindowMD`	Metadata for window N

The type WindowMD has the format:

Field	Type	Description
Window value offset	`T`	Offset applied to values in the output window, where `T` is the original datatype of the tile values.
Bit width of reduced type	`uint8_t`	Number of bits in the new datatype of the values in the output window
Window length	`uint32_t`	Number of bytes in output window data.

The bit width reduction filter produces output data in the format:

Field	Type	Description
Window 1	`uint8_t[]`	Window 1 data (possibly-reduced width elements)
…	…	…
Window N	`uint8_t[]`	Window N data (possibly-reduced width elements)

Positive delta encoding filter¶

The positive-delta encoding filter does not filter input metadata.

The positive-delta encoding filter produces output metadata in the format:

Field	Type	Description
Number of windows	`uint32_t`	Number of windows in output
Window 1 metadata	`WindowMD`	Metadata for window 1
…	…	…
Window N metadata	`WindowMD`	Metadata for window N

The type WindowMD has the format:

Field	Type	Description
Window value delta offset	`T`	Offset applied to values in the output window, where `T` is the datatype of the tile values.
Window length	`uint32_t`	Number of bytes in output window data.

The positive-delta encoding filter produces output data in the format:

Field	Type	Description
Window 1	`T[]`	Window 1 delta-encoded data
…	…	…
Window N	`T[]`	Window N delta-encoded data

Compression filters¶

The compression filters do filter input metadata.

The compression filters produce output metadata in the format:

Field	Type	Description
Number of metadata parts	`uint32_t`	Number of input metadata parts that were compressed
Number of data parts	`uint32_t`	Number of input data parts that were compressed
Metadata part 1	`CompressedPartMD`	Metadata about the first metadata
…	…	…
Metadata part N	`CompressedPartMD`	Metadata about the nth metadata part
Data part 1	`CompressedPartMD`	Metadata about the first data part
…	…	…
Data part N	`CompressedPartMD`	Metadata about the nth data part

The type CompressedPartMD has the format:

Field	Type	Description
Part original length	`uint32_t`	Input length of the part (before compression)
Part compressed length	`uint32_t`	Compressed length of the part

The compression filters then produce output data in the format:

Field	Type	Description
Metadata part 0 compressed bytes	`uint8_t[]`	Compressed bytes of the first metadata part
…	…	…
Metadata part N compressed bytes	`uint8_t[]`	Compressed bytes of the nth metadata part
Data part 0 compressed bytes	`uint8_t[]`	Compressed bytes of the first data part
…	…	…
Data part N compressed bytes	`uint8_t[]`	Compressed bytes of the nth data part

Internal formats¶

As mentioned, any file written by TileDB including attribute data, array schema, fragment metadata, coordinates or offsets, is treated as an array of bytes and broken up into separate tiles before writing. The previous section defined the on-disk format of files written by TileDB in terms of tiles and filter metadata.

This section describes the data contained in each file, independent of any tiling. In other words, the format structures defined here comprise unfiltered tile data, which is treated as an array of bytes, broken into TileChunks, filtered, and written to disk with the format described in the previous section. We refer to the byte format of unfiltered tile data as the “internal” format.

Array schema file¶

The file __array_schema.tdb has the internal format:

Field	Type	Description
Array schema	`ArraySchema`	The serialized array schema.

The type ArraySchema has the internal format:

Field	Type	Description
Array version	`uint32_t`	Format version number of the array schema
Array type	`uint8_t`	Dense or sparse
Tile order	`uint8_t`	Row or column major
Cell order	`uint8_t`	Row or column major
Capacity	`uint64_t`	For sparse arrays, the data tile capacity
Coords filters	`FilterPipeline`	The filter pipeline used for coordinate tiles
Offsets filters	`FilterPipeline`	The filter pipeline used for cell var-len offset tiles
Domain	`Domain`	The array domain
Num attributes	`uint32_t`	Number of attributes in the array
Attribute 1	`Attribute`	First attribute
…	…	…
Attribute N	`Attribute`	Nth attribute

The type Domain has the internal format:

Field	Type	Description
Type	`uint8_t`	Datatype of dimension values (`TILEDB_INT32`, `TILEDB_FLOAT64`, etc).
Num dimensions	`uint32_t`	Dimensionality/rank of the domain
Dimension 1	`Dimension`	First dimension
…	…	…
Dimension N	`Dimension`	Nth dimension

The type Dimension has the internal format:

Field	Type	Description
Dimension name length	`uint32_t`	Number of characters in dimension name (the following array)
Dimension name	`char[]`	Dimension name character array
Domain	`uint8_t[]`	Byte array of length `2*sizeof(DimT)`, storing the min, max values of the dimension (of type `DimT`).
Null tile extent	`uint8_t`	`1` if the dimension has a null tile extent, else `0`.
Tile extent	`uint8_t[]`	Byte array of length `sizeof(DimT)`, storing the space tile extent of this dimension.

The type Attribute has the internal format:

Field	Type	Description
Attribute name length	`uint32_t`	Number of characters in attribute name (the following array)
Attribute name	`char[]`	Attribute name character array
Attribute datatype	`uint8_t`	Datatype of the attribute values
Cell val num	`uint32_t`	Number of attribute values per cell. For variable-length attributes, this is `std::numeric_limits<uint32_t>::max()`
Filters	`FilterPipeline`	The filter pipeline used on attribute value tiles

The type FilterPipeline has the internal format:

Field	Type	Description
Max chunk size	`uint32_t`	Maximum chunk size within a tile
Num filters	`uint32_t`	Number of filters in pipeline
Filter 1	`Filter`	First filter
…	…	…
Filter N	`Filter`	Nth filter

The type Filter has the internal format:

Field	Type	Description
Filter type	`uint8_t`	Type of filter (e.g. `TILEDB_FILTER_BZIP2`)
Filter metadata size	`uint32_t`	Number of bytes in filter metadata (the following array) — may be 0.
Filter metadata	`uint8_t[]`	Filter metadata, specific to each filter. E.g. compression level for compression filters.

The filter metadata contains configuration parameters for the filters that do not change once the array schema has been created. For the compression filters (any of the filter types TILEDB_FILTER_{GZIP,ZSTD,LZ4,RLE,BZIP2,DOUBLE_DELTA }) the filter metadata has the internal format:

Field	Type	Description
Compressor type	`uint8_t`	Type of compression (e.g. `TILEDB_BZIP2`)
Compression level	`int32_t`	Compression level used (ignored by some compressors).

The filter metadata for TILEDB_FILTER_BIT_WIDTH_REDUCTION has the internal format:

Field	Type	Description
Max window size	`uint32_t`	Maximum window size in bytes

The filter metadata for TILEDB_FILTER_POSITIVE_DELTA has the internal format:

Field	Type	Description
Max window size	`uint32_t`	Maximum window size in bytes

The remaining filters (TILEDB_FILTER_BITSHUFFLE and TILEDB_FILTER_BYTESHUFFLE) do not serialize any metadata.

Array lock file¶

The file __lock.tdb is always an empty file on disk.

Fragment metadata file¶

The file __fragment_metadata.tdb has the internal format:

Field	Type	Description
R-Tree	`RTree`	The serialized R-Tree.
Tile offsets for attribute 1	`TileOffsets`	The serialized tile offsets for attribute 1.
…	…	…
Tile offsets for attribute N	`TileOffsets`	The serialized tile offsets for attribute N.
Variable tile offsets for attribute 1	`TileOffsets`	The serialized variable tile offsets for attribute 1.
…	…	…
Variable tile offsets for attribute N	`TileOffsets`	The serialized variable tile offsets for attribute N.
Variable tile sizes for attribute 1	`TileSizes`	The serialized variable tile sizes for attribute 1.
…	…	…
Variable tile sizes for attribute N	`TileSizes`	The serialized variable tile sizes for attribute N.
Footer	`Footer`	Basic metadata.

The type RTree is a generic tile with the following internal format:

Field	Type	Description
Dimensionality	`uint32_t`	Number of dimensions.
Fanout	`uint32_t`	The tree fanout.
Type	`uint8_t`	The domain datatype.
Number of levels	`uint32_t`	The number of levels in the tree.
Num MBRs at level 1	`uint64_t`	The number of MBRs at level 1.
MBR 1 at level 1	`uint8_t[]`	Byte array of two coordinate tuples storing the min/max coords of the first MBR at level 1
…	…	…
MBR N at level 1	`uint8_t[]`	Byte array of two coordinate tuples storing the min/max coords of the Nth MBR at level 1
…	…	…
Num MBRs at level L	`uint64_t`	The number of MBRs at level L.
MBR 1 at level L	`uint8_t[]`	Byte array of two coordinate tuples storing the min/max coords of the first MBR at level L
…	…	…
MBR N at level L	`uint8_t[]`	Byte array of two coordinate tuples storing the min/max coords of the Nth MBR at level L

The type TileOffsets is a generic tile with the following internal format:

Field	Type	Description
Num tile offsets	`uint64_t`	Number of tile offsets.
Tile offset 1	`uint64_t`	Offset 1.
…	…	…
Tile offset N	`uint64_t`	Offset N.

The type TileSizes is a generic tile with the following internal format:

Field	Type	Description
Num tile sizes	`uint64_t`	Number of tile sizes.
Tile size 1	`uint64_t`	Size 1.
…	…	…
Tile size N	`uint64_t`	Size N.

The type Footer has the following internal format:

Field	Type	Description
Version number	`uint32_t`	Format version number of the fragment.
Null non-empty domain	`char`	Indicates whether the non-empty domain is null or not.
Non-empty domain	`uint8_t[]`	Byte array of two coordinate tuples storing the min/max coords of a bounding box surrounding the non-empty domain of the fragment.
Number of sparse tiles	`uint64_t`	Number of sparse tiles.
Last tile cell num	`uint64_t`	For sparse arrays, the number of cells in the last tile in the fragment.
File sizes	`uint64_t[]`	The size in bytes of each attribute file in the fragment, including coords. For var-length attributes, this is the size of the offsets file.
File var sizes	`uint64_t[]`	The size in bytes of each var-length attribute file in the fragment.
R-Tree offset	`uint64_t`	The offset to the generic tile storing the R-Tree in the metadata file.
Tile offset for attribute 1	`uint64_t`	The offset to the generic tile storing the tile offsets for attribute 1.
…	…	…
Tile offset for attribute N+1	`uint64_t`	The offset to the generic tile storing the tile offsets for attribute N+1 (N+1 stands for coordinates.
Tile var offset for attribute 1	`uint64_t`	The offset to the generic tile storing the variable tile offsets for attribute 1.
…	…	…
Tile var offset for attribute N	`uint64_t`	The offset to the generic tile storing the variable tile offsets for attribute N.
Tile var sizes offset for attribute 1	`uint64_t`	The offset to the generic tile storing the variable tile sizes for attribute 1.
…	…	…
Tile var sizes offset for attribute N	`uint64_t`	The offset to the generic tile storing the variable tile sizes for attribute N.

Coords file¶

Within a sparse fragment, the file __coords.tdb has the following internal format:

Field	Type	Description
Dim 1 coordinate values	`DimT[]`	Array of the first dimension values for all coordinate tuples of cells in the fragment.
…	…
Dim N coordinate values	`DimT[]`	Array of the nth dimension values for all coordinate tuples of cells in the fragment.

Attribute file¶

Within a fragment, each fixed-length attribute named <attr> has a file <attr>.tdb with the internal format:

Field	Type	Description
Attribute values	`AttrT[]`	Array of the attribute values for all cells.

Each var-length attribute named <attr> has two files, <attr>_var.tdb storing the variable-length data for the attribute in each cell, and <attr>.tdb storing the offsets in <attr_var>.tdb for the data belonging to each cell. <attr>.tdb has the internal format:

Field	Type	Description
Attribute offsets	`uint64_t`	Array of the attribute value offsets in corresponding file `<attr>_var.tdb`.

And <attr>_var.tdb has the format:

Field	Type	Description
Attribute values	`AttrT[]`	Array of the attribute values for all cells.