Dense Arrays

In this tutorial we will learn how to create, read, and write a simple dense array in TileDB.

Full programs

Program

Links

quickstart_dense

quickstartcpp quickstartpy

Basic concepts and definitions

Cell, Dimension, Domain

An array in TileDB is an n-dimensional collection of cells, where each cell is uniquely identified by a coordinate tuple equal to the dimensionality of the array. For example, every cell in a 2D array is represented by a coordinate pair (i, j), whereas in a 3D array by a coordinate triple, (i, j, k). Each dimension in the array has an associated domain which defines the data type and extent (min, max) of coordinate values for that dimension. The dimension domain could be of type int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, or float64. Notice that TileDB supports negative as well as real dimensions domains, but for now we will only focus on positive integer domains. The ordered set of dimensions comprise the array domain.

Note

In TileDB, currently all dimension domains must have the same type.

Attribute

In TileDB, a cell is not limited to storing a single value. Each cell stores a tuple with a structure that is common to all cells. Each tuple element corresponds to a value on a named attribute of a certain type. The array cells can be perceived as rows in a table, where each column is an attribute and each row is uniquely identified by the cell coordinates. An attribute can specify a single value of type char, int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, or float64, or a fixed- or variable-sized vector of the above primitive types.

Array schema

The structure of the array, i.e., the number of dimensions and type of their domains, the number and type of attributes (and a lot of other information covered in later tutorials) are all defined in the array schema. The array schema is very similar to a table schema used in Databases.

Dense array

If every cell in the array has an associated value, such as a pixel in a 2D image, we call the array dense.

Array directory

An array is stored on persistent storage as a directory containing subdirectories and files. We will explain in later tutorials the benefits from such a physical organization, and how a “directory” translates for storage backends where directories are not treated in the same manner as in a local POSIX filesystem (e.g., for the S3 object store).

Subarray

A subarray is a slice of the array domain, used in queries.

Creating a dense array

C++

The following snippet creates an empty array schema for a dense array:

Context ctx;
ArraySchema schema(ctx, TILEDB_DENSE);

Next, we define a 2D domain where the coordinates can be integer values from 1 to 4 (inclusive) along both dimensions. For now, you can ignore the last argument in the dimension constructor (tile extent).

Domain domain(ctx);
domain.add_dimension(Dimension::create<int>(ctx, "rows", {{1, 4}}, 4))
    .add_dimension(Dimension::create<int>(ctx, "cols", {{1, 4}}, 4));

Then, attach the domain to the schema, and configure a few other parameters (cell and tile ordering) that are explained in later tutorials:

schema.set_domain(domain).set_order({{TILEDB_ROW_MAJOR, TILEDB_ROW_MAJOR}});

Finally, create a single attribute named a for the array that will hold a single integer for each cell:

schema.add_attribute(Attribute::create<int>(ctx, "a"));

Python

First we define a 2D domain where the coordinates can be integer values from 1 to 4 (inclusive) along both dimensions. For now, you can ignore the tile argument in the dimension constructor (tile extent).

# Don't forget to 'import numpy as np'
dom = tiledb.Domain(tiledb.Dim(name="rows", domain=(1, 4), tile=4, dtype=np.int32),
                    tiledb.Dim(name="cols", domain=(1, 4), tile=4, dtype=np.int32))

Next we create the schema object, attaching the domain and a single attribute a that will hold a single integer for each cell:

schema = tiledb.ArraySchema(domain=dom, sparse=False,
                            attrs=[tiledb.Attr(name="a", dtype=np.int32)])

Note

The order of the dimensions (as added to the domain) is important later when specifying subarrays. For instance, in the above schema, subarray [1,2], [2,4] means slice the first two values in the rows dimension domain, and values 2,3,4 in the cols dimension domain.

All that is left to do is create the empty array on disk so that it can be written to. We specify the name of the array to create, and the schema to use. This command will essentially persist the array schema we just created on disk.

C++

std::string array_name("quickstart_dense_array");
Array::create(array_name, schema);

Python

array_name = "quickstart_dense"
tiledb.DenseArray.create(array_name, schema)

Note

The array name here will be used to create a data directory in the current working path (see On-disk Structure. below). The array name can also be a full URI, for example a path like file:///home/username/my_array or an S3 URI like s3://bucket-name/array-name.

Writing to the array

We will populate the array with values 1, 2, ..., 16. To start, prepare the data to be written:

C++

std::vector<int> data = {
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};

Python

data = np.array(([1, 2, 3, 4],
                 [5, 6, 7, 8],
                 [9, 10, 11, 12],
                 [13, 14, 15, 16]))

C++

Next, open the array for writing, and create a query object:

Context ctx;
Array array(ctx, array_name, TILEDB_WRITE);
Query query(ctx, array);

Then, set up the query. We set the buffer for attribute a, and also set the layout of the cells in the buffer to row-major.

query.set_layout(TILEDB_ROW_MAJOR).set_buffer("a", data);

Finally, submit the query and close the array.

query.submit();
array.close();

Python

Next, open the array for writing and write the data to the array:

with tiledb.DenseArray(array_name, mode='w') as A:
    A[:] = data

By default, the Python API issues the write query in row-major layout.

Although the cell layout is covered thoroughly in later tutorials, here what you should know is that you are telling TileDB that the cell values in your buffer will be written in row-major order in the cells of the array (i.e., 1 will be stored in cell (1,1), 2 in (1,2), etc.).

The array data is now stored on disk. The resulting array is depicted in the figure below.

../_images/quickstart_dense1.png

Reading from the array

We will next explain how to read the cell values in subarray [1,2], [2,4], i.e., in the blue rectangle shown in the figure above. The result values should be 2 3 4 6 7 8, reading in row-major order (i.e., first the three selected columns of row 1, then the three selected columns of row 2).

C++

Reading happens in much the same way as writing, except we must provide buffers sufficient to hold the data being read. First, open the array for reading:

Context ctx;
Array array(ctx, array_name, TILEDB_READ);

Next, specify the subarray in terms of (min, max) values on each dimension. Also define the buffer that will hold the result, making sure that it has enough space (six elements here, as the result of the subarray will be six integers). Proper result buffer allocation is an important topic that is covered in detail in later tutorials.

const std::vector<int> subarray = {1, 2, 2, 4};
std::vector<int> data(6);

Then, we set up and submit a query object, and close the array, similarly to writes.

Query query(ctx, array);
query.set_subarray(subarray)
     .set_layout(TILEDB_ROW_MAJOR)
     .set_buffer("a", data);
query.submit();
array.close();

Python

Reading happens in much the same way as writing, simply specifying a different mode when opening the array:

with tiledb.DenseArray(array_name, mode='r') as A:
    # Slice only rows 1, 2 and cols 2, 3, 4.
    data = A[1:3, 2:5]
    print(data["a"])

Again by default the Python API issues the read query in row-major layout.

The row-major layout here means that the cells will be returned in row-major order within the subarray [1,2], [2,4] (more information on cell layouts is covered in later tutorials).

Now data holds the result cell values on attribute a. If you compile and run the example of this tutorial as shown below, you should see the following output:

C++

$ g++ -std=c++11 quickstart_dense.cc -o quickstart_dense -ltiledb
$ ./quickstart_dense
2 3 4 6 7 8

Python

$ python quickstart_dense.py
[[2 3 4]
 [6 7 8]]

On-disk structure

A TileDB array is stored on disk as a directory with the name given at the time of array creation. If we look into the array on disk after it has been written to, we will see something like the following

$ ls -l quickstart_dense_array/
total 8
drwx------  4 stavros  staff  128 Jun 25 15:18 __1561490302161_1561490302161_15bab0281e2e44f2a803eb6f3001ed00
-rwx------  1 stavros  staff  149 Jun 25 15:18 __array_schema.tdb
-rwx------  1 stavros  staff    0 Jun 25 15:18 __lock.tdb

The array directory and files __array_schema.tdb and __lock.tdb were written upon array creation, whereas subdirectory __1561490302161_1561490302161_15bab0281e2e44f2a803eb6f3001ed00 was created after array writting. This subdirectory, called fragment, contains the written cell values for attribute a in file a.tdb, along with associated metadata:

$ ls -l quickstart_dense_array/__1561490302161_1561490302161_15bab0281e2e44f2a803eb6f3001ed00/
total 16
-rwx------  1 stavros  staff  602 Jun 25 15:18 __fragment_metadata.tdb
-rwx------  1 stavros  staff   84 Jun 25 15:18 a.tdb

The TileDB array hierarchy on disk and more details about fragments are discussed in later tutorials.