Variable-length Attributes

In this tutorial we will learn how to use variable-length attributes. It is recommended to read the tutorial on dense arrays first.

Warning

Variable-length attributes are not yet supported in the Python API.

Full programs

Program

Links

variable_length

varlencpp

Basic concepts and definitions

Variable-length attributes

A variable-length attribute allows each cell to store a variable-length list of values of the same type. For example, each cell could store a variable-length list of characters (a string value).

Creating the array

This is similar to what we covered in the simple dense array example. The only difference is that our two attributes are variable-length. Specifically, a1 will store strings, whereas a2 will store a variable-length list of integers.

C++

schema.add_attribute(Attribute::create<std::string>(ctx, "a1"));
schema.add_attribute(Attribute::create<std::vector<int>>(ctx, "a2"));

Writing to the array

Since each cell may now store a variable number of values, when writing, you must somehow tell TileDB which values belong to which cell. In TileDB, you do this by providing two input buffers (instead of one as in the fixed-length attributes); one buffer holds the actual cell data, whereas another stores the starting offsets (in bytes) of each variable-length cell.

Suppose we wish to populate the array with cell values on a1 and a2 as shown in the figure below.

../_images/variable_length.png

Let’s take a look at the code below and focus on a1 first. Notice that we create buffer a1_data which stores the cell values in row-major order, e.g., cell (1,1) stores a, (1,2) stores bb, etc. If a cell stores more than one value, we just place its values in the buffer contiguously, such as bb for cell (1,2). The problem is that, only by looking at a1_data, TileDB has no way of discerning the cell value “limits”. Therefore, we construct an extra buffer a1_off which stores the offsets where the first value of each cell starts in a1_data, e.g., a1_off[0] = 0 means that a starts at offset 0 in a1_data (in bytes), a1_off[1] = 1 means bb starts at offset 1, a1_off[2] = 3 means ccc starts at offset 3, etc.

C++

// Prepare the buffers
std::string a1_data = "abbcccddeeefghhhijjjkklmnoop";
std::vector<uint64_t> a1_off = {
    0, 1, 3, 6, 8, 11, 12, 13, 16, 17, 20, 22, 23, 24, 25, 27};
std::vector<int> a2_data = {1, 1, 2, 2,  3,  4,  5,  6,  6,  7,  7,  8,  8,
                            8, 9, 9, 10, 11, 12, 12, 13, 14, 14, 14, 15, 16};
std::vector<uint64_t> a2_el_off = {
    0, 2, 4, 5, 6, 7, 9, 11, 14, 16, 17, 18, 20, 21, 24, 25};
std::vector<uint64_t> a2_off;
for (auto e : a2_el_off)
  a2_off.push_back(e * sizeof(int));

// Prepare and submit the query
Array array(ctx, array_name, TILEDB_WRITE);
Query query(ctx, array);
query.set_layout(TILEDB_ROW_MAJOR)
     .set_buffer("a1", a1_off, a1_data)
     .set_buffer("a2", a2_off, a2_data);
query.submit();
array.close();

Note that the offsets buffer stores offsets in bytes. That was easy for a1 where each character consumes 1 byte. The case of a2 is a little different, thus, for simplicity, we take two steps. In the first step we construct a buffer a2_el_off that records the starting offsets in terms of elements in a2_data. For instance, 2,2 of cell (1,2) starts at element 2 in a2_data. Next, we create another buffer a2_off that stores the actual buffer offsets by multiplying the element offsets by the size of an integer. In the previous example 2,2 of cell (1,2) starts at byte 2*sizeof(int)=8 in a2_data. Note that TileDB expects a2_off, not a2_el_off.

Finally, similar to the fixed-length case we use set_buffer to add the buffers to the query, but now we provide both (byte) offset and data buffers.

Reading from the array

We focus on subarray [1,2], [2,4]. Recall that, in order to read from a TileDB array, we must allocate space for the buffers that will hold the result. For the variable-length case, this is a challenging task, since we do not know how many values each cell may be storing. Fortunately, TileDB has an auxiliary function that gives you an upper bound on how many elements your buffers need to store the results (note that this is an approximation). You can prepare the buffers as follows. Once again, we need two buffers for each attribute, one for the data and one for the offsets.

C++

auto max_el_map = array.max_buffer_elements(subarray);
std::vector<uint64_t> a1_off(max_el_map["a1"].first);
std::string a1_data;
a1_data.resize(max_el_map["a1"].second);
std::vector<uint64_t> a2_off(max_el_map["a2"].first);
std::vector<int> a2_data(max_el_map["a2"].second);

Next, we perform the query as usual, but now we set both the data and offset buffers. After completion, a1_data and a2_data will hold the result cell values , whereas a1_off and a2_off will store the starting offsets (in bytes) of the cell values in a1_data and a2_data, respectively. More specifically, a1_data will contain bbcccddfghhh, a1_off will contain 0, 2, 5, 7, 8, 9, a2_data will contain 2, 2, 3, 4, 6, 6, 7, 7, 8, 8, 8 and a2_off will contain 0, 8, 12, 16, 24, 32 (see figure above).

C++

Query query(ctx, array);
query.set_subarray(subarray)
     .set_layout(TILEDB_ROW_MAJOR)
     .set_buffer("a1", a1_off, a1_data)
     .set_buffer("a2", a2_off, a2_data);
query.submit();
array.close();

Warning

For the case of variable-length attributes, you should always use the auxialiary max_buffer_elements function to calculate the appropriate buffer sizes that will hold the result, even if you know the result size a priori. This is because TileDB may overestimate the buffer sizes needed and, hence, process a part of the query upon query.submit(), yielding an incomplete status (checked with query.query_status()). For more information about incomplete queries, see Incomplete queries. Allocating buffers using the sizes output by max_buffer_elements guarantees that the query will be completed and the whole result will be returned.

Perhaps the most cumbersome task is parsing the cell values given the data and offset buffers. Here is what we do for the strings of a1. We first calculate the string sizes using the offsets buffer. Then, we create a vector of strings (one per result cell), so that we make it easy to print later.

C++

// Get the string sizes
auto result_el_map = query.result_buffer_elements();
auto result_el_a1_off = result_el_map["a1"].first;
std::vector<uint64_t> a1_str_sizes;
for (size_t i = 0; i < result_el_a1_off - 1; ++i)
  a1_str_sizes.push_back(a1_off[i + 1] - a1_off[i]);
auto result_a1_data_size = result_el_map["a1"].second * sizeof(char);
a1_str_sizes.push_back(result_a1_data_size - a1_off[result_el_a1_off - 1]);

// Get the strings
std::vector<std::string> a1_str;
for (size_t i = 0; i < result_el_a1_off; ++i)
  a1_str.push_back(std::string(&a1_data[a1_off[i]], a1_str_sizes[i]));

For the integers of a2, we first calculate the element offsets from the byte offsets in a2_off, and then we calculate the number of elements per result cell. Once again, this will simplify printing the result.

C++

// Get the element offsets
std::vector<uint64_t> a2_el_off;
auto result_el_a2_off = result_el_map["a2"].first;
for (size_t i = 0; i < result_el_a2_off; ++i)
  a2_el_off.push_back(a2_off[i] / sizeof(int));

// Get the number of elements per cell value
std::vector<uint64_t> a2_cell_el;
for (size_t i = 0; i < result_el_a2_off - 1; ++i)
  a2_cell_el.push_back(a2_el_off[i + 1] - a2_el_off[i]);
auto result_el_a2_data = result_el_map["a2"].second;
a2_cell_el.push_back(result_el_a2_data - a2_el_off.back());

Finally, we print the result as follows.

C++

for (size_t i = 0; i < result_el_a1_off; ++i) {
  std::cout << "a1: " << a1_str[i] << ", a2: ";
  for (size_t j = 0; j < a2_cell_el[i]; ++j)
    std::cout << a2_data[a2_el_off[i] + j] << " ";
  std::cout << "\n";
}

If you compile and run the example of this tutorial as shown below, you should see the following output:

$ g++ -std=c++11 variable_length.cc -o variable_length -ltiledb
$ ./variable_length
a1: bb, a2: 2 2
a1: ccc, a2: 3
a1: dd, a2: 4
a1: f, a2: 6 6
a1: g, a2: 7 7
a1: hhh, a2: 8 8 8

On-disk structure

Let us look at the contents of the array of this example on disk.

$ ls -l variable_length_array/
total 8
drwx------  7 stavros  staff  224 Jun 25 15:38 __1561491531226_1561491531226_3e56db7d25a447708a73d3e578622ab4
-rwx------  1 stavros  staff  155 Jun 25 15:38 __array_schema.tdb
-rwx------  1 stavros  staff    0 Jun 25 15:38 __lock.tdb

$ ls -l variable_length_array/__1561491531226_1561491531226_3e56db7d25a447708a73d3e578622ab4/
total 40
-rwx------  1 stavros  staff  945 Jun 25 15:38 __fragment_metadata.tdb
-rwx------  1 stavros  staff  100 Jun 25 15:38 a1.tdb
-rwx------  1 stavros  staff   48 Jun 25 15:38 a1_var.tdb
-rwx------  1 stavros  staff  100 Jun 25 15:38 a2.tdb
-rwx------  1 stavro  staff  124 Jun 25 15:38 a2_var.tdb

Observe that, contrary to the case of fixed-length attributes, TileDB stores two files for each variable-length attribute. Specifically, a1_var.tdb and a2_var.tdb store the actual cell values (which are of variable length), whereas a1.tdb and a2.tdb store the corresponding starting offsets (in bytes). In other words, TileDB adopts a “columnar” format by splitting the values from the offsets. The reason behind this choice is better compressibility (later tutorials explain this in more detail).