Arceus - Reading Notes on Data Blocks

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation (SIGMOD 2016, TUM)

Data Blocks is a storage layer technology developed by the TUM database group to reduce the memory footprint of HyPer (OLTP + OLAP hybrid main memory database) without sacrificing existing performance.

Questions

To reduce the memory footprint, the data is divided into cold and hot parts, with the cold data being compressed to reduce memory pressure (and evicted out of memory) and the hot data not compressed. However, unlike some typical OLAP systems, HyPer does not use very high compression ratios, but rather restraint in the use of lightweight compression (byte-addressable compression formats), mainly for the sake of OLTP workoad performance (efficient point accesses). The SARGable scan restrictions (i.e., =, is, <, ≤, >, ≥, between) can be compared directly on this lightly compressed data. Query acceleration techniques such as vectorized execution for highly compressed columnar data are not suitable for OLTP, which generally does not require access to large amounts of data, but rather to locate data quickly.

As a remark, identifying hot and cold data is not the focus of this article; there are many off-the-shelf algorithms that solve this problem (e.g., Anti-Caching).

Abstract

This paper contributes to three main areas, the second of which is likely to be more widely understood.

Design of Data Blocks // designed for cold data, Chunk compressed into Data Blocks
Lightweight indexing (PSMA) // designed for Data Blocks
SIMD algorithm for accelerated predicate evaluation // done on Data Block

Design of Data Blocks

When a chunk is considered cold, it is compressed into a read-optimized immutable Data Block (not writable). If the data on the cold Data Block is to be updated, the data on the Data Block is first deleted (marked with a flag) and a new one is inserted on top of the hot data. This ensures that OLTP performance is maintained.

Lightweight indexing designed for Data Blocks

The Data Block layout is not much more than meta data + compressed data. What is special about the layout is that a lightweight index, PSMA (Positional SMAs), is stored for each column. Compared to traditional SMAs that only store an interval mix max, PSMAs further reduce the possible query intervals. That is, if a point look up falls within an SMA in this interval, the PSMA narrows down the data to be scanned instead of scanning the entire Data Block (for that query condition column) directly.

The PSMA query process is as follows.
First a number v is subtracted from the SMA min in the data block interval of the column to obtain a delta value
The highest non-zero byte value of this delta value is , and the remaining number of bytes is r then the index value
Query the look up table of the PSMA with i as the key, and get a range, which is the interval where the final result exists

Understanding the query principle, it is easy to construct a PSMA by scanning through the data and continuously updating each range.

Conclusions

Question: Can PSMA only be used to speed up point queries? If it is an OLAP range query, PSMA has to do multiple look up + union operations for each range, which is not necessarily cost effective.

Then there is the compression of Data Blocks. One of the main features is that each Data Block may have a different compression algorithm for each column, and they will choose the one that is best for their data (the one that minimises memory usage). As for specific compression algorithms, Data Blocks consider three algorithms that they believe balance compression ratio and query performance.

single value compression // if all values of an attribute in a block are equa, e.g., null
ordered dictionary compression // the relative order of keys before compression is the same as after compression
truncation // value - min, does not apply to double and string, string will always be compressed to integer

When a query comes in, first see if you can exclude the Data Block with SMA and PSMA or narrow the query, then use vectorized execution to compare directly on the compressed data.

However, the fact that each chunk is compressed differently poses a design challenge for the JIT tuple-at-a-time engine (just-in-time compilation of SQL queries directly into executable code). If the layout of each Data Block is different, the number of code paths to be generated by scan grows exponentially. To solve this problem, the authors took advantage of the vectorized scan remain interpreted and can be pre-compiled capability and combined it with JIT.

Resource

Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 311–326. https://doi.org/10.1145/2882903.2882925