Please refer to the following slides for more information: Slide
Please refer to the following notes for more information: Note
Database workloads#
OLTP#
OLTP: On-Line Transaction Processing
Fast operations that read and update a small range of data at a time.
OLAP#
OLAP: On-Line Analytical Processing
OLAP is used to support complex analytical operations and provide decision support.
HTAP#
Hybrid Transaction + Analytical Processing
A new workload that combines OLTP and OLAP.
Storage Models#
N-Ary Storage Model (NSM)#
In the n-ary storage model, the DBMS stores all tuples containing all attributes continuously in a single page.
This is ideal for OLTP.
Advantages:
- Fast insertion, update, and deletion
- Efficient for queries that require the entire tuple
Disadvantages:
- Inefficient for queries that require scanning the entire table or only need one attribute
Decomposition Storage Model (DSM)#
In the decomposition storage model, each attribute of all tuples is stored separately.
Also known as "column store"
This is efficient for read-only OLAP operations, especially those that only need to scan certain attributes.
Advantages:
- Reduces I/O waste
- Better for querying and compressing data
Disadvantages:
- Slower for single-point modification, query, and update operations
To implement this, there are two approaches:
- Set a fixed word length for each attribute, so we only need to get the offset to find the desired data accurately.
- A less common approach is to use a tuple in the form of
(id: pos)
to store values, indicating that the value of theid
is stored at positionpos
.
Database Compression#
I/O is time-consuming and often the bottleneck of the entire database, so DBMS widely uses compression algorithms to improve performance.
Usually, we need to balance between speed and compression ratio.
Compression Granularity#
- Block Level
- Tuple Level (NSM Only)
- Attribute Level
- Columnar Level
Naive Compression#
Using general-purpose compression algorithms is one solution. However, once we use this method, the DBMS does not know the data we are operating on until it is decompressed.
Provided as input:
→ LZO (1996), LZ4 (2011), Snappy (2011),
Oracle OZIP (2014), Zstd (2015)
To improve speed, we need additional compression methods that allow us to access information even after compression.
Columnar Compression#
Run-Length Encoding (RLE)#
Values that appear consecutively in the same column can be compressed into triplets (value: pos: num)
.
Where:
value
represents the valuepos
represents the starting position of the valuenum
represents the number of times the value is repeated
However, this method may have some drawbacks
After transformation
Bit-Packing Encoding#
Some data can be highly redundant, and we can reduce this redundancy by using bit-packing.
Converting int64
to int8
greatly reduces the required space.
However, this method has some drawbacks and may result in some information that does not fit in int8
.
Therefore, we need to store it in the following way.
However, this method can only be used when there is minimal additional storage.
Bitmap Encoding#
When there are few values for each attribute, we can use bitmap encoding to store them.
For example, when there are only two values, F
and M
, we can represent them as 01
for yes or no.
Delta Encoding#
In many cases, such as room temperature, there may be dense values within a certain range in our statistics.
Therefore, after determining one value, all subsequent values can be stored as deltas.
Incremental Encoding#
We can also obtain the final result by taking prefixes or suffixes of the data.
Dictionary Compression#
When a table may have multiple values and these values are scattered in different places, we can use a dictionary to get the positions of these values.
This is also the most commonly used compression method.