rknfish

rknfish

i need more love
github

Lecture #05: Storage Models & Compression

Please refer to the following slides for more information: Slide

Please refer to the following notes for more information: Note

Mermaid Loading...

Database workloads#

OLTP#

OLTP: On-Line Transaction Processing

Fast operations that read and update a small range of data at a time.

OLTP

OLAP#

OLAP: On-Line Analytical Processing

OLAP is used to support complex analytical operations and provide decision support.

image

HTAP#

Hybrid Transaction + Analytical Processing

A new workload that combines OLTP and OLAP.

OLTP OLAP HTAP


Storage Models#

N-Ary Storage Model (NSM)#

In the n-ary storage model, the DBMS stores all tuples containing all attributes continuously in a single page.

n-ary-model

This is ideal for OLTP.

Advantages:

  • Fast insertion, update, and deletion
  • Efficient for queries that require the entire tuple

Disadvantages:

  • Inefficient for queries that require scanning the entire table or only need one attribute

image

Decomposition Storage Model (DSM)#

In the decomposition storage model, each attribute of all tuples is stored separately.

Also known as "column store"

This is efficient for read-only OLAP operations, especially those that only need to scan certain attributes.

image

Advantages:

  • Reduces I/O waste
  • Better for querying and compressing data

Disadvantages:

  • Slower for single-point modification, query, and update operations

To implement this, there are two approaches:

  1. Set a fixed word length for each attribute, so we only need to get the offset to find the desired data accurately.
  2. A less common approach is to use a tuple in the form of (id: pos) to store values, indicating that the value of the id is stored at position pos.

image

Database Compression#

I/O is time-consuming and often the bottleneck of the entire database, so DBMS widely uses compression algorithms to improve performance.

Usually, we need to balance between speed and compression ratio.

Compression Granularity#

  • Block Level
  • Tuple Level (NSM Only)
  • Attribute Level
  • Columnar Level

Naive Compression#

Using general-purpose compression algorithms is one solution. However, once we use this method, the DBMS does not know the data we are operating on until it is decompressed.

Provided as input:
→ LZO (1996), LZ4 (2011), Snappy (2011),
Oracle OZIP (2014), Zstd (2015)

image

To improve speed, we need additional compression methods that allow us to access information even after compression.

image

Columnar Compression#

Run-Length Encoding (RLE)#

Values that appear consecutively in the same column can be compressed into triplets (value: pos: num).

Where:

  1. value represents the value
  2. pos represents the starting position of the value
  3. num represents the number of times the value is repeated

image

However, this method may have some drawbacks

image

After transformation

image

Bit-Packing Encoding#

Some data can be highly redundant, and we can reduce this redundancy by using bit-packing.

image

Converting int64 to int8 greatly reduces the required space.

However, this method has some drawbacks and may result in some information that does not fit in int8.

Therefore, we need to store it in the following way.

image

However, this method can only be used when there is minimal additional storage.

Bitmap Encoding#

When there are few values for each attribute, we can use bitmap encoding to store them.

For example, when there are only two values, F and M, we can represent them as 01 for yes or no.

image

Delta Encoding#

In many cases, such as room temperature, there may be dense values within a certain range in our statistics.

Therefore, after determining one value, all subsequent values can be stored as deltas.

image

Incremental Encoding#

We can also obtain the final result by taking prefixes or suffixes of the data.

image

Dictionary Compression#

When a table may have multiple values and these values are scattered in different places, we can use a dictionary to get the positions of these values.

image

This is also the most commonly used compression method.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.