The rise of LSM trees for Databases
B-trees (1970) vs LSM trees(1990)
PS: Note that there is a small difference between B-Trees and B+Trees.
B-trees store records even at intermediate or nested leaves whereas B+Trees do not store any records at the intermediate or nested leaves. But the imp thing to remember is that:
- B+Tree is faster for sorting ( in general, most old DBs use B+Trees)
- B-Tree is faster when we want to insert records in the middle
For thr purpose of this article, I refer to both B+Trees and B-trees as B-trees as the difference for the purpose of this comparison isnt significant. viz. both B-tree variants outperform LSM trees in terms of read performance.
B-trees were invented in 1970 and since then has been a staple data structure for all popular DBs during that time. At the time, HDDs were the main form of persistent storage replacing tapes and DB designers tried to work around HDDs weaknesses of rotational latency and disk seek time.
B-trees are essentially a form of self-balancing binary tree except that instead of just two child nodes, a B-tree can have many child nodes, hence having a large fan-out and hence saving on tree depth.
B-trees perform well when there are workflows that do few writes and many reads.
However, B-trees suffer from 2 main drawbacks : write amplification and read amplification.
write amplification as B-trees writing just a few bytes to disk could involve writing many blocks: right from the intermediate B-tree blocks down to the leaf block.
space amplification also affected B-trees as DB designers didnt want to allocate new B-tree block everytime something was written, most of the intermediate B-trees levels were left with buffer space or padding to accommodate any new writes, appends or concatenated data.
Around 1990, LSM trees was proposed but LSM trees really took off once the data explosion happened in the 1990s with things like BigTable by Google, followed by HBase, Cassandra, LevelDB, RocksDB and other DBs.
Around the same time, another technology shift happened. SSDs also started to be seen as a replacement to Disks, though they were 2x-3x more expensive to HDDs.
In an LSM tree new writes are stored in a in-mem table until a sufficient amount is written. Then all writes are packed tightly and written to storage. To make reads efficient, the data on LSM trees were sorted before being written to disk.
So say at time t1, LSM DB wrote a block that was say for employee records 1,3,5
At a later time, LSM tree also wrote changes that were say for 2, 5, 7.
So 2,5,7 are also written to disk and at some point the LSM DB tries to read, merge and compact these sorted keys to a new set of blocks.
i.e merge([1,3,5] and [2,5,7]) to a new sorted strings table[1,2,3,5,7].
there are 2 kinds of LSM tree compaction stategies: Write optimized / Lazy Writes/ Universal / Tiered(lesser effort spent in sorting and compaction) vs Read-optimized / Eager Writes / Leveled (more levels of sorting, merging and compaction)
Another optimization that LSM DBs did was use bloom filters. Bloom filters are probabilistic data structures that help answer 2 things : given a key, respond with : “its definitely not there” OR “it may be there” somewhere in this block range.
Bloom filters occupied some space on storage : say 1% but the speed up in performance they provided was worth it.
So, while LSM trees now did very well with large vol of writes, LSM trees suffered when doing small range queries. In other words LSM trees still sucked at search cost.
To solve range queries, two mechanisms were proposed at SIGMODDB conferences: Surf Tries(truncated set of Tries) and Rosetta(all key prefixes in a filter)
Note that LSM tree compaction also does create some write amplification but its still lesser than the write-amplification of B-Trees.
Summary
B-trees are still a force to reckon with if your workload has a lot of short range queries but for a lot of simple get queries + lots of writes LSM trees are still more optimal. So its always a question of navigating trade-offs.
Examples
DBs that use B-trees
SQL: Amost all of SQL RDBMS use B-trees
NOSQL: Default storage engine of Mongo(WiredTiger) uses B-trees
DBs that use LSM trees
NoSQL: Almost all NOSQL DBs use LSM viz. Cassandra, LevelDB, RocksDB, BigTable, ScyllaDB
New Advances
BW tree from Microsoft since 2013: make a B-tree more write optimized.
also buffers insertions from a physical construct to a logical construct. Need to read more on this bit.
References
- https://www.youtube.com/watch?v=2qZCLAQXSHE (The Rise of LSM Trees — Why now?)
- https://www.youtube.com/playlist?list=PLeKd45zvjcDHJxge6VtYUAbYnvd_VNQCx (Martin Kleppmans playlist series on Databases and Transactions
- The Bw-Tree: A B-tree for New Hardware Platforms
- https://medium.com/codex/understanding-log-structured-merge-lsm-trees-c4a0039f17a8