If the user or operating system erases a file not just remove parts of itthe file will typically be marked for deletion, but the actual contents on the disk are never actually erased.
The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload.
The comparison in practice provides values for read, write and space amplification on real workloads. The comparison in theory attempts to explain those values. For now I am only specific about the cache hit rate.
For the B-Tree I assume that all non-leaf levels are in cache. While an LSM with leveled compaction has more things to keep in the cache bloom filters it also benefits from a better compression rate and the cache requirements are similar to a clustered B-Tree.
Disk read-amp for range queries is 1 or 2 for a short range scan assuming that 1 or 2 blocks from the B-Tree leaf level and LSM max level are read. Note the impact of my assumption for cached data. While many files might be accessed for a short range query with an LSM everything but the max level data blocks are in cache.
The number of key comparisons can be used as the in-memory read-amp. For a B-Tree with 1M keys there are about 20 key comparisons on a point query.
For a range query with a B-Tree there is one additional comparison for each row fetched. It is harder to reason about the number of comparisons for an LSM.
Bloom filters can be used for a point query to write amplification index comparisons but when there are too many files in level 0 then there will be too many bloom filter checks.
Bloom filters don't work for range queries, ignoring prefix bloom filters. So I will ignore this for now. If you want to maximize the ratio of the database to cache sizes while doing at most one disk read per point query then an LSM with leveled compaction or a clustered B-Tree are the best choices.
For a clustered B-Tree the things that must be in memory are one key per leaf block and all non-leaf levels of the index. An LSM with leveled compaction has similar requirements, although it also needs some of the bloom filters to be in memory. The cache requirement is much larger for an LSM with size-tiered compaction.
Second, there are more old versions of key-value pairs, space-amp is larger, so there is more data that needs to be in the cache. An unclustered B-Tree index also requires more memory to keep the important bits in cache.
The important bits are all keys, which is much more memory than one key per leaf block for a clustered B-Tree. Write Amplification For now I assume that flash storage is used so I can focus on bytes written and ignore disk seeks when explaining write-amp.
For a B-Tree a change is first recorded in the redo log and the page is eventually written back. The worst case occurs when the buffer pool is full with dirty pages and reading the to-be-modified page into the buffer pool forces a dirty page to be evicted and written back. In this case there is a redo log write and a page write back per row change.
The write-amp is reduced when there is more one changed row on a page or when one row is changed many times before write back. For the LSM the redo log is written immediately on a row change. When the memtable is full and flushed to level 0 then the row is written again.
The total write-amp is computed from the writes required to move a row change from the memtable to the max level. From the examples above the LSM has less write-amp than the B-Tree but those examples were not meant to be compared. When using flash storage this means the device will last longer.
Each stream writes files sequentially, but the writes from different streams can end up in the same logical erase block logical means it is striped across many NAND chips. The level of the leveled compaction LSM predicts the lifetime of the write.
Writes to level 0 have a short lifetime. Writes to level 4 have a much longer lifetime. This means that logical erase blocks will end up with a mix of long and short lived data and the long-lived data will get copied out during flash garbage collection.
Does your flash device use logical erase blocks? Flash devices that support multi-stream will help a lot. Space Amplification A B-Tree gets space-amp from fragmentation, per-row metadata and fixed page sizes on disk.
Finally, when compression is done for InnoDB there will be wasted space because page sizes are fixed on disk. When a 16kb in-memory page compressed to 5kb for a table that uses 8kb pages on disk, then 3kb of the 8kb page on disk is wasted.In this context, when % of the operations are writes, Accordion achieves up to 30% reduction of write amplification, 20% increase of write throughput, and 22% reduction of GC.
When 50% of the operations are reads, the tail read latency is reduced by 12%. Nov 24, · Read, write & space amplification - B-Tree vs LSM This post compares a B-Tree and LSM for read, write and space amplification.
The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload. Hi All, As previously discussed , WARM is a technique to reduce write amplification when an indexed column of a table is updated.
HOT fails to. Write amplification factor (WAF) is a numerical value that represents the amount of data a solid state storage controller has to write in relation to the amount of data that the host’s flash controller has to write. For our common workloads, where index values are stored in leaf nodes and sorted by keys, the database working set doesn't fit in memory and keys are updated in .
Jul 24, · The index probe must be efficient to support high GC rates that follow from high insert rates. This means that the index must be in memory with an index entry in memory for every key. That is a lot of memory unless the value/key size ratio is large.
Read amplification is different for point and range queries.