Professional Documents
Culture Documents
READ/WRITE PATH
Josh McKenzie
josh.mckenzie@datastax.com
CORE COMPONENTS
Core Components
Memtable data in memory (R/W)
CommitLog data on disk (W/O)
SSTable data on disk (immutable, R/O)
CacheService (Row Cache and Key Cache) in-memory caches
ColumnFamilyStore logical grouping of table data
DataTracker and View provides atomicity and grouping of memtable/
sstable data
ColumnFamily Collection of Cells (Sorted map of Columns)
Cell Name, Value, TS
Tombstone Deletion marker indicating TS and deleted cell(s)
MemTable
In-memory data structure consisting of:
Memory pools (on-heap, off-heap)
Allocators for each pool
Size and limit tracking and CommitLog sentinels
Map of Key ! AtomicBTreeColumns
Atomic copy-on-write semantics for row-data
Flush to disk logic is triggered when pool passes ratio of usage relative
to user-configurable threshold
Memtable w/largest ratio of used space (either on or off heap) is flushed
to disk
CommitLog
Append-only file structure corresponding provides interim durability for writes while
theyre living in Memtables and havent been flushed to sstables
Has sync logic to determine the level of durability to disk you want - either
PeriodicCommitLogService or BatchCommitLogService
Periodic: (default) checks to see if it hit window limit, if so, block and wait for sync to catch up
Batch: no ack until fsync to disk. Waits for a specific window before hitting fsync to coalesce
SSTable
Ordered-map of KVP
Immutable
Consist of 3 files:
Bloom Filter: optimization to determine if the Partition Key youre
looking for is (probably) in this sstable
Index file: contains offset into data file, generally memory mapped
Data file: contains data, generally compressed
Read by SSTableReader
CacheService.java
In-memory caching service to optimize lookups of hot data
Contains three caches:
keyCache
rowCache
counterCache
See:
AutoSavingCache.java
InstrumentingCache.java
Tunable per table, limits in cassandra.yaml, keys to cache, size in mb, rows, size in mb
Defaults to keys only, can enable row cache via CQL
ColumnFamilyStore.java
Contains logic for a table
Holds DataTracker
Creating and removing sstables on disk
Writing / reading data
Cache initialization
Secondary index(es)
Flushing memtables to sstables
Snapshots
And much more (just short of 3k LoC)
ColumnFamily.java
A sorted map of columns
Abstract class, extended by:
ArrayBackedSortedColumns
Array backed
Non-thread-safe
Good for iteration, adding cells (especially if in sorted order)
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
Resources are ref counted see RefCounted.java, Tidy interface, various private classes in
SSTR that implement Tidy interface to clean up resources
Provides methods to retrieve an SSTableScanner which gives you access to OnDiskAtoms
via iterators and holds RandomAccessReaders on the raw files on disk
OnDiskAtom ! Cell ! *Cell
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
SliceByNamesReadCommand
ArrayBackedSortedColumns, AtomicBTreeSortedColumns
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
RowCache
CFS.getThroughCache(UUID cfId, QueryFilter filter)
After retrieving our CFS, the first thing we check is our Row Cache to see if the row is
already merged, in memory, and ready to go
If we get a cache hit on the key, well:
Confirm its not just a sentinel of someone else in flight. If so, we query w/out caching
If the data for the key is valid, we filter it down to the query we have in flight and return
those results as itll have >= the count of Cells were looking for
On cache miss:
Eventually cache all top level columns for the key queried if configured to do so (after
Collation)
Cache results of user query if it satisfies the cache config params
Extend the results of the query to satisfy the caching requirements of the system
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
CollationController.collect*Data ()
The data were looking for may be in a Memtable, an SSTable, multiple of either, or a
combination of all of them.
The logic to query this data and merge our results exists in CollationController.java:
collectAllData
collectTimeOrderedData
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
No
Skip
Yes
Write CommitLog
Write to Memtable
SecondaryIndexManager.Updater
Invalidate Row Cache
MutationVerbHandler, Mutation.apply
Contains Keyspace name
DecoratedKey
Map of cfId to ColumnFamily of modifications to perform
MutationVerbHandler ! Mutation.apply() ! Keyspace.apply() !
ColumnFamilyStore.apply()
No
Skip
Yes
Write CommitLog
Write to Memtable
SecondaryIndexManager.Updater
Invalidate Row Cache
No
Skip
Yes
Write CommitLog
Write to Memtable
SecondaryIndexManager.Updater
Invalidate Row Cache
Memtable writes
We attempt to get the partition for the given key if it exists
If not, we allocate space for a new key and put an empty entry in the memtable for it,
backing that out if we race and someone else got there first on allocation
Once we have space allocated, we call addAllWithSizeDelta
Add the record to a new BTree and CAS it into the existing Holder
Updates secondary indexes
Finalize some heap tracking in the ColumnUpdater used by the BTree to perform updates
Further reading:
AtomicBTreeColumns.java (specifically addAllWithSizeDelta)
BTree.java
MemtablePool
Single MEMORY_POOL instance across entire DB
Get an allocator to the memory pool during construction of a memtable
Interface covering management of an on-heap and off-heap pool via SubPool
HeapPool: On heap ByteBuffer allocations and release, subject to GC w/object overhead
NativePool: Blend of on and off heap based on limits passed in
SlabPool: Blend of on and off heap based on limits passed in (used for memtables, reduced fragmentation)
MemtablePool.SubPool / SubAllocator:
Contains various atomically updated longs tracking:
Limits on allocation
Spin and CAS for updates on the above on allocator calls in addAllWithSizeDelta
No
Skip
Yes
Write CommitLog
Write to Memtable
SecondaryIndexManager.Updater
Invalidate Row Cache
On Write Path:
SecondaryIndex updater passed down through to ColumnUpdater ctor
On ColumnUpdater.apply(), insert for secondary index is called
Essentially amounts to a 2nd write on another table
No
Skip
Yes
Write CommitLog
Write to Memtable
SecondaryIndexManager.Updater
Invalidate Row Cache
FLUSHING MEMTABLES
Flushing Memtables
CLSM.activeSegments
ColumnFamilyStore
Memtable
SSTableWriter
SSTable
CLS Active
Actively allocating
Skip
CLS 2
CLS 1
SSTableReader
CommitLog.discardCompletedSegments(
cfId, lastReplayPosition)
Memtable Flushing
Reference ColumnFamilyStore$Flush
1st, switch out memtables in CFS.DataTracker.View so new ops go to new memtable
Sets lifecycle in memtable to discarding
Runs the FlushRunnable in the Memtable
Memtable.writeSortedContents
Uses SSTableWriter to write sorted contents to disk
Returns SSTableReader created by SSTableWriter.closeAndOpenReader
Memtable.setDiscarded() ! MemtableAllocator.setDiscarded()
Lifecycle to Discarded
Free up all memory from the allocator for this memtable
3.0