Cassandra Read-Write Path by Josh McKenzie

CASSANDRA
READ/WRITE PATH
Josh McKenzie
josh.mckenzie@datastax.com
CORE COMPONENTS
Core Components
Memtable data in memory (R/W)
CommitLog data on disk (W/O)
SSTable data on disk (immutable, R/O)
CacheService (Row Cache and Key Cache) in-memory caches
ColumnFamilyStore logical grouping of table data
DataTracker and View provides atomicity and grouping of memtable/
sstable data
ColumnFamily Collection of Cells (Sorted map of Columns)
Cell Name, Value, TS
Tombstone Deletion marker indicating TS and deleted cell(s)
MemTable
In-memory data structure consisting of:
Memory pools (on-heap, off-heap)
Allocators for each pool
Size and limit tracking and CommitLog sentinels
Map of Key ! AtomicBTreeColumns
Atomic copy-on-write semantics for row-data
Flush to disk logic is triggered when pool passes ratio of usage relative
to user-configurable threshold
Memtable w/largest ratio of used space (either on or off heap) is flushed
to disk
On heap vs. Off heap Memtables: an overview

http://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1
https://issues.apache.org/jira/browse/CASSANDRA-6689
memtable_allocation_type
offheap_buffers moves the cell name and value to DirectBuffer objects. The values are still
live Java buffers. This mode only reduces heap significantly when you are storing large
strings or blobs
offheap_objects moves the entire cell off heap, leaving only the NativeCell reference
containing a pointer to the native (off-heap) data. This makes it effective for small values
like ints or uuids as well, at the cost of having to copy it back on-heap temporarily when
reading from it.
Default in 2.1 is heap buffers
On heap vs. Off heap: continued

Why?
Reduces sizes of objects in memory no more ByteBuffer overhead
More data fitting in memory == better performance
Code changes that support it:
MemtablePools allow on vs. off-heap allocation (and Slab, for that matter)
MemtableAllocators to allow differentiating between on-heap and off-heap
allocation
DecoratedKey and *Cells changed to interfaces to have different allocation
implementations based on native vs. heap
CommitLog
Append-only file structure corresponding provides interim durability for writes while
theyre living in Memtables and havent been flushed to sstables
Has sync logic to determine the level of durability to disk you want - either
PeriodicCommitLogService or BatchCommitLogService
Periodic: (default) checks to see if it hit window limit, if so, block and wait for sync to catch up
Batch: no ack until fsync to disk. Waits for a specific window before hitting fsync to coalesce
Singleton faade for commit log operations

Consists of multiple components
CommitLog.java: interface to subsystem
CommitLogSegment.java: files on disk
CommitLogManager.java: segment allocation and management
CommitLogArchiver.java: user-defined commands pre/post flush
CommitLogMetrics.java
SSTable
Ordered-map of KVP
Immutable
Consist of 3 files:
Bloom Filter: optimization to determine if the Partition Key youre
looking for is (probably) in this sstable
Index file: contains offset into data file, generally memory mapped
Data file: contains data, generally compressed
Read by SSTableReader
CacheService.java
In-memory caching service to optimize lookups of hot data
Contains three caches:
keyCache
rowCache
counterCache
See:
AutoSavingCache.java
InstrumentingCache.java
Tunable per table, limits in cassandra.yaml, keys to cache, size in mb, rows, size in mb
Defaults to keys only, can enable row cache via CQL
ColumnFamilyStore.java
Contains logic for a table
Holds DataTracker
Creating and removing sstables on disk
Writing / reading data
Cache initialization
Secondary index(es)
Flushing memtables to sstables
Snapshots
And much more (just short of 3k LoC)
CFS: DataTracker and View

DataTracker allows for atomic operations on a view of a Table (ColumnFamilyStore)
Contains various logic surrounding Memtables and flushing, SSTables and
compaction, and notification for subscribers on changes to SSTableReaders
1 DataTracker per CFS, 1 AtomicReference<View> per DataTracker
View consists of current Memtable, Memtables pending flush, SSTables for the CFS,
and SSTables being actively compacted
Currently active Memtable is atomically switched out in:
DataTracker.switchMemtable(boolean truncating)
ColumnFamily.java
A sorted map of columns
Abstract class, extended by:
ArrayBackedSortedColumns
Array backed
Non-thread-safe
Good for iteration, adding cells (especially if in sorted order)
AtomicBTreeColumns (memtable only)

Btree backed
Thread-safe w/atomic CAS
Logarithmic complexity on operations
Logic to add / retrieve columns, counters, tombstones, atoms
THE READ PATH
Read Path: Very High Level
Overview the Read Path

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
Read-specific primitive: QueryFilter

Contains IDiskAtomFilter
IDiskAtomFilter: used to get columns from Memtable, SSTable, or SuperColumn
IdentityQueryFilter, NamesQueryFilter, SliceQueryFilter
Contains a variety of iterators to collate on disk contents, gather tombstones, reduce

(merge) Cells with the same name, etc
See:
collateColumns()
gatherTombstones()
getReducer(final Comparator<Cell> comparator)
Read-specific class: SSTableReader

Has 2 SegmentedFiles, ifile and dfile, for index and data respectively
Contains a Key Cache, caching positions of keys in the SSTR
Contains an IndexSummary w/sampling of the keys that are in the table
Binary search used to narrow down location in file via IndexSummary
getIndexScanPosition(RowPosition key)
Short running operations guarded by ColumnFamilyStore.readOrdering

See OpOrder.java producer/consumer synchronization primitive to coordinate readers w/
flush operations
Resources are ref counted see RefCounted.java, Tidy interface, various private classes in
SSTR that implement Tidy interface to clean up resources
Provides methods to retrieve an SSTableScanner which gives you access to OnDiskAtoms
via iterators and holds RandomAccessReaders on the raw files on disk
OnDiskAtom ! Cell ! *Cell

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
ReadVerbHandler and ReadCommands

Messages are received by the MessagingService and passed to the ReadVerbHandler for appropriate
verbs
ReadCommands:
SliceFromReadCommand
Relies on SliceQueryFilter, uses a range of columns defined by a ColumnSlice
SliceByNamesReadCommand
Relies on NamesQueryFilter, uses a column name to retrieve a single column
Both diverge in calls and converge back into implementers of ColumnFamily
ArrayBackedSortedColumns, AtomicBTreeSortedColumns
public Row Keyspace.getRow(QueryFilter filter) {

ColumnFamilyStore cfStore = getColumnFamilyStore(filter.getColumnFamilyName());
ColumnFamily columnFamily = cfStore.getColumnFamily(filter);
return new Row(filter.key, columnFamily);
}

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
RowCache
CFS.getThroughCache(UUID cfId, QueryFilter filter)
After retrieving our CFS, the first thing we check is our Row Cache to see if the row is
already merged, in memory, and ready to go
If we get a cache hit on the key, well:
Confirm its not just a sentinel of someone else in flight. If so, we query w/out caching
If the data for the key is valid, we filter it down to the query we have in flight and return
those results as itll have >= the count of Cells were looking for
On cache miss:
Eventually cache all top level columns for the key queried if configured to do so (after
Collation)
Cache results of user query if it satisfies the cache config params
Extend the results of the query to satisfy the caching requirements of the system

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
CollationController.collect*Data ()
The data were looking for may be in a Memtable, an SSTable, multiple of either, or a
combination of all of them.
The logic to query this data and merge our results exists in CollationController.java:
collectAllData
collectTimeOrderedData
High level flow:

1. Get data from memtables for the QueryFilter were processing
2. Get data from sstables for the QueryFilter were processing
3. Merge all the data together, keeping the most recent
4. If we iterated across enough sstables, hoist up the now defragmented data into a memtable,
bypassing CommitLog and Index update (collectTimeOrderedData only)

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
CollationController merging: memtables

Fairly straightforward operations on memtables in the view:
Check all memtables to see if they have a ColumnFamily that matches our filter.key
Add all columns to our result ColumnFamily that match
Keep a running tally of the mostRecentRowTombstone for use in next step.

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
CollationController merging: sstables

We have a few optimizations available for merging in data from sstables:
Sort the collection of SSTables by the max timestamp present
Iterate across the SSTables
Skipping any that are older than the most recent tombstone weve seen
Create a reduced name filter by removing columns from our filter where we
have fresher data than the SSTRs max Timestamp
Get iterator from SSTR for Atoms matching that reduced name filter
Add any matching OnDiskAtoms to our result set (BloomFilter excludes via
iterator with SSTR.getPosition() call)

MessagingService
Keyspace
Coordinator
ColumnFamilyStore
Check Row Cache
hit
Return results
miss
Update Row Cache
CollationController
Memtable
read
hit
SSTables
Key Cache
miss
merge
Seek to cached
position
Binary scan index,
update cache
ColumnFamily
THE WRITE PATH
Write Path: Very High Level
Overview the Write Path

MessagingService
Keyspace
CommitLog Enabled for this mutation?
No
Skip
Yes
Write CommitLog
Write to Memtable
SecondaryIndexManager.Updater
Invalidate Row Cache
MutationVerbHandler, Mutation.apply
Contains Keyspace name
DecoratedKey
Map of cfId to ColumnFamily of modifications to perform
MutationVerbHandler ! Mutation.apply() ! Keyspace.apply() !
ColumnFamilyStore.apply()

MessagingService
Keyspace
No
Skip
Yes
Write CommitLog
Write to Memtable
The CommitLog ecosystem

CommitLogSegment: file on disk
CommitLogSegmentManager: allocation and recycling of CommitLogSegments
CommitLogArchiver: allows user-defined archive and restore commands to be run
Reference conf/commitlog_archiving.properties
An AbstractCommitLogService, one of either:
BatchCommitLogService writer waits on sync to complete before returning
PeriodicCommitLogService Check if sync is behind, if so, register w/signal and
block until lastSyncedAt catches up
CommitLogSegmentManager (CLSM): overview

Contains 2 collections of CommitLogSegments
availableSegments: Segments ready to be used
activeSegments: Segments that are active and contain unflushed data
Only 1 active CommitLogSegment is in use at any given time

Manager thread is responsible for maintaining active vs. available
CommitLogSegments and can be woken up by other contexts when maintenance is
needed
CLSM: allocation on the write path

During CommitLog.add(), a writer asks for allocated space for their mutation from
the CommitLogSegmentManager
This is passed to the active CommitLogSegments allocate() method
CommitLogSegmentManager.allocate(int size) spins non-blocking until the space in
the segment is allocated, at which time it marks it dirty
If the CLS.allocate() call returns null indicating we need a new segment:
CommitLogSegment.advanceAllocatingFrom(CommitLogSegment old)
Goal is to move CLS from available to active segments so we have more CLS to work with
If it fails to get an available segment, the manager thread is woken back up to do some
maintenance, be it recycling or allocating a new CLS
CLSM: manager thread, new segments, recycling

Constructor creates a runnable that blocks on segmentManagementTasks
Task can either be null indicating were out of space (allocate path) or a segment thats
flushed and ready for recycle
If theres no available segments, we create new CommitLogSegments and add them to
availableSegments
hasAvailableSegments WaitQueue is signaled by this to awake any blocking writes waiting for
allocation
When our CommitLog usage is approaching our allowable limit:

If our total used size is > than the size allowed
CommitLogSegmentManager.flushDataFrom on a list of activeSegments
Force flush on any CFS thats dirty

Which switches Memtables and flushes to SSTable more on this later

MessagingService
Keyspace
No
Skip
Yes
Write CommitLog
Write to Memtable
Memtable writes
We attempt to get the partition for the given key if it exists
If not, we allocate space for a new key and put an empty entry in the memtable for it,
backing that out if we race and someone else got there first on allocation
Once we have space allocated, we call addAllWithSizeDelta
Add the record to a new BTree and CAS it into the existing Holder
Updates secondary indexes
Finalize some heap tracking in the ColumnUpdater used by the BTree to perform updates
Further reading:
AtomicBTreeColumns.java (specifically addAllWithSizeDelta)
BTree.java
MemtablePool
Single MEMORY_POOL instance across entire DB
Get an allocator to the memory pool during construction of a memtable
Interface covering management of an on-heap and off-heap pool via SubPool
HeapPool: On heap ByteBuffer allocations and release, subject to GC w/object overhead
NativePool: Blend of on and off heap based on limits passed in
Off heap allocations and release through NativeAllocator, calls to Unsafe
SlabPool: Blend of on and off heap based on limits passed in (used for memtables, reduced fragmentation)
Allocated in large chunks by SlabAllocator (1024*1024)
MemtablePool.SubPool / SubAllocator:
Contains various atomically updated longs tracking:
Limits on allocation
Currently allocated amounts
Currently reclaiming amounts
Threshold for when to run Cleaner thread
Spin and CAS for updates on the above on allocator calls in addAllWithSizeDelta

MessagingService
Keyspace
No
Skip
Yes
Write CommitLog
Write to Memtable
Secondary Indexes: an overview

Essentially a separate table stored on disk / in memtable
Contains a ConcurrentNavigableMap of ByteBuffer ! SecondaryIndex
There are quite a few SecondaryIndex implementations in the code base, ex:
PerRowSecondaryIndex
PerColumnSecondaryIndex
KeysIndex
On Write Path:
SecondaryIndex updater passed down through to ColumnUpdater ctor
On ColumnUpdater.apply(), insert for secondary index is called
Essentially amounts to a 2nd write on another table

MessagingService
Keyspace
No
Skip
Yes
Write CommitLog
Write to Memtable
FLUSHING MEMTABLES
Flushing Memtables
CLSM.activeSegments
ColumnFamilyStore
Stop at position of flush
Memtable
SSTableWriter
SSTable
CLS Active
Actively allocating
Skip
CLS 2
Still other cfDirty

Remove flushed cfId
CLS 1
Removed last dirty

Recycle CLS
SSTableReader
CommitLog.discardCompletedSegments(
cfId, lastReplayPosition)
MemtableCleanerThread: starting a flush

When MemtableAllocator adjusts the size of the data it has acquired the
MemtablePool checks whether or not we need to flush to free up space in memory
If our used memory is > than the total reclaiming memory + the limit * ratio defined
in conf.memtable_cleanup_threshold, a memtable needs to be cleaned
Cleaner thread is currently: ColumnFamilyStore.FlushLargestColumnFamily())
We find the memtable with the largest Ownership ratio as determined by the currently
owned memory vs. limit, taking the max of either on or off heap
Signals to CommitLog to discard completed segments on PostFlush stage of flush
Memtable Flushing
Reference ColumnFamilyStore$Flush
1st, switch out memtables in CFS.DataTracker.View so new ops go to new memtable
Sets lifecycle in memtable to discarding
Runs the FlushRunnable in the Memtable
Memtable.writeSortedContents
Uses SSTableWriter to write sorted contents to disk
Returns SSTableReader created by SSTableWriter.closeAndOpenReader
Memtable.setDiscarded() ! MemtableAllocator.setDiscarded()
Lifecycle to Discarded
Free up all memory from the allocator for this memtable
Memtable Flushing: the commit log

ColumnFamilyStore$PostFlush
All relative to a timestamp of the most recent data in the flushed memtable
Record sentinel for when this cf was cleaned (to be used later if it was active and we
couldnt purge at time of flush)
Walk through CommitLogSegments and remove dirty cfid
Unless its actively being allocated from
If the CLS is no longer in use:
Remove it from our activeSegments
Queue a task for Management thread to wake up and recycle the segment
Switching out memtables

CFS.switchMemtableIfCurrent / CFS.switchMemtable
Theres some complex non-blocking write-barrier operations on
Keyspace.writeOrder to allow us to wait for writes to finish in this context before
swapping out with new memtables regardless of dirty status
Reference: OpOrder.java,OpOrder.Barrier
Write sorted contents to disk (Memtable.FlushRunnable.runWith(File

sstableDirectory)
cfs.replaceFlushed, swapping the memtable with the new SSTableReader returned
from writeSortedContents
3.0
A couple of interesting things

Sylvain has a herculean effort underway to refactor and modernize the storage
engine, removing quite a bit of complexity from the querying process
Various components renamed (names are in flux atm)
CommitLogSegment Recycling is pretty complex without much payoff, so theres

plans to remove it:

Cassandra Read-Write Path by Josh McKenzie

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cassandra Read-Write Path by Josh McKenzie

Uploaded by

Copyright:

Available Formats

CASSANDRA

On heap vs. Off heap Memtables: an overview

Default in 2.1 is heap buffers

On heap vs. Off heap: continued

Singleton faade for commit log operations

CFS: DataTracker and View

AtomicBTreeColumns (memtable only)

Logic to add / retrieve columns, counters, tombstones, atoms

THE READ PATH

Read Path: Very High Level

Overview the Read Path

Update Row Cache

Read-specific primitive: QueryFilter

Contains a variety of iterators to collate on disk contents, gather tombstones, reduce

Read-specific class: SSTableReader

Short running operations guarded by ColumnFamilyStore.readOrdering

Overview the Read Path

Update Row Cache

ReadVerbHandler and ReadCommands

Relies on SliceQueryFilter, uses a range of columns defined by a ColumnSlice

Relies on NamesQueryFilter, uses a column name to retrieve a single column

Both diverge in calls and converge back into implementers of ColumnFamily

public Row Keyspace.getRow(QueryFilter filter) {

Overview the Read Path

Update Row Cache

Overview the Read Path

Update Row Cache

High level flow:

Overview the Read Path

Update Row Cache

CollationController merging: memtables

Overview the Read Path

Update Row Cache

CollationController merging: sstables

Overview the Read Path

Update Row Cache

THE WRITE PATH

Write Path: Very High Level

Overview the Write Path

Overview the Write Path

The CommitLog ecosystem

CommitLogSegmentManager (CLSM): overview

Only 1 active CommitLogSegment is in use at any given time

CLSM: allocation on the write path

CLSM: manager thread, new segments, recycling

When our CommitLog usage is approaching our allowable limit:

Force flush on any CFS thats dirty

Overview the Write Path

Off heap allocations and release through NativeAllocator, calls to Unsafe

Allocated in large chunks by SlabAllocator (1024*1024)

Currently allocated amounts

Currently reclaiming amounts

Threshold for when to run Cleaner thread

Overview the Write Path

Secondary Indexes: an overview

Overview the Write Path

Stop at position of flush

Still other cfDirty

Removed last dirty

MemtableCleanerThread: starting a flush

Memtable Flushing: the commit log

Switching out memtables

Write sorted contents to disk (Memtable.FlushRunnable.runWith(File

A couple of interesting things