Getting Started With RocksDb (Part 3)| Iterations

Sunpriya Kaur
4 min readNov 15, 2021

Introduction

All data in RocksDb is logically stored in sorted order. An Iterator API allows to do a range scan on the database. Iterator can seek to a particular key and then start scanning one at a time and can also do reverse iterations. A consistent point-in-view of DB is created whenever an iterator is created.

1. Seek

seekToFirst() option is used to start iterating from the first record.
seekToLast() option is used to start iterating from the last record.

To get all the keys from the database, follow the following code sample:

seek(byte[] key) option is used to start seeking from a particular key. We can go either in forward direction using itr.next() or reverse direction using itr.prev()

2. Error Handling

There could be various errors while iterating like I/O errors, checksum mismatches, unsupported operation or internal errors.
itr.isValid() is used to check if the status of iterator is OK and it is safe to iterate. Whenever the iterator is invalidated, there could be two reasons :
1. The iterator has come to the end of the data.
2. There are some internal errors i.e status is not OK

3. Resource Pinning

Iterators by themselves do not consume much memory however they can prevent some resources from being released. It includes:
1. Memtable and SST files pinned at the time of creation of iterator. Even if any file get deleted or flushed, it is still preserved if iterator pinned them .
2. Data blocks for current iterating position are preserved in block cache or heap.
Best practice is to close the iterator and keep them short-lived using itr.close()
You can even Refresh the iterator to represent the recent state, the stale pinned resource is then released.

4. Read Ahead

RocksDB does automatic readahead and prefetches data on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of BlockBasedTableOptions.max_auto_readahead_size (default 256 KB). This helps in cutting down the number of IOs needed to complete the range scan. This automatic readahead is enabled only when ReadOptions.readahead_size = 0 (default value).
ReadOptions.readahead_size provides read-ahead support in RocksDB for very limited use cases. The limitation of this feature is that, if turned on, the constant cost of the iterator will be much higher. So you should only use it if you iterate a very large range of data, and can't work it around using other approaches. A typical use case will be that the storage is remote storage with very long latency, OS page cache is not available and a large amount of data will be scanned.

5. Prefix Seek

Prefix seek is a feature for mitigating I/O request which might be huge when an iterator seek is executed. The basic idea is that, if users know the iterating will be within one key prefix, the common prefix can be used to reduce costs. A common technique for prefix seek is prefix bloom filter. If many sorted runs don’t contain any entry for this prefix, it can be filtered out by a bloom filter, and some I/Os and CPU for the sorted run can be ignored.
To configure bloom filter:

In earlier versions, there were a lot of settings required for prefix seek like setting prefix_extractor and total_order_seek=true. In newer versions, all these are taken care of in seek() function only.

6. SeekFoPrev
SeekForPrev() is used to get the last key less than or equal to the given key.

7. Tailing iterator

Tailing iterator is used to read the data as soon as it is added to the database. It doesn’t create a snapshot when it is created and is optimized for doing sequential reads. Currently tailing iterator only supports to move in forward direction. A normal iterator takes a snapshot of the data with the current sequence number, new data won’t be included as they have larger sequence number. Tailing iterator uses a MaxSequenceNumber, which will include any new added data.

Not all new data is guaranteed to be available to a tailing iterator. Seek() or SeekToFirst() on a tailing iterator can be thought of as creating an implicit snapshot — anything written after it may, but is not guaranteed to be seen.

Please refer to following repository for complete code : https://github.com/sunpriya/RocksDb-Recipe

--

--