Leveldb 元信息相关的数据结构解析1
LevelDB 元信息相关的数据结构解析1
- 在开始具体的内容之前先抛出一些问题:
- 在leveldb中各个level中有那么多的sst文件,它是如何确定哪些sst文件是在哪一层的level上的呢?
- 在读写请求的时候如何快速确定某条记录在哪个sst文件上?
- 在系统重启之后,如何恢复到之前的状态呢?
- leveldb的元信息是如何管理的呢?
- 其实这些问题就是涉及到本篇要讨论的一些问题,都是元信息的管理,这里可以先概括下元信息管理的的作用:
- 标记Compaction相关的信息,保证适时触发Compaction操作;
- 记录sst文件的层次信息,记录一些序列号和sst产生的序号等,为leveldb的读、写、Compaction等操作提供数据结构支持;
- 持久化元信息,为重启恢复提供依据;
- 以版本的方式维护元信息,使得用户通过快照的方式读写文件和数据。
0.如何实现的
- leveldb 的元信息管理主要涉及到3个数据结构,Version、VersionEdit 和 VersionSet,下面就通过代码来简单分析下这些数据结构。
0.1 Version
- leveldb 用 Version表示一个版本的元信息,其中一个嵌套的重要的数据结构就是FileMetaData指针的二维数组,它记录了每一层level 包含的 sst文件信息。除此之外,Version中还记录了触发Compaction的条件信息,这些信息会在读写请求或Compaction过程中被更新。通过Compaction过程会产生和删除sst文件。每当这个时候都会有一个新的对应的Version生成,并插入VersionSet链表头部。下边是具体的代码,我加了一些注释:
class Version {
public:
// Lookup the value for key. If found, store it in *val and
// return OK. Else return a non-OK status. Fills *stats.
// REQUIRES: lock is not held
struct GetStats {
FileMetaData* seek_file;
int seek_file_level;
};
// Append to *iters a sequence of iterators that will
// yield the contents of this Version when merged together.
// REQUIRES: This version has been saved (see VersionSet::SaveTo)
void AddIterators(const ReadOptions&, std::vector<Iterator*>* iters);
Status Get(const ReadOptions&, const LookupKey& key, std::string* val,
GetStats* stats);
// Adds "stats" into the current state. Returns true if a new
// compaction may need to be triggered, false otherwise.
// REQUIRES: lock is held
bool UpdateStats(const GetStats& stats);
// Record a sample of bytes read at the specified internal key.
// Samples are taken approximately once every config::kReadBytesPeriod
// bytes. Returns true if a new compaction may need to be triggered.
// REQUIRES: lock is held
bool RecordReadSample(Slice key);
// Reference count management (so Versions do not disappear out from
// under live iterators)
void Ref();
void Unref();
void GetOverlappingInputs(
int level,
const InternalKey* begin, // nullptr means before all keys
const InternalKey* end, // nullptr means after all keys
std::vector<FileMetaData*>* inputs);
// Returns true iff some file in the specified level overlaps
// some part of [*smallest_user_key,*largest_user_key].
// smallest_user_key==nullptr represents a key smaller than all the DB's keys.
// largest_user_key==nullptr represents a key largest than all the DB's keys.
bool OverlapInLevel(int level, const Slice* smallest_user_key,
const Slice* largest_user_key);
// Return the level at which we should place a new memtable compaction
// result that covers the range [smallest_user_key,largest_user_key].
// 用来选择将从MemTable dump出的sstable file放入第几层
int PickLevelForMemTableOutput(const Slice& smallest_user_key,
const Slice& largest_user_key);
// 判断某层level的文件个数
int NumFiles(int level) const { return files_[level].size(); }
// Return a human readable string that describes this version's contents.
std::string DebugString() const;
private:
friend class Compaction;
friend class VersionSet;
class LevelFileNumIterator;
explicit Version(VersionSet* vset)
: vset_(vset),
next_(this),
prev_(this),
refs_(0),
file_to_compact_(nullptr),
file_to_compact_level_(-1),
compaction_score_(-1),
compaction_level_(-1) {}
Version(const Version&) = delete;
Version& operator=(const Version&) = delete;
~Version();
Iterator* NewConcatenatingIterator(const ReadOptions&, int level) const;
// Call func(arg, level, f) for every file that overlaps user_key in
// order from newest to oldest. If an invocation of func returns
// false, makes no more calls.
//
// REQUIRES: user portion of internal_key == user_key.
void ForEachOverlapping(Slice user_key, Slice internal_key, void* arg,
bool (*func)(void*, int, FileMetaData*));
VersionSet* vset_; // VersionSet to which this Version belongs 所有的version都属于一个集合:VersionSet
// VersionSet是Version组成的双链表,这里Version记录前一个和后一个Version两个指针
Version* next_; // Next version in linked list
Version* prev_; // Previous version in linked list
int refs_; // Number of live refs to this version
// List of files per level
// 这个二维数组,分层记录了所有的SST文件信息,每个SST文件由FileMetaData表示
std::vector<FileMetaData*> files_[config::kNumLevels];
// Next file to compact based on seek stats.
FileMetaData* file_to_compact_;
int file_to_compact_level_;
// Level that should be compacted next and its compaction score.
// Score < 1 means compaction is not strictly needed. These fields
// are initialized by Finalize().
// Compaction需要用compaction_score_来判断是否需要发起major compaction,这部分逻辑与某level所有SSTable file的大小有关系
double compaction_score_;
int compaction_level_;
};
- 下边我们再顺便看下sst的数据结构,FileMetaData数据结构比较简单,用来维护一个sst文件的元信息,包括文件大小,文件编号(也就是sst文件的名字是,例如12345.sst这种格式),最大最小key,引用计数等,这个引用计数记录了被不同的Version引用的次数。leveldb会触发Compaction,会对一些文件进行清理操作,让数据更加有序,清理后的数据放到新的版本里面,而老的数据作为原始的素材,最终是要清理掉的,但是如果有读事务位于旧的文件,那么暂时就不能删除。因此利用引用计数,只要一个Verison还活着,就不允许删除该Verison管理的所有文件。当一个Version生命周期结束,它管理的所有文件的引用计数减1.
struct FileMetaData {
FileMetaData() : refs(0), allowed_seeks(1 << 30), file_size(0) {}
int refs;
int allowed_seeks; // Seeks allowed until compaction
uint64_t number;
uint64_t file_size; // File size in bytes
InternalKey smallest; // Smallest internal key served by table
InternalKey largest; // Largest internal key served by table
};
Version::~Version() {
assert(refs_ == 0);
/* 从VersionSet中注销 */
prev_->next_ = next_;
next_->prev_ = prev_;
/*本Version下所有的文件,引用计数减1*/
for (int level = 0; level < config::kNumLevels; level++) {
for (size_t i = 0; i < files_[level].size(); i++) {
FileMetaData* f = files_[level][i];
assert(f->refs > 0);
f->refs--;
if (f->refs <= 0) {
delete f;
}
}
}
}
- 这里我们还可以在拓展一下,再抛出一个问题:对于一条记录,如果读和写同一时间发生,可能读到不一致的数据或者是修改了一半的数据。对于这种情况,要如何处理呢?目前有最最常见的3中做法:
- 悲观锁:最简单的处理方式,就是加锁保护,写的时候不许读,读的时候不许写。学些效率受到锁的制约。
- 乐观锁:假设多个并发的事务在处理数据时不会彼此影响,各事务只处理影响各自的数据,没有数据的重叠,然后在提交数据更新之前,每个事务会先检查在该事务读取数据后,有没有其他事务又修改了该数据。如果不幸有其他事务也进行了跟新,那么正在提交的事务会进行回滚;这样做不会有锁竞争,也不会产生死锁。但如果处理热点数据,数据竞争的概率较高,事务的成功率会大大下降。
- MVCC:Multiversion concurrency control,字面意思是多版本并发控制。具体来说,每一个执行操作操作的用户,看到的都是数据库特定时刻的的一个状态,叫做快照(snapshot), 未完成的修改(write)不会被其他的用户所看到。需要对数据进行更新的时候并是不直接覆盖原有数据,而是先进行标记, 然后在其他地方添加新的数据,从而形成一个新版本,此时新的读取操作看到的就是最新的版本了。所以这种策略本质是维护了多个版本的数据的,但只有一个是最新的。
- 所以可以理解为:sst级别的MVCC就是利用Version实现的。只有一个current version,持有最新的sst集合。
0.2 VersionSet
- VersionSet就是一个Version构成的双向链表,这些Version按时间顺序先后产生,链表头指向当前最新的Version,同时维护了每个Version的引用计数,被引用中的Version不会被删除,其对应的sst文件也因此得以保留(这里要注意下有两个引用计数,一个是Version的,记录有哪些事务持有这些Version,另外是sst文件的引用计数,这个是标记有多少个Version持有sst文件),通过这种方式,使得leveldb可以在一个稳定的快照视图上访问文件。VersionSet中除了Version的双向链表外还会记录一些如LogNumber,Sequence,下一个sst文件编号的状态信息。
class VersionSet {
public:
VersionSet(const std::string& dbname, const Options* options,
TableCache* table_cache, const InternalKeyComparator*);
VersionSet(const VersionSet&) = delete;
VersionSet& operator=(const VersionSet&) = delete;
~VersionSet();
// Apply *edit to the current version to form a new descriptor that
// is both saved to persistent state and installed as the new
// current version. Will release *mu while actually writing to the file.
// REQUIRES: *mu is held on entry.
// REQUIRES: no other thread concurrently calls LogAndApply()
Status LogAndApply(VersionEdit* edit, port::Mutex* mu)
EXCLUSIVE_LOCKS_REQUIRED(mu);
// Recover the last saved descriptor from persistent storage.
Status Recover(bool* save_manifest);
// Return the current version.
Version* current() const { return current_; }
// Return the current manifest file number
uint64_t ManifestFileNumber() const { return manifest_file_number_; }
// Allocate and return a new file number
uint64_t NewFileNumber() { return next_file_number_++; }
// Arrange to reuse "file_number" unless a newer file number has
// already been allocated.
// REQUIRES: "file_number" was returned by a call to NewFileNumber().
void ReuseFileNumber(uint64_t file_number) {
if (next_file_number_ == file_number + 1) {
next_file_number_ = file_number;
}
}
// Return the number of Table files at the specified level.
int NumLevelFiles(int level) const;
// Return the combined file size of all files at the specified level.
int64_t NumLevelBytes(int level) const;
// Return the last sequence number.
uint64_t LastSequence() const { return last_sequence_; }
// Set the last sequence number to s.
void SetLastSequence(uint64_t s) {
assert(s >= last_sequence_);
last_sequence_ = s;
}
// Mark the specified file number as used.
void MarkFileNumberUsed(uint64_t number);
// Return the current log file number.
uint64_t LogNumber() const { return log_number_; }
// Return the log file number for the log file that is currently
// being compacted, or zero if there is no such log file.
uint64_t PrevLogNumber() const { return prev_log_number_; }
// Pick level and inputs for a new compaction.
// Returns nullptr if there is no compaction to be done.
// Otherwise returns a pointer to a heap-allocated object that
// describes the compaction. Caller should delete the result.
Compaction* PickCompaction();
// Return a compaction object for compacting the range [begin,end] in
// the specified level. Returns nullptr if there is nothing in that
// level that overlaps the specified range. Caller should delete
// the result.
Compaction* CompactRange(int level, const InternalKey* begin,
const InternalKey* end);
// Return the maximum overlapping data (in bytes) at next level for any
// file at a level >= 1.
int64_t MaxNextLevelOverlappingBytes();
// Create an iterator that reads over the compaction inputs for "*c".
// The caller should delete the iterator when no longer needed.
Iterator* MakeInputIterator(Compaction* c);
// Returns true iff some level needs a compaction.
bool NeedsCompaction() const {
Version* v = current_;
return (v->compaction_score_ >= 1) || (v->file_to_compact_ != nullptr);
}
// Add all files listed in any live version to *live.
// May also mutate some internal state.
void AddLiveFiles(std::set<uint64_t>* live);
// Return the approximate offset in the database of the data for
// "key" as of version "v".
uint64_t ApproximateOffsetOf(Version* v, const InternalKey& key);
// Return a human-readable short (single-line) summary of the number
// of files per level. Uses *scratch as backing store.
struct LevelSummaryStorage {
char buffer[100];
};
const char* LevelSummary(LevelSummaryStorage* scratch) const;
private:
class Builder;
friend class Compaction;
friend class Version;
bool ReuseManifest(const std::string& dscname, const std::string& dscbase);
void Finalize(Version* v);
void GetRange(const std::vector<FileMetaData*>& inputs, InternalKey* smallest,
InternalKey* largest);
void GetRange2(const std::vector<FileMetaData*>& inputs1,
const std::vector<FileMetaData*>& inputs2,
InternalKey* smallest, InternalKey* largest);
void SetupOtherInputs(Compaction* c);
// Save current contents to *log
Status WriteSnapshot(log::Writer* log);
void AppendVersion(Version* v);
Env* const env_;
const std::string dbname_;
const Options* const options_;
TableCache* const table_cache_;
const InternalKeyComparator icmp_;
uint64_t next_file_number_;
uint64_t manifest_file_number_;
uint64_t last_sequence_;
uint64_t log_number_;
uint64_t prev_log_number_; // 0 or backing store for memtable being compacted
// Opened lazily
WritableFile* descriptor_file_;
log::Writer* descriptor_log_;
Version dummy_versions_; // Head of circular doubly-linked list of versions.
Version* current_; // == dummy_versions_.prev_
// Per-level key at which the next compaction at that level should start.
// Either an empty string, or a valid InternalKey.
std::string compact_pointer_[config::kNumLevels];
};
0.3VersionEdit
- 到这里你可能要问了,一个Version到另外一个Version是如何过渡产生的呢?VersionEdit就是来负责这个事情。其实相邻的Version可以通过如下表示得到:
Version(N) + VersionEdit = Version(N+1)
```c++ class VersionEdit { public: VersionEdit() { Clear(); } ~VersionEdit() = default;
void Clear();
void SetComparatorName(const Slice& name) { has_comparator_ = true; comparator_ = name.ToString(); } void SetLogNumber(uint64t num) { has_log_number = true; log_number_ = num; } void SetPrevLogNumber(uint64t num) { has_prev_log_number = true; prev_log_number_ = num; } void SetNextFile(uint64t num) { has_next_file_number = true; next_file_number_ = num; } void SetLastSequence(SequenceNumber seq) { has_last_sequence_ = true; last_sequence_ = seq; } void SetCompactPointer(int level, const InternalKey& key) { compact_pointers_.push_back(std::make_pair(level, key)); }
// Add the specified file at the specified number. // REQUIRES: This version has not been saved (see VersionSet::SaveTo) // REQUIRES: “smallest” and “largest” are smallest and largest keys in file void AddFile(int level, uint64t file, uint64_t file_size, const InternalKey& smallest, const InternalKey& largest) { FileMetaData f; f.number = file; f.file_size = file_size; f.smallest = smallest; f.largest = largest; new_files.push_back(std::make_pair(level, f)); }
// Delete the specified “file” from the specified “level”. void RemoveFile(int level, uint64t file) { deleted_files.insert(std::make_pair(level, file)); }
void EncodeTo(std::string* dst) const; Status DecodeFrom(const Slice& src);
std::string DebugString() const;
private: friend class VersionSet;
typedef std::set<std::pair<int, uint64_t» DeletedFileSet;
std::string comparator_; uint64t log_number; uint64t prev_log_number; uint64t next_file_number; SequenceNumber last_sequence_; bool has_comparator_; bool has_log_number_; bool has_prev_log_number_; bool has_next_file_number_; bool has_last_sequence_;
std::vector<std::pair<int, InternalKey» compact_pointers_; DeletedFileSet deleted_files_; std::vector<std::pair<int, FileMetaData» new_files_; }; ```
- VersionEdit中最最重要的3个成员变量是:
std::vector< std::pair<int, InternalKey> > compact_pointers_; DeletedFileSet deleted_files_; std::vector< std::pair<int, FileMetaData> > new_files_;
看这个变量名就可以清楚他的作用,分别为:用于记录compact的指针、要删掉的sst文件几个、新产生的sst文件。compact相关的我们在后边再来解释。
- 为了避免leveldb进程崩溃导致的数据丢失,leveldb会将元信息持久化到磁盘,存储到Manifest文件中。每当有新的Version产生就会更新Manifest,所以新增数据正好对应于VersionEdit内容,也就是说Manifest文件记录的是一组VersionEdit值。一个Manifest文件中,包含了多条Session Record。一个SessionRecord记录了从上一个版本至该版本的变化情况。一个Session Record可能包含以下字段:
- Comparer的名称;
- 最新的journal文件编号;
- 下一个可以使用的文件编号;
- 数据库已经持久化数据项中最大的sequence number;
- 新增的文件信息;
- 删除的文件信息;
- compaction记录信息;
- 这里可以引出一个问题,可以先看下下边的这个操作:
Version0 + VersionEdit0 -> Version1 + VersionEdit1 -> Version2 + VersionEdit2 -> ... -> VersionN
可以看出恢复元信息的过程也变成了依次应用VersionEdit的过程,这个过程中有大量的中间Version产生,但这些并不是我们所需要的。所以leveldb引入VersionSet::Builder来避免这种中间变量,先将所有的VersoinEdit内容整理到VersionBuilder中,然后一次应用产生最终的Version,操作就变成了:
Version0 + VersionEdit0 + VersionEdit1 + VersionEdit2 -> ... -> VersionN
- 本篇先简单介绍了leveldb中涉及元信息管理的一些数据结构,下一篇将介绍涉及这些数据结构的一下操作。