$25
B+ Tree
• B+ Tree source code (originated http://www.amittai.com/prose/bpt.c) • Download the basic source code uploaded in piazza (bpt.zip)
• Unzip the file.
B+Tree
• B+Tree soruce code (originated http://www.amittai.com/prose/bpt.c)
• Compile the source file using ‘Makefile’
• If you don’t have make util, type ‘sudo apt-get install make’ • Don’t change the makefile unless you add another source file.
• If Makefile doesn’t make libbpt.a library file at the exact path, you’ll get zero score. (your_git_repo/project2/lib/libbpt.a)
• Your project hierarchy will be like this.
• Your_git_repo
• project2
• include /
• lib /
• Makefile
• src /
Hanyang University
Basic trial
• Execute it and try some basic commands.
$ ./main
…
Enter any of the following commands after the prompt : i <k -- Insert <k (an integer) as both key and value).
f <k -- Find the value under key <k.
p <k -- Print the path from the root to key k and its associated value.
…
i 1
1 |
i 2 you can type commands
1 2 | like this and see the results!
i 3
1 2 3 |
i 4
3 |
1 2 | 3 4 |
4
Basic trial
• You can adjust order by giving argument. (see the usage() functions)
• You’d better understand the code fully before implementing the project.
• You can get some help from https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html
Disk-based B+tree
• Note that current design only considers in-memory b+tree.
• Our goal
1. Implement disk-based b+ tree. (like below example)
2. Delayed merge (details in the next slide)
➢ Implement 4 commands : open / insert / find / delete
➢ There should be an appropriate data file in your system (you can call it a very simple database), maintaining disk-based b+ tree after serving those commands. 1. open <pathname
• Open existing data file using ‘pathname’ or create one if not existed.
• All other 3 commands below should be handled after open data file.
2. insert <key <value
• Insert input ‘key/value’ (record) to data file at the right place. A “record” means
• Same key should not be inserted (no duplicate). a <key/value pair
3. find <key
• Find the record containing input ‘key’ and return matching ‘value’.
4. delete <key
• Find the matching record and delete it if found.
➢Your library (libbpt.a) should provide those API services.
1. int open_table (char *pathname);
• Open existing data file using ‘pathname’ or create one if not existed.
• If success, return the unique table id, which represents the own table in this database. Otherwise, return negative value. (This table id will be used for future assignment.)
2. int db_insert (int64_t key, char * value);
• Insert input ‘key/value’ (record) to data file at the right place.
• If success, return 0. Otherwise, return non-zero value.
3. int db_find (int64_t key, char * ret_val);
• Find the record containing input ‘key’.
• If found matching ‘key’, store matched ‘value’ string in ret_val and return 0. Otherwise, return non-zero value.
• Memory allocation for record structure(ret_val) should occur in caller function.
4. int db_delete (int64_t key);
• Find the matching record and delete it if found.
• If success, return 0. Otherwise, return non-zero value.
➢All update operation (insert/delete) should be applied to your data file as an operation unit. That means one update operation should change the data file layout correctly.
➢Note that your code must be worked as other students’ data file. That means, your code should handle open(), insert(), find() and delete() API with other students’ data file as well.
➢So follow the data file layout described from next slides.
➢We fixed the on-disk page size with 4096 Bytes.
➢ We fixed the record (key + value) size with 128 (8 + 120) Bytes.
• type : key = integer & value = string
➢ There are 4 types of page. (detail next slides..)
1. Header page (special, containing metadata)
2. Free page (maintained by free page list)
3. Leaf page (containing records)
4. Internal page (indexing internal/leaf page)
Header Page (Special)
Free Page Number
Root Page Number
Number of pages
(Reserved)
Page 1
Page 2
…
➢Header page is the first page (offset 0-4095) 0 of a data file, and contains metadata.
8
➢When we open the data file at first, 16 initializing disk-based b+tree should be done 24 using this header page.
• Free page number: [0-7]
- points the first free page (head of free page list)
- 0, if there is no free page left. 4096
• Root page number: [8-15] 8192
- pointing the root page within the data file.
- 0, if there is no root page.
• Number of pages: [16-23]
End Of
- how many pages exist in this data file now. File
Free Page
Free Page Layout
Next Free Page Number
(Not used)
➢In previous slide, header page contains the 0 position of the first free page.
8
➢Free pages are linked and allocation is managed by the free page list.
• Next free page Number: [0-7]
- points the next free page.
- 0, if end of the free page list.
Leaf Page
Leaf Page Layout
Page Header
Key (8) + Value (120)
Key (8) + Value (120)
…
Key (8) + Value (120)
➢Leaf page contains the key/value records. 0 ➢Keys are sorted in the page.
➢One record size is 128 bytes and we contain maximum 31 records per one data page.
➢First 128 bytes will be used as a page header 128 for other types of pages. (see next slides) 256 ➢Branching factor (order) = 32
4096
Page Header
Page Header Layout
Parent Page Number
Is Leaf
Number of Keys
(Reserved)
➢Internal/Leaf page have first 128 bytes as a 0 page header. ➢Leaf/Internal page should contain those data
8
(see the node structure in include/bpt.h)
12
• Parent page Number [0-7]: If internal/leaf page, this
field points the position of parent page. 16
Set 0 if it is the root page.
• Is Leaf [8-11] : 0 is internal page, 1 is leaf page.
• Number of keys [12-15] : the number of keys within
this page. 128
Leaf Page (Cont.)
Leaf Page Layout
➢We can say that the order of leaf page in 0 disk-based b+tree is 32, but there is a minor problem.
Next Free Page Number or Parent Page Number
Is Leaf
Number of Keys
(Reserved)
Right Sibling Page Number
Key (8) + Value (120)
Key (8) + Value (120)
…
Key (8) + Value (120)
8
➢There should be one more page number added to store right sibling page number 12
for leaf page. (see the comments of node 16
structure in include/bpt.h)
➢So we define one special page number at
the end of page header. 120
128
➢If rightmost leaf page, right sibling page number field is 0.
Leaf Page (Cont.)
Leaf Page Leaf Page Layout
Next Free Page Number or Parent Page Number
Is Leaf
Number of Keys
(Reserved)
Right Sibling Page Number
Key 𝒌𝟏 (8) + Value 𝒗𝟏 (120)
Key 𝒌𝟐 (8) + Value 𝒗𝟐 (120)
…
Key 𝒌𝟑𝟏(8) + Value 𝒗𝟑𝟏(120)
0
8
12
16
120
128
Internal Page
(Reserved)
One more page Number
Key (8) + Page number (8)
Key (8) + Page number (8)
…
Key (8) + Page number (8)
➢Internal page is similar to leaf page, but 0 Internal Page Layout instead of containing 120 bytes of values, it contains 8 bytes of another page (internal or 16 leaf) number.
➢Internal page also needs one more page number to interpret key ranges and we use the field which is specially defined in the leaf page for indicating right sibling.
120
➢Branch factor (order) = 249
128
• Internal page can have maximum 248 entries, because ‘key + page number’ (8+8 bytes) can cover up to whole page (except page header) with the number of 248.
• (4096 – 128) / (8+8) = 248
Internal Page
Internal Page Layout
(Reserved)
Page A Number
Key𝒌𝟏 (8) + Page B Num (8)
Key𝒌𝟐 (8) + Page C Num (8)
…
0
16
Internal Page
120
128
Disk-based B+tree Example
Hanyang University
// file.h
typedef uint64_t pagenum_t; struct page_t {
// in-memory page structure
};
// Allocate an on-disk page from the free page list pagenum_t file_alloc_page();
// Free an on-disk page to the free page list void file_free_page(pagenum_t pagenum);
// Read an on-disk page into the in-memory page structure(dest) void file_read_page(pagenum_t pagenum, page_t* dest);
// Write an in-memory page(src) to the on-disk page void file_write_page(pagenum_t pagenum, const page_t* src);
// file.c or file.cpp
// Allocate an on-disk page from the free page list pagenum_t file_alloc_page(){ }
// Free an on-disk page to the free page list void file_free_page(pagenum_t pagenum){ }
// Read an on-disk page into the in-memory page structure(dest) void file_read_page(pagenum_t pagenum, page_t* dest){ }
// Write an in-memory page(src) to the on-disk page
void file_write_page(pagenum_t pagenum, const page_t* src){ }
➢We strongly recommand you to implement those APIs for managing a database file and synchronizing between in-memory pages and on-disk pages.
If you violate this layered architecture you will get zero score in this project.
➢When performing a write operation to disk, it is not immediately written to disk. This is because of the method used by the OS to reduce I/O which has a big negative impact on performance.
➢However, in this project, the page on disk and the page in memory should be synchronized.
➢And for this, you make the disk write reflected on disk immediately.
➢So, you need to call a function like fsync() or sync() after calling the fwrite() or write() function.
➢Or, you need to add O_SYNC and O_DIRECT as flags when calling the open() function.
➢Synchronization between disk and memory pages is also included in the scoring scope.
Delayed Merge
➢Structure modification(Split, Merge) incurs heavy disk I/O, resulting in performance degradation.
➢One way to reduce it (and you should implement): Delayed Merge
➢Delayed Merge
• Do not merge a page(leaf, internal) until all keys in the page have deleted, regardless of the branching factor.
Milestone & DEADLINE
➢ Milestone 1
➢ Analyze the given b+ tree code and submit a report to the hconnect Wiki. ➢ Your report should includes
1. Possible call path of the insert/delete operation
2. Detail flow of the structure modification (split, merge)
3. (Naïve) designs or required changes for building on-disk b+ tree