Project 2: B+ Trees

Project Structure Diagram

项目结构图

Task1: LeafNode::fromBytes

You should first implement the fromBytes in LeafNode. This method reads a LeafNode from a page. For information on how a leaf node is serialized, see LeafNode::toBytes. For an example on how to read a node from disk, see InnerNode::fromBytes. Your code should be similar to the inner node version but should account for the differences between how inner nodes and leaf nodes are serialized. You may find the documentation in ByteBuffer.java helpful. Once you have implemented fromBytes you should be passing TestLeafNode::testToAndFromBytes.

toBytesfromBytes分别是用来将节点序列化为字节流和反序列化的代码。而 LeafNode 与 InnerNode 大致相同,照着写即可。唯一需要注意的是rightSibling,我一开始没有处理右指针为-1 的情况。具体实现为:

public static LeafNode fromBytes(BPlusTreeMetadata metadata, BufferManager bufferManager,
                                     LockContext treeContext, long pageNum) {
        // TODO(proj2): implement
        // Note: LeafNode has two constructors. To implement fromBytes be sure to
        // use the constructor that reuses an existing page instead of fetching a
        // brand new one.
        Page page = bufferManager.fetchPage(treeContext, pageNum);
        Buffer buf = page.getBuffer();

        byte nodeType = buf.get();
        assert(nodeType == (byte) 1);

        long rightSibling = buf.getLong();
        Optional<Long> rightSiblingOpt = (rightSibling == -1L) ? Optional.empty() : Optional.of(rightSibling);

        List<DataBox> keys = new ArrayList<>();
        List<RecordId> rids = new ArrayList<>();

        int n = buf.getInt();
        for (int i = 0; i < n; ++i) {
            keys.add(DataBox.fromBytes(buf, metadata.getKeySchema()));
            rids.add(RecordId.fromBytes(buf));
        }

        return new LeafNode(metadata, bufferManager, page, keys, rids, rightSiblingOpt, treeContext);
    }

运行测试,通过:

Task 2: get, getLeftmostLeaf, put, remove

After implementing fromBytes, you will need to implement the following methods in LeafNode, InnerNode, and BPlusTree:

  • get

  • getLeftmostLeaf (LeafNode and InnerNode only)

  • put

  • remove

For more information on what these methods should do refer to the comments in BPlusTree and BPlusNode.

Each of these methods, although split into three different classes, can be viewed as one recursive action each - the BPlusTree method starts the call, the InnerNode method is the recursive case, and the LeafNode method is the base case. It's suggested that you work on one method at a time (over all three classes).

We've provided a sync() method in LeafNode and InnerNode. The purpose of sync() is to ensure that the representation of a node in our buffers is up-to-date with the representation of the node in program memory.

Do not forget to call sync() when implementing the two mutating methods (put and remove); it's easy to forget.

1. get函数

  • BPlusTreeget的入口层,负责控制整体流程。首先从根节点开始导航,经过递归查找到叶子节点,并查找实际的键值;

  • InnerNode为路由层,通过二分查找与递归返回叶子节点;

  • LeafNode为终端层,负责查找实际键值。

这里我从InnerNodeget开始:

public LeafNode get(DataBox key) {
        int index = InnerNode.numLessThanEqual(key, keys);
        BPlusNode child = getChild(index);
        return child.get(key);
    }

numLessThanEqual方法直接找到小于等于key对应的索引,然后进行递归。这里我们把child的类型定义为BPlusNode抽象类,child.get(key)会自动处理递归,根据child的类型分别调用两个类里面的get方法。

注意写完以后直接运行TestInnerNode中的testGet测试并不会通过,因为递归的终止情况是child的类型为LeafNode然后调用对应的get方法,目前这里返回的是null,需要完成LeafNode中的get方法:

public LeafNode get(DataBox key) {
        return this;
    }

再分别运行TestLeafNode中的testGetLTestInnerNode中的testGet,就可以通过啦。

再去写BPlusTree中的get方法:

public Optional<RecordId> get(DataBox key) {
        typecheck(key);
        // TODO(proj4_integration): Update the following line
        LockUtil.ensureSufficientLockHeld(lockContext, LockType.NL);

        // TODO(proj2): implement
        LeafNode leaf = root.get(key);

        return leaf.getKey(key);
    }

2. getLeftmostLeaf函数

跟上面的思路其实差不多,先实现LeafNode里的:

public LeafNode getLeftmostLeaf() {
        return this;
    }

因为n.getLeftmostLeaf要求是返回以n为根节点的子树的最左节点,叶节点在最底层直接返回自己即可。测试通过:

然后来写InnerNode的,很简单:

public LeafNode getLeftmostLeaf() {
        assert(children.size() > 0);
        BPlusNode child = getChild(0);
        return child.getLeftmostLeaf();
    }

运行测试通过:

3. put函数

整体思路大致如下:

首先写LeafNode中的put函数,这个函数需要检查是否有重复键,并将其插入到正确位置,如果出现溢出,再分裂叶子节点,并返回中间键和新节点的指针。具体实现如下:

public Optional<Pair<DataBox, Long>> put(DataBox key, RecordId rid) {
        // TODO(proj2): implement

        // 检查重复键
        if (keys.contains(key)) {
            throw new BPlusTreeException("Leaf already has the key");
        }

        // 利用numLessThan函数寻找插入位置并执行
        int index = InnerNode.numLessThan(key, keys);
        keys.add(index, key);
        rids.add(index, rid);

        // 最大键数
        int maxKeys = 2 * metadata.getOrder();

        // 如果没有溢出
        if (keys.size() <= maxKeys) {
            sync();
            return Optional.empty();
        }

        // 溢出情况
        // 获取中间键
        int splitIndex = maxKeys / 2;
        DataBox splitKey = keys.get(splitIndex);

        // 创建新的叶子节点
        List<DataBox> rightKeys = new ArrayList<>(keys.subList(splitIndex, keys.size()));
        List<RecordId> rightRids = new ArrayList<>(rids.subList(splitIndex, rids.size()));
        LeafNode rightSibling = new LeafNode(metadata, bufferManager, rightKeys, rightRids, this.rightSibling, treeContext);

        // 更新旧的叶子节点
        keys = new ArrayList<>(keys.subList(0, splitIndex));
        rids = new ArrayList<>(rids.subList(0, splitIndex));
        this.rightSibling = Optional.of(rightSibling.getPage().getPageNum());
        sync();

        return Optional.of(new Pair<>(splitKey, rightSibling.getPage().getPageNum()));
    }

运行测试验证(共三个测试):

均通过🥰🥰🥰。

然后来写InnerNode部分,按照思路设计,内部节点需要分裂时循环调用的逻辑放在BPlusTree中实现,在这部分中我们只实现一次的情况即可。

public Optional<Pair<DataBox, Long>> put(DataBox key, RecordId rid) {
    // 找到合适的子节点索引
    int index = InnerNode.numLessThanEqual(key, keys);
    BPlusNode child = getChild(index);

    // 递归插入到子节点
    Optional<Pair<DataBox, Long>> result = child.put(key, rid);

    if (result.isPresent()) {
        // 子节点分裂,插入中间键和新节点指针
        Pair<DataBox, Long> splitResult = result.get();
        keys.add(index, splitResult.getFirst());
        children.add(index + 1, splitResult.getSecond());

        // 检查当前节点是否溢出
        if (keys.size() > 2 * metadata.getOrder()) {
            // 分裂当前节点
            int mid = metadata.getOrder();
            DataBox splitKey = keys.get(mid);

            // 创建新节点
            List<DataBox> rightKeys = keys.subList(mid + 1, keys.size());
            List<Long> rightChildren = children.subList(mid + 1, children.size());
            InnerNode right = new InnerNode(metadata, bufferManager, rightKeys, rightChildren, treeContext);

            // 更新当前节点
            keys = keys.subList(0, mid);
            children = children.subList(0, mid + 1);
            sync();

            return Optional.of(new Pair<>(splitKey, right.getPage().getPageNum()));
        }
        sync();
    }
    return Optional.empty();
}

运行测试通过:

最后写BPlusTreeput部分,其中处理根节点分裂的逻辑:

public void put(DataBox key, RecordId rid) {
        typecheck(key);
        // TODO(proj4_integration): Update the following line
        LockUtil.ensureSufficientLockHeld(lockContext, LockType.NL);

        // TODO(proj2): implement
        // Note: You should NOT update the root variable directly.
        // Use the provided updateRoot() helper method to change
        // the tree's root if the old root splits.
        Optional<Pair<DataBox, Long>> result = root.put(key, rid);

        if (result.isPresent()) {
            List<DataBox> keys = new ArrayList<>();
            keys.add(result.get().getFirst());

            List<Long> children = new ArrayList<>();
            children.add(root.getPage().getPageNum());
            children.add(result.get().getSecond());

            InnerNode newRoot = new InnerNode(metadata, bufferManager, keys, children, lockContext);
            updateRoot(newRoot);
        }
    }

但由于scanAll函数尚未完成,无法通过testRandomPuts()测试。

remove函数

依然是先写LeafNode部分,比较简单:

public void remove(DataBox key) {
        // TODO(proj2): implement
        int index = keys.indexOf(key);
        if (index == -1) {
            throw new BPlusTreeException("Leaf does not have the key");
        }

        keys.remove(index);
        rids.remove(index);
    }

但我在写InnerNode部分遇到问题,测试一直失败,应该是没有sync的问题,修改LeafNode代码:

public void remove(DataBox key) {
        int index = Collections.binarySearch(keys, key);
        if (index == -1) {
            throw new BPlusTreeException("Leaf does not have the key");
        }

        keys.remove(index);
        rids.remove(index);

        sync();
    }

再写InnerNode部分代码:

public void remove(DataBox key) {
        // TODO(proj2): implement
        int index = InnerNode.numLessThanEqual(key, keys);
        BPlusNode child = getChild(index);
        child.remove(key);
        sync();
    }

测试通过

再完成BPlusTreeremove部分:

public void remove(DataBox key) {
        typecheck(key);
        // TODO(proj4_integration): Update the following line
        LockUtil.ensureSufficientLockHeld(lockContext, LockType.NL);

        // TODO(proj2): implement
        root.remove(key);
    }

那么Task2就完成了。

Task 3: Scans

You will need to implement the following methods in BPlusTree:

  • scanAll

  • scanGreaterEqual

In order to implement these, you will have to complete the BPlusTreeIterator inner class in BPlusTree.javato complete these two methods.

After completing this task, you should be passing TestBPlusTree::testRandomPuts

Your implementation does not have to account for the tree being modified during a scan. For the time being you can think of this as there being a lock that prevents scanning and mutation from overlapping, and that the behavior of iterators created before a modification is undefined (you can handle any problems with these iterators however you like, or not at all).

首先看scanAllscanGreaterEqual的注释提示,大致的调用关系如下:

直接写迭代器:

private class BPlusTreeIterator implements Iterator<RecordId> {
        // TODO(proj2): Add whatever fields and constructors you want here.
        private LeafNode currentLeafNode;
        private int currentIndex;

        public BPlusTreeIterator(LeafNode startLeafNode, int startIndex) {
            this.currentLeafNode = startLeafNode;
            this.currentIndex = startIndex;
        }

        @Override
        public boolean hasNext() {
            // TODO(proj2): implement
            if (currentLeafNode == null) {
                return false;
            }
            // 处理叶节点遍历完需要跨叶
            if (currentIndex >= currentLeafNode.getKeys().size()) {
                currentLeafNode = currentLeafNode.getRightSibling().orElse(null);
                currentIndex = 0;
            }
            return currentLeafNode != null && currentIndex < currentLeafNode.getKeys().size();
        }

        @Override
        public RecordId next() {
            // TODO(proj2): implement
            if (!hasNext()) {
                throw new NoSuchElementException();
            }

            RecordId rid = currentLeafNode.getRids().get(currentIndex);
            currentIndex++;

            return rid;
        }
    }

回去写一下那两个函数,通过传入构造函数的参数不同来实现:

首先是scanAll函数:

public Iterator<RecordId> scanAll() {
        // TODO(proj4_integration): Update the following line
        LockUtil.ensureSufficientLockHeld(lockContext, LockType.NL);

        // TODO(proj2): Return a BPlusTreeIterator.
        return new BPlusTreeIterator(root.getLeftmostLeaf(), 0);
    }

然后是scanGreaterEqual函数:

public Iterator<RecordId> scanGreaterEqual(DataBox key) {
        typecheck(key);
        // TODO(proj4_integration): Update the following line
        LockUtil.ensureSufficientLockHeld(lockContext, LockType.NL);

        // TODO(proj2): Return a BPlusTreeIterator.
        LeafNode startLeaf = root.get(key);
        if (startLeaf == null) {
            return Collections.emptyIterator();
        }
        int index = startLeaf.getKeys().indexOf(key);
        return new BPlusTreeIterator(startLeaf, index);
    }

运行之前没通过的测试:

成功,Task3结束🤩!

Task 4: Bulk Load

Much like the methods from the Task 2 you'll need to implement bulkLoad within all three of LeafNode, InnerNode, and BPlusTree. Since bulk loading is a mutating operation you will need to call sync(). Be sure to read the instructions in BPluNode::bulkLoad carefully to ensure you split your nodes properly. We've provided a visualization of bulk loading for an order 2 tree with fill factor 0.75 (powerpoint slides here).

After this, you should pass all the Project 2 tests we have provided to you (and any you add yourselves)! These are all the provided tests in database.index.*.

先写LeafNode部分:这部分要注意的就是分裂时要保留maxSize,移出1个元素到新节点,其余部分跟put的实现有些类似:

public Optional<Pair<DataBox, Long>> bulkLoad(Iterator<Pair<DataBox, RecordId>> data,
            float fillFactor) {
        // TODO(proj2): implement
        // 计算填充阈值,向上取整
        int maxSize = (int)Math.ceil(2 * metadata.getOrder() * fillFactor);

        while (data.hasNext() && keys.size() <= maxSize) {
            Pair<DataBox, RecordId> entry = data.next();
            keys.add(entry.getFirst());
            rids.add(entry.getSecond());
        }

        if (!data.hasNext()) {
            sync();
            return Optional.empty();
        }

        // 分裂
        List<DataBox> rightKeys = keys.subList(maxSize, keys.size());
        List<RecordId> rightRids = rids.subList(maxSize, rids.size());

        LeafNode rightNode = new LeafNode(metadata, bufferManager, rightKeys, rightRids, this.rightSibling, treeContext);

        keys = new ArrayList<>(keys.subList(0, maxSize));
        rids = new ArrayList<>(rids.subList(0, maxSize));
        this.rightSibling = Optional.of(rightNode.getPage().getPageNum());
        sync();

        return Optional.of(new Pair<>(rightNode.getKeys().get(0), rightNode.getPage().getPageNum()));
    }

测试:

全部通过。

然后写InnerNode的部分,也很类似。

public Optional<Pair<DataBox, Long>> bulkLoad(Iterator<Pair<DataBox, RecordId>> data,
            float fillFactor) {
        // TODO(proj2): implement
        int maxSize = (int)Math.ceil(2 * metadata.getOrder());

        while (data.hasNext() && keys.size() < maxSize) {
            BPlusNode rightMostChild = getChild(children.size() - 1);
            Optional<Pair<DataBox, Long>> result = rightMostChild.bulkLoad(data, fillFactor);

            if (result.isPresent()) {
                Pair<DataBox, Long> splitResult = result.get();
                keys.add(splitResult.getFirst());
                children.add(splitResult.getSecond());
            }
        }

        if (!data.hasNext()) {
            sync();
            return Optional.empty();
        }

        int mid = metadata.getOrder();
        DataBox splitKey = keys.get(mid);
        List<DataBox> rightKeys = new ArrayList<>(keys.subList(mid + 1, keys.size()));
        List<Long> rightChildren = children.subList(mid + 1, children.size());
        InnerNode right = new InnerNode(metadata, bufferManager, rightKeys, rightChildren, treeContext);

        keys = new ArrayList<>(keys.subList(0, mid + 1));
        children = new ArrayList<>(children.subList(0, mid + 1));

        sync();
        return Optional.of(new Pair<>(splitKey, right.getPage().getPageNum()));
    }

BPlusTree调用InnerNode的函数以及新建节点的思路基本与put一样。put写出来以后这个还是比较简单的:

public void bulkLoad(Iterator<Pair<DataBox, RecordId>> data, float fillFactor) {
        // TODO(proj4_integration): Update the following line
        LockUtil.ensureSufficientLockHeld(lockContext, LockType.NL);

        // TODO(proj2): implement
        // Note: You should NOT update the root variable directly.
        // Use the provided updateRoot() helper method to change
        // the tree's root if the old root splits.
        if (scanAll().hasNext()) {
            throw new RuntimeException("Tree is not empty");
        }

        while (data.hasNext()) {
            Optional<Pair<DataBox, Long>> result = root.bulkLoad(data, fillFactor);
            if (result.isPresent()) {
                Pair<DataBox, Long> pair = result.get();
                List<DataBox> keys = new ArrayList<>();
                List<Long> children = new ArrayList<>();
                keys.add(pair.getFirst());
                children.add(root.getPage().getPageNum());
                children.add(pair.getSecond());
                InnerNode newRoot = new InnerNode(metadata, bufferManager, keys, children, lockContext);
                updateRoot(newRoot);
            }
        }
        return;
    }

测试通过:

最后,按照教程运行CommandLineInterface代码启动CLI,这时候就可以使用数据库命令来调用我们刚写的代码了:

"C:\Program Files\Java\jdk-22\bin\java.exe" "-javaagent:C:\Users\ASUS\AppData\Local\Programs\IntelliJ IDEA Ultimate 2024.3\lib\idea_rt.jar=61448" -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath F:\cs186\cs186-sp25-rookiedb\target\classes edu.berkeley.cs186.database.cli.CommandLineInterface

\|/  ___------___
 \__|--o______o--|
    |  berklee   |
     ---______---

Welcome to RookieDB (v1.8.6-fa24)
=> SELECT * FROM Students AS s WHERE s.sid = 1;
 sid | name              | major     | gpa
-----+-------------------+-----------+-----------
   1 | Augustina Mazzoni | Chemistry | 1.0054202
(1 row)
=> CREATE INDEX on Students(sid);
CREATE INDEX ON Students (sid)
=> exit
exit
Bye!

进程已结束,退出代码为 0

一切顺利,那么到这里Project2就算彻底结束了,只不过遗憾的是没法上传到Gradescope上看看得分情况,希望后面运行的时候不要爆雷啊😰

总结

这次的作业关键还是在于熟悉项目结构以及体会B+树的存储设计,再结合一些CS61B的基础,难度适中,Task2可能需要更多的阅读代码理解各个函数的功能并调用。设计层面的内容也跟CS61B挺像的,不过写CS61B的项目时候一直有点晕晕的,还有点靠AI没有好好体会,这次也算是弥补一下。

Last updated