Indexing Exercise

due: October 10, 2003

The B+-Tree is the most commonly used index structure in databases.The B+-Tree is an index over an attribute in a relation. The B+-Tree consists of nodes, which contain {i>pointers and values. If there are k pointers in a node, then there are at most k-1 values in the node.

There are two types of nodes, leaf and non-leaf. A leaf node contains pointers and values. The values are sorted from least to greatest. If the attribute is unique, then the pointers are tupe identifiers (TIDs) of the tuple that contains the value.If the attribute is non-unique, then the pointer points to a bucket containing all of the TIDs containing the value. The last pointer in a leaf node points to the "next" leaf node (i.e., the next leaf is the one with the least value greater than any value in the current leaf node).

In a non-leaf node, the pointer to the left of a value indicates the sub-tree where all values are less than the value. The pointer to the right of a value indicates the sub-tree where all values are greater than or equal to the value in the node.

A B+-Tree maintains the property that the path from the root to any leaf is the same length. This is achieved through splitting and coalescing nodes. In this exercise, you will not have to coalesce nodes.

Create a B+-Tree for the input file product.txt. The file consists of a key, and item number, a list of keywords associated with the item, a price and an inventory. The B+-Tree will be on the non-unique keyword field. Every keyword should appear as a value in a leaf in a tree.

The nodes of the B+-Tree may have between 50 and 200 pointers. The number of pointers will be a parameter at B+-Tree creation time. Each node must be its own file. The pointers will be file names. The bucket files will consist of the keys from the input file that have the desired keywords.

After B+-Tree creation, you must have a query program. The program should accept either one or two command line arguments. The first argument is required and is the lowest range element in the search. The second argument is optional and is the highest range element in the search. If the second argument is missing, the query program should perform an exact match. The output of the query should be all keys of lines containing the keywords.

The node files may be either binary or ascii. If they are binary, there must exist a program to allow me to read the files.

The program must again be designed to run as an executable on the Linux machines. Either C++ or Java is acceptable.