This demo works best on a large screen, preferably fullscreen.
Click the arrow in the bottom right or press the right arrow on your keyboard to start the demo.
This demo is a softer introduction to the data structure explored in more depth in my paper @ validark.github.io/DynSDT.
This demo heavily references the “Completion Trie”, as created by:
Hsu, Bo-June & Ottaviano, Giuseppe. (2013). Space-efficient Data Structures for Top-k Completion. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web. 583-594. 10.1145/2488388.2488440.
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/TopKCompletion.pdf
This demo solves the problem of scored prefix completion, which is the problem of finding the top-k (usually 10 or so) highest-scored completions to a given prefix string p in a string corpus where each string term has a numeric score denoting its relevance or popularity. As an example, if a user types “r” into an autocompleted search bar, they should receive a list of the highest scored terms beginning with “r”:
One solution to this problem is to use a compacted trie. To complete a prefix p, successive characters in p can be matched along the corresponding path of edges descending from the root, terminating when p is fully exhausted, arriving upon the locus node. The locus node is the highest node with p as a prefix and it contains all completions to p in its subtree. To enumerate the top-k highest-scored completions to p, a (depth-first) traversal over this entire subtree must be performed, and an online partial sort of the top-k elements must be performed. I.e. during the depth-first search, the top-k highest-scored completions found at each point can be stored in a min-heap constrained to size k. This algorithm takes a time in O(N + n log k), where n is the number of completions to p (i.e. the number of scores to insert into the min-heap) and N is the total length of the n completions (i.e. the upper bound on the number of nodes to traverse).
For example, if the given prefix p to complete is “li”, then the edge containing “l” from the root is followed, and then the edge containing “i” is followed. Under the locus node representing “li” is all completions to “li”.
In the diagram above, the edges representing “lis”, “lin”, “lit”, and “lif” extend from “li”, and each of those subtrees must be traversed in their entirety and each score must be considered as a candidate for the list of the top-k scored completions. Note that this diagram is merely an abbreviated slice of the dataset containing the word frequencies of every word on Wikipedia, which contains over six million terms and thus this algorithm would be incredibly slow.
To improve top-k enumeration time, internal nodes in the trie can be given a score equal to the maximum score in its subtree. The top-k search algorithm then becomes a variation of the A* search algorithm with the scores serving as an exact heuristic function (Hsu & Ottaviano, 2013).
The algorithm proceeds as follows: The path to the first completion is found by following the first child on each layer with a score equal to the score in the locus node. All other children are added to a bounded DEPQ constrained to size k - 1 (which discards the minimum element when an insertion would make its size exceed k). When a completion is found, k is decremented and the highest-scored node in the DEPQ is extracted and the same process is repeated until k completions are found or the DEPQ becomes empty.
To illustrate, this example will find the top-5 completions to the empty string (p = “”, k = 5). The locus node is the root, which has a score of 1220297, representing the maximum score in the trie. All children of the root are added to the DEPQ constrained to size 4, except the node representing “w”, because it has the same score as the locus node. (f, 50468) has a score lower than the 4th highest score found thus far, and therefore it is discarded from the DEPQ.
Note: You can mouse-over any node to see its string:score. You can also click and drag the screen around and scroll to zoom.
Next, the process is repeated for the children of the node representing “w”. The node representing “wi” will be followed next, because it too has a score of 1220297. The other nodes are added to the DEPQ, but because they all have scores lower than the lowest in the DEPQ they are all discarded. Specifically, (w$, 8393), (we, 17837), (wo, 30978) all have scores lower than (i, 62504).
The node representing “wik” matches the target score of 1220297, so it will be followed downwards in the next step. All other children are pushed to the DEPQ but because they all have scores lower than (i, 62504) they are all discarded (not shown in the DEPQ table above).
The node representing “wiki” matches the target score of 1220297, so it will be followed downwards in the next step. All other children of the node representing “wik” are pushed to the DEPQ, but because they all have scores lower than (i, 62504), it remains unchanged.
The node representing “wikip” matches the target score of 1220297, so it will be followed downwards in the next step. All other children of the node representing “wiki” are pushed to the DEPQ, but because they all have scores lower than (i, 62504), it remains unchanged.
The node representing “wikipedi” matches the target score of 1220297, so it will be followed downwards in the next step. All other children of the node representing “wikip” are pushed to the DEPQ, but because they all have scores lower than (i, 62504), it remains unchanged.
The node representing “wikipedia” matches the target score of 1220297, so it will be followed downwards in the next step. All other children of the node representing “wikipedi” are pushed to the DEPQ, but because they all have scores lower than (i, 62504), it remains unchanged.
The path to “wikipedia$” will be taken next. All other children of the node representing “wikipedia” are pushed to the DEPQ, but because they all have scores lower than (i, 62504), it remains unchanged.
The node representing “wikipedia$” is a leaf, thus it is added to the set of completions and k is decremented. The maximum node is then extracted from the DEPQ, which yields the highest-scored node of all of the candidate nodes highlighted in this color. In this case, it happens to be (l, 101139). Therefore, the path from the node representing “l” will be followed next, and the score of the next completion is 101139. The DEPQ is also bounded to size 3 now.
From here on out this demo will pick up the pace and show one completion per frame.
Following the path of nodes with scores equalling 101139 from the node representing “l” leads to the leaf representing “list$”, now added to the set of completions. All of the new candidate nodes are pushed to the DEPQ, but only (list_, 100625) makes it in. During insertion, the lowest value in the DEPQ (i, 62504) is bumped out.
The next completion is found by extracting the maximum from the DEPQ, which is (list_, 100625). k is decremented again and the DEPQ is now bounded to size 2. The path formed by nodes with scores equalling 100625 which starts at the node representing “list_” is followed to its corresponding leaf, which represents “list_of$”, which gets pushed to the completions list. None of the children of the nodes representing “list_”, “list_o”, or “list_of” have scores large enough to penetrate the DEPQ, so it remains unchanged.
The next completion is found by extracting the maximum from the DEPQ, which is (o, 98750). k is decremented again and the DEPQ is now bounded to size 1. The path formed by nodes with scores equalling 98750 which starts at the node representing “o” is followed to its corresponding leaf, which represents “of$”, which gets pushed to the completions list. Again none of the nodes along the path have scores large enough to penetrate the DEPQ, so it remains unchanged.
The next completion is found by extracting the maximum from the DEPQ, which is (t, 66985). k is decremented again and the DEPQ is now bounded to size 0. The path formed by nodes with scores equalling 66985 which starts at the node representing “t” is followed to its corresponding leaf, which represents “the$”, which gets pushed to the completions list. Since the DEPQ is constrained to size 0, none of the nodes along the path are inserted into the DEPQ.
The top-k search over this structure (not including the time to find the locus node) takes Theta(bdk log k) time in the worst-case, where d is the average depth of each of the k leaf nodes sought by this algorithm, and b is the average breadth of each visited level. This results from Theta(bdk) nodes being inserted into the DEPQ (and k of those being extracted), which takes worst-case Theta(log k) time for each (even though many of the nodes could be rejected with a single comparison against the minimum element in the DEPQ).
To improve this search time, Hsu & Ottaviano (2013) sort each horizontal layer during preprocessing. Rather than inserting into the DEPQ every child of each node iteratively extracted, only the second highest scored node from each level along the path to the next completion is added to the DEPQ as the first-child nodes are iteratively followed to the leaf representing the next completion. Each node in the DEPQ thus acts like a forward iterator which inserts (among others) the next element on its level when extracted (Hsu & Ottaviano, 2013).
When the trie is drawn in the left-child right sibling representation, finding the top-5 completions to the empty string (p = “”, k = 5) looks exactly like the k-way merge algorithm.
Because nodes are sorted on each level by their score, the first completion is the leaf at the end of the chain of first-child nodes starting at the locus node. In other words, the first completion is either the locus node itself, or its first child, or that node's first child, and so on. The nodes adjacent to the path to the first completion are added to the DEPQ which is constrained to size 4.
Next, k is decremented and the maximum node from the DEPQ is extracted, which is (l, 101139). The extracted node's first-child path is followed down to the leaf, and all nodes adjacent to the path are added to the DEPQ which is now constrained to size 3. The net effect on the DEPQ is that (o, 98750) and (list_, 100625) supplant (wikt, 36) and (wil, 27706). The nodes (le, 22168), (lin, 6574) and (lisa, 1342) are ultimately discarded as well.
Next, k is decremented and the maximum node from the DEPQ is extracted, which is (list_, 100625). The extracted node's first-child path is followed down to the leaf, and all nodes adjacent to the path are added to the DEPQ which is now constrained to size 2. The nodes along the path ((listi, 2974), (list_a, 50) and (list ob, 1)) have lower scores than (wo, 30978) and therefore have no effect on the DEPQ.
Next, k is decremented and the maximum node from the DEPQ is extracted, which is (o, 98750). The extracted node's first-child path is followed down to the leaf, and all nodes adjacent to the path are added to the DEPQ which is now constrained to size 1. The net effect on the DEPQ is that (t, 66985) supplants (wo, 30978). The nodes (op, 13563) and (of_, 46771) are pushed to the DEPQ but are discarded as well.
Lastly, k is decremented and the maximum node from the DEPQ is extracted, which is (t, 66985). The extracted node's first-child path is followed down to the leaf, yielding the last completion. The DEPQ is constrained to size 0 so nodes along the path cannot be inserted into the DEPQ.
The top-k search over this structure (not including the time to find the locus node) takes Theta(dk log k) time in the worst-case, where d is the average depth of each of the k leaf nodes sought by this algorithm. This results from Theta(dk) nodes being inserted into the DEPQ (and k of those being extracted), which takes worst-case Theta(log k) time for each (even though many of the nodes could be rejected with a single comparison against the minimum element in the DEPQ).
Intuitively, since sorting horizontally removed the breadth factor b, it stands to reason that sorting vertically on top of that will remove the depth factor d. Then, instead of inserting all the nodes along the path to the leaf, only the (next) highest-scored node from each vertical list needs to be in the priority queue at any given time.
Data Structure | Top-k Search Time |
---|---|
Unsorted Completion Trie | Theta(bdk log k) |
Horizontally Sorted Completion Trie | Theta(dk log k) |
Horizontally and Vertically Sorted Trie? | Theta(k log k) |
The previous trie is decomposed by compressing paths to maximum completions and labeling the branch points along each compressed path with their longest common prefix length (LCP) with the maximum completion. E.g. The path to “wikipedia$” is compressed into the root node, and the first-sibling along the path representing “wo” was compressed into “world”, and those strings have an LCP of 1 because because they have the prefix “w” in common (which has a length of 1).
Now that LCP values are explicitly stored, the structure can be sorted vertically. Since this structure was already sorted horizontally, it doubles as a max-heap as well as a trie.
When this structure is drawn in the left-child right sibling representation, it becomes apparent that the top-k search algorithm over this data structure is equivalent to finding the top-k nodes from an immutable max heap.
The first completion is the locus node, which in this case is the root (wikipedia, 1220297) because p is the empty string. The second completion is its first branch point*, in this case (list, 101139). Since the first two completions have already been found, k is decremented twice. Next, the branch point succeeding (list, 101139) in its vertical list and its first branch point are pushed to a DEPQ constrained to size k (which is 3 after being decremented twice).
* Simplification, will be elaborated upon later
The highest-scored completion is then extracted from the DEPQ, which in this case is (list of, 100625). It is pushed to the output list and k is decremented, constraining the DEPQ to size 2. Next the successor branch point and first branch point within the node (list of, 100625) are pushed to the DEPQ. (listings, 2974) gets discarded, but (of, 98750) becomes the new maximum, so it will be selected next.
(of, 98750) is extracted from the DEPQ and pushed to the output list, decrementing k, thereby constraining the DEPQ to size 1. (league, 22168) and (the, 66985) are then pushed to the DEPQ, which ultimately retains only (the, 66985).
The last value is extracted from the DEPQ and pushed to the output list. k is decremented to 0, so no further nodes are added to the DEPQ.
The top-k search demonstrated here takes Theta(k log k) time in the worst-case. This results from Theta(k) nodes being inserted into and extracted from the DEPQ, which takes worst-case Theta(log k) time for each (even though many of the nodes could be rejected with a single comparison against the minimum element in the DEPQ).
However, because the structure was queried with the empty string, the algorithm was simplified slightly. Next, the structure will be queried with (p = “li”, k = 10) to demonstrate a more realistic query.
To find the locus node, the algorithm starts at the root (0, wikipedia) and computes the LCP with between its string and p, in this case LCP(“wikipedia”, “li”) = 0. Thus, the algorithm moves to the LCP matching 0 in the current node's list of branch points, which happens to be the first node in this case: (0, “list”). The same process is repeated until the LCP = |p| = 2. In this case, LCP(“list”, “li”) = 2, so (0, “list”) is the locus node. The locus node is the first completion, so it is pushed to the output list.
The second completion will be discussed in the next step.
The second completion to “li” is the first branch point contained in the locus with an LCP ≥ |p| (i.e. 2). In this case, it does not change the second completion, but it does change which vertical successor of the second completion is inserted into the DEPQ. In this case, (0, “of”) and (1, “league”) are not valid completions to p, both with LCP < |p|.
The rest of the algorithm continues the same way as already described.
As demonstrated, it is possible that the top-k search algorithm may need to skip all nodes contained directly within the locus node with an LCP < |p|, of which there are at most |p|. Accounting for this, the time complexity for this algorithm is in O(|p| + k log k).
Assuming the locus node can be found in Theta(|p|) time, the total time it takes to find the locus node and perform the top-k search is Theta(|p| + k log k).
Thank you for making it this far! If you appreciated this, you should read the paper @ validark.github.io/DynSDT!