DynSDT demo

Welcome!

This demo works best on a large screen, preferably fullscreen.
Click the arrow in the bottom right or press the right arrow on your keyboard to start the demo.
This demo is a softer introduction to the data structure explored in more depth in my paper @ validark.github.io/DynSDT.

This demo heavily references the “Completion Trie”, as created by:

Hsu, Bo-June & Ottaviano, Giuseppe. (2013). Space-efficient Data Structures for Top-k Completion. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web. 583-594. 10.1145/2488388.2488440.
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/TopKCompletion.pdf

The Problem

This demo solves the problem of scored prefix completion, which is the problem of finding the top-k (usually 10 or so) highest-scored completions to a given prefix string p in a string corpus where each string term has a numeric score denoting its relevance or popularity. As an example, if a user types “r” into an autocompleted search bar, they should receive a list of the highest scored terms beginning with “r”:

rubber ducky
rare fidget toys

The Data Structure

One solution to this problem is to use a compacted trie. To complete a prefix p, successive characters in p can be matched along the corresponding path of edges descending from the root, terminating when p is fully exhausted, arriving upon the locus node. The locus node is the highest node with p as a prefix and it contains all completions to p in its subtree. To enumerate the top-k highest-scored completions to p, a (depth-first) traversal over this entire subtree must be performed, and an online partial sort of the top-k elements must be performed. I.e. during the depth-first search, the top-k highest-scored completions found at each point can be stored in a min-heap constrained to size k. This algorithm takes a time in O(N + n log k), where n is the number of completions to p (i.e. the number of scores to insert into the min-heap) and N is the total length of the n completions (i.e. the upper bound on the number of nodes to traverse).

Example Traversal

For example, if the given prefix p to complete is “li”, then the edge containing “l” from the root is followed, and then the edge containing “i” is followed. Under the locus node representing “li” is all completions to “li”.

In the diagram above, the edges representing “lis”, “lin”, “lit”, and “lif” extend from “li”, and each of those subtrees must be traversed in their entirety and each score must be considered as a candidate for the list of the top-k scored completions. Note that this diagram is merely an abbreviated slice of the dataset containing the word frequencies of every word on Wikipedia, which contains over six million terms and thus this algorithm would be incredibly slow.

The Completion Trie

To improve top-k enumeration time, internal nodes in the trie can be given a score equal to the maximum score in its subtree. The top-k search algorithm then becomes a variation of the A* search algorithm with the scores serving as an exact heuristic function (Hsu & Ottaviano, 2013).

The algorithm proceeds as follows: The path to the first completion is found by following the first child on each layer with a score equal to the score in the locus node. All other children are added to a bounded DEPQ constrained to size k - 1 (which discards the minimum element when an insertion would make its size exceed k). When a completion is found, k is decremented and the highest-scored node in the DEPQ is extracted and the same process is repeated until k completions are found or the DEPQ becomes empty.

Top-`k` Search Example

completions = {}
DEPQ = { (f, 50468), (i, 62504), (t, 66985), (o, 98750), (l, 101139) }
k = 5

To illustrate, this example will find the top-5 completions to the empty string (p = “”, k = 5). The locus node is the root, which has a score of 1220297, representing the maximum score in the trie. All children of the root are added to the DEPQ constrained to size 4, except the node representing “w”, because it has the same score as the locus node. (f, 50468) has a score lower than the 4th highest score found thus far, and therefore it is discarded from the DEPQ.

Note: You can mouse-over any node to see its string:score. You can also click and drag the screen around and scroll to zoom.

Traversing downwards

completions = {}
DEPQ = { (w$, 8393), (we, 17837), (wo, 30978), (i, 62504), (t, 66985), (o, 98750), (l, 101139) }
k = 5

Next, the process is repeated for the children of the node representing “w”. The node representing “wi” will be followed next, because it too has a score of 1220297. The other nodes are added to the DEPQ, but because they all have scores lower than the lowest in the DEPQ they are all discarded. Specifically, (w$, 8393), (we, 17837), (wo, 30978) all have scores lower than (i, 62504).