Heap-like Dynamic Score-Decomposed Tries for Top-`k` Autocomplete

Abstract

Query autocompletion, also known as type-ahead search, is a critical feature for a wide range of services: mobile keyboards, web stores, social media sites, and virtually every modern system connected to a database. This paper improves scored prefix completion by introducing a simple, pointer-based data structure called the Dynamic Score-Decomposed Trie that has the properties of both a decomposed trie and a binary max-heap, enabling the top-k highest scored completions to a prefix p to be computed in a time in O(|p|+klogk) after the node representing p is located (for which several options are given).

1. Introduction

Autocomplete is a critical component of every modern search service. For users, it reduces the keystrokes and effort required to express intent. For platforms, it provides an opportunity to suggest content or products to users. Since query autocompletion can occur with every keystroke, it is the most interactive component of search services and thus receives the most traffic. This necessitates the development of extraordinarily time-efficient algorithms to service these requests in real-time.

This paper solves the scored prefix completion problem, which is the problem of finding the top-k highest scored completions which begin with a given prefix string p, where the score is a numeric rank denoting each string's relevance relative to every other string in a string corpus. This is the foundation of other autocomplete problems such as error-correcting autocompletion which may be viewed as an extension of this problem.

Tries (i.e. prefix trees) are the natural choice for prefix completion queries for their obvious structural and performance advantages in this problem space. However, most tries are not amenable to scored prefix completion. While any off-the-shelf trie can be augmented with scores, without a specialized data structure every completion to p must be considered a candidate for the top-k because scores could occur in any order in the trie. When the trie is large and p is short, so much of the trie must be traversed that it precludes being performed on-demand in real-time. Many real-world trie-based autocomplete services explicitly disallow (or cache) prefix queries with fewer than three characters for this very reason [27], even though it is possible to experience the same performance issues anywhere, so long as the candidate set of completions to p is massive. Although some or all of the top-k completions could be precomputed and cached, this strategy requires a preordained k and often too much space in practice. This paper instead improves query times by traversing smaller amounts of a specialized trie data structure tailored specifically for the autocomplete problem.

2. Related work

The scored prefix completion problem has been solved many times in the query autocompletion (QAC) literature, often with different constraints and goals for different solutions. Only the most similar works are mentioned here, so refer to surveys [25, 26] for a broader summary of the query autocompletion research which goes beyond the problem addressed in this paper.

A popular solution is to precompute the top-k answers to all possible queries and store them in each node of a trie [18] or in a (distributed [29]) hash table [24]. However, precomputing takes a large amount of space and is rendered unnecessary by faster algorithms.

Another solution is augmenting a sorted string data structure with a Segment Tree [21] or some other Range-Minimum-Query structure [19, 22, 28]. The main advantage of an RMQ strategy is in its simplicity and modularity, as it works with existing dictionary structures without much modification [22]. However, it is much slower in practice than data structures that are specialized for the scored prefix completion problem [22].

In [22], Hsu & Ottaviano introduced the Completion Trie, which serves as the baseline for this paper and is discussed at length in the Background section. Note that this paper presents their algorithms with a few improvements: namely using a double-ended priority queue bounded to size k rather than an unbounded max-heap, and not inserting nodes into the priority queue that are guaranteed to be extracted immediately after.

Also in [22], Hsu & Ottaviano introduced the Score-Decomposed Trie [22, 23] as a succinct representation of the Completion Trie. Basically, tries can be succinctly encoded by traversing the tree and encoding its shape in a bitmap with roughly 2n bits [20]. One technique is to perform a depth-first traversal and encode a 1 each time the algorithm descends and a 0 each time it ascends [20]. All traversal operations over the original structure can then be simulated on the resulting bitmap, so long as the data held in each succinct “node” can be mapped to by some auxiliary structure(s). The main advantage of succinct data structures is in their space efficiency, with the Score-Decomposed Trie achieving space competitive with gzip on massive datasets [22]. The downside is that succinct data structures are typically (but not always [17]) static, generally rendering them unsuitable for highly dynamic settings. Despite bearing a similar name, the structure presented in this paper is not succinct and has a stricter invariant than the Score-Decomposed Trie.

Surprisingly, the structure presented in this paper is one of the first non-succinct path-decomposed tries, behind the Dynamic Path-Decomposed Trie of Kanda et al. [17, 16], although their structure is not specialized for the same problem.

3. Preliminaries

A trie [4, 5], also known as a prefix tree, is a digital search tree that encodes a set of strings, organized such that every string is encoded as a root-to-node path and each edge extending from a parent node to a child node represents subsequent characters (or bit(s)) in the string. This is useful for autocompletion because all strings under (i.e. in the subtree of) the node representing p are valid completions to p.

A compacted trie [11], also known as a radix tree, is a trie that allows multiple characters to be associated with a single edge. This eliminates nodes that would otherwise have only one child, thereby reducing memory consumption.

A left-child right-sibling binary trie [6], abbreviated here as “LCRS trie”, is a trie where every node only holds two references to other nodes: its first child and next sibling. Any trie can be represented as an LCRS trie by transforming each list of children into a linked list and letting each parent point only to the first node in the linked list and its sibling.

A priority queue, also known as a heap [8], is an abstract data type similar to a queue except that elements have a “priority” which determines the order in which they are handled. Min-heaps support the operations HeapPush, which adds an element, and HeapPopMin, which removes and returns the minimum priority element. Max-heaps are the opposite, supporting HeapPopMax instead, which removes and returns the maximum priority element. The aforementioned operations are assumed to take Θ(logx) time, where x is the maximum size to which the heap grows. Note that this paper does not delay the construction of heaps until just before the first HeapPop operation is needed, which could save a few comparisons via Floyd's linear-time heap construction algorithm [7].

A double-ended priority queue, abbreviated as DEPQ, is a priority queue that simultaneously supports the operations of a min-heap and a max-heap [10].

A bounded priority queue or bounded heap is a priority queue constrained to a certain capacity. When an element is pushed to a full priority queue, the lowest priority element is discarded. Whether the element to be discarded happens to be the incoming element can be determined via a single comparison with the lowest priority element in the heap. Although this may garner performance gains in practice, this paper assumes the worst case: that each pushed element requires a logarithmic-time HeapPush.

Insertion sort is a simple, comparison-based, in-place sorting algorithm that sorts each successive item in a list by performing a linear scan on the items that precede it, shifting each higher ordered element, and inserting it immediately after the first element with a lower order is found [15]. In this paper it is used in online algorithms to sort a single element into an otherwise sorted list. The InsertionSort functions (Down/Up) take as input a list and an index into it where there is an element which may need to be shifted to maintain the list's sorted order, and returns the index into which it was shifted. MergeSortedSublists takes the same parameters, but the index instead points to the first element that is not properly sorted with the elements before it, but is sorted with the elements after it. MergeSortedSublists performs InsertionSortUp on each successive element, starting at the given index, terminates once an element is encountered that need not be shifted, and returns the final index of the element that originally occurred at the given index. InsertionSortIntoList takes a list and an element, performs AppendToList which appends the element to the back of the list, insertion sorts it upwards/leftwards, and returns its new index in the list. RemoveIndexFromList is a similar function that takes a list and an index and shifts all the elements after the index to the left by one, effectively shrinking the list.

4. Background

The data structure serving as the baseline for this paper was introduced in Hsu & Ottaviano's 2013 paper Space-Efficient Data Structures for Top-k Completion [22]. It introduced the Completion Trie: a scored compacted trie where each internal node is given a score equal to the maximum score in its subtree. This conveniently allows internal nodes and leaf nodes (which hold their own score) to have the same shape, but more importantly, it enables the algorithm which searches for the top-k scores in a given subtree to know at every step which path leads to the next highest scored completion.

After ordinary trie traversal yields the locus node, i.e. the highest node representing a completion string with the given prefix p, the top-k search algorithm over the Completion Trie proceeds as follows: First, each child of the locus node is inserted into a bounded double-ended priority queue (DEPQ) constrained to size k - 1, except for the node with the same score as the locus node, for which the same process is repeated. (If there are multiple nodes with the target score, pick one.) In effect, the locus node's score is followed downwards to the corresponding leaf with the same score, and all other children along the path from the locus to the leaf are inserted into the DEPQ. Once the leaf is reached, it is pushed to the output list, k is decremented, and the maximum node is extracted from the DEPQ (again constraining its size to k - 1). The algorithm repeats for the extracted node and only terminates once k is 0 or the DEPQ is empty.

This top-k search algorithm is a variation of the A* search algorithm [9] with the scores serving as an exact heuristic function [22]. This algorithm takes Θ(bdklogk) time in the worst case, where d denotes the average depth of each leaf node corresponding to the top-k completions to p, and b denotes the average breadth of each visited level. Since there are Θ(dk) visited levels of size b, there are a total of Θ(bdk) nodes pushed to the DEPQ, each taking Θ(logk) worst-case time. Note that b is at most the size of the alphabet and d is at most the average length of the completions to p.

Hsu & Ottaviano further improve the top-k search time by sorting each node's children by score, so that only one node from each level needs to be pushed to the DEPQ as the path of first-child nodes is iteratively followed to each completion's leaf. Each node in the DEPQ then effectively acts like a forward iterator which inserts the next element on its level when extracted (before its first-child path is followed to the leaf). This improved top-k search is a variation of the k-way merge algorithm [15] and takes Θ(dklogk) time.

Figure 1. The first step for the empty string query (p = “”) over a Completion Trie is depicted. The algorithm starts at the locus node, which for the empty string is the root node, and traverses downward until reaching a leaf node, in this case the node representing “wikipedia$”. Because this structure is sorted horizontally by score, the next highest completion starts at the node with the maximum score among the first-sibling nodes. A live demo of this algorithm is available at https://validark.github.io/DynSDT/demo.

Intuitively, since sorting horizontally removed the breadth factor b, it stands to reason that sorting vertically would remove the depth factor d. In other words, if the highlighted sibling nodes in Figure 1 (adjacent to the path of “wikipedia”) were sorted, then only the (next) highest of those nodes would need to be in the priority queue at any given time. However, the trie structure needs to be decomposed to support vertical sorting. This is the intuition and motivation behind the structure introduced in the next section. In summary:

Data Structure	Top-`k` Search Time
Unsorted Completion Trie	Θ(`bdk`log`k`)
Horizontally Sorted Completion Trie	Θ(`dk`log`k`)
Horizontally and Vertically Sorted Trie?	Θ(`k`log`k`)

Note: The time it takes to find the locus node is considered separately from the top-k search time, and is thus not included in the above table. To find the locus node corresponding to a given prefix string of length |p|, standard tries must traverse Θ(|p|) levels in the worst case. Hence, to find the locus node in Θ(|p|) time the trie must support constant time traversals, both horizontally and vertically, even for tries that maintain a sorted order (by score) horizontally and/or vertically. If only the sorted order is maintained, horizontal sorting pessimizes lookup to Θ(|p|b) and vertical sorting on top of that pessimizes prefix lookup to Θ(|p|bd) in the worst case (query and input). However, the breadth and/or depth factors can be obviated by augmenting each node with HashMaps for constant time horizontal jumps and/or using a Linked HashMap (sorted by score) for constant time vertical jumps. Alternatively, a separate HashMap structure could map prefix strings to locus nodes.

5. Dynamic Score-Decomposed Tries

The Dynamic Score-Decomposed Trie is a (non-succinct, pointer-based) data structure based on the path decomposition of conventional tries. When constructing from a conventional trie, the path to the first leaf is compressed into a single decomposed node by concatenating the substrings and keeping a list of all the first-sibling nodes encountered along the path. Each encountered first-sibling node is likewise decomposed and stored inside a branch point which also contains the longest common prefix length (LCP) between the string it represents and its parent's.

Figure 2. The Completion Trie of Figure 1 after decomposition. Tuples are rendered in the form (LCP, node.key). E.g. the node representing “wikipedia” has 8 outgoing edges with unique branch points, and one of those is (2, “william”). The 2 denotes that “wi” is shared between the aforementioned strings. This is still equivalent to the Score-Decomposed Trie of [22, 23].

Alternatively, this structure could be viewed as a derivative of the LCRS trie, as can the original trie structure of [4, 5]. The original trie can be derived from the LCRS trie by moving each horizontal linked list (of sibling nodes) into the parent node. By the same token this structure could be derived from the LCRS trie by moving the vertical linked lists (of descendant nodes) into the parent node (after advancing horizontally by one on each level and letting the parent node hold the string representing the concatenation of the nodes that were advanced over).

5.1 Structural Properties

Each node contains a string key, its corresponding numeric score, and a list of nodes with unique branch points which differ from key at different positions. By construction, branch points in each node are greater than or equal to the branch point that led to that node (or 0 for the root) and less than or equal to the length of its key. This gives the resulting structure the trie property, meaning that all nodes have a unique path to them from the root, which can be found by matching successive characters, and that the subtree under any given node contains only completions to the string that was matched along the path to it. To find the locus node representing (at least) p, the algorithm starts at the root node and iteratively computes the longest common prefix (LCP) between p and the current node's key, where the next current node is at the branch point matching the LCP length, and continues until all characters in p have been matched or until the target branch point does not exist within the current node.

Algorithm 1 Find the locus node for a prefix string p (without augmenting the structure)

Input: T, a Dynamic Score-Decomposed Trie and p, a prefix string

Output: n, the locus node, i.e. the highest node in T which represents at least p

procedure FindLocusForPrefix(T, p)

lcp ← 0

n ← T.root // the current node

while n ≠ null do

while lcp < Min(|p|, |n.key|) and p[lcp] ⩵ n.key[lcp] do // compute LCP

lcp ← lcp + 1

if lcp ⩵ |p| then break

n ← FindNodeForLCP(n.branch-points, lcp)

return n

procedure FindNodeForLCP(bp, lcp)

for i ← 0 to |bp| - 1 do

if bp[i].LCP ⩵ lcp then

return bp[i].node, i

return null, NaN

Since LCP lengths are stored explicitly, this structure is amenable to being sorted both horizontally and vertically. To start, the root node holds the maximum scored completion in the data set. To ensure horizontal sorted order, each branch point holds the highest-scored completion that matches its LCP with the containing node. To ensure vertical sorted order, each list of branch points is sorted by score. This structure thus satisfies the heap property, both horizontally and vertically, making this structure a variant of the binary max-heap (except that it has no constraint of being complete or nearly complete). Hence, the top-k search algorithm over this structure is equivalent to the algorithm for finding the top-k nodes from a binary max-heap without mutating it [12].

Figure 3. The Dynamic Score-Decomposed Trie of Figure 2 after sorting by score vertically, drawn in the LCRS representation. Tuples are rendered in the form (LCP, node.key, node.score).

This structure also has the benefit of using only n nodes for n strings, as opposed to conventional tries, which require far more intermediary nodes and therefore more allocations, memory usage, and cache misses.

5.2 Top-`k` Completion Search

The top-k search algorithm over this structure proceeds as follows: First, the key of the locus node and its highest-scored branch point with LCP ≥ |p| are pushed to the output list and k is decremented twice. Next, the two candidate nodes directly succeeding the latter node, one horizontally and one vertically (like a binary max-heap drawn with right angles), are pushed to a bounded DEPQ constrained to size k. Iteratively, the maximum node is extracted from the DEPQ, its key is pushed to the output list, k is decremented, and (up to) two nodes (one horizontal and one vertical) are pushed to the DEPQ. This continues until k reaches 0 or until the DEPQ is empty. More specifically, the vertical node to be pushed to the DEPQ is the next branch point after the current node (in the containing node) with branching point ≥ |p| and the horizontal node is the first branch point contained within the current node's list of branch points. Some vertical successors can be skipped over because branch points in the locus node can have an LCP as low as the one that led to it (which is less than |p|, by definition). Every other list of branch points encountered by the algorithm is found by following a branching point ≥ |p|, and therefore no check is necessary outside the locus node's list of branch points.

Algorithm 2 Top-k Completions to p

Input: the structure T, a string p, a number k > 0, and c, the output list of completions to p

L ← FindLocusForPrefix(T, p) orelse return

AppendToList(c, L.key)

if --k ⩵ 0 then return

bp ← L.branch-points // the current list of branch points

i ← 0 // the current index in bp (0-indexed)

while (i < |bp| or return) and bp[i].LCP < |p| do ++i // find first bp[i] with LCP ≥ |p|

AppendToList(c, bp[i].node.key)

Q ← new DEPQ of capacity k // When full, HeapPush internally calls HeapPopMin to constrain size to k

while --k > 0 do

if |bp[i].node.branch-points| > 0 then

HeapPush(Q, { bp: bp[i].node.branch-points, i: 0 }) // horizontal candidate

while ++i < |bp| do

if bp[i].LCP ≥ |p| then // this check is always true when bp ≠ L.branch-points

HeapPush(Q, { bp, i }) // vertical candidate

break

if |Q| ⩵ 0 then return

bp, i ← HeapPopMax(Q) // The size of Q is now constrained to the new value of k

AppendToList(c, bp[i].node.key)

In the worst case, this top-k search algorithm inserts (k - 2) * 2 nodes into the DEPQ and extracts k - 2 nodes, contributing a Θ(klogk) term to the time complexity. The algorithm also skips over a number of nodes in the list of branch points contained within the locus node which is at most |p| minus the locus node's branch point in its containing node, which is an integer in the range [0, |p|), contributing a O(|p|) term. Therefore, the top-k search (after the locus node is located) takes a time in O(|p|+klogk).

As previously mentioned, the locus node is found in worst-case Θ(|p|) time when augmented with HashMaps and Θ(|p|bd) time otherwise. Altogether, the total times to both find the locus node and perform the top-k search in its subtree are as follows:

	Augmented	Unaugmented
Dynamic Score-Decomposed Trie	Θ(\|`p`\|+`k`log`k`)	Θ(\|`p`\|`bd`+`k`log`k`)
Completion Trie	Θ(\|`p`\|+`dk`log`k`)	Θ(\|`p`\|`b`+`dk`log`k`)

5.2.1 DEPQ Capacity Optimization

As in the Heapsort algorithm [8], the DEPQ and the output list can be backed by the same underlying array. When that strategy is not used, one observation that can improve performance in high-level languages that always heap-allocate is that the maximum capacity of the DEPQ is actually only ⌊0.5k⌋. The reason for this is that each iteration has the net effect of incrementing the DEPQ size (after inserting 2 and extracting 1) and decrementing k. Since the size and k approach each other at the same rate (in the worst case), they converge in the middle. Note that the two values meet in the middle after inserting 2 and extracting 1, meaning the size overshoots the midpoint, briefly reaching size ⌊0.5k + 1⌋ before the extract operation. However, because k is decremented twice in the step before the DEPQ is used, the maximum size of the DEPQ is actually given by ⌊0.5(k - 2) + 1⌋, reducing to ⌊0.5k⌋. The DEPQ can also be implemented as an insertion-sorted array for optimal performance when ⌊0.5k⌋ is low and fallback on a Min-Max Heap [10, 14] for higher values of k to maintain logarithmic asymptotic complexities.

5.2.2 Eliminating logarithmic factors

When it is acceptable for the top-k completions to be returned in arbitrary order, the logarithmic factors in the time complexities above can be eliminated. If the score of the kth highest completion was available before the start of the top-k search, then a depth-first search could retrieve the top-k completions in non-sorted order in Θ(k) time [12]. For a predetermined k, precomputing the kth highest score for all values of p with at least k completions enables this strategy. For arbitrary values of k, the ideas of [12] or something similar might be adaptable to this structure such that the kth highest score for a completion to p could be computed in a time in O(k). While such strategies were not explored with implementations, it is conceivable that these ideas could bring the total query complexity down to the optimal Θ(|p|+k) time by eliminating the need for a DEPQ. Another potential small performance benefit is that whether a branch point has an LCP < |p| would not be unnecessarily checked for nodes that are not under the locus (as the previous algorithm requires unless two separate DEPQ's are used), since the codepath for nodes that are not under the locus node could be separate from the ones under the locus.

Algorithm 3 Top-k Completions to p, DFS version

Input: the structure T, a string p, a number k > 0, and c, the output list of completions to p

L ← FindLocusForPrefix(T, p) orelse return

s ← KthHighestScore(L, p, k)

AppendToList(c, L.key)

for each (LCP, node) in L.branch-points do

if node.score < s then break

if LCP ≥ |p| then

DFS-Helper(c, node, s)

procedure DFS-Helper(c, n, s)

AppendToList(c, n.key)

for each (_, node) in n.branch-points do

if node.score < s then break

DFS-Helper(c, node, s)

5.3 String Compression

Optionally, this structure can omit the prefix of each key which is implied by its path from the root, greatly reducing space usage when the uncompressed strings are not needed in main memory by any other process. This does, however, result in a performance penalty during top-k enumeration because each completion must then be reconstructed on-demand to answer each query. Fortunately, the compressed version of this structure requires only k substring and concatenation operations to reconstruct k completions to a given prefix string p (specifically, this is executed k times:str₁.substring(0, len).concat(str₂)). This is still an improvement over the Completion Trie, which requires Θ(dk) concatenations to reconstruct k completions. However, if the uncompressed strings are going to be stored in main memory anyway, or if multiple Dynamic Score-Decomposed Tries share the same set of completions (but with different scores) on a single machine, then it is quite advantageous that this structure does not need to divide strings into substrings as regular non-decomposed tries do. To aid understanding, the diagrams in this paper depict each node with the complete string key it represents, with the redundant prefix underlined and bolded.

5.4 Numeric Compression

A few techniques employed by others for numeric compression are also applicable to this structure: Firstly, longest common prefix lengths (i.e. LCP values) could be made relative to their containing node's LCP, which makes the numbers smaller and therefore require less space, as in [22, 23, 17, 16]. E.g., if a node's LCP is 8 and its containing node's LCP is 3, then the contained node only needs to store that its LCP is 5 more than its container. Relative LCP's must be in the range [0, |x|] where |x| is the length of the containing node's compressed key (which has the string implied by the path from the root omitted), and hence can be stored in log₂(|x| + 1) bits. Secondly, because scores tend to exhibit a skewed power law distribution, variable-byte encoding schemes have been shown to reduce the space usage of scores [22, 23]. These techniques are otherwise omitted from this paper for ease of understanding, but greatly reduce space usage in practice [22, 23].

5.5 Construction

The simplest way to construct a Dynamic Score-Decomposed Trie is with a list of scored, unique completions sorted in descending order by score. To start, the first scored completion in the list becomes the root node. Each subsequent scored completion is inserted by iteratively computing the longest common prefix (LCP) with the current node's key (starting with the root) and jumping to the current node's branch point which corresponds to that LCP and making that the new current node. When no branch point is found, the scored completion is inserted at the end of the current node's branch points. Because the input list is sorted, the trie produced by this algorithm is also properly sorted, both horizontally and vertically.

Algorithm 4 Simple Construction

Input: c, a non-empty list of unique completions sorted in descending order by score

Output: T, a Dynamic Score-Decomposed Trie made from c

T ← a new Dynamic Score-Decomposed Trie

T.root ← { key: c[0].term, score: c[0].score, branch-points: new List }

for (term, score) in c[1..] do // start loop at index 1 in c

n ← T.root

lcp ← 0

loop

while lcp < Min(|term|, |n.key|) and term[lcp] ⩵ n.key[lcp] do // compute LCP

lcp ← lcp + 1

n, j ← FindNodeForLCP(n.branch-points, lcp) orelse break

AppendToList(n.branch-points, (lcp, { key: term, score, branch-points: new List }))

return T

To construct the trie incrementally (i.e. online), or to change an existing structure, an algorithm is needed which does not assume that successive scores are lower or that every completion is not already in the trie. This Set algorithm is given as input a string term and a numeric score to associate with it and proceeds as follows: If the trie is empty, the new scored completion becomes the root. Otherwise, the trie is traversed in the same way as the previous algorithm: Starting at the root, the LCP of the new completion with the current node's key is iteratively computed, and the next current node becomes the one at the branch point corresponding to that LCP. The traversal terminates when either a node is found whose key exactly matches term (5.5.1), when the current node's score is lower than the given score (5.5.2), or when there is no corresponding branch point for the computed LCP (the only case in the previous algorithm).

Algorithm 5 Set

Input: the structure T, a string term and a numeric score to associate with it

if T.root ⩵ null then

T.root ← { key: term, score, branch-points: new List }

return

n ← T.root // T.root is an alias for T.root-branch-points[0].node

bp ← T.root-branch-points // the list of branch points that contains n

i ← 0 // the index in bp (0-indexed), such that n = bp[i]

lcp ← 0

loop

while lcp < Min(|term|, |n.key|) and term[lcp] ⩵ n.key[lcp] do // compute LCP

lcp ← lcp + 1

if lcp ⩵ |term| and lcp ⩵ |n.key| then

return Set-ExactMatchFound(score, n, bp, i) // 5.5.1

if score > n.score then

return Set-ScoreLocationFound(term, score, lcp, bp, i) // 5.5.2

bp ← n.branch-points

n, i ← FindNodeForLCP(bp, lcp) orelse // if there is no branch point for lcp, push to bp

return AppendToList(bp, (lcp, { key: term, score, branch-points: new List }))

5.5.1 Exact match found (demotion)

If the original traversal terminated because the current node's key exactly matches term (i.e. when LCP equals term length), the algorithm proceeds as follows: First, the current node's score is updated to the given score. If score is greater than or equal to all the scores in the current node's subtree (determined by checking its first branch point) then the current node is simply insertion sorted (by score) in its containing node, at which point Set is done.

If score is not the greatest of the current node's subtree, then the current node swaps places with its first (highest-scored) branch point. Since both slots now have lower scores than before, they each must be insertion sorted downwards to maintain the sorted order of both lists. The list of branch points into which the current node was demoted then becomes a queue of nodes to (non-recursively) reinsert into the subtree of the promoted node. The promoted node adopts the LCP of its new position and the demoted node is given an LCP equal to its key length. Also note that the demoted node should have an empty list of branch points.

Figure 4. An example of the previous step for Set(“tennis championships”, 63) called on the Wikipedia dataset [3] with tuples in the format (LCP, key, score). Since 63 < 845, “tennis championships” and “tennis at” switch places, then both are insertion sorted downwards. In this case, only “tennis championships” moves down after swapping.

To reinsert each node from the queue, the naive algorithm starts at the promoted node and iteratively computes the LCP and follows the corresponding branch points until one does not exist or until the dequeued node's score is higher than the current branch point's. If no branch point matches the target LCP, the dequeued node is insertion sorted into the current list of branch points. Otherwise, if the proper location for the dequeued node's score was found, the node formerly occupying that location is supplanted and insertion sorted directly into the dequeued node's branch points. The dequeued node is then inserted in the proper location and insertion sorted upwards. (Note that the supplanted node's LCP is guaranteed to not be present in the dequeued node. The reason for that is more apparent with the improved algorithm.)

Unfortunately, the naive algorithm is quite wasteful as it often performs the same tree traversal for each dequeued node and unnecessarily recalculates LCP's. Two observations improve this algorithm: Firstly, the LCP of any two nodes in a list of branch points is equal to the minimum of their LCP's (with the containing node). Secondly, nodes from the queue are always reinserted into one of the previous nodes in the queue or the promoted node due to the trie property.

With these observations, a better algorithm emerges that employs the use of a running maximums list, which holds a reference to each node from the queue that had the maximum LCP when it was encountered. The algorithm starts by pushing the promoted node to the running maximums list with its original LCP. Each successive node in the queue is inserted into the first element in the running maximums list that has an LCP greater than or equal to it. If the current node from the queue has the highest LCP encountered thus far, it is inserted into the maximum element from the running maximums list and is pushed to the running maximums list itself. Since the running maximums list is, by construction, sorted in ascending order of LCP, it is binary searchable in O(logd) time.

Figure 5. A continuation of Figure 4. Arrows indicate which nodes from the queue are inserted into which nodes in the running maximums list.

If LCP's occur in a completely random order, then the running maximums list is expected to be of size Θ(logd), in which case the binary search takes only Θ(loglogd) time. Of course, when assuming random distribution, a linear search over the running-maximums list would have an expected Θ(1) cost on average, because each successive running maximum would be expected to roughly bisect the dataset. In other words, there would be a 50% chance of the first running maximum being the answer, a 25% chance of the second, a 12.5% chance of the third, and so on. However, anecdotally, the Wikipedia dataset [3] seems to have these probabilities flipped the other way around since the highest-scored branch point under a given node seems likely to have the lowest possible LCP it can have. E.g. “wikipedia” is the highest-scored node and the second highest is quite likely to not start with “w” at all, and in this case, it happens to be “list”. For that reason, the lowest LCP's are likely to occur at the beginning of branch point lists (which is beneficial during trie traversal), although a significantly higher LCP typically occurs around the third slot, due to terms that are extensions of the same word. E.g. the three highest branch points contained within “national” are (0, “river”), (1, “new”), and (6, “nations”), with “nations” more than bisecting the remainder of the available LCP's. If the above algorithm was applied to these examples, most branch points would be reinserted into the second or third term. As a compromise, the reference implementation checks if the last running maximum has the first LCP higher than the current node before falling back on a binary search.

Inserting into a node works as before, except now the trie property guarantees that the target LCP cannot change from the minimum of the LCP's between the two elements from the running maximums list and the queue. The chain of LCP's in the running maximums list is followed until the dequeued node's score is higher than the current branch point, supplanting it, and insertion sorting it directly into the dequeued node's branch points. If the current list of branch points contains no node with the target LCP, then the dequeued node is insertion sorted directly into it. Note that when the dequeued node has a lower LCP than the current maximum LCP, the trie property guarantees that the node in the running maximums list into which it is inserted cannot contain a branch point with the same LCP as the dequeued node. Also note that the LCP given to reinserted elements in the trie does not change its LCP in the queue or running maximums list. Once all the nodes in the queue have been reinserted, Set is finished.

Algorithm 6 Set Helper: Exact match found (demotion)

Input: the new score of a node n, branch points bp, and the index i in bp where n occurs

procedure Set-ExactMatchFound(score, n, bp, i)

n.score ← score

Q ← n.branch-points

if |Q| ⩵ 0 or score ≥ Q[0].node.score then

return InsertionSort(bp, i)

R ← new List // the running maximums list

bp[i].node ← Q[0].node

i ← InsertionSortDown(bp, i)

AppendToList(R, { LCP: Q[0].LCP, bp, i })

Q[0] ← (|n.key|, { key: n.key, score, branch-points: new List })

InsertionSortDown(Q, 0) // Sorting Q ensures that no bp[i] in R can become invalidated.

for each (LCP, node) in Q do

max-LCP, bp, i ← R[|R| - 1]

if LCP ≥ max-LCP then // find where node belongs in the chain of LCP's equal to max-LCP

loop

bp ← bp[i].node.branch-points

n, i ← FindNodeForLCP(bp, max-LCP) orelse

i ← InsertionSortIntoList(bp, (max-LCP, node))

break

if node.score ≥ n.score then

InsertionSortIntoList(node.branch-points, bp[i])

bp[i] ← (max-LCP, node)

i ← InsertionSortUp(bp, i)

break

if LCP > max-LCP and /* not last iteration */ node ≠ Q[|Q| - 1].node then

AppendToList(R, { LCP, bp, i })

else // LCP < max-LCP

l ← 0

r ← |R| - 2 // |R| - 1 was already checked

while l ≤ r do // binary search to find the first l for which LCP < R[l] is true

m ← ⌊(l + r) ÷ 2⌋ // Watch out for numeric overflow in real-world applications!

if LCP < R[m].LCP then

r ← m - 1

else

l ← m + 1

_, bp, i ← R[l]

InsertionSortIntoList(bp[i].node.branch-points, (LCP, node))

5.5.2 Found location for score (promotion)

If the original traversal terminated because the new score is greater than the current branch point, then the algorithm needs to traverse deeper in the trie to find the branch points for the inserted node, as well as delete the old node which represented term, if one exists. To start, the node in the current branch point is supplanted by a new node with the given score that represents term and is insertion sorted upwards. The supplanted node is then placed as the first element in the new node's list of branch points (with an LCP equal to the current LCP from the traversal). Next, all branch points in the supplanted node's branch points with an LCP lower than the current LCP (from the traversal) are moved to the new list of branch points.

While the current LCP does not equal the length of term, the current LCP is followed, starting in the supplanted node's branch points, until a node is found whose key matches at least one more character of term. If no node is found, then Set is finished. If a node is found, it is replaced by its branch point with the same LCP as the current LCP, if one exists. Next, the current LCP is increased by matching successive characters of term to the replaced node's key. If the replaced node does not represent term exactly, it is assigned the current LCP and insertion sorted upwards in the new list of branch points, and then all of the replaced node's branch points with an LCP lower than its new LCP are moved to the new list of branch points and insertion sorted upwards. This algorithm repeats inside the replaced node until the current LCP equals the length of term.

While the current node does not exactly represent term (i.e. if it is a superstring of term), the chain of branch points with the current LCP is followed down to the node which exactly represents term. If it exists, the node which exactly represents term is replaced by its branch point with the current LCP (equal to the length of term) and the remainder of its branch points are insertion sorted upwards into the new list of branch points. Set is then finished.

Algorithm 7 Set Helper: Found location for score (promotion)

procedure Set-ScoreLocationFound(term, score, LCP, bp, i)

BP ← new List // the new branch points for our new node

n ← bp[i].node

AppendToList(BP, (LCP, n))

bp[i].node ← { key: term, score, branch-points: BP }

i ← InsertionSortUp(bp, i)

n.branch-points ← ExtractLCPsBelowThreshold(n.branch-points, LCP, BP)

bp ← BP

i ← 0

while LCP ≠ |term| do // find all peers with LCP ∈ (LCP, |term|]

do // find n such that n.key matches at least one more character

bp ← bp[i].node.branch-points

n, i ← FindNodeForLCP(bp, LCP) orelse return

while LCP ⩵ |n.key| or term[LCP] ≠ n.key[LCP]

SupplantNodeFromParent(bp, i, n, LCP)

do ++LCP while LCP < Min(|term|, |n.key|) and term[LCP] ⩵ n.key[LCP]

if LCP ≠ |term| or LCP ≠ |n.key| then

bp ← BP

i ← |bp|

AppendToList(bp, (LCP, n)) // write to bp[i]

n.branch-points ← ExtractLCPsBelowThreshold(n.branch-points, LCP, bp)

i ← MergeSortedSublists(bp, i)

if LCP ≠ |n.key| then // if n is not the node that exactly represents term, go look for the one that does

bp ← bp[i].node.branch-points

n, i ← FindNodeForLCP(bp, LCP) orelse return

while LCP ≠ |n.key|

SupplantNodeFromParent(bp, i, n, LCP)

j ← |BP|

for each (LCP, node) in n.branch-points do AppendToList(BP, (LCP, node))

MergeSortedSublists(BP, j)

procedure SupplantNodeFromParent(bp, i, n, LCP)

c, j ← FindNodeForLCP(n.branch-points, LCP) orelse // find next link in the horizontal list

return RemoveIndexFromList(bp, i) // if there is no next link, just remove node's old spot

bp[i] ← n.branch-points[j] // move next link (LCP, c) into n's old spot and sort

InsertionSortDown(bp, i)

RemoveIndexFromList(n.branch-points, j)

procedure ExtractLCPsBelowThreshold(src, lcp, dst)

L ← new List // the new branch points to replace src

for each (LCP, node) in src do

AppendToList(if lcp > LCP then dst else L, (LCP, node))

return L

Figure 6. The aforementioned process for Set(“tennis academy”, 9001). nodes are the new branch points for (“tennis academy”, 9001). nodes are promoted to fill the old locations of replaced nodes.

Figure 7. The structure of Figure 6 after Set(“tennis academy”, 9001) finishes. Nodes have the same color as before.

5.6 Deletion

The Delete algorithm is nearly a subset of the Set algorithm. It takes as input a term and traverses the tree as previously described until a node is found that exactly represents term. If such a node exists, it is removed from the tree and all of the nodes it contained are reinserted, as in 5.5.1.

6. Discussions

Presumably, the stricter max-heap-like invariant of the Dynamic Score-Decomposed Trie can be back-ported to the Score-Decomposed Trie of [22, 23], yielding a succinct (and static) structure with the same asymptotic time complexity improvements over the Completion Trie. Also like the Completion Trie, any fuzzy completion algorithm applicable to a trie is also applicable to the Dynamic Score-Decomposed Trie because it, too, supports the same traversal operations as conventional tries [22].

7. Conclusion

This paper introduced the Dynamic Score-Decomposed Trie and the algorithms to construct it (both offline and online) and search it to enumerate the top-k completions to a prefix p. Top-k enumerations can be performed in Θ(|p|+klogk) time when the structure is augmented with HashMaps for constant time horizontal/vertical traversals, and Θ(|p|bd+klogk) time otherwise. A live demo [1] and reference implementations [2] are provided.

8. References

[1] Live demo of the Completion Trie and the Dynamic Score-Decomposed Trie. https://validark.github.io/DynSDT/demo.

[2] Reference implementations. https://github.com/Validark/DynSDT/.

[3] Wikipedia word frequency data. https://github.com/wolfgarbe/PruningRadixTrie/raw/master/PruningRadixTrie.Benchmark/terms.zip.

Heap-like Dynamic Score-Decomposed Tries for Top-k Autocomplete

Heap-like Dynamic Score-Decomposed Tries for Top-`k` Autocomplete