Heap-like Dynamic Score-Decomposed Tries for Top-k Autocomplete
Abstract
Query autocompletion, also known as type-ahead search, is a critical feature for a wide range of services: mobile keyboards, web stores, social media sites, and virtually every modern system connected to a database. This paper improves scored prefix completion by introducing a simple, pointer-based data structure called the Dynamic Score-Decomposed Trie that has the properties of both a decomposed trie and a binary max-heap, enabling the top-k highest scored completions to a prefix p to be computed in a time in O(|p|+klogk) after the node representing p is located (for which several options are given).
1. Introduction
Autocomplete is a critical component of every modern search service. For users, it reduces the keystrokes and effort required to express intent. For platforms, it provides an opportunity to suggest content or products to users. Since query autocompletion can occur with every keystroke, it is the most interactive component of search services and thus receives the most traffic. This necessitates the development of extraordinarily time-efficient algorithms to service these requests in real-time.
This paper solves the scored prefix completion problem, which is the problem of finding the top-k highest scored completions which begin with a given prefix string p, where the score is a numeric rank denoting each string's relevance relative to every other string in a string corpus. This is the foundation of other autocomplete problems such as error-correcting autocompletion which may be viewed as an extension of this problem.
Tries (i.e. prefix trees) are the natural choice for prefix completion queries for their obvious structural and performance advantages in this problem space. However, most tries are not amenable to scored prefix completion. While any off-the-shelf trie can be augmented with scores, without a specialized data structure every completion to p must be considered a candidate for the top-k because scores could occur in any order in the trie. When the trie is large and p is short, so much of the trie must be traversed that it precludes being performed on-demand in real-time. Many real-world trie-based autocomplete services explicitly disallow (or cache) prefix queries with fewer than three characters for this very reason [27], even though it is possible to experience the same performance issues anywhere, so long as the candidate set of completions to p is massive. Although some or all of the top-k completions could be precomputed and cached, this strategy requires a preordained k and often too much space in practice. This paper instead improves query times by traversing smaller amounts of a specialized trie data structure tailored specifically for the autocomplete problem.
3. Preliminaries
A trie [4, 5], also known as a prefix tree, is a digital search tree that encodes a set of strings, organized such that every string is encoded as a root-to-node path and each edge extending from a parent node to a child node represents subsequent characters (or bit(s)) in the string. This is useful for autocompletion because all strings under (i.e. in the subtree of) the node representing p are valid completions to p.
A compacted trie [11], also known as a radix tree, is a trie that allows multiple characters to be associated with a single edge. This eliminates nodes that would otherwise have only one child, thereby reducing memory consumption.
A left-child right-sibling binary trie [6], abbreviated here as “LCRS trie”, is a trie where every node only holds two references to other nodes: its first child and next sibling. Any trie can be represented as an LCRS trie by transforming each list of children into a linked list and letting each parent point only to the first node in the linked list and its sibling.
A priority queue, also known as a heap [8], is an abstract data type similar to a queue except that elements have a “priority” which determines the order in which they are handled. Min-heaps support the operations HeapPush, which adds an element, and HeapPopMin, which removes and returns the minimum priority element. Max-heaps are the opposite, supporting HeapPopMax instead, which removes and returns the maximum priority element. The aforementioned operations are assumed to take Θ(logx) time, where x is the maximum size to which the heap grows. Note that this paper does not delay the construction of heaps until just before the first HeapPop operation is needed, which could save a few comparisons via Floyd's linear-time heap construction algorithm [7].
A double-ended priority queue, abbreviated as DEPQ, is a priority queue that simultaneously supports the operations of a min-heap and a max-heap [10].
A bounded priority queue or bounded heap is a priority queue constrained to a certain capacity. When an element is pushed to a full priority queue, the lowest priority element is discarded. Whether the element to be discarded happens to be the incoming element can be determined via a single comparison with the lowest priority element in the heap. Although this may garner performance gains in practice, this paper assumes the worst case: that each pushed element requires a logarithmic-time HeapPush.
Insertion sort is a simple, comparison-based, in-place sorting algorithm that sorts each successive item in a list by performing a linear scan on the items that precede it, shifting each higher ordered element, and inserting it immediately after the first element with a lower order is found [15]. In this paper it is used in online algorithms to sort a single element into an otherwise sorted list. The InsertionSort functions (Down/Up) take as input a list and an index into it where there is an element which may need to be shifted to maintain the list's sorted order, and returns the index into which it was shifted. MergeSortedSublists takes the same parameters, but the index instead points to the first element that is not properly sorted with the elements before it, but is sorted with the elements after it. MergeSortedSublists performs InsertionSortUp on each successive element, starting at the given index, terminates once an element is encountered that need not be shifted, and returns the final index of the element that originally occurred at the given index. InsertionSortIntoList takes a list and an element, performs AppendToList which appends the element to the back of the list, insertion sorts it upwards/leftwards, and returns its new index in the list. RemoveIndexFromList is a similar function that takes a list and an index and shifts all the elements after the index to the left by one, effectively shrinking the list.
4. Background
The data structure serving as the baseline for this paper was introduced in Hsu & Ottaviano's 2013 paper Space-Efficient Data Structures for Top-k Completion [22]. It introduced the Completion Trie: a scored compacted trie where each internal node is given a score equal to the maximum score in its subtree. This conveniently allows internal nodes and leaf nodes (which hold their own score) to have the same shape, but more importantly, it enables the algorithm which searches for the top-k scores in a given subtree to know at every step which path leads to the next highest scored completion.
After ordinary trie traversal yields the locus node, i.e. the highest node representing a completion string with the given prefix p, the top-k search algorithm over the Completion Trie proceeds as follows: First, each child of the locus node is inserted into a bounded double-ended priority queue (DEPQ) constrained to size k - 1, except for the node with the same score as the locus node, for which the same process is repeated. (If there are multiple nodes with the target score, pick one.) In effect, the locus node's score is followed downwards to the corresponding leaf with the same score, and all other children along the path from the locus to the leaf are inserted into the DEPQ. Once the leaf is reached, it is pushed to the output list, k is decremented, and the maximum node is extracted from the DEPQ (again constraining its size to k - 1). The algorithm repeats for the extracted node and only terminates once k is 0 or the DEPQ is empty.
This top-k search algorithm is a variation of the A* search algorithm [9] with the scores serving as an exact heuristic function [22]. This algorithm takes Θ(bdklogk) time in the worst case, where d denotes the average depth of each leaf node corresponding to the top-k completions to p, and b denotes the average breadth of each visited level. Since there are Θ(dk) visited levels of size b, there are a total of Θ(bdk) nodes pushed to the DEPQ, each taking Θ(logk) worst-case time. Note that b is at most the size of the alphabet and d is at most the average length of the completions to p.
Hsu & Ottaviano further improve the top-k search time by sorting each node's children by score, so that only one node from each level needs to be pushed to the DEPQ as the path of first-child nodes is iteratively followed to each completion's leaf. Each node in the DEPQ then effectively acts like a forward iterator which inserts the next element on its level when extracted (before its first-child path is followed to the leaf). This improved top-k search is a variation of the k-way merge algorithm [15] and takes Θ(dklogk) time.
Intuitively, since sorting horizontally removed the breadth factor b, it stands to reason that sorting vertically would remove the depth factor d. In other words, if the highlighted sibling nodes in Figure 1 (adjacent to the path of “wikipedia”) were sorted, then only the (next) highest of those nodes would need to be in the priority queue at any given time. However, the trie structure needs to be decomposed to support vertical sorting. This is the intuition and motivation behind the structure introduced in the next section. In summary:
Data Structure | Top-k Search Time |
---|---|
Unsorted Completion Trie | Θ(bdklogk) |
Horizontally Sorted Completion Trie | Θ(dklogk) |
Horizontally and Vertically Sorted Trie? | Θ(klogk) |
5. Dynamic Score-Decomposed Tries
The Dynamic Score-Decomposed Trie is a (non-succinct, pointer-based) data structure based on the path decomposition of conventional tries. When constructing from a conventional trie, the path to the first leaf is compressed into a single decomposed node by concatenating the substrings and keeping a list of all the first-sibling nodes encountered along the path. Each encountered first-sibling node is likewise decomposed and stored inside a branch point which also contains the longest common prefix length (LCP) between the string it represents and its parent's.
Alternatively, this structure could be viewed as a derivative of the LCRS trie, as can the original trie structure of [4, 5]. The original trie can be derived from the LCRS trie by moving each horizontal linked list (of sibling nodes) into the parent node. By the same token this structure could be derived from the LCRS trie by moving the vertical linked lists (of descendant nodes) into the parent node (after advancing horizontally by one on each level and letting the parent node hold the string representing the concatenation of the nodes that were advanced over).
5.1 Structural Properties
Each node contains a string key, its corresponding numeric score, and a list of nodes with unique branch points which differ from key at different positions. By construction, branch points in each node are greater than or equal to the branch point that led to that node (or 0 for the root) and less than or equal to the length of its key. This gives the resulting structure the trie property, meaning that all nodes have a unique path to them from the root, which can be found by matching successive characters, and that the subtree under any given node contains only completions to the string that was matched along the path to it. To find the locus node representing (at least) p, the algorithm starts at the root node and iteratively computes the longest common prefix (LCP) between p and the current node's key, where the next current node is at the branch point matching the LCP length, and continues until all characters in p have been matched or until the target branch point does not exist within the current node.
Since LCP lengths are stored explicitly, this structure is amenable to being sorted both horizontally and vertically. To start, the root node holds the maximum scored completion in the data set. To ensure horizontal sorted order, each branch point holds the highest-scored completion that matches its LCP with the containing node. To ensure vertical sorted order, each list of branch points is sorted by score. This structure thus satisfies the heap property, both horizontally and vertically, making this structure a variant of the binary max-heap (except that it has no constraint of being complete or nearly complete). Hence, the top-k search algorithm over this structure is equivalent to the algorithm for finding the top-k nodes from a binary max-heap without mutating it [12].
This structure also has the benefit of using only n nodes for n strings, as opposed to conventional tries, which require far more intermediary nodes and therefore more allocations, memory usage, and cache misses.
5.2 Top-k Completion Search
The top-k search algorithm over this structure proceeds as follows: First, the key of the locus node and its highest-scored branch point with LCP ≥ |p| are pushed to the output list and k is decremented twice. Next, the two candidate nodes directly succeeding the latter node, one horizontally and one vertically (like a binary max-heap drawn with right angles), are pushed to a bounded DEPQ constrained to size k. Iteratively, the maximum node is extracted from the DEPQ, its key is pushed to the output list, k is decremented, and (up to) two nodes (one horizontal and one vertical) are pushed to the DEPQ. This continues until k reaches 0 or until the DEPQ is empty. More specifically, the vertical node to be pushed to the DEPQ is the next branch point after the current node (in the containing node) with branching point ≥ |p| and the horizontal node is the first branch point contained within the current node's list of branch points. Some vertical successors can be skipped over because branch points in the locus node can have an LCP as low as the one that led to it (which is less than |p|, by definition). Every other list of branch points encountered by the algorithm is found by following a branching point ≥ |p|, and therefore no check is necessary outside the locus node's list of branch points.
In the worst case, this top-k search algorithm inserts (k - 2) * 2 nodes into the DEPQ and extracts k - 2 nodes, contributing a Θ(klogk) term to the time complexity. The algorithm also skips over a number of nodes in the list of branch points contained within the locus node which is at most |p| minus the locus node's branch point in its containing node, which is an integer in the range [0, |p|), contributing a O(|p|) term. Therefore, the top-k search (after the locus node is located) takes a time in O(|p|+klogk).
As previously mentioned, the locus node is found in worst-case Θ(|p|) time when augmented with HashMaps and Θ(|p|bd) time otherwise. Altogether, the total times to both find the locus node and perform the top-k search in its subtree are as follows:
Augmented | Unaugmented | |
---|---|---|
Dynamic Score-Decomposed Trie | Θ(|p|+klogk) | Θ(|p|bd+klogk) |
Completion Trie | Θ(|p|+dklogk) | Θ(|p|b+dklogk) |
5.2.1 DEPQ Capacity Optimization
As in the Heapsort algorithm [8], the DEPQ and the output list can be backed by the same underlying array. When that strategy is not used, one observation that can improve performance in high-level languages that always heap-allocate is that the maximum capacity of the DEPQ is actually only ⌊0.5k⌋. The reason for this is that each iteration has the net effect of incrementing the DEPQ size (after inserting 2 and extracting 1) and decrementing k. Since the size and k approach each other at the same rate (in the worst case), they converge in the middle. Note that the two values meet in the middle after inserting 2 and extracting 1, meaning the size overshoots the midpoint, briefly reaching size ⌊0.5k + 1⌋ before the extract operation. However, because k is decremented twice in the step before the DEPQ is used, the maximum size of the DEPQ is actually given by ⌊0.5(k - 2) + 1⌋, reducing to ⌊0.5k⌋. The DEPQ can also be implemented as an insertion-sorted array for optimal performance when ⌊0.5k⌋ is low and fallback on a Min-Max Heap [10, 14] for higher values of k to maintain logarithmic asymptotic complexities.
5.2.2 Eliminating logarithmic factors
When it is acceptable for the top-k completions to be returned in arbitrary order, the logarithmic factors in the time complexities above can be eliminated. If the score of the kth highest completion was available before the start of the top-k search, then a depth-first search could retrieve the top-k completions in non-sorted order in Θ(k) time [12]. For a predetermined k, precomputing the kth highest score for all values of p with at least k completions enables this strategy. For arbitrary values of k, the ideas of [12] or something similar might be adaptable to this structure such that the kth highest score for a completion to p could be computed in a time in O(k). While such strategies were not explored with implementations, it is conceivable that these ideas could bring the total query complexity down to the optimal Θ(|p|+k) time by eliminating the need for a DEPQ. Another potential small performance benefit is that whether a branch point has an LCP < |p| would not be unnecessarily checked for nodes that are not under the locus (as the previous algorithm requires unless two separate DEPQ's are used), since the codepath for nodes that are not under the locus node could be separate from the ones under the locus.
5.3 String Compression
Optionally, this structure can omit the prefix of each key
which is implied by its path from the root, greatly reducing space usage when the uncompressed strings are not needed in main memory by any other process.
This does, however, result in a performance penalty during top-k enumeration because each completion must then be reconstructed on-demand to answer each query.
Fortunately, the compressed version of this structure requires only k substring and concatenation operations to reconstruct k completions
to a given prefix string p
(specifically, this is executed k
times:str1.substring(0, len).concat(str2)
).
This is still an improvement over the Completion Trie, which requires Θ(dk) concatenations to reconstruct k completions.
However, if the uncompressed strings are going to be stored in main memory anyway, or if multiple Dynamic Score-Decomposed Tries share the same set of completions (but with different scores) on a single machine, then it is quite advantageous that this structure does not need to divide strings into substrings as regular non-decomposed tries do. To aid understanding, the diagrams in this paper depict each node with the complete string key it represents, with the redundant prefix underlined and bolded.
5.4 Numeric Compression
A few techniques employed by others for numeric compression are also applicable to this structure: Firstly, longest common prefix lengths (i.e. LCP values) could be made relative to their containing node's LCP, which makes the numbers smaller and therefore require less space, as in [22, 23, 17, 16]. E.g., if a node's LCP is 8 and its containing node's LCP is 3, then the contained node only needs to store that its LCP is 5 more than its container. Relative LCP's must be in the range [0, |x|] where |x| is the length of the containing node's compressed key (which has the string implied by the path from the root omitted), and hence can be stored in log2(|x| + 1) bits. Secondly, because scores tend to exhibit a skewed power law distribution, variable-byte encoding schemes have been shown to reduce the space usage of scores [22, 23]. These techniques are otherwise omitted from this paper for ease of understanding, but greatly reduce space usage in practice [22, 23].
5.5 Construction
The simplest way to construct a Dynamic Score-Decomposed Trie is with a list of scored, unique completions sorted in descending order by score. To start, the first scored completion in the list becomes the root node. Each subsequent scored completion is inserted by iteratively computing the longest common prefix (LCP) with the current node's key (starting with the root) and jumping to the current node's branch point which corresponds to that LCP and making that the new current node. When no branch point is found, the scored completion is inserted at the end of the current node's branch points. Because the input list is sorted, the trie produced by this algorithm is also properly sorted, both horizontally and vertically.
To construct the trie incrementally (i.e. online), or to change an existing structure, an algorithm is needed which does not assume that successive scores are lower or that every completion is not already in the trie. This Set algorithm is given as input a string term and a numeric score to associate with it and proceeds as follows: If the trie is empty, the new scored completion becomes the root. Otherwise, the trie is traversed in the same way as the previous algorithm: Starting at the root, the LCP of the new completion with the current node's key is iteratively computed, and the next current node becomes the one at the branch point corresponding to that LCP. The traversal terminates when either a node is found whose key exactly matches term (5.5.1), when the current node's score is lower than the given score (5.5.2), or when there is no corresponding branch point for the computed LCP (the only case in the previous algorithm).
5.5.1 Exact match found (demotion)
If the original traversal terminated because the current node's key exactly matches term (i.e. when LCP equals term length), the algorithm proceeds as follows: First, the current node's score is updated to the given score. If score is greater than or equal to all the scores in the current node's subtree (determined by checking its first branch point) then the current node is simply insertion sorted (by score) in its containing node, at which point Set is done.
If score is not the greatest of the current node's subtree, then the current node swaps places with its first (highest-scored) branch point. Since both slots now have lower scores than before, they each must be insertion sorted downwards to maintain the sorted order of both lists. The list of branch points into which the current node was demoted then becomes a queue of nodes to (non-recursively) reinsert into the subtree of the promoted node. The promoted node adopts the LCP of its new position and the demoted node is given an LCP equal to its key length. Also note that the demoted node should have an empty list of branch points.
To reinsert each node from the queue, the naive algorithm starts at the promoted node and iteratively computes the LCP and follows the corresponding branch points until one does not exist or until the dequeued node's score is higher than the current branch point's. If no branch point matches the target LCP, the dequeued node is insertion sorted into the current list of branch points. Otherwise, if the proper location for the dequeued node's score was found, the node formerly occupying that location is supplanted and insertion sorted directly into the dequeued node's branch points. The dequeued node is then inserted in the proper location and insertion sorted upwards. (Note that the supplanted node's LCP is guaranteed to not be present in the dequeued node. The reason for that is more apparent with the improved algorithm.)
Unfortunately, the naive algorithm is quite wasteful as it often performs the same tree traversal for each dequeued node and unnecessarily recalculates LCP's. Two observations improve this algorithm: Firstly, the LCP of any two nodes in a list of branch points is equal to the minimum of their LCP's (with the containing node). Secondly, nodes from the queue are always reinserted into one of the previous nodes in the queue or the promoted node due to the trie property.
With these observations, a better algorithm emerges that employs the use of a running maximums list, which holds a reference to each node from the queue that had the maximum LCP when it was encountered. The algorithm starts by pushing the promoted node to the running maximums list with its original LCP. Each successive node in the queue is inserted into the first element in the running maximums list that has an LCP greater than or equal to it. If the current node from the queue has the highest LCP encountered thus far, it is inserted into the maximum element from the running maximums list and is pushed to the running maximums list itself. Since the running maximums list is, by construction, sorted in ascending order of LCP, it is binary searchable in O(logd) time.
Inserting into a node works as before, except now the trie property guarantees that the target LCP cannot change from the minimum of the LCP's between the two elements from the running maximums list and the queue. The chain of LCP's in the running maximums list is followed until the dequeued node's score is higher than the current branch point, supplanting it, and insertion sorting it directly into the dequeued node's branch points. If the current list of branch points contains no node with the target LCP, then the dequeued node is insertion sorted directly into it. Note that when the dequeued node has a lower LCP than the current maximum LCP, the trie property guarantees that the node in the running maximums list into which it is inserted cannot contain a branch point with the same LCP as the dequeued node. Also note that the LCP given to reinserted elements in the trie does not change its LCP in the queue or running maximums list. Once all the nodes in the queue have been reinserted, Set is finished.
5.5.2 Found location for score (promotion)
If the original traversal terminated because the new score is greater than the current branch point, then the algorithm needs to traverse deeper in the trie to find the branch points for the inserted node, as well as delete the old node which represented term, if one exists. To start, the node in the current branch point is supplanted by a new node with the given score that represents term and is insertion sorted upwards. The supplanted node is then placed as the first element in the new node's list of branch points (with an LCP equal to the current LCP from the traversal). Next, all branch points in the supplanted node's branch points with an LCP lower than the current LCP (from the traversal) are moved to the new list of branch points.
While the current LCP does not equal the length of term, the current LCP is followed, starting in the supplanted node's branch points, until a node is found whose key matches at least one more character of term. If no node is found, then Set is finished. If a node is found, it is replaced by its branch point with the same LCP as the current LCP, if one exists. Next, the current LCP is increased by matching successive characters of term to the replaced node's key. If the replaced node does not represent term exactly, it is assigned the current LCP and insertion sorted upwards in the new list of branch points, and then all of the replaced node's branch points with an LCP lower than its new LCP are moved to the new list of branch points and insertion sorted upwards. This algorithm repeats inside the replaced node until the current LCP equals the length of term.
While the current node does not exactly represent term (i.e. if it is a superstring of term), the chain of branch points with the current LCP is followed down to the node which exactly represents term. If it exists, the node which exactly represents term is replaced by its branch point with the current LCP (equal to the length of term) and the remainder of its branch points are insertion sorted upwards into the new list of branch points. Set is then finished.
5.6 Deletion
The Delete algorithm is nearly a subset of the Set algorithm. It takes as input a term and traverses the tree as previously described until a node is found that exactly represents term. If such a node exists, it is removed from the tree and all of the nodes it contained are reinserted, as in 5.5.1.
6. Discussions
Presumably, the stricter max-heap-like invariant of the Dynamic Score-Decomposed Trie can be back-ported to the Score-Decomposed Trie of [22, 23], yielding a succinct (and static) structure with the same asymptotic time complexity improvements over the Completion Trie. Also like the Completion Trie, any fuzzy completion algorithm applicable to a trie is also applicable to the Dynamic Score-Decomposed Trie because it, too, supports the same traversal operations as conventional tries [22].
7. Conclusion
This paper introduced the Dynamic Score-Decomposed Trie and the algorithms to construct it (both offline and online) and search it to enumerate the top-k completions to a prefix p. Top-k enumerations can be performed in Θ(|p|+klogk) time when the structure is augmented with HashMaps for constant time horizontal/vertical traversals, and Θ(|p|bd+klogk) time otherwise. A live demo [1] and reference implementations [2] are provided.
8. References
[2] Reference implementations. https://github.com/Validark/DynSDT/.
8.1 Preliminary data structures
[5] Fredkin, Edward. (1960). Trie memory.
Communications of the ACM 3, 9 (1960), 490–499.
[7] Floyd, Robert W. (1964), Algorithm 245, Treesort 3. Communications of the ACM 7, 701.
[8] Williams, J. W. J. (1964). Algorithm 232: Heapsort. Communications of the ACM 7, 347–348.
[15] Bentley, Jon Louis (2000). Programming Pearls (2nd ed.). ACM Press / Addison Wesley. pp. 115–116, 147–162. ISBN 0201657880.