Trie (aka radix tree or prefix tree), is an ordered tree data structure where the keys are usually strings. Tries have tremendous applications from all sorts of things like dictionary to
3. Michael T. Goodrich and Roberto Tamassia
Data Structures and Algorithms in Java (4th edition)
John Wiley & Sons, Inc.
ISBN: 0-471-73884-0
Haim Kaplan, Tel Aviv University
Jörg Liebeherr, University of Toronto
L6 - Tries CS 6213 - Advanced Data Structures - Arora 3
CREDITS
4. Naïve, brute force for searching a text of size n and a
pattern of size m requires O(nm) time.
Preprocessing the pattern speeds up pattern
matching queries. E.g., KMP algorithm performs
pattern matching in time proportional to the text
size: O(n)
If the text is large, immutable and searched often
(e.g., Shakespeare), we may want to preprocess the
text itself. Want to perform the searching in O(m)
time.
L6 - Tries CS 6213 - Advanced Data Structures - Arora 4
MOTIVATION
5. A trie is a compact data structure for representing a
set of strings, such as all the words in a text. A trie
supports pattern matching queries in time
proportional to the pattern size: O(m)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 5
MOTIVATION (CONT.)
7. The standard trie for a set of strings S is an ordered tree
such that:
Each node but the root is labeled with a character
The children of a node are alphabetically ordered
The paths from the root to the leaves yield the strings of S
Example: set of strings S = { bear, bell, bid, bull, buy, sell, stock,
stop }
L6 - Tries CS 6213 - Advanced Data Structures - Arora 7
STANDARD TRIES
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
8. A standard trie uses O(n) space and supports
searches, insertions and deletions in time
O(dm), where:
n total size of the strings in S
m size of the string parameter of the operation
d size of the alphabet
L6 - Tries CS 6213 - Advanced Data Structures - Arora 8
ANALYSIS OF STANDARD TRIES
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
9. We insert
the words of
the text into
a trie
Each leaf
stores the
occurrences
of the
associated
word in the
text
L6 - Tries CS 6213 - Advanced Data Structures - Arora 9
WORD MATCHING WITH A TRIE
s e e b e a r ? s e l l s t o c k !
s e e b u l l ? b u y s t o c k !
b i d s t o c k !
a
a
h e t h e b e l l ? s t o p !
b i d s t o c k !
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
a r
87 88
a
e
b
l
s
u
l
e t
e
0, 24
o
c
i
l
r
6
l
78
d
47, 58
l
30
y
36
l
12
k
17, 40,
51, 62
p
84
h
e
r
69
a
10. A compressed trie has
internal nodes of
degree at least two
It is obtained from
standard trie by
compressing chains of
“redundant” nodes
L6 - Tries CS 6213 - Advanced Data Structures - Arora 10
COMPRESSED TRIES
e
b
ar ll
s
u
ll y
ell to
ck p
id
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
11. Compact representation of a compressed trie for an array of
strings:
Stores at the nodes ranges of indices instead of substrings
Uses O(s) space, where s is the number of strings in the array
Serves as an auxiliary index structure
L6 - Tries CS 6213 - Advanced Data Structures - Arora 11
COMPACT REPRESENTATION
s e e
b e a r
s e l l
s t o c k
b u l l
b u y
b i d
h e
b e l l
s t o p
0 1 2 3 4
a rS[0] =
S[1] =
S[2] =
S[3] =
S[4] =
S[5] =
S[6] =
S[7] =
S[8] =
S[9] =
0 1 2 3 0 1 2 3
1, 1, 1
1, 0, 0 0, 0, 0
4, 1, 1
0, 2, 2
3, 1, 2
1, 2, 3 8, 2, 3
6, 1, 2
4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3
7, 0, 3
0, 1, 1
12. Begins with: where name like ‘x%’
Ends with: where name like ‘%x’
Substring: where name like ‘%x%’
L6 - Tries CS 6213 - Advanced Data Structures - Arora 12
STRING SEARCHES
13. The suffix trie of a string X is the compressed trie of all
the suffixes of X
L6 - Tries CS 6213 - Advanced Data Structures - Arora 13
SUFFIX TRIE
e nimize
nimize ze
zei mi
mize nimize ze
m i n i z em i
0 1 2 3 4 5 6 7
14. Compact representation of the suffix trie for a
string X of size n from an alphabet of size d
Uses O(n) space
Supports arbitrary pattern matching queries in X in O(dm)
time, where m is the size of the pattern
Can be constructed in O(n) time
L6 - Tries CS 6213 - Advanced Data Structures - Arora 14
ANALYSIS OF SUFFIX TRIES
7, 7 2, 7
2, 7 6, 7
6, 7
4, 7 2, 7 6, 7
1, 1 0, 1
m i n i z em i
0 1 2 3 4 5 6 7
15. Auto complete: User types “Rob” and you can type
with all words that begin with Rob, or all contacts
that begin with Rob, etc.
Sequence Assembly in Genetics Sequences
Sorting of Large Sets of Strings: BurstSort
Big Data: See “TeraSort.java” source code
L6 - Tries CS 6213 - Advanced Data Structures - Arora 15
APPLICATIONS OF TRIES
16. L6 - Tries CS 6213 - Advanced Data Structures - Arora 16
SAMPLE APPLICATION – IP ROUTING
Packets of Fun
18. A standardized exterior gateway protocol designed to
exchange routing and reachability information
between autonomous systems (AS) on the Internet.
Makes routing decisions based on paths, network
policies and/or rule-sets configured by a network
administrator.
Plays a key role in the overall operation of the
Internet and is involved in making core routing
decisions.
[Itself uses TCP to exchange its own data.]
L6 - Tries CS 6213 - Advanced Data Structures - Arora 18
BORDER GATEWAY PROTOCOL (BGP)
20. Destination address Next hop
10.0.0.0/8 R1
128.143.0.0/16 R2
128.143.64.0/20
R3
128.143.192.0/20 R3
128.143.71.0/24 R4
128.143.71.55/32 R3
Default R5
With CIDR, there can be multiple
matches for a destination address in the
routing table
Longest Prefix Match: Search for the
routing table entry that has the longest
match with the prefix of the destination
IP address (Most Specific Router):
1. Search for a match on all 32 bits
2. Search for a match for 31 bits
…..
32. Search for a match on 0 bits
Needed: Data structure that supports a FAST
longest prefix match lookup!
L6 - Tries CS 6213 - Advanced Data Structures - Arora 20
ROUTING TABLE LOOKUP: LONGEST
PREFIX MATCH
128.143.71.21
The longest prefix match for
128.143.71.21 is with
128.143.71.0/24
Datagram will be sent to R4
21. The following algorithms are suitable for Longest
Prefix Match routing table lookups
Tries
Path-Compressed Tries
Disjoint-prefix binary Tries
Multibit Tries
Binary Search on Prefix
Prefix Range Search
L6 - Tries CS 6213 - Advanced Data Structures - Arora 21
IP ADDRESS LOOKUP ALGORITHMS
22. t p
te to po
t p
e o
ten tea
n a
top
o
pot
o
t
A trie is a tree-based
data structure for
storing strings:
There is one node for every
common prefix
The strings are stored in
extra leaf nodes
Prefixes are not only stored
at leaf nodes but also at
internal nodes
L6 - Tries CS 6213 - Advanced Data Structures - Arora 22
SLIGHTLY DIFFERENT VERSION OF TRIE
23. Structure
Each leaf contains a
possible address
Prefixes in the table are
marked (dark)
Search
Traverse the tree
according to destination
address
Most recent marked node
is the current longest
prefix
Search ends when a leaf
node is reached
L6 - Tries CS 6213 - Advanced Data Structures - Arora 23
BINARY TRIE
24. Update
Search for the
new entry
Search ends
when a leaf node
is reached
If there is no
branch to take,
insert new
node(s)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 24
BINARY TRIE
z 1010*
1
z
0
25. Path Compression:
Requires to store additional information with nodes Bit number
field is added to node
Bit string of prefixes must be explicitly stored at nodes
Need to make comparison when searching the tree
Goal: Eliminate long
sequences of 1-child
nodes
Path compression
collapses 1-child
branches
L6 - Tries CS 6213 - Advanced Data Structures - Arora 25
COMPRESSED BINARY TRIE
d
26. Search: “010110”
Root node: Inspect 1st bit and move left
“a” node:
Check with prefix of a (“0*”) and find a match
Inspect 3rd bit and move left
“b” node:
Check with prefix of b (“01000*”) and determine that there is no match
Search stops. Longest prefix match is with a
L6 - Tries CS 6213 - Advanced Data Structures - Arora 26
COMPRESSED BINARY TRIE
d
27. Disjoint prefix:
Nodes are split so that there is only one match for each prefix (“Leaf pushing”)
Consequence: Internal nodes do not match with prefixes
Results:
a (0*) is split into: a1 (00*), a3 (010*), a2 (01001*)
d (1*) is represented as d1 (101*)
Multiple matches in
longest prefix rule
require backtracking
of search
Goal: Transform tree
as to avoid multiple
matches
L6 - Tries CS 6213 - Advanced Data Structures - Arora 27
DISJOINT-PREFIX BINARY TRIE
28. 2-bit stride:
1-bit prefix for a (0*) is split into 00* and 01*
1-bit prefix for d (1*) is split into 10* and 11*
3-bit prefix for c has been expanded to two nodes
Why are the prefixes for b and e not expanded?
Goal: Accelerate lookup
by inspecting more than
one bit at a time
“Stride”: number of bits
inspected at one time
With k-bit stride, node
has up to 2k child nodes
L6 - Tries CS 6213 - Advanced Data Structures - Arora 28
VARIABLE-STRIDE MULTIBIT TRIE
29. Scheme Lookup Update Memory
Binary trie O(W) O(W) O(NW)
Path-compressed trie O(W) O(W) O(NW)
k-stride multibit trie O(W/k) O(W/k+2k) O(2kNW/k)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 29
COMPLEXITY OF THE LOOKUP
Bounds are expressed for
Look-up time: What is the longest lookup time?
Update time: How long does it take to change an entry?
Memory: How much memory is required to store the data structure?
W: length of the address (32 bits)
N: number of prefix in the routing table
30. Excellent data structure for managing Strings
Supports prefix and suffix kind of lookups
Extremely fast – After the Trie has been built, the
search time is O(m) where m is the size of the
pattern.
Can be used to build indexes
Various applications in areas that use Strings
(Literature/Dictionary/Content, as well as Networks
and Bioinformatics)
L6 - Tries CS 6213 - Advanced Data Structures - Arora 30
CONCLUSIONS: TRIES