Human Factors of XR: Using Human Factors to Design XR Systems
Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)
1. Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 6
September 29, 2011
Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
2. Acknowledgments
Course design and slides based on
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park
Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)
3. Today’s Agenda
• Automatic Spelling Correction
– Review: Information Retrieval (IR)
• Boolean Search
• Vector Space Modeling
• Inverted Indexing in MapReduce
– Probabilisitic modeling via noisy channel
• Index Compression
– Order inversion in MapReduce
• In-class exercise
• Hadoop: Pipelined & Chained jobs
5. Automatic Spelling Correction
Three main stages
Error detection
Candidate generation
Candidate ranking / choose best candidate
Usage cases
Flagging possible misspellings / spell checker
Suggesting possible corrections
Automatically correcting (inferred) misspellings
• “as you type” correction
• web queries
• real-time closed captioning
• …
6. Types of spelling errors
Unknown words: “She is their favorite acress in town.”
Can be identified using a dictionary…
…but could be a valid word not in the dictionary
Dictionary could be automatically constructed from large corpora
• Filter out rare words (misspellings, or valid but unlikely)…
• Why filter out rare words that are valid?
Unknown words violating phonotactics:
e.g. “There isn’t enough room in this tonw for the both of us.”
Given dictionary, could automatically construct “n-gram dictionary”
of all character n-grams known in the language
• e.g. English words don’t end with “nw”, so flag tonw
Incorrect homophone: “She drove their.”
Valid word, wrong usage; infer appropriateness from context
Typing errors reflecting kayout of leyboard
7. Candidate generation
How to generate possible corrections for acress?
Inspiration: how do people do it?
People may suggest words like actress, across, access, acres,
caress, and cress – what do these have in common?
What about “blam” and “zigzag”?
Two standard strategies for candidate generation
Minimum edit distance
• Generate all candidates within 1+ edit step(s)
• Possible edit operations: insertion, deletion, substitution, transposition, …
• Filter through a dictionary
• See Peter Norvig’s post: http://norvig.com/spell-correct.html
Character ngrams: see next slide…
8. Character ngram Spelling Correction
Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is set of character ngrams
Let’s use n=3 (trigram), with # to mark word start/end
Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
Uhm, IR model???
Review…
9. Abstract IR Architecture
Query Documents
online offline
Representation Representation
Function Function
Query Representation Document Representation
Comparison
Function Index
Results
10. Document Boolean Representation
McDonald's slims down spuds
Fast-food chain to reduce certain types of
“Bag of Words”
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is McDonalds
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items fat
healthier.
But does that mean the popular shoestring fries fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with new
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
french
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with Company
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's Said
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger nutrition
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately …
be reached for comment.
…
12. Inverted Index: Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
1 2 3 4
blue 1 blue 2
cat 1 cat 3
egg 1 egg 4
fish 1 1 fish 1 2
green 1 green 4
ham 1 ham 4
hat 1 hat 3
one 1 one 1
red 1 red 2
two 1 two 1
13. Inverted Indexing via MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
one 1 red 2 cat 3
Map two 1 blue 2 hat 3
fish 1 fish 2
Shuffle and Sort: aggregate values by keys
cat 3
blue 2
Reduce fish 1 2
hat 3
one 1
two 1
red 2
14. Inverted Indexing in MapReduce
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)
15. Scalability Bottleneck
Desired output format: <term, [doc1, doc2, …]>
Just emitting each <term, docID> pair won’t produce this
How to produce this without buffering?
Side-effect: write directly to HDFS instead of emitting
Complications?
• Persistent data must be cleaned up if reducer restarted…
16. Using the Inverted Index
Boolean Retrieval: to execute a Boolean query
Build query syntax tree
OR
( blue AND fish ) OR ham ham AND
For each clause, look up postings blue fish
blue 2
fish 1 2
Traverse postings and apply Boolean operator
Efficiency analysis
Start with shortest posting first
Postings traversal is linear (if postings are sorted)
• Oops… we didn’t actually do this in building our index…
17. Inverted Indexing in MapReduce
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)
18. Inverted Indexing in MapReduce: try 2
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values() fish
4: Sort(P) 1 2
5: Emit(term t; P)
19. (Another) Scalability Bottleneck
Reducers buffers all docIDs associated with term (to sort)
What if term occurs in many documents?
Secondary sorting
Use composite key
Partition function
Key Comparator
Side-effect: write directly to HDFS as before…
20. Inverted index for spelling correction
Like search, spelling correction must be fast
How can we quickly identify candidate corrections?
II: Map each character ngram list of all words containing it
#ac -> { act, across, actress, acquire, … }
acr -> { across, acrimony, macro, … }
cre -> { crest, acre, acres, … }
res -> { arrest, rest, rescue, restaurant, … }
ess -> { less, lesson, necessary, actress, … }
ss# -> { less, mess, moss, across, actress, … }
How do we build the inverted index in MapReduce?
21. Exercise
Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus
22. Exercise
Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus
Map(String docid, String text):
for each word w in text:
for each trigram t in w:
Emit(t, w)
Reduce(String trigram, Iterator<Text> values):
Emit(trigram, values.toSet)
Also other alternatives, e.g. in-mapper combining, pairs
Is MapReduce even necessary for this?
Dictionary vs. token frequency
23. Spelling correction as Boolean search
Given inverted index, how to find set of possible corrections?
Compute union of all words indexed by any of its character ngrams
= Boolean search
• Query “acress” “#ac OR acr OR cre OR res OR ess OR ss# “
Are all corrections equally likely / good?
24. Ranked Information Retrieval
Order documents by probability of relevance
Estimate relevance of each document to the query
Rank documents by relevance
How do we estimate relevance?
Vector space paradigm
Approximate relevance by vector similarity (e.g. cosine)
Represent queries and documents as vectors
Rank documents by vector similarity to the query
25. Vector Space Model
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
Assumption: Documents that are “close” in vector space
“talk about” the same things
Retrieve documents based on how close the document
vector is to the query vector (i.e., similarity ~ “closeness”)
26. Similarity Metric
Use “angle” between the vectors
d j dk
cos( )
d j dk
i1 wi, j wi,k
n
d j dk
sim(d j , d k )
i1 wi , j i 1 wi2,k
n n
d j dk 2
Given pre-normalized vectors, just compute inner product
sim(d j , d k ) d j d k i 1 wi , j wi ,k
n
27. Boolean Character ngram correction
Boolean Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is set of character ngrams
Let’s use n=3 (trigram), with # to mark word start/end
Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
28. Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
29. Spelling Correction in Vector Space
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
Assumption: Words that are “close together” in ngram
vector space have similar orthography
Therefore, retrieve words in the dictionary based on how
close the word is to the typo (i.e., similarity ~ “closeness”)
30. Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
“value” here expresses relative importance of different
vector components for the similarity comparison
Use simple count here, what else might we do?
31. IR Term Weighting
Term weights consist of two components
Local: how important is the term in this document?
Global: how important is the term in the collection?
Here’s the intuition:
Terms that appear often in a document should get high weights
Terms that appear in many documents should get low weights
How do we capture this mathematically?
Term frequency (local)
Inverse document frequency (global)
32. TF.IDF Term Weighting
N
wi , j tfi , j log
ni
wi , j weight assigned to term i in document j
tfi, j number of occurrence of term i in document j
N number of documents in entire collection
ni number of documents with term i
33. Inverted Index: TF.IDF
Doc 1 Doc 2 Doc 3 Doc 4
one fish, two fish red fish, blue fish cat in the hat green eggs and ham
tf
1 2 3 4 df
blue 1 1 blue 1 2 1
cat 1 1 cat 1 3 1
egg 1 1 egg 1 4 1
fish 2 2 2 fish 2 1 2 2 2
green 1 1 green 1 4 1
ham 1 1 ham 1 4 1
hat 1 1 hat 1 3 1
one 1 1 one 1 1 1
red 1 1 red 1 2 1
two 1 1 two 1 1 1
34. Inverted Indexing via MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
one 1 red 2 cat 3
Map two 1 blue 2 hat 3
fish 1 fish 2
Shuffle and Sort: aggregate values by keys
cat 3
blue 2
Reduce fish 1 2
hat 3
one 1
two 1
red 2
35. Inverted Indexing via MapReduce (2)
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat
one 1 1 red 2 1 cat 3 1
Map two 1 1 blue 2 1 hat 3 1
fish 1 2 fish 2 2
Shuffle and Sort: aggregate values by keys
cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1
37. Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
“value” here expresses relative importance of different
vector components for the similarity comparison
What else might we do? TF.IDF for character n-grams?
38. TF.IDF for character n-grams
Think about what makes an ngram more discriminating
e.g. in acquire, acq and cqu are more indicative than qui and ire.
Schematically, we want something like:
• acquire: [ #ac, acq, cqu, qui, uir, ire, re# ]
Possible solution: TF-IDF, where
TF is the frequency of the ngram in the word
IDF is the number of words the ngram occurs in in the vocabulary
39. Correction Beyond Orthography
So far we’ve focused on orthography alone
The context of a typo also tells us a great deal
How can we compare contexts?
40. Correction Beyond Orthography
So far we’ve focused on orthography alone
The context of a typo also tells us a great deal
How can we compare contexts?
Idea: use the co-occurrence matrices built during HW2
We have a vector of co-occurrence counts for each word
Extract a similar vector for the typo given its immediate context
• “She is their favorite acress in town.”
acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]
Possible enhancement: make vectors sensitive to word order
41. Combining evidence
We have orthographic similarity and contextual similarity
We can do a simple weighted combination of the two, e.g.:
simCombined ( d j , d k ) simOrth( d j , d k ) 1 simContext ( d j , d k )
How to do this more efficiently?
Compute top candidates based on simOrth
Take top k for consideration with simContext
…or other way around…
The combined model might also be expressed by a similar
probabilistic model…
42. March 22, 2005 42
Paradigm: Noisy-Channel Modeling
s arg max P( S | O) arg max P( S ) P(O | S )
S S
Want to recover most likely latent (correct) source
word underlying the observed (misspelled) word
P(S): language model gives probability distribution
over possible (candidate) source words
P(O|S): channel model gives probability of each
candidate source word being “corrupted” into the
observed typo
44. Probabilistic vs. vector space model
Both measure orthographic & contextual “fit” of the
candidate given the typo and its usage context
Noisy channel:
P( cand | typo, context ) log P(typo | cand ) log P( cand | context )
IR approach:
simCombined ( d j , d k ) simOrth( d j , d k ) 1 simContext ( d j , d k )
Both can benefit from “big” data (i.e. bigger samples)
Better estimates of probabilities and population frequencies
Usual probabilistic vs. non-probabilistic tradeoffs
Principled theory and methodology for modeling and estimation
How to extend the feature space to include additional information?
• Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?
46. Postings Encoding
Conceptually:
fish 1 2 9 1 21 3 34 1 35 2 80 3 …
In Practice:
•Instead of document IDs, encode deltas (or d-gaps)
• But it’s not obvious that this save space…
fish 1 2 8 1 12 3 13 1 1 2 45 3 …
47. Overview of Index Compression
Byte-aligned vs. bit-aligned
Non-parameterized bit-aligned
Unary codes
(gamma) codes
(delta) codes
Parameterized bit-aligned
Golomb codes
Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
48. But First... General Data Compression
Run Length Encoding
7 7 7 8 8 9 = (7, 3), (8,2), (9,1)
Binary Equivalent
0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3
Good with sparse binary data
Huffman Coding
Optimal when data is distributed by negative powers of two
e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8
• a = 0, b = 10, c= 110, d=111
Prefix codes: no codeword is the prefix of another codeword
• If we read 0, we know it’s an “a” following bits are a new codeword
• Similarly 10 is a b (no other codeword starts with 10), etc.
• Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)
49. Unary Codes
Encode number as a run of 1s, specifically…
x 1 coded as x-1 1s, followed by zero bit terminator
1=0
2 = 10
3 = 110
4 = 1110
...
Great for small numbers… horrible for large numbers
Overly-biased for very small gaps
50. codes
x 1 is coded in two parts: unary length : offset
Start with binary encoded, remove highest-order bit = offset
Length is number of binary digits, encoded in unary
Concatenate length + offset codes
Example: 9 in binary is 1001
Offset = 001
Length = 4, in unary code = 1110
code = 1110:001
Another example: 7 (111 in binary)
• offset=11, length=3 (110 in unary) code = 110:11
Analysis
Offset = log x
Length = log x +1
Total = 2 log x +1 (97 bits, 75 bits, …)
51. codes
As with codes, two parts: unary length & offset
Offset is same as before
Length is encoded by its code
Example: 9 (=1001 in binary)
Offset = 001
Length = 4 (100), offset=00, length 3 = 110 in unary
• code=110:00
code = 110:00:001
Comparison
codes better for smaller numbers
codes better for larger numbers
52. Golomb Codes
x 1, parameter b
x encoded in two parts
Part 1: q = ( x - 1 ) / b , code q + 1 in unary
Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary
Truncated binary defines prefix code
if b is a power of 2
• easy case: truncated binary = regular binary
else
• First 2^(log b + 1) – b values encoded in log b bits
• Remaining values encoded in log b + 1 bits
Let’s see some examples
53. Golomb Code Examples
b = 3, r = [0:2]
First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit
First 1 value in 1 bit: 0
Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11
b = 5, r = [0:4]
First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits
First 3 values in 2 bits: 00, 01, 10
Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111
• Two prefix bits needed since single leading 1 already used in “10”
b = 6, r = [0:5]
First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits
First 2 values in 2 bits: 00, 01
Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111
55. Index Compression: Performance
Comparison of Index Size (bits per pointer)
Bible TREC
Unary 262 1918
Binary 15 20
6.51 6.63
6.23 6.38
Golomb 6.09 5.84
Use Golomb for d-gaps, codes for term frequencies
Optimal b 0.69 (N/df): Different b for every term!
Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)
Witten, Moffat, Bell, Managing Gigabytes (1999)
56. Where are we without compression?
(key) (values) (keys) (values)
fish 1 2 [2,4] fish 1 [2,4]
34 1 [23] fish 9 [9]
21 3 [1,8,22] fish 21 [1,8,22]
35 2 [8,41] fish 34 [23]
80 3 [2,9,76] fish 35 [8,41]
9 1 [9] fish 80 [2,9,76]
How is this different?
• Let the framework do the sorting
• Directly write postings to disk
• Term frequency implicitly stored
57. Index Compression in MapReduce
Need df to compress posting for each term
How do we compute df?
Count the # of postings in reduce(), then compress
Problem?
58. Order Inversion Pattern
In the mapper:
Emit “special” key-value pairs to keep track of df
In the reducer:
Make sure “special” key-value pairs come first: process them to
determine df
Remember: proper partitioning!
59. Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…
(key) (value)
fish 1 [2,4] Emit normal key-value pairs…
one 1 [1]
two 1 [3]
fish [1] Emit “special” key-value pairs to keep track of df…
one [1]
two [1]
60. Getting the df: Modified Reducer
(key) (value)
First, compute the df by summing contributions
fish [63] [82] [27] …
from all “special” key-value pair…
Compress postings incrementally as they arrive
fish 1 [2,4]
fish 9 [9]
fish 21 [1,8,22] Important: properly define sort order to make
sure “special” key-value pairs come first!
fish 34 [23]
fish 35 [8,41]
fish 80 [2,9,76]
… Write postings directly to disk
Where have we seen this before?
62. Exercise: where have all the ngrams gone?
For each observed (word) trigram in collection,
output its observed (docID, wordIndex) locations
Input
Doc 1 Doc 2 Doc 3
one fish two fish one fish two salmon two fish two fish
Output Possible Tools:
* pairs/stripes?
one fish two [(1,1),(2,1)]
* combining?
fish two fish [(1,2),(3,2)]
* secondary sorting?
fish two salmon [(2,2)]
* order inversion?
two fish two [(3,1)]
* side effects?
63. Exercise: shingling
Given observed (docID, wordIndex) ngram locations
For each document, for each of its ngrams (in order),
give a list of the ngram locations for that ngram
Input
one fish two [(1,1),(2,1)]
fish two fish [(1,2),(3,2)]
fish two salmon [(2,2)]
Possible Tools:
[(3,1)]
two fish two * pairs/stripes?
Output * combining?
Doc 1 [ [(1,1),(2,1)], [(1,2),(3,2)] ] * secondary sorting?
Doc 2 [ [(1,1),(2,1)], [(2,2)] ] * order inversion?
Doc 3 [ [(3,1)], [(1,2),(3,2)] ]
* side effects?
64. Exercise: shingling (2)
How can we recognize when longer ngrams are
aligned across documents?
Example
doc 1: a b c d e
doc 2: a b c d f
doc 3: e b c d f
doc 4: a b c d e
Find “a b c d” in docs 1 2 and 4,
“b c d f” in 2 & 3
“a b c d e” in 1 and 4
65. class Alignment
int index // start position in this document
int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram;
int otherID // ID of other document
int otherIndex // start position in other document
class NgramExtender
Set<Alignment> alignments = empty set
index=0;
NgramExtender(int docID) { _docID = docID }
close() { foreach Alignment a, emit(_docID, a) }
AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
...
@inproceedings{Kolak:2008,
author = {Kolak, Okan and Schilit, Bill N.},
title = {Generating links by mining quotations},
booktitle = {19th ACM conference on Hypertext and
hypermedia},
year = {2008},
pages = {117--126}
}
66. class Alignment
int index // start position in this document
int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram;
int otherID // ID of other document
int otherIndex // start position in other document
class NgramExtender
Set<Alignment> alignments = empty set
index=0;
NgramExtender(int docID) { _docID = docID }
close() { foreach Alignment a, emit(_docID, a) }
AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
++index;
foreach Alignment a in alignments
Ngram next = new Ngram(a.otherID, a.otherIndex + a.length)
if (ngrams.contains(next)) // extend alignment
a.length += 1; ngrams.remove(next)
else // terminate alignment
emit _docID, (a); alignments.remove(a)
foreach ngram in ngrams
alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )
68. Building more complex MR algorithms
Monolithic single Map + single Reduce
What we’ve done so far
Fitting all computation to this model can be difficult and ugly
We generally strive for modularization when possible
What else can we do?
Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)
Chaining: [Map+ Reduce Map*]
• 1 or more Mappers
• 1 reducer
• 0 or more Mappers
Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …
Express arbitrary dependencies between jobs
69. Modularization and WordCount
General benefits of modularization
Re-use for easier/faster development
Consistent behavior across applications
Easier/faster to maintain/extend for benefit of many applications
Even basic word count can be broken down
Pre-processing
• How will we tokenize? Perform stemming? Remove stopwords?
Main computation: count tokenized tokens and group by word
Post-processing
• Transform the values? (e.g. log-damping)
Let’s separate tokenization into its own module
Many other tasks can likely benefit
First approach: pipeline…
71. Pipeline WordCount in Hadoop
Two distinct jobs: tokenize and count
Data sharing between jobs via persistent output
Can use combiners and partitioners as usual (won’t bother here)
Let’s use SequenceFileOutputFormat rather than TextOutputFormat
sequence of binary key-value pairs; faster / smaller
tokenization output will stick around unless we delete it
Tokenize job
Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer
Output goes to directory we specify
Files will be read back in by the counting job
Output is array of tokens
We need to make a suitable Writable for String arrays
Count job
Input types defined by the input SequenceFile (don’t need to be specified)
Mapper is trivial
observes tokens from incoming data
Key: (docid) & Value: (Array of Strings, encoded as a Writable)
72. Pipeline WordCount (old Hadoop API)
Configuration conf = new Configuration();
String tmpDir1to2 = "/tmp/intermediate1to2";
// Tokenize job
JobConf tokenizationJob = new JobConf(conf);
tokenizationJob.setJarByClass(PipelineWordCount.class);
FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));
FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));
tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);
tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);
tokenizationJob.setOutputKeyClass(LongWritable.class);
tokenizationJob.setOutputValueClass(TextArrayWritable.class);
tokenizationJob.setNumReduceTasks(0);
// Count job
JobConf countingJob = new JobConf(conf);
countingJob.setJarByClass(PipelineWordCount.class);
countingJob.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));
FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));
countingJob.setMapperClass(TrivialWordObserver.class);
countingJob.setReducerClass(MapRedIntSumReducer.class);
countingJob.setOutputKeyClass(Text.class);
countingJob.setOutputValueClass(IntWritable.class);
countingJob.setNumReduceTasks(reduceTasks);
JobClient.runJob(tokenizationJob);
JobClient.runJob(countingJob);
73. Pipeline jobs in Hadoop
Old API
JobClinet.runJob(..) does not return until job finishes
New API
Use Job rather than JobConf
Use job.waitForCompletion instead of JobClient.runJob
Why Old API?
In 0.20.2, chaining only possible under old API
We want to re-use the same components for chaining (next…)
74. Chaining in Hadoop Mapper 1 Mapper 1
Map+ Reduce Map*
Intermediates Intermediates
1 or more Mappers
• Can use IdentityMapper
1 reducer Mapper 2 Mapper 2
• No reducers: conf.setNumReduceTasks(0)?
0 or more Mappers
Usual combiners and partitioners
By default, data passed between Reducer Reducer
Mappers by usual writing of
intermediate data to disk
Can always use side-effects…
There is a better, built-in way to bypass Mapper 3 Mapper 3
this and pass (Key,Value) pairs by
reference instead
• Requires different Mapper semantics! Persistent Output Persistent Output
75. Hadoop: ChainMapper & ChainReducer
Below JobConf objects (deprecated in Hadoop 0.20.2).
No undeprecated replacement in 0.20.2…
Examples here work for later versions with small changes
Configuration conf = new Configuration();
JobConf job = new JobConf(conf);
...
boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?
JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class,
Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);
JobConf map2Conf = new JobConf(false);
ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class,
Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class,
ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)
JobConf map3Conf = new JobConf(false);
ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class,
Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)
JobClient.runJob(job);
76. Chaining in Hadoop
Let’s continue our running example:
Mapper 1: Tokenize
Mapper 2: Observe (count) words
Reducer: same IntSum reducer as always
Mapper 3 Log-dampen counts
• We didn’t have this in our pipeline example but we’ll add here…
77. Chained Tokenizer + WordCount
// Set up configuration and intermediate directory location
Configuration conf = new Configuration();
JobConf chainJob = new JobConf(conf);
chainJob.setJobName("Chain job");
chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…
chainJob.setNumReduceTasks(reduceTasks);
FileInputFormat.setInputPaths(chainJob, new Path(inputPath));
FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));
// pass output (Key,Value) pairs to next Mapper by reference?
boolean passByRef = false;
JobConf map1 = new JobConf(false); // tokenization
ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class,
LongWritable.class, Text.class,
LongWritable.class, TextArrayWritable.class, passByRef, map1);
JobConf map2 = new JobConf(false); // Add token observer job
ChainMapper.addMapper(chainJob, TrivialWordObserver.class,
LongWritable.class, TextArrayWritable.class,
Text.class, LongWritable.class, passByRef, map2);
JobConf reduce = new JobConf(false); // Set the int sum reducer
ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class,
Text.class, LongWritable.class, passByRef, reduce);
JobConf map3 = new JobConf(false); // log-scaling of counts
ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class,
Text.class, FloatWritable.class, passByRef, map3);
JobClient.runJob(chainJob);
78. Hadoop Chaining: Pass by Reference
Chaining allows possible optimization
Chained mappers run in same JVM thread, so opportunity to avoid
serialization to/from disk with pipelined jobs
Also lesser benefit of avoiding extra object destruction / construction
Gotchas
OutputCollector.collect(K k, V v) promises
not alter the content of k and v
But if Map1 passes (k,v) by reference to Map2 via collect(),
Map2 may alter (k,v) & thereby violate the contract
What to do?
Option 1: Honor the contract – don’t alter input (k,v) in Map2
Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()
Document carefully to avoid later changes silently breaking this…
79. Setting Dependencies Between Jobs
JobControl and Job provide the mechanism
// create jobconf1 and jobconf2 as appropriate
// …
Job job1=new Job(jobconf1)
Job job2=new Job(jobconf2);
job2.addDependingJob(job1);
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
jbcntrl.run()
New API: no JobConf, create Job from Configuration, …
80. Higher Level Abstractions
Pig: language and execution environment for expressing
MapReduce data flows. (pretty much the standard)
See White, Chapter 11
Cascading: another environment with a higher level of
abstraction for composing complex data flows
See White, Chapter 16, pp 539-552
Cascalog: query language based on Cascading that uses
Clojure (a JVM-based LISP variant)
Word count in Cascalog
Certainly more concise – though you need to grok the syntax.
(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))