5. Requirements
• 50 billion URLs to crawl.
• Problem of revisiting a URL.
• We have to keep track of the URLs we have
visited before.
• Avoid disk access for performance.
• Tradeoff between memory – speed.
6. Requirements
•Average size of a URL:
• IE supports 2083 characters at maximum.
• Plenty of URL shorteners around.
• We will assume average size of 80 characters (80 bytes).
◦ For 50 billion URLs: 80 × 50 × 109 = 4 × 1012 = 4 TB
◦ A lookup table of size 4 TB indexed with URLs.
◦ Too big to fit into the main memory.
◦ Disk is okay. But disk IO is very slow.
7. Bloom Filters
•Place a bloom filter on the stream of URLs
•Bloom Filter will decide if a URL is visited before.
• It will use much less memory.
• Queries will be in O(1).
• It might occasionally lie.
8. How?
• Start with a bit array initalized with 0s.
• N = 18.
• Predefined length.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11. Rate of False Positives
•m: Total # of bits.
•k: # of hash functions.
•n: # of insertions.
•Probability that a bit is not set to 1 by a certain hash function: 1 −
1
𝑚
•For k hash functions: (1 −
1
𝑚
) 𝑘
•For n elements: (1 −
1
𝑚
) 𝑘𝑛
12. Rate of False Positives
•m: Total number of bits.
•k: # of hash functions.
•n: # of insertions.
•Probability that it is 1: 1 − (1 −
1
𝑚
) 𝑘𝑛
•For an item that is not in the set all of the k bits should be 1.
13.
14. Crawler with Bloom Filter
LU TABLE / HASH TABLE
•4 TB required space.
•Mandatory Disk Access.
BLOOM FILTER
•139.48GB (0.00005 fpp) k=17
•111.58GB (0.0001 fpp) k=13
•No disk access required.
•O(k) insert time.
•O(k) query time.
15.
16.
17.
18. Big Data Analytics for Time Critical
Mobility Applications
IoT Applications
Small problems will turn into big
problems very fast.
Editor's Notes
SEO – Robots.txt
Sadece false positive mümkün, false negative mümkün değil.
"possibly in set" ya da "definitely not in set«
Hash collisions – dont use crypto hashes
Too many hash functions -> Too many 1s
Too less -> Just 1 collision is enough for a fpp
One Hit Wonders
Weather forecasts – Interpreting sensor data – Privacy preserving mobility mining – 3V of big data
Big Table – Hbase -> Check if a row exists before doing disk access
Medium -> Recommendation
Chrome -> Malicious URLs