SlideShare a Scribd company logo
1 of 53
Download to read offline
Duplicate and Near Duplicate
Detection at Scale
Tim Allison, Ph.D.
Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization
Ā© 2020 California Institute of Technology. Government sponsorship
acknowledged.
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.
jpl.nasa.gov
About me
ā€¢ Data scientist (files and search) Jet Propulsion
Laboratory, California Institute of Technology
ā€¢ Chair/V.P. Apache Tika
ā€¢ Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
ā€¢ Member Apache Software Foundation
2Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.10/22/20
jpl.nasa.gov
Outline
ā€¢ Search system assessments, an overview of options
ā€¢ Plug for text extraction assessment
ā€¢ Duplicates and near duplicates ā€“ Case Study
ā€¢ Exploration: Near duplicates with minhash
ā€¢ Conclusion
310/22/20 Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Search System Assessment: 20,000 ft view
ā€¢ Offline
ā€¢ Ground truth queries and expected docs
ā€¢ Online
ā€¢ User behavior
ā€¢ User feedback
ā€¢ Surveys, interviews
ā€¢ Technical review
ā€¢ System
ā€¢ Data
10/22/20 4Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
System Assessment
Udo Kruschwitz and
Charlie Hull ā€œSearching
the Enterpriseā€,
Foundations and
TrendsĀ® in Information
Retrieval. 11(1):1-142,
July 2017. p. 16.
| 5 |Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
System Assessment
ā€¢ Crawler configurations
ā€¢ Text extraction configurations
ā€¢ Schema and field configuration
ā€¢ Query Parser configuration
ā€¢ Default Boolean operator
ā€¢ Fields, field boosts
ā€¢ ā€¦
10/22/20 6Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Data Assessment
ā€¢ File types, parser coverage
ā€¢ Quality of text extraction (ā€¦languages)
ā€¢ Quality of metadata ā€“ dates, duplicate
titles/metadata
ā€¢ Liveness of documents/URLs/URL redirects
ā€¢ Duplicates and near duplicates
10/22/20 7Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Plug for Text Extraction Assessment
10/22/20 8Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Out of vocabulary (OOV) ā€“ Same file, different extractors
10/22/20 9
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 786 156
Total Tokens 1603 272
LangId zh-ch de
Common Words 0 116
Alphabetic Tokens 1603 250
Top N Tokens ę³ę•Ø: 18 | ē“ę”£: 14 | ē•„ē“:
14 | m: 11 | ę®ę¹„: 11 | ē‘µę³:
11 | ē•¬ę®: 11 | ę”£ę¹„: 10 |
ꐠꕩ: 9 | ę•®ęµØ: 9
die: 11 | und: 8 | von: 8 |
deutschen: 7 | deutsche: 6 | 1:
5 | das: 5 | der: 5 |
finanzministerium: 5 | oder: 5
OOV% 1-(0/1603) = 100% 1-(116/250) = 54%
Fixed encoding detection between 1.14 and 1.15
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Quality of text extraction, an example
10/22/20 10
https://voyager.jpl.nasa.gov/pdf/sfos2003pdf/03_10_02-03_10_19.sfos.pdf
Language Id: Nepali (Out of Vocabulary 99%)
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Unexplained Garbage at Beginning of File(???)
10/22/20 11
This unexplained garbage
at the beginning of a file
also occurs in several other
PDF files identified as
Nepali
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
From analytics to action
10/22/20 12
https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Stored text vs. Optical Character Recognition
10/22/20 13
Text As Stored in File
!"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB
D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc
5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a
lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a
Text from Tesseract OCR
Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest
Descent Method
Nobuhiko Oguraā€™ and Isao Yamadaā€
1 Introduction
A closed polyhedron is the intersection of finite number of closed half
spaces, i.e., the setof points satisfying finite number of lincar
incqualitics, and is widely used as a constraint in various application, for
example specifications or constraints in signal processing or estimation
problems, resource restrictions in financial applications and feasible sets of
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates and Near Duplicates
10/22/20 14Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experimental Setup
ā€¢ Development ā€œweb_indexā€ (~12.5 million documents)
ā€¢ Slightly out of date compared with production, but close
enough
ā€¢ Covers internal web, but not other ā€œdocument-heavyā€
indices
ā€¢ Safer to avoid heavy computation on production cluster
ā€¢ Small enough to reindex with different field settings on
dev cluster
ā€¢ Use existing tools/metrics ā€“ no contrib modules/hand-
coded algorithms
10/22/20 15Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates!
10/22/20 16Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
How big of a problem are duplicates?
10/22/20 17
First ā€œlesson
learnedā€ in
Oleksiy
Kovyrinā€™s
recent
ā€œSprinting to a
crawl: Building
an effective
web crawlerā€
on ElastiCON
Global 2020
jpl.nasa.gov
Google has several patents for (near)duplicate detection
https://patents.google.com/?q=%22duplicate+documents%22&assignee=Google%2c+Llc&n
um=100&oq=assignee:(Google%2c+Llc)+%22duplicate+documents%22&sort=new
10/22/20 18Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Googleā€™s Guidance for Duplicates and Search Engine
Optimization (SEO)
10/22/20 19
https://support.google.com/webmasters/answer/66359?hl=en
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
File Types ā€“ Top 10 file types in web_index
10/22/20 20
File Type Count
text/html 8,894,038
image/gif 1,870,136
image/jpeg 1,094,937
image/png 319,710
text/plain 109,516
application/pdf 105,081
application/x-hdf 64,194
image/x-ms-bmp 26,377
application/xml 8,734
application/msword 7,414
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Duplicates, near duplicates
ā€¢ Digests
ā€¢ Literal bytes of a file are the same
ā€¢ Text Digests
ā€¢ Extracted text from a document is the same
ā€¢ Text Profile Digest (see next slide)
ā€¢ Require all words
ā€¢ Drop the rarer words in a document (default)
10/22/20 21Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Nutchā€™s TextProfile
Data Search | JPL's Earth Science Airborne Program Jump to navigation
Earth Science Airborne Program JPL's Suborbital Earth Science
Instruments & Measurements Home ā€ŗ All Products ā€ŗ Instrument: Fourier
Transform Infrared Spectrometer (FTS) ā€ŗ Product Type: FTS_L2QR ā€ŗ
Platform: C-23 Sherpa ā€ŗ Parameter: Atmospheric Chemistry ā€ŗ Platform
Type: Airborne ā€ŗ Campaign: Carbon in Arctic Reservoirs Vulnerability
Experiment (CARVE) Data Search Show Advanced search Temporal
Search Start Date Stop Date Free Text Search Enter search text Spatial
Search (Hold Shift to draw bounding box) + - Perform Search Sort By
Popularity (All Time) Popularity (This Month) Popularity (Users) Long
Name (A-Z) Short Name (A-Z) Grid Spatial Resolution Satellite Spatial
Resolution Start Date Stop Date Found 0 matching products(s). Browse
Products Campaign Any campaign Carbon in Arctic Reservoirs
Vulnerability Experiment (CARVE) (261) Parameter Any parameter
Atmospheric Chemistry (261) Instrument Any instrument Fourier
Transform Infrared Spectrometer (FTS) (261) Platform Any platform C-23
Sherpa (261) Platform Type Any platform type Airborne (261) Product
Type Any product type FTS_L2QR (261)
10/22/20 22
Term
Quantized
Count
search 8
261 6
any 6
platform 6
type 6
airborne 4
date 4
Text Profile: ā€œsearch 261 any
platform type airborne dateā€¦ā€
Quantize counts, sort by
descending order of
frequency, drop quantized
count below a thresholdhttps://airbornescience.jpl.nasa.gov/data
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Different Digest, Different Text Digest, Same Text Profile
10/22/20 23Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Digests vs Text Digests vs Text Profile Digests in non-
image documents
ā€¢ Total non-image documents: 9.2 million
ā€¢ Distinct digests: 8.6 million
ā€¢ Distinct text digests: 5.2 million
ā€¢ Distinct text profile (keep all words): 5.1 million
ā€¢ Distinct text profile (drop infrequent words): 2.7 million
10/22/20 24Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Number of Non-Image Documents with a Distinct Digest
10/22/20 25
Digest Text Digest Text Profile Digest
digest1 27,874 2,810,868 2,810,868
digest2 10,089 73,203 489,821
digest3 1,565 27,874 73,225
digest4 1,170 10,089 63,818
digest5 1,166 7,926 58,271
digest6 1,128 2,589 27,874
digest7 1,072 2,557 25,311
digest8 990 1,911 12,222
digest9 933 1,616 11,973
digest10 841 1,573 10,089
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
2.8 million?!
10/22/20 26
Yes! On development index.
In production, there are ONLY 880k!
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
Error page. The Web Server
encountered an unknown runtime
error. Cannot display pageā€¦
jpl.nasa.gov
Initial Takeaway
ā€¢ Some easy fixes
10/22/20 27Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Exploration: Near Duplicates with
MinHash
10/22/20 28Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experiments with MinHash
ā€¢ Earlier proof-of-concept implemented by intern
ā€¢ Filter available in Elasticsearch to allow for fuzzy hashing/near
duplicate detection
ā€¢ Default settings ā€“ digest 5-grams (see next slide), summarize
digests into 512 tokens (buckets)
ā€¢ Run a ā€œMoreLikeThisā€ query ā€“ there is a more efficient algorithm,
but not built into ES yet*
10/22/20 29
Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Whatā€™s a 5-gram
ā€¢ ā€œthe quick brown fox jumped over the lazy dogā€
ā€¢ ā€œthe quick brown fox jumpedā€
ā€¢ ā€œquick brown fox jumped overā€
ā€¢ ā€¦.
10/22/20 30Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experiments with MinHash: Findings
ā€¢ Worked really well on a toy set of synthetic
documents
ā€¢ Performance is prohibitive on full web_index (even
with stored termvectors) ā€“ estimate ~1 year to query
every document in the index
ā€¢ Note: speed was greatly improved by
programmatically retrieving termvectors and
creating own terms query, but still not acceptable
10/22/20 31Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Experiments with MinHash: Conclusion
ā€¢ There may be ways of improving performance with
more shards, multithreading, smarter processing,
different algorithm
ā€¢ At this point, however, the problems with exact
duplicates and/or text duplicates are sufficient so as
not to warrant further investigation of near
duplicates via minhash
10/22/20 32Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
But why, why was MinHash SO slow?! Some ideasā€¦
ā€¢ Elasticsearch is optimized for queries of a few
words, not 512 ā€œwordsā€
ā€¢ Aside from exact duplicates, how much duplication
do we have in 5-grams?
10/22/20 33Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Index 5-grams
ā€¢ Intuition: in plagiarism detection, a single 5-gram is
indicative of duplicationā€¦should be extremely rare
ā€¢ Finding: NOT AT ALL RARE on web_index
ā€¢ The 10,000th most common appears in 12k files!
ā€¢ Most common:
ā€¢ ā€œan unknown runtime error cannotā€ 2.6 million files
10/22/20 34Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Shared 5-grams ā€“ Some Categories of Causes
ā€¢ Actual duplication or near duplication
ā€¢ Boilerplate
ā€¢ Web-page based (navigation, etc)
ā€¢ Legal (copyright, branding)
ā€¢ Machine generated logs
10/22/20 35Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Actual duplication or near duplication
10/22/20 36Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Boilerplate
ā€¢ Webpage/Navigational
ā€¢ ā€œscience technology launch vehicleā€ 1.4 million files
for Mars Odyssey pages
ā€¢ ā€œcontent announcements events opportunities
peopleā€ 500k on techconnect pages
ā€¢ Legal
ā€¢ ā€œresearch and development center staffedā€ 640k
10/22/20 37Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Example of Indexed Boilerplate
10/22/20 38
ā€œscience technology launch
vehicle spacecraftā€
1.4 million files!!!
https://mars.nasa.gov/odyssey/mission/time
line/communicationsrelay/
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Pause for relevance check
ā€¢ If science, technology, ā€œlaunch vehicleā€
and spacecraft appear in 1.4 million documents,
how important will those words be in a user query?!
10/22/20 39Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Boilerpipe output
10/22/20 40
Demo: https://boilerpipe-web.appspot.com/
Available as a handler in Tika: BoilerpipeHandler
Available as a python library: https://pypi.org/project/boilerpy3/
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Google is removing boilerplate
10/22/20 41Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Machine Generated Logs
10/22/20 42
"downlink monitor block has
completedā€
14k documents
Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Takeaways from MinHash and 5gram
ā€¢ We have enough to work with for now with digests,
text digests and text profile digests
ā€¢ We can use 5grams to identify:
ā€¢ Boilerplate content that we should remove if
boilerpipe isnā€™t sufficient
ā€¢ Content that we might want to demote in relevance or
remove from the index (machine generated logs?!)
10/22/20 43Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Categories/causes of (near) duplication
ā€¢ Exact duplicates
ā€¢ Same document, different URL
ā€¢ Documents with little or no text
ā€¢ Near duplicates
ā€¢ Different formats: PDF vs HTML of same content
ā€¢ Versioning
ā€¢ Documents with little text
ā€¢ Asymmetric duplicates (A is contained entirely within
B, but B is larger), e.g. email included in reply
10/22/20 44Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Removal of (near duplicates) problematic ifā€¦
ā€¢ ā€œDuplicateā€ documents differ in other key features
(same text, but different images)
ā€¢ Users need to find all versions of a versioned
document
ā€¢ Small difference in text is important or main point of
page is non-textual (see next slide)
10/22/20 45Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Slightly different photo metadata
10/22/20 46Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Recommendations, step 1
ā€¢ Experiment with boilerpipe handler vs. top n 5-
grams. Confirm that this doesnā€™t remove desired
text; or identify triggers for boilerpipe handler
ā€¢ Index token count, lang id, digest and text digest
along with documents
ā€¢ Add major sources of malignant duplicates to ā€œskip
listā€ at crawling stage
10/22/20 47Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Recommendations, step 2ā€¦some options
ā€¢ Remove duplicates or prevent from insertion
ā€¢ Add a duplicate identification process and
ā€¢ Group by duplicate digest in search results
ā€¢ Demote duplicates in search results
ā€¢ Allow users to select ā€œinclude duplicatesā€
10/22/20 48Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Tools
ā€¢ Quaerite (https://github.com/tballison/quaerite)
ā€¢ Copy indices Solr->ES and vice versa
ā€¢ List top n tokens (Solr only):TopNTokens
ā€¢ tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval )
ā€¢ Token counts
ā€¢ Language identification
ā€¢ Out of vocabulary %
ā€¢ Digest, Text digest, Text profile
10/22/20 49Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Conclusion
ā€¢ It dependsā„¢
ā€¢ There is no easy button, but this analysis and
discovery reveal critical areas for improvement and
get us closer to solutions
10/22/20 50Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Some References
ā€¢ Manku, G., Jain, A. and Dash, A. ā€œDetecting near-duplicates for web
crawling.ā€ WWWā€™07
https://static.googleusercontent.com/media/research.google.com/en//
pubs/archive/33026.pdf
ā€¢ Early patented work at Google:
https://www.cs.umd.edu/~pugh/google/Duplicates.pdf
ā€¢ LSH at Uber for fraudulent trip detection: https://eng.uber.com/lsh/
ā€¢ Minhash vs. SimHash:
http://proceedings.mlr.press/v33/shrivastava14.pdf
10/22/20 51Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Some Other References
ā€¢ KNN and LSH in Elasticsearch:
https://blog.insightdatascience.com/elastik-nearest-neighbors-
4b1f6821bd62
ā€¢ Minhash in Lucene:
https://medium.com/@xingzeng/understanding-minhash-in-
lucene-elasticsearch-e6799b78c0d7
ā€¢ ssdeep and elastic: https://www.intezer.com/blog/intezer-analyze-
community/intezer-community-tip-ssdeep-comparisons-with-
elasticsearch/
10/22/20 52Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
10/22/20 53Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.

More Related Content

What's hot

FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetupTheodoros Vasiloudis
Ā 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
Ā 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containerskbajda
Ā 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
Ā 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
Ā 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)Chris Riccomini
Ā 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
Ā 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex applicationAkshay Gore
Ā 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
Ā 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
Ā 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkVerverica
Ā 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
Ā 
Apache flink
Apache flinkApache flink
Apache flinkpranay kumar
Ā 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
Ā 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Corley S.r.l.
Ā 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor APIconfluent
Ā 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
Ā 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
Ā 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
Ā 

What's hot (20)

FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
Ā 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Ā 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
Ā 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Ā 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Ā 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
Ā 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Ā 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
Ā 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Ā 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
Ā 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ā 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ā 
Apache flink
Apache flinkApache flink
Apache flink
Ā 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Ā 
Zurich Flink Meetup
Zurich Flink MeetupZurich Flink Meetup
Zurich Flink Meetup
Ā 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Ā 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
Ā 
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...
Ā 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
Ā 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Ā 

Similar to Haystack Live tallison_202010_v2

Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison
Ā 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Jonathan Yu
Ā 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patentsMIPLM
Ā 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeJonathan Yu
Ā 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"Tim Allison
Ā 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioCatalogue
Ā 
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsLouise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsDataconomy Media
Ā 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDMpetermurrayrust
Ā 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and MedicineTheContentMine
Ā 
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - SillMPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - SillAlan Sill
Ā 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation ModelThamme Gowda
Ā 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
Ā 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015Bonnie Hurwitz
Ā 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
Ā 
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and ServicesGetting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Servicesdavidemartin
Ā 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008Ian Foster
Ā 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of InformationAdrian Paschke
Ā 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
Ā 

Similar to Haystack Live tallison_202010_v2 (20)

Evaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache TikaEvaluating Text Extraction at Scale: A case study from Apache Tika
Evaluating Text Extraction at Scale: A case study from Apache Tika
Ā 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
Ā 
How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patents
Ā 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
Ā 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
Ā 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
Ā 
Louise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx SystemsLouise McCluskey, Kx Engineer at Kx Systems
Louise McCluskey, Kx Engineer at Kx Systems
Ā 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
Ā 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
Ā 
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - SillMPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
MPLS/SDN 2013 Intercloud Standardization and Testbeds - Sill
Ā 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
Ā 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Ā 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
Ā 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Ā 
Getting Access to ALCF Resources and Services
Getting Access to ALCF Resources and ServicesGetting Access to ALCF Resources and Services
Getting Access to ALCF Resources and Services
Ā 
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous GraphsTackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Tackling Usability Challenges in Querying Massive, Ultra-heterogeneous Graphs
Ā 
Grid Projects In The US July 2008
Grid Projects In The US July 2008Grid Projects In The US July 2008
Grid Projects In The US July 2008
Ā 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
Ā 
Ogf27 Ligo
Ogf27 LigoOgf27 Ligo
Ogf27 Ligo
Ā 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ā 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
Ā 
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...amitlee9823
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
Ā 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...amitlee9823
Ā 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
Ā 
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Callshivangimorya083
Ā 
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...amitlee9823
Ā 
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...amitlee9823
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
Ā 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
Ā 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
Ā 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
Ā 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
Ā 
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ā 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Ā 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Ā 
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Ā 
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Ā 
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Ā 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
Ā 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
Ā 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
Ā 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
Ā 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
Ā 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
Ā 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
Ā 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Ā 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Ā 

Haystack Live tallison_202010_v2

  • 1. Duplicate and Near Duplicate Detection at Scale Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization Ā© 2020 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
  • 2. jpl.nasa.gov About me ā€¢ Data scientist (files and search) Jet Propulsion Laboratory, California Institute of Technology ā€¢ Chair/V.P. Apache Tika ā€¢ Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP ā€¢ Member Apache Software Foundation 2Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.10/22/20
  • 3. jpl.nasa.gov Outline ā€¢ Search system assessments, an overview of options ā€¢ Plug for text extraction assessment ā€¢ Duplicates and near duplicates ā€“ Case Study ā€¢ Exploration: Near duplicates with minhash ā€¢ Conclusion 310/22/20 Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 4. jpl.nasa.gov Search System Assessment: 20,000 ft view ā€¢ Offline ā€¢ Ground truth queries and expected docs ā€¢ Online ā€¢ User behavior ā€¢ User feedback ā€¢ Surveys, interviews ā€¢ Technical review ā€¢ System ā€¢ Data 10/22/20 4Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 5. jpl.nasa.gov System Assessment Udo Kruschwitz and Charlie Hull ā€œSearching the Enterpriseā€, Foundations and TrendsĀ® in Information Retrieval. 11(1):1-142, July 2017. p. 16. | 5 |Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 6. jpl.nasa.gov System Assessment ā€¢ Crawler configurations ā€¢ Text extraction configurations ā€¢ Schema and field configuration ā€¢ Query Parser configuration ā€¢ Default Boolean operator ā€¢ Fields, field boosts ā€¢ ā€¦ 10/22/20 6Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 7. jpl.nasa.gov Data Assessment ā€¢ File types, parser coverage ā€¢ Quality of text extraction (ā€¦languages) ā€¢ Quality of metadata ā€“ dates, duplicate titles/metadata ā€¢ Liveness of documents/URLs/URL redirects ā€¢ Duplicates and near duplicates 10/22/20 7Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 8. jpl.nasa.gov Plug for Text Extraction Assessment 10/22/20 8Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 9. jpl.nasa.gov Out of vocabulary (OOV) ā€“ Same file, different extractors 10/22/20 9 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens ę³ę•Ø: 18 | ē“ę”£: 14 | ē•„ē“: 14 | m: 11 | ę®ę¹„: 11 | ē‘µę³: 11 | ē•¬ę®: 11 | ę”£ę¹„: 10 | ꐠꕩ: 9 | ę•®ęµØ: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% Fixed encoding detection between 1.14 and 1.15 Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 10. jpl.nasa.gov Quality of text extraction, an example 10/22/20 10 https://voyager.jpl.nasa.gov/pdf/sfos2003pdf/03_10_02-03_10_19.sfos.pdf Language Id: Nepali (Out of Vocabulary 99%) Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 11. jpl.nasa.gov Unexplained Garbage at Beginning of File(???) 10/22/20 11 This unexplained garbage at the beginning of a file also occurs in several other PDF files identified as Nepali Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 12. jpl.nasa.gov From analytics to action 10/22/20 12 https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 13. jpl.nasa.gov Stored text vs. Optical Character Recognition 10/22/20 13 Text As Stored in File !"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc 5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a Text from Tesseract OCR Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest Descent Method Nobuhiko Oguraā€™ and Isao Yamadaā€ 1 Introduction A closed polyhedron is the intersection of finite number of closed half spaces, i.e., the setof points satisfying finite number of lincar incqualitics, and is widely used as a constraint in various application, for example specifications or constraints in signal processing or estimation problems, resource restrictions in financial applications and feasible sets of Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 14. jpl.nasa.gov Duplicates and Near Duplicates 10/22/20 14Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 15. jpl.nasa.gov Experimental Setup ā€¢ Development ā€œweb_indexā€ (~12.5 million documents) ā€¢ Slightly out of date compared with production, but close enough ā€¢ Covers internal web, but not other ā€œdocument-heavyā€ indices ā€¢ Safer to avoid heavy computation on production cluster ā€¢ Small enough to reindex with different field settings on dev cluster ā€¢ Use existing tools/metrics ā€“ no contrib modules/hand- coded algorithms 10/22/20 15Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 16. jpl.nasa.gov Duplicates! 10/22/20 16Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 17. jpl.nasa.gov How big of a problem are duplicates? 10/22/20 17 First ā€œlesson learnedā€ in Oleksiy Kovyrinā€™s recent ā€œSprinting to a crawl: Building an effective web crawlerā€ on ElastiCON Global 2020
  • 18. jpl.nasa.gov Google has several patents for (near)duplicate detection https://patents.google.com/?q=%22duplicate+documents%22&assignee=Google%2c+Llc&n um=100&oq=assignee:(Google%2c+Llc)+%22duplicate+documents%22&sort=new 10/22/20 18Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 19. jpl.nasa.gov Googleā€™s Guidance for Duplicates and Search Engine Optimization (SEO) 10/22/20 19 https://support.google.com/webmasters/answer/66359?hl=en Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 20. jpl.nasa.gov File Types ā€“ Top 10 file types in web_index 10/22/20 20 File Type Count text/html 8,894,038 image/gif 1,870,136 image/jpeg 1,094,937 image/png 319,710 text/plain 109,516 application/pdf 105,081 application/x-hdf 64,194 image/x-ms-bmp 26,377 application/xml 8,734 application/msword 7,414 Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 21. jpl.nasa.gov Duplicates, near duplicates ā€¢ Digests ā€¢ Literal bytes of a file are the same ā€¢ Text Digests ā€¢ Extracted text from a document is the same ā€¢ Text Profile Digest (see next slide) ā€¢ Require all words ā€¢ Drop the rarer words in a document (default) 10/22/20 21Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 22. jpl.nasa.gov Nutchā€™s TextProfile Data Search | JPL's Earth Science Airborne Program Jump to navigation Earth Science Airborne Program JPL's Suborbital Earth Science Instruments & Measurements Home ā€ŗ All Products ā€ŗ Instrument: Fourier Transform Infrared Spectrometer (FTS) ā€ŗ Product Type: FTS_L2QR ā€ŗ Platform: C-23 Sherpa ā€ŗ Parameter: Atmospheric Chemistry ā€ŗ Platform Type: Airborne ā€ŗ Campaign: Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) Data Search Show Advanced search Temporal Search Start Date Stop Date Free Text Search Enter search text Spatial Search (Hold Shift to draw bounding box) + - Perform Search Sort By Popularity (All Time) Popularity (This Month) Popularity (Users) Long Name (A-Z) Short Name (A-Z) Grid Spatial Resolution Satellite Spatial Resolution Start Date Stop Date Found 0 matching products(s). Browse Products Campaign Any campaign Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) (261) Parameter Any parameter Atmospheric Chemistry (261) Instrument Any instrument Fourier Transform Infrared Spectrometer (FTS) (261) Platform Any platform C-23 Sherpa (261) Platform Type Any platform type Airborne (261) Product Type Any product type FTS_L2QR (261) 10/22/20 22 Term Quantized Count search 8 261 6 any 6 platform 6 type 6 airborne 4 date 4 Text Profile: ā€œsearch 261 any platform type airborne dateā€¦ā€ Quantize counts, sort by descending order of frequency, drop quantized count below a thresholdhttps://airbornescience.jpl.nasa.gov/data Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 23. jpl.nasa.gov Different Digest, Different Text Digest, Same Text Profile 10/22/20 23Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 24. jpl.nasa.gov Digests vs Text Digests vs Text Profile Digests in non- image documents ā€¢ Total non-image documents: 9.2 million ā€¢ Distinct digests: 8.6 million ā€¢ Distinct text digests: 5.2 million ā€¢ Distinct text profile (keep all words): 5.1 million ā€¢ Distinct text profile (drop infrequent words): 2.7 million 10/22/20 24Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 25. jpl.nasa.gov Number of Non-Image Documents with a Distinct Digest 10/22/20 25 Digest Text Digest Text Profile Digest digest1 27,874 2,810,868 2,810,868 digest2 10,089 73,203 489,821 digest3 1,565 27,874 73,225 digest4 1,170 10,089 63,818 digest5 1,166 7,926 58,271 digest6 1,128 2,589 27,874 digest7 1,072 2,557 25,311 digest8 990 1,911 12,222 digest9 933 1,616 11,973 digest10 841 1,573 10,089 Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 26. jpl.nasa.gov 2.8 million?! 10/22/20 26 Yes! On development index. In production, there are ONLY 880k! Ā© 2020 California Institute of Technology. Government sponsorship acknowledged. Error page. The Web Server encountered an unknown runtime error. Cannot display pageā€¦
  • 27. jpl.nasa.gov Initial Takeaway ā€¢ Some easy fixes 10/22/20 27Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 28. jpl.nasa.gov Exploration: Near Duplicates with MinHash 10/22/20 28Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 29. jpl.nasa.gov Experiments with MinHash ā€¢ Earlier proof-of-concept implemented by intern ā€¢ Filter available in Elasticsearch to allow for fuzzy hashing/near duplicate detection ā€¢ Default settings ā€“ digest 5-grams (see next slide), summarize digests into 512 tokens (buckets) ā€¢ Run a ā€œMoreLikeThisā€ query ā€“ there is a more efficient algorithm, but not built into ES yet* 10/22/20 29 Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-minhash-tokenfilter.html Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 30. jpl.nasa.gov Whatā€™s a 5-gram ā€¢ ā€œthe quick brown fox jumped over the lazy dogā€ ā€¢ ā€œthe quick brown fox jumpedā€ ā€¢ ā€œquick brown fox jumped overā€ ā€¢ ā€¦. 10/22/20 30Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 31. jpl.nasa.gov Experiments with MinHash: Findings ā€¢ Worked really well on a toy set of synthetic documents ā€¢ Performance is prohibitive on full web_index (even with stored termvectors) ā€“ estimate ~1 year to query every document in the index ā€¢ Note: speed was greatly improved by programmatically retrieving termvectors and creating own terms query, but still not acceptable 10/22/20 31Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 32. jpl.nasa.gov Experiments with MinHash: Conclusion ā€¢ There may be ways of improving performance with more shards, multithreading, smarter processing, different algorithm ā€¢ At this point, however, the problems with exact duplicates and/or text duplicates are sufficient so as not to warrant further investigation of near duplicates via minhash 10/22/20 32Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 33. jpl.nasa.gov But why, why was MinHash SO slow?! Some ideasā€¦ ā€¢ Elasticsearch is optimized for queries of a few words, not 512 ā€œwordsā€ ā€¢ Aside from exact duplicates, how much duplication do we have in 5-grams? 10/22/20 33Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 34. jpl.nasa.gov Index 5-grams ā€¢ Intuition: in plagiarism detection, a single 5-gram is indicative of duplicationā€¦should be extremely rare ā€¢ Finding: NOT AT ALL RARE on web_index ā€¢ The 10,000th most common appears in 12k files! ā€¢ Most common: ā€¢ ā€œan unknown runtime error cannotā€ 2.6 million files 10/22/20 34Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 35. jpl.nasa.gov Shared 5-grams ā€“ Some Categories of Causes ā€¢ Actual duplication or near duplication ā€¢ Boilerplate ā€¢ Web-page based (navigation, etc) ā€¢ Legal (copyright, branding) ā€¢ Machine generated logs 10/22/20 35Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 36. jpl.nasa.gov Actual duplication or near duplication 10/22/20 36Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 37. jpl.nasa.gov Boilerplate ā€¢ Webpage/Navigational ā€¢ ā€œscience technology launch vehicleā€ 1.4 million files for Mars Odyssey pages ā€¢ ā€œcontent announcements events opportunities peopleā€ 500k on techconnect pages ā€¢ Legal ā€¢ ā€œresearch and development center staffedā€ 640k 10/22/20 37Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 38. jpl.nasa.gov Example of Indexed Boilerplate 10/22/20 38 ā€œscience technology launch vehicle spacecraftā€ 1.4 million files!!! https://mars.nasa.gov/odyssey/mission/time line/communicationsrelay/ Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 39. jpl.nasa.gov Pause for relevance check ā€¢ If science, technology, ā€œlaunch vehicleā€ and spacecraft appear in 1.4 million documents, how important will those words be in a user query?! 10/22/20 39Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 40. jpl.nasa.gov Boilerpipe output 10/22/20 40 Demo: https://boilerpipe-web.appspot.com/ Available as a handler in Tika: BoilerpipeHandler Available as a python library: https://pypi.org/project/boilerpy3/ Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 41. jpl.nasa.gov Google is removing boilerplate 10/22/20 41Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 42. jpl.nasa.gov Machine Generated Logs 10/22/20 42 "downlink monitor block has completedā€ 14k documents Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 43. jpl.nasa.gov Takeaways from MinHash and 5gram ā€¢ We have enough to work with for now with digests, text digests and text profile digests ā€¢ We can use 5grams to identify: ā€¢ Boilerplate content that we should remove if boilerpipe isnā€™t sufficient ā€¢ Content that we might want to demote in relevance or remove from the index (machine generated logs?!) 10/22/20 43Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 44. jpl.nasa.gov Categories/causes of (near) duplication ā€¢ Exact duplicates ā€¢ Same document, different URL ā€¢ Documents with little or no text ā€¢ Near duplicates ā€¢ Different formats: PDF vs HTML of same content ā€¢ Versioning ā€¢ Documents with little text ā€¢ Asymmetric duplicates (A is contained entirely within B, but B is larger), e.g. email included in reply 10/22/20 44Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 45. jpl.nasa.gov Removal of (near duplicates) problematic ifā€¦ ā€¢ ā€œDuplicateā€ documents differ in other key features (same text, but different images) ā€¢ Users need to find all versions of a versioned document ā€¢ Small difference in text is important or main point of page is non-textual (see next slide) 10/22/20 45Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 46. jpl.nasa.gov Slightly different photo metadata 10/22/20 46Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 47. jpl.nasa.gov Recommendations, step 1 ā€¢ Experiment with boilerpipe handler vs. top n 5- grams. Confirm that this doesnā€™t remove desired text; or identify triggers for boilerpipe handler ā€¢ Index token count, lang id, digest and text digest along with documents ā€¢ Add major sources of malignant duplicates to ā€œskip listā€ at crawling stage 10/22/20 47Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 48. jpl.nasa.gov Recommendations, step 2ā€¦some options ā€¢ Remove duplicates or prevent from insertion ā€¢ Add a duplicate identification process and ā€¢ Group by duplicate digest in search results ā€¢ Demote duplicates in search results ā€¢ Allow users to select ā€œinclude duplicatesā€ 10/22/20 48Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 49. jpl.nasa.gov Tools ā€¢ Quaerite (https://github.com/tballison/quaerite) ā€¢ Copy indices Solr->ES and vice versa ā€¢ List top n tokens (Solr only):TopNTokens ā€¢ tika-eval (https://cwiki.apache.org/confluence/display/TIKA/TikaEval ) ā€¢ Token counts ā€¢ Language identification ā€¢ Out of vocabulary % ā€¢ Digest, Text digest, Text profile 10/22/20 49Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 50. jpl.nasa.gov Conclusion ā€¢ It dependsā„¢ ā€¢ There is no easy button, but this analysis and discovery reveal critical areas for improvement and get us closer to solutions 10/22/20 50Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 51. jpl.nasa.gov Some References ā€¢ Manku, G., Jain, A. and Dash, A. ā€œDetecting near-duplicates for web crawling.ā€ WWWā€™07 https://static.googleusercontent.com/media/research.google.com/en// pubs/archive/33026.pdf ā€¢ Early patented work at Google: https://www.cs.umd.edu/~pugh/google/Duplicates.pdf ā€¢ LSH at Uber for fraudulent trip detection: https://eng.uber.com/lsh/ ā€¢ Minhash vs. SimHash: http://proceedings.mlr.press/v33/shrivastava14.pdf 10/22/20 51Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 52. jpl.nasa.gov Some Other References ā€¢ KNN and LSH in Elasticsearch: https://blog.insightdatascience.com/elastik-nearest-neighbors- 4b1f6821bd62 ā€¢ Minhash in Lucene: https://medium.com/@xingzeng/understanding-minhash-in- lucene-elasticsearch-e6799b78c0d7 ā€¢ ssdeep and elastic: https://www.intezer.com/blog/intezer-analyze- community/intezer-community-tip-ssdeep-comparisons-with- elasticsearch/ 10/22/20 52Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 53. jpl.nasa.gov 10/22/20 53Ā© 2020 California Institute of Technology. Government sponsorship acknowledged.