SlideShare a Scribd company logo
1 of 54
Download to read offline
Apache
Nutch – ApacheCon US '09




                           Web-scale search engine toolkit

                                Today and tomorrow

                                      Andrzej Białecki
                                      ab@apache.org
Agenda
                               About the project
                               Web crawling in general
                               Nutch architecture overview
                               Nutch workflow:
                           •   Setup
Nutch – ApacheCon US '09




                           •   Crawling
                           •   Searching
                               Challenges (and some solutions)
                               Nutch present and future
                               Questions and answers


                                                                 2
Apache Nutch project
                               Founded in 2003 by Doug Cutting, the Lucene
                               creator, and Mike Cafarella
                               Apache project since 2004 (sub-project of Lucene)
                               Spin-offs:
                               Map-Reduce and distributed FS → Hadoop
Nutch – ApacheCon US '09




                           •

                           •   Content type detection and parsing → Tika
                               Many installations in operation, mostly vertical
                               search
                               Collections typically 1 mln - 200 mln documents


                                                                                   3
Nutch – ApacheCon US '09




4
Web as a directed graph
                           Nodes (vertices): URL-s as unique identifiers
                           Edges (links): hyperlinks like <a href=”targetUrl”/>
                           Edge labels: <a href=”..”>anchor text</a>
                           Often represented as adjacency (neighbor) lists
                           Traversal: follow the edges, breadth-first, depth-
Nutch – ApacheCon US '09




                           first, random
                               8

                                                   7   9
                                       3
                           2                               1 → 2, 3, 4, 5, 6
                                   1           4           5 → 6, 9
                                                           7 → 3, 4, 8, 9
                           6               5

                                                                                  5
What's in a search engine?
                           … a few things that may surprise you! 
Nutch – ApacheCon US '09




                                                                     6
Search engine building blocks

                           Injector    Scheduler             Crawler     Searcher



                                                                                    Indexer
Nutch – ApacheCon US '09




                              Web graph            Updater         Content
                              -page info
                              -links (in/out)
                                                                  repository
                                                                                    Parser




                                                   Crawling frontier controls



                                                                                              7
Nutch features at a glance
                           Page database and link database (web graph)
                           Plugin-based, highly modular:
                             −   Most behavior can be changed via plugins
                           Multi-protocol, multi-threaded, distributed crawler
                           Plugin-based content processing (parsing, filtering)
Nutch – ApacheCon US '09




                           Robust crawling frontier controls
                           Scalable data processing framework
                             −   Map-reduce processing
                           Full-text indexer & search engine
                             −   Using Lucene or Solr
                             −   Support for distributed search
                           Robust API and integration options

                                                                              8
Hadoop foundation
                               File system abstraction
                           •   Local FS, or
                           •   Distributed FS
                                − also Amazon S3, Kosmos and other FS impl.


                               Map-reduce processing
Nutch – ApacheCon US '09




                           •   Currently central to Nutch algorithms
                           •   Processing tasks are executed as one or more map-
                                reduce jobs
                           •   Data maintained as Hadoop MapFile-s / SequenceFile-s
                           •   Massive updates very efficient
                           •   Small updates costly
                                 −   Hadoop data is immutable, once created
                                 −   Updates are really merge & replace               9
Search engine building blocks

                           Injector    Scheduler             Crawler     Searcher



                                                                                    Indexer
Nutch – ApacheCon US '09




                              Web graph            Updater         Content
                              -page info
                              -links (in/out)
                                                                  repository
                                                                                    Parser




                                                   Crawling frontier controls



                                                                                              10
Nutch building blocks

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     11
Nutch data

                                              Maintains info on all known URL-s:
                           Injector Generator ● Fetch schedule
                                                       Fetcher       Searcher
                                              ● Fetch status

                                              ● Page signature

                                              ● Metadata

                                                                              Indexer
Nutch – ApacheCon US '09




                                CrawlDB      Updater
                                                                   Shards
                                                                 (segments)
                                                   Link                                Parser
                               LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     12
Nutch data

                           Injector  Generator For each Fetcher
                                                        target URL keeps info on
                                                                         Searcher
                                               incoming links, i.e. list of source
                                               URL-s and their associated anchor
                                               text
                                                                                   Indexer
Nutch – ApacheCon US '09




                               CrawlDB        Updater
                                                                   Shards
                                                                 (segments)
                                                   Link                                Parser
                                LinkDB           inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     13
Nutch data
                           Shards (“segments”) keep:
                           ● Raw page content

                           Injector content + discovered
                           ● Parsed Generator            Fetcher          Searcher
                           metadata + outlinks
                           ● Plain text for indexing and

                           snippets
                                                                                       Indexer
Nutch – ApacheCon US '09




                           ● Optional per-shard indexes
                                CrawlDB          Updater
                                                                   Shards
                                                                 (segments)
                                                   Link                                Parser
                                LinkDB           inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     14
Nutch shards (a.k.a. “segments”)
                           Unit of work (batch) – easier to process massive datasets
                           Convenience placeholder, using predefined directory names
                           Unit of deployment to the search infrastructure
                           May include per-shard Lucene indexes
                           Once completed they are basically unmodifiable
                              −   No in-place updates of content, or replacing of obsolete content
                           Periodically phased-out
Nutch – ApacheCon US '09




                                                         200904301234/
                                                         2009043012345
                                                        2009043012345
                                    Generator
                                                         crawl_generate/
                                                         crawl_generate
                                                        crawl_generate
                                                         crawl_fetch/
                                      Fetcher            crawl_fetch
                                                        crawl_fetch
                                                         content/
                                                         content
                                                         crawl_parse/            “cached” view
                                                        content
                                      Parser             crawl_parse
                                                         parse_data/
                                                        crawl_parse
                                                         parse_text/
                                                         parse_data
                                                        parse_data
                                                         indexes/                snippets
                                      Indexer            parse_text
                                                        parse_text
                                                                                                     15
URL life-cycle and shards
                                                                                             observed changes   time
                                                                    real page changes
                               A time-lapse view of reality                                             injected /
                                                                                                      discovered
                               Goals:                                                                  scheduled
                                                                                                      for fetching
                           •   Fresh index → sync with the real                                 S1
                                 rate of changes
                           •   Minimize re-fetching → don't fetch                                       fetched /
                                 unmodified pages                                                      up-to-date
                               Each page may need its
Nutch – ApacheCon US '09




                               own crawl schedule                                                      scheduled
                                                                                                      for fetching
                               Shard management:
                           •   Page may be present in many                                              fetched /
                                 shards, but only the most recent                                      up-to-date
                                 record is valid
                                                                                                       scheduled
                           •   Inconvenient to update in-place,                         S2            for fetching
                                 just mark as deleted
                           •   Phase-out old shards and force re-
                                                                                                        fetched /
                                 fetch of the remaining pages                  S3
                                                                                                       up-to-date
                                                                                                                     16
Crawling frontier
                           No authoritative catalog of web pages
                           Search engines discover their view of web universe
                             −   Start from “seed list”
                             −   Follow (walk) some (useful? interesting?) outlinks
                           Many dangers of simply wandering around
                                 explosion or collapse of the frontier
Nutch – ApacheCon US '09




                             −

                             −   collecting unwanted content (spam, junk, offensive)


                                        I could use
                                      some guidance




                                                                                       17
Controlling the crawling frontier
                           URL filter plugins
                             −   White-list, black-list, regex
                             −   May use external resources
                                  (DB-s, services ...)
                           URL normalizer plugins
Nutch – ApacheCon US '09




                             −   Resolving relative path
                                   elements                         seed
                             −   “Equivalent” URLs                     i=1

                                                                             i=2
                           Additional controls using                               i=3

                           scoring plugins
                             −   priority, metadata select/block
                           Crawler traps – difficult!
                             −   Domain / host / path-level stats                        18
Wide vs. focused crawling
                               Differences:
                           •   Little technical difference in configuration
                           •   Big difference in operations, maintenance and quality
                               Wide crawling:
                                 −   (Almost) Unlimited crawling frontier
                                     High risk of spamming and junk content
Nutch – ApacheCon US '09




                                 −

                                 −   “Politeness” a very important limiting factor
                                 −   Bandwidth & DNS considerations
                               Focused (vertical or enterprise) crawling:
                                 −   Limited crawling frontier
                                 −   Bandwidth or politeness is often not an issue
                                 −   Low risk of spamming and junk content



                                                                                       19
Vertical & enterprise search
                               Vertical search
                           •   Range of selected “reference” sites
                           •   Robust control of the crawling frontier
                           •   Extensive content post-processing
                           •   Business-driven decisions about ranking
Nutch – ApacheCon US '09




                               Enterprise search
                           •   Variety of data sources and data formats
                           •   Well-defined and limited crawling frontier
                           •   Integration with in-house data sources
                           •   Little danger of spam
                           •   PageRank-like scoring usually works poorly
                                                                            20
Face to face with Nutch


                           ?
Nutch – ApacheCon US '09




                                                     21
Simple set-up and running
                               You already have Java 5+ , right?
                               Get a 1.0 release or a nightly build (pretty stable)
                               Simple search setup uses Tomcat for search
                               web app
                               Command-line bash script: bin/nutch
Nutch – ApacheCon US '09




                           •   Windows users: get Cygwin
                           •   Early version of a web-based UI console
                           •   Hadoop web-based monitoring




                                                                                      22
Configuration: files
                               Edit configuration in conf/nutch-site.xml
                           •   Check nutch-default.xml for defaults and docs
                           •   You MUST at least fill the name of your agent
                               Active plugins configuration
                               Per-plugin properties
Nutch – ApacheCon US '09




                               External configuration files
                           •   regex-urlfilter.xml
                           •   regex-normalize.xml
                           •   parse-plugins.xml: mapping of MIME type to plugin



                                                                                   23
Nutch plugins
                               Plugin-based extensions for:
                           •   Crawl scheduling
                           •   URL filtering and normalization
                           •   Protocol for getting the content
                           •   Content parsing
Nutch – ApacheCon US '09




                           •   Text analysis (tokenization)
                           •   Page signature (to detect near-duplicates)
                           •   Indexing filters (index fields & metadata)
                           •   Snippet generation and highlighting
                           •   Scoring and ranking
                           •   Query translation and expansion (user → Lucene)

                                                                             24
Main Nutch workflow
                                                                                Command-line:
                                                                                bin/nutch
                                          Inject: initial creation of CrawlDB    inject
                                      •   Insert seed URLs
                                      •   Initial LinkDB is empty
Nutch – ApacheCon US '09




                                          Generate new shard's fetchlist         generate
                                          Fetch raw content                      fetch
                                          Parse content (discovers outlinks)     parse
                                          Update CrawlDB from shards             updatedb
                                          Update LinkDB from shards              updatelinkdb
                                          Index shards                           index /
                                                                                    solrindex
                           (repeat)                                                             25
Injecting new URL-s

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     26
Generating fetchlists

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     27
Workflow: generating fetchlists
                               What to fetch next?
                           •   Breadth-first – important due to “politeness” limits
                           •   Expired (longer than fetchTime + fetchInterval)
                           •   Highly ranking (PageRank)
                           •   Newly added
                               Fetchlist generation:
Nutch – ApacheCon US '09




                           •   “topN” - select best candidates
                                     −   Priority based on many factors, pluggable
                               Adaptive fetch schedule
                           •   Detect rate of changes & time of change
                           •   Detect unmodified content
                                 −   Hmm, and how to recognize this? → approximate page
                                      signatures (near-duplicate detection)

                                                                                          28
Fetching content

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     29
Workflow: fetching

                               Multi-protocol: HTTP(s), FTP, file://, etc ...
                               Coordinates multiple threads accessing the same
                               host
                               •   “politeness” issues vs. crawling efficiency
                               •   Host: an IP or DNS name?
Nutch – ApacheCon US '09




                               •   Redirections: should follow immediately? What kind?
                               Other netiquette issues:
                           •   robots.txt: disallowed paths, Crawl-Delay
                               Preparation for parsing:
                               •   Content type detection issues
                               •   Parsing usually executed as a separate step
                                   (resource hungry and sometimes unstable)
                                                                                         30
Content processing

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     31
Workflow: content processing
                           Protocol plugins retrieve content as plain bytes +
                           protocol-level metadata (e.g. HTTP headers)
                           Parse plugins
                           •   Content is parsed by MIME-type specific parsers
                           •   Content is parsed into parseData (title, outlinks, other
Nutch – ApacheCon US '09




                               metadata) and parseText (plain text content)
                           Nutch supports many popular file formats




                                                                                          32
Link inversion

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     33
Workflow: link inversion
                           Pages have outgoing links (outlinks)
                                … I know where I'm pointing to
                           Question: who points to me?
                                … I don't know, there is no catalog of pages
                                … NOBODY knows for sure either!
Nutch – ApacheCon US '09




                           In-degree indicates importance of the page
                           Anchor text provides important semantic info
                           Partial answer: invert the outlinks that I know
                           about, and group by target
                                      tgt 2                        src 2
                            tgt 1                         src 1

                                    src 1      tgt 3              tgt 1     src 3

                            tgt 5      tgt 4              src 5     src 4           34
Page importance - scoring

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     35
Workflow: page scoring
                               Query-independent importance factor
                           •   Affects search ranking
                           •   Affects crawl prioritization
                               May include arbitrary decision of the operator
                               Currently two systems (plugins + tools)
Nutch – ApacheCon US '09




                               OPIC scoring
                           •   Doesn't require explicit steps - “online”
                           •   Difficult to stabilize
                               PageRank scoring with loop detection
                           •   Periodically run to update CrawlDb scores
                           •   Computationally intensive, esp. loop detection

                                                                                36
Indexing

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     37
Workflow: indexing
                               Indexing plugins
                           •   Create full-text search index from segment data
                           •   Apply adjustments to score per document
                           •   May further post-process the parsed content (e.g.
                                language identification) to facilitate advanced search
Nutch – ApacheCon US '09




                               Indexers
                           •   Lucene indexer – builds indexes to be served by Nutch
                                search servers
                           •   Solr indexer – submits documents to a Solr instance




                                                                                         38
De-duplication

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     39
Workflow: de-duplication
                               The same page may be present in many shards
                           •   Obsolete versions in older shards
                           •   Mirrored pages, or equivalent (a.com → www.a.com)
                               Many other pages are almost identical
                           •   Template-related differences (banners, current date)
                               Font / layout changes, minor re-wording
Nutch – ApacheCon US '09




                           •


                               Hmm … what is a significant change ???
                           •   Tricky issue! Hint: what is the page content?

                               Near-duplicate detection and removal
                           •   Nutch uses approximate page signatures (fingerprints)
                           •   Duplicates are only marked as deleted
                                                                                       40
Cycles may overlap

                           Injector   Generator              Fetcher      Searcher



                                                                                       Indexer
Nutch – ApacheCon US '09




                               CrawlDB            Updater
                                                                   Shards
                                                                 (segments)
                                                    Link                               Parser
                                LinkDB            inverter


                              URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                     41
Nutch distributed search
                                                 Search front-end(s)
                                                                    Web app
                           Search server                              API
                           (Nutch or Solr)        NutchBean
                                                                      XML
                                                                     JSON
Nutch – ApacheCon US '09




                            indexed shards
                                                 Multiple search servers
                                                 Search front-end dispatches
                                                 queries and collects search
                                                 results
                                             •   multiple access methods (API,
                                                  OpenSearch XML, JSON)
                                                 Page servers build snippets
                                                 Fault-tolerant *
                                                            * with degraded quality   42
Search configuration
                               Nutch syntax is limited – on purpose!
                           •   Some queries are costly, e.g. leading wildcard, very long
                           •   Some queries may need implicit (hidden) expansion
                               Query plugins
                           •   From user query to Lucene/Solr query
                               User query: web search
Nutch – ApacheCon US '09




                               +(url:web^4.0 anchor:web^2.0 content:web title:web^1.5 host:web^2.0)
                               +(url:search^4.0 anchor:search^2.0 content:search title:search^1.5
                               host:search^2.0) url:"web search"~10^4.0 anchor:"web search"~4^2.0
                               content:"web search"~10 title:"web search"~10^1.5
                               host:"web search"~10^2.0

                               Search server configuration
                           •   Single search vs. distributed
                           •   Using Nutch searcher, or Solr, or a mix
                           •   Currently no global IDF calculation
                                                                                                      43
Deployment: single-server
                           The easiest but the most limited deployment
                           Centralized storage, centralized processing
                           Hadoop LocalFS and LocalJobTracker
                           Drop nutch.war in Tomcat/webapps and point to shards
                              Search
                               Search                               Fetcher
                             front-end                               Fetcher
                              front-end
Nutch – ApacheCon US '09




                                                 LocalFS
                                                  LocalFS           Indexer
                                                 Storage             Indexer
                                                  Storage

                                                                   DB mgmt
                                                                    DB mgmt




                                                                                  44
Deployment: distributed search
                           Local storage on search servers preferred (perf.)


                               Search
                                Search
                             front-end
                              front-end                           Fetcher
                                                                   Fetcher
Nutch – ApacheCon US '09




                                                LocalFS
                                                 LocalFS          Indexer
                                                Storage            Indexer
                                                 Storage
                           Search server
                            Search server                        DB mgmt
                              Search server                      DB mgmt




                                                                               45
Deployment: map-reduce
                             processing & distrib. search
                           Fully distributed storage and processing
                           Local storage on search servers preferred (perf.)
                                                                  (Job Tracker)
                               Search
                                Search
                                                                   (Job Tracker)
                                                                    Fetcher
                                              DFS Name Node
                                               DFS Name Node         Fetcher
                             front-end
                              front-end                             Indexer
                                                                     Indexer
Nutch – ApacheCon US '09




                                                                   DB mgmt
                                                                    DB mgmt




                           Search server      DFS Data Node      Task Tracker
                            Search server      DFS Data Node       Task Tracker
                              Search server      DFS Data Node       Task Tracker




                                                                                    46
Nutch on a Hadoop cluster
                           Assumes an up & running Hadoop cluster
                                 … which is covered elsewhere
                           Build nutch.job jar and use it as usual:

                           bin/hadoop jar nutch.job <className> <args...>
Nutch – ApacheCon US '09




                           Note: Nutch configuration is inside nutch.job
                             −   When you change it you need to rebuild job jar


                           Searching is not a map-reduce job – often on a
                           separate group of machines

                                                                                  47
Nutch – ApacheCon US '09




48
Nutch – ApacheCon US '09



                                Nutch index




49
Nutch – ApacheCon US '09



                                … and press
                                Search




50
Conclusions
                                (This overview is a tip of the iceberg)

                                               Nutch
                           Implements all core search engine components
                           Scales well
Nutch – ApacheCon US '09




                           Extremely configurable and modular
                           It's a complete solution – and a toolkit




                                                                          51
Future of Nutch
                               Avoid code duplication
                               Parsing → Tika
                           •   Almost total overlap
                           •   Still some missing functionality in Tika → contribute
                               Plugins → OSGI
                           •   Home-grown plugin system has some deficiencies
Nutch – ApacheCon US '09




                           •   Initial port available
                               Indexing & Search → Solr, Zookeeper
                           •   Distributed and replicated search is difficult
                           •   Initial integration needs significant improvement
                           •   Shard management - Katta?
                               Web-graph & page repository → HBase
                           •   Combine CrawlDB, LinkDB and shard storage
                           •   Avoid tedious shard management
                           •   Initial port available                                  52
Future of Nutch (2)
                               What's left then?
                           •   Crawling frontier management, discovery
                           •   Re-crawl algorithms
                           •   Spider trap handling
                           •   Fetcher
                               Ranking: enterprise-specific, user-feedback
Nutch – ApacheCon US '09




                           •
                           •   Duplicate detection, URL aliasing (mirror detection)
                           •   Template detection and cleanup, pagelet-level crawling
                           •   Spam & junk control
                               Share code with other crawler projects →
                               crawler-commons
                               Vision: á la carte toolkit, scalable from
                               1-1000s nodes
                                                                                        53
Summary

                           Q&A

                           Further information:
                           •   http://lucene.apache.org/nutch
Nutch – ApacheCon US '09




                                                                54

More Related Content

What's hot

Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wildJulien Nioche
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopHyunsik Choi
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackRich Lee
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventGruter
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services
 

What's hot (20)

Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 

Similar to Nutch - web-scale search engine toolkit

Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416OpenStack Foundation
 
OpenNaaS Overview Complete
OpenNaaS Overview CompleteOpenNaaS Overview Complete
OpenNaaS Overview CompleteJoan Garcia
 
Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalYeounhee Lee
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
OpenNaas overview
OpenNaas overviewOpenNaas overview
OpenNaas overviewPauMinoves
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Difference between apache spark and apache nifi
Difference between apache spark and apache nifiDifference between apache spark and apache nifi
Difference between apache spark and apache nifiGaneshJoshi47
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 

Similar to Nutch - web-scale search engine toolkit (20)

Hadoop For OpenStack Log Analysis
Hadoop For OpenStack Log AnalysisHadoop For OpenStack Log Analysis
Hadoop For OpenStack Log Analysis
 
Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
OpenNaaS Overview Complete
OpenNaaS Overview CompleteOpenNaaS Overview Complete
OpenNaaS Overview Complete
 
Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_final
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Data streaming
Data streamingData streaming
Data streaming
 
OpenNaas overview
OpenNaas overviewOpenNaas overview
OpenNaas overview
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Difference between apache spark and apache nifi
Difference between apache spark and apache nifiDifference between apache spark and apache nifi
Difference between apache spark and apache nifi
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
October 2013 HUG: HBase 0.96
October 2013 HUG: HBase 0.96October 2013 HUG: HBase 0.96
October 2013 HUG: HBase 0.96
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Nutch - web-scale search engine toolkit

  • 1. Apache Nutch – ApacheCon US '09 Web-scale search engine toolkit Today and tomorrow Andrzej Białecki ab@apache.org
  • 2. Agenda About the project Web crawling in general Nutch architecture overview Nutch workflow: • Setup Nutch – ApacheCon US '09 • Crawling • Searching Challenges (and some solutions) Nutch present and future Questions and answers 2
  • 3. Apache Nutch project Founded in 2003 by Doug Cutting, the Lucene creator, and Mike Cafarella Apache project since 2004 (sub-project of Lucene) Spin-offs: Map-Reduce and distributed FS → Hadoop Nutch – ApacheCon US '09 • • Content type detection and parsing → Tika Many installations in operation, mostly vertical search Collections typically 1 mln - 200 mln documents 3
  • 5. Web as a directed graph Nodes (vertices): URL-s as unique identifiers Edges (links): hyperlinks like <a href=”targetUrl”/> Edge labels: <a href=”..”>anchor text</a> Often represented as adjacency (neighbor) lists Traversal: follow the edges, breadth-first, depth- Nutch – ApacheCon US '09 first, random 8 7 9 3 2 1 → 2, 3, 4, 5, 6 1 4 5 → 6, 9 7 → 3, 4, 8, 9 6 5 5
  • 6. What's in a search engine? … a few things that may surprise you!  Nutch – ApacheCon US '09 6
  • 7. Search engine building blocks Injector Scheduler Crawler Searcher Indexer Nutch – ApacheCon US '09 Web graph Updater Content -page info -links (in/out) repository Parser Crawling frontier controls 7
  • 8. Nutch features at a glance Page database and link database (web graph) Plugin-based, highly modular: − Most behavior can be changed via plugins Multi-protocol, multi-threaded, distributed crawler Plugin-based content processing (parsing, filtering) Nutch – ApacheCon US '09 Robust crawling frontier controls Scalable data processing framework − Map-reduce processing Full-text indexer & search engine − Using Lucene or Solr − Support for distributed search Robust API and integration options 8
  • 9. Hadoop foundation File system abstraction • Local FS, or • Distributed FS − also Amazon S3, Kosmos and other FS impl. Map-reduce processing Nutch – ApacheCon US '09 • Currently central to Nutch algorithms • Processing tasks are executed as one or more map- reduce jobs • Data maintained as Hadoop MapFile-s / SequenceFile-s • Massive updates very efficient • Small updates costly − Hadoop data is immutable, once created − Updates are really merge & replace 9
  • 10. Search engine building blocks Injector Scheduler Crawler Searcher Indexer Nutch – ApacheCon US '09 Web graph Updater Content -page info -links (in/out) repository Parser Crawling frontier controls 10
  • 11. Nutch building blocks Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 11
  • 12. Nutch data Maintains info on all known URL-s: Injector Generator ● Fetch schedule Fetcher Searcher ● Fetch status ● Page signature ● Metadata Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 12
  • 13. Nutch data Injector Generator For each Fetcher target URL keeps info on Searcher incoming links, i.e. list of source URL-s and their associated anchor text Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 13
  • 14. Nutch data Shards (“segments”) keep: ● Raw page content Injector content + discovered ● Parsed Generator Fetcher Searcher metadata + outlinks ● Plain text for indexing and snippets Indexer Nutch – ApacheCon US '09 ● Optional per-shard indexes CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 14
  • 15. Nutch shards (a.k.a. “segments”) Unit of work (batch) – easier to process massive datasets Convenience placeholder, using predefined directory names Unit of deployment to the search infrastructure May include per-shard Lucene indexes Once completed they are basically unmodifiable − No in-place updates of content, or replacing of obsolete content Periodically phased-out Nutch – ApacheCon US '09 200904301234/ 2009043012345 2009043012345 Generator crawl_generate/ crawl_generate crawl_generate crawl_fetch/ Fetcher crawl_fetch crawl_fetch content/ content crawl_parse/ “cached” view content Parser crawl_parse parse_data/ crawl_parse parse_text/ parse_data parse_data indexes/ snippets Indexer parse_text parse_text 15
  • 16. URL life-cycle and shards observed changes time real page changes A time-lapse view of reality injected / discovered Goals: scheduled for fetching • Fresh index → sync with the real S1 rate of changes • Minimize re-fetching → don't fetch fetched / unmodified pages up-to-date Each page may need its Nutch – ApacheCon US '09 own crawl schedule scheduled for fetching Shard management: • Page may be present in many fetched / shards, but only the most recent up-to-date record is valid scheduled • Inconvenient to update in-place, S2 for fetching just mark as deleted • Phase-out old shards and force re- fetched / fetch of the remaining pages S3 up-to-date 16
  • 17. Crawling frontier No authoritative catalog of web pages Search engines discover their view of web universe − Start from “seed list” − Follow (walk) some (useful? interesting?) outlinks Many dangers of simply wandering around explosion or collapse of the frontier Nutch – ApacheCon US '09 − − collecting unwanted content (spam, junk, offensive) I could use some guidance 17
  • 18. Controlling the crawling frontier URL filter plugins − White-list, black-list, regex − May use external resources (DB-s, services ...) URL normalizer plugins Nutch – ApacheCon US '09 − Resolving relative path elements seed − “Equivalent” URLs i=1 i=2 Additional controls using i=3 scoring plugins − priority, metadata select/block Crawler traps – difficult! − Domain / host / path-level stats 18
  • 19. Wide vs. focused crawling Differences: • Little technical difference in configuration • Big difference in operations, maintenance and quality Wide crawling: − (Almost) Unlimited crawling frontier High risk of spamming and junk content Nutch – ApacheCon US '09 − − “Politeness” a very important limiting factor − Bandwidth & DNS considerations Focused (vertical or enterprise) crawling: − Limited crawling frontier − Bandwidth or politeness is often not an issue − Low risk of spamming and junk content 19
  • 20. Vertical & enterprise search Vertical search • Range of selected “reference” sites • Robust control of the crawling frontier • Extensive content post-processing • Business-driven decisions about ranking Nutch – ApacheCon US '09 Enterprise search • Variety of data sources and data formats • Well-defined and limited crawling frontier • Integration with in-house data sources • Little danger of spam • PageRank-like scoring usually works poorly 20
  • 21. Face to face with Nutch ? Nutch – ApacheCon US '09 21
  • 22. Simple set-up and running You already have Java 5+ , right? Get a 1.0 release or a nightly build (pretty stable) Simple search setup uses Tomcat for search web app Command-line bash script: bin/nutch Nutch – ApacheCon US '09 • Windows users: get Cygwin • Early version of a web-based UI console • Hadoop web-based monitoring 22
  • 23. Configuration: files Edit configuration in conf/nutch-site.xml • Check nutch-default.xml for defaults and docs • You MUST at least fill the name of your agent Active plugins configuration Per-plugin properties Nutch – ApacheCon US '09 External configuration files • regex-urlfilter.xml • regex-normalize.xml • parse-plugins.xml: mapping of MIME type to plugin 23
  • 24. Nutch plugins Plugin-based extensions for: • Crawl scheduling • URL filtering and normalization • Protocol for getting the content • Content parsing Nutch – ApacheCon US '09 • Text analysis (tokenization) • Page signature (to detect near-duplicates) • Indexing filters (index fields & metadata) • Snippet generation and highlighting • Scoring and ranking • Query translation and expansion (user → Lucene) 24
  • 25. Main Nutch workflow Command-line: bin/nutch Inject: initial creation of CrawlDB inject • Insert seed URLs • Initial LinkDB is empty Nutch – ApacheCon US '09 Generate new shard's fetchlist generate Fetch raw content fetch Parse content (discovers outlinks) parse Update CrawlDB from shards updatedb Update LinkDB from shards updatelinkdb Index shards index / solrindex (repeat) 25
  • 26. Injecting new URL-s Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 26
  • 27. Generating fetchlists Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 27
  • 28. Workflow: generating fetchlists What to fetch next? • Breadth-first – important due to “politeness” limits • Expired (longer than fetchTime + fetchInterval) • Highly ranking (PageRank) • Newly added Fetchlist generation: Nutch – ApacheCon US '09 • “topN” - select best candidates − Priority based on many factors, pluggable Adaptive fetch schedule • Detect rate of changes & time of change • Detect unmodified content − Hmm, and how to recognize this? → approximate page signatures (near-duplicate detection) 28
  • 29. Fetching content Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 29
  • 30. Workflow: fetching Multi-protocol: HTTP(s), FTP, file://, etc ... Coordinates multiple threads accessing the same host • “politeness” issues vs. crawling efficiency • Host: an IP or DNS name? Nutch – ApacheCon US '09 • Redirections: should follow immediately? What kind? Other netiquette issues: • robots.txt: disallowed paths, Crawl-Delay Preparation for parsing: • Content type detection issues • Parsing usually executed as a separate step (resource hungry and sometimes unstable) 30
  • 31. Content processing Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 31
  • 32. Workflow: content processing Protocol plugins retrieve content as plain bytes + protocol-level metadata (e.g. HTTP headers) Parse plugins • Content is parsed by MIME-type specific parsers • Content is parsed into parseData (title, outlinks, other Nutch – ApacheCon US '09 metadata) and parseText (plain text content) Nutch supports many popular file formats 32
  • 33. Link inversion Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 33
  • 34. Workflow: link inversion Pages have outgoing links (outlinks) … I know where I'm pointing to Question: who points to me? … I don't know, there is no catalog of pages … NOBODY knows for sure either! Nutch – ApacheCon US '09 In-degree indicates importance of the page Anchor text provides important semantic info Partial answer: invert the outlinks that I know about, and group by target tgt 2 src 2 tgt 1 src 1 src 1 tgt 3 tgt 1 src 3 tgt 5 tgt 4 src 5 src 4 34
  • 35. Page importance - scoring Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 35
  • 36. Workflow: page scoring Query-independent importance factor • Affects search ranking • Affects crawl prioritization May include arbitrary decision of the operator Currently two systems (plugins + tools) Nutch – ApacheCon US '09 OPIC scoring • Doesn't require explicit steps - “online” • Difficult to stabilize PageRank scoring with loop detection • Periodically run to update CrawlDb scores • Computationally intensive, esp. loop detection 36
  • 37. Indexing Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 37
  • 38. Workflow: indexing Indexing plugins • Create full-text search index from segment data • Apply adjustments to score per document • May further post-process the parsed content (e.g. language identification) to facilitate advanced search Nutch – ApacheCon US '09 Indexers • Lucene indexer – builds indexes to be served by Nutch search servers • Solr indexer – submits documents to a Solr instance 38
  • 39. De-duplication Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 39
  • 40. Workflow: de-duplication The same page may be present in many shards • Obsolete versions in older shards • Mirrored pages, or equivalent (a.com → www.a.com) Many other pages are almost identical • Template-related differences (banners, current date) Font / layout changes, minor re-wording Nutch – ApacheCon US '09 • Hmm … what is a significant change ??? • Tricky issue! Hint: what is the page content? Near-duplicate detection and removal • Nutch uses approximate page signatures (fingerprints) • Duplicates are only marked as deleted 40
  • 41. Cycles may overlap Injector Generator Fetcher Searcher Indexer Nutch – ApacheCon US '09 CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 41
  • 42. Nutch distributed search Search front-end(s) Web app Search server API (Nutch or Solr) NutchBean XML JSON Nutch – ApacheCon US '09 indexed shards Multiple search servers Search front-end dispatches queries and collects search results • multiple access methods (API, OpenSearch XML, JSON) Page servers build snippets Fault-tolerant * * with degraded quality 42
  • 43. Search configuration Nutch syntax is limited – on purpose! • Some queries are costly, e.g. leading wildcard, very long • Some queries may need implicit (hidden) expansion Query plugins • From user query to Lucene/Solr query User query: web search Nutch – ApacheCon US '09 +(url:web^4.0 anchor:web^2.0 content:web title:web^1.5 host:web^2.0) +(url:search^4.0 anchor:search^2.0 content:search title:search^1.5 host:search^2.0) url:"web search"~10^4.0 anchor:"web search"~4^2.0 content:"web search"~10 title:"web search"~10^1.5 host:"web search"~10^2.0 Search server configuration • Single search vs. distributed • Using Nutch searcher, or Solr, or a mix • Currently no global IDF calculation 43
  • 44. Deployment: single-server The easiest but the most limited deployment Centralized storage, centralized processing Hadoop LocalFS and LocalJobTracker Drop nutch.war in Tomcat/webapps and point to shards Search Search Fetcher front-end Fetcher front-end Nutch – ApacheCon US '09 LocalFS LocalFS Indexer Storage Indexer Storage DB mgmt DB mgmt 44
  • 45. Deployment: distributed search Local storage on search servers preferred (perf.) Search Search front-end front-end Fetcher Fetcher Nutch – ApacheCon US '09 LocalFS LocalFS Indexer Storage Indexer Storage Search server Search server DB mgmt Search server DB mgmt 45
  • 46. Deployment: map-reduce processing & distrib. search Fully distributed storage and processing Local storage on search servers preferred (perf.) (Job Tracker) Search Search (Job Tracker) Fetcher DFS Name Node DFS Name Node Fetcher front-end front-end Indexer Indexer Nutch – ApacheCon US '09 DB mgmt DB mgmt Search server DFS Data Node Task Tracker Search server DFS Data Node Task Tracker Search server DFS Data Node Task Tracker 46
  • 47. Nutch on a Hadoop cluster Assumes an up & running Hadoop cluster … which is covered elsewhere Build nutch.job jar and use it as usual: bin/hadoop jar nutch.job <className> <args...> Nutch – ApacheCon US '09 Note: Nutch configuration is inside nutch.job − When you change it you need to rebuild job jar Searching is not a map-reduce job – often on a separate group of machines 47
  • 48. Nutch – ApacheCon US '09 48
  • 49. Nutch – ApacheCon US '09 Nutch index 49
  • 50. Nutch – ApacheCon US '09 … and press Search 50
  • 51. Conclusions (This overview is a tip of the iceberg) Nutch Implements all core search engine components Scales well Nutch – ApacheCon US '09 Extremely configurable and modular It's a complete solution – and a toolkit 51
  • 52. Future of Nutch Avoid code duplication Parsing → Tika • Almost total overlap • Still some missing functionality in Tika → contribute Plugins → OSGI • Home-grown plugin system has some deficiencies Nutch – ApacheCon US '09 • Initial port available Indexing & Search → Solr, Zookeeper • Distributed and replicated search is difficult • Initial integration needs significant improvement • Shard management - Katta? Web-graph & page repository → HBase • Combine CrawlDB, LinkDB and shard storage • Avoid tedious shard management • Initial port available 52
  • 53. Future of Nutch (2) What's left then? • Crawling frontier management, discovery • Re-crawl algorithms • Spider trap handling • Fetcher Ranking: enterprise-specific, user-feedback Nutch – ApacheCon US '09 • • Duplicate detection, URL aliasing (mirror detection) • Template detection and cleanup, pagelet-level crawling • Spam & junk control Share code with other crawler projects → crawler-commons Vision: á la carte toolkit, scalable from 1-1000s nodes 53
  • 54. Summary Q&A Further information: • http://lucene.apache.org/nutch Nutch – ApacheCon US '09 54