SlideShare a Scribd company logo
1 of 21
Download to read offline
Lucene in the Cloud:
                                                       Leveraging the
                                                    Power of Search and
                                                      Big Data to Shed
                                                    Light on Government
                                                          Spending




                                                          Seshubabu Simhadri
                                                      Chief Technology Officer, GCE

Confidential, Do Not Disclose. Property of Global
           Computer Enterprises, Inc..
Background

What is USASpending.gov?

Moving to Our Big Data cloud

Some of the design decisions
   Tool Selection
   Cluster Design
   Hardware Design

Limitations and enhancements

                                        Overview
             Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
What is USASpending.gov?
 Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
U.S. Government Spending vs. Other Entities
         Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Distribution of U.S. Government Spending
       Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
• Analytics
   •  Stats
   •  Top-K


• Free Text
  Search
 (With auto
 Suggestions)


• Large
  Data
  Feeds

• APIs
              What can users do on the site?
                Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
• Public

• Media

• Congress

• Value Added
  Resellers




           Who are the users of the site?
                Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Leveraging the
                                                                               industry leading
                                                                                 open source
                                                                                  platform to
                                                                                 deliver cost
                                                                                 savings and
                                                                               scalability within
                                                                                   a Cloud
                                                                                  computing
                                                                                    model



GCE Big Data and Analytics Cloud
    Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
•  Hadoop
   − For indexing and downloads
                                                                                        Start by
•  Distributed Solr                                                                    Looking at
   − Analytics                                                                         the Usual
   − Free text search                                                                  Suspects

•  Drupal static content

•  Visualization



    What’s Inside the GCE Cloud?
       Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
The greatest
challenge is how
  to optimally
design a node –
      which
 combination of
CPUs, memory,
 and shard size
  delivers the
     desired
 performance?




                              Solr Node Sizing
              Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Multiple index types
       Different types of spending
       Varying sizes

Break complete dataset into shards as small as required to
meet the response times
      Choose shard size based on response times

Single Core with multiple cores or Multiple Solr instances each
with single core?




                            Solr Node Sizing
            Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
How do you
                                                                                 design the
                                                                                  cluster –
                                                                                 which ones
                                                                                are individual
                                                                                 nodes and
                                                                                 which ones
                                                                                      are
                                                                                aggregators?




         Solr Cluster Design
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Should all shards be treated equal?

Userà Aggregator Nodes à Shards

Different requirements for nodes collecting the data
and nodes serving a specific dataset

Aggregator Node 1,2,3 ….m
  Large Solr Instances, No local index

Shard Nodes 1,2,3,..100..n
  Small Solr Instance with index


                  Solr Cluster Design
         Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Separate Solr
                                                                                 instances

                                                                                    Multiple hard
                                                                                     drives per
                                                                                       server

                                                                                     Solid state
                                                                                       disks

                                                                                     Infiniband


What configuration did we choose?
    Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Enhanced
 Faceting:
  Enabling
aggregation
by more than
  one field

   Will be
contributed to
 Solr project




                     Solr Enhancements
            Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
When the shards
                                                                              increase,
                                                                           management of
                                                                           SQLs inside Solr
                                                                             becomes a
                                                                              challenge

                                                                               External Data
                                                                              Importer Using
                                                                                 Hadoop


Solr Data Importer: Why Not?
  Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Solr in the
                                                                                      Cloud required
                                                                                      building a cost
                                                                                       effective and
                                                                                            high
                                                                                       performance
                                                                                       infrastructure

                                                                                       Small vs. large
                                                                                        Commodity
                                                                                         servers




Utilizing Large Commodity Servers
      Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Failure of one
 node results
 in failure of
   multiple
   shards -
    careful
  design is
   required




   Disadvantages of higher capacity servers
             Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Sharded architecture

Multiple Solr instances per server each handling small
datasets

Aggregator nodes + shards

Hadoop for data indexing and data feeds

Large Commodity Servers
   •  48-core
   •  256GB RAM
   •  SSD
   •  Infiniband


                                     Summary
           Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Come build
                                                                                  the future
                                                                                 of Big Data

                                                                                GCECloud.com




                We’re hiring!
Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
Questions?
 ssimhadri at GCECloud.com

Visit us at www.GCECloud.com

More Related Content

What's hot

HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and FutureDataWorks Summit
 
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingLarry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingDiamond Exchange
 
CH07-Types of Storage
CH07-Types of StorageCH07-Types of Storage
CH07-Types of StorageSukanya Ben
 
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructures
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructuresDell PowerEdge M620 blade server solutions for virtual desktop infrastructures
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructuresPrincipled Technologies
 
Sun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentationSun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentationxKinAnx
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationxKinAnx
 
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...IBM India Smarter Computing
 
OMG DDS Tutorial - Part I
OMG DDS Tutorial - Part IOMG DDS Tutorial - Part I
OMG DDS Tutorial - Part IAngelo Corsaro
 
Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012Andy Parsons
 
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd WolfWebinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd WolfDropbox
 
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012DATAVERSITY
 
Dds interop demo_washington_dds_2011_03_01
Dds interop demo_washington_dds_2011_03_01Dds interop demo_washington_dds_2011_03_01
Dds interop demo_washington_dds_2011_03_01Gerardo Pardo-Castellote
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformSergei Dolukhanov
 
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"darach
 
Power Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore MultiprocessingPower Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore Multiprocessingchiportal
 

What's hot (19)

HDFS - What's New and Future
HDFS - What's New and FutureHDFS - What's New and Future
HDFS - What's New and Future
 
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingLarry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
 
CH07-Types of Storage
CH07-Types of StorageCH07-Types of Storage
CH07-Types of Storage
 
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructures
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructuresDell PowerEdge M620 blade server solutions for virtual desktop infrastructures
Dell PowerEdge M620 blade server solutions for virtual desktop infrastructures
 
Sun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentationSun sparc enterprise t2 systems customer presentation
Sun sparc enterprise t2 systems customer presentation
 
Sun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentationSun sparc enterprise t5140 and t5240 servers customer presentation
Sun sparc enterprise t5140 and t5240 servers customer presentation
 
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
TCA/TCO Benefits of Consolidating Databases and x86 Servers on IBM Enterprise...
 
OMG DDS Tutorial - Part I
OMG DDS Tutorial - Part IOMG DDS Tutorial - Part I
OMG DDS Tutorial - Part I
 
Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012Prince Building Tech Talk 12102012
Prince Building Tech Talk 12102012
 
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd WolfWebinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
Webinar: eFolder Expert Series: BDR Pain Relief with Lloyd Wolf
 
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
 
Dds interop demo_washington_dds_2011_03_01
Dds interop demo_washington_dds_2011_03_01Dds interop demo_washington_dds_2011_03_01
Dds interop demo_washington_dds_2011_03_01
 
ieee title
ieee titleieee title
ieee title
 
Netmagic Cloud Computing Services
Netmagic Cloud Computing ServicesNetmagic Cloud Computing Services
Netmagic Cloud Computing Services
 
Migrate
MigrateMigrate
Migrate
 
EvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics PlatformEvoApp - Bermuda Real-Time Analytics Platform
EvoApp - Bermuda Real-Time Analytics Platform
 
Juju
JujuJuju
Juju
 
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"EFL Munich - February 2013 - "Conversational Big Data with Erlang"
EFL Munich - February 2013 - "Conversational Big Data with Erlang"
 
Power Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore MultiprocessingPower Optimization Through Manycore Multiprocessing
Power Optimization Through Manycore Multiprocessing
 

Viewers also liked

GCX Cloud X Launch Presentation (October 14th, 2014)
GCX Cloud X Launch Presentation (October 14th, 2014)GCX Cloud X Launch Presentation (October 14th, 2014)
GCX Cloud X Launch Presentation (October 14th, 2014)Ahmed Abdel-Latif
 
Global Cloud Xchange - Newsletter-Q4 2015
Global Cloud Xchange - Newsletter-Q4 2015Global Cloud Xchange - Newsletter-Q4 2015
Global Cloud Xchange - Newsletter-Q4 2015Michael Agterberg
 
ECI - ElastiCLOUD™ - For Data Center & Cloud Solutions
ECI - ElastiCLOUD™ - For Data Center & Cloud SolutionsECI - ElastiCLOUD™ - For Data Center & Cloud Solutions
ECI - ElastiCLOUD™ - For Data Center & Cloud SolutionsECI – THE ELASTIC NETWORK™
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingSimon Su
 

Viewers also liked (7)

GCX Cloud X Launch Presentation (October 14th, 2014)
GCX Cloud X Launch Presentation (October 14th, 2014)GCX Cloud X Launch Presentation (October 14th, 2014)
GCX Cloud X Launch Presentation (October 14th, 2014)
 
Global Cloud Xchange - Newsletter-Q4 2015
Global Cloud Xchange - Newsletter-Q4 2015Global Cloud Xchange - Newsletter-Q4 2015
Global Cloud Xchange - Newsletter-Q4 2015
 
ECI - ElastiCLOUD™ - For Data Center & Cloud Solutions
ECI - ElastiCLOUD™ - For Data Center & Cloud SolutionsECI - ElastiCLOUD™ - For Data Center & Cloud Solutions
ECI - ElastiCLOUD™ - For Data Center & Cloud Solutions
 
ECI - ElastiNET™ - For Service Providers & NRENS
ECI - ElastiNET™ - For Service Providers & NRENSECI - ElastiNET™ - For Service Providers & NRENS
ECI - ElastiNET™ - For Service Providers & NRENS
 
ECI Driving Standards from Code -ECI Work with ONOS
ECI Driving Standards from Code -ECI Work with ONOSECI Driving Standards from Code -ECI Work with ONOS
ECI Driving Standards from Code -ECI Work with ONOS
 
ECI - The Elastic Network - winds of change
ECI - The Elastic Network - winds of changeECI - The Elastic Network - winds of change
ECI - The Elastic Network - winds of change
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
 

Similar to Lucene in the Cloud: Learn how GCE leveraged the power of search and Big Data to shed light on government spending

Cloud computing basics
Cloud computing basicsCloud computing basics
Cloud computing basicsAkshay Guleria
 
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1Ruud Ramakers
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Massive Data Analytics and the Cloud
Massive Data Analytics and the CloudMassive Data Analytics and the Cloud
Massive Data Analytics and the CloudBooz Allen Hamilton
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computingikanow
 
Cloud computing and libraries sndt
Cloud computing and libraries sndtCloud computing and libraries sndt
Cloud computing and libraries sndtVishwas Taralekar
 
ONS content extraction
ONS content extractionONS content extraction
ONS content extractionKellyCheah
 
Zsl cloud-application migration-8_phased_approach
Zsl cloud-application migration-8_phased_approachZsl cloud-application migration-8_phased_approach
Zsl cloud-application migration-8_phased_approachzslmarketing
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
 
NCOIC Enterprise Cloud Computing - Kevin Jackson
NCOIC Enterprise Cloud Computing - Kevin JacksonNCOIC Enterprise Cloud Computing - Kevin Jackson
NCOIC Enterprise Cloud Computing - Kevin JacksonGovCloud Network
 
Hive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentationHive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentationEuroCloud
 
colony framework & omni platform
colony framework & omni platformcolony framework & omni platform
colony framework & omni platformHive Solutions
 
Comparing Ruby on Rails Public vs. Private Cloud Options
Comparing Ruby on Rails Public vs. Private Cloud OptionsComparing Ruby on Rails Public vs. Private Cloud Options
Comparing Ruby on Rails Public vs. Private Cloud OptionsAltoros
 
The who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computingThe who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computingNari Kannan
 
When where why cloud
When where why cloudWhen where why cloud
When where why cloudreshmaroberts
 
When Where Why Cloud
When Where Why CloudWhen Where Why Cloud
When Where Why Cloudreshmaroberts
 

Similar to Lucene in the Cloud: Learn how GCE leveraged the power of search and Big Data to shed light on government spending (20)

The Cloud Changing the Game
The Cloud Changing the GameThe Cloud Changing the Game
The Cloud Changing the Game
 
Cloud computing basics
Cloud computing basicsCloud computing basics
Cloud computing basics
 
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
Cloudcomputing Nivo Consultancy 26 Mei 2009 Versie 1
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Massive Data Analytics and the Cloud
Massive Data Analytics and the CloudMassive Data Analytics and the Cloud
Massive Data Analytics and the Cloud
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
 
The Sun Cloud
The Sun CloudThe Sun Cloud
The Sun Cloud
 
Cloud computing and libraries sndt
Cloud computing and libraries sndtCloud computing and libraries sndt
Cloud computing and libraries sndt
 
ONS content extraction
ONS content extractionONS content extraction
ONS content extraction
 
Zsl cloud-application migration-8_phased_approach
Zsl cloud-application migration-8_phased_approachZsl cloud-application migration-8_phased_approach
Zsl cloud-application migration-8_phased_approach
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
NCOIC Enterprise Cloud Computing - Kevin Jackson
NCOIC Enterprise Cloud Computing - Kevin JacksonNCOIC Enterprise Cloud Computing - Kevin Jackson
NCOIC Enterprise Cloud Computing - Kevin Jackson
 
Hive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentationHive solutions cloudviews 2010 presentation
Hive solutions cloudviews 2010 presentation
 
colony framework & omni platform
colony framework & omni platformcolony framework & omni platform
colony framework & omni platform
 
Comparing Ruby on Rails Public vs. Private Cloud Options
Comparing Ruby on Rails Public vs. Private Cloud OptionsComparing Ruby on Rails Public vs. Private Cloud Options
Comparing Ruby on Rails Public vs. Private Cloud Options
 
The who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computingThe who, what, why, when & how of cloud computing
The who, what, why, when & how of cloud computing
 
Vr storm cips_03nov2010
Vr storm cips_03nov2010Vr storm cips_03nov2010
Vr storm cips_03nov2010
 
When where why cloud
When where why cloudWhen where why cloud
When where why cloud
 
When Where Why Cloud
When Where Why CloudWhen Where Why Cloud
When Where Why Cloud
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Lucene in the Cloud: Learn how GCE leveraged the power of search and Big Data to shed light on government spending

  • 1. Lucene in the Cloud: Leveraging the Power of Search and Big Data to Shed Light on Government Spending Seshubabu Simhadri Chief Technology Officer, GCE Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 2. Background What is USASpending.gov? Moving to Our Big Data cloud Some of the design decisions Tool Selection Cluster Design Hardware Design Limitations and enhancements Overview Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 3. What is USASpending.gov? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 4. U.S. Government Spending vs. Other Entities Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 5. Distribution of U.S. Government Spending Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 6. • Analytics •  Stats •  Top-K • Free Text Search (With auto Suggestions) • Large Data Feeds • APIs What can users do on the site? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 7. • Public • Media • Congress • Value Added Resellers Who are the users of the site? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 8. Leveraging the industry leading open source platform to deliver cost savings and scalability within a Cloud computing model GCE Big Data and Analytics Cloud Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 9. •  Hadoop − For indexing and downloads Start by •  Distributed Solr Looking at − Analytics the Usual − Free text search Suspects •  Drupal static content •  Visualization What’s Inside the GCE Cloud? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 10. The greatest challenge is how to optimally design a node – which combination of CPUs, memory, and shard size delivers the desired performance? Solr Node Sizing Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 11. Multiple index types Different types of spending Varying sizes Break complete dataset into shards as small as required to meet the response times Choose shard size based on response times Single Core with multiple cores or Multiple Solr instances each with single core? Solr Node Sizing Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 12. How do you design the cluster – which ones are individual nodes and which ones are aggregators? Solr Cluster Design Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 13. Should all shards be treated equal? Userà Aggregator Nodes à Shards Different requirements for nodes collecting the data and nodes serving a specific dataset Aggregator Node 1,2,3 ….m Large Solr Instances, No local index Shard Nodes 1,2,3,..100..n Small Solr Instance with index Solr Cluster Design Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 14. Separate Solr instances Multiple hard drives per server Solid state disks Infiniband What configuration did we choose? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 15. Enhanced Faceting: Enabling aggregation by more than one field Will be contributed to Solr project Solr Enhancements Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 16. When the shards increase, management of SQLs inside Solr becomes a challenge External Data Importer Using Hadoop Solr Data Importer: Why Not? Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 17. Solr in the Cloud required building a cost effective and high performance infrastructure Small vs. large Commodity servers Utilizing Large Commodity Servers Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 18. Failure of one node results in failure of multiple shards - careful design is required Disadvantages of higher capacity servers Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 19. Sharded architecture Multiple Solr instances per server each handling small datasets Aggregator nodes + shards Hadoop for data indexing and data feeds Large Commodity Servers •  48-core •  256GB RAM •  SSD •  Infiniband Summary Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 20. Come build the future of Big Data GCECloud.com We’re hiring! Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..
  • 21. Questions? ssimhadri at GCECloud.com Visit us at www.GCECloud.com