SlideShare a Scribd company logo
1 of 35
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Real-time Analytics &
Anomaly detection
using Hadoop, Elasticsearch and Storm
Costin Leau
@costinl
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Interesting != Common
Datasets tend to have hot / common entities
Monopolize the data set
Create too much noise
Cannot be easily avoided
Common = frequent
Interesting = frequently different
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
H5N1
flu
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Finding the uncommon
Background vs foreground == things that stand out
Example:
Background: “flu”
“H5N1” appears in 5 / 10M docs
Foreground: “bird flu”
“H5N1” appears in 4 / 100 docs
H5N1
flu
H5N1
bird flu
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Finding the uncommon - Challenges
Deal with big data sets
• Hadoop
Perform the analysis
• Elasticsearch
Keep the data fresh
• Storm
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Hadoop
De-facto platform for big data
HDFS - Used for storing and performing ETL at scale
Map/Reduce - Excellent for iterating, thorough analysis
YARN – Job scheduling and resource management
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Open-source real-time search and analytics engine
• Fully-featured search
Relevance-ranked text search
Scalable search
High-performance geo, temporal, range and key lookup
Highlighting
Support for complex / nested document types *
Spelling suggestions
Powerful query DSL *
“Standing” queries *
Real-time results *
Extensible via plugins *
• Powerful faceting/analysis
Summarize large sets by any combinations of
time, geo, category and more. *
“Kibana” visualization tool *
* Features we see as differentiators
• Management
Simple and robust deployments *
REST APIs for handling all aspects of administration/monitoring *
“Marvel” console for monitoring and administering clusters *
Special features to manage the life cycle of content *
• Integration
Hadoop (Map/Red,Hive,Pig,Cascading..)*
Client libraries (Python, Java, Ruby, javascript…)
Data connectors (Twitter, JMS…)
Logstash ETL framework *
• Support
Development and Production support with tiered levels
Support staff are the core developers of the product *
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Open-source real-time search and analytics engine
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch Hadoop
Use Elasticsearch natively in Hadoop
‣ Map/Reduce – Input/OutputFormat
‣ Apache Pig – Storage
‣ Apache Hive – External Table
‣ Cascading – Tap/Sink
‣ Storm (in development) – Spout / Bolt
All operations (reads/writes) are parallelized (Map/Reduce)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Storm
Distributed, fault-tolerant, real-time computation system
Perform on-the-fly queries
React to live data
Prevention
Routing
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Discovering
the relevant
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Inverted index
Inverting Shakespeare
‣ Take all the plays and break them down word by word
‣ For each word, store the ids of the documents that contain it
‣ Sort all tokens (words)
token doc freq. postings (doc ids)
Anthony 2 1, 2
Brutus 1 5
Caesar 2 2, 3
Calpurnia 2 4, 5
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Relevancy
How well does a document match a query?
step query d1 d2
The text brown fox The quick brown fox likes
brown nuts
The red fox
The terms (brown, fox) (brown, brown, fox, likes, nuts,
quick)
(red, fox)
A frequency vector (1, 1) (2, 1) (0, 1)
Relevancy - 2? 1?
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Relevancy - Vector Space Model
• How well q matches d1 and d2?
‣ The coordinates in the vector represent weights per term
‣ The simple (1, 0) vector we discussed defines these weights based on the
frequency of each term
‣ But to generalize:
.
2
1
1
tf: brown
tf: fox
q: (brown, fox)
d1: (brown, brown, fox)
d2: (fox)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Relevancy TF-IDF
Term frequency / Inverse Document Frequency
TF = the more a token appears in a doc, the
more important it is
IDF = the more documents containing the term,
the less important it is
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Ranking Formula
Called Lucene Similarity
Can be ignored (was an
attempt to make query
scores comparable across
indices, it’s there for
backward compatibility)
Core TF/IDF weight
Score of a
document for a
given query
Normalized doc length,
shorter docs are more
likely to be relevant
than longer docs
Boost of
query term t
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Discovering
the interesting
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Frequency differentiator
TF-IDF by-itself is not enough
need to compare the DF in foreground vs background
Precision vs Recall balance
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Single-set analysis
A C F H I K
A B C D E … X Y Z W
Query results
Dataset
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Single-set analysis example
crimes
bicycle theft
crimes
bicycle theft
British Police Force British Transport Police
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Multi-set analysis
A B C D E … X Y Z W
A C F H I K M Q R
…
Query results
Dataset
A B C D .. J L M N O .. U
Aggregate
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Background (geo-aggregation)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Foreground (geo-aggregation)
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Hadoop
Off-line / slow learning
‣ In-depth analysis
‣ Break down data into hot spots
‣ Eliminate noise
‣ Build multiple models
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Search features
‣ Scoring, TF-IDF
‣ Significant terms (multi-set analysis)
Aggregations
‣ Buckets & Metrics
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Reacting
to data
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Reacting to data
Prevent
execute queries as data flows in  build a model
Route
place suspicious data into a dedicate pipeline
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Reacting to data
spout bolt
bolt
bolt
bolt
bolt bolt
bolt
bolt
bolt
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Live loop
Data keeps changing
‣ Adapt the set of rules
Improves reaction time
‣ Build a model for fast decision making
Keeps the prevention rate high
‣ Categorize data on the fly
bolt
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Putting it all together
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
The Big Picture
HDFS
Slow, in-depth
learning
Fast, real-time
learning
ETL
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Usages
Recommendation
‣ Find similar movies based on user feedback
‣ Use Storm to optimize the returned results
Card Fraud
‣ Use Storm to prevent suspicious transactions from executing
‣ Route possible frauds to a dedicated analysis queue
Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited
Q&A
Thank you!
@costinl

More Related Content

What's hot

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGLucidworks
 
Ahsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetAhsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetRonnie Chan
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistpmanvi
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Federico Panini
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5Idan Tohami
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 

What's hot (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 
Ahsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - DatasheetAhsay Backup Software v7 - Datasheet
Ahsay Backup Software v7 - Datasheet
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologist
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
AHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File SystemsAHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File Systems
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 

Similar to Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and Storm

Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldArmonDadgar
 
OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...
OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...
OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...NETWAYS
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014Ewout Kramer
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Making sense of your data to give new insight - Elasticsearch at Findability ...
Making sense of your data to give new insight - Elasticsearch at Findability ...Making sense of your data to give new insight - Elasticsearch at Findability ...
Making sense of your data to give new insight - Elasticsearch at Findability ...Findwise
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Pavlo Baron
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Enroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarEnroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarJohanna Green
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFManish Pandit
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
 
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...Amazon Web Services
 
OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...
OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...
OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...NETWAYS
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 

Similar to Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and Storm (20)

Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
 
OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...
OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...
OSMC 2014 | Using Elasticsearch, Logstash & Kibana in system administration b...
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014
 
Big Data
Big DataBig Data
Big Data
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Making sense of your data to give new insight - Elasticsearch at Findability ...
Making sense of your data to give new insight - Elasticsearch at Findability ...Making sense of your data to give new insight - Elasticsearch at Findability ...
Making sense of your data to give new insight - Elasticsearch at Findability ...
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Enroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarEnroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman Sarwar
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
 
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
(BAC307) The Cold Data Playbook: Building the Ultimate Archive Solution in Am...
 
OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...
OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...
OSMC 2014: Using elasticsearch, logstash & kibana in system administration | ...
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and Storm

  • 1. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited Real-time Analytics & Anomaly detection using Hadoop, Elasticsearch and Storm Costin Leau @costinl
  • 2. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
  • 3. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Interesting != Common Datasets tend to have hot / common entities Monopolize the data set Create too much noise Cannot be easily avoided Common = frequent Interesting = frequently different
  • 4. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Finding the uncommon Background vs foreground == things that stand out Example: Background: “flu” “H5N1” appears in 5 / 10M docs H5N1 flu
  • 5. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Finding the uncommon Background vs foreground == things that stand out Example: Background: “flu” “H5N1” appears in 5 / 10M docs Foreground: “bird flu” “H5N1” appears in 4 / 100 docs H5N1 flu H5N1 bird flu
  • 6. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Finding the uncommon - Challenges Deal with big data sets • Hadoop Perform the analysis • Elasticsearch Keep the data fresh • Storm
  • 7. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Hadoop De-facto platform for big data HDFS - Used for storing and performing ETL at scale Map/Reduce - Excellent for iterating, thorough analysis YARN – Job scheduling and resource management
  • 8. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Open-source real-time search and analytics engine • Fully-featured search Relevance-ranked text search Scalable search High-performance geo, temporal, range and key lookup Highlighting Support for complex / nested document types * Spelling suggestions Powerful query DSL * “Standing” queries * Real-time results * Extensible via plugins * • Powerful faceting/analysis Summarize large sets by any combinations of time, geo, category and more. * “Kibana” visualization tool * * Features we see as differentiators • Management Simple and robust deployments * REST APIs for handling all aspects of administration/monitoring * “Marvel” console for monitoring and administering clusters * Special features to manage the life cycle of content * • Integration Hadoop (Map/Red,Hive,Pig,Cascading..)* Client libraries (Python, Java, Ruby, javascript…) Data connectors (Twitter, JMS…) Logstash ETL framework * • Support Development and Production support with tiered levels Support staff are the core developers of the product *
  • 9. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Open-source real-time search and analytics engine
  • 10. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Hadoop Use Elasticsearch natively in Hadoop ‣ Map/Reduce – Input/OutputFormat ‣ Apache Pig – Storage ‣ Apache Hive – External Table ‣ Cascading – Tap/Sink ‣ Storm (in development) – Spout / Bolt All operations (reads/writes) are parallelized (Map/Reduce)
  • 11. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Storm Distributed, fault-tolerant, real-time computation system Perform on-the-fly queries React to live data Prevention Routing
  • 12. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited Discovering the relevant
  • 13. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Inverted index Inverting Shakespeare ‣ Take all the plays and break them down word by word ‣ For each word, store the ids of the documents that contain it ‣ Sort all tokens (words) token doc freq. postings (doc ids) Anthony 2 1, 2 Brutus 1 5 Caesar 2 2, 3 Calpurnia 2 4, 5
  • 14. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Relevancy How well does a document match a query? step query d1 d2 The text brown fox The quick brown fox likes brown nuts The red fox The terms (brown, fox) (brown, brown, fox, likes, nuts, quick) (red, fox) A frequency vector (1, 1) (2, 1) (0, 1) Relevancy - 2? 1?
  • 15. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Relevancy - Vector Space Model • How well q matches d1 and d2? ‣ The coordinates in the vector represent weights per term ‣ The simple (1, 0) vector we discussed defines these weights based on the frequency of each term ‣ But to generalize: . 2 1 1 tf: brown tf: fox q: (brown, fox) d1: (brown, brown, fox) d2: (fox)
  • 16. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Relevancy TF-IDF Term frequency / Inverse Document Frequency TF = the more a token appears in a doc, the more important it is IDF = the more documents containing the term, the less important it is
  • 17. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Ranking Formula Called Lucene Similarity Can be ignored (was an attempt to make query scores comparable across indices, it’s there for backward compatibility) Core TF/IDF weight Score of a document for a given query Normalized doc length, shorter docs are more likely to be relevant than longer docs Boost of query term t
  • 18. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited Discovering the interesting
  • 19. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Frequency differentiator TF-IDF by-itself is not enough need to compare the DF in foreground vs background Precision vs Recall balance
  • 20. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Single-set analysis A C F H I K A B C D E … X Y Z W Query results Dataset
  • 21. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Single-set analysis example crimes bicycle theft crimes bicycle theft British Police Force British Transport Police
  • 22. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Multi-set analysis A B C D E … X Y Z W A C F H I K M Q R … Query results Dataset A B C D .. J L M N O .. U Aggregate
  • 23. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Background (geo-aggregation)
  • 24. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Foreground (geo-aggregation)
  • 25. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Hadoop Off-line / slow learning ‣ In-depth analysis ‣ Break down data into hot spots ‣ Eliminate noise ‣ Build multiple models
  • 26. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Elasticsearch Search features ‣ Scoring, TF-IDF ‣ Significant terms (multi-set analysis) Aggregations ‣ Buckets & Metrics
  • 27. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited Reacting to data
  • 28. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Reacting to data Prevent execute queries as data flows in  build a model Route place suspicious data into a dedicate pipeline
  • 29. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Reacting to data spout bolt bolt bolt bolt bolt bolt bolt bolt bolt
  • 30. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Live loop Data keeps changing ‣ Adapt the set of rules Improves reaction time ‣ Build a model for fast decision making Keeps the prevention rate high ‣ Categorize data on the fly bolt
  • 31. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited Putting it all together
  • 32. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited The Big Picture HDFS Slow, in-depth learning Fast, real-time learning ETL
  • 33. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited Usages Recommendation ‣ Find similar movies based on user feedback ‣ Use Storm to optimize the returned results Card Fraud ‣ Use Storm to prevent suspicious transactions from executing ‣ Route possible frauds to a dedicated analysis queue
  • 34. Copyright Elasticsearch 2014. Copying, publishing and/or distributing without written permission is strictly prohibited
  • 35. Copyright Elasticsearch 2014 Copying, publishing and/or distributing without written permission is strictly prohibited Q&A Thank you! @costinl

Editor's Notes

  1. Zipf distribution, power-law, long tail
  2. Full-text search Highlight Search-as-you-type Did-you-mean More-like-this Geolocation Events, logs,
  3. Precision = how many of the retrieved documents are relevant Recall = how many of the relevant documents are retrieved
  4. Term aggregation
  5. Significant terms