SlideShare a Scribd company logo
1 of 13
Download to read offline
Using Spark and Shark for
Fast Cycle Analysis on Diverse Data
Vaibhav Nivargi
12.2.13
clearstorydata.com
About ClearStory Data

clearstorydata.com
Analysis in the New Data Landscape
New use cases seen in all industries.
• Live situational analysis requiring fast-cycle
analysis across internal data and sources of
external data

• Multi-source analysis with data refreshing on
new insights, as data from sources evolves

• Large-scale analysis of structured and
unstructured data combined in integrated
insights
clearstorydata.com
Example: Interactive Multi-source Analysis
More data and more people change the analysis.

News
Coverage
Online, Print,
Television

Data Intelligence

Donations
New Members,
Donations
Website Traffic
Traffic,
Referrals,
Content

Facebook
Shares, Likes,
Comments

Twitter
Followers,
Tweets, Retweets

Interactive analysis on diverse
internal & external data

Corporate Sponsors
Corporate
Engagement, New
Inquiries
clearstorydata.com
Today’s Need is Speed, Scale & Ad Hoc Flexibility
With more sources, more data and more people.

?

?

clearstorydata.com

?

?
Why Spark and Shark ?
• RDDs
– Low latency & scale
– Iterative and Interactive computation

• Lineage and fault tolerance
– Able to re-derive data

• Expressive power of Scala and SQL
– Operations beyond aggregations, joins, and statistical operators
– Advanced: ML, data mining, segmentation, approximate queries,
graphs …

• Support for structured and semi-structured data
• BDAS Stack & AMPLab
– Tachyon, MLBase, BlinkDB, GraphX …

• Community and adoption

clearstorydata.com
The ClearStory Solution

Data Sources

ClearStory Platform

ClearStory Application

Harmonization

Data Inference & Profiling

In-Memory
Data Units

Visualization

Collaboration

clearstorydata.com
Where do Spark & Shark fit ?
User Application
ClearStory API
Harmonization Engine and Blended Data Processing
Spark Cluster + ClearStory IP

Data Access, Inference and Lineage
Data Source API

RDBMS
clearstorydata.com

Hadoop

Files

Public

Web

Premium
How we leverage Spark & Shark
• User intent captured and translated to custom API
• Harmonization-as-a-Service
• Manages Spark and Shark query execution

• Read cached data from HDFS
• RESTful
• Merges datasets (RDDs) on the fly – on user request
• Support conversion of user actions to backend queries
• Query optimizations

• Performance optimizations
• Mixed-mode execution (sql2rdd & spark native)
• Caching
• Pre-computation
clearstorydata.com
How we leverage Spark & Shark
• Query results returned to the application for
scalable visualization and ClearStory-specific viz
techniques
• RDDs cached/un-cached and materialized at
strategic points based on usage patterns and
signals
• Data updates automatically processed as source
data changes
• ClearStory’s own deployment, packaging, and
integrated monitoring for operations at scale

clearstorydata.com
Spark Developments – What We Like
• Query cancellation, progress indication (0.8.1 and
beyond)
• More performance breakthroughs
• Workload Management
• BlinkDB

• MLBase
• Tachyon
• GraphX

clearstorydata.com
We’re Hiring!
• Working with the community, giving back
• Lots of exciting new developments
• This is like the early days of Hadoop – massive
momentum gathering

The First Spark Summit!
More Meet-ups!

clearstorydata.com
clearstorydata.com

More Related Content

What's hot

Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniAvinash Ramineni
 
Webinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data PlatformWebinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data PlatformDataStax
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18Olga Zinkevych
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azureEyal Ben Ivri
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopMark Kromer
 
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeHybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeAli Hodroj
 
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...DataStax
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1GurinderG
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaData Con LA
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
 
Scaling Data overview
Scaling Data overviewScaling Data overview
Scaling Data overviewWade Malone
 
DesignMind Data Analytics Consulting
DesignMind Data Analytics Consulting DesignMind Data Analytics Consulting
DesignMind Data Analytics Consulting DesignMind
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsMark Kromer
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 

What's hot (20)

Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
 
Webinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data PlatformWebinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data Platform
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
 
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database HypeHybrid Transactional/Analytics Processing: Beyond the Big Database Hype
Hybrid Transactional/Analytics Processing: Beyond the Big Database Hype
 
Azure synapse by usama whaba khan
Azure synapse by usama whaba khanAzure synapse by usama whaba khan
Azure synapse by usama whaba khan
 
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
Webinar | Real-time Analytics for Healthcare: How Amara Turned Big Data into ...
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman Marya
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
 
Scaling Data overview
Scaling Data overviewScaling Data overview
Scaling Data overview
 
DesignMind Data Analytics Consulting
DesignMind Data Analytics Consulting DesignMind Data Analytics Consulting
DesignMind Data Analytics Consulting
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 

Similar to Clear story _spark_

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukErwin de Kreuk
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...
Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...
Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...ClearStory Data
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukErwin de Kreuk
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreukErwin de Kreuk
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic IntelAPAC
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 

Similar to Clear story _spark_ (20)

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...
Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...
Fast Cycle, Multi-Terabyte Data Analysis with Amazon Redshift and ClearStory ...
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de Kreuk
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic Gab Genai Cloudera - Going Beyond Traditional Analytic
Gab Genai Cloudera - Going Beyond Traditional Analytic
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 

Recently uploaded

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Clear story _spark_

  • 1. Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi 12.2.13 clearstorydata.com
  • 3. Analysis in the New Data Landscape New use cases seen in all industries. • Live situational analysis requiring fast-cycle analysis across internal data and sources of external data • Multi-source analysis with data refreshing on new insights, as data from sources evolves • Large-scale analysis of structured and unstructured data combined in integrated insights clearstorydata.com
  • 4. Example: Interactive Multi-source Analysis More data and more people change the analysis. News Coverage Online, Print, Television Data Intelligence Donations New Members, Donations Website Traffic Traffic, Referrals, Content Facebook Shares, Likes, Comments Twitter Followers, Tweets, Retweets Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries clearstorydata.com
  • 5. Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ? ? clearstorydata.com ? ?
  • 6. Why Spark and Shark ? • RDDs – Low latency & scale – Iterative and Interactive computation • Lineage and fault tolerance – Able to re-derive data • Expressive power of Scala and SQL – Operations beyond aggregations, joins, and statistical operators – Advanced: ML, data mining, segmentation, approximate queries, graphs … • Support for structured and semi-structured data • BDAS Stack & AMPLab – Tachyon, MLBase, BlinkDB, GraphX … • Community and adoption clearstorydata.com
  • 7. The ClearStory Solution Data Sources ClearStory Platform ClearStory Application Harmonization Data Inference & Profiling In-Memory Data Units Visualization Collaboration clearstorydata.com
  • 8. Where do Spark & Shark fit ? User Application ClearStory API Harmonization Engine and Blended Data Processing Spark Cluster + ClearStory IP Data Access, Inference and Lineage Data Source API RDBMS clearstorydata.com Hadoop Files Public Web Premium
  • 9. How we leverage Spark & Shark • User intent captured and translated to custom API • Harmonization-as-a-Service • Manages Spark and Shark query execution • Read cached data from HDFS • RESTful • Merges datasets (RDDs) on the fly – on user request • Support conversion of user actions to backend queries • Query optimizations • Performance optimizations • Mixed-mode execution (sql2rdd & spark native) • Caching • Pre-computation clearstorydata.com
  • 10. How we leverage Spark & Shark • Query results returned to the application for scalable visualization and ClearStory-specific viz techniques • RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals • Data updates automatically processed as source data changes • ClearStory’s own deployment, packaging, and integrated monitoring for operations at scale clearstorydata.com
  • 11. Spark Developments – What We Like • Query cancellation, progress indication (0.8.1 and beyond) • More performance breakthroughs • Workload Management • BlinkDB • MLBase • Tachyon • GraphX clearstorydata.com
  • 12. We’re Hiring! • Working with the community, giving back • Lots of exciting new developments • This is like the early days of Hadoop – massive momentum gathering The First Spark Summit! More Meet-ups! clearstorydata.com