SlideShare a Scribd company logo
1 of 18
Download to read offline
© 2014 IBM Corporation
Advanced Data Retrieval and Analytics
with Apache Spark and Openstack Swift
Gil Vernik
IBM Research - Haifa
© 2014 IBM Corporation
Topics Covered in This Talk
§ Openstack Swift
§ Apache Spark
§ Basic integration between Spark and Swift
§ Advanced integration between Spark and Swift by
utilizing the Storlets technology.
© 2014 IBM Corporation
Digital Universe
More than 1.8 zettabytes
(1.8 trillion gigabytes)
Grows rapidly
80% owned by enterprises
75% generated by individuals
According IDC iView "Extracting Value from Chaos,"
© 2014 IBM Corporation
Map-Reduce, Databases, etc..
Data needs to be replicated, Time, Cost, etc..
© 2014 IBM Corporation
Can we do it better?
© 2014 IBM Corporation
Openstack Swift
§ A massively scalable object store
§ Known to work with thousands of
servers, stores petabytes of data.
§ Exposes REST API
§ Features:
– Storage polices
– Erasure codes
– Data replication
– ….
PUTProxy Nodes
Storage Nodes
© 2014 IBM Corporation
Apache Spark
§ Apache Spark™ is a fast and general engine for
large-scale data processing
– Up to 100x faster than Hadoop Map
Reduce in-memory, 10x faster on disk
§ Combines SQL, streaming, and complex analytics
§ Can read existing Hadoop data
§ Most active project in Apache today
© 2014 IBM Corporation
Swift enablement for data retrieval in Spark
§ Apache Spark implements Hadoop interfaces and can use
HDFS or Amazon S3 as a data source.
Swift
Network
§ IBM research enabled Spark to access data stored in
Openstack Swift.
© 2014 IBM Corporation
What do we analyze?
Swift
Network
Stored Data Input to Analytics
Images EXIF metadata
PDF Hidden metadata
LOGs Only ‘ERROR’ records
…. ….
© 2014 IBM Corporation
Yes! We can do it better.
© 2014 IBM Corporation
Storlets: Flexibly extend for Swift
Advanced Data processing inside Swift
§ Storlets is a way to ‘extend’
cloud computational capabilities
§ Storlet is compiled code,
deployed to Swift and when
triggered is executed by Storlet
Engine directly on storage
nodes.
§ Storlet engine - responsible to
execute every storlet in a secure
environment
§ Storlet is a standard Java code
© 2014 IBM Corporation
Storlets extend an object store by
moving computation to the data –
filtering, transforming, analyzing –
instead of bringing the data to the
computation
© 2014 IBM Corporation
Swift Storlets: How do they benefit Spark?
Swift Storlet
Network
Objects
Filter
Data processing+
© 2014 IBM Corporation
Storlets Enable Extending the Functionality of Spark
Example: analyzing EXIF metadata from photos
§ Object store is a
natural repository for
photos
§ Photos contain rich
capture metadata
§ Analyzing this
metadata for a set of
photos can show how
the camera is used
© 2014 IBM Corporation
Example: Analyzing EXIF metadata
Storlets can extract metadata, returning as JSON
(rather than of processing the binary data directly by Spark)
10MB 1KB
© 2014 IBM Corporation
Example: Analyzing EXIF metadata.
•  Spark accesses images via storlet
•  No change to Spark, only changes the URI
•  JSON file returned by storlet defines schema
•  SQL from Spark processes metadata
© 2014 IBM Corporation
Example: Analyzing EXIF metadata.
© 2014 IBM Corporation
Summary
§ Openstack Swift is the most popular open source
object store
§ Apache Spark is the next big thing in data analytics
§ Spark and Swift can be integrated
§ Storlets in Swift provide clear benefits for analytics
use cases.
Thank you!
More information
Gil Vernik, IBM Research -Haifa
gilv@il.ibm.com

More Related Content

What's hot

Open stack journey from folsom to grizzly
Open stack journey from folsom to grizzlyOpen stack journey from folsom to grizzly
Open stack journey from folsom to grizzly
openstackindia
 
Nova for Physicalization and Virtualization compute models
Nova for Physicalization and Virtualization compute modelsNova for Physicalization and Virtualization compute models
Nova for Physicalization and Virtualization compute models
openstackindia
 

What's hot (20)

Juniper Network Automation for KrDAG
Juniper Network Automation for KrDAGJuniper Network Automation for KrDAG
Juniper Network Automation for KrDAG
 
OpenStack Networking and Automation
OpenStack Networking and AutomationOpenStack Networking and Automation
OpenStack Networking and Automation
 
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
Scaling OpenStack Networking Beyond 4000 Nodes with Dragonflow - Eshed Gal-Or...
 
Open stack journey from folsom to grizzly
Open stack journey from folsom to grizzlyOpen stack journey from folsom to grizzly
Open stack journey from folsom to grizzly
 
Nova net-or-neutron-atlanta2014.pptx
Nova net-or-neutron-atlanta2014.pptxNova net-or-neutron-atlanta2014.pptx
Nova net-or-neutron-atlanta2014.pptx
 
OpenContrail Cloudwatt Feedback
OpenContrail Cloudwatt FeedbackOpenContrail Cloudwatt Feedback
OpenContrail Cloudwatt Feedback
 
Can the Open vSwitch (OVS) bottleneck be resolved? - Erez Cohen - OpenStack D...
Can the Open vSwitch (OVS) bottleneck be resolved? - Erez Cohen - OpenStack D...Can the Open vSwitch (OVS) bottleneck be resolved? - Erez Cohen - OpenStack D...
Can the Open vSwitch (OVS) bottleneck be resolved? - Erez Cohen - OpenStack D...
 
L4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef LaribiL4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef Laribi
 
Open stack networking_101_update_2014
Open stack networking_101_update_2014Open stack networking_101_update_2014
Open stack networking_101_update_2014
 
OpenStack Neutron behind the Scenes
OpenStack Neutron behind the ScenesOpenStack Neutron behind the Scenes
OpenStack Neutron behind the Scenes
 
Open stack korea_uni2u_pdf
Open stack korea_uni2u_pdfOpen stack korea_uni2u_pdf
Open stack korea_uni2u_pdf
 
OpenDaylight Netvirt and Neutron - Mike Kolesnik, Josh Hershberg - OpenStack ...
OpenDaylight Netvirt and Neutron - Mike Kolesnik, Josh Hershberg - OpenStack ...OpenDaylight Netvirt and Neutron - Mike Kolesnik, Josh Hershberg - OpenStack ...
OpenDaylight Netvirt and Neutron - Mike Kolesnik, Josh Hershberg - OpenStack ...
 
Contrail Basics
Contrail BasicsContrail Basics
Contrail Basics
 
BRKDCT-2445 Agile OpenStack Networking with Cisco Solutions - Cisco Live! US ...
BRKDCT-2445 Agile OpenStack Networking with Cisco Solutions - Cisco Live! US ...BRKDCT-2445 Agile OpenStack Networking with Cisco Solutions - Cisco Live! US ...
BRKDCT-2445 Agile OpenStack Networking with Cisco Solutions - Cisco Live! US ...
 
Nova for Physicalization and Virtualization compute models
Nova for Physicalization and Virtualization compute modelsNova for Physicalization and Virtualization compute models
Nova for Physicalization and Virtualization compute models
 
OpenStack Israel Meetup - Project Kuryr: Bringing Container Networking to Neu...
OpenStack Israel Meetup - Project Kuryr: Bringing Container Networking to Neu...OpenStack Israel Meetup - Project Kuryr: Bringing Container Networking to Neu...
OpenStack Israel Meetup - Project Kuryr: Bringing Container Networking to Neu...
 
Agile Networking with OpenStack
Agile Networking with OpenStack Agile Networking with OpenStack
Agile Networking with OpenStack
 
Osdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauserOsdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauser
 
Network and Service Virtualization tutorial at ONUG Spring 2015
Network and Service Virtualization tutorial at ONUG Spring 2015Network and Service Virtualization tutorial at ONUG Spring 2015
Network and Service Virtualization tutorial at ONUG Spring 2015
 
Cloud Networking - Leaving the Physical Behind - Omer Anson - OpenStack Day I...
Cloud Networking - Leaving the Physical Behind - Omer Anson - OpenStack Day I...Cloud Networking - Leaving the Physical Behind - Omer Anson - OpenStack Day I...
Cloud Networking - Leaving the Physical Behind - Omer Anson - OpenStack Day I...
 

Viewers also liked

Viewers also liked (20)

Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. Experience
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 
OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and Concurrency
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter Lock
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 

Similar to Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

Similar to Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
A Brave new object store world
A Brave new object store worldA Brave new object store world
A Brave new object store world
 
Se training storage grid webscale technical overview
Se training   storage grid webscale technical overviewSe training   storage grid webscale technical overview
Se training storage grid webscale technical overview
 
IDE.pptx
IDE.pptxIDE.pptx
IDE.pptx
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
NetApp Se training storage grid webscale technical overview
NetApp Se training   storage grid webscale technical overviewNetApp Se training   storage grid webscale technical overview
NetApp Se training storage grid webscale technical overview
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Serverless data lake architecture
Serverless data lake architectureServerless data lake architecture
Serverless data lake architecture
 
S cv3179 spectrum-integration-openstack-edge2015-v5
S cv3179 spectrum-integration-openstack-edge2015-v5S cv3179 spectrum-integration-openstack-edge2015-v5
S cv3179 spectrum-integration-openstack-edge2015-v5
 
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
 
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
 
Big data cloud architecture
Big data cloud architectureBig data cloud architecture
Big data cloud architecture
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 

More from Daniel Krook

More from Daniel Krook (20)

Commit to the Cause, Push for Change: Contributing to Call for Code Open Sour...
Commit to the Cause, Push for Change: Contributing to Call for Code Open Sour...Commit to the Cause, Push for Change: Contributing to Call for Code Open Sour...
Commit to the Cause, Push for Change: Contributing to Call for Code Open Sour...
 
Engaging Open Source Developers to Develop Tech for Good through Code and Res...
Engaging Open Source Developers to Develop Tech for Good through Code and Res...Engaging Open Source Developers to Develop Tech for Good through Code and Res...
Engaging Open Source Developers to Develop Tech for Good through Code and Res...
 
COVID-19 and Climate Change Action Through Open Source Technology
COVID-19 and Climate Change Action Through Open Source TechnologyCOVID-19 and Climate Change Action Through Open Source Technology
COVID-19 and Climate Change Action Through Open Source Technology
 
Serverless APIs with Apache OpenWhisk
Serverless APIs with Apache OpenWhiskServerless APIs with Apache OpenWhisk
Serverless APIs with Apache OpenWhisk
 
Workshop: Develop Serverless Applications with IBM Cloud Functions
Workshop: Develop Serverless Applications with IBM Cloud FunctionsWorkshop: Develop Serverless Applications with IBM Cloud Functions
Workshop: Develop Serverless Applications with IBM Cloud Functions
 
Event specifications, state of the serverless landscape, and other news from ...
Event specifications, state of the serverless landscape, and other news from ...Event specifications, state of the serverless landscape, and other news from ...
Event specifications, state of the serverless landscape, and other news from ...
 
Serverless Architectures in Banking: OpenWhisk on IBM Bluemix at Santander
Serverless Architectures in Banking: OpenWhisk on IBM Bluemix at SantanderServerless Architectures in Banking: OpenWhisk on IBM Bluemix at Santander
Serverless Architectures in Banking: OpenWhisk on IBM Bluemix at Santander
 
The CNCF on Serverless
The CNCF on ServerlessThe CNCF on Serverless
The CNCF on Serverless
 
Building serverless applications with Apache OpenWhisk and IBM Cloud Functions
Building serverless applications with Apache OpenWhisk and IBM Cloud FunctionsBuilding serverless applications with Apache OpenWhisk and IBM Cloud Functions
Building serverless applications with Apache OpenWhisk and IBM Cloud Functions
 
Building serverless applications with Apache OpenWhisk
Building serverless applications with Apache OpenWhiskBuilding serverless applications with Apache OpenWhisk
Building serverless applications with Apache OpenWhisk
 
Containers vs serverless - Navigating application deployment options
Containers vs serverless - Navigating application deployment optionsContainers vs serverless - Navigating application deployment options
Containers vs serverless - Navigating application deployment options
 
Serverless architectures built on an open source platform
Serverless architectures built on an open source platformServerless architectures built on an open source platform
Serverless architectures built on an open source platform
 
Build a cloud native app with OpenWhisk
Build a cloud native app with OpenWhiskBuild a cloud native app with OpenWhisk
Build a cloud native app with OpenWhisk
 
Cloud Native Architectures with an Open Source, Event Driven, Serverless Plat...
Cloud Native Architectures with an Open Source, Event Driven, Serverless Plat...Cloud Native Architectures with an Open Source, Event Driven, Serverless Plat...
Cloud Native Architectures with an Open Source, Event Driven, Serverless Plat...
 
Open Container Technologies and OpenStack - Sorting Through Kubernetes, the O...
Open Container Technologies and OpenStack - Sorting Through Kubernetes, the O...Open Container Technologies and OpenStack - Sorting Through Kubernetes, the O...
Open Container Technologies and OpenStack - Sorting Through Kubernetes, the O...
 
Serverless apps with OpenWhisk
Serverless apps with OpenWhiskServerless apps with OpenWhisk
Serverless apps with OpenWhisk
 
OpenWhisk - A platform for cloud native, serverless, event driven apps
OpenWhisk - A platform for cloud native, serverless, event driven appsOpenWhisk - A platform for cloud native, serverless, event driven apps
OpenWhisk - A platform for cloud native, serverless, event driven apps
 
Containers, OCI, CNCF, Magnum, Kuryr, and You!
Containers, OCI, CNCF, Magnum, Kuryr, and You!Containers, OCI, CNCF, Magnum, Kuryr, and You!
Containers, OCI, CNCF, Magnum, Kuryr, and You!
 
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayerTaking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
 
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

  • 1. © 2014 IBM Corporation Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa
  • 2. © 2014 IBM Corporation Topics Covered in This Talk § Openstack Swift § Apache Spark § Basic integration between Spark and Swift § Advanced integration between Spark and Swift by utilizing the Storlets technology.
  • 3. © 2014 IBM Corporation Digital Universe More than 1.8 zettabytes (1.8 trillion gigabytes) Grows rapidly 80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"
  • 4. © 2014 IBM Corporation Map-Reduce, Databases, etc.. Data needs to be replicated, Time, Cost, etc..
  • 5. © 2014 IBM Corporation Can we do it better?
  • 6. © 2014 IBM Corporation Openstack Swift § A massively scalable object store § Known to work with thousands of servers, stores petabytes of data. § Exposes REST API § Features: – Storage polices – Erasure codes – Data replication – …. PUTProxy Nodes Storage Nodes
  • 7. © 2014 IBM Corporation Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing – Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk § Combines SQL, streaming, and complex analytics § Can read existing Hadoop data § Most active project in Apache today
  • 8. © 2014 IBM Corporation Swift enablement for data retrieval in Spark § Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source. Swift Network § IBM research enabled Spark to access data stored in Openstack Swift.
  • 9. © 2014 IBM Corporation What do we analyze? Swift Network Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….
  • 10. © 2014 IBM Corporation Yes! We can do it better.
  • 11. © 2014 IBM Corporation Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities § Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes. § Storlet engine - responsible to execute every storlet in a secure environment § Storlet is a standard Java code
  • 12. © 2014 IBM Corporation Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the computation
  • 13. © 2014 IBM Corporation Swift Storlets: How do they benefit Spark? Swift Storlet Network Objects Filter Data processing+
  • 14. © 2014 IBM Corporation Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos § Object store is a natural repository for photos § Photos contain rich capture metadata § Analyzing this metadata for a set of photos can show how the camera is used
  • 15. © 2014 IBM Corporation Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark) 10MB 1KB
  • 16. © 2014 IBM Corporation Example: Analyzing EXIF metadata. •  Spark accesses images via storlet •  No change to Spark, only changes the URI •  JSON file returned by storlet defines schema •  SQL from Spark processes metadata
  • 17. © 2014 IBM Corporation Example: Analyzing EXIF metadata.
  • 18. © 2014 IBM Corporation Summary § Openstack Swift is the most popular open source object store § Apache Spark is the next big thing in data analytics § Spark and Swift can be integrated § Storlets in Swift provide clear benefits for analytics use cases. Thank you! More information Gil Vernik, IBM Research -Haifa gilv@il.ibm.com