More Related Content Similar to Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift (20) More from Daniel Krook (20) Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift1. © 2014 IBM Corporation
Advanced Data Retrieval and Analytics
with Apache Spark and Openstack Swift
Gil Vernik
IBM Research - Haifa
2. © 2014 IBM Corporation
Topics Covered in This Talk
§ Openstack Swift
§ Apache Spark
§ Basic integration between Spark and Swift
§ Advanced integration between Spark and Swift by
utilizing the Storlets technology.
3. © 2014 IBM Corporation
Digital Universe
More than 1.8 zettabytes
(1.8 trillion gigabytes)
Grows rapidly
80% owned by enterprises
75% generated by individuals
According IDC iView "Extracting Value from Chaos,"
4. © 2014 IBM Corporation
Map-Reduce, Databases, etc..
Data needs to be replicated, Time, Cost, etc..
6. © 2014 IBM Corporation
Openstack Swift
§ A massively scalable object store
§ Known to work with thousands of
servers, stores petabytes of data.
§ Exposes REST API
§ Features:
– Storage polices
– Erasure codes
– Data replication
– ….
PUTProxy Nodes
Storage Nodes
7. © 2014 IBM Corporation
Apache Spark
§ Apache Spark™ is a fast and general engine for
large-scale data processing
– Up to 100x faster than Hadoop Map
Reduce in-memory, 10x faster on disk
§ Combines SQL, streaming, and complex analytics
§ Can read existing Hadoop data
§ Most active project in Apache today
8. © 2014 IBM Corporation
Swift enablement for data retrieval in Spark
§ Apache Spark implements Hadoop interfaces and can use
HDFS or Amazon S3 as a data source.
Swift
Network
§ IBM research enabled Spark to access data stored in
Openstack Swift.
9. © 2014 IBM Corporation
What do we analyze?
Swift
Network
Stored Data Input to Analytics
Images EXIF metadata
PDF Hidden metadata
LOGs Only ‘ERROR’ records
…. ….
10. © 2014 IBM Corporation
Yes! We can do it better.
11. © 2014 IBM Corporation
Storlets: Flexibly extend for Swift
Advanced Data processing inside Swift
§ Storlets is a way to ‘extend’
cloud computational capabilities
§ Storlet is compiled code,
deployed to Swift and when
triggered is executed by Storlet
Engine directly on storage
nodes.
§ Storlet engine - responsible to
execute every storlet in a secure
environment
§ Storlet is a standard Java code
12. © 2014 IBM Corporation
Storlets extend an object store by
moving computation to the data –
filtering, transforming, analyzing –
instead of bringing the data to the
computation
13. © 2014 IBM Corporation
Swift Storlets: How do they benefit Spark?
Swift Storlet
Network
Objects
Filter
Data processing+
14. © 2014 IBM Corporation
Storlets Enable Extending the Functionality of Spark
Example: analyzing EXIF metadata from photos
§ Object store is a
natural repository for
photos
§ Photos contain rich
capture metadata
§ Analyzing this
metadata for a set of
photos can show how
the camera is used
15. © 2014 IBM Corporation
Example: Analyzing EXIF metadata
Storlets can extract metadata, returning as JSON
(rather than of processing the binary data directly by Spark)
10MB 1KB
16. © 2014 IBM Corporation
Example: Analyzing EXIF metadata.
• Spark accesses images via storlet
• No change to Spark, only changes the URI
• JSON file returned by storlet defines schema
• SQL from Spark processes metadata
17. © 2014 IBM Corporation
Example: Analyzing EXIF metadata.
18. © 2014 IBM Corporation
Summary
§ Openstack Swift is the most popular open source
object store
§ Apache Spark is the next big thing in data analytics
§ Spark and Swift can be integrated
§ Storlets in Swift provide clear benefits for analytics
use cases.
Thank you!
More information
Gil Vernik, IBM Research -Haifa
gilv@il.ibm.com