More Related Content Similar to How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams (20) More from DATAVERSITY (20) How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights and Streams1. How to Crunch Petabytes with
Hadoop and Big Data using
InfoSphere BigInsights and
Streams
Tom Deutsch, IBM
Vladimir B
Vl di i Bacvanski, Founder, SciSpike
ki F d S iS ik
vladimir.bacvanski@scispike.com
Stephen Brodsky, Technical Executive and Distinguished Engineer, IBM
sbrodsky@us.ibm.com
b d k @ ib
August 24, 2011 © 2011 IBM Corporation & SciSpike
2. Who are we?
Dr. Vladimir Bacvanski
– Consultant, trainer, and mentor focusing on making clients successful in
adopting new data and software approaches
– Over 20 years of experience
y p
– Founder of SciSpike – a training and consulting firm specializing in
advanced software and data technologies
Stephen Brodsky, Ph.D.
– Di ti
Distinguished E i
i h d Engineer and T h i l E
d Technical Executive f IBM Bi D t
ti for Big Data
initiatives at the IBM Silicon Valley Laboratory
– Previously led the architecture for the Optim Data Studio product line
and pureQuery and was a member of the architecture team for DB2
pureXML, Rational Application Developer (RAD), and WebSphere.
2 © 2011 IBM Corporation & SciSpike
3. Agenda
The “Big Data challenge: smarter analytics for a
Big Data”
smarter planet
How to do it?
– The big data challenge
–FFoundations of Big D
d i f Bi Data approaches
h
– MapReduce and Hadoop
– Real-time data and stream processing
– Integration with existing systems
3 © 2011 IBM Corporation & SciSpike
4. The “Big Data” Challenge
August 24, 2011 © 2011 IBM Corporation & SciSpike
5. The World is Changing and Becoming More…
More
INSTRUMENTED
INTERCONNECTED
INTELLIGENT
The
Th resulting explosion of information creates a need for
lti l i fi f ti t df
a new kind of intelligence
…to help build a Smarter Planet
5 © 2011 IBM Corporation & SciSpike
6. Information is Growing at a Phenomenal Rate . . . .
44x
44 as much data and content
over coming decade 80% Of world’s data
is unstructured
2020
35 zettabytes
(35 billion terabytes)
2009
800,000 petabytes
Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
6 © 2011 IBM Corporation & SciSpike
7. The BIG Data Challenge
• Manage and benefit from massive and growing amounts of data
• Handle varied data formats (structured, unstructured, semi-structured) and
increased data velocity
• Exploit BIG Data in a timely and cost effective fashion
COLLECT MANAGE
Collect Manage
Integrate
INTEGRATE Analyze
ANALYZE
7 © 2011 IBM Corporation & SciSpike
8. What clients are saying . . .
Lots of potentially valuable data is dormant or discarded
p y
due to size/performance considerations
Large volume of unstructured or semi-structured data is not worth
semi structured
integrating fully (e.g. Tweets, logs, . . .)
Not clear what should be analyzed (exploratory iterative)
(exploratory,
Information distributed across multiple systems and/or Internet
Some information has a short useful lifespan
Volumes can be extremely high
Analysis needed in the context of existing information (not stand
alone)
8 © 2011 IBM Corporation & SciSpike
9. Big Data Presents Big Opportunities
Extract insight from a high volume, variety and velocity of data
in a timely and cost-effective manner
Variety: Manage and benefit from
diverse data types and data
structures
Velocity: Analyze streaming data and
large volumes of persistent
data
Volume: Scale from terabytes to
zettabytes
ettabytes
9
9 © 2011 IBM Corporation & SciSpike
10. Streams and Oceans of Information . . . .
Information oceans
Information streams
Information stored outside
High
Hi h speed information flowing in
di f ti fl i i conventional systems. Data may
ti l t D t
real-time, often transient originate from the Web or different
Information from sensors, instruments, internal different systems
etc.
etc
Information flowing from real-time logs Collection of what has streamed
and activity monitors Information from social media, logs, click
Streaming content like audio and video streams, emails, etc.
High speed transactions like tickers,
trades, or traffic systems Unstructured or mixed schema documents
like claims, forms, desktop applications,
etc.
Structured data from disparate systems
10 © 2011 IBM Corporation & SciSpike
11. Applications for Big Data Analytics
Smarter Healthcare Multi-channel sales Finance
Homeland security Traffic Control Telecom
Manufacturing Trading Analytics
Many more!
11 © 2011 IBM Corporation & SciSpike
12. Use Case Example: Energy Company
Business scenario
Analyze large volumes of public and
private weather data for alternative
energy business
E i ti hi h
Existing high-performance computing
f ti
hardware, limited staff
Technical challenges
High data volume: 2+ PB
Range of q y types
g query yp
- Avg temp in given location? (Small
result)
- Geo pts where ice may form on wind
turbines? (Large result derived values –
result,
icing determined by humidity + temp.)
Run on system with non-Hadoop apps
12 © 2011 IBM Corporation & SciSpike
13. Use Case Example: Global Media Firm
Business scenario
Identify unauthorized content
streaming in digital media (piracy)
- Quantify annual revenue loss
- Analyze trends
Monitor social media sites to identify
dissemination of pirated content. Time
sensitive!
Technical challenges
High variety of unstructured and semi-
structured data.
t t dd t
Initial focus: text analytics over 1 year’s
worth of social media data. Look for live
streaming URLs, sentiment, event info, etc.
Complex rules to qualify & classify info
Future potential for video analysis
13 © 2011 IBM Corporation & SciSpike
14. IBM Watson
IBM Watson is a breakthrough in analytic innovation, but it is only successful
because of the quality of the information from which it is working.
14 © 2011 IBM Corporation & SciSpike
15. Big Data and Watson
Big Data technology is used to build Watson technology offers great potential
Watson’s knowledge base for advanced business analytics
Watson uses the Apache Hadoop open
framework to distribute the workload for
loading information into memory. CRM Data
POS Data Social Media
Approx. 200M pages of text
(To compete on Jeopardy!) Distilled Insight
- Spending habits
- Social relationships
- Buying trends
InfoSphere BigInsights
oSp e e g s g ts
Watson’s
Memory Advanced
search and
analysis
15 © 2011 IBM Corporation & SciSpike
16. Customer Engagements
Use patterns Common requirements
• Customer sentiment analysis (cross-
(cross • Extract business insight from large volumes of
sell, up-sell, campaign management) raw data (often outside operational systems)
• Integrated retail and web customer • Integrate with other existing software
behavior modeling g • Ready for enterprise use
• Predictive modeling (credit card fraud)
• System log analytics (reduce
operational risk)
p )
Consumer
Text, Blog,
Text Blog Weblog
Insight
Click streams
Multi-channel
sales
Log & transactions
Next Gen
Text Analytics
Biological Sequences
Fraud Models
Operational system & streams data sources
p y New Business Stat st ca ode
Statistical Model
Development Building
1616 © 2011 IBM Corporation & SciSpike
17. The approach to
crunching big data
August 24, 2011 © 2011 IBM Corporation & SciSpike
18. How to approach Big Data analytics?
InfoSphere BigInsights and InfoSphere Streams
• Analytics for data in-motion and at-rest
• Platform for processing large volumes of diverse data
• Complements and integrates with existing software solutions
18 © 2011 IBM Corporation & SciSpike
19. Addressing the Key Requirements
1. Platform for V3 – Variety, Velocity, Volume
Variety - manage data & content “As Is”
Handle any velocity - low-latency streams and large volume batch
Volume - huge volumes of at-rest or streaming data Big Data Platform
2 Analytics for V3
2.
Analyze Sources in their native format - text, data, rich content
Analyze all of the data - not just a subset
Dynamic analytics - automatic adjustments and actions
3. Ease of Use for Developers and Users
Developer UIs, common languages & automatic optimization
End-user UIs & visualization
4. Enterprise Class
Failure tolerance, Security and Privacy
Scale Economically
5. Extensive Integration Capabilities
Integrate wide variety of sources
Leverage enterprise integration technologies
19 © 2011 IBM Corporation & SciSpike
20. Big D t I iti ti
Bi Data Initiative
Volumes of diverse persistent data
diverse, Analytic applications for
“Big Data”
InfoSphere
p
BigInsights
Warehouse
Traditional warehouse
applications
IBM Confidential
InfoSphere
Streams
Real-time streaming data
20 © 2011 IBM Corporation & SciSpike
21. BigInsights Summary
BigInsights = analytical platform for persistent “Big Data”
– Based on open source & IBM technologies
Distinguishing characteristics
– Built-in analytics . . . . Enhances business knowledge
– Enterprise soft are integration . . . . Complements and e tends
software extends
existing capabilities
– Production-ready platform . . . . Speeds time-to-value; simplifies
development and maintenance
21 © 2011 IBM Corporation & SciSpike
22. Big Data Platform Vision
Bringing Big Data to the Enterprise
Data
Big Data Solutions Warehouse
Information
Integration
Big Data User Environments
Developers End Users Administrators Master Data
Mgmt
IN
NTEGRATIO
AGENTS
A
Database
Big Data Enterprise Engines
Content
ON
Analytics
Business
Analytics
Streaming Analytics
g y Internet Scale Analytics
y
Marketing
Open Source Foundational Components
Data Growth
Management
22 © 2011 IBM Corporation & SciSpike
23. InfoSphere BigInsights v 1.1
Platform for volume,
variety, velocity -- V3
Hadoop foundation
Analytics for V3
Text analytics & tooling Enterprise Edition
Licensed
Usability Web admin console, LDAP authentication
Web administrative
lass
RDBMS, warehouse connectivity
nterprise cl
console Text analytics
Basic Edition
Integrated install Spreadsheet-style analytic tool
Free download Flexible job scheduler
Spreadsheet-style
analytic t l
l ti tool Apache 24 x 7 Web
En
Hadoop support
Enterprise Class
Storage, security,
cluster management
Breadth of capabilities
Integration
Connectivity to DB2,
Netezza
23 © 2011 IBM Corporation & SciSpike
24. BigInsights Platform: Key Ideas
Flexible, enterprise-class support for processing large
volumes of data
– Based on Google’s MapReduce technology
– Inspired by Apache Hadoop; compatible with its ecosystem a d
sp ed pac e adoop; co pat b e t ts ecosyste and
distribution
– Well-suited to batch-oriented, read-intensive applications
– Supports wide variety of data
Enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
t b t fd t i hi hl ll l t ff ti
– CPU + disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written
24 © 2011 IBM Corporation & SciSpike
25. The M R d
Th MapReduce Programming Model
P i M d l
"Map" step:
Map
– Input split into pieces
– W k nodes process individual pieces i parallel ( d
Worker d i di id l i in ll l (under
global control of the Job Tracker node)
– Each worker node stores its result in its local file system
where a reducer is able to access it
"Reduce" step:
– Data is aggregated (‘reduced” from the map steps) by
( reduced
worker nodes (under control of the Job Tracker)
– M lti l reduce tasks can parallelize th aggregation
Multiple d t k ll li the ti
25
25 © 2011 IBM Corporation & SciSpike
26. What is Hadoop?
Apache Hadoop = free, open source framework for data-
intensive applications
– Inspired by Google technologies (MapReduce, GFS)
– Well-suited to batc o e ted, read-intensive app cat o s
e su ted batch-oriented, ead te s e applications
– Originally built to address scalability problems of Nutch, an open source
Web search technology
Enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
– CPU + disks of commodity b = H d
di k f dit box Hadoop “ d ”
“node”
– Boxes can be combined into clusters
– New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written
26 © 2011 IBM Corporation & SciSpike
27. Two Key Aspects of Hadoop
MapReduce framework
– How Hadoop understands and assigns work to the nodes
(machines)
Hadoop Distributed File System = HDFS
– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to
make them into one big file system
27 © 2011 IBM Corporation & SciSpike
28. Logical MapReduce Example: Word Count
Content of Input Documents
Hello World Bye World
map(String key, String value):
Hello IBM
// key: document name
// value: document contents Map 1 emits:
< Hello, 1>
for each word w in value: < World, 1>
EmitIntermediate(w, 1 );
EmitIntermediate(w "1"); < Bye, 1>
Bye
< World, 1>
reduce(String key, Iterator values):
( g y, ) Map 2 emits:
< Hello, 1>
// key: a word < IBM, 1>
// values: a list of counts
Reduce (final output):
int result = 0;
< Bye, 1>
for each v in values: < IBM, 1>
result += ParseInt(v); < H ll 2>
Hello, 2
Emit(AsString(result)); < World, 2>
28 © 2011 IBM Corporation & SciSpike
29. How To Create MapReduce Jobs
MapReduce development in Java
p p
– Low level, very flexible
– Time consuming development
Hive
– Open source language / Apache sub-project
sub project
– Provides a SQL-like interface to Hadoop
Pig
– Data flow language / Apache sub-project
Jaql
– A query language for JSON
– Useful for loosely structured data
29 © 2011 IBM Corporation & SciSpike
30. Management Tools: Web Console
Graphically manage cluster, jobs, HDFS
Sample administration tasks
– Start/Stop Servers
– Add/Remove Servers
– Server Status Details (Log)
30 © 2011 IBM Corporation & SciSpike
31. Spreadsheet like
Spreadsheet-like Analysis Tool
Web-based analysis BigSheets
and visualization tool
Spreadsheet-like
interface
– Define and manage
long running data
collection j b
ll i jobs
– Analyze content of the
text on the pages that
have been retrieved
31 © 2011 IBM Corporation & SciSpike
32. Text Analytics
• Distill structured info from unstructured data "Acquisition"
• Sentiment analysis "Address"
Address
"Alliance"
• Consumer behavior "AnalystEarningsEstimate"
• Illegal or suspicious activities "City"
"CompanyEarningsAnnouncement"
CompanyEarningsAnnouncement
• ... "CompanyEarningsGuidance"
"Continent"
"Country"
• Pre-built library of text annotators for common "County"
County
business entities "DateTime"
"EmailAddress"
"JointVenture"
• Rich language and tooling to build custom
g g g "Location"
Location
annotators "Merger"
"NotesEmailAddress"
"Organization"
• Support for Western languages ( g ,
pp g g (English, "Person"
Person
Dutch/Flemish, French, German, Italian, "PhoneNumber"
Portuguese, or Spanish) plus select Asian "StateOrProvince"
languages (Japanese, Chinese) "URL"
"ZipCode"
ZipCode
32
32 © 2011 IBM Corporation & SciSpike
34. So What Does This Result In?
Easy To Scale
Fault Tolerant and Self-Healing
Data Agnostic
Extremely Flexible
34 © 2011 IBM Corporation & SciSpike
35. Working with streaming data: a new paradigm
Conventional processing: static data
Queries Data Results
Real-time processing: streaming data
Data Queries Results
35 © 2011 IBM Corporation & SciSpike
36. Real-Time
Real Time Data with InfoSphere Streams
Source Sink
Streaming analytic applications Adapters Operator Repository Adapters
– M lti l i
Multiple input streams
t t
– Advanced streaming analytics
Eclipse based IDE
InfoSphere Streams Studio
– Define sources, apply operators, (IDE for Streams Processing Language)
define intermediary and final
output sinks
– User defined operators in Java or
C++
Automated,
Automated Optimized Deploy
O i i i
Optimizing compiler automates
il and Management (Scheduler)
deployment and connections
– Extremely low latency
y y
– Cluster of up to 125 nodes
36 © 2011 IBM Corporation & SciSpike
37. Scalable stream processing
InfoSphere Streams provides
– A programming model and IDE f d fi i d t sources and
i d l d for defining data d
software analytic modules called operators that are fused into
process execution units (PEs)
– infrastructure to support the composition of scalable stream
processing applications from these components
– deployment and operation of these applications across distributed
p y p pp
x86 processing nodes, when scaled processing is required
– stream connectivity between data sources and PEs of a stream
processing application
37 © 2011 IBM Corporation & SciSpike
38. Merging the Traditional and Big Data Approaches
Traditional Approach Big Data Approach
Structured & Repeatable Analysis Iterative & Exploratory Analysis
IT
Business Users
Delivers a platform to
Determine what enable creative
bl ti
question to ask discovery
IT Business
Structures the Explores what
data to answer questions could be
that question
q asked
Monthly sales reports Brand sentiment
Profitability analysis Product strategy
Customer surveys Maximum asset utilization
38 © 2011 IBM Corporation & SciSpike
39. BigInsights and the data warehouse: filtering and
summarizing “Big Data”
BigInsights
• Broader analytic coverage
• Exploits IT investments while
p Data warehouse
minimizing burden
39 © 2011 IBM Corporation & SciSpike
40. BigInsights as a “queryable archive for growing
queryable archive”
data warehouses
BigInsights
Data warehouse • Offl d “cold” or dated warehouse info but
Offload “ ld” d t d h i f b t
maintain access for further exploration
• Keep warehouse size manageable and focused
on well-known business analytic needs
40 © 2011 IBM Corporation & SciSpike
41. Trends and directions
Enterprise software integration
– Data warehouses, RDBMSs
– ETL platforms
l tf
– Business intelligence tools
– Applications
– ...
Diverse range of analytics
– Text
– Image / video (e.g., content based user profiling)
(e g content-based
– Predictive modeling (e.g., ranking and classification based on
machine learning)
– ...
Sophisticated, scalable infrastructure for processing
massive data volumes
– High-performance file system with full POSIX compliance, g
g p y p , granular
security
– Fully recoverable and restartable workflows
– Parallel, distributed indexing for text (“BigIndex”)
– Read-optimized column store
p
– Tooling for administrators, programmers, analysts
– ...
41 © 2011 IBM Corporation & SciSpike
42. Integrating Relational, Streams, and BigInsights
Traditional /
Traditional Relational
Warehouse
Data Sources
Database & At-rest Results
Warehouse data
analytics
Non-Traditional /
Streams Non-Relational
N R l ti l
Data Sources
In-Motion Ultra Low
Analytics Latencyy
Results
Varied data InfoSphere
Big Insights
formats
Massive Scale
Big Data
Semi-structured, Batch oriented Results
unstructured... data analytics
42 © 2011 IBM Corporation & SciSpike
43. Typical Strategy for Analytics
ETL SQL Analytics, Mining
Data warehouse / marts
Source
Sources
S
Transform/
Extract Load
subset
43 © 2011 IBM Corporation & SciSpike
44. Emerging requirements for analytics
SQL Analytics, Mining
ETL, ELT (MR BI, Mining)
Source
Structured Transform,
Analyze Warehouses / marts
Sources
Transform/
Extract subset Load
BigInsights
g g
Source Repository
Other
Sources
Explore large volumes of “raw” or diverse data.
Discover, analyze new insights with BigInsights
44 © 2011 IBM Corporation & SciSpike
45. Conclusions
– Scale out to crunch petabytes
– We need a mix of technologies
• Data at rest: MapReduce, Hadoop and beyond
• Data in motion: stream processing
– To be successful, integrate with conventional
technologies
45 © 2011 IBM Corporation & SciSpike
46. Getting in touch
Stephen Brodsky – IBM
– Email: sbrodsky@us.ibm.com
InfoSphere BigInsights
– http://www-01.ibm.com/software/data/infosphere/biginsights/
ttp // 0 b co /so t a e/data/ osp e e/b g s g ts/
InfoSphere Streams
– http://www-01.ibm.com/software/data/infosphere/streams/
Vladimir Bacvanski - SciSpike
– Email: vladimir.bacvanski@scispike.com
– Blog: http://www.OnBuildingSoftware.com/
– Twitter: http://twitter.com/OnSoftware
– LinkedIn: http://www.linkedin.com/in/VladimirBacvanski
p
46 © 2011 IBM Corporation & SciSpike