Powerpoint exploring the locations used in television show Time Clash
Big Data and OSS at IBM
1. Open
Source
SW
@
IBM
Big
Data
Boulder Java User Group
06/11/13
Ivan Portilla
ivanp@us.ibm.com
portilla@gmail.com
Ryan DeJana
rdejana@us.ibm.com
- 1 -
2. Disclaimer
ü This presentation represents the view of the authors
and does not represent the view of IBM.
ü All opinions expressed in this presentation are strictly of
the speakers, and do NOT represent those of IBM, IBM
management, or anyone else.
ü IBM and IBM (logo) are trademarks or registered
trademarks of International Business Machines
Corporation in the United States and/or other countries.
ü Many Thanks to Rafael Coss & Paul Zikopoulos for the
materials used in this presentation.
8. Big Data Includes Any of the following Characteristics:
Extracting insight in context, beyond what was previously possible.
8
Manage the complexity of
multiple relational and non-
relational data types and
schemas
Variety
Streaming data and large
volume data movement
Velocity
Scale from terabytes to
zettabytes
Volume
10. Up to
10,000
Times
larger
Up to 10,000
times faster
Traditional Data
Warehouse and
Business Intelligence
DataScale
DataScale
yr mo wk day hr min sec … ms µs
Exa
Peta
Tera
Giga
Mega
Kilo
Decision Frequency
Occasional Frequent Real-time
Data in Motion
DataatRest
Big Data Has New Opportunities But Needs New Analytics
-
1
0
Telco Promotions
100,000 records/sec, 6B/day
10 ms/decision
270TB for Deep Analytics
DeepQA
100s GB for Deep Analytics
3 sec/decision
Smart Traffic
250K GPS probes/sec
630K segments/sec
2 ms/decision, 4K vehicles
Homeland Security
600,000 records/sec, 50B/day
1-2 ms/decision
320TB for Deep Analytics
11. Applications for Big Data Analytics
Homeland
Security
Finance
Smarter
Healthcare
MulM-‐channel
sales
Telecom
Manufacturing
Traffic
Control
Trading
AnalyMcs
Fraud
and
Risk
Log
Analysis
Search
Quality
Retail:
Churn,
NBO
12. U8li8es
§ Weather
impact
analysis
on
power
generaMon
§ Transmission
monitoring
§ Smart
grid
management
Retail
§ 360°
View
of
the
Customer
§ Click-‐stream
analysis
§ Real-‐Mme
promoMons
Law
Enforcement
§ Real-‐Mme
mulMmodal
surveillance
§ SituaMonal
awareness
§ Cyber
security
detecMon
Transporta8on
§ Weather
and
traffic
impact
on
logisMcs
and
fuel
consumpMon
§ Traffic
congesMon
Financial Services
§ Fraud detection
§ Risk management
§ 360° View of the Customer
IT
§ System
log
analysis
§ Cybersecurity
Telecommunica8ons
§ CDR
processing
§ Churn
predicMon
§ Geomapping
/
markeMng
§ Network
monitoring
Most requested use cases of Big Data
12
Health
&
Life
Sciences
§ Epidemic
early
warning
§ ICU
monitoring
§ Remote
healthcare
monitoring
Follow this link for details on Industry Big Data use cases
13. 13
§ Public
wind
data
is
available
on
284km
x
284
km
grids
(2.5o
LAT/LONG)
§ More
data
means
more
accurate
and
richer
models
(adding
hundreds
of
variables)
- Vestas
wind
library
at
2.5
PB:
to
grow
to
over
6
PB
in
the
near-‐term
- Granularity
27km
x
27km
grids:
driving
to
9x9,
3x3
to
10m
x
10m
simulaMons
§ Reduced
turbine
placement
idenMficaMon
from
weeks
to
hours
§ PerspecMve:
The
Vestas
Wind
library,
as
HD
TV
would
take
70
years
to
watch
13
14. 14
Big Data Analytics in Smarter Hospitals
IBM Data Baby
youtube.com
Big
Data
enabled
doctors
from
University
of
Ontario
to
apply
neonatal
infant
monitoring
to
predict
infec8on
in
ICU
24
hours
in
advance
http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
15. IBM Watson is a breakthrough in analytic innovation, but it is only successful
because of the quality of the information from which it is working.
-
1
5
16. -
1
6
Big Data and Watson
InfoSphere BigInsights
POS Data
CRM Data
Social Media
Distilled Insight
- Spending habits
- Social relationships
- Buying trends
Advanced
search and
analysis
Watson can consume insights from
Big Data for advanced analysis"
Big Data technology is used to build
Watson’s knowledge base"
Watson uses the Apache Hadoop
open framework to distribute the
workload for loading information into
memory."
Approx. 200M pages of text
(To compete on Jeopardy!)
Watson’s
Memory
17. IBM is committed to Open Source
► Decade of lineage and contributions to
the open source community
– Apache Hadoop and Jaql, Apache
Derby, Apache Geronimo, Apache
Jakarta, +++
– Eclipse: founded by IBM
– Significant Lucene contributions via IBM
Lucene Extension Library (ILEL)
– DRDA, XQuery, SQL, XML4J, XERCES,
HTTP, Java, Linux, +++
► IBM products built on open source
– WebSphere: Apache
– Rational: Eclipse and Apache
– InfoSphere: Eclipse and Apache, +++
► IBM’s BigInsights (Hadoop) is 100%
open source compatible with
no forks
18. Introducing MapReduce
► In 2003 and 2004 Google releases two papers that provide insight
into their success
– The Google File System
– MapReduce: Simplified Data Processing on Large Clusters
► Introduced an approach to large scale data processing known as
MapReduce
Global TLE Framework
1
8
19. MapReduce
► A programming model
– Inspired by functional programming
– Allows expressing distributed computations on large amounts of data
► Execution framework
– Designed for large-scale data processing
– Designed to run on clusters of commodity hardware
Global TLE Framework
1
9
20. MapReduce, the programming model
► Process key-value records
► Map function:
(Kin, Vin) è list(Kinter, Vinter)
► Barrier between map and reduce phases
– Shuffle and sort phase moves and groups like keys
► Reduce function:
(Kinter, list(Vinter)) è list(Kout, Vout)
Global TLE Framework
2
0
21. Map phase, word-count example
Global TLE Framework
2
1
(line1, “Hello there.”)
(line2, “Why, hello.”)
(“hello”,1)
(“there”,1)
(“why”,1)
(“hello”,1)
22. Sort phase, word-count example
Global TLE Framework
2
2
(“hello”, 1)
(“hello”, 1)
(“there”,
1)
(“why”,
1)
23. Reduce phase, word-count example
Global TLE Framework
2
3
(“hello”, 1)
(“hello”, 1)
(“there”,
1)
(“why”,
1)
(“hello”, 2)
(“there”, 1)
(“why”, 1)
25. Pseudocode for word-count
Global TLE Framework
2
5
def
mapper(line):
foreach
word
in
line.split():
output(word,
1)
def
reducer(key,
values):
output(key,
sum(values)
Same code can be applied to thousands of lines,
even the whole web!
Google processes over 20PBs a day, much of it in
MapReduce programs.
26. But what about the data!
Global TLE Framework
2
6
Compute Nodes
NAS
SAN
27. Distributed file system enables processing to
be moved to the data!
Global TLE Framework
2
7
(key1, value1)
(key2, value2)
…
(key1, value1)
(key2, value2)
…
Processing is done local to the data
Key-value pairs are processed independently and in parallel!
28. Hadoop – A M/R Framework
► Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
§ Hides underlying system details and complexities from user
§ Developed in Java
► Core sub projects:
− MapReduce
− Hadoop Distributed File System a.k.a. HDFS
− Hadoop Common
► Supported by several Hadoop-related projects
§ HBase
§ Zookeeper
§ Avro
§ Etc.
► Meant for heterogeneous commodity hardware
31. Hadoop Open Source Projects
► Hadoop is supplemented by an ecosystem of open source projects
Jaql
Oozie
32. The IBM Big Data Platform
32
InfoSphere BigInsights
Hadoop-based low latency
analytics for variety and volume
Data-At-Rest
Netezza High
Capacity Appliance
Queryable Archive for
Structured Data
Netezza 1000
BI+Ad Hoc Analytics on
Structured Data
Smart Analytics System
Operational Analytics on
Structured Data
Informix Timeseries
Time-structured analytics
InfoSphere Warehouse
Large volume structured data
analytics
InfoSphere Streams
Low Latency Analytics for
streaming data
Velocity, Variety & Volume
Data-In-Motion
MPP
Data
Warehouse
Stream
CompuMng
InformaMon
IntegraMon
Hadoop
InfoSphere Information
Server
High volume data integration
and transformation
Apache Hadoop:
open source framework
for the distributed processing
of large data sets across
clusters of computers using a
simple programming model
33. The IBM Big Data Platform
33
Integrate
and
manage
the
full
variety,
velocity
and
volume
of
data
Apply
advanced
analy7cs
to
informa7on
in
its
na7ve
form
Visualize
all
available
data
for
ad-‐hoc
analysis
Development
environment
for
building
new
analy7c
applica7ons
Workload
op7miza7on
and
scheduling
Security
and
Governance
34. BigInsights Brings Hadoop to the Enterprise
► BigInsights = analytical platform for
persistent Big Data
– Based on open source & IBM technologies
– Managed like a start-up . . . . Emphasis on
deep customer engagements, product plan
flexibility
► Distinguishing characteristics
– Built-in analytics . . . . Enhances business
knowledge
– Enterprise software integration . . . .
Complements and extends existing
capabilities
– Production-ready platform with tooling for
analysts, developers, and
administrators. . . . Speeds time-to-value;
simplifies development and maintenance
► IBM advantage
– Combination of software, hardware, services
and advanced research
Hadoop
System
35. InfoSphere BigInsights
Platform for volume, variety,
velocity
► Enhanced Hadoop
foundation
Analytics
► Text analytics & tooling
► Application accelerators
Usability
► Web console
► Spreadsheet-style tool
► Ready-made “apps”
Enterprise Class
► Storage, security, cluster
management
Integration
► Connectivity to Netezza,
DB2, JDBC databases, etc
Apache
Hadoop
Basic Edition
Enterprise Edition
Licensed
ApplicaMon
accelerators
Pre-‐built
applicaMons
Text
analyMcs
Spreadsheet-‐style
tool
RDBMS,
warehouse
connecMvity
AdministraMve
tools,
security
Eclipse
development
tools
Performance
enhancements
.
.
.
.
Free download
Integrated install
Online InfoCenter
BigData Univ.
Breadth of capabilities
Enterpriseclass
36. BigInsights Basic Edition
Connectivity and integration
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Lucene
Oozie
Open Source IBM
Integrated
installer
Sqoop
HCatalog
37. BigInsights Enterprise Edition
Connectivity and Integration Streams
Netezza
Text
processing
engine and
library
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Indexing Lucene
Adaptive
MapReduce
Oozie
Text compression
Enhanced
security
Flexible
scheduler
Optional
IBM and
partner
offerings
Analytics and discovery “Apps”
DB2
BigSheets
Web Crawler
Distrib file copy
DB export
Boardreader
DB import
Ad hoc query
Machine
learning
Data
processing
. . .
Administrative and
development tools
Web console
• Monitor cluster health, jobs, etc.
• Add / remove nodes
• Start / stop services
• Inspect job status
• Inspect workflow status
• Deploy applications
• Launch apps / jobs
• Work with distrib file system
• Work with spreadsheet interface
• Support REST-based API
• . . .
R
Eclipse tools
• Text analytics
• MapReduce programming
• Jaql, Hive, Pig development
• BigSheets plug-in development
• Oozie workflow generation
Integrated
installer
Open Source IBMIBM
Cognos BI
GPFS (EAP)
Accelerator for
machine data
analysis
Accelerator for
social data
analysis
Guardium DataStageData Explorer
Sqoop
HCatalog
38. Open Source Components Across
DistributionsComponent
Big
Insights
2.0
HortonWorks
HDP 1.2
MapR
2.0
Greenplum
HD 1.2
Cloudera
CDH3u5
Cloudera
CDH4*
Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *
HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1
Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1
Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2
Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3
Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3
Avro 1.6.3 X X X X X
Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0
Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1
HCatalog 0.4.0 0.5.0 0.4.0 X X X
BigInsights
con8nues
to
offer
the
most
proven,
stable
versions
of
Apache
Hadoop
components
*Cloudera
CDH4
Hadoop
2.0
includes
Map
Reduce
2.0
which
Cloudera
states
“not
yet
considered
stable”
41. BigInsights Content (cont’d)Function
Basic
Edition
Enterprise
Edition
Integration with R (Jaql module to invoke R statistical capabilities from
BigInsights) n/a Inc
Integration with Netezza, DB2 LUW with DPF from Jaql n/a Inc
LDAP authentication, Guardium support, etc. n/a Inc
Integrated Web Console n/a Inc
Business process accelerators (social data, machine data analytics) n/a Inc
Platform performance enhancements (Adaptive MapReduce, large scale
indexing, efficient processing of compressed text files, flexible job
scheduler, etc.)
n/a Inc
Text analytics n/a Inc
Eclipse tools for text analytic development, Jaql, Hive, Java n/a Inc
Applications for data import/export, Web crawl, machine learning, etc. n/a Inc
Web-based application catalog n/a Inc
Spreadsheet-like analytical tool n/a Inc
IBM support Opt Inc
Streams, Data Explorer, Cognos BI (limited use licenses) n/a Inc
Unlimited storage n/a Inc
42. BigInsights: Value Beyond Open Source
Enterprise Capabilities
Administration & Security
Workload Optimization
Connectors
Open source
components
Advanced Engines
Visualization & Exploration
Development Tools
IBM-certified
Apache Hadoop or or …
Key
differenMators
• Built-‐in
analyMcs
• Text
engine,
annotators,
Eclipse
tooling
• Interface
to
project
R
(staMsMcal
plamorm)
• Enterprise
sonware
integraMon
• Spreadsheet-‐style
analysis
• Integrated
installaMon
of
supported
open
source
and
other
components
• Web
Console
for
admin
and
applicaMon
access
• Plamorm
enrichment:
addiMonal
security,
performance
features,
.
.
.
• World-‐class
support
• Full
open
source
compaMbility
Business
benefits
• Quicker
Mme-‐to-‐value
due
to
IBM
technology
and
support
• Reduced
operaMonal
risk
• Enhanced
business
knowledge
with
flexible
analyMcal
plamorm
• Leverages
and
complements
exisMng
sonware
44. Big Data Application Ecosystem
Eclipse
App
library
MapReduce,
…
Text
AnalyMcs
Query
App Development
• Code application program, and generate
associated App
• Deploy Apps to Enterprise ManagerApp
Development
Publish
Data
integra7on
scenario:
Pre-‐defined
work
flows
simplify
loading
data
from
various
sources
• Work
flows
can
be
configured,
deployed,
executed
and
scheduled
Development
tooling:
• Text
analyMcs
• MapReduce
• Query
languages
•
.
.
.
Applica7on
scenarios
(web
log,
email,
social
media,
…):
•
Samples
provide
starMng
point,
speed
Mme
to
value
Big Data Web Console
45. Web Console
• Manage BigInsights
Inspect /monitor system health
Add / drop nodes
Start / stop services
Run / monitor jobs (applications)
Explore / modify file system
Create custom dashboards
. . .
• Launch applications
Spreadsheet-like analysis tool
Pre-built applications (IBM
supplied or user developed)
• Publish applications
• Monitor cluster, applications,
data, etc.
46. Running Applications from the Web Console
• Import
&
Export
Data
• Database
&
Files
• Web
and
Social
• Analyze
and
Query
• Predic7ve
Analy7cs
• Text
Analy7cs
• SQL/Hive,
Jaql,
Pig,
HBase
47. Spreadsheet-style Analysis
• Web-based analysis and
visualization
• Spreadsheet-like
interface
Define and manage long
running data collection
jobs
Analyze content of the text
on the pages that have
been retrieved
48. Get started with BigInsights
• In the Cloud
Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise
Cloud, or on private clouds.
Pay only for the resources used.
• In the Classroom
Via IBM Education
Online at www.bigdatauniversity.com
• On Your Cluster
Download Basic Edition from ibm.com.
• With the BigInsights Community
– Technical portal @ http://tinyurl.com/biginsights
– BigData on DW @ http://ibm.co/bigdatadev
Links to demos, papers, forum, downloads, etc.
• Stay connected with IBM Big Data
– http://ibmbigdatahub.com
49. BigDataUniversity.com
Learn Big Data Technologies
• Flexible on-line delivery
allows learning @your place
and @your pace
§ Free courses, free study
materials.
§ Cloud-based sandbox
for exercises – zero setup
§ 66666 registered students.
§ Robust Course
Management System and
Content Distribution
infrastructure-
4
9
55. BigInsights and Text Analytics
• Distills structured info from
unstructured text
Sentiment analysis
Consumer behavior
Illegal or suspicious activities
…
• Parses text and detects meaning
with annotators
• Understands the context in which
the text is analyzed
• Features pre-built extractors for
names, addresses, phone numbers,
etc.
• Built-in support for English,
Spanish, French, German,
Portuguese, Dutch, Japanese,
Chinese
Football World Cup 2010, one team
distinguished themselves well, losing to the
eventual champions 1-0 in the Final. Early in
the second half, Netherlands’ striker, Arjen
Robben, had a breakaway, but the keeper for
Spain, Iker Casillas made the save. Winger
Andres Iniesta scored for Spain for the win.
Unstructured text (document, email, etc)
Classification and Insight
56. Example Analysis : Extraction from Twitter
messages
Extract intent, interests, life events and micro segmentation attributes
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others
http://4sq.com/gbsaYR
@silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;)
btw happy birthday Sylvia ;)
@rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyle
I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry
now !!!
Monetizable Intent
Relocation
Location
Name, Birth Day
Subtle Spam,
Advertising
Sarcasm,
Wishful Thinking
While accounting for less relevant messages
I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes
http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile
@purplepleather Gotta do more research my Versace term paper 2day. Before I die, I
want a versace purple diamond tiara. Im just sayin>lol
had so much fun today! I want to buy a million dollar house with a wrap around
porch ... ... wading river on the long island sound, ha i wish!