SlideShare a Scribd company logo
1 of 56
Download to read offline
Open	
  Source	
  SW	
  @	
  IBM	
  Big	
  Data	
  
Boulder Java User Group
06/11/13
Ivan Portilla
ivanp@us.ibm.com
portilla@gmail.com
Ryan DeJana
rdejana@us.ibm.com
- 1 -
Disclaimer
ü  This presentation represents the view of the authors
and does not represent the view of IBM.
ü  All opinions expressed in this presentation are strictly of
the speakers, and do NOT represent those of IBM, IBM
management, or anyone else.
ü  IBM and IBM (logo) are trademarks or registered
trademarks of International Business Machines
Corporation in the United States and/or other countries.
ü  Many Thanks to Rafael Coss & Paul Zikopoulos for the
materials used in this presentation.
Agenda
ü  Big Data
ü  OSS in IBM Big Data platform
ü  Demo
-
3
-
4
5
Big Data
Size Equivalence
6
Name	
   Value	
   RAMAC	
   IPOD	
  
1	
  Giga	
  (GB)	
   10^9	
   200	
  
1	
  Tera	
  (TB)	
   10^12	
   200K	
   200	
  
1	
  Peta	
  (PB)	
  	
   10^15	
   200M	
   200K	
  
1	
  Exa	
  (EB)	
   10^18	
   200B	
   200M	
  
1	
  ZeEa	
  (ZB)	
   10^21	
   200T	
   200B	
  
Why Didn’t We Use All of the Big Data Before?
Big Data Includes Any of the following Characteristics:
Extracting insight in context, beyond what was previously possible.
8
Manage the complexity of
multiple relational and non-
relational data types and
schemas
Variety	
  
Streaming data and large
volume data movement
Velocity	
  
Scale from terabytes to
zettabytes
Volume	
  
Veracity
-
9
-
Up to
10,000
Times
larger
Up to 10,000
times faster
Traditional Data
Warehouse and
Business Intelligence
DataScale
DataScale
yr mo wk day hr min sec … ms µs
Exa
Peta
Tera
Giga
Mega
Kilo
Decision Frequency
Occasional Frequent Real-time
Data in Motion
DataatRest
Big Data Has New Opportunities But Needs New Analytics
-
1
0
Telco Promotions
100,000 records/sec, 6B/day
10 ms/decision
270TB for Deep Analytics
DeepQA
100s GB for Deep Analytics
3 sec/decision
Smart Traffic
250K GPS probes/sec
630K segments/sec
2 ms/decision, 4K vehicles
Homeland Security
600,000 records/sec, 50B/day
1-2 ms/decision
320TB for Deep Analytics
Applications for Big Data Analytics
Homeland	
  Security	
  
Finance	
  	
  Smarter	
  Healthcare	
   MulM-­‐channel	
  sales	
  
Telecom	
  
Manufacturing	
  
Traffic	
  Control	
  
Trading	
  AnalyMcs	
   Fraud	
  and	
  Risk	
  
Log	
  Analysis	
  
Search	
  Quality	
  
Retail:	
  Churn,	
  NBO	
  
U8li8es	
  
§  Weather	
  impact	
  analysis	
  on	
  power	
  
generaMon	
  
§  Transmission	
  monitoring	
  
§  Smart	
  grid	
  management	
  
Retail	
  
§  360°	
  View	
  of	
  the	
  Customer	
  
§  Click-­‐stream	
  analysis	
  
§  Real-­‐Mme	
  promoMons	
  
Law	
  Enforcement	
  
§  Real-­‐Mme	
  mulMmodal	
  surveillance	
  
§  SituaMonal	
  awareness	
  
§  Cyber	
  security	
  detecMon	
  
Transporta8on	
  
§  Weather	
  and	
  traffic	
  
impact	
  on	
  logisMcs	
  and	
  
fuel	
  consumpMon	
  
§  Traffic	
  congesMon	
  
Financial Services
§  Fraud detection
§  Risk management
§  360° View of the Customer
IT	
  
§  System	
  log	
  analysis	
  
§  Cybersecurity	
  
Telecommunica8ons	
  
§  CDR	
  processing	
  
§  Churn	
  predicMon	
  
§  Geomapping	
  /	
  markeMng	
  
§  Network	
  monitoring	
  
Most requested use cases of Big Data
12
Health	
  &	
  Life	
  Sciences	
  
§  Epidemic	
  early	
  warning	
  
§  ICU	
  monitoring	
  
§  Remote	
  healthcare	
  monitoring	
  
Follow this link for details on Industry Big Data use cases
13	
  
§ Public	
  wind	
  data	
  is	
  available	
  on	
  284km	
  x	
  284	
  
km	
  grids	
  (2.5o	
  LAT/LONG)	
  
§ More	
  data	
  means	
  more	
  accurate	
  and	
  richer	
  
models	
  (adding	
  hundreds	
  of	
  variables)	
  
-  Vestas	
  wind	
  library	
  at	
  2.5	
  PB:	
  to	
  grow	
  to	
  over	
  
6	
  PB	
  in	
  the	
  near-­‐term	
  
-  Granularity	
  27km	
  x	
  27km	
  grids:	
  driving	
  to	
  9x9,	
  
3x3	
  to	
  10m	
  x	
  10m	
  simulaMons	
  
§ Reduced	
  turbine	
  placement	
  idenMficaMon	
  from	
  
weeks	
  to	
  hours	
  
§ PerspecMve:	
  The	
  Vestas	
  Wind	
  library,	
  as	
  HD	
  TV	
  
would	
  take	
  70	
  years	
  to	
  watch	
  
13	
  
14
Big Data Analytics in Smarter Hospitals
IBM Data Baby
youtube.com
Big	
  Data	
  enabled	
  doctors	
  from	
  University	
  of	
  Ontario	
  to	
  apply	
  neonatal	
  infant	
  monitoring	
  to	
  
predict	
  infec8on	
  in	
  ICU	
  24	
  hours	
  in	
  advance	
  	
  
http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
IBM Watson is a breakthrough in analytic innovation, but it is only successful
because of the quality of the information from which it is working.
-
1
5
-
1
6
Big Data and Watson
InfoSphere BigInsights
POS Data
CRM Data
Social Media
Distilled Insight
-  Spending habits
-  Social relationships
-  Buying trends
Advanced
search and
analysis
Watson can consume insights from

Big Data for advanced analysis"
Big Data technology is used to build
Watson’s knowledge base"
Watson uses the Apache Hadoop
open framework to distribute the
workload for loading information into
memory."
Approx. 200M pages of text
(To compete on Jeopardy!)
Watson’s
Memory
IBM is committed to Open Source
►  Decade of lineage and contributions to
the open source community
– Apache Hadoop and Jaql, Apache
Derby, Apache Geronimo, Apache
Jakarta, +++
– Eclipse: founded by IBM
– Significant Lucene contributions via IBM
Lucene Extension Library (ILEL)
– DRDA, XQuery, SQL, XML4J, XERCES,
HTTP, Java, Linux, +++
►  IBM products built on open source
– WebSphere: Apache
– Rational: Eclipse and Apache
– InfoSphere: Eclipse and Apache, +++
►  IBM’s BigInsights (Hadoop) is 100%
open source compatible with
no forks
Introducing MapReduce
►  In 2003 and 2004 Google releases two papers that provide insight
into their success
– The Google File System
– MapReduce: Simplified Data Processing on Large Clusters
►  Introduced an approach to large scale data processing known as
MapReduce
Global TLE Framework
1
8
MapReduce
►  A programming model
– Inspired by functional programming
– Allows expressing distributed computations on large amounts of data
►  Execution framework
– Designed for large-scale data processing
– Designed to run on clusters of commodity hardware
Global TLE Framework
1
9
MapReduce, the programming model
►  Process key-value records
►  Map function:
(Kin, Vin) è list(Kinter, Vinter)
►  Barrier between map and reduce phases
– Shuffle and sort phase moves and groups like keys
►  Reduce function:
(Kinter, list(Vinter)) è list(Kout, Vout)
Global TLE Framework
2
0
Map phase, word-count example
Global TLE Framework
2
1
(line1, “Hello there.”)
(line2, “Why, hello.”)
(“hello”,1)	
  
(“there”,1)	
  
(“why”,1)	
  
(“hello”,1)	
  
Sort phase, word-count example
Global TLE Framework
2
2
(“hello”, 1)
(“hello”, 1)
(“there”,	
  1)	
  
(“why”,	
  1)	
  
Reduce phase, word-count example
Global TLE Framework
2
3
(“hello”, 1)
(“hello”, 1)
(“there”,	
  1)	
  
(“why”,	
  1)	
  
(“hello”, 2)
(“there”, 1)
(“why”, 1)
MapReduce, end to end
Global TLE Framework
2
4
Pseudocode for word-count
Global TLE Framework
2
5
def	
  mapper(line):	
  
	
  	
  foreach	
  word	
  in	
  line.split():	
  
	
  	
  	
  	
  output(word,	
  1)	
  
	
  
def	
  reducer(key,	
  values):	
  
	
  	
  output(key,	
  sum(values)	
  
Same code can be applied to thousands of lines,
even the whole web!
Google processes over 20PBs a day, much of it in
MapReduce programs.
But what about the data!
Global TLE Framework
2
6
Compute Nodes
NAS
SAN
Distributed file system enables processing to
be moved to the data!
Global TLE Framework
2
7
(key1, value1)
(key2, value2)
…
(key1, value1)
(key2, value2)
…
Processing is done local to the data
Key-value pairs are processed independently and in parallel!
Hadoop – A M/R Framework
►  Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
§ Hides underlying system details and complexities from user
§ Developed in Java
►  Core sub projects:
− MapReduce
− Hadoop Distributed File System a.k.a. HDFS
− Hadoop Common
►  Supported by several Hadoop-related projects
§ HBase
§ Zookeeper
§ Avro
§ Etc.
►  Meant for heterogeneous commodity hardware
Hadoop Architecture
Global TLE Framework
2
9
Who uses Hadoop?
Hadoop Open Source Projects
►  Hadoop is supplemented by an ecosystem of open source projects
Jaql	
  
Oozie	
  
The IBM Big Data Platform
32
InfoSphere BigInsights
Hadoop-based low latency
analytics for variety and volume
Data-At-Rest
Netezza High
Capacity Appliance
Queryable Archive for
Structured Data
Netezza 1000
BI+Ad Hoc Analytics on
Structured Data
Smart Analytics System
Operational Analytics on
Structured Data
Informix Timeseries
Time-structured analytics
InfoSphere Warehouse
Large volume structured data
analytics
InfoSphere Streams
Low Latency Analytics for
streaming data
Velocity, Variety & Volume
Data-In-Motion
MPP	
  Data	
  Warehouse	
  
Stream	
  
CompuMng	
  
InformaMon	
  
IntegraMon	
  
Hadoop	
  
InfoSphere Information
Server
High volume data integration
and transformation
Apache Hadoop:
open source framework
for the distributed processing
of large data sets across
clusters of computers using a
simple programming model
The IBM Big Data Platform
33
Integrate	
  and	
  manage	
  
the	
  full	
  variety,	
  
velocity	
  and	
  volume	
  of	
  
data	
  
	
  
	
  
Apply	
  advanced	
  
analy7cs	
  to	
  
informa7on	
  in	
  its	
  
na7ve	
  form	
  
	
  
	
  
Visualize	
  all	
  available	
  
data	
  for	
  ad-­‐hoc	
  
analysis	
  
Development	
  
environment	
  for	
  
building	
  new	
  analy7c	
  
applica7ons	
  
	
  
	
  
Workload	
  
op7miza7on	
  and	
  
scheduling	
  
	
  
	
  
	
  
Security	
  and	
  
Governance	
  
BigInsights Brings Hadoop to the Enterprise
►  BigInsights = analytical platform for
persistent Big Data
–  Based on open source & IBM technologies
–  Managed like a start-up . . . . Emphasis on
deep customer engagements, product plan
flexibility
►  Distinguishing characteristics
– Built-in analytics . . . . Enhances business
knowledge
– Enterprise software integration . . . .
Complements and extends existing
capabilities
– Production-ready platform with tooling for
analysts, developers, and
administrators. . . . Speeds time-to-value;
simplifies development and maintenance
►  IBM advantage
– Combination of software, hardware, services
and advanced research
Hadoop
System
InfoSphere BigInsights
Platform for volume, variety,
velocity
►  Enhanced Hadoop
foundation
Analytics
►  Text analytics & tooling
►  Application accelerators
Usability
►  Web console
►  Spreadsheet-style tool
►  Ready-made “apps”
Enterprise Class
►  Storage, security, cluster
management
Integration
►  Connectivity to Netezza,
DB2, JDBC databases, etc
Apache
Hadoop
Basic Edition
Enterprise Edition
Licensed
ApplicaMon	
  accelerators	
  	
  
Pre-­‐built	
  applicaMons	
  
Text	
  analyMcs	
  	
  
Spreadsheet-­‐style	
  tool	
  
RDBMS,	
  warehouse	
  connecMvity	
  
	
  AdministraMve	
  tools,	
  security	
  
Eclipse	
  development	
  tools	
  
Performance	
  enhancements	
  
.	
  .	
  .	
  .	
  	
  	
  	
  	
  	
  	
  
	
  
Free download
Integrated install
Online InfoCenter
BigData Univ.
Breadth of capabilities
Enterpriseclass
BigInsights Basic Edition
Connectivity and integration
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Lucene
Oozie
Open Source IBM
Integrated
installer
Sqoop
HCatalog
BigInsights Enterprise Edition
Connectivity and Integration Streams
Netezza
Text
processing
engine and
library
JDBC
Flume
Infrastructure Jaql
Hive
Pig
HBase
MapReduce
HDFS
ZooKeeper
Indexing Lucene
Adaptive
MapReduce
Oozie
Text compression
Enhanced
security
Flexible
scheduler
Optional
IBM and
partner
offerings
Analytics and discovery “Apps”
DB2
BigSheets
Web Crawler
Distrib file copy
DB export
Boardreader
DB import
Ad hoc query
Machine
learning
Data
processing
. . .
Administrative and
development tools
Web console
• Monitor cluster health, jobs, etc.
• Add / remove nodes
• Start / stop services
• Inspect job status
• Inspect workflow status
• Deploy applications
• Launch apps / jobs
• Work with distrib file system
• Work with spreadsheet interface
• Support REST-based API
• . . .
R
Eclipse tools
• Text analytics
• MapReduce programming
• Jaql, Hive, Pig development
• BigSheets plug-in development
• Oozie workflow generation
Integrated
installer
Open Source IBMIBM
Cognos BI
GPFS (EAP)
Accelerator for
machine data
analysis
Accelerator for
social data
analysis
Guardium DataStageData Explorer
Sqoop
HCatalog
Open Source Components Across
DistributionsComponent
Big
Insights
2.0
HortonWorks
HDP 1.2
MapR
2.0
Greenplum
HD 1.2
Cloudera
CDH3u5
Cloudera
CDH4*
Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *
HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1
Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1
Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2
Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3
Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3
Avro 1.6.3 X X X X X
Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0
Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1
HCatalog 0.4.0 0.5.0 0.4.0 X X X
BigInsights	
  con8nues	
  to	
  offer	
  the	
  most	
  proven,	
  stable	
  versions	
  of	
  Apache	
  Hadoop	
  components	
  
*Cloudera	
  CDH4	
  Hadoop	
  2.0	
  	
  includes	
  Map	
  Reduce	
  2.0	
  which	
  Cloudera	
  states	
  “not	
  yet	
  considered	
  stable”	
  
Hadoop Systems
3
9
HDFS	
  
Map/	
  
Reduce	
  
	
  
Hive,	
  Pig	
  &	
  Jaql	
  
Sqoop	
  
Zookeeper	
  	
  
Avro	
  (Serializa8on)	
  
HBase	
  
ETL	
  	
  
Tools	
  
BI	
  	
  
ReporMng	
  
RDBMS	
  
BigInsights Content
Function Version
Basic
Edition
Enterprise
Edition
Integrated Install Inc Inc
Hadoop (including common utilities, HDFS, MapReduce framework) 1.0.3 Inc Inc
Jaql (programming / query language) 0.5.2 Inc Inc
Pig (programming / query language) 0.10.0 Inc Inc
Flume (data collection/aggregation) 0.9.4 Inc Inc
Hive (data summarization/querying) 0.9.0 Inc Inc
Lucene (text search)* 3.3.0 Inc Inc
Zookeeper (process coordination) 3.4.3 Inc Inc
Avro (data serialization) 1.6.3 Inc Inc
HBase (real time read/write) 0.94.0 Inc Inc
HCatalog (table and storage management service) 0.4.0 Inc Inc
Sqoop (RDBMS bulk data transfer) 1.4.1 Inc Inc
Oozie (workflow/ job orchestration) 3.2.0 Inc Inc
Online documentation Inc Inc
Integration with JDBC sources through general-purpose Jaql module Inc Inc
Integration with DB2 (sample functions to submit jobs, read data) Inc Inc
BigInsights Content (cont’d)Function
Basic
Edition
Enterprise
Edition
Integration with R (Jaql module to invoke R statistical capabilities from
BigInsights) n/a Inc
Integration with Netezza, DB2 LUW with DPF from Jaql n/a Inc
LDAP authentication, Guardium support, etc. n/a Inc
Integrated Web Console n/a Inc
Business process accelerators (social data, machine data analytics) n/a Inc
Platform performance enhancements (Adaptive MapReduce, large scale
indexing, efficient processing of compressed text files, flexible job
scheduler, etc.)
n/a Inc
Text analytics n/a Inc
Eclipse tools for text analytic development, Jaql, Hive, Java n/a Inc
Applications for data import/export, Web crawl, machine learning, etc. n/a Inc
Web-based application catalog n/a Inc
Spreadsheet-like analytical tool n/a Inc
IBM support Opt Inc
Streams, Data Explorer, Cognos BI (limited use licenses) n/a Inc
Unlimited storage n/a Inc
BigInsights: Value Beyond Open Source
Enterprise Capabilities
Administration & Security
Workload Optimization
Connectors
Open source
components
Advanced Engines
Visualization & Exploration
Development Tools
IBM-certified
Apache Hadoop or or …
Key	
  differenMators	
  	
  
•  Built-­‐in	
  analyMcs	
  	
  
•  Text	
  engine,	
  annotators,	
  Eclipse	
  tooling	
  	
  
•  Interface	
  to	
  project	
  R	
  (staMsMcal	
  plamorm)	
  
•  Enterprise	
  sonware	
  integraMon	
  
•  Spreadsheet-­‐style	
  analysis	
  	
  
•  Integrated	
  installaMon	
  of	
  supported	
  open	
  source	
  
and	
  other	
  components	
  
•  Web	
  Console	
  for	
  admin	
  and	
  applicaMon	
  access	
  
•  Plamorm	
  enrichment:	
  addiMonal	
  security,	
  
performance	
  features,	
  .	
  .	
  .	
  	
  	
  	
  
•  World-­‐class	
  support	
  
•  Full	
  open	
  source	
  compaMbility	
  
Business	
  benefits	
   	
  	
  
•  Quicker	
  Mme-­‐to-­‐value	
  due	
  to	
  IBM	
  technology	
  and	
  
support	
  
•  Reduced	
  operaMonal	
  risk	
  
•  Enhanced	
  business	
  knowledge	
  with	
  flexible	
  
analyMcal	
  plamorm	
  
•  Leverages	
  and	
  complements	
  exisMng	
  sonware	
  
Big Insights - Demo
4
3
Big Data Application Ecosystem
Eclipse
App	
  library	
  
MapReduce,	
  …	
  
Text	
  AnalyMcs	
  
Query	
  
App Development
• Code application program, and generate
associated App
• Deploy Apps to Enterprise ManagerApp	
  
Development	
  
Publish
Data	
  integra7on	
  scenario:	
  	
  
Pre-­‐defined	
  work	
  flows	
  simplify	
  
loading	
  data	
  from	
  various	
  sources	
  
• Work	
  flows	
  can	
  be	
  configured,	
  
deployed,	
  executed	
  and	
  
scheduled	
  
Development	
  tooling:	
  
• Text	
  analyMcs	
  	
  
• MapReduce	
  
• Query	
  languages	
  	
  
• 	
  .	
  .	
  .	
  	
  
Applica7on	
  scenarios	
  (web	
  log,	
  
email,	
  social	
  media,	
  …):	
  
• 	
  Samples	
  provide	
  starMng	
  point,	
  
speed	
  Mme	
  to	
  value	
  	
  
Big Data Web Console
Web Console
• Manage BigInsights
Inspect /monitor system health
Add / drop nodes
Start / stop services
Run / monitor jobs (applications)
Explore / modify file system
Create custom dashboards
. . .
• Launch applications
Spreadsheet-like analysis tool
Pre-built applications (IBM
supplied or user developed)
• Publish applications
• Monitor cluster, applications,
data, etc.
Running Applications from the Web Console
•  Import	
  &	
  Export	
  Data	
  
•  Database	
  &	
  Files	
  
•  Web	
  and	
  Social	
  
•  Analyze	
  and	
  Query	
  
•  Predic7ve	
  Analy7cs	
  
•  Text	
  Analy7cs	
  
•  SQL/Hive,	
  Jaql,	
  Pig,	
  HBase	
  
Spreadsheet-style Analysis
•  Web-based analysis and
visualization
•  Spreadsheet-like
interface
Define and manage long
running data collection
jobs
Analyze content of the text
on the pages that have
been retrieved
Get started with BigInsights
•  In the Cloud
Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise
Cloud, or on private clouds.
Pay only for the resources used.
•  In the Classroom
Via IBM Education
Online at www.bigdatauniversity.com
•  On Your Cluster
Download Basic Edition from ibm.com.
•  With the BigInsights Community
– Technical portal @ http://tinyurl.com/biginsights
– BigData on DW @ http://ibm.co/bigdatadev
Links to demos, papers, forum, downloads, etc.
• Stay connected with IBM Big Data
– http://ibmbigdatahub.com
BigDataUniversity.com
Learn Big Data Technologies
• Flexible on-line delivery
allows learning @your place
and @your pace
§ Free courses, free study
materials.
§ Cloud-based sandbox
for exercises – zero setup
§ 66666 registered students.
§ Robust Course
Management System and
Content Distribution
infrastructure-
4
9
50
Big Data is ripe for innovation
Backup slides
OSS in IBM Big Data Platform
5
2
Hadoop	
   	
  -­‐	
  hEp://hadoop.apache.org/	
  
HDFS 	
   	
  -­‐	
  hEp://hadoop.apache.org/docs/r1.0.4/hdfs_design.html	
  
Hive 	
   	
  -­‐	
  hEp://hive.apache.org/	
  
Hbase 	
   	
  -­‐	
  hEp://hbase.apache.org/	
  
Flume 	
   	
  -­‐	
  hEp://flume.apache.org/	
  
Jaql	
   	
   	
  -­‐	
  hEp://code.google.com/p/jaql/wiki/Running	
  
Oozie	
   	
   	
  -­‐	
  hEp://oozie.apache.org/	
  
Sqoop 	
   	
  -­‐	
  hEp://sqoop.apache.org/	
  
Avro 	
   	
  -­‐	
  hEp://avro.apache.org/	
  
Lucene	
   	
   	
  -­‐	
  hEp://lucene.apache.org/	
  
Pigserver 	
  -­‐	
  hEp://pig.apache.org/	
  
Zookeeper 	
  -­‐	
  hEp://zookeeper.apache.org/	
  
Top	
  	
   	
   	
  -­‐	
  http://bigtop.apache.org/
	
  
Build a Big Data Program – MapReduce example
Eclipse tools
For Jaql, Hive, Pig Java MapReduce, BigSheets
plug-ins, text analytics, etc.
BigInsights Text Analytics Development
BigInsights and Text Analytics
• Distills structured info from
unstructured text
Sentiment analysis
Consumer behavior
Illegal or suspicious activities
…
• Parses text and detects meaning
with annotators
• Understands the context in which
the text is analyzed
• Features pre-built extractors for
names, addresses, phone numbers,
etc.
• Built-in support for English,
Spanish, French, German,
Portuguese, Dutch, Japanese,
Chinese
Football World Cup 2010, one team
distinguished themselves well, losing to the
eventual champions 1-0 in the Final. Early in
the second half, Netherlands’ striker, Arjen
Robben, had a breakaway, but the keeper for
Spain, Iker Casillas made the save. Winger
Andres Iniesta scored for Spain for the win.
Unstructured text (document, email, etc)
Classification and Insight
Example Analysis : Extraction from Twitter
messages
Extract intent, interests, life events and micro segmentation attributes
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others
http://4sq.com/gbsaYR
 @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;)
btw happy birthday Sylvia ;)
@rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyle
I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry
now !!!
Monetizable Intent
Relocation
Location
Name, Birth Day
Subtle Spam,
Advertising
Sarcasm,
Wishful Thinking
While accounting for less relevant messages
I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes
http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile
@purplepleather Gotta do more research my Versace term paper 2day. Before I die, I
want a versace purple diamond tiara. Im just sayin>lol
had so much fun today! I want to buy a million dollar house with a wrap around
porch ... ... wading river on the long island sound, ha i wish!

More Related Content

What's hot

Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use casesAllied Consultants
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use CasesInSemble
 
DW Appliance
DW ApplianceDW Appliance
DW ApplianceShankar R
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An IntroductionShankar R
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big DataShankar R
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 

What's hot (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use cases
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
DW Appliance
DW ApplianceDW Appliance
DW Appliance
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
Bigdata
BigdataBigdata
Bigdata
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 

Viewers also liked

IBM Watson & Open Source Software - LinuxCon 2012
IBM Watson & Open Source Software - LinuxCon 2012IBM Watson & Open Source Software - LinuxCon 2012
IBM Watson & Open Source Software - LinuxCon 2012iportilla
 
The IBM Open Cloud Architecture (and Platform)
The IBM Open Cloud Architecture (and Platform)The IBM Open Cloud Architecture (and Platform)
The IBM Open Cloud Architecture (and Platform)Florian Georg
 
Ibm company prsenation
Ibm company prsenationIbm company prsenation
Ibm company prsenationSana Khan
 
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source TriumvirateCloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source TriumvirateAnimesh Singh
 
IBM Overview and Case Study
IBM Overview and Case StudyIBM Overview and Case Study
IBM Overview and Case StudyDaryl Pereira
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Opensource Powerpoint Review.Ppt
Opensource Powerpoint Review.PptOpensource Powerpoint Review.Ppt
Opensource Powerpoint Review.PptViet NguyenHoang
 
Telecom OSS/BSS Overview
Telecom OSS/BSS OverviewTelecom OSS/BSS Overview
Telecom OSS/BSS Overviewmagidg
 
Ibm presentation ppt
Ibm presentation pptIbm presentation ppt
Ibm presentation pptravish28
 
IBM Presentation
IBM PresentationIBM Presentation
IBM Presentationrolsen3
 

Viewers also liked (13)

Introduction to R for Big Data Analysis
Introduction to R for Big Data AnalysisIntroduction to R for Big Data Analysis
Introduction to R for Big Data Analysis
 
IBM Watson & Open Source Software - LinuxCon 2012
IBM Watson & Open Source Software - LinuxCon 2012IBM Watson & Open Source Software - LinuxCon 2012
IBM Watson & Open Source Software - LinuxCon 2012
 
The IBM Open Cloud Architecture (and Platform)
The IBM Open Cloud Architecture (and Platform)The IBM Open Cloud Architecture (and Platform)
The IBM Open Cloud Architecture (and Platform)
 
How IBM Innovates
How IBM InnovatesHow IBM Innovates
How IBM Innovates
 
Cloud foundry meetup 12112013
Cloud foundry meetup 12112013Cloud foundry meetup 12112013
Cloud foundry meetup 12112013
 
Ibm company prsenation
Ibm company prsenationIbm company prsenation
Ibm company prsenation
 
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source TriumvirateCloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
 
IBM Overview and Case Study
IBM Overview and Case StudyIBM Overview and Case Study
IBM Overview and Case Study
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Opensource Powerpoint Review.Ppt
Opensource Powerpoint Review.PptOpensource Powerpoint Review.Ppt
Opensource Powerpoint Review.Ppt
 
Telecom OSS/BSS Overview
Telecom OSS/BSS OverviewTelecom OSS/BSS Overview
Telecom OSS/BSS Overview
 
Ibm presentation ppt
Ibm presentation pptIbm presentation ppt
Ibm presentation ppt
 
IBM Presentation
IBM PresentationIBM Presentation
IBM Presentation
 

Similar to Big Data and OSS at IBM

Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big dataxKinAnx
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataAravindharamanan S
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationVlad Ponomarev
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life RevolutionCapgemini
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentCA | Automic Software
 
Activeeon - Scale Beyond Limits
Activeeon - Scale Beyond LimitsActiveeon - Scale Beyond Limits
Activeeon - Scale Beyond LimitsActiveeon
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosionactifio
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmersKevin Lee
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 

Similar to Big Data and OSS at IBM (20)

Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-data
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop Agent
 
Activeeon - Scale Beyond Limits
Activeeon - Scale Beyond LimitsActiveeon - Scale Beyond Limits
Activeeon - Scale Beyond Limits
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 

More from Boulder Java User's Group (8)

Spring insight what just happened
Spring insight   what just happenedSpring insight   what just happened
Spring insight what just happened
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
 
Json at work overview and ecosystem-v2.0
Json at work   overview and ecosystem-v2.0Json at work   overview and ecosystem-v2.0
Json at work overview and ecosystem-v2.0
 
Restful design at work v2.0
Restful design at work v2.0Restful design at work v2.0
Restful design at work v2.0
 
Introduction To JavaFX 2.0
Introduction To JavaFX 2.0Introduction To JavaFX 2.0
Introduction To JavaFX 2.0
 
55 New Features in Java 7
55 New Features in Java 755 New Features in Java 7
55 New Features in Java 7
 
Watson and Open Source Tools
Watson and Open Source ToolsWatson and Open Source Tools
Watson and Open Source Tools
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Big Data and OSS at IBM

  • 1. Open  Source  SW  @  IBM  Big  Data   Boulder Java User Group 06/11/13 Ivan Portilla ivanp@us.ibm.com portilla@gmail.com Ryan DeJana rdejana@us.ibm.com - 1 -
  • 2. Disclaimer ü  This presentation represents the view of the authors and does not represent the view of IBM. ü  All opinions expressed in this presentation are strictly of the speakers, and do NOT represent those of IBM, IBM management, or anyone else. ü  IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. ü  Many Thanks to Rafael Coss & Paul Zikopoulos for the materials used in this presentation.
  • 3. Agenda ü  Big Data ü  OSS in IBM Big Data platform ü  Demo - 3 -
  • 4. 4
  • 5. 5
  • 6. Big Data Size Equivalence 6 Name   Value   RAMAC   IPOD   1  Giga  (GB)   10^9   200   1  Tera  (TB)   10^12   200K   200   1  Peta  (PB)     10^15   200M   200K   1  Exa  (EB)   10^18   200B   200M   1  ZeEa  (ZB)   10^21   200T   200B  
  • 7. Why Didn’t We Use All of the Big Data Before?
  • 8. Big Data Includes Any of the following Characteristics: Extracting insight in context, beyond what was previously possible. 8 Manage the complexity of multiple relational and non- relational data types and schemas Variety   Streaming data and large volume data movement Velocity   Scale from terabytes to zettabytes Volume  
  • 10. Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence DataScale DataScale yr mo wk day hr min sec … ms µs Exa Peta Tera Giga Mega Kilo Decision Frequency Occasional Frequent Real-time Data in Motion DataatRest Big Data Has New Opportunities But Needs New Analytics - 1 0 Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics DeepQA 100s GB for Deep Analytics 3 sec/decision Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics
  • 11. Applications for Big Data Analytics Homeland  Security   Finance    Smarter  Healthcare   MulM-­‐channel  sales   Telecom   Manufacturing   Traffic  Control   Trading  AnalyMcs   Fraud  and  Risk   Log  Analysis   Search  Quality   Retail:  Churn,  NBO  
  • 12. U8li8es   §  Weather  impact  analysis  on  power   generaMon   §  Transmission  monitoring   §  Smart  grid  management   Retail   §  360°  View  of  the  Customer   §  Click-­‐stream  analysis   §  Real-­‐Mme  promoMons   Law  Enforcement   §  Real-­‐Mme  mulMmodal  surveillance   §  SituaMonal  awareness   §  Cyber  security  detecMon   Transporta8on   §  Weather  and  traffic   impact  on  logisMcs  and   fuel  consumpMon   §  Traffic  congesMon   Financial Services §  Fraud detection §  Risk management §  360° View of the Customer IT   §  System  log  analysis   §  Cybersecurity   Telecommunica8ons   §  CDR  processing   §  Churn  predicMon   §  Geomapping  /  markeMng   §  Network  monitoring   Most requested use cases of Big Data 12 Health  &  Life  Sciences   §  Epidemic  early  warning   §  ICU  monitoring   §  Remote  healthcare  monitoring   Follow this link for details on Industry Big Data use cases
  • 13. 13   § Public  wind  data  is  available  on  284km  x  284   km  grids  (2.5o  LAT/LONG)   § More  data  means  more  accurate  and  richer   models  (adding  hundreds  of  variables)   -  Vestas  wind  library  at  2.5  PB:  to  grow  to  over   6  PB  in  the  near-­‐term   -  Granularity  27km  x  27km  grids:  driving  to  9x9,   3x3  to  10m  x  10m  simulaMons   § Reduced  turbine  placement  idenMficaMon  from   weeks  to  hours   § PerspecMve:  The  Vestas  Wind  library,  as  HD  TV   would  take  70  years  to  watch   13  
  • 14. 14 Big Data Analytics in Smarter Hospitals IBM Data Baby youtube.com Big  Data  enabled  doctors  from  University  of  Ontario  to  apply  neonatal  infant  monitoring  to   predict  infec8on  in  ICU  24  hours  in  advance     http://www.youtube.com/watch?v=0lt0hTNtjrY&feature=results_main&playnext=1&list=PL783389D2F81FFAB5
  • 15. IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working. - 1 5
  • 16. - 1 6 Big Data and Watson InfoSphere BigInsights POS Data CRM Data Social Media Distilled Insight -  Spending habits -  Social relationships -  Buying trends Advanced search and analysis Watson can consume insights from
 Big Data for advanced analysis" Big Data technology is used to build Watson’s knowledge base" Watson uses the Apache Hadoop open framework to distribute the workload for loading information into memory." Approx. 200M pages of text (To compete on Jeopardy!) Watson’s Memory
  • 17. IBM is committed to Open Source ►  Decade of lineage and contributions to the open source community – Apache Hadoop and Jaql, Apache Derby, Apache Geronimo, Apache Jakarta, +++ – Eclipse: founded by IBM – Significant Lucene contributions via IBM Lucene Extension Library (ILEL) – DRDA, XQuery, SQL, XML4J, XERCES, HTTP, Java, Linux, +++ ►  IBM products built on open source – WebSphere: Apache – Rational: Eclipse and Apache – InfoSphere: Eclipse and Apache, +++ ►  IBM’s BigInsights (Hadoop) is 100% open source compatible with no forks
  • 18. Introducing MapReduce ►  In 2003 and 2004 Google releases two papers that provide insight into their success – The Google File System – MapReduce: Simplified Data Processing on Large Clusters ►  Introduced an approach to large scale data processing known as MapReduce Global TLE Framework 1 8
  • 19. MapReduce ►  A programming model – Inspired by functional programming – Allows expressing distributed computations on large amounts of data ►  Execution framework – Designed for large-scale data processing – Designed to run on clusters of commodity hardware Global TLE Framework 1 9
  • 20. MapReduce, the programming model ►  Process key-value records ►  Map function: (Kin, Vin) è list(Kinter, Vinter) ►  Barrier between map and reduce phases – Shuffle and sort phase moves and groups like keys ►  Reduce function: (Kinter, list(Vinter)) è list(Kout, Vout) Global TLE Framework 2 0
  • 21. Map phase, word-count example Global TLE Framework 2 1 (line1, “Hello there.”) (line2, “Why, hello.”) (“hello”,1)   (“there”,1)   (“why”,1)   (“hello”,1)  
  • 22. Sort phase, word-count example Global TLE Framework 2 2 (“hello”, 1) (“hello”, 1) (“there”,  1)   (“why”,  1)  
  • 23. Reduce phase, word-count example Global TLE Framework 2 3 (“hello”, 1) (“hello”, 1) (“there”,  1)   (“why”,  1)   (“hello”, 2) (“there”, 1) (“why”, 1)
  • 24. MapReduce, end to end Global TLE Framework 2 4
  • 25. Pseudocode for word-count Global TLE Framework 2 5 def  mapper(line):      foreach  word  in  line.split():          output(word,  1)     def  reducer(key,  values):      output(key,  sum(values)   Same code can be applied to thousands of lines, even the whole web! Google processes over 20PBs a day, much of it in MapReduce programs.
  • 26. But what about the data! Global TLE Framework 2 6 Compute Nodes NAS SAN
  • 27. Distributed file system enables processing to be moved to the data! Global TLE Framework 2 7 (key1, value1) (key2, value2) … (key1, value1) (key2, value2) … Processing is done local to the data Key-value pairs are processed independently and in parallel!
  • 28. Hadoop – A M/R Framework ►  Apache open source software framework for reliable, scalable, distributed computing of massive amount of data § Hides underlying system details and complexities from user § Developed in Java ►  Core sub projects: − MapReduce − Hadoop Distributed File System a.k.a. HDFS − Hadoop Common ►  Supported by several Hadoop-related projects § HBase § Zookeeper § Avro § Etc. ►  Meant for heterogeneous commodity hardware
  • 31. Hadoop Open Source Projects ►  Hadoop is supplemented by an ecosystem of open source projects Jaql   Oozie  
  • 32. The IBM Big Data Platform 32 InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume Data-At-Rest Netezza High Capacity Appliance Queryable Archive for Structured Data Netezza 1000 BI+Ad Hoc Analytics on Structured Data Smart Analytics System Operational Analytics on Structured Data Informix Timeseries Time-structured analytics InfoSphere Warehouse Large volume structured data analytics InfoSphere Streams Low Latency Analytics for streaming data Velocity, Variety & Volume Data-In-Motion MPP  Data  Warehouse   Stream   CompuMng   InformaMon   IntegraMon   Hadoop   InfoSphere Information Server High volume data integration and transformation Apache Hadoop: open source framework for the distributed processing of large data sets across clusters of computers using a simple programming model
  • 33. The IBM Big Data Platform 33 Integrate  and  manage   the  full  variety,   velocity  and  volume  of   data       Apply  advanced   analy7cs  to   informa7on  in  its   na7ve  form       Visualize  all  available   data  for  ad-­‐hoc   analysis   Development   environment  for   building  new  analy7c   applica7ons       Workload   op7miza7on  and   scheduling         Security  and   Governance  
  • 34. BigInsights Brings Hadoop to the Enterprise ►  BigInsights = analytical platform for persistent Big Data –  Based on open source & IBM technologies –  Managed like a start-up . . . . Emphasis on deep customer engagements, product plan flexibility ►  Distinguishing characteristics – Built-in analytics . . . . Enhances business knowledge – Enterprise software integration . . . . Complements and extends existing capabilities – Production-ready platform with tooling for analysts, developers, and administrators. . . . Speeds time-to-value; simplifies development and maintenance ►  IBM advantage – Combination of software, hardware, services and advanced research Hadoop System
  • 35. InfoSphere BigInsights Platform for volume, variety, velocity ►  Enhanced Hadoop foundation Analytics ►  Text analytics & tooling ►  Application accelerators Usability ►  Web console ►  Spreadsheet-style tool ►  Ready-made “apps” Enterprise Class ►  Storage, security, cluster management Integration ►  Connectivity to Netezza, DB2, JDBC databases, etc Apache Hadoop Basic Edition Enterprise Edition Licensed ApplicaMon  accelerators     Pre-­‐built  applicaMons   Text  analyMcs     Spreadsheet-­‐style  tool   RDBMS,  warehouse  connecMvity    AdministraMve  tools,  security   Eclipse  development  tools   Performance  enhancements   .  .  .  .                 Free download Integrated install Online InfoCenter BigData Univ. Breadth of capabilities Enterpriseclass
  • 36. BigInsights Basic Edition Connectivity and integration JDBC Flume Infrastructure Jaql Hive Pig HBase MapReduce HDFS ZooKeeper Lucene Oozie Open Source IBM Integrated installer Sqoop HCatalog
  • 37. BigInsights Enterprise Edition Connectivity and Integration Streams Netezza Text processing engine and library JDBC Flume Infrastructure Jaql Hive Pig HBase MapReduce HDFS ZooKeeper Indexing Lucene Adaptive MapReduce Oozie Text compression Enhanced security Flexible scheduler Optional IBM and partner offerings Analytics and discovery “Apps” DB2 BigSheets Web Crawler Distrib file copy DB export Boardreader DB import Ad hoc query Machine learning Data processing . . . Administrative and development tools Web console • Monitor cluster health, jobs, etc. • Add / remove nodes • Start / stop services • Inspect job status • Inspect workflow status • Deploy applications • Launch apps / jobs • Work with distrib file system • Work with spreadsheet interface • Support REST-based API • . . . R Eclipse tools • Text analytics • MapReduce programming • Jaql, Hive, Pig development • BigSheets plug-in development • Oozie workflow generation Integrated installer Open Source IBMIBM Cognos BI GPFS (EAP) Accelerator for machine data analysis Accelerator for social data analysis Guardium DataStageData Explorer Sqoop HCatalog
  • 38. Open Source Components Across DistributionsComponent Big Insights 2.0 HortonWorks HDP 1.2 MapR 2.0 Greenplum HD 1.2 Cloudera CDH3u5 Cloudera CDH4* Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 * HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1 Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1 Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2 Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3 Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3 Avro 1.6.3 X X X X X Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0 Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1 HCatalog 0.4.0 0.5.0 0.4.0 X X X BigInsights  con8nues  to  offer  the  most  proven,  stable  versions  of  Apache  Hadoop  components   *Cloudera  CDH4  Hadoop  2.0    includes  Map  Reduce  2.0  which  Cloudera  states  “not  yet  considered  stable”  
  • 39. Hadoop Systems 3 9 HDFS   Map/   Reduce     Hive,  Pig  &  Jaql   Sqoop   Zookeeper     Avro  (Serializa8on)   HBase   ETL     Tools   BI     ReporMng   RDBMS  
  • 40. BigInsights Content Function Version Basic Edition Enterprise Edition Integrated Install Inc Inc Hadoop (including common utilities, HDFS, MapReduce framework) 1.0.3 Inc Inc Jaql (programming / query language) 0.5.2 Inc Inc Pig (programming / query language) 0.10.0 Inc Inc Flume (data collection/aggregation) 0.9.4 Inc Inc Hive (data summarization/querying) 0.9.0 Inc Inc Lucene (text search)* 3.3.0 Inc Inc Zookeeper (process coordination) 3.4.3 Inc Inc Avro (data serialization) 1.6.3 Inc Inc HBase (real time read/write) 0.94.0 Inc Inc HCatalog (table and storage management service) 0.4.0 Inc Inc Sqoop (RDBMS bulk data transfer) 1.4.1 Inc Inc Oozie (workflow/ job orchestration) 3.2.0 Inc Inc Online documentation Inc Inc Integration with JDBC sources through general-purpose Jaql module Inc Inc Integration with DB2 (sample functions to submit jobs, read data) Inc Inc
  • 41. BigInsights Content (cont’d)Function Basic Edition Enterprise Edition Integration with R (Jaql module to invoke R statistical capabilities from BigInsights) n/a Inc Integration with Netezza, DB2 LUW with DPF from Jaql n/a Inc LDAP authentication, Guardium support, etc. n/a Inc Integrated Web Console n/a Inc Business process accelerators (social data, machine data analytics) n/a Inc Platform performance enhancements (Adaptive MapReduce, large scale indexing, efficient processing of compressed text files, flexible job scheduler, etc.) n/a Inc Text analytics n/a Inc Eclipse tools for text analytic development, Jaql, Hive, Java n/a Inc Applications for data import/export, Web crawl, machine learning, etc. n/a Inc Web-based application catalog n/a Inc Spreadsheet-like analytical tool n/a Inc IBM support Opt Inc Streams, Data Explorer, Cognos BI (limited use licenses) n/a Inc Unlimited storage n/a Inc
  • 42. BigInsights: Value Beyond Open Source Enterprise Capabilities Administration & Security Workload Optimization Connectors Open source components Advanced Engines Visualization & Exploration Development Tools IBM-certified Apache Hadoop or or … Key  differenMators     •  Built-­‐in  analyMcs     •  Text  engine,  annotators,  Eclipse  tooling     •  Interface  to  project  R  (staMsMcal  plamorm)   •  Enterprise  sonware  integraMon   •  Spreadsheet-­‐style  analysis     •  Integrated  installaMon  of  supported  open  source   and  other  components   •  Web  Console  for  admin  and  applicaMon  access   •  Plamorm  enrichment:  addiMonal  security,   performance  features,  .  .  .         •  World-­‐class  support   •  Full  open  source  compaMbility   Business  benefits       •  Quicker  Mme-­‐to-­‐value  due  to  IBM  technology  and   support   •  Reduced  operaMonal  risk   •  Enhanced  business  knowledge  with  flexible   analyMcal  plamorm   •  Leverages  and  complements  exisMng  sonware  
  • 43. Big Insights - Demo 4 3
  • 44. Big Data Application Ecosystem Eclipse App  library   MapReduce,  …   Text  AnalyMcs   Query   App Development • Code application program, and generate associated App • Deploy Apps to Enterprise ManagerApp   Development   Publish Data  integra7on  scenario:     Pre-­‐defined  work  flows  simplify   loading  data  from  various  sources   • Work  flows  can  be  configured,   deployed,  executed  and   scheduled   Development  tooling:   • Text  analyMcs     • MapReduce   • Query  languages     •   .  .  .     Applica7on  scenarios  (web  log,   email,  social  media,  …):   •   Samples  provide  starMng  point,   speed  Mme  to  value     Big Data Web Console
  • 45. Web Console • Manage BigInsights Inspect /monitor system health Add / drop nodes Start / stop services Run / monitor jobs (applications) Explore / modify file system Create custom dashboards . . . • Launch applications Spreadsheet-like analysis tool Pre-built applications (IBM supplied or user developed) • Publish applications • Monitor cluster, applications, data, etc.
  • 46. Running Applications from the Web Console •  Import  &  Export  Data   •  Database  &  Files   •  Web  and  Social   •  Analyze  and  Query   •  Predic7ve  Analy7cs   •  Text  Analy7cs   •  SQL/Hive,  Jaql,  Pig,  HBase  
  • 47. Spreadsheet-style Analysis •  Web-based analysis and visualization •  Spreadsheet-like interface Define and manage long running data collection jobs Analyze content of the text on the pages that have been retrieved
  • 48. Get started with BigInsights •  In the Cloud Via RightScale, or directly on Amazon, Rackspace, IBM Smart Enterprise Cloud, or on private clouds. Pay only for the resources used. •  In the Classroom Via IBM Education Online at www.bigdatauniversity.com •  On Your Cluster Download Basic Edition from ibm.com. •  With the BigInsights Community – Technical portal @ http://tinyurl.com/biginsights – BigData on DW @ http://ibm.co/bigdatadev Links to demos, papers, forum, downloads, etc. • Stay connected with IBM Big Data – http://ibmbigdatahub.com
  • 49. BigDataUniversity.com Learn Big Data Technologies • Flexible on-line delivery allows learning @your place and @your pace § Free courses, free study materials. § Cloud-based sandbox for exercises – zero setup § 66666 registered students. § Robust Course Management System and Content Distribution infrastructure- 4 9
  • 50. 50 Big Data is ripe for innovation
  • 52. OSS in IBM Big Data Platform 5 2 Hadoop    -­‐  hEp://hadoop.apache.org/   HDFS    -­‐  hEp://hadoop.apache.org/docs/r1.0.4/hdfs_design.html   Hive    -­‐  hEp://hive.apache.org/   Hbase    -­‐  hEp://hbase.apache.org/   Flume    -­‐  hEp://flume.apache.org/   Jaql      -­‐  hEp://code.google.com/p/jaql/wiki/Running   Oozie      -­‐  hEp://oozie.apache.org/   Sqoop    -­‐  hEp://sqoop.apache.org/   Avro    -­‐  hEp://avro.apache.org/   Lucene      -­‐  hEp://lucene.apache.org/   Pigserver  -­‐  hEp://pig.apache.org/   Zookeeper  -­‐  hEp://zookeeper.apache.org/   Top        -­‐  http://bigtop.apache.org/  
  • 53. Build a Big Data Program – MapReduce example Eclipse tools For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc.
  • 55. BigInsights and Text Analytics • Distills structured info from unstructured text Sentiment analysis Consumer behavior Illegal or suspicious activities … • Parses text and detects meaning with annotators • Understands the context in which the text is analyzed • Features pre-built extractors for names, addresses, phone numbers, etc. • Built-in support for English, Spanish, French, German, Portuguese, Dutch, Japanese, Chinese Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Unstructured text (document, email, etc) Classification and Insight
  • 56. Example Analysis : Extraction from Twitter messages Extract intent, interests, life events and micro segmentation attributes I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR  @silliesylvia good!!! U shouldnt! Think about the important stuff, like ur birthday ;) btw happy birthday Sylvia ;) @rakonturmiami im moving to miami in 3 months. i look foward to the new lifestyle I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry now !!! Monetizable Intent Relocation Location Name, Birth Day Subtle Spam, Advertising Sarcasm, Wishful Thinking While accounting for less relevant messages I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile @purplepleather Gotta do more research my Versace term paper 2day. Before I die, I want a versace purple diamond tiara. Im just sayin>lol had so much fun today! I want to buy a million dollar house with a wrap around porch ... ... wading river on the long island sound, ha i wish!