SlideShare a Scribd company logo
1 of 47
ae nv/sa
Interleuvenlaan 27b, B-3001 Heverlee
T +32 16 39 30 60 - F +32 16 39 30 70
www.ae.be
Bram Vanschoenwinkel
Principal Consultant BI & Analytics
@bvschoen
R & Hadoop
The perfect marriage for your analytics?
ae nv/sa
Interleuvenlaan 27b, B-3001 Heverlee
T +32 16 39 30 60 - F +32 16 39 30 70
www.ae.be
WELCOME
R & Hadoop
The perfect marriage for your analytics?
By Michael Degrez
Sales Director - AE
ae nv/sa
Interleuvenlaan 27b, B-3001 Heverlee
T +32 16 39 30 60 - F +32 16 39 30 70
www.ae.be
19/02 Mobilebydesign
Howtodesign,build,runformobilefirst
23/04 R&Hadoop
Theperfectmarriageforyouranalytics?
18/06 Fromprivatecloudtohybridcloud
Howtobenefitfromasuccessfulimplementation
01/10 Prepareforthedigitalenterprise
Businessdrivenenterprisearchitecture
26/11 Multi-devicefront-endengineering
Howbusinessesbenefitfromapplyingthistechnicalskill
ae nv/sa
Interleuvenlaan 27b, B-3001 Heverlee
T +32 16 39 30 60 - F +32 16 39 30 70
www.ae.be
Bram Vanschoenwinkel
Principal Consultant BI & Analytics
@bvschoen
R & Hadoop
The perfect marriage for your analytics?
 7
Agenda
1. It’s a ( R )evolution
2. Intelligent Decision Support in the Digital Age
3. The R Project for Statistical Computing
4. The World of Hadoop
5. Case: A Customer Intelligence Platform
6. Conclusions
 8
It’s a (R)evolution
2000 2010 2015
DATA
VOLUME
TIME
MAJORITY
UNSTRUCTUREDDATA
 9
Abundance of Data
BEYOND
WEB
CRM
ERP
PURCHASE DETAIL
PRODUCTION
PAYMENT DETAIL
PLANNING
CONTACT INFORMATION
LEADS
OFFERS
SEGMENTATION
PROSPECTS
CLICK STREAM DATA
WEB SHOPS SOCIAL MEDIA
VIDEO
IMAGES
TEXT
ONLINE SERVICES
AUDIO
OPEN DATA
MOBILE DEVICES
INTERNET OF THINGS
RFID
GPS
SENSORS
USER GENERATED CONTENT
SMART DEVICES
SENSORS
REMOTE MONITORING
CLOUD
MEDICAL
WARABLES
 10
Opportunities
OPERATIONAL
EXCELLENCE
INNOVATIVE
BUSINESS MODELS
INSIGHTS, STRATEGY
AND POLICY
 11
SHORT LIFESPAN OF THE DATA
FASTMOVINGDATA
FASTDATAPROCESSING
HIGH VARIETY OF DATA
Challenges
 12
intelligent decision support in the digital age
WHAT WE SEE
ABUNDANCE OF
HETEROGENOUS DATA
THE WAY WE INTERACT
WITH THE WORLD HAS
CHANGED
OPPORTUNITIES
OPERATIONAL
EXCELLENCE
BETTER DECISION
SUPPORT
CHALLENGES
ANALYSIS GAP
VOLUME, VARIETY,
VELOCITY
INNOVATING BUSINESS
MODELS
COMPETENCES
 13
Decision Support in the Digital Age
Facing the Challenges and realizing the
Opportunities
Business
Analytics
Big Data
 14
Elements of a Holistic Information Management
Framework
- Data Sources
- Internal & External
- From Data to Information
- Improving data quality
- Integrality of data
- From Information to Knowledge
Intelligent Decision Support:
- Reporting
- Business Analytics
- From Knowledge to Intelligence
DATAInformation
Knowledge
Intelligence
Wisdom/Insight
 15
Decision Support in the Digital Age
“Business Analytics is the nontrivial extraction of
implicit, previously unknown, and potentially useful
information from data.”
 16
Business Analytics vs Business Intelligence
 17
New Insights
8 stoppen
132 stoppen
10 stoppen
53 stoppen
64 stoppen
14 stoppen 4 stoppen
11 stoppen
 18
Innovating Business Models
Front-end Application(s)
Security
Analytics (on Hadoop)
Web Click
StreamingSocial Media
Connectivity
External
Application
Integration
Operational Data Processing on Hadoop
 19
From Analytics…
Statistics Algorithms
Biology
Psychology
Databases
 20
…to Business Analytics
 21
Analytics Approach
 Analytics
 Incremental and iterative
 Think big act small
 Proof-of-Concept
 Open source tools
 Architecture & Deployment
 (Non-)funtional requirements
 Information Architecture
 Technology
 Embedded into operations
Two Phase Approach
Analytics
Architecture Deployment
 22
Analytics Churn Prediction Example
Invoicing CRM Call Center
Application
John Doe – 43years – Antwerp – Man – 7calls – 3weeks – 30%down invoicing
Jane Dan – 32years – Brussels – Woman – 2calls – 12weeks – 10%up invoicing
…
Operations
CHURN SCORES
REGION
PRODUCT
CHURN SCORES
MANAGEMENT
DASHBOARD
OPERATIONS
DATA DUMP
Analytics
Engine
Data Warehouse
 23
Big Data
“Big data is high-volume, high-velocity, high-complexity and
high-variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight
and decision making.” (Gartner)
 24
Four V’s and a C
 Not only volume makes big data big, it’s all about the three V’s:
 High Volume, Variety, Velocity
 High Value!
 In addition the data is very complex in nature, often unstructured:
 Text documents, emails, images and videos, etc.
 Click stream data, social media feed data, etc.
 25
Innovative Forms of Information Processing
 Traditional methods don’t suffice anymore.
 New forms of information processing have emerged.
DISTRIBUTE DATA
STORAGE
COMPUTATION
NoSQL DATA STORES
 26
Innovative Forms of Information Processing
 27
The R Project for Statistical Computing
 R is a dialect of the S language
 S was developed by John Chambers and others at Bell Labs
 S was initiated in 1976
 Now owned by TIBCO and sold under the name S-PLUS
INTERACTIVE NOT
PROGRAMMING
PROGRAMMING
WHEN SYSTEM
ASPECTS BECOME
IMPORTANT
GRADUALLY MOVING INTO
 28
Advantages of R
 Most widely used data analysis software
 Created and used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
 Flexible, extensible & comprehensive for productivity, +4800 packages
 Create beautiful and unique data visualizations
 As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
 Leading edge of analytics research
 Fills the talent gap
 New graduates prefer R
 29
Drawbacks of R
Steep learning curve
Objects must be
stored in physical
memory, little
thought to memory
management
Functionality is
based on consumer
demand and user
contributions
Documentation is
sometimes patchy
and terse, and
impenetrable to the
non-statistician
Vibrant community
to help you
Recent
advancements to
deal with this
If a package is
useful to many
people, it will
quickly evolve into a
robust product
Vibrant community
to help you
 30
Exploding growth and Demand for R
 R is the highest paid IT skill
 – Dice.com, Jan 2014
 R most-used data science language
after SQL
 – O’Reilly, Jan 2014
 R is used by 70% of data miners
 – Rexer, Sep 2013
 R is #15 of all programming languages
 – RedMonk, Jan 2014
 R growing faster than any other data
science language
 – KDnuggets, Aug 2013
 More than 2 million users worldwide
 31
Great Adoption of R by Many Companies
 Commercial vendors offering general support and developing
specific R based products, e.g.: Oracle, RevolutionAnalytics.
 Companies using R for advanced statistics and analytics, e.g.:
Thomas Cook, Google, Twitter.
 Also in the AE customer base we see different companies looking
into R as an alternative or complement to the traditional tools.
 32
Example Packages
 twitteR: Provides an interface to the Twitter web API.
 tm: Provides Text Mining functionalities like word stemming,
stopword removal, etc.
 wordcloud: Provides methods for producing wordclouds in
different forms, shapes and colors.
 33
Apache Hadoop
 Open-source software framework.
 Storage and large-scale processing of data on clusters of commodity hardware.
 Apache top-level project built and used by a global community.
 Two core components:
1. Hadoop Distributed File System (HDFS)
2. MapReduce
 34
Apache Hadoop
 MapReduce/HDFS based on Google's MapReduce and Google File System.
 Other components are:
 Hadoop Common – libraries and utilities needed by other Hadoop modules
 Hadoop YARN – a resource-management platform
 The entire Apache Hadoop “platform” is now commonly considered to consist
of a number of related projects as well: Pig, Hive, Hbase,…
 Created by Doug Cutting and Mike Cafarella at Yahoo in 2005 originally to
support distribution for the Apache Nutch search engine project.
All the modules in Hadoop are designed with a fundamental
assumption that hardware failures (of individual machines, or
racks of machines) are common and thus should be
automatically handled in software by the framework.
 35
The World of Hadoop
 36
Key Properties Apache Hadoop
 Transforms commodity hardware into a service that:
 Stores petabytes of data reliably.
 Allows huge distributed computations.
 Key Properties:
 Designed for batch processing.
 Write-once-read-many access model for files.
 Extremely powerful.
 Scalability:
• Scales linearly with cores and disks.
• Machines can be added and removed from the cluster.
• Write code once, same program runs on 1, 1000, 4000 machines.
 Reliable and fault-tolerant:
• Failed tasks/data transfers are automatically retried.
• Data replication, redundancy.
 37
Rack 2 Rack 3Rack 1
A Typical Hadoop Cluster
Client
DATA ASSIGNMENT TO NODES
DATA READ
DATA WRITE
METADATA FOR
BLOCK INFO
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Job Tracker
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Master
Node
Slave
Nodes
Slave
Nodes
Slave
Nodes
Name Node
JOB
ASSIGNMENT
TASK ASSIGNMENT
1. Client
2. Master Node
 Name Node
 Job Tracker
3. Slave Nodes
 Data Nodes
 Task Trackers
 Map / Reduce
 38
1. Client consults Name Node
2. Client writes block to Data Node
3. Data Node replicates block
4. Cycle repeats for next blocks
Rack 2 Rack 3Rack 1
Hadoop File System (HDFS)
Data Node 1 Data Node 4 Data Node 7
Data Node 2 Data Node 5 Data Node 8
Data Node 3 Data Node 6 Data Node 9
Name Node
Client
FILE
FILE
DATA ASSIGNMENT TO NODES
DATA READ
DATA WRITE
METADATA FOR
BLOCK INFO
Rack 1:
Data Node 1
Data Node 2
…
Rack 2:
Data Node 3
…
 39
MapReduce
the, 1
quick, 1
brown, 1
fox, 1
the, 1
fox, 1
ate, 1
the, 1
mouse, 1
how, 1
now, 1
brown, 1
cow, 1
the, 1
the, 1
the, 1
fox, 1
fox, 1
quick, 1
brown, 1
brown, 1
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
the, 3
fox, 2
quick, 1
brown, 2
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
the, 3
fox, 2
quick, 1
brown, 2
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
Input Splitting Map Shuffle
Sort
Reduce
Output
The Map function processes one line at a time,
splits it into tokens seperated by a withespace
and emits a key-value pair <word, 1>.
The Reducer function just sums up the values,
which are the occurence counts for each key
(i.e. words in this example).
 40
Hadoop Distributions
 Fully equipped, scalable and flexible cloud solutions.
 Also different on premise solutions are being offered.
 Choice depends on specific requirements.
 Data Privacy, Scalability, Security, Data Mastership, Configuration, Flexibility,
Price-Performance Ratio, Automation,…
 How to get started?
 Free to download!
 Business model is based on training, consulting, support and additional
“tooling” (Enterprise Editions).
 Many free trial cloud versions available to play around with.
 Many tutorials, trainings, blogs, user groups etc.
 41
RHadoop
 A collection of four R packages that allow users to manage and
analyze data with Hadoop:
 rmr: Hadoop MapReduce functionality in R
 rhdfs: file management of the HDFS from within R
 rhbase: database management for the HBase distributed database
 Recently a new package plyrmr was relased providing a familiar interface
while hiding many of the MapReduce details (like Hive, Pig and Mahoot).
 R and all RHadoop packges should be installed on all nodes in
the Hadoop cluster.
Combining the advantages of R with the
power of Hadoop.
 42
MapReduce Wordcount Example in R
Map function.
Reduce function.
Reading the input from
HDFS from.dfs().
Writing the results back
to HDFS to.dfs().
 43
Case
 Under NDA.
 44
Conclusions
 The Digital Age brings many opportunities but also challenges.
 Big Data and Analytics can face the challenges and realize the
opportunities.
 It is within anyone’s grasp, do it incremental and iterative.
 R and Hadoop:
 Open source software, active user groups and support.
 A great way to start exploring!
 Combined power gives you the advantage of 1 + 1 =3.
 Sometimes alternatives are better.
 45
Conclusions
 Don’t always need Big Data to do Analytics, it depends on the
requirements.
 Hadoop cloud solutions are scalable, flexible and cost-efficient,
but sometimes limited in functionality (or not standardized).
 Many differences between Hadoop distributions, constantly
evolving (and getting better).
 Need for good Data Scientists in a mixed team of competences to
make the right choices.
 46
What’s next?
 Ask yourselves following questions:
 What opportunities do I see for myself?
 What strategic and competitive advantages can I realize?
 Is Analytics the right solution for me? Do I need Big Data?
 What about my Data Warehouse environment?
 And what about the quality of my operational data?
 Do I have the right infrastructure in place?
 Do I have the right competences in house?
 Now you should know what’s in it for you, but also the challenges
your most probably will be facing.
 47
What’s next?
 You have a case you would like to discuss…?
 You have any questions…?
 Please feel free to contact me:
 Bram Vanschoenwinkel
 Bram.Vanschoenwinkel@ae.be
 +32(0)478741738
@bvschoen
be.linkedin.com/in/bramvanschoenwinkel/
 48
23 april 2014 R and Hadoop - The perfect marriage for your analytics?
18 juni 2014 From Private Cloud to Hybrid Cloud
1 oktober 2014 Digital Enterprise Architecture
26 november 2014 Multi-device front-end engineering
?
Thank you!
@bvschoen / @ae_nv
www.ae.be

More Related Content

More from AE - architects for business and ict

AE Foyer: Soa Integration Architecture and Api Management
AE Foyer: Soa Integration Architecture and Api ManagementAE Foyer: Soa Integration Architecture and Api Management
AE Foyer: Soa Integration Architecture and Api Management
AE - architects for business and ict
 
Building the digital enterprise for the age of the customer (part 2)
Building the digital enterprise for the age of the customer (part 2)Building the digital enterprise for the age of the customer (part 2)
Building the digital enterprise for the age of the customer (part 2)
AE - architects for business and ict
 

More from AE - architects for business and ict (13)

AE Foyer - Value Driven Transformation
AE Foyer - Value Driven TransformationAE Foyer - Value Driven Transformation
AE Foyer - Value Driven Transformation
 
AE Foyer: Soa Integration Architecture and Api Management
AE Foyer: Soa Integration Architecture and Api ManagementAE Foyer: Soa Integration Architecture and Api Management
AE Foyer: Soa Integration Architecture and Api Management
 
AE Foyer: Information Management in the Digital Enterprise
AE Foyer: Information Management in the Digital EnterpriseAE Foyer: Information Management in the Digital Enterprise
AE Foyer: Information Management in the Digital Enterprise
 
AE Foyer: Embrace your customer get digital (handouts 18052015)
AE Foyer: Embrace your customer get digital (handouts 18052015)AE Foyer: Embrace your customer get digital (handouts 18052015)
AE Foyer: Embrace your customer get digital (handouts 18052015)
 
Trends in front end engineering_handouts
Trends in front end engineering_handoutsTrends in front end engineering_handouts
Trends in front end engineering_handouts
 
Embrace your customer, get digital!
Embrace your customer, get digital!Embrace your customer, get digital!
Embrace your customer, get digital!
 
Building the digital enterprise for the age of the customer (part 2)
Building the digital enterprise for the age of the customer (part 2)Building the digital enterprise for the age of the customer (part 2)
Building the digital enterprise for the age of the customer (part 2)
 
Building the digital enterprise for the age of the customer handouts
Building the digital enterprise for the age of the customer   handoutsBuilding the digital enterprise for the age of the customer   handouts
Building the digital enterprise for the age of the customer handouts
 
AE foyer: From Server Virtualization to Hybrid Cloud
AE foyer: From Server Virtualization to Hybrid CloudAE foyer: From Server Virtualization to Hybrid Cloud
AE foyer: From Server Virtualization to Hybrid Cloud
 
AE foyer on Mobile by Design 19/02/2014
AE foyer on Mobile by Design 19/02/2014AE foyer on Mobile by Design 19/02/2014
AE foyer on Mobile by Design 19/02/2014
 
AE Spot'On - Chris Potts - Enterprise investment: Combining EA and Investment...
AE Spot'On - Chris Potts - Enterprise investment: Combining EA and Investment...AE Spot'On - Chris Potts - Enterprise investment: Combining EA and Investment...
AE Spot'On - Chris Potts - Enterprise investment: Combining EA and Investment...
 
Process Mining in Package Delivery (Logistics) - AE nv
Process Mining in Package Delivery (Logistics) - AE nvProcess Mining in Package Delivery (Logistics) - AE nv
Process Mining in Package Delivery (Logistics) - AE nv
 
AngularJS in large applications - AE NV
AngularJS in large applications - AE NVAngularJS in large applications - AE NV
AngularJS in large applications - AE NV
 

Recently uploaded

valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 

Recently uploaded (20)

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...
VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...
VVIP Pune Call Girls Mohammadwadi WhatSapp Number 8005736733 With Elite Staff...
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 

AE foyer: R and Hadoop, the perfect marriage for your analytics?

  • 1. ae nv/sa Interleuvenlaan 27b, B-3001 Heverlee T +32 16 39 30 60 - F +32 16 39 30 70 www.ae.be Bram Vanschoenwinkel Principal Consultant BI & Analytics @bvschoen R & Hadoop The perfect marriage for your analytics?
  • 2. ae nv/sa Interleuvenlaan 27b, B-3001 Heverlee T +32 16 39 30 60 - F +32 16 39 30 70 www.ae.be WELCOME R & Hadoop The perfect marriage for your analytics? By Michael Degrez Sales Director - AE
  • 3. ae nv/sa Interleuvenlaan 27b, B-3001 Heverlee T +32 16 39 30 60 - F +32 16 39 30 70 www.ae.be 19/02 Mobilebydesign Howtodesign,build,runformobilefirst 23/04 R&Hadoop Theperfectmarriageforyouranalytics? 18/06 Fromprivatecloudtohybridcloud Howtobenefitfromasuccessfulimplementation 01/10 Prepareforthedigitalenterprise Businessdrivenenterprisearchitecture 26/11 Multi-devicefront-endengineering Howbusinessesbenefitfromapplyingthistechnicalskill
  • 4. ae nv/sa Interleuvenlaan 27b, B-3001 Heverlee T +32 16 39 30 60 - F +32 16 39 30 70 www.ae.be Bram Vanschoenwinkel Principal Consultant BI & Analytics @bvschoen R & Hadoop The perfect marriage for your analytics?
  • 5.  7 Agenda 1. It’s a ( R )evolution 2. Intelligent Decision Support in the Digital Age 3. The R Project for Statistical Computing 4. The World of Hadoop 5. Case: A Customer Intelligence Platform 6. Conclusions
  • 6.  8 It’s a (R)evolution 2000 2010 2015 DATA VOLUME TIME MAJORITY UNSTRUCTUREDDATA
  • 7.  9 Abundance of Data BEYOND WEB CRM ERP PURCHASE DETAIL PRODUCTION PAYMENT DETAIL PLANNING CONTACT INFORMATION LEADS OFFERS SEGMENTATION PROSPECTS CLICK STREAM DATA WEB SHOPS SOCIAL MEDIA VIDEO IMAGES TEXT ONLINE SERVICES AUDIO OPEN DATA MOBILE DEVICES INTERNET OF THINGS RFID GPS SENSORS USER GENERATED CONTENT SMART DEVICES SENSORS REMOTE MONITORING CLOUD MEDICAL WARABLES
  • 9.  11 SHORT LIFESPAN OF THE DATA FASTMOVINGDATA FASTDATAPROCESSING HIGH VARIETY OF DATA Challenges
  • 10.  12 intelligent decision support in the digital age WHAT WE SEE ABUNDANCE OF HETEROGENOUS DATA THE WAY WE INTERACT WITH THE WORLD HAS CHANGED OPPORTUNITIES OPERATIONAL EXCELLENCE BETTER DECISION SUPPORT CHALLENGES ANALYSIS GAP VOLUME, VARIETY, VELOCITY INNOVATING BUSINESS MODELS COMPETENCES
  • 11.  13 Decision Support in the Digital Age Facing the Challenges and realizing the Opportunities Business Analytics Big Data
  • 12.  14 Elements of a Holistic Information Management Framework - Data Sources - Internal & External - From Data to Information - Improving data quality - Integrality of data - From Information to Knowledge Intelligent Decision Support: - Reporting - Business Analytics - From Knowledge to Intelligence DATAInformation Knowledge Intelligence Wisdom/Insight
  • 13.  15 Decision Support in the Digital Age “Business Analytics is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.”
  • 14.  16 Business Analytics vs Business Intelligence
  • 15.  17 New Insights 8 stoppen 132 stoppen 10 stoppen 53 stoppen 64 stoppen 14 stoppen 4 stoppen 11 stoppen
  • 16.  18 Innovating Business Models Front-end Application(s) Security Analytics (on Hadoop) Web Click StreamingSocial Media Connectivity External Application Integration Operational Data Processing on Hadoop
  • 17.  19 From Analytics… Statistics Algorithms Biology Psychology Databases
  • 19.  21 Analytics Approach  Analytics  Incremental and iterative  Think big act small  Proof-of-Concept  Open source tools  Architecture & Deployment  (Non-)funtional requirements  Information Architecture  Technology  Embedded into operations Two Phase Approach Analytics Architecture Deployment
  • 20.  22 Analytics Churn Prediction Example Invoicing CRM Call Center Application John Doe – 43years – Antwerp – Man – 7calls – 3weeks – 30%down invoicing Jane Dan – 32years – Brussels – Woman – 2calls – 12weeks – 10%up invoicing … Operations CHURN SCORES REGION PRODUCT CHURN SCORES MANAGEMENT DASHBOARD OPERATIONS DATA DUMP Analytics Engine Data Warehouse
  • 21.  23 Big Data “Big data is high-volume, high-velocity, high-complexity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” (Gartner)
  • 22.  24 Four V’s and a C  Not only volume makes big data big, it’s all about the three V’s:  High Volume, Variety, Velocity  High Value!  In addition the data is very complex in nature, often unstructured:  Text documents, emails, images and videos, etc.  Click stream data, social media feed data, etc.
  • 23.  25 Innovative Forms of Information Processing  Traditional methods don’t suffice anymore.  New forms of information processing have emerged. DISTRIBUTE DATA STORAGE COMPUTATION NoSQL DATA STORES
  • 24.  26 Innovative Forms of Information Processing
  • 25.  27 The R Project for Statistical Computing  R is a dialect of the S language  S was developed by John Chambers and others at Bell Labs  S was initiated in 1976  Now owned by TIBCO and sold under the name S-PLUS INTERACTIVE NOT PROGRAMMING PROGRAMMING WHEN SYSTEM ASPECTS BECOME IMPORTANT GRADUALLY MOVING INTO
  • 26.  28 Advantages of R  Most widely used data analysis software  Created and used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language  Flexible, extensible & comprehensive for productivity, +4800 packages  Create beautiful and unique data visualizations  As seen in New York Times, Twitter and Flowing Data  Thriving open-source community  Leading edge of analytics research  Fills the talent gap  New graduates prefer R
  • 27.  29 Drawbacks of R Steep learning curve Objects must be stored in physical memory, little thought to memory management Functionality is based on consumer demand and user contributions Documentation is sometimes patchy and terse, and impenetrable to the non-statistician Vibrant community to help you Recent advancements to deal with this If a package is useful to many people, it will quickly evolve into a robust product Vibrant community to help you
  • 28.  30 Exploding growth and Demand for R  R is the highest paid IT skill  – Dice.com, Jan 2014  R most-used data science language after SQL  – O’Reilly, Jan 2014  R is used by 70% of data miners  – Rexer, Sep 2013  R is #15 of all programming languages  – RedMonk, Jan 2014  R growing faster than any other data science language  – KDnuggets, Aug 2013  More than 2 million users worldwide
  • 29.  31 Great Adoption of R by Many Companies  Commercial vendors offering general support and developing specific R based products, e.g.: Oracle, RevolutionAnalytics.  Companies using R for advanced statistics and analytics, e.g.: Thomas Cook, Google, Twitter.  Also in the AE customer base we see different companies looking into R as an alternative or complement to the traditional tools.
  • 30.  32 Example Packages  twitteR: Provides an interface to the Twitter web API.  tm: Provides Text Mining functionalities like word stemming, stopword removal, etc.  wordcloud: Provides methods for producing wordclouds in different forms, shapes and colors.
  • 31.  33 Apache Hadoop  Open-source software framework.  Storage and large-scale processing of data on clusters of commodity hardware.  Apache top-level project built and used by a global community.  Two core components: 1. Hadoop Distributed File System (HDFS) 2. MapReduce
  • 32.  34 Apache Hadoop  MapReduce/HDFS based on Google's MapReduce and Google File System.  Other components are:  Hadoop Common – libraries and utilities needed by other Hadoop modules  Hadoop YARN – a resource-management platform  The entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well: Pig, Hive, Hbase,…  Created by Doug Cutting and Mike Cafarella at Yahoo in 2005 originally to support distribution for the Apache Nutch search engine project. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.
  • 33.  35 The World of Hadoop
  • 34.  36 Key Properties Apache Hadoop  Transforms commodity hardware into a service that:  Stores petabytes of data reliably.  Allows huge distributed computations.  Key Properties:  Designed for batch processing.  Write-once-read-many access model for files.  Extremely powerful.  Scalability: • Scales linearly with cores and disks. • Machines can be added and removed from the cluster. • Write code once, same program runs on 1, 1000, 4000 machines.  Reliable and fault-tolerant: • Failed tasks/data transfers are automatically retried. • Data replication, redundancy.
  • 35.  37 Rack 2 Rack 3Rack 1 A Typical Hadoop Cluster Client DATA ASSIGNMENT TO NODES DATA READ DATA WRITE METADATA FOR BLOCK INFO Task Tracker Task Tracker Map Reduce Map Reduce Job Tracker Data Node Data Node Task Tracker Map Reduce Data Node Task Tracker Task Tracker Map Reduce Map Reduce Data Node Data Node Task Tracker Map Reduce Data Node Task Tracker Task Tracker Map Reduce Map Reduce Data Node Data Node Task Tracker Map Reduce Data Node Master Node Slave Nodes Slave Nodes Slave Nodes Name Node JOB ASSIGNMENT TASK ASSIGNMENT 1. Client 2. Master Node  Name Node  Job Tracker 3. Slave Nodes  Data Nodes  Task Trackers  Map / Reduce
  • 36.  38 1. Client consults Name Node 2. Client writes block to Data Node 3. Data Node replicates block 4. Cycle repeats for next blocks Rack 2 Rack 3Rack 1 Hadoop File System (HDFS) Data Node 1 Data Node 4 Data Node 7 Data Node 2 Data Node 5 Data Node 8 Data Node 3 Data Node 6 Data Node 9 Name Node Client FILE FILE DATA ASSIGNMENT TO NODES DATA READ DATA WRITE METADATA FOR BLOCK INFO Rack 1: Data Node 1 Data Node 2 … Rack 2: Data Node 3 …
  • 37.  39 MapReduce the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate, 1 the, 1 mouse, 1 how, 1 now, 1 brown, 1 cow, 1 the, 1 the, 1 the, 1 fox, 1 fox, 1 quick, 1 brown, 1 brown, 1 ate, 1 mouse, 1 how, 1 now, 1 cow, 1 the, 3 fox, 2 quick, 1 brown, 2 ate, 1 mouse, 1 how, 1 now, 1 cow, 1 the, 3 fox, 2 quick, 1 brown, 2 ate, 1 mouse, 1 how, 1 now, 1 cow, 1 Input Splitting Map Shuffle Sort Reduce Output The Map function processes one line at a time, splits it into tokens seperated by a withespace and emits a key-value pair <word, 1>. The Reducer function just sums up the values, which are the occurence counts for each key (i.e. words in this example).
  • 38.  40 Hadoop Distributions  Fully equipped, scalable and flexible cloud solutions.  Also different on premise solutions are being offered.  Choice depends on specific requirements.  Data Privacy, Scalability, Security, Data Mastership, Configuration, Flexibility, Price-Performance Ratio, Automation,…  How to get started?  Free to download!  Business model is based on training, consulting, support and additional “tooling” (Enterprise Editions).  Many free trial cloud versions available to play around with.  Many tutorials, trainings, blogs, user groups etc.
  • 39.  41 RHadoop  A collection of four R packages that allow users to manage and analyze data with Hadoop:  rmr: Hadoop MapReduce functionality in R  rhdfs: file management of the HDFS from within R  rhbase: database management for the HBase distributed database  Recently a new package plyrmr was relased providing a familiar interface while hiding many of the MapReduce details (like Hive, Pig and Mahoot).  R and all RHadoop packges should be installed on all nodes in the Hadoop cluster. Combining the advantages of R with the power of Hadoop.
  • 40.  42 MapReduce Wordcount Example in R Map function. Reduce function. Reading the input from HDFS from.dfs(). Writing the results back to HDFS to.dfs().
  • 42.  44 Conclusions  The Digital Age brings many opportunities but also challenges.  Big Data and Analytics can face the challenges and realize the opportunities.  It is within anyone’s grasp, do it incremental and iterative.  R and Hadoop:  Open source software, active user groups and support.  A great way to start exploring!  Combined power gives you the advantage of 1 + 1 =3.  Sometimes alternatives are better.
  • 43.  45 Conclusions  Don’t always need Big Data to do Analytics, it depends on the requirements.  Hadoop cloud solutions are scalable, flexible and cost-efficient, but sometimes limited in functionality (or not standardized).  Many differences between Hadoop distributions, constantly evolving (and getting better).  Need for good Data Scientists in a mixed team of competences to make the right choices.
  • 44.  46 What’s next?  Ask yourselves following questions:  What opportunities do I see for myself?  What strategic and competitive advantages can I realize?  Is Analytics the right solution for me? Do I need Big Data?  What about my Data Warehouse environment?  And what about the quality of my operational data?  Do I have the right infrastructure in place?  Do I have the right competences in house?  Now you should know what’s in it for you, but also the challenges your most probably will be facing.
  • 45.  47 What’s next?  You have a case you would like to discuss…?  You have any questions…?  Please feel free to contact me:  Bram Vanschoenwinkel  Bram.Vanschoenwinkel@ae.be  +32(0)478741738 @bvschoen be.linkedin.com/in/bramvanschoenwinkel/
  • 46.  48 23 april 2014 R and Hadoop - The perfect marriage for your analytics? 18 juni 2014 From Private Cloud to Hybrid Cloud 1 oktober 2014 Digital Enterprise Architecture 26 november 2014 Multi-device front-end engineering ? Thank you!

Editor's Notes

  1. Platwalsen met informatie – educatieve trainingssessie die we wel met voorbeelden en cases concretiseren.
  2. Reources = mensen met de juiste competenties  analysis gap.
  3. De manier waarop we met de wereld interageren is veranderd. Web: social media, webshops, online services,…Beyond: mobile, devices, sensors,…
  4. Introductie van de 3 V’s: Velocity – Varaiety – Volume.De manier waarop we met de wereld interageren is veranderd: social media, mobile, devices,…
  5. Algemeen platform voor verschillende sectoren, hier voorbeeld uit de energie sector.Product dus voor verschillende sectoren voor customer profiling en churnprediction.Integratie met social media: Twitter, Facebook, Youtube (om profiel data binnen te krijgen, zoals naam, woonplaats, favoriete pagina’s, tweets, enzovoort)Click streaming: in welke zaken is de gebruiker geïnteresseerd op de website (groene vs grijze producten, passieve woning, enzovoort)
  6. NoSQL is een breed gamma aan databasemanagementsystemendie op aanmerkelijke wijze verschillen van het klassieke relationele databasemanagementsysteem.De datasystemen behoeven niet altijd vaste databankschema&apos;s, zo vermijden ze gewoonlijk de zware JOIN-operaties en schalen ze ook beter voor grote hoeveelheden data.Non-relational, distributed, open-source &amp; horizontally scalable (over meerdere machines).Document based, column based (aggregatie), graphbased (relaties van de data worden makkelijker voorgesteld door een graph).MapReduce is een methode voordistributed computing (ontwikkeld door Google). De bekendste implementatie ervan is die van Apache Hadoop (YarnMapReduce v2).
  7. R is een mooi opstapmodel, maar kan ook een alternatief bieden voor de “groten”.
  8. Bullet 1: dit is wat Apache beweert, in de realiteit – zeker voor complexe, professionele toepassingen – zien wij vaak toch high-end hardware eerder dan commoditiy hardware.Grootste cluster = up to 25 petabytesand 4500 machines.BRING THE COMPUTATION TO THE DATA RATHER TAHN THE DATA TO THE COMPUTATION!
  9. Master Node is rack aware, dwz. weet waar in de netwerktopologie een node staat en gaat die informatie gebruiken om bestanden optimaal te verdelen (idem voor computation). Bijvoorbeeld intrarack communicatie is snel, interrack communicatie is trager.
  10. Keuze tussen on premise versus in decloud is de belangrijkste keuze.Cloud solutions bieden heel veel voordelen, maar hebben ook enkele nadelen. Goed de afweging maken:On premise is moeilijk op te zetten ondankt AMBARI.Ook up/down-scaling vraagt configuratiewerk: hardware aankopen en configureren, inpassen in de netwerktopologie op de meest optimale manier, configuratie van de master node (waar staat de “nieuwe” node, want master node moet rack aware zijn,…In de Cloud is makkelijk: click &amp; done.Maar niet alle distributies bieden alle modules (Mahoot, Hbase, R,…) aan.Hbase is bijvoorbeeld een moeilijke. R ook, maakt geen deel uit van Apache Hadoop.AmazonElasticMapReduce was in dit opzicht voor ons de meest flexibele: ondersteunt zowel Hbase als R (ten koste van standaard/automatische configuratie?).Er zijn ook “Enterprise Editions” met bijkomende modules en optimalisaties zoals bijvoorbeeld Amazon S3 en Microsoft Blop Storage voor lange termijn opslag, want HDSF is “korte” termijn, als je de cluster afzet is alles weg?Standaard kan je wel starten met een “eenvoudige” uit de Cloud.Hbase: Use Apache HBase when you need random, realtime read/write access to your Big Data. This project&apos;s goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database.
  11. DATA DUMPMongoDB is een document data store die informatie in JSON formaat opslaat, voorziet ook in indexering van de inhoud van de JSON files.MongoDB is niet standaard Hadoop heeft dus eigen cluster nodig en brengt extra kosten en onderhoud met zich mee.VALIDATED DATA REPOSITORYScalability: must be able to handle enormous amounts of data, without degradation in performance. RDBMS technology doesn’t suit this requirement very well, so we need to consider other NoSQL technologies.Flexibility: should be able to handle a mix of structured (e.g. ERP data typically coming from relational data stores) and semi-structured data (e.g. tweets). The repository must be easily tweak-able to the specific needs of the HybridCube3 platform. Within the range of NoSQL database technologies, document- and wide column storage works best in these situations.AGGREGATES REPOSITORIESHbase is column-based en werktgoedvooraggregaties.