SlideShare a Scribd company logo
1 of 39
Data Science Models on Big
Data Platforms
Engineering Patterns for Implementing
Hisham Arafat
Digital Transformation Lead Consultant
Solutions Architect, Technology Strategist & Researcher
Riyadh, KSA – 31 January 2017
http://www.visualcapitalist.com/what-happens-internet-minute-2016/
Big Data…Practical Definition!
• Big Data is the challenge not the solution
• Big Data technologies address that
challenge
• Practically:
• Massive Streams
• Unstructured
• Complex Processing
Let’s Have a Use Case…Social Marketing
Social Marketing…Looks Simple!
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
What people are saying about our new brand “LemaTea”?
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
It’s NOT as Easy as it’s Looks Like!
Not Only Building Appropriate Model, but
More Into
Designing a Solution…Engineering Factors
• Interfacing with sources: REST APIs, source HTML,… (text is assumed)
• Parsing to extract: queries, Regular Expressions,…
• Crawling frequency: every 1 minute, 1 hour, on event,…
• Document structure: post, post + comments, #, Reach, Retweets,…
• Metadata: time, date, source, tags, authoritativeness,…
• Transformations: canonicalization, weights, tokenization,…
- Size: average size of 2 KB / doc
- Initial load: 1.5B doc
- Frequency: every 5 minutes
- Throughput: 2 KB * 60,000 doc
= 120 MB / load
- Grows per day ~ 34 GB
Engineering Factors
• Input format: text, encoded text,…
• Document representation: bag of words, ontology…
• Corpus structures: indexes, reverse indexes,…
• Corpus metrics: doc frequency, inverse doc frequency,…
• Preprocessing: annotation, tagging,…
• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day
- Processing window: 60K per 3 mins
- Processing rate: 20K doc per min
- Final doc size = 2KB * 5 ~ 10KB
- Scan rate: 20k * 10KB min ~ 200MB/min
- Many overheads need to be added
Engineering Factors
• Dimensionality reduction: stemming, lemmatization, noisy words…
• Type of applications: search/retrieval, sentiment analysis…
• Modeling methods: classifiers, topic modeling, relevance…
• Model efficiency: confusion metrics, precision, recall…
• Overheads: intermediate processing, pre-aggregation,…
• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day
- Search for “LemaTea sweet taste”
- No of tf to calculate ~ 1.5B * 3 ~ 4.5B
- No of idf to calculate ~ 1.5B
- Total calculations for 1 search ~ 6 B
- Consider daily growth
Engineering Factors
• Files structure: tables, text files, files-day,…
• Files formats: HDFS, parquet, avro…
• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…
• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML…
• Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow, Kafka/Streaming…
• Ingestion pattern: real-time, micro batches,…
- Overall Storage
- Processing capacity per node
- No of nodes
- Tables  Hive, Hbase, Greenplum
- Individual files  Spark, Flink
- Files-day  Hadoop HDFS
Engineering Factors
• Workload: no of requests, request size,…
• Application performance: response time, concurrent requests…
• Applications interfacing: RESET APIs, native, messaging,…
• Application implementation: integration, model scoring,…
• Security model: application level, platform level,…
- For 3 search terms ~ 6B calculations
- For 5 search terms ~ 9B calculations
- For 10 concurrent requests ~ 75B
- Resource queuing / prioritization
- Search options like date range
- Access control model
Engineering Factors
Ongoing Process…Growing Requirements
What if?
• New sources are included
• Wider parsing Criteria
• Advanced modeling: POS, Word Co-
occurrence, Co-referencing, Named
Entity, Relationship Extraction,…
• Better response time is needed
• More frequent ingestion
Dynamic
Platform
Ingestion
Corpus
Processing
Model
Processing
Requests
Processing
• Larger number of docs
• Increased processing requirements
• Platform expansion
• Overall architecture reconsidered
Some Building Blocks
What is a Data Science Model?
• Type & format of inputs date
• Data ingestion
• Transformations and feature engineering
• Modeling methods and algorithms
• Model evaluation and scoring
• Applications implantations considerations
• In-Memory vs. In-Database
Key Challenges for Data Science Models
Volume
Stationary
Batches
Structured
Insights
Growth
Streams
Real-time
Unstructured
Responsive
Scale out Performance
Data Flow Engines
Event Processing
Complex Formats
Perspective / Deep Models
Traditional Data Management Systems
• Shared I/O
• Shared Processing
• Limited Scalability
• Service Bottlenecks
• High Cost Factor
SharedBuffers
Data Files
Database
Cluster
I/O
I/O
I/O
Network
DatabaseService
Abstraction of Big Data Platforms Data Nodes
Master Nodes
I/O
Network
Interconnect
• Parallel Processing
• Shared Nothing
• Linear Scalability
• Distributed Services
• Lower Cost Factor
I/O
I/O
I/O
…
Metadata
1
2
3
n
Metadata
User data / Replicas
User data / Replicas
User data / Replicas
User data / Replicas
In a Nutshell
Source:
http://dataconomy.com
/2014/06/understandi
ng-big-data-ecosystem/
• Very huge.
• Overlaps.
• Overloading.
• You need to
start with a use
case to be able
to get your
solutions well
engineered.
Engineered Systems
• Packaged: Hortonworks – Pivotal – Cloudera
• Appliances: EMC DCA – Dell DSSD – Dell VxRack
• Cloud offerings: Azure – AWS – IBM – Google Cloud
Engineering Patterns in
Implementation
Lambda Architecture…Social Marketing
• Generic, scalable and
fault-tolerant data
processing architecture.
• Keeps a master
immutable dataset
while serving low
latency requests.
• Aims at providing linear
scalability.
Source: http://lambda-architecture.net/
Social Marketing…Revisted
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
What people are saying about our new brand “LemaTea”?
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Sequence Files
Apache Spark / MLlib
• In memory distributed
Processing
• Scala, Python, Java and R
• Resilient Distributed
Dataset (RDD)
• Mllib – Machine Learning
Algorithms
• SQL and Data Frames /
Pipelines
• Streaming
• Big Graph analytics
Spark Cluster Mesos HDFS/YARN
Apache Spark
• Supports different
types of Cluster
Managers
• HDFS / YARN,
Mesos, Amazon S3,
Stand Alone,
Hbase, Casandra…
• Interactive vs
Application Mode
• Memory
Optimization
Source: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
Apache Spark
Apache Spark MLlib
Apache Spark…The Big Picture
Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
Greenplum / MADLib
• Massively Parallel
Processing
• Shared Nothing
• Table distribution
• By Key
• By Round Robin
• Massively Parallel
Data Loading
• Integration with
Hadoop
• Native MapReduce
Apache MADLib
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-
massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Take Aways
• A Data Science is not just the algorithms but it includes and end-to-end
solution.
• The implementation should consider engineering factors and quantify them
so appropriate components can be selected.
• The Big Data technology land scape is really huge and growing – start with a
solid use case to identify potential components.
• Abstraction of specific technology will enable you to put your hands on the
pros and cons.
• Creativity in solutions design and technology selection case by case.
• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark SQL
Kafka, Hadoop / Yarn, Greenplum, MADLib.
Q & A
Email: hiarafat@hotmail.com
Skype: hichawy
LinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230
Thank You

More Related Content

What's hot

What's hot (20)

MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe?
 
Architecture of Big Data Solutions
Architecture of Big Data SolutionsArchitecture of Big Data Solutions
Architecture of Big Data Solutions
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data Analytics
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
 
The importance of efficient data management for Digital Transformation
The importance of efficient data management for Digital TransformationThe importance of efficient data management for Digital Transformation
The importance of efficient data management for Digital Transformation
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 

Viewers also liked

Building new business models through big data dec 06 2012
Building new business models through big data   dec 06 2012Building new business models through big data   dec 06 2012
Building new business models through big data dec 06 2012
Aki Balogh
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
Nathan Bijnens
 

Viewers also liked (19)

Complex Models for Big Data
Complex Models for Big DataComplex Models for Big Data
Complex Models for Big Data
 
Building new business models through big data dec 06 2012
Building new business models through big data   dec 06 2012Building new business models through big data   dec 06 2012
Building new business models through big data dec 06 2012
 
Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
DataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetupDataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetup
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and Analytics
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Pivotal Cloud Foundry: A Technical Overview
Pivotal Cloud Foundry: A Technical OverviewPivotal Cloud Foundry: A Technical Overview
Pivotal Cloud Foundry: A Technical Overview
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 
Analytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionAnalytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolution
 

Similar to Engineering patterns for implementing data science models on big data platforms

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 

Similar to Engineering patterns for implementing data science models on big data platforms (20)

Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Apache drill
Apache drillApache drill
Apache drill
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 

Recently uploaded (20)

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

Engineering patterns for implementing data science models on big data platforms

  • 1. Data Science Models on Big Data Platforms Engineering Patterns for Implementing Hisham Arafat Digital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher Riyadh, KSA – 31 January 2017
  • 2. http://www.visualcapitalist.com/what-happens-internet-minute-2016/ Big Data…Practical Definition! • Big Data is the challenge not the solution • Big Data technologies address that challenge • Practically: • Massive Streams • Unstructured • Complex Processing
  • 3. Let’s Have a Use Case…Social Marketing
  • 4. Social Marketing…Looks Simple! Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users What people are saying about our new brand “LemaTea”?
  • 5. Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users
  • 6. It’s NOT as Easy as it’s Looks Like!
  • 7. Not Only Building Appropriate Model, but More Into Designing a Solution…Engineering Factors
  • 8. • Interfacing with sources: REST APIs, source HTML,… (text is assumed) • Parsing to extract: queries, Regular Expressions,… • Crawling frequency: every 1 minute, 1 hour, on event,… • Document structure: post, post + comments, #, Reach, Retweets,… • Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,… - Size: average size of 2 KB / doc - Initial load: 1.5B doc - Frequency: every 5 minutes - Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB Engineering Factors
  • 9. • Input format: text, encoded text,… • Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,… • Corpus metrics: doc frequency, inverse doc frequency,… • Preprocessing: annotation, tagging,… • Files structure: tables, text files, files-day,… - No of docs: 1.5B + 17M / day - Processing window: 60K per 3 mins - Processing rate: 20K doc per min - Final doc size = 2KB * 5 ~ 10KB - Scan rate: 20k * 10KB min ~ 200MB/min - Many overheads need to be added Engineering Factors
  • 10. • Dimensionality reduction: stemming, lemmatization, noisy words… • Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance… • Model efficiency: confusion metrics, precision, recall… • Overheads: intermediate processing, pre-aggregation,… • Files structure: tables, text files, files-day,… - No of docs: 1.5B + 17M / day - Search for “LemaTea sweet taste” - No of tf to calculate ~ 1.5B * 3 ~ 4.5B - No of idf to calculate ~ 1.5B - Total calculations for 1 search ~ 6 B - Consider daily growth Engineering Factors
  • 11. • Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro… • Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,… • Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow, Kafka/Streaming… • Ingestion pattern: real-time, micro batches,… - Overall Storage - Processing capacity per node - No of nodes - Tables  Hive, Hbase, Greenplum - Individual files  Spark, Flink - Files-day  Hadoop HDFS Engineering Factors
  • 12. • Workload: no of requests, request size,… • Application performance: response time, concurrent requests… • Applications interfacing: RESET APIs, native, messaging,… • Application implementation: integration, model scoring,… • Security model: application level, platform level,… - For 3 search terms ~ 6B calculations - For 5 search terms ~ 9B calculations - For 10 concurrent requests ~ 75B - Resource queuing / prioritization - Search options like date range - Access control model Engineering Factors
  • 13. Ongoing Process…Growing Requirements What if? • New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co- occurrence, Co-referencing, Named Entity, Relationship Extraction,… • Better response time is needed • More frequent ingestion Dynamic Platform Ingestion Corpus Processing Model Processing Requests Processing • Larger number of docs • Increased processing requirements • Platform expansion • Overall architecture reconsidered
  • 15. What is a Data Science Model? • Type & format of inputs date • Data ingestion • Transformations and feature engineering • Modeling methods and algorithms • Model evaluation and scoring • Applications implantations considerations • In-Memory vs. In-Database
  • 16. Key Challenges for Data Science Models Volume Stationary Batches Structured Insights Growth Streams Real-time Unstructured Responsive Scale out Performance Data Flow Engines Event Processing Complex Formats Perspective / Deep Models
  • 17. Traditional Data Management Systems • Shared I/O • Shared Processing • Limited Scalability • Service Bottlenecks • High Cost Factor SharedBuffers Data Files Database Cluster I/O I/O I/O Network DatabaseService
  • 18. Abstraction of Big Data Platforms Data Nodes Master Nodes I/O Network Interconnect • Parallel Processing • Shared Nothing • Linear Scalability • Distributed Services • Lower Cost Factor I/O I/O I/O … Metadata 1 2 3 n Metadata User data / Replicas User data / Replicas User data / Replicas User data / Replicas
  • 19. In a Nutshell Source: http://dataconomy.com /2014/06/understandi ng-big-data-ecosystem/ • Very huge. • Overlaps. • Overloading. • You need to start with a use case to be able to get your solutions well engineered.
  • 20. Engineered Systems • Packaged: Hortonworks – Pivotal – Cloudera • Appliances: EMC DCA – Dell DSSD – Dell VxRack • Cloud offerings: Azure – AWS – IBM – Google Cloud
  • 22. Lambda Architecture…Social Marketing • Generic, scalable and fault-tolerant data processing architecture. • Keeps a master immutable dataset while serving low latency requests. • Aims at providing linear scalability. Source: http://lambda-architecture.net/
  • 23. Social Marketing…Revisted Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users What people are saying about our new brand “LemaTea”?
  • 24. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  • 25. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  • 26. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark Sequence Files
  • 27. Apache Spark / MLlib • In memory distributed Processing • Scala, Python, Java and R • Resilient Distributed Dataset (RDD) • Mllib – Machine Learning Algorithms • SQL and Data Frames / Pipelines • Streaming • Big Graph analytics Spark Cluster Mesos HDFS/YARN
  • 28. Apache Spark • Supports different types of Cluster Managers • HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra… • Interactive vs Application Mode • Memory Optimization Source: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
  • 31. Apache Spark…The Big Picture Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
  • 32. Greenplum / MADLib • Massively Parallel Processing • Shared Nothing • Table distribution • By Key • By Round Robin • Massively Parallel Data Loading • Integration with Hadoop • Native MapReduce
  • 34. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to- massively-parallel-in-database-image-processing-part-1
  • 35. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
  • 36. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
  • 37. Take Aways • A Data Science is not just the algorithms but it includes and end-to-end solution. • The implementation should consider engineering factors and quantify them so appropriate components can be selected. • The Big Data technology land scape is really huge and growing – start with a solid use case to identify potential components. • Abstraction of specific technology will enable you to put your hands on the pros and cons. • Creativity in solutions design and technology selection case by case. • Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.
  • 38. Q & A
  • 39. Email: hiarafat@hotmail.com Skype: hichawy LinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230 Thank You