Submit Search
Upload
Big Data for Managers: From hadoop to streaming and beyond
•
3 likes
•
1,252 views
DataWorks Summit/Hadoop Summit
Follow
Big Data for Managers: From hadoop to streaming and beyon
Read less
Read more
Technology
Report
Share
Report
Share
1 of 40
Download now
Download to read offline
Recommended
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seeling Cheung
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
Data Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
Inside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
Recommended
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seeling Cheung
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
Data Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
Inside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
Beyond TCO
Beyond TCO
DataWorks Summit/Hadoop Summit
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
NVISIA
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
DataWorks Summit
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
Deploying a Governed Data Lake
Deploying a Governed Data Lake
WaterlineData
Filling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
Hadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
DataWorks Summit
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
Producing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
DataWorks Summit/Hadoop Summit
More Related Content
What's hot
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
Beyond TCO
Beyond TCO
DataWorks Summit/Hadoop Summit
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
NVISIA
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
DataWorks Summit
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
Deploying a Governed Data Lake
Deploying a Governed Data Lake
WaterlineData
Filling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
Hadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
DataWorks Summit
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
What's hot
(19)
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Beyond TCO
Beyond TCO
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
Deploying a Governed Data Lake
Deploying a Governed Data Lake
Filling the Data Lake
Filling the Data Lake
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Hadoop Journey at Walgreens
Hadoop Journey at Walgreens
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Viewers also liked
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
Producing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
DataWorks Summit/Hadoop Summit
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark
DataWorks Summit/Hadoop Summit
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
DataWorks Summit/Hadoop Summit
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
Apache Hive ACID Project
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
Big Data Security and Governance
Big Data Security and Governance
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Big Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
A distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoop
redpel dot com
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
MongoDB
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
DataWorks Summit/Hadoop Summit
Viewers also liked
(20)
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Producing Spark on YARN for ETL
Producing Spark on YARN for ETL
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Apache Hive ACID Project
Apache Hive ACID Project
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
Big Data Security and Governance
Big Data Security and Governance
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Big Data Analytics with Hadoop
Big Data Analytics with Hadoop
A distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoop
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
Similar to Big Data for Managers: From hadoop to streaming and beyond
Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
DataWorks Summit
From lots of reports (with some data Analysis) to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) to Massive Data Analysis (Wit...
Mark Rittman
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
Introduction to Big Data
Introduction to Big Data
Mohammed Guller
Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the Cloud
Jen Aman
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Linked Enterprise Date Services
Amazon QuickSight
Amazon QuickSight
Amazon Web Services
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
Datameer
Introduction to big data and apache spark
Introduction to big data and apache spark
Mohammed Guller
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
DATAVERSITY
Horses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
Embarcadero Technologies
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
Jonathan Raspaud
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
How Businesses use Big Data to Impact the Bottom Line
How Businesses use Big Data to Impact the Bottom Line
Enterprise Management Associates
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
Democratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan Gole
Databricks
Similar to Big Data for Managers: From hadoop to streaming and beyond
(20)
Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
From lots of reports (with some data Analysis) to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) to Massive Data Analysis (Wit...
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Introduction to Big Data
Introduction to Big Data
Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the Cloud
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Semantically integrated Enterprise Data Lakes and Co-Evolution of Public / Pr...
Amazon QuickSight
Amazon QuickSight
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
Introduction to big data and apache spark
Introduction to big data and apache spark
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Horses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
How Businesses use Big Data to Impact the Bottom Line
How Businesses use Big Data to Impact the Bottom Line
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Democratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan Gole
More from DataWorks Summit/Hadoop Summit
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
More from DataWorks Summit/Hadoop Summit
(20)
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Hadoop Crash Course
Data Science Crash Course
Data Science Crash Course
Apache Spark Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Recently uploaded
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
charlottematthew16
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Mark Simos
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
DianaGray10
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
hariprasad279825
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lars Bell
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Stephanie Beckett
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
comworks
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Manik S Magar
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Alfredo García Lavilla
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
charlottematthew16
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Sergiu Bodiu
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Hervé Boutemy
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
NavinnSomaal
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
null - The Open Security Community
Recently uploaded
(20)
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Big Data for Managers: From hadoop to streaming and beyond
1.
Big Data for Managers: From Hadoop to Streaming and Beyond Dr. Vladimir Bacvanski vladimir.bacvanski@scispike.com @OnSo5ware
2.
www.scispike.com Copyright © SciSpike 2016 Dr. Vladimir Bacvanski § Founder of
SciSpike, a development, consulting, and training firm § Passionate about software and data § PhD in computer science RWTH Aachen, Germany § Architect, consultant, mentor § Custom development: Scalable Web and IoT systems § Training and mentoring in Big Data, Scala, node.js, software architecture @OnSoftware https://www.linkedin.com/in/vladimirbacvanski
3.
www.scispike.com Copyright © SciSpike 2016 Problems with Rela9onal Stores § Data that does not naturally fit into tables à Impedance mismatch § Development Eme o5en to long §
Dealing with unstructured data § Performance problems § Difficult to run on clusters § Cost 3
4.
www.scispike.com Copyright © SciSpike 2016 Structured and Unstructured Data Sources Structured Data Sources • ExisEng databases • ERP/CRM/BI systems • Inventory • Supply chain Unstructured Data Sources • Server logs • Search engine logs • Browsing logs • E-Commerce records • Social media • Voice • Video • Sensor data 4
5.
www.scispike.com Copyright © SciSpike 2016 NoSQL Impact 5 Disks Processors x1000 x1000 x1000 Cost / Performance 1M
1B 1T 1Q …HUGE!!! x1000 Rela9onal Database Big Data + NoSQL Tomorrow - Volume is out of reach Today - Doable, but expensive and slow Stabilize Cost & Increase Performance Enable Unlimited Volume Growth
6.
www.scispike.com Copyright © SciSpike 2016 Scale Up vs. Scale Out 6 Capability Cost Scale Up Capability Cost Scale Out
7.
www.scispike.com Copyright © SciSpike 2016 A Common PaNern for Processing Large Data Load a large set of records onto a set of machines Extract something interesEng from each record Shuffle and sort intermediate results Aggregate intermediate results Store end result 7 "Map" "Reduce" Key/Value pairs
8.
www.scispike.com Copyright © SciSpike 2016 Two Key Aspects of Hadoop § MapReduce framework – How Hadoop understands and assigns work to the nodes (machines) § Hadoop Distributed File System = HDFS – Where Hadoop stores data – A file system that spans all the nodes in a Hadoop cluster – It links together the file systems on many local nodes to make them into one big file system 8
9.
www.scispike.com Copyright © SciSpike 2016 MapReduce Example: Word Count § WordCount is the "Hello World" of Big Data – You will see various technologies implemenEng it – A good first step to compare the expressiveness of Big Data tools 9 dog cat
bird dog cat bird dog dog cat dog, 1 cat, 1 bird, 1 dog, 1 cat, 1 bird, 1 dog, 1 dog, 1 cat, 1 Map dog, 1 dog, 1 dog, 1 dog, 1 cat, 1 cat, 1 cat, 1 bird, 1 bird, 1 Shuffle dog, 4 cat, 3 bird, 2 Reduce dog cat bird dog cat bird dog dog cat pets.txt dog, 4 cat, 3 bird, 2 pet_freq.txt
10.
www.scispike.com Copyright © SciSpike 2016 10 The MapReduce Programming Model § "Map" step: – Input split into pieces –
Worker nodes process individual pieces in parallel (under global control of the Job Tracker node) – Each worker node stores its result in its local file system where a reducer is able to access it § "Reduce" step: – Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of the Job Tracker) – MulEple reduce tasks can parallelize the aggregaEon 10
11.
www.scispike.com Copyright © SciSpike 2016 Separa9on of Work Programmers • Map • Reduce Framework • Deals with fault tolerance • Assign workers to map and reduce tasks • Moves processes to data • Shuffles and sorts intermediate data • Deals with errors 11
12.
www.scispike.com Copyright © SciSpike 2016 How To Create MapReduce Jobs § Java API – Low level, very flexible – Time consuming development § Streaming API – A simple, producEve model for Python and Ruby §
Hive – Open source language / Apache sub-project – Provides a SQL-like interface to Hadoop § Pig – Data flow language / Apache sub-project 15
13.
www.scispike.com Copyright © SciSpike 2016 The Big Picture: NoSQL + Hadoop in Applica9ons 16 Columnar Price updates Logs Document Product info Graph Customer Agent relaFon- ships RDB XA data Hadoop Oper. analyFcs Price analyFcs Key/Value Session data ApplicaFons
14.
www.scispike.com Copyright © SciSpike 2016 Streaming: A New Paradigm § ConvenEonal processing: sta9c data DataQueries Results §
Real-time processing: streaming data QueriesData Results 17
15.
www.scispike.com Copyright © SciSpike 2016 Common Streaming Applica9ons § PersonalizaEon § Search §
Revenue opEmizaEon § User events § Content feeds § Log processing § Monitoring § RecommendaEons § Ads § Notable users: – Twiper – Yahoo – SpoEfy – Cisco – Flickr – Weather Channel 18
16.
www.scispike.com Copyright © SciSpike 2016 Beyond Hadoop: Spark & Flink 19 MapReduce Tez Spark Flink
17.
www.scispike.com Copyright © SciSpike 2016 Apache Spark § Important Features – In Memory Data – Resilient Distributed Datasets (RDDs) • Datasets can rebuild themselves if failure occurs – Rich set of operators §
Efficient: – 10x (on Disk) -100x (In Memory) faster than Hadoop MR – 2 to 5 Emes less code (Rich APIs in Scala/Java/Python) 20
18.
www.scispike.com Copyright © SciSpike 2016 Spark Architecture § A powerful set of tools § Beyond tradiEonal Hadoop Source: hpp://spark.apache.org
19.
www.scispike.com Copyright © SciSpike 2016 Data Sharing in Apache Spark H D F S IteraFon 1 Result 1 Held In Cluster Memory IteraFon 2 Result 2 Held In Cluster Memory Query 1 Query 2
20.
www.scispike.com Copyright © SciSpike 2016 Apache Flink § ExecuEon: – Programs compiled into an execuEon plan –
Plan is opEmized – Executed § Design goals: – High performance – Hybrid batch and streaming runEme – Simplicity for the developer – Rich libraries – IntegraEon with many systems 23
21.
www.scispike.com Copyright © SciSpike 2016 Apache Flink Components § IntegraEon with Hadoop YARN, MapReduce, HBase, Cassandra, Kara, … § ExecuEon engine for Apache Beam (Google Dataflow) 24
22.
www.scispike.com Copyright © SciSpike 2016 Flink Op9miza9on and Execu9on § OpEmizer selects an execuEon plan § Similar to what we have in relaEonal databases §
OpEmal plan depends on the size of the input files § Run as standalone or on top of Hadoop § IntegraEon with many Hadoop technologies 25
23.
www.scispike.com Copyright © SciSpike 2016 Flink & Spark: The Advantages and Outlook § Less IO overhead than convenEonal Hadoop § Caching §
IteraEve algorithms § Unifying batch and stream compuEng § Scala as a natural, expressive language for Big Data – Other languages: Python, Java, R § Beware of less mature components 26
24.
www.scispike.com Copyright © SciSpike 2016 Typical NoSQL Systems § Non-relaKonal § Distributed §
Horizontally scalable § No need for a fixed schema § Several established players § Systems are specialized 27
25.
www.scispike.com Copyright © SciSpike 2016 NoSQL Stores and Their Categories § Choose a store that is a best match for your applicaEon § It is fine to have several different stores used – "Polyglot persistence" 28 k
v Key-Value Column- Family Document- Oriented Graph DB
26.
www.scispike.com Copyright © SciSpike 2016 NoSQL Stores: Scale vs. Complexity of Data 29 k v Key-Value Column- Family Document- Oriented complexity scalability Graph DB needs of most applicaFons
27.
www.scispike.com Copyright © SciSpike 2016 Key-Value Stores § Key à Value mapping § Large, persistent Map ("hashtable") – Values could be lists and hashes §
Easy to use § Scale very well § Data model may be too simple for most applicaEons § Systems: – Redis, Riak, Memcached, Amazon DynamoDB, Aerospike, FoundaEonDB § Use when data model is very simple and scalability essenEal 30
28.
www.scispike.com Copyright © SciSpike 2016 Typical Use Cases § The data model is very simple! – Actual data can be JSON § Session data §
User preferences and profiles § Shopping cart § If other NoSQL store is good enough, you may want to skip this and let Column or Document store handle it 31
29.
www.scispike.com Copyright © SciSpike 2016 Column-Family § "Column-family": similar to a table – Table is sparse § Key à (Column:Value)* §
Columns have names § Can be indexed § Can store complex data – Denormalize! § Systems: – Google BigTable, HBase, Cassandra, Amazon SimpleDB, Hypertable § Use when scalability is essenEal 32
30.
www.scispike.com Copyright © SciSpike 2016 Typical Use Cases § High insert volume: logging § Real-Eme updates §
Content management § Expiring content § Cross-datacenter replicaEon § MapReduce analyEcs over stored data § You don’t need convenEonal (ACID) transacEons 33
31.
www.scispike.com Copyright © SciSpike 2016 Document Stores § JSON, BSON, XML § No schema §
Indexes improve performance § Easy transiEon from RDBMS § Systems – MongoDB, CouchDB, CouchBase § Use when data is in semi-structured form § O5en seen in new Web applicaEons 34
32.
www.scispike.com Copyright © SciSpike 2016 Typical Use Cases § Logging – Especially with variable content § Product informaEon §
Customer informaEon § Content management § Data to be stored has format that varies over Eme – Flexible schema § Web analyEcs 35
33.
www.scispike.com Copyright © SciSpike 2016 Graph Databases § Nodes with properEes § Nodes connected through relaEonships §
Can model very complex graph data – Social networks § Systems: – Neo4J, Infinite Graph, TitanDB, OrientDB § Use when data is a (complex) graph 36
34.
www.scispike.com Copyright © SciSpike 2016 Typical Use Cases § Highly interconnected data § Social graphs §
Party relaEonships in an enterprise § LocaEon based services § Purchasing analyEcs and recommendaEons § O5en combined with other systems to store the bulk of data – Graph database can focus on relaEonships 37
35.
www.scispike.com Copyright © SciSpike 2016 Integra9ng Rela9onal, Streams, and Hadoop Streams Data + Big Data TradiEonal Warehouse In-MoEon AnalyEcs Data analyEcs Results Database & Warehouse At-rest data analyEcs Results Ultra Low Latency Results TradiEonal / RelaEonal Data Sources Non-TradiEonal / Non-RelaEonal Data Sources Varied data formats Semi-structured, unstructured... Event System NoSQL 38
36.
www.scispike.com Copyright © SciSpike 2016 Merge Results Lambda Architecture 39 Event (Speed) Layer Real Time Data Batch Layer Serving Layer Master Dataset Batch View Incoming Data Real Time Update Batch Update Queries Rolling Values
37.
www.scispike.com Copyright © SciSpike 2016 Master Data Management and Governance § Big Data and NoSQL stores can easily become a bigger mess than relaEonal stores § Introduce a pracEcal plan – Avoid lengthy and cumbersome governance – Actual use should be the driving force – Start slow §
Be ready for change – The technologies change rapidly § Focus on business outcomes 40
38.
www.scispike.com Copyright © SciSpike 2016 Succeeding with Big Data and NoSQL 1. AcEvely look for soluEons where the right store can ease the pain 2. Make sure you deliver tangible value to clients 3.
A5er you get your first apps to work: create a Big Data introducEon and governance plan 4. PrioriEze: do the most useful thing for the business first 5. Integrate with exisEng IT 6. Make sure you hire or grow your Big Data champions 7. Field is immature: look out for new tools and techniques 41
39.
www.scispike.com Copyright © SciSpike 2016 Conclusions – Hadoop and NoSQL address the weak points of relaEonal systems: • Scale • Performance •
Unstructured and semistructured data – Streaming addresses the processing of data in real-Eme – Integrate with convenEonal technologies! – Spark and Flink: the next generaEon Big Data systems 42
40.
QuesKons?
Download now