SlideShare a Scribd company logo
1 of 34
Download to read offline
Small intro to Big Data
Michał Matłoka @mmatloka
Outline
1. What is Big Data?
2. Storing
3. Batch & Streams processing
4. Resource Managers
5. Machine Learning
6. Analysis & Visualization
7. Other
What is Big Data?
● Volume
● Velocity
● Variety
How big is Big?
ThoughtWorks: Big Data envy
Storing
CAP theorem
(Brewer’s theorem)
In distributed system you can only
have two of three guarantees:
● Consistency
● Availability
● Partition Tolerance
Relational scaling
(horizontal)
Example limitations:
● Max 48 nodes
● Read-only nodes
● Cross-shard joins…
● Auto-increments
● Distributed transactions,
possible, but…
It can work!
You don’t always need ACID
BASE might be enough
NoSQL
(Not only SQL)
● Key-value (Redis, Dynamo, ...)
● Column (Cassandra, HBase, ...)
● Document (MongoDB, … )
● Graph (Neo4J, …
● Multi-model (OrientDB, …)
Apple - 115k Cassandra nodes with
over 10PB of data!
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
Batch Processing
● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache Parquet, ...
● Not real-time
● Can take hours or more
Apache Hadoop
● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
projectsMapReduce
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Hadoop Wordcount - part I
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Hadoop Wordcount - part II
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Hadoop Wordcount - part III
Source:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
Apache Spark
RDD (Resilient Distributed
Dataset)
DAG (Directed acyclic graph)
● RDD - map, filter, count etc
● Spark SQL
● MLib
● GraphX
● Spark Streaming
● API: Scala, Java, Python, R*
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark Wordcount
Source: http://spark.apache.org/examples.html
Apache Flink
DAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
● FlinkCEP - detect patterns in
data streams
● Compatible with Apache
Hadoop and Apache Storm APIs
● API: Scala, Java, Python*
Stream
processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Akka Actors, Twitter,
RabbitMQ etc
● Event time vs processing time
● Windows - fixed, sliding, session
● Watermarks
● State
Stream
processing
● Native/micro-batch
● Latency
● Throughput
● Delivery guarantees
● Resources managers
● API - compositional/declarative
● Maturity
Differences
Stream processing
● Apache Storm
● Apache Storm Trident
● Alibaba JStorm
● Twitter Heron
● Apache Spark Streaming
● Apache Flink
● Apache Beam
● Apache Kafka Streams
● Apache Samza
● Apache Gearpump
● Apache Apex
● Apache Ignite Streaming
● Apache S4
● ...
Resource
management
● Apache YARN
● Apache Mesos (1.0.1!)
● Apache Slider - deploy existing
apps on YARN
● Apache Myriad - YARN on
Mesos
● DC/OS
Source:
https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb
oard-ee-600x395@2x.gif
Analysis
SQL Engines & Querying
● Apache Hive
● Apache Pig
● Apache HAWQ
● Apache Impala
● Apache Phoenix
● Apache Spark SQL
● Apache Drill
● Facebook Presto
● ...
Machine Learning
● Apache Mahout
● Apache Samoa
● Spark MLib
● FlinkML
● H2O
● TensorFlow
Notebooks
● IPython
● Jupyter
● Apache Zeppelin
Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png
Hadoop-related
● Apache Sqoop
● Apache Flume
● Apache Oozie
● Hue
● Apache HDFS
● Apache Ambari
● Apache Knox
● Apache ZooKeeper
Awesome Big Data
https://github.com/onurakpolat
/awesome-bigdata
Conclusions
● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
SMACK stack (Spark, Mesos,
Akka, Cassandra, Kafka)
Articles & references
● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t
echnologies/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-
frameworks-part-1
● https://dcos.io/
● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
● http://spark.apache.org/
● https://flink.apache.org/
● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of-
big-data-analytics-frameworks
● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing
● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Thank you, Q&A?
@mmatloka
http://www.slideshare.net/softwaremill
https://softwaremill.com/blog/

More Related Content

What's hot

What's hot (20)

Cascalog
CascalogCascalog
Cascalog
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Exploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better TogetherExploring MongoDB & Elasticsearch: Better Together
Exploring MongoDB & Elasticsearch: Better Together
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
NATE-Central-Log
NATE-Central-LogNATE-Central-Log
NATE-Central-Log
 
Apache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java MeetupApache Spark part of Eindhoven Java Meetup
Apache Spark part of Eindhoven Java Meetup
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 

Viewers also liked

Viewers also liked (20)

Origins of free
Origins of freeOrigins of free
Origins of free
 
Machine learning by example
Machine learning by exampleMachine learning by example
Machine learning by example
 
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
3 kroki do sukcesu płaskiej i zdalnej firmy | SoftwareMill
 
Jednorożce to kobiety a nie firmy. O √kobiecym w STEM
Jednorożce to kobiety a nie firmy. O √kobiecym w STEMJednorożce to kobiety a nie firmy. O √kobiecym w STEM
Jednorożce to kobiety a nie firmy. O √kobiecym w STEM
 
Scalatra - Scalar Mini
Scalatra  - Scalar MiniScalatra  - Scalar Mini
Scalatra - Scalar Mini
 
An Introduction to Akka
An Introduction to AkkaAn Introduction to Akka
An Introduction to Akka
 
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
From spaghetti with no `src/test` to green CI and well-sleeping developers
From spaghetti with no `src/test` to green CI and well-sleeping developersFrom spaghetti with no `src/test` to green CI and well-sleeping developers
From spaghetti with no `src/test` to green CI and well-sleeping developers
 
How secure your web framework is?
How secure your web framework is?How secure your web framework is?
How secure your web framework is?
 
Code reviews - Human Factor
Code reviews - Human FactorCode reviews - Human Factor
Code reviews - Human Factor
 
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
Emancypacja pracowników. Dlaczego spaliliśmy karty zakładowe?
 
Proste REST API z użyciem play i slick
Proste REST API z użyciem play i slickProste REST API z użyciem play i slick
Proste REST API z użyciem play i slick
 
What is most important 
in cooperation with external software developers? Par...
What is most important 
in cooperation with external software developers? Par...What is most important 
in cooperation with external software developers? Par...
What is most important 
in cooperation with external software developers? Par...
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 

Similar to Small intro to Big Data - Old version

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 

Similar to Small intro to Big Data - Old version (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 

More from SoftwareMill

More from SoftwareMill (18)

Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
How To Survive a Live-Coding Session
How To Survive a Live-Coding SessionHow To Survive a Live-Coding Session
How To Survive a Live-Coding Session
 
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
 
Have you ever wondered about code review?
Have you ever wondered about code review?Have you ever wondered about code review?
Have you ever wondered about code review?
 
Reactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and AlpakkaReactive Integration with Akka Streams and Alpakka
Reactive Integration with Akka Streams and Alpakka
 
W świecie botów czyli po co nam SI
W świecie botów czyli po co nam SIW świecie botów czyli po co nam SI
W świecie botów czyli po co nam SI
 
Small intro to Big Data
Small intro to Big DataSmall intro to Big Data
Small intro to Big Data
 
Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9Out-of-the-box Reactive Streams with Java 9
Out-of-the-box Reactive Streams with Java 9
 
Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry) Hiring, Bots and Beer. (Hiring in the IT industry)
Hiring, Bots and Beer. (Hiring in the IT industry)
 
Teal Is The New Black
Teal Is The New BlackTeal Is The New Black
Teal Is The New Black
 
Windowing data in big data streams
Windowing data in big data streamsWindowing data in big data streams
Windowing data in big data streams
 
Kafka as a message queue
Kafka as a message queueKafka as a message queue
Kafka as a message queue
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Origins of Free
Origins of FreeOrigins of Free
Origins of Free
 
Cassandra - how to fail?
Cassandra - how to fail?Cassandra - how to fail?
Cassandra - how to fail?
 
How to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent companyHow to manage in a flat organized, remote and transparent company
How to manage in a flat organized, remote and transparent company
 
Performance tests with gatling
Performance tests with gatlingPerformance tests with gatling
Performance tests with gatling
 
Projekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designeraProjekt z punktu widzenia UX designera
Projekt z punktu widzenia UX designera
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Small intro to Big Data - Old version

  • 1. Small intro to Big Data Michał Matłoka @mmatloka
  • 2. Outline 1. What is Big Data? 2. Storing 3. Batch & Streams processing 4. Resource Managers 5. Machine Learning 6. Analysis & Visualization 7. Other
  • 3. What is Big Data? ● Volume ● Velocity ● Variety
  • 4. How big is Big?
  • 7. CAP theorem (Brewer’s theorem) In distributed system you can only have two of three guarantees: ● Consistency ● Availability ● Partition Tolerance
  • 8. Relational scaling (horizontal) Example limitations: ● Max 48 nodes ● Read-only nodes ● Cross-shard joins… ● Auto-increments ● Distributed transactions, possible, but… It can work!
  • 9. You don’t always need ACID
  • 10. BASE might be enough
  • 11. NoSQL (Not only SQL) ● Key-value (Redis, Dynamo, ...) ● Column (Cassandra, HBase, ...) ● Document (MongoDB, … ) ● Graph (Neo4J, … ● Multi-model (OrientDB, …) Apple - 115k Cassandra nodes with over 10PB of data!
  • 13. Batch Processing ● Processes data from 1 or more sources from bigger period of time (e.g. day, month) ● Source: db, Apache Parquet, ... ● Not real-time ● Can take hours or more
  • 14. Apache Hadoop ● Based on Google paper ● First release in 2006! ● Map -> (shuffle) -> Reduce ● Was the beginning of many projectsMapReduce
  • 15. public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Hadoop Wordcount - part I Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 16. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Hadoop Wordcount - part II Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 17. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } Hadoop Wordcount - part III Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  • 18. Apache Spark RDD (Resilient Distributed Dataset) DAG (Directed acyclic graph) ● RDD - map, filter, count etc ● Spark SQL ● MLib ● GraphX ● Spark Streaming ● API: Scala, Java, Python, R*
  • 19. val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark Wordcount Source: http://spark.apache.org/examples.html
  • 20. Apache Flink DAGs with iterations ● Batch & native streaming ● FlinkML ● Table API & SQL (Beta) ● Gelly - graph analytics ● FlinkCEP - detect patterns in data streams ● Compatible with Apache Hadoop and Apache Storm APIs ● API: Scala, Java, Python*
  • 21. Stream processing ● Near real-time/real time ● Processing (usually) does not end ● Source: files, Apache Kafka, Socket, Akka Actors, Twitter, RabbitMQ etc ● Event time vs processing time ● Windows - fixed, sliding, session ● Watermarks ● State
  • 22. Stream processing ● Native/micro-batch ● Latency ● Throughput ● Delivery guarantees ● Resources managers ● API - compositional/declarative ● Maturity Differences
  • 23. Stream processing ● Apache Storm ● Apache Storm Trident ● Alibaba JStorm ● Twitter Heron ● Apache Spark Streaming ● Apache Flink ● Apache Beam ● Apache Kafka Streams ● Apache Samza ● Apache Gearpump ● Apache Apex ● Apache Ignite Streaming ● Apache S4 ● ...
  • 24. Resource management ● Apache YARN ● Apache Mesos (1.0.1!) ● Apache Slider - deploy existing apps on YARN ● Apache Myriad - YARN on Mesos ● DC/OS
  • 26. Analysis SQL Engines & Querying ● Apache Hive ● Apache Pig ● Apache HAWQ ● Apache Impala ● Apache Phoenix ● Apache Spark SQL ● Apache Drill ● Facebook Presto ● ...
  • 27. Machine Learning ● Apache Mahout ● Apache Samoa ● Spark MLib ● FlinkML ● H2O ● TensorFlow
  • 30. Hadoop-related ● Apache Sqoop ● Apache Flume ● Apache Oozie ● Hue ● Apache HDFS ● Apache Ambari ● Apache Knox ● Apache ZooKeeper
  • 32. Conclusions ● There is a lot of it! ● https://pixelastic.github.io/pokem onorbigdata/ ● If you want to learn, start with SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka)
  • 33. Articles & references ● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t echnologies/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing- frameworks-part-1 ● https://dcos.io/ ● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn ● http://spark.apache.org/ ● https://flink.apache.org/ ● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of- big-data-analytics-frameworks ● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing ● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102