SlideShare a Scribd company logo
1 of 77
Download to read offline
葉祐欣 (Evans Ye)
Big Data Conference 2015
Trend Micro Big Data Platform 

and Apache Bigtop
Who am I
• Apache Bigtop PMC member
• Apache Big Data Europe 2015 Speaker
• Software Engineer @ Trend Micro
• Develop big data apps & infra
• Has some experience in Hadoop, HBase, Pig,
Spark, Kafka, Fluentd, Akka, and Docker
Outline
• Quick Intro to Bigtop
• Trend Micro Big Data Platform
• Mission-specific Platform
• Big Data Landscape (3p)
• Bigtop 1.1 Release (6p)
Quick Intro to Bigtop
Linux Distributions
Hadoop Distributions
Hadoop Distributions
We’re fully open sourced !
How do I add patches?
From source code
to packages
Bigtop

Packaging
Bigtop feature set
Packaging Testing Deployment Virtualization
for you to easily build your own Big Data Stack
Supported components
• $ git clone https://github.com/apache/bigtop.git
• $ docker run 

--rm 

--volume `pwd`/bigtop:/bigtop 

--workdir /bigtop 

bigtop/slaves:trunk-centos-7 

bash -l -c ‘./gradlew rpm’
One click to build packages
• $ ./gradlew tasks’
Easy to do CI
ci.bigtop.apache.org
RPM/DEB packages
www.apache.org/dist/bigtop
One click Hadoop provisioning
./docker-hadoop.sh -c 3
bigtop/deploy image 

on Docker hub
./docker-hadoop.sh -c 3
One click Hadoop provisioning
bigtop/deploy image 

on Docker hub
./docker-hadoop.sh -c 3
puppet apply
puppet apply
puppet apply
One click Hadoop provisioning
Just google bigtop provisioner
Should I use Bigtop?
If you want to build your
own customised 

Big Data Stack
Curves ahead…
Pros & cons
• Bigtop
• You need a talented Hadoop team
• Self-service: troubleshoot, find solutions, develop patches
• Add any patch at any time you want (additional efforts)
• Choose any version of component you want (additional efforts)
• Vendors (Hortonworks, Cloudera, etc)
• Better support since they’re the guy who write the code !
• $
Trend Micro 

Big Data Platform
• Use Bigtop as the basis for our internal custom
distribution of Hadoop
• Apply community, private patches to upstream
projects for business and operational need
• Newest TMH7 is based on Bigtop 1.0 SNAPSHOT
Trend Micro Hadoop (TMH)
Working with community
made our life easier
• Knowing community status made TMH7 release 

based on Bigtop 1.0 SNAPSHOT possible
Working with community
made our life easier
• Contribute Bigtop Provisioner, packaging code,
puppet recipes, bugfixes, CI infra, anything!
• Knowing community status made TMH7 release 

based on Bigtop 1.0 SNAPSHOT possible
Working with community
made our life easier
• Leverage Bigtop smoke tests and integration tests 

with Bigtop Provisioner to evaluate TMH7
Working with community
made our life easier
• Contribute feedback, evaluation, use case
through Production level adoption
• Leverage Bigtop smoke tests and integration tests 

with Bigtop Provisioner to evaluate TMH7
Hadoop YARN
Hadoop HDFS
Mapreduce
Ad-hoc Query UDFs
Pig
App A App C
Oozie
Resource
Management
Storage
Processing
Engine
APIs and

Interfases
In-house 

Apps
Trend Micro Big Data Stack
Powered by Bigtop
Kerberos
App B App D
HBase
Wuji
Solr
Cloud
Hadooppet (prod) Hadoocker (dev)Deployment
Hadooppet
• Puppet recipes to deploy and manage TMH 

Big Data Platform
• HDFS, YARN, HA auto-configured
• Kerberos, LDAP auto-configured
• Kerberos cross realm authentication auto-configured

(For distcp to run across secured clusters)
• A Devops toolkit for Hadoop app developer 

to develop and test its code on
• Big Data Stack preload images

—> dev & test env w/o deployment

—> support end-to-end CI test
• A Hadoop env for apps to test against new 

Hadoop distribution
• https://github.com/evans-ye/hadoocker
Hadoocker
internal Docker registry
./execute.sh
Hadoop server
Hadoop client
data
Docker based dev & test env
TMH7
Hadoop app
Restful 

APIs
sample data
hadoop fs put
internal Docker registry
./execute.sh
Hadoop server
Hadoop client
data
TMH7
Hadoop app
Restful 

APIs
sample data
hadoop fs putSolr
Oozie(Wuji)
Dependency service
Docker based dev & test env
Mission-specific Platform
Use case
• Real-time streaming data flows in
• Lookup external info when data flows in
• Detect threat/malicious activities on streaming data
• Correlate with other historical data (batch query) to gather
more info
• Can also run batch detections by specifying arbitrary start
time and end time
• Support Investigation down to raw log level
Lambda Architecture
receiver
receiver
buffer
transformation,

lookup ext info
receiver
buffer
batch
streaming
receiver
buffer
transformation,

lookup ext info
transformation,

lookup ext info
batch
streaming
receiver
buffer
• High-throughput, distributed publish-subscribe
messaging system
• Supports multiple consumers attached to a topic
• Configurable partition(shard), replication 

factor
• Load-balance within same consumer group
• Only consume message once
a b c
• Distributed NoSQL key-value storage, no SPOF
• Super fast on write, suitable for data keeps coming in
• Decent read performance, if design it right
• Build data model around your queries
• Spark Cassandra Connector
• Configurable CA (CAP theorem)
• Choose A over C for availability and vise-versa
Dynamo: Amazon’s Highly Available Key-value Store
• Fast, distributed, in-memory processing engine
• One system for streaming and batch workloads
• Spark streaming
Akka
• High performance concurrency framework for Java and Scala
• Actor model for message-driven processing
• Asynchronous by design to achieve high throughput
• Each message is handled in a single threaded context

(no lock, synchronous needed)
• Let-it-crash model for fault tolerance and auto-healing system
• Clustering mechanism to scale out
The Road to Akka Cluster, and Beyond
Akka Streams
• Akka Streams is a DSL library for streaming computation on Akka
• Materializer to transform each step into Actor
• Back-pressure enabled by default
Source Flow Sink
The Reactive Manifesto
No back-pressure
Source Fast!!! SinkSlow…
(>﹏<)’v( ̄︶ ̄)y
No back-pressure
Source Fast!!! SinkSlow…
(>﹏<)’’’’’v( ̄︶ ̄)y
With back-pressure
Source Fast!!! SinkSlow…
With back-pressure
Source Fast!!! SinkSlow…
request 3request 3
Data pipeline with Akka Streams
• Scale up using balance and merge
source: http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html#working-with-flows
worker
worker
worker
balance merge
• Scale out using docker
Data pipeline with Akka Streams
$ docker-compose scale pipeline=3
Reactive Kafka
• Akka Streams wrapper for Kafka
• Commit processed offset back into Kafka
• Provide at-least-once delivery guarantee
https://github.com/softwaremill/reactive-kafka
Message delivery guarantee
• Actor Model: at-most-once
• Akka Persistence: at-least-once
• Persist log to external storage (like WAL)
• Reactive Kafka: at-least-once + back-pressure
• Write offset back into Kafka
• At-least-once + Idempotent writes = exactly-once
• Spark: both streaming and batch analytics
• Docker: resource management (fine for one app)
• Akka: fine-grained, elastic data pipelines
• Cassandra: batch queries
• Kafka: durable buffer, fan-out to multiple consumers
Recap: SDACK Stack
Your mileage may vary
we’re still evolving
Remember this:
The SMACK Stack
Toolbox for wide variety of data processing scenarios
SMACK Stack
• Spark: fast and general engine for large-scale data
processing
• Mesos: cluster resource management system
• Akka: toolkit and runtime for building highly concurrent,
distributed, and resilient message-driven applications
• Cassandra: distributed, highly available database designed
to handle large amounts of data across datacenters
• Kafka: high-throughput, low-latency distributed pub-sub
messaging system for real-time data feeds
Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
Reference
• Spark Summit Europe 2015
• Streaming Analytics with Spark, Kafka,
Cassandra, and Akka (Helena Edelson)
• Big Data AW Meetup
• SMACK Architectures (Anton Kirillov)
Big Data Landscape
• Memory is faster than SSD/disk, and is cheaper
• In Memory Computing & Fast Data
• Spark : In memory batch/streaming engine
• Flink : In memory streaming/batch engine
• Iginte : In memory data fabric
• Geode (incubating) : In memory database
Big Data moving trend
• Off-Heap storage is a JVM process memory
outside of the heap, which is allocated and
managed using native calls.
• size not limited by JVM (it is limited by physical
memory limits)
• is not subject to GC which essentially removes
long GC pauses
• Project Tungsten, Flink, Iginte, Geode, HBase
Off-Heap, Off-Heap, Off-Heap
Pig
Hadoop YARN
Hadoop HDFS
Resource
Management
Storage
Processing
Engine
(Some) Apache Big Data
Components
Slider
Flink Spark
Flink ML,
Gelly
Streaming,
MLlib, GraphX
Kafka
HBase
Mesos
Tez
Hive Phoenix
Ignite
APIs and

Interfases
Geode
Trafodion
Solr
}
messaging system in memory data grid search engine
Bigtop
Ambari
Hadoop

Distribution
Hadoop

Management
Cassandra
NoSQL
Bigtop 1.1 Release
Jan, 2016, I expect…
Bigtop 1.1 Release
• Hadoop 2.7.1
• Spark 1.5.1
• Hive 1.2.1
• Pig 0.15.0
• Oozie 4.2.0
• Flume 1.6.0
• Zeppelin 0.5.5
• Ignite Hadoop 1.5.0
• Phoenix 4.6.0
• Hue 3.8.1
• Crunch 0.12
• …, 24 components included!
Hadoop 2.6
• Heterogeneous Storages
• SSD + hard drive
• Placement policy (all_ssd, hot, warm, cold)
• Archival Storage (cost saving)
• HDFS-7285 (Hadoop 3.0)
• Erasure code to save storage from 3X to 1.5X
http://www.slideshare.net/Hadoop_Summit/reduce-storage-
costs-by-5x-using-the-new-hdfs-tiered-storage-feature
Hadoop 2.7
• Transparent encryption (encryption zone)
• Available in 2.6
• Known issue: Encryption is sometimes done
incorrectly (HADOOP-11343)
• Fixed in 2.7
http://events.linuxfoundation.org/sites/events/files/slides/
HDFS2015_Past_present_future.pdf
Rising star: Flink
• Streaming dataflow engine
• Treat batch computing as fixed length streaming
• Exactly-once by distributed snapshotting
• Event time handling by watermarks
• Integrate and package Apache Flink
• Re-implement Bigtop Provisioner using 

docker-machine, compose, swarm
• Deploy containers on multiple hosts
• Support any kind of base image for deployment
Bigtop Roadmap
Wrap up
• Hadoop Distribution
• Choose Bigtop if you want more control
• The SMACK Stack
• Toolbox for variety data processing scenarios
• Big Data Landscape
• In-memory, off-heap solutions are hot
Wrap up
Questions ?
Thank you !

More Related Content

What's hot

Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Cloudera, Inc.
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Databricks
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Open shift 4 infra deep dive
Open shift 4    infra deep diveOpen shift 4    infra deep dive
Open shift 4 infra deep diveWinton Winton
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
Deep dive into highly available open stack architecture openstack summit va...
Deep dive into highly available open stack architecture   openstack summit va...Deep dive into highly available open stack architecture   openstack summit va...
Deep dive into highly available open stack architecture openstack summit va...Arthur Berezin
 
Building infrastructure as code using Terraform - DevOps Krakow
Building infrastructure as code using Terraform - DevOps KrakowBuilding infrastructure as code using Terraform - DevOps Krakow
Building infrastructure as code using Terraform - DevOps KrakowAnton Babenko
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Grafana Labs
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introductionw_akram
 

What's hot (20)

Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
 
Red Hat Insights
Red Hat InsightsRed Hat Insights
Red Hat Insights
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Open shift 4 infra deep dive
Open shift 4    infra deep diveOpen shift 4    infra deep dive
Open shift 4 infra deep dive
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Deep dive into highly available open stack architecture openstack summit va...
Deep dive into highly available open stack architecture   openstack summit va...Deep dive into highly available open stack architecture   openstack summit va...
Deep dive into highly available open stack architecture openstack summit va...
 
Building infrastructure as code using Terraform - DevOps Krakow
Building infrastructure as code using Terraform - DevOps KrakowBuilding infrastructure as code using Terraform - DevOps Krakow
Building infrastructure as code using Terraform - DevOps Krakow
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introduction
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 

Similar to Trend Micro Big Data Platform and Apache Bigtop

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexApache Apex
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...Evans Ye
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewYafang Chang
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...Evans Ye
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache KafkaJoe Stein
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyCeph Community
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 

Similar to Trend Micro Big Data Platform and Apache Bigtop (20)

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 

More from Evans Ye

Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfJoin ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfEvans Ye
 
非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽Evans Ye
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningEvans Ye
 
2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations publicEvans Ye
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 
The Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessThe Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessEvans Ye
 
The Apache Way
The Apache WayThe Apache Way
The Apache WayEvans Ye
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningEvans Ye
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
BigTop vm and docker provisioner
BigTop vm and docker provisionerBigTop vm and docker provisioner
BigTop vm and docker provisionerEvans Ye
 
Docker workshop
Docker workshopDocker workshop
Docker workshopEvans Ye
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devopsEvans Ye
 
Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Evans Ye
 
Deep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaDeep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaEvans Ye
 
How we lose etu hadoop competition
How we lose etu hadoop competitionHow we lose etu hadoop competition
How we lose etu hadoop competitionEvans Ye
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseEvans Ye
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environmentEvans Ye
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodesEvans Ye
 
How to be a star engineer
How to be a star engineerHow to be a star engineer
How to be a star engineerEvans Ye
 

More from Evans Ye (20)

Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfJoin ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
 
非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
 
2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
The Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessThe Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward Success
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
BigTop vm and docker provisioner
BigTop vm and docker provisionerBigTop vm and docker provisioner
BigTop vm and docker provisioner
 
Docker workshop
Docker workshopDocker workshop
Docker workshop
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
 
Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...
 
Deep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaDeep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through Impala
 
How we lose etu hadoop competition
How we lose etu hadoop competitionHow we lose etu hadoop competition
How we lose etu hadoop competition
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
 
Vagrant
VagrantVagrant
Vagrant
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodes
 
How to be a star engineer
How to be a star engineerHow to be a star engineer
How to be a star engineer
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 

Trend Micro Big Data Platform and Apache Bigtop

  • 1. 葉祐欣 (Evans Ye) Big Data Conference 2015 Trend Micro Big Data Platform 
 and Apache Bigtop
  • 2. Who am I • Apache Bigtop PMC member • Apache Big Data Europe 2015 Speaker • Software Engineer @ Trend Micro • Develop big data apps & infra • Has some experience in Hadoop, HBase, Pig, Spark, Kafka, Fluentd, Akka, and Docker
  • 3. Outline • Quick Intro to Bigtop • Trend Micro Big Data Platform • Mission-specific Platform • Big Data Landscape (3p) • Bigtop 1.1 Release (6p)
  • 4. Quick Intro to Bigtop
  • 8. How do I add patches?
  • 9.
  • 10. From source code to packages Bigtop
 Packaging
  • 11. Bigtop feature set Packaging Testing Deployment Virtualization for you to easily build your own Big Data Stack
  • 13. • $ git clone https://github.com/apache/bigtop.git • $ docker run 
 --rm 
 --volume `pwd`/bigtop:/bigtop 
 --workdir /bigtop 
 bigtop/slaves:trunk-centos-7 
 bash -l -c ‘./gradlew rpm’ One click to build packages
  • 14. • $ ./gradlew tasks’
  • 15. Easy to do CI ci.bigtop.apache.org
  • 17. One click Hadoop provisioning ./docker-hadoop.sh -c 3
  • 18. bigtop/deploy image 
 on Docker hub ./docker-hadoop.sh -c 3 One click Hadoop provisioning
  • 19. bigtop/deploy image 
 on Docker hub ./docker-hadoop.sh -c 3 puppet apply puppet apply puppet apply One click Hadoop provisioning Just google bigtop provisioner
  • 20. Should I use Bigtop?
  • 21. If you want to build your own customised 
 Big Data Stack
  • 23. Pros & cons • Bigtop • You need a talented Hadoop team • Self-service: troubleshoot, find solutions, develop patches • Add any patch at any time you want (additional efforts) • Choose any version of component you want (additional efforts) • Vendors (Hortonworks, Cloudera, etc) • Better support since they’re the guy who write the code ! • $
  • 24. Trend Micro 
 Big Data Platform
  • 25. • Use Bigtop as the basis for our internal custom distribution of Hadoop • Apply community, private patches to upstream projects for business and operational need • Newest TMH7 is based on Bigtop 1.0 SNAPSHOT Trend Micro Hadoop (TMH)
  • 26. Working with community made our life easier • Knowing community status made TMH7 release 
 based on Bigtop 1.0 SNAPSHOT possible
  • 27. Working with community made our life easier • Contribute Bigtop Provisioner, packaging code, puppet recipes, bugfixes, CI infra, anything! • Knowing community status made TMH7 release 
 based on Bigtop 1.0 SNAPSHOT possible
  • 28. Working with community made our life easier • Leverage Bigtop smoke tests and integration tests 
 with Bigtop Provisioner to evaluate TMH7
  • 29. Working with community made our life easier • Contribute feedback, evaluation, use case through Production level adoption • Leverage Bigtop smoke tests and integration tests 
 with Bigtop Provisioner to evaluate TMH7
  • 30. Hadoop YARN Hadoop HDFS Mapreduce Ad-hoc Query UDFs Pig App A App C Oozie Resource Management Storage Processing Engine APIs and
 Interfases In-house 
 Apps Trend Micro Big Data Stack Powered by Bigtop Kerberos App B App D HBase Wuji Solr Cloud Hadooppet (prod) Hadoocker (dev)Deployment
  • 31. Hadooppet • Puppet recipes to deploy and manage TMH 
 Big Data Platform • HDFS, YARN, HA auto-configured • Kerberos, LDAP auto-configured • Kerberos cross realm authentication auto-configured
 (For distcp to run across secured clusters)
  • 32.
  • 33. • A Devops toolkit for Hadoop app developer 
 to develop and test its code on • Big Data Stack preload images
 —> dev & test env w/o deployment
 —> support end-to-end CI test • A Hadoop env for apps to test against new 
 Hadoop distribution • https://github.com/evans-ye/hadoocker Hadoocker
  • 34. internal Docker registry ./execute.sh Hadoop server Hadoop client data Docker based dev & test env TMH7 Hadoop app Restful 
 APIs sample data hadoop fs put
  • 35. internal Docker registry ./execute.sh Hadoop server Hadoop client data TMH7 Hadoop app Restful 
 APIs sample data hadoop fs putSolr Oozie(Wuji) Dependency service Docker based dev & test env
  • 37. Use case • Real-time streaming data flows in • Lookup external info when data flows in • Detect threat/malicious activities on streaming data • Correlate with other historical data (batch query) to gather more info • Can also run batch detections by specifying arbitrary start time and end time • Support Investigation down to raw log level
  • 44. • High-throughput, distributed publish-subscribe messaging system • Supports multiple consumers attached to a topic • Configurable partition(shard), replication 
 factor • Load-balance within same consumer group • Only consume message once a b c
  • 45. • Distributed NoSQL key-value storage, no SPOF • Super fast on write, suitable for data keeps coming in • Decent read performance, if design it right • Build data model around your queries • Spark Cassandra Connector • Configurable CA (CAP theorem) • Choose A over C for availability and vise-versa Dynamo: Amazon’s Highly Available Key-value Store
  • 46. • Fast, distributed, in-memory processing engine • One system for streaming and batch workloads • Spark streaming
  • 47. Akka • High performance concurrency framework for Java and Scala • Actor model for message-driven processing • Asynchronous by design to achieve high throughput • Each message is handled in a single threaded context
 (no lock, synchronous needed) • Let-it-crash model for fault tolerance and auto-healing system • Clustering mechanism to scale out The Road to Akka Cluster, and Beyond
  • 48. Akka Streams • Akka Streams is a DSL library for streaming computation on Akka • Materializer to transform each step into Actor • Back-pressure enabled by default Source Flow Sink The Reactive Manifesto
  • 49. No back-pressure Source Fast!!! SinkSlow… (>﹏<)’v( ̄︶ ̄)y
  • 50. No back-pressure Source Fast!!! SinkSlow… (>﹏<)’’’’’v( ̄︶ ̄)y
  • 52. With back-pressure Source Fast!!! SinkSlow… request 3request 3
  • 53. Data pipeline with Akka Streams • Scale up using balance and merge source: http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html#working-with-flows worker worker worker balance merge
  • 54. • Scale out using docker Data pipeline with Akka Streams $ docker-compose scale pipeline=3
  • 55. Reactive Kafka • Akka Streams wrapper for Kafka • Commit processed offset back into Kafka • Provide at-least-once delivery guarantee https://github.com/softwaremill/reactive-kafka
  • 56. Message delivery guarantee • Actor Model: at-most-once • Akka Persistence: at-least-once • Persist log to external storage (like WAL) • Reactive Kafka: at-least-once + back-pressure • Write offset back into Kafka • At-least-once + Idempotent writes = exactly-once
  • 57. • Spark: both streaming and batch analytics • Docker: resource management (fine for one app) • Akka: fine-grained, elastic data pipelines • Cassandra: batch queries • Kafka: durable buffer, fan-out to multiple consumers Recap: SDACK Stack
  • 61. The SMACK Stack Toolbox for wide variety of data processing scenarios
  • 62. SMACK Stack • Spark: fast and general engine for large-scale data processing • Mesos: cluster resource management system • Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications • Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters • Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
  • 63. Reference • Spark Summit Europe 2015 • Streaming Analytics with Spark, Kafka, Cassandra, and Akka (Helena Edelson) • Big Data AW Meetup • SMACK Architectures (Anton Kirillov)
  • 65. • Memory is faster than SSD/disk, and is cheaper • In Memory Computing & Fast Data • Spark : In memory batch/streaming engine • Flink : In memory streaming/batch engine • Iginte : In memory data fabric • Geode (incubating) : In memory database Big Data moving trend
  • 66. • Off-Heap storage is a JVM process memory outside of the heap, which is allocated and managed using native calls. • size not limited by JVM (it is limited by physical memory limits) • is not subject to GC which essentially removes long GC pauses • Project Tungsten, Flink, Iginte, Geode, HBase Off-Heap, Off-Heap, Off-Heap
  • 67. Pig Hadoop YARN Hadoop HDFS Resource Management Storage Processing Engine (Some) Apache Big Data Components Slider Flink Spark Flink ML, Gelly Streaming, MLlib, GraphX Kafka HBase Mesos Tez Hive Phoenix Ignite APIs and
 Interfases Geode Trafodion Solr } messaging system in memory data grid search engine Bigtop Ambari Hadoop
 Distribution Hadoop
 Management Cassandra NoSQL
  • 68. Bigtop 1.1 Release Jan, 2016, I expect…
  • 69. Bigtop 1.1 Release • Hadoop 2.7.1 • Spark 1.5.1 • Hive 1.2.1 • Pig 0.15.0 • Oozie 4.2.0 • Flume 1.6.0 • Zeppelin 0.5.5 • Ignite Hadoop 1.5.0 • Phoenix 4.6.0 • Hue 3.8.1 • Crunch 0.12 • …, 24 components included!
  • 70.
  • 71. Hadoop 2.6 • Heterogeneous Storages • SSD + hard drive • Placement policy (all_ssd, hot, warm, cold) • Archival Storage (cost saving) • HDFS-7285 (Hadoop 3.0) • Erasure code to save storage from 3X to 1.5X http://www.slideshare.net/Hadoop_Summit/reduce-storage- costs-by-5x-using-the-new-hdfs-tiered-storage-feature
  • 72. Hadoop 2.7 • Transparent encryption (encryption zone) • Available in 2.6 • Known issue: Encryption is sometimes done incorrectly (HADOOP-11343) • Fixed in 2.7 http://events.linuxfoundation.org/sites/events/files/slides/ HDFS2015_Past_present_future.pdf
  • 73. Rising star: Flink • Streaming dataflow engine • Treat batch computing as fixed length streaming • Exactly-once by distributed snapshotting • Event time handling by watermarks
  • 74. • Integrate and package Apache Flink • Re-implement Bigtop Provisioner using 
 docker-machine, compose, swarm • Deploy containers on multiple hosts • Support any kind of base image for deployment Bigtop Roadmap
  • 76. • Hadoop Distribution • Choose Bigtop if you want more control • The SMACK Stack • Toolbox for variety data processing scenarios • Big Data Landscape • In-memory, off-heap solutions are hot Wrap up