Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data and Containers
Charles Smith
@charles_s_smith
Netflix / Lead the big data platform architecture team
Spend my time / Thinking how to make it easy/efficient to work with...
“It is important that we know where we come from, because
if you do not know where you come from, then you don't
know wher...
Database Distributed Database Distributed Storage
Distributed Processing
???
Why do we care about containers?
Containers ~= Virtual Machines
Virtual Machines ~= Servers
Lightweight
fast to start
memory use
Secure
Process isolation
Data isolation
Portable
Composable
Reproducible
Everything o...
Microservices and large architectures
Datastorage
(Cassandra, MySQL, MongoDB, etc..)
Operational
(Mesos, Kubernetes, etc...)
Discovery/Routing
What’s different about big data.
Data at rest
Data in motion
Customer Facing
Minimize latency
Maximize reliability
Data Analytics
Minimize I/O
Maximize processing
Ship computation to data
The questions you can answer aren’t predefined
Hive/Pig/MR
Presto
Metacat
Hive
Metastore
That doesn’t look very container-y
(or microservicy-y for that matter)
Datastorage - HDFS (Or in our case S3)
Operational - YARN
Containers - JVM
So what happens when you want to do something else?
But is that really the way we want to approach containers?
What’s different about big data.
Running many different short-lived processes
Running many different short-lived processes
Efficient container construction, allocation, and movement
Groups of processes having meaning
Groups of processes having meaning
How we observe processes needs to be holistic
Processes need to be scheduled by data locality
(And not just data locality for data at rest)
Processes need to be scheduled by data locality
(And not just data locality for data at rest)
A special case of affinity (...
We do need a data discovery service.
(kind of… maybe… a namenode?)
SELECT
t.title_id,
t.title_desc,
SUM(v.view_secs)
FROM
view_history as v
join title_d as t on
v.title_id =
t.title_id
WHER...
Data
Discovery
Query Compiler
Query Planner
Metadata
DAG
Watcher
Bottom line
Containers provide process level security
The goal should be to minimize monoliths
This isn’t different from w...
Questions?
Big data and containers
Upcoming SlideShare
Loading in …5
×

Big data and containers

Thinking through how containers should change our thinking in big data.

  • Be the first to comment

Big data and containers

  1. 1. Big Data and Containers Charles Smith @charles_s_smith
  2. 2. Netflix / Lead the big data platform architecture team Spend my time / Thinking how to make it easy/efficient to work with big data University of Florida / PhD in Computer Science Who am I?
  3. 3. “It is important that we know where we come from, because if you do not know where you come from, then you don't know where you are, and if you don't know where you are, you don't know where you're going. And if you don't know where you're going, you're probably going wrong.” Terry Pratchett
  4. 4. Database Distributed Database Distributed Storage Distributed Processing ???
  5. 5. Why do we care about containers?
  6. 6. Containers ~= Virtual Machines Virtual Machines ~= Servers
  7. 7. Lightweight fast to start memory use Secure Process isolation Data isolation Portable Composable Reproducible Everything old is new
  8. 8. Microservices and large architectures
  9. 9. Datastorage (Cassandra, MySQL, MongoDB, etc..)
  10. 10. Operational (Mesos, Kubernetes, etc...)
  11. 11. Discovery/Routing
  12. 12. What’s different about big data.
  13. 13. Data at rest Data in motion
  14. 14. Customer Facing Minimize latency Maximize reliability
  15. 15. Data Analytics Minimize I/O Maximize processing
  16. 16. Ship computation to data
  17. 17. The questions you can answer aren’t predefined
  18. 18. Hive/Pig/MR Presto Metacat Hive Metastore
  19. 19. That doesn’t look very container-y (or microservicy-y for that matter)
  20. 20. Datastorage - HDFS (Or in our case S3)
  21. 21. Operational - YARN
  22. 22. Containers - JVM
  23. 23. So what happens when you want to do something else?
  24. 24. But is that really the way we want to approach containers?
  25. 25. What’s different about big data.
  26. 26. Running many different short-lived processes
  27. 27. Running many different short-lived processes Efficient container construction, allocation, and movement
  28. 28. Groups of processes having meaning
  29. 29. Groups of processes having meaning How we observe processes needs to be holistic
  30. 30. Processes need to be scheduled by data locality (And not just data locality for data at rest)
  31. 31. Processes need to be scheduled by data locality (And not just data locality for data at rest) A special case of affinity (although possibly over time) but...
  32. 32. We do need a data discovery service. (kind of… maybe… a namenode?)
  33. 33. SELECT t.title_id, t.title_desc, SUM(v.view_secs) FROM view_history as v join title_d as t on v.title_id = t.title_id WHERE v.view_dateint > 20150101 GROUP BY 1,2; LOAD LOAD JOIN GROUP
  34. 34. Data Discovery Query Compiler Query Planner Metadata DAG Watcher
  35. 35. Bottom line Containers provide process level security The goal should be to minimize monoliths This isn’t different from what we are doing already Our languages are abstractions of composable-distributed processing Different big data projects should share services No matter what we do, joining is going to be a big problem
  36. 36. Questions?

    Be the first to comment

    Login to see the comments

  • MythilyRajavelu1

    Jun. 16, 2016
  • willingc

    Nov. 4, 2016

Thinking through how containers should change our thinking in big data.

Views

Total views

1,880

On Slideshare

0

From embeds

0

Number of embeds

107

Actions

Downloads

41

Shares

0

Comments

0

Likes

2

×