Big data and containers

•

3 likes•1,926 views

This document discusses how containers can be used for big data workloads. It notes that containers provide lightweight virtualization similar to virtual machines. The document outlines how containers can help with distributed processing and storage of big data. It discusses using containers to ship computation to data and schedule processes based on data locality. Overall, the document argues that containers are well-suited for big data applications by allowing distributed, short-lived processes to be run efficiently near related data.

Data & Analytics

Big Data and Containers
Charles Smith
@charles_s_smith

Netflix / Lead the big data platform architecture team
Spend my time / Thinking how to make it easy/efficient to work with big data
University of Florida / PhD in Computer Science
Who am I?

“It is important that we know where we come from, because
if you do not know where you come from, then you don't
know where you are, and if you don't know where you are,
you don't know where you're going. And if you don't know
where you're going, you're probably going wrong.”
Terry Pratchett

Database Distributed Database Distributed Storage
Distributed Processing
???

Containers ~= Virtual Machines
Virtual Machines ~= Servers

Lightweight
fast to start
memory use
Secure
Process isolation
Data isolation
Portable
Composable
Reproducible
Everything old is new

Datastorage
(Cassandra, MySQL, MongoDB, etc..)

Customer Facing
Minimize latency
Maximize reliability

Data Analytics
Minimize I/O
Maximize processing

The questions you can answer aren’t predefined

Hive/Pig/MR
Presto
Metacat
Hive
Metastore

That doesn’t look very container-y
(or microservicy-y for that matter)

So what happens when you want to do something else?

But is that really the way we want to approach containers?

Running many different short-lived processes

Running many different short-lived processes
Efficient container construction, allocation, and movement

Groups of processes having meaning
How we observe processes needs to be holistic

Processes need to be scheduled by data locality
(And not just data locality for data at rest)

Processes need to be scheduled by data locality
(And not just data locality for data at rest)
A special case of affinity (although possibly over time)
but...

We do need a data discovery service.
(kind of… maybe… a namenode?)

SELECT
t.title_id,
t.title_desc,
SUM(v.view_secs)
FROM
view_history as v
join title_d as t on
v.title_id =
t.title_id
WHERE
v.view_dateint > 20150101
GROUP BY 1,2;
LOAD LOAD
JOIN
GROUP

Data
Discovery
Query Compiler
Query Planner
Metadata
DAG
Watcher

Bottom line
Containers provide process level security
The goal should be to minimize monoliths
This isn’t different from what we are doing already
Our languages are abstractions of composable-distributed processing
Different big data projects should share services
No matter what we do, joining is going to be a big problem

Similar to Big data and containers

Data Structure and TypesAnjani Phuyal

Big Data - JAX2011 (Pavlo Baron)Pavlo Baron

Big DataNGDATA

Gerenral insurance Accounts IT and Investmentvijayk23x

zenoh: The Edge Data FabricAngelo Corsaro

Migrating from a Relational Database to Cassandra: Why, Where, When and HowAnant Corporation

Moving from a Relational Database to Cassandra: Why, Where, When, and HowAnant Corporation

Gerenciando recursos computacionais com Apache Mesostdc-globalcode

Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo

Microsoft DryadColin Clark

Master Meta DataDigikrit

Data miningRitesh Tiwari

Data Virtualization: From Zero to Hero (Middle East)Denodo

Overview of databasesshaik faroq

Experimenting With Big DataNick Boucart

Big data presentationChinh Vo Wili

Data Virtualization: An IntroductionDenodo

Big Data DC - Analytics at Clearspringabramsm

CodeFutures - Scaling Your Database in the CloudRightScale

(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...Niraj Tolia

Similar to Big data and containers (20)

Data Structure and Types

Big Data - JAX2011 (Pavlo Baron)

Big Data

Gerenral insurance Accounts IT and Investment

zenoh: The Edge Data Fabric

Migrating from a Relational Database to Cassandra: Why, Where, When and How

Moving from a Relational Database to Cassandra: Why, Where, When, and How

Gerenciando recursos computacionais com Apache Mesos

Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411

Microsoft Dryad

Master Meta Data

Data mining

Data Virtualization: From Zero to Hero (Middle East)

Overview of databases

Experimenting With Big Data

Big data presentation

Data Virtualization: An Introduction

Big Data DC - Analytics at Clearspring

CodeFutures - Scaling Your Database in the Cloud

(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...

Recently uploaded

Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56

2023 Survey Shows Dip in High School E-Cigarette UseBisnar Chase Personal Injury Attorneys

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

IBEF report on the Insurance market in IndiaManalVerma4

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Presentation of project of business person who are successPratikSingh115843

Data Analysis Project: Stroke PredictionBoston Institute of Analytics

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo

DATA ANALYSIS using various data sets like shoping data set etclalithasri22

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation

Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation

Role of Consumer Insights in business transformationAnnie Melnic

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Recently uploaded (16)

Statistics For Management by Richard I. Levin 8ed.pdf

2023 Survey Shows Dip in High School E-Cigarette Use

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

IBEF report on the Insurance market in India

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Presentation of project of business person who are success

Data Analysis Project: Stroke Prediction

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

Digital Indonesia Report 2024 by We Are Social .pdf

DATA ANALYSIS using various data sets like shoping data set etc

Insurance Churn Prediction Data Analysis Project

Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...

Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...

Role of Consumer Insights in business transformation

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Big data and containers

1. Big Data and Containers Charles Smith @charles_s_smith

2. Netflix / Lead the big data platform architecture team Spend my time / Thinking how to make it easy/efficient to work with big data University of Florida / PhD in Computer Science Who am I?

3. “It is important that we know where we come from, because if you do not know where you come from, then you don't know where you are, and if you don't know where you are, you don't know where you're going. And if you don't know where you're going, you're probably going wrong.” Terry Pratchett

4. Database Distributed Database Distributed Storage Distributed Processing ???

5. Why do we care about containers?

6. Containers ~= Virtual Machines Virtual Machines ~= Servers

7. Lightweight fast to start memory use Secure Process isolation Data isolation Portable Composable Reproducible Everything old is new

8. Microservices and large architectures

9. Datastorage (Cassandra, MySQL, MongoDB, etc..)

10. Operational (Mesos, Kubernetes, etc...)

11. Discovery/Routing

12. What’s different about big data.

13. Data at rest Data in motion

14. Customer Facing Minimize latency Maximize reliability

15. Data Analytics Minimize I/O Maximize processing

16. Ship computation to data

17. The questions you can answer aren’t predefined

18. Hive/Pig/MR Presto Metacat Hive Metastore

19. That doesn’t look very container-y (or microservicy-y for that matter)

20. Datastorage - HDFS (Or in our case S3)

21. Operational - YARN

22. Containers - JVM

23. So what happens when you want to do something else?

24.

25. But is that really the way we want to approach containers?

26. What’s different about big data.

27. Running many different short-lived processes

28. Running many different short-lived processes Efficient container construction, allocation, and movement

29. Groups of processes having meaning

30. Groups of processes having meaning How we observe processes needs to be holistic

31. Processes need to be scheduled by data locality (And not just data locality for data at rest)

32. Processes need to be scheduled by data locality (And not just data locality for data at rest) A special case of affinity (although possibly over time) but...

33. We do need a data discovery service. (kind of… maybe… a namenode?)

34. SELECT t.title_id, t.title_desc, SUM(v.view_secs) FROM view_history as v join title_d as t on v.title_id = t.title_id WHERE v.view_dateint > 20150101 GROUP BY 1,2; LOAD LOAD JOIN GROUP

35. Data Discovery Query Compiler Query Planner Metadata DAG Watcher

36. Bottom line Containers provide process level security The goal should be to minimize monoliths This isn’t different from what we are doing already Our languages are abstractions of composable-distributed processing Different big data projects should share services No matter what we do, joining is going to be a big problem

37. Questions?

Editor's Notes

This is a good thing!
Something that is ingrained at Netflix
Decentralized
Basically do I deploy and get resources?
Think of it this way: Our content is data at rest, a bunch of encodings sitting on an open connect server somewhere. When someone wants to view something, the data is streamed to them, data in motion (for a huge chunk of the downstream bandwidth) And the actual viewing of the content is the visualization of the data. You can extend this pattern to other services. Don’t go overboard, but it is a useful way to think about data. Especially when the data starts to get big.
But that isn’t really what we do.
As a result the allocations need to be fast and scalable.
As a result the allocations need to be fast and scalable.

Big data and containers

Recommended

More Related Content

Similar to Big data and containers

Similar to Big data and containers (20)

Recently uploaded

Recently uploaded (16)

Big data and containers

Editor's Notes