Big Data Capacity Planning for 10K Server Analytics

•Download as PPTX, PDF•

5 likes•712 views

Big data solution capacity planning to process 10 million events per day from 10k different servers. Considering 50 GB of data processing per day.

Technology

Hello!
I am Riyaz A Shaikh
Full Stack Architect
You can find me at:
@jf @rizAShaikh
Riyaz A Shaikh
www.riyazshaikh.com

Requirement
Need to setup analytical and alerting system
on data produced by 10,000 servers.
Assuming 10 million events generated per
day by all servers. Considering 50 GB of data
per day.

Big Data Cluster
Considering Hortonworks Hadoop
distribution for cluster setup with following
systems.
 HDFS for data backup in compressed
format.
 Spark for data computation and
transformation.
 Apache Kafka as messaging service for
data completeness.
 Flume for data capture
 Elasticsearch for Analytical data storage
and search engine.
 Kibana for data visualization

Kafka cluster capacity
Assumption Size in GB Rationale
Daily average raw data ingest rate 50
Kafka retention period of 2 days 100 Raw data * retention period
Kafka replication factor of 3 300 Raw data * retention period * replication factor
Storage per Day 300 GB
Storage per Month
This is staging. Monthly calculation is not required because
data will be auto purged after retention period.
Table 1

Elasticsearch cluster capacity
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
Elasticsearch 3 shards 50
Shards are index split. No
extra space required.
Elasticsearch 3 replica 150 Raw data * replicas
Each shards will be
replicated 3 times
Storage per Day 150 GB
Storage per Month 4500 GB Per day * 30 4.5 TB per month
Table 2

HDFS to backup Elasticsearch data
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
HDFS replication factor by 3 150 Raw data * replication factor
70 % Compression 45 (150 – (150*70/100)) LZO compression
Storage per Day 45 GB
Storage per Month 1350 GB 1.35 TB per month
Table 3

Typical Node structure
Table 4
Node Structure
Typical per data node storage capacity
4 TB 2 X 2 TB HDD
Temp space for processing by Spark,
Map Reduce etc. 1 TB 25% of the data node
Data node usable storage
3 TB
Raw storage - Spark
reserve
Considering storage capacity from above three tables
Table 1, Table 2 and Table 3.
Total storage required per month is
300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)

“
Assuming 10% data growth per quarter. Further, considering
15% year-on-year growth in data volume.
Below given Table 5 indicated capacity required as per data
growth year-on-year

Capacity growth year-on-year
Table 5
10% Data Growth Quarterly (Data in TB)
Quarter Year 1 Year 2 Year 3 Year 4 Year 5
Q1 6.15 9.4 12.5 16.7 22.2
Q2 6.8 9.9 13.2 17.5 23.3
Q3 7.4 10.4 13.8 18.4 24.5
Q4 8.2 10.9 14.5 19.3 25.7
Yearly storage 28.5 40.6 54.0 71.9 95.7
Data nodes required =
yearly storage / Data node usable storage
10 14 18 24 32

Hardware Specs
Considering one year storage on ten data
node with one Namenode and one standby
Namenode.
Table 6 & 7 shows hardware configuration of
each machines.

Typical worker node hardware configurations
Table 6
Midline configuration (Data Node)
CPU 2 × 8 core 2.9 Ghz
Memory 64 GB DDR3-1600 ECC
Disk controller SAS 6 Gb/s
Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.

Typical Namenode hardware configurations
Table 7
Namenode configuration
CPU 2 × 8 core 2.9 Ghz
Memory 128 GB
Disk controller RAID 1
Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.

Thanks!
Any questions & feedback!
Write to me at:
@rizAShaikh
Shaikh.r.a@gmail.com

What's hot

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

Coriani 2Innocenti Andrea

MongoDB @ fliptopRobbie Cheng

Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba

introduction to data processing using Hadoop and PigRicardo Varela

Can the elephants handle the no sql onslaughtAung Thu Rha Hein

Big dataSampath Bhargav Pinnam

BioPig for scalable analysis of big sequencing dataZhong Wang

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

PgconfSV compressionAnastasia Lubennikova

Hadoop Architecture in DepthSyed Hadoop

Introduction to Hadoop - FinistJugDavid Morin

R&D for L&DMegan Bowe

Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData

SUNY Ulster - GIS Program Server Storage OptionsMichael Dobe, Ph.D.

Hadoop breizhjugDavid Morin

Getting Started on HadoopPaco Nathan

20171012 found IT #9 PySparkの勘所Ryuji Tamagawa

Aerospike Nested CDTs - Meetup Dec 2019Aerospike

20170210 sapporotechbar7Ryuji Tamagawa

What's hot (20)

Improve Presto Architectural Decisions with Shadow Cache

Coriani 2

MongoDB @ fliptop

Hive vs Pig for HadoopSourceCodeReading

introduction to data processing using Hadoop and Pig

Can the elephants handle the no sql onslaught

Big data

BioPig for scalable analysis of big sequencing data

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

PgconfSV compression

Hadoop Architecture in Depth

Introduction to Hadoop - FinistJug

R&D for L&D

Let's Compare: A Benchmark review of InfluxDB and Elasticsearch

SUNY Ulster - GIS Program Server Storage Options

Hadoop breizhjug

Getting Started on Hadoop

20171012 found IT #9 PySparkの勘所

Aerospike Nested CDTs - Meetup Dec 2019

20170210 sapporotechbar7

Similar to Big Data Capacity Planning for 10K Server Analytics

Dba tuningMaximiliano Accotto

Hadoop Research Shreyansh Ajit kumar

Security sizing meetupDaliya Spasova

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil

(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services

Mapping Data Flows Perf Tuning April 2021Mark Kromer

Extended memory access in PHPAndrew Goodwin

Intro to hadoopHaden Pereira

Empower Data-Driven OrganizationsDataWorks Summit/Hadoop Summit

Whitepaper: Exadata Consolidation Success StoryKristofferson A

Ceph Performance and Sizing GuideJose De La Rosa

AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services

Azure Data Factory Data Flow Performance Tuning 101Mark Kromer

Espc17 make your share point fly by tuning and optimising sql serverIsabelle Van Campenhoudt

Make your SharePoint fly by tuning and optimizing SQL Serverserge luca

Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt

Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community

Sql server 2016 it just runs faster sql bits 2017 editionBob Ward

3.INTEL.Optane_on_ceph_v2.pdfhellobank1

11g R2afa reg

Similar to Big Data Capacity Planning for 10K Server Analytics (20)

Dba tuning

Hadoop Research

Security sizing meetup

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...

(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features

Mapping Data Flows Perf Tuning April 2021

Extended memory access in PHP

Intro to hadoop

Empower Data-Driven Organizations

Whitepaper: Exadata Consolidation Success Story

Ceph Performance and Sizing Guide

AWS June Webinar Series - Getting Started: Amazon Redshift

Azure Data Factory Data Flow Performance Tuning 101

Espc17 make your share point fly by tuning and optimising sql server

Make your SharePoint fly by tuning and optimizing SQL Server

Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...

Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...

Sql server 2016 it just runs faster sql bits 2017 edition

3.INTEL.Optane_on_ceph_v2.pdf

11g R2

Recently uploaded

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

"ML in Production",Oleksandr BaganFwdays

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

CloudStudio User manual (basic edition):comworks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash

Commit 2024 - Secret Management made easy

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DMCC Future of Trade Web3 - Special Edition

"ML in Production",Oleksandr Bagan

WordPress Websites for Engineers: Elevate Your Brand

Ensuring Technical Readiness For Copilot in Microsoft 365

Scanning the Internet for External Cloud Exposures via SSL Certs

TeamStation AI System Report LATAM IT Salaries 2024

Artificial intelligence in cctv survelliance.pptx

Connect Wave/ connectwave Pitch Deck Presentation

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

CloudStudio User manual (basic edition):

Are Multi-Cloud and Serverless Good or Bad?

DevEX - reference for building teams, processes, and platforms

Nell’iperspazio con Rocket: il Framework Web di Rust!

"Debugging python applications inside k8s environment", Andrii Soldatenko

Unleash Your Potential - Namagunga Girls Coding Club

Take control of your SAP testing with UiPath Test Suite

Big Data Capacity Planning for 10K Server Analytics

1. Capacity Planning Big Data Solution

2. Hello! I am Riyaz A Shaikh Full Stack Architect You can find me at: @jf @rizAShaikh Riyaz A Shaikh www.riyazshaikh.com

3. Requirement Need to setup analytical and alerting system on data produced by 10,000 servers. Assuming 10 million events generated per day by all servers. Considering 50 GB of data per day.

4. Big Data Cluster Considering Hortonworks Hadoop distribution for cluster setup with following systems.  HDFS for data backup in compressed format.  Spark for data computation and transformation.  Apache Kafka as messaging service for data completeness.  Flume for data capture  Elasticsearch for Analytical data storage and search engine.  Kibana for data visualization

5. Kafka cluster capacity Assumption Size in GB Rationale Daily average raw data ingest rate 50 Kafka retention period of 2 days 100 Raw data * retention period Kafka replication factor of 3 300 Raw data * retention period * replication factor Storage per Day 300 GB Storage per Month This is staging. Monthly calculation is not required because data will be auto purged after retention period. Table 1

6. Elasticsearch cluster capacity Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 Elasticsearch 3 shards 50 Shards are index split. No extra space required. Elasticsearch 3 replica 150 Raw data * replicas Each shards will be replicated 3 times Storage per Day 150 GB Storage per Month 4500 GB Per day * 30 4.5 TB per month Table 2

7. HDFS to backup Elasticsearch data Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 HDFS replication factor by 3 150 Raw data * replication factor 70 % Compression 45 (150 – (150*70/100)) LZO compression Storage per Day 45 GB Storage per Month 1350 GB 1.35 TB per month Table 3

8. Typical Node structure Table 4 Node Structure Typical per data node storage capacity 4 TB 2 X 2 TB HDD Temp space for processing by Spark, Map Reduce etc. 1 TB 25% of the data node Data node usable storage 3 TB Raw storage - Spark reserve Considering storage capacity from above three tables Table 1, Table 2 and Table 3. Total storage required per month is 300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)

9. “ Assuming 10% data growth per quarter. Further, considering 15% year-on-year growth in data volume. Below given Table 5 indicated capacity required as per data growth year-on-year

10. Capacity growth year-on-year Table 5 10% Data Growth Quarterly (Data in TB) Quarter Year 1 Year 2 Year 3 Year 4 Year 5 Q1 6.15 9.4 12.5 16.7 22.2 Q2 6.8 9.9 13.2 17.5 23.3 Q3 7.4 10.4 13.8 18.4 24.5 Q4 8.2 10.9 14.5 19.3 25.7 Yearly storage 28.5 40.6 54.0 71.9 95.7 Data nodes required = yearly storage / Data node usable storage 10 14 18 24 32

11. Hardware Specs Considering one year storage on ten data node with one Namenode and one standby Namenode. Table 6 & 7 shows hardware configuration of each machines.

12. Typical worker node hardware configurations Table 6 Midline configuration (Data Node) CPU 2 × 8 core 2.9 Ghz Memory 64 GB DDR3-1600 ECC Disk controller SAS 6 Gb/s Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.

13. Typical Namenode hardware configurations Table 7 Namenode configuration CPU 2 × 8 core 2.9 Ghz Memory 128 GB Disk controller RAID 1 Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.

14. Thanks! Any questions & feedback! Write to me at: @rizAShaikh Shaikh.r.a@gmail.com

Big Data Capacity Planning for 10K Server Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Capacity Planning for 10K Server Analytics

Similar to Big Data Capacity Planning for 10K Server Analytics (20)

Recently uploaded

Recently uploaded (20)

Big Data Capacity Planning for 10K Server Analytics