How to deploy Apache Spark in a multi-tenant, on-premises environment

•Download as PPTX, PDF•

2 likes•3,514 views

Adoption of Apache Spark in the enterprise is increasing rapidly - it's become one of the fastest growing and most popular technologies in the Big Data ecosystem. However, implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires expertise that is generally not available to all. BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin notebooks for interactive data analytics. BlueData’s software platform leverages virtualization and Docker containers – combined with our own patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and running with a multi-tenant Spark deployment on-premises. Learn more at www.bluedata.com

Software

HOW TO DEPLOY APACHE SPARK
IN A MULTI-TENANT, ON-PREMISES ENVIRONMENT

Adoption of Apache Spark is accelerating
• Spark adoption is growing rapidly
– The number of contributors and end users is increasing at a substantial rate
• Spark is expanding beyond Hadoop
– Spark is an integral component of new big data platforms - with support for pipelines,
streaming and statistical analysis, SQL, and more
• A variety of use cases are being implemented
– Use cases include recommendation systems, data warehousing, log processing, and more
• Programming paradigm is expanding
– Languages supported include java, scala, python, SQL, R and more
Source: Spark Survey Report, 2015 (Databricks)

Top roles using Spark in the enterprise
DATA ENGINEERS
41%
DATA SCIENTISTS
22.2%
ARCHITECTS
17.2%
MANAGEMENT
10.6%
ACADEMIA
6.2%
OTHER
2.4%
Source: Spark Survey Report, 2015 (Databricks)

Spark infrastructure patterns
• Individual developers or data scientists who build their own
infrastructure from VMs or bare metal machines
• A bottoms-up approach where everyone gets the same
infrastructure/platform irrespective of their skill or use case

Developers / data scientists and Spark
• Mostly self-starters who identify a use case
• They build their own systems on laptops, VMs, or servers
• The complexity soon overwhelms them and restricts adoption
• They need help to scale deployment beyond the initial use case

Rigid on-premises infrastructure
• Infrastructure is often built by IT for generic use cases
• Flexibility to cater to different usage scenarios is lost
• Spark users needs are always changing
• Upgrades become a challenge

Common Deployment Patterns
48%
Standalone mode
40%
YARN
11%
Mesos
Most Common Spark Deployment Environments
(Cluster Managers)
Source: Spark Survey Report, 2015 (Databricks)

Scalable, self-service infrastructure
• IT controls machines, network, storage, and security
• Users create their own tenants and Spark clusters
• Teams can upgrade and scale their clusters independently

Big Data New Realities
Big Data Traditional
Assumptions
Bare-metal
Data locality
HDFS on local disks
Big Data
New Realities
Containers and VMs
Compute and storage
separation
In-place access on
remote data stores (e.g.
NFS, Object)
New Benefits
and Value
Big-Data-as-a-Service
Agility and
cost savings
Faster time-to-insights

Local HDFS
BlueData EPIC Software Platform
IOBoost™ - Extreme performance and scalability
ElasticPlane™ - Self-service, multi-tenant clusters
DataTap™ - In-place access to enterprise data stores
Blue Data EPIC 2.0 Platform
Marketing R&D Sales Manufacturing Support
BI/Analytics Tools
NFS Gluster Object Store Remote HDFS CEPH

Deployment flexibility for Spark
• Physical Machines
or VMs as hosts
• Docker containers
as nodes
• Networking and
security enabled
• Standalone or YARN-
based deployment

Support for all types of Spark users
• Integrated web-based notebook support for data analysts
• Command line support for data engineers and data scientists
• API support for building customer pipelines
• Multiple language support including SQL, R, Streaming
• JDBC support for business intelligence tools

Instant Spark analysis and visualization
• Web-based notebook with
integrated Spark cluster
• Support for various languages
and Zeppelin interpreters
• Fully provisioned Hadoop File
System (HDFS)
• Support for persistent tables
• Iterative analysis and
visualization

What's hot

Overview of new features in Apache RangerDataWorks Summit

Introduction to Hadoop and Hadoop component rebeccatho

CockroachDBandrei moga

Vector databaseGuy Korland

Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services

Intro to HBasealexbaranau

Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit

Vector databases and neural searchDmitry Kan

Data ingestion and distribution with apache NiFiLev Brailovskiy

Introduction to NOSQL databasesAshwani Kumar

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share

Deep Dive Into ElasticsearchKnoldus Inc.

Apache Hadoop and HBaseCloudera, Inc.

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative

Hadoop File system (HDFS)Prashant Gupta

Introduction to memcachedJurriaan Persyn

Apache Spark overviewDataArt

Exploring Java Heap Dumps (Oracle Code One 2018)Ryan Cuprak

What's hot (20)

Overview of new features in Apache Ranger

Introduction to Hadoop and Hadoop component

CockroachDB

Vector database

Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...

Intro to HBase

Compression Options in Hadoop - A Tale of Tradeoffs

Vector databases and neural search

Data ingestion and distribution with apache NiFi

Introduction to NOSQL databases

Apache Iceberg - A Table Format for Hige Analytic Datasets

Cassandra Data Modeling - Practical Considerations @ Netflix

Deep Dive Into Elasticsearch

Apache Hadoop and HBase

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...

Hadoop File system (HDFS)

Introduction to memcached

Apache Spark overview

Exploring Java Heap Dumps (Oracle Code One 2018)

Similar to How to deploy Apache Spark in a multi-tenant, on-premises environment

So You Want to Build a Data Lake?David P. Moore

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino

Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo

Prague data management meetup 2018-03-27Martin Bém

Architecting Agile Data Applications for ScaleDatabricks

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies

Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit

Big Data Introduction - Solix empowerDurga Gadiraju

Hortonworks Oracle Big Data Integration Hortonworks

Options for Data Prep - A Survey of the Current MarketDremio Corporation

Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit

Introduction to Data EngineeringDurga Gadiraju

Big Data_Architecture.pptxbetalab

Spark One Platform WebinarCloudera, Inc.

Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle

ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY

Dev Ops TrainingSpark Summit

Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks

Similar to How to deploy Apache Spark in a multi-tenant, on-premises environment (20)

So You Want to Build a Data Lake?

Sa introduction to big data pipelining with cassandra & spark west mins...

Azure Cafe Marketplace with Hortonworks March 31 2016

Scaling up with Cisco Big Data: Data + Science = Data Science

Prague data management meetup 2018-03-27

Architecting Agile Data Applications for Scale

20160331 sa introduction to big data pipelining berlin meetup 0.3

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks

Big Data Introduction - Solix empower

Hortonworks Oracle Big Data Integration

Options for Data Prep - A Survey of the Current Market

Innovation in the Enterprise Rent-A-Car Data Warehouse

Introduction to Data Engineering

Big Data_Architecture.pptx

Spark One Platform Webinar

Big Data Retrospective - STL Big Data IDEA Jan 2019

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Dev Ops Training

Teradata - Presentation at Hortonworks Booth - Strata 2014

Recently uploaded

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Recruitment Management Software Benefits (Infographic)Hr365.us smith

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Software Coding for software engineeringssuserb3a23b

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Cyber security and its impact on E commercemanigoyal112

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

Precise and Complete Requirements? An Elusive GoalLionel Briand

cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm

How to submit a standout Adobe Champion ApplicationBradBedford3

MYjobs Presentation Django-based projectAnoyGreter

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171

Recently uploaded (20)

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Recruitment Management Software Benefits (Infographic)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Software Coding for software engineering

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

CRM Contender Series: HubSpot vs. Salesforce

2.pdf Ejercicios de programación competitiva

Cyber security and its impact on E commerce

Cloud Data Center Network Construction - IEEE

PREDICTING RIVER WATER QUALITY ppt presentation

Sending Calendar Invites on SES and Calendarsnack.pdf

Precise and Complete Requirements? An Elusive Goal

cpct NetworkING BASICS AND NETWORK TOOL.ppt

How to submit a standout Adobe Champion Application

MYjobs Presentation Django-based project

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Unveiling the Future: Sylius 2.0 New Features

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf

How to deploy Apache Spark in a multi-tenant, on-premises environment

1. HOW TO DEPLOY APACHE SPARK IN A MULTI-TENANT, ON-PREMISES ENVIRONMENT

2. Adoption of Apache Spark is accelerating • Spark adoption is growing rapidly – The number of contributors and end users is increasing at a substantial rate • Spark is expanding beyond Hadoop – Spark is an integral component of new big data platforms - with support for pipelines, streaming and statistical analysis, SQL, and more • A variety of use cases are being implemented – Use cases include recommendation systems, data warehousing, log processing, and more • Programming paradigm is expanding – Languages supported include java, scala, python, SQL, R and more Source: Spark Survey Report, 2015 (Databricks)

3. Top roles using Spark in the enterprise DATA ENGINEERS 41% DATA SCIENTISTS 22.2% ARCHITECTS 17.2% MANAGEMENT 10.6% ACADEMIA 6.2% OTHER 2.4% Source: Spark Survey Report, 2015 (Databricks)

4. Spark infrastructure patterns • Individual developers or data scientists who build their own infrastructure from VMs or bare metal machines • A bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case

5. Developers / data scientists and Spark • Mostly self-starters who identify a use case • They build their own systems on laptops, VMs, or servers • The complexity soon overwhelms them and restricts adoption • They need help to scale deployment beyond the initial use case

6. Rigid on-premises infrastructure • Infrastructure is often built by IT for generic use cases • Flexibility to cater to different usage scenarios is lost • Spark users needs are always changing • Upgrades become a challenge

7. Common Deployment Patterns 48% Standalone mode 40% YARN 11% Mesos Most Common Spark Deployment Environments (Cluster Managers) Source: Spark Survey Report, 2015 (Databricks)

8. Scalable, self-service infrastructure • IT controls machines, network, storage, and security • Users create their own tenants and Spark clusters • Teams can upgrade and scale their clusters independently

9. Big Data New Realities Big Data Traditional Assumptions Bare-metal Data locality HDFS on local disks Big Data New Realities Containers and VMs Compute and storage separation In-place access on remote data stores (e.g. NFS, Object) New Benefits and Value Big-Data-as-a-Service Agility and cost savings Faster time-to-insights

10. Local HDFS BlueData EPIC Software Platform IOBoost™ - Extreme performance and scalability ElasticPlane™ - Self-service, multi-tenant clusters DataTap™ - In-place access to enterprise data stores Blue Data EPIC 2.0 Platform Marketing R&D Sales Manufacturing Support BI/Analytics Tools NFS Gluster Object Store Remote HDFS CEPH

11. Deployment flexibility for Spark • Physical Machines or VMs as hosts • Docker containers as nodes • Networking and security enabled • Standalone or YARN- based deployment

12. Support for all types of Spark users • Integrated web-based notebook support for data analysts • Command line support for data engineers and data scientists • API support for building customer pipelines • Multiple language support including SQL, R, Streaming • JDBC support for business intelligence tools

13. Simple and easy Spark cluster creation

14. Instant Spark analysis and visualization • Web-based notebook with integrated Spark cluster • Support for various languages and Zeppelin interpreters • Fully provisioned Hadoop File System (HDFS) • Support for persistent tables • Iterative analysis and visualization

15. App Store for Spark and Big Data tools

16. One-click Big Data app deployment

17. www.bluedata.com

How to deploy Apache Spark in a multi-tenant, on-premises environment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to deploy Apache Spark in a multi-tenant, on-premises environment

Similar to How to deploy Apache Spark in a multi-tenant, on-premises environment (20)

More from BlueData, Inc.

More from BlueData, Inc. (19)

Recently uploaded

Recently uploaded (20)

How to deploy Apache Spark in a multi-tenant, on-premises environment