Adoption of Apache Spark in the enterprise is increasing rapidly - it's become one of the fastest growing and most popular technologies in the Big Data ecosystem.
However, implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires expertise that is generally not available to all.
BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin notebooks for interactive data analytics.
BlueData’s software platform leverages virtualization and Docker containers – combined with our own patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and running with a multi-tenant Spark deployment on-premises.
Learn more at www.bluedata.com
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
How to deploy Apache Spark in a multi-tenant, on-premises environment
1. HOW TO DEPLOY APACHE SPARK
IN A MULTI-TENANT, ON-PREMISES ENVIRONMENT
2. Adoption of Apache Spark is accelerating
• Spark adoption is growing rapidly
– The number of contributors and end users is increasing at a substantial rate
• Spark is expanding beyond Hadoop
– Spark is an integral component of new big data platforms - with support for pipelines,
streaming and statistical analysis, SQL, and more
• A variety of use cases are being implemented
– Use cases include recommendation systems, data warehousing, log processing, and more
• Programming paradigm is expanding
– Languages supported include java, scala, python, SQL, R and more
Source: Spark Survey Report, 2015 (Databricks)
3. Top roles using Spark in the enterprise
DATA ENGINEERS
41%
DATA SCIENTISTS
22.2%
ARCHITECTS
17.2%
MANAGEMENT
10.6%
ACADEMIA
6.2%
OTHER
2.4%
Source: Spark Survey Report, 2015 (Databricks)
4. Spark infrastructure patterns
• Individual developers or data scientists who build their own
infrastructure from VMs or bare metal machines
• A bottoms-up approach where everyone gets the same
infrastructure/platform irrespective of their skill or use case
5. Developers / data scientists and Spark
• Mostly self-starters who identify a use case
• They build their own systems on laptops, VMs, or servers
• The complexity soon overwhelms them and restricts adoption
• They need help to scale deployment beyond the initial use case
6. Rigid on-premises infrastructure
• Infrastructure is often built by IT for generic use cases
• Flexibility to cater to different usage scenarios is lost
• Spark users needs are always changing
• Upgrades become a challenge
7. Common Deployment Patterns
48%
Standalone mode
40%
YARN
11%
Mesos
Most Common Spark Deployment Environments
(Cluster Managers)
Source: Spark Survey Report, 2015 (Databricks)
8. Scalable, self-service infrastructure
• IT controls machines, network, storage, and security
• Users create their own tenants and Spark clusters
• Teams can upgrade and scale their clusters independently
9. Big Data New Realities
Big Data Traditional
Assumptions
Bare-metal
Data locality
HDFS on local disks
Big Data
New Realities
Containers and VMs
Compute and storage
separation
In-place access on
remote data stores (e.g.
NFS, Object)
New Benefits
and Value
Big-Data-as-a-Service
Agility and
cost savings
Faster time-to-insights
10. Local HDFS
BlueData EPIC Software Platform
IOBoost™ - Extreme performance and scalability
ElasticPlane™ - Self-service, multi-tenant clusters
DataTap™ - In-place access to enterprise data stores
Blue Data EPIC 2.0 Platform
Marketing R&D Sales Manufacturing Support
BI/Analytics Tools
NFS Gluster Object Store Remote HDFS CEPH
11. Deployment flexibility for Spark
• Physical Machines
or VMs as hosts
• Docker containers
as nodes
• Networking and
security enabled
• Standalone or YARN-
based deployment
12. Support for all types of Spark users
• Integrated web-based notebook support for data analysts
• Command line support for data engineers and data scientists
• API support for building customer pipelines
• Multiple language support including SQL, R, Streaming
• JDBC support for business intelligence tools
14. Instant Spark analysis and visualization
• Web-based notebook with
integrated Spark cluster
• Support for various languages
and Zeppelin interpreters
• Fully provisioned Hadoop File
System (HDFS)
• Support for persistent tables
• Iterative analysis and
visualization