SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Splunk scaling & best practice
Nico van der Walt
Client Architect, Splunk
Copyright © 2016 Splunk Inc.

Introduction
3 Tier Approach
Forwarding Architecture
Indexing Architecture
Search Architecture
Sizing Recap
Sizing Examples
Monitoring
Q&A
AGENDA

Sizing Considerations
Vital Info
• Amount of incoming data
• Amount of indexed (stored) data
• Number of concurrent users
• Number of saved searches
• Types of searches
• Specific Splunk apps
http://docs.splunk.com/Documentation/Splunk/latest/Installation/
Performancechecklist

Splunk 3 Tier Architecture
5
Enterprise-class Scale, Resilience and Interoperability
Send data from thousands of servers using any combination of Splunk forwarders
Auto load-balanced forwarding to Splunk Indexers
Offload search load to Splunk Search Heads

Reference Hardware
All instances x64, CPU > 2Ghz per core
* http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware
† http://docs.splunk.com/Documentation/ES/latest/Install/DeploymentPlanning
6
Role Core Splunk* Enterprise Security (ES) †
Indexer
12 CPU cores
12GB of RAM
800 IOPS/indexer RAID 1+0
data ingest: 150-250GB/day
16 CPU cores
32GB of RAM
800 IOPS/indexer RAID 1+0
data ingest: 100GB/day
Search Head
16 CPU cores
12GB of RAM
2x 300GB 10k rpm SAS in RAID1
16 CPU cores
32GB of RAM
2x 300GB 10k rpm SAS in RAID1

Required Reading
Distributed Deployment Manual
• http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Distribute
doverview
Highlights
• Reference hardware specs
• How searches affect performance
• Dense / Rare / Sparse
• App considerations
• Summary table
7

Forwarding Tier
Design Factors
Syslog Collectors (HA)
DBConnect Inputs
• Eg. McAfee EPO data
TA Inputs
• Eg. CheckPoint
Assorted Inputs
• Microsoft AD logs
• Microsoft Exchange Server
• Microsoft Sharepoint logs
• Log4j, Linux, IIS
• …
9

Syslog Collectors
• Best practice to use dedicated syslog servers
• Syslog-NG/rSyslog recommended
• Syslog can write events to dedicated log files allowing for easy sourcetype classification
on inputs
10

Syslog Collectors
Using a Load Balancer/VIP with Linux
Heartbeat to provide failover for the
syslog listener
Syslog-NG PE Client-side failover
11
Syslog-NG Server Syslog-NG Server
Syslog 514/tcp
& 514/udp
Router
(Physical)
Load
Balancer
Load
Balancer

Forwarder for TA’s
TA-McAfee requires DBConnect to pull endpoint
events
TA-Checkpoint uses the LEA Client to retrieve
Firewall log events
Not a HA design, but could be hosted on a VM
to standby or failover
12
Heavy Forwarder,
Linux
ePO Database
Checkpoint Server
TA-McAfee
(DBConnect)
TA-Checkpoint
Firewall

Deployment Server
Deployment
Server
Splunk Forwarders to get a
splkds.internal.door2door.c13
● Deployment Server to manage Linux and
Windows forwarders
● Not a HA design, but could be hosted on a
VM to standby or failover

Forwarding Tier
Syslog-NG Server
Forwarders, LinuxForwarders,
Windows
Deployment
Server
Windows
SharePoint Server
Heavy Forwarder,
Linux
ePO Database
Checkpoint Server
Windows AD
Server
Syslog-NG Server
Indexers
Syslog 514/tcp
& 514/udp
TA-McAfee
(DBConnect)
TA-Checkpoint
Splunk AutoLB to splkidx.internal.door2door.com:9997
Splunk Forwarders to get apps from
splkds.internal.door2door.com:8089
Router
(Physical)
Load
Balancer
Load
Balancer
Firewall
14

Forwarding Tier Design Best Practices
Use a Syslog Server for Syslog data
Be careful with Intermediate forwarders
• They can introduce bottlenecks
• Reduce the distribution of events across Indexers
May need to increase UF thruput setting for high velocity sources
• [thruput]
• maxKBps
AutoLB will spread over all available indexers, but don’t assume evenly!
• Enable forceTimebasedAutoLB for UF
15

Data Distribution Imbalance
Even data distribution is crucial in parallel computing
Ways to improve data distribution:
• Enable parallel pipelines on heavy forwarders (In server.conf)
• Route directly from Universal forwarders where possible
• Make the following changes to forwarders’ outputs.conf:
• forceTimebasedAutoLB = true
• autoLBFrequency = x
Examine saved search time windows. Example below has many searches over a 5 minute window, and some searches over 1 minute window,
autoLBFrequency times number of indexers should be divisible by 5 minutes, or 1 minute if possible
|tstats summariesonly=t count WHERE index=“*” by splunk_server _time |timechart span=5m sum(count) by splunk_server
16
6 Indexers; autoLBFrequency = 30
Uneven distribution of workload over 5
minute periods. Unpredictable workload
variation
6 Indexers; autoLBFrequency = 15
Better distribution over 5 minutes.
autoLBFrequency = 10 would be even better
as there are 6 indexers

Data Imbalance - Troubleshoot
Troubleshooting:
• Validate firewall rules are in place
• Check that all forwarders have the correct outputs
• Ensure indexers all all listening on proper port
• Does splunkd.log have anything to say?
• Use the Indexing Overview and Configuration Overview (btool saves the day)
Other Causes:
• Simple misconfiguration
• Data processing queues filling up and forwarders timing out and jumping to next indexer
• Check Distributed Indexing Performance in the DMC for queue filling - typical sign of disk performance
issues
• Indexer affinity - the forwarders get stuck to one indexer because EOF never met
• forceTimebasedAutoLB can help! http://blogs.splunk.com/2014/03/18/time-based-load-balancing/
17

How Many Deployment Servers?
Rule of thumb says: 1 per 10k clients @ 10 – 15 min polling period
Adjust polling period to increase total clients supported
Small deployments can share the same instance as other management
instances (License Master, Cluster Master, etc.)
Low requirement for disk performance (good candidate for virtualization)
Or use something other than deployment server
• puppet, SCCM, cfengine, chef…

Indexing Tier
Design Factors
Peak ingest volume
High Availability – Indexer Replication
10% Disk Space Contingency
Data retention
Cluster Sizing Calculator
http://splunk-sizing.appspot.com
20

How Many Indexers?
Rule of thumb says: 1 indexer per 150 - 250 GB/day
80 – 100 GB with Enterprise Security
Leave room for:
• Daily peaks
Need more indexers for:
• Heavy reporting
• More users
• Slower disks, slower CPUs, fewer CPUs

Storage Calculations
RAID Configuration
• Amount of raw disk
• Fault tolerance
• Available IOPS
Filesystem Overhead
• inodes consume space
Wiggle room
• Additional replicated buckets when a node fails
• Unbalanced replicated buckets
Splunk internal logs, Summary Indexes, Report Acceleration, Accelerated Data
Models
22

Storage Types
Local vs Direct Attached vs SAN vs NAS
SSD/Flash vs Spinning Disk
• SSDs offer much higher IOPS with no latency
• Significant performance increases with Sparse Searches
23

Index Replication (aka Index Clustering)
What is it?
• Data is replicated to 1 or more indexers based on indexes
• Splunk Cluster Master controlled
Basics
• Master Node (manages indexing and searching location)
• Horizontal Scaling
HA vs DR
• HA - Data is made available on 1 or more indexers in one location
• DR – Multisite clustering. All data exists in multiple locations

Benefits of Clustering
• Data redundancy
• Data availability
• Indexer resiliency
• Simpler management of indexers
• Simpler setup of distributed search
• Multi-site clustering allows site-specific search to reduce WAN
traffic
25

Index Clustering Sizing
Replication factor
ü Determine the number of rebuildable copies of data to maintain
Search factor
ü Determine the number of searchable copies of the data
Data Retention equation based on syslog data
ü Total disk usage across cluster in GB = (RepFactor * 0.15 + SearchFactor * 0.35) * DatasetSizeGB
Increase in I/O, CPU, and disk requirement
• Means daily indexing volume per server will be lower
Search factor increase disk usage by ~30% (rawdata + tsidx)
Replication factor increases disk usage by ~10% (only rawdata)

Cluster Master Server
• Indexer Apps are deployed via CM
• Not a HA design, but could be hosted on a VM to standby or failover
27

Indexing Tier
Master Cluster
Node
28

Search Tier
Design Factors
• High Availability
• Search Head Clustering
• # users
• # concurrent searches
• Forward all data to indexers
30

Search Head Clustering
What is it?
• Group search heads into a cluster as a single entity
• Provides HA at the Search Head layer
• Splunk Head Captain controlled
• RAFT protocol to pick captain
Basics
• A captain gets elected dynamically (pre 6.3) or can be defined manually
(6.3)
• Knowledge objects and search artifacts are replicated
• Search workload distribution
• Replication using local storage NOT over NFS

SHC & Deployer
• Search Head Cluster Apps need to be installed by the
Deployer
• A minimum of 3 Search Heads are required for a SHC
• No exchange, no dbx with SHC
• ES will still require a separate Search Head or dedicated
SHC
• Use LDAP/AD/SSO for user Authentication
• Load Balancer configured for sticky sessions
32

Search Tier
Search HeadSearch Head Search Head
Load
Balancer
Deployer License Server
33

How Many Search Heads?
Rule of thumb says: 1 per 20 – 40 concurrent queries
Limit is concurrent queries
Search Query normally uses up to 1 CPU core
• 6.3 Parallelization can leverage more
Don’t add search heads; add indexers: indexers do most work
• Unless you need HA/Search Clustering
Scale vertically if infrastructure allows it. Add CPU, add memory.

Real World Examples
Cisco Unified Computing System (UCS)
• Search Head:
• UCS C220 M4
• 24 cores
• Indexer:
• UCS C240 M4
• 24 cores

Cisco Validated Design (CVD) for UCS
267 page Reference Manual for deploying 1 TB/day on UCS
Validated and Benchmarked by Cisco and Splunk
37

Distributed Deployment – Common Components
Search-Head 3 X Cisco UCS C220-M4 Rack Servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256 GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 2 X 600GB 15K SFF SAS drives (RAID1)
Admin /Master Nodes 2 X Cisco UCS C220-M4 Rack Servers, each with:
▫ 2 X E5-2620 v3 (12 cores)
▫ Memory: 256 GB
▫ Cisco VIC 1227
▫ 2 X 600GB 15K SFF SAS drives (RAID1)
Network Fabric 2 X Cisco UCS 6248UP 48- Port Fabric Interconnects

Distributed Deployment – Retention vs. Performance
Distributed Deployment with High Capacity Distributed Deployment with High Performance
Indexer 16 X C240-M4 rack servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256GB
▫ Cisco VIC 1227
▫ 24 X 1.2TB 10K SAS in RAID10
2 X 120GB SSD in RAID1 for OS
16 X C220-M4 rack servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256GB
▫ Cisco VIC 1227
▫ 6 X 800GB SSD-EP in RAID5
▫ 2 X 600GB 10K SFF SAS HDD w/ RAID1 for OS
Retention Capability >1 TB/Day w/ 1 year+ retention >1.25 TB/Day w/ 90 day retention
Indexing Capacity 4 TB/Day 8 TB /Day
Indexing Capacity w/
Replication
2 TB/Day 4 TB/Day
Raw Index Capacity 236 TB 64 TB
Expected Data Capacity At 2:1 compression:
472 TB
At 2:1 compression:
128 TB
Key Use-Cases ▫ Enterprises requiring larger data retention ▫ Ability to support large number of concurrent users that require
faster response time
Servers Count 21 (37 RU) 21 (21 RU)
Scalability ▫ Additional Search-Head(s)
▫ 1 to 16 additional Indexers (refer to High Capacity Indexer
configuration)
▫ Additional Search-Head(s)
▫ 1 to 16 additional Indexers (Refer to High Performance Indexer
configuration)

Cloud Deployments
Cloud Considerations
• Authentication restrictions
• Data transfer costs
• Security – SSL Tunnel
• Zones
• Hybrid deployments
VMware http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_VMware_VMs_Tech_Brief.pdf
AWS https://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-amazon-web-services-technical-brief.pdf
Azure http://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-microsoft-azure.pdf

Real World Examples
Amazon Web Services EC2
• Search Head:
• c4.4xlarge + EBS storage
• Indexer:
• d2.4xlarge (IR)

Full
Featured
Enterprise Ready Easy
What We Built

FULL FEATURE
SET OF SPLUNK ENTERPRISE

High availability across
Indexers & Search
Heads
MultipleAWS
availability zones
Dedicated Cloud
environments
- Secure
- 10x Bursting
Splunk Cloud fully monitored using Splunk Enterprise
Built for 100% Uptime

Forward data
Search
Monitor
Get value fast
What You Do
Hardware setup
Storage
Scaling
Monitoring
What We Do

Hybrid Search
Search Head(s)
Indexer(s)
Search Head(s)
Indexer(s)
On Premises Private Cloud Public Cloud On Premises Private Cloud Public Cloud
Single Pane of Glass Visibility

Top 5 Things To Consider
• Indexer Storage requirements
• Minimum buy-in for a SHC is 3
• Use VMs for CM/LS/DS/Deployer if possible
• Consider a dedicated SH for management
• Distributed Management Console
• Splunk Health Check Overview App
• Search Activity App
• When in doubt – add another Indexer
50

More Is Better?
CPUs
• 8, 12, 16, 24, 32, etc….
• Pipelines - New 6.3 feature for parallelization!
• Indexing can handle higher bursts with multiple index pipeline sets
• Certain searches can be improved with multiple search pipeline sets
• Historical batch – return the data without worrying about time order ( … | stats count)
• Indexers still need to do the heavy lifting (search exists on indexer AND search head)
Memory
• Good for search heads and indexers (16+ GB)
• Benefits from extra RAM used by OS for caching
Disks
• Faster is better - 10k – 15k rpm strongly recommended, SSD preferred
• More disks in RAID 1+0 = Faster
• RAID 5+1 or 6 can be good for Cold buckets
• SSDs can also provide benefit for rare term searches and many concurrent jobs

Monitoring Tools
So what’s out there and what’s the difference?
Distributed Management Console (DMC) – Built in and only available on v6.2+
• http://docs.splunk.com/Documentation/Splunk/latest/Admin/ConfiguretheMonitoringConsole
• Splunk supported and focuses on all facets of the deployment
FireBrigade
• https://splunkbase.splunk.com/app/1632/
• Detailed look at index/bucket activity and capacity
SoS (Splunk on Splunk)
• Legacy Splunk troubleshooting tool
Splunk Health Overview
• Combination of views found to be helpful in the field
Note:
Deployment monitor app is deprecated – try to stay away from it
Many of these app functionalities are being rolled in the DMC
54

How are things, overall?
High level environment status – quick view of what’s up/down/not reporting:
• Forwarder health - finding forwarders that we haven’t seen for awhile
• Data source health - how are our data feeds doing?
• REST endpoints (| rest /services/server/info) - looking at system information, possibly under provisioned ones
Spotting warnings and errors within Splunk _internal:
• index=_internal sourcetype=splunkd (log_level=ERROR OR log_level=WARN) | cluster showcount=t | table cluster_count host log_level
message | sort – cluster_count | rename cluster_count AS count, log_level AS level
• index=_internal sourceype=splunkd log_level!=INFO | timechart count by component
Track resource usage:
• Say hello to _introspection (Splunk 6.1+)
• Captures disk and other resource metrics (by default on full installs)
• http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Abouttheplatforminstrumentationframework
Dashboards to help save the day:
• Health Status - Splunk Health Overview
• Instance - Distributed Management Console
• Indexing Performance - Distributed Management Console
• Resource Usage - Splunk Health Overview
• License Usage - Splunk Health Overview 55

Environment Overview
What are we reporting on?
•_internal
•_introspection
•metadata and using tstats
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Tstats
•REST endpoints
• | rest /services/server/info
• | rest /services/server/roles
• | rest /services/server/status/resource-usage
56
How to use the tools available to check overall health…

SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud

Similar to SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud (20)

More from Splunk

More from Splunk (20)

Recently uploaded

Recently uploaded (20)

SplunkLive Melbourne Scaling and best practice for Splunk on premise and in the cloud