Leverage the Splunk architecture to provide the best possible performance. Whether deploying on premise, in the cloud or on Splunk Cloud, this session will guide you through scenarios that will assist in getting the best from all these options. The agenda also covers how you can plan your searches and reporting to provide the best results for your end users.
14. Forwarding Tier
Syslog-NG Server
Forwarders, LinuxForwarders,
Windows
Deployment
Server
Windows
SharePoint Server
Heavy Forwarder,
Linux
ePO Database
Checkpoint Server
Windows AD
Server
Syslog-NG Server
Indexers
Syslog 514/tcp
& 514/udp
TA-McAfee
(DBConnect)
TA-Checkpoint
Splunk AutoLB to splkidx.internal.door2door.com:9997
Splunk Forwarders to get apps from
splkds.internal.door2door.com:8089
Router
(Physical)
Load
Balancer
Load
Balancer
Firewall
14
16. Data Distribution Imbalance
Even data distribution is crucial in parallel computing
Ways to improve data distribution:
• Enable parallel pipelines on heavy forwarders (In server.conf)
• Route directly from Universal forwarders where possible
• Make the following changes to forwarders’ outputs.conf:
• forceTimebasedAutoLB = true
• autoLBFrequency = x
Examine saved search time windows. Example below has many searches over a 5 minute window, and some searches over 1 minute window,
autoLBFrequency times number of indexers should be divisible by 5 minutes, or 1 minute if possible
|tstats summariesonly=t count WHERE index=“*” by splunk_server _time |timechart span=5m sum(count) by splunk_server
16
6 Indexers; autoLBFrequency = 30
Uneven distribution of workload over 5
minute periods. Unpredictable workload
variation
6 Indexers; autoLBFrequency = 15
Better distribution over 5 minutes.
autoLBFrequency = 10 would be even better
as there are 6 indexers
17. Data Imbalance - Troubleshoot
Troubleshooting:
• Validate firewall rules are in place
• Check that all forwarders have the correct outputs
• Ensure indexers all all listening on proper port
• Does splunkd.log have anything to say?
• Use the Indexing Overview and Configuration Overview (btool saves the day)
Other Causes:
• Simple misconfiguration
• Data processing queues filling up and forwarders timing out and jumping to next indexer
• Check Distributed Indexing Performance in the DMC for queue filling - typical sign of disk performance
issues
• Indexer affinity - the forwarders get stuck to one indexer because EOF never met
• forceTimebasedAutoLB can help! http://blogs.splunk.com/2014/03/18/time-based-load-balancing/
17
38. Distributed Deployment – Common Components
Search-Head 3 X Cisco UCS C220-M4 Rack Servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256 GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 2 X 600GB 15K SFF SAS drives (RAID1)
Admin /Master Nodes 2 X Cisco UCS C220-M4 Rack Servers, each with:
▫ 2 X E5-2620 v3 (12 cores)
▫ Memory: 256 GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 2 X 600GB 15K SFF SAS drives (RAID1)
Network Fabric 2 X Cisco UCS 6248UP 48- Port Fabric Interconnects
39. Distributed Deployment – Retention vs. Performance
Distributed Deployment with High Capacity Distributed Deployment with High Performance
Indexer 16 X C240-M4 rack servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 24 X 1.2TB 10K SAS in RAID10
2 X 120GB SSD in RAID1 for OS
16 X C220-M4 rack servers, each with:
▫ CPU: 2 X E5-2680 v3 (24 cores)
▫ Memory: 256GB
▫ Cisco 12Gbps SAS modular RAID controller (2GB FBWC cache)
▫ Cisco VIC 1227
▫ 6 X 800GB SSD-EP in RAID5
▫ 2 X 600GB 10K SFF SAS HDD w/ RAID1 for OS
Retention Capability >1 TB/Day w/ 1 year+ retention >1.25 TB/Day w/ 90 day retention
Indexing Capacity 4 TB/Day 8 TB /Day
Indexing Capacity w/
Replication
2 TB/Day 4 TB/Day
Raw Index Capacity 236 TB 64 TB
Expected Data Capacity At 2:1 compression:
472 TB
At 2:1 compression:
128 TB
Key Use-Cases ▫ Enterprises requiring larger data retention ▫ Ability to support large number of concurrent users that require
faster response time
Servers Count 21 (37 RU) 21 (21 RU)
Scalability ▫ Additional Search-Head(s)
▫ 1 to 16 additional Indexers (refer to High Capacity Indexer
configuration)
▫ Additional Search-Head(s)
▫ 1 to 16 additional Indexers (Refer to High Performance Indexer
configuration)
40. Cloud Deployments
Cloud Considerations
• Authentication restrictions
• Data transfer costs
• Security – SSL Tunnel
• Zones
• Hybrid deployments
VMware http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_VMware_VMs_Tech_Brief.pdf
AWS https://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-amazon-web-services-technical-brief.pdf
Azure http://www.splunk.com/pdfs/technical-briefs/deploying-splunk-enterprise-on-microsoft-azure.pdf
46. High availability across
Indexers & Search
Heads
MultipleAWS
availability zones
Dedicated Cloud
environments
- Secure
- 10x Bursting
Splunk Cloud fully monitored using Splunk Enterprise
Built for 100% Uptime
51. More Is Better?
CPUs
• 8, 12, 16, 24, 32, etc….
• Pipelines - New 6.3 feature for parallelization!
• Indexing can handle higher bursts with multiple index pipeline sets
• Certain searches can be improved with multiple search pipeline sets
• Historical batch – return the data without worrying about time order ( … | stats count)
• Indexers still need to do the heavy lifting (search exists on indexer AND search head)
Memory
• Good for search heads and indexers (16+ GB)
• Benefits from extra RAM used by OS for caching
Disks
• Faster is better - 10k – 15k rpm strongly recommended, SSD preferred
• More disks in RAID 1+0 = Faster
• RAID 5+1 or 6 can be good for Cold buckets
• SSDs can also provide benefit for rare term searches and many concurrent jobs
55. How are things, overall?
High level environment status – quick view of what’s up/down/not reporting:
• Forwarder health - finding forwarders that we haven’t seen for awhile
• Data source health - how are our data feeds doing?
• REST endpoints (| rest /services/server/info) - looking at system information, possibly under provisioned ones
Spotting warnings and errors within Splunk _internal:
• index=_internal sourcetype=splunkd (log_level=ERROR OR log_level=WARN) | cluster showcount=t | table cluster_count host log_level
message | sort – cluster_count | rename cluster_count AS count, log_level AS level
• index=_internal sourceype=splunkd log_level!=INFO | timechart count by component
Track resource usage:
• Say hello to _introspection (Splunk 6.1+)
• Captures disk and other resource metrics (by default on full installs)
• http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Abouttheplatforminstrumentationframework
Dashboards to help save the day:
• Health Status - Splunk Health Overview
• Instance - Distributed Management Console
• Indexing Performance - Distributed Management Console
• Resource Usage - Splunk Health Overview
• License Usage - Splunk Health Overview 55