Data Onboarding

Copyright
©
2014
Splunk
Inc.

Data
Onboarding
Ingestion without the
Indigestion
Jeﬀ
Meyers

Sales
Engineer

•  Major
components
involved
in
data
indexing

•  What
happens
to
data
within
Splunk

•  What
the
data
pipeline
is
&
how
to
inﬂuence
it

•  Shaping
data
understanding
via
props.conf

•  Conﬁguring
data
inputs
via
inputs.conf

•  What
goes
where

•  Heavy
Forwarders
vs.
Universal
Forwarders

•  How
to
get
your
data
into
Splunk
(mostly
correctly)

~
60
minutes
from
now...

•  SystemaMc
way
to
bring
new
data
sources
into
Splunk

•  Make
sure
that
new
data
is
instantly
usable

&
has
maximum
value
for
users

•  Goes
hand-‐in-‐hand
with
the
User
Onboarding
process

(sold
separately)

What
is
the
Data
Onboarding
Process?

4

Machine Data > Business Value
Index
Untapped
Data:
Any
Source,
Type,
Volume

Online

Services
Web

Services

Servers

Security
GPS

LocaMon

Storage

Desktops

Networks

Packaged

ApplicaMons

Custom

ApplicaMons
Messaging

Telecoms

Online

Shopping

Cart

Web

Clickstream
s

Databases

Energy

Meters

Call
Detail

Records

Smartphones

and
Devices

RFID

On-‐

Premises

Private

Cloud

Public

Cloud

Ask
Any
QuesMon

ApplicaMon
Delivery

Security,
Compliance
and

Fraud

IT
OperaMons

Business
AnalyMcs

Industrial
Data
and

the
Internet
of
Things

Flavors of Machine Data
Order
Processing

TwiRer

Care
IVR

Middleware

Error

Getting Data Into Splunk
6
Agent
and
Agent-‐less
Approach
for
Flexibility

perf

shell

code

Mounted
File
Systems

hostnamemount

syslog

TCP/UDP

WMI

Event
Logs
Performance

AcMve

Directory

syslog
compaMble
hosts

and
network
devices

Unix,
Linux
and
Windows
hosts

Windows
hosts
Custom
apps
and
scripted
API
connecMons

Local
File
Monitoring

log
files,
config
files

dumps
and
trace
files

Windows
Inputs

Event
Logs

performance
counters

registry
monitoring

AcAve
Directory
monitoring

virtual

host

Windows
hosts

Scripted
Inputs

shell
scripts
custom

parsers
batch
loading

Agent-‐less
Data
Input
Splunk
Forwarder

Splunk
Data
Ingest

UF
UF
HF
UF

IDX

SH

Splunk
Enterprise

(with
opMonal
conﬁgs)

Splunk
Universal
Forwarder

Summary:
when
it
comes
to
"core"

Splunk,
there
are
two
dis8nct

products:
Splunk
Universal

Forwarder
and
Splunk
Enterprise.

"Everything
else"
–
Indexer,
Search

Head,
License
Server,
Deployment

Server,
Cluster
Master,
Deployer,

Heavy
Forwarder,
etc.
are
all

instances
of
Splunk
Enterprise
with

varying
conﬁgs.

Data Pipeline
(what the what?)

The
Data
Pipeline

Any
QuesMons?

•  Input
Processors:
Monitor,
FIFO,
UDP,
TCP,
Scripted

•  No
events
yet-‐-‐
just
a
stream
of
bytes

•  Break
data
stream
into
64KB
blocks

•  Annotate
stream
with
metadata
keys
(host,
source,

sourcetype,
index,
etc.)

•  Can
happen
on
UF,
HF
or
indexer

Inputs–
Where
it
all
starts

• Check
character
set

• Break
lines

• Process
headers

• Can
happen
on
HF
or
indexer

Parsing

•  Merge
lines
for
mulM-‐line
events

•  IdenMfy
events
(ﬁnally!)

•  Extract
Mmestamps

•  Exclude
events
based
on
Mmestamp
(MAX_DAYS_AGO,
..)

•  Can
happen
on
HF
or
indexer

AggregaMon/Merging

•  Do
regex
replacement
(ﬁeld
extracMon,
punctuaMon

extracMon,
event
rouMng,
host/source/sourcetype

overrides)

•  Annotate
events
with
metadata
keys

(host,
source,
sourcetype,
..)

•  Can
happen
on
HF
or
indexer

Typing

•  Output
processors:
TCP,
syslog,
HTTP

•  indexAndForward

•  Sign
blocks

•  Calculate
license
volume
and
throughput
metrics

•  Index

•  [Write
to
disk
]
/
[forward
elsewhere]
/
...

•  Can
happen
on
HF
or
indexer

Indexing

Data
Pipeline:
UF
&
Indexer

Data
Pipeline:
HF
&
Indexer

Data
Pipeline:
UF,
IF
&
Indexer

UF
vs.
HF

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:16]
"GET
/oldlink?itemId=EST-‐6&JSESSIONID=SD0SL6FF7AD...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:17]
"GET
/product.screen?productId=BS-‐AG-‐G09&JSESSION...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:19]
"POST
/category.screen?categoryId=STRATEGY&JSESSI...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:20]
"GET
/product.screen?productId=FS-‐SG-‐G03&JSESSION...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:20]
"POST
/cart.do?acMon=addtocart&itemId=EST-‐21&pro...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:21]
"POST
/cart.do?acMon=purchase&itemId=EST-‐21&JSES...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:22]
"POST
/cart/success.do?JSESSIONID=SD0SL6FF7ADFF49...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:21]
"GET
/cart.do?acMon=remove&itemId=EST-‐11&product...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:22]
"GET
/oldlink?itemId=EST-‐14&JSESSIONID=SD0SL6FF7A...

112.111.162.4
-‐
-‐
[23/Feb/2016:18:26:36]
"GET
/product.screen?productId=WC-‐SH-‐G04&JSESSION...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:16]
"GET
/oldlink?itemId=EST-‐6&JSESSIONID=SD0SL6FF7AD...

209.160.24.63
-‐
-‐
[23/Feb/2016:18:22:17]
"GET
/product.screen?productId=BS-‐AG-‐G09&SSN=xxxyyyzzz...

sourcetype=access_combined,
_8me=1456251739,
index=foo,
host=bar,
…

_8me=1456251739,
index=foo,
host=bar,
…

index=foo,
host=bar,
…

UF

HF

emits
events

emits
chunks
of

data

Splunk
Data
Ingest

UF
UF
HF
UF

IDX

SH

Parsing

Not
Parsing

Note:
the
data
is
parsed
at
the
first

component
that
has
a
parsing

engine
–
and
not
again

This
effects
where
you
put
certain

props.conf
and
transforms.conf

files
(a.k.a.
some8mes
they
go
on

the
forwarder)

Data Onboarding Process
(bringing it together)

•  IdenMfy
the
speciﬁc
sourcetype(s)
-‐
onboard
each
separately

•  Check
for
pre-‐exisMng
app/TA
on
splunk.com-‐-‐
don't
reinvent
the
wheel!

•  Gather
info

•  Where
does
this
data
originate/reside?

How
will
Splunk
collect
it?

•  Which
users/groups
will
need
access
to
this
data?

Access
controls?

•  Determine
the
indexing
volume
and
data
retenMon
requirements

•  Will
this
data
need
to
drive
exisMng
dashboards
(ES,
PCI,
etc.)?

•  Who
is
the
SME
for
this
data?

•  Map
it
out

•  Get
a
"big
enough"
sample
of
the
event
data

•  IdenMfy
and
map
out
ﬁelds

•  Assign
sourcetype
and
TA
names
according
to
CIM
convenMons

On-‐boarding
Process

•  Dev

•  Create
(or
use)
an
app

•  Props
/
inputs
deﬁniMon

•  Sourcetype
deﬁniMon

•  Use
data
import
wizard

•  Import,
tweak,
repeat

•  Oneshot

•  [hook
up
monitor]

On-‐boarding
Process

•  Prod

•  Deploy
app

•  Validate

•  Monitor

•  Test

•  Deploy
app

•  Oneshost

•  Validate

•  Hook
up
monitor

•  Validate

1
2

3

•  General:

•  Use
apps
for
conﬁgs

•  Use
TAs
/
add-‐ons
from
Splunk
if
possible

•  Use
dev,
test,
prod

•  Dev
can
be
laptop,
test
can
be
ephemeral

•  UF
when
possible

•  HF
only
if
ﬁltering
/
transforming
is
required
in
foreign
land

•  Unique
Sourcetype
per
event
stream

•  Don't
send
data
through
Search
Heads

•  Don't
send
data
direct
to
Indexers

Good
Hygiene

•  inputs.conf

•  As
speciﬁc
as
possible

•  Set
sourcetype,
if
possible

•  Don't
let
splunk
auto-‐sourcetype
(no
...too_small)

•  Specify
index
if
possible

•  props.conf

•  Set:
TIME_PREFIX,
TIME_FORMAT,

MAX_TIMESTAMP_LOOKAHEAD

•  OpMmally:
SHOULD_LINEMERGE
=
false,
LINE_BREAKER,

TRUNCATE

Good
Hygiene

Data Onboarding Process
(details)

•  IdenMfy
the
speciﬁc
sourcetype(s)
-‐
onboard
each
separately

•  Check
for
pre-‐exisMng
app/TA
on
splunk.com-‐-‐
don't
reinvent
the
wheel!

•  Gather
info

•  Where
does
this
data
originate/reside?

How
will
Splunk
collect
it?

•  Which
users/groups
will
need
access
to
this
data?

Access
controls?

•  Determine
the
indexing
volume
and
data
retenMon
requirements

•  Will
this
data
need
to
drive
exisMng
dashboards
(ES,
PCI,
etc.)?

•  Who
is
the
SME
for
this
data?

•  Map
it
out

•  Get
a
"big
enough"
sample
of
the
event
data

•  IdenMfy
and
map
out
ﬁelds

•  Assign
sourcetype
and
TA
names
according
to
CIM
convenMons

Pre-‐Board

•  The
Common
InformaMon
Model
(CIM)
deﬁnes

relaMonships
in
the
underlying
data,
while
leaving
the
raw

machine
data
intact

•  A
naming
convenMon
for
ﬁelds,
evensypes
&
tags

•  More
advanced
reporMng
and
correlaMon
requires
that
the

data
be
normalized,
categorized,
and
parsed

•  CIM-‐compliant
data
sources
can
drive
CIM-‐based

dashboards
(ES,
PCI,
others)

Tangent:
What
is
the
CIM
and
why
should
I
care?

•  IdenMfy
necessary
configs
(inputs,
props
and
transforms)

to
properly
handle:

•  Mmestamp
extracMon,
Mmezone,
event
breaking,

sourcetype/host/source
assignments

•  Do
events
contain
sensiMve
data
(i.e.,
PII,
PAN,
etc.)?

Create
masking
transforms
if
necessary

•  Package
all
index-‐Mme
configs
into
the
TA

Build
the
index-‐Mme
configs

•  Assign
sourcetype
according
to
event
format;
events
with

similar
format
should
have
the
same
sourcetype

•  When
do
I
need
a
separate
index?

•  When
the
data
volume
will
be
very
large,
or
when
it
will

be
searched
exclusively
a
lot

•  When
access
to
the
data
needs
to
be
controlled

•  When
the
data
requires
a
speciﬁc
data
retenMon
policy

•  Resist
the
temptaMon
to
create
lots
of
indexes

Tangent:
Best
&
Worst
PracMces

•  Always
specify
a
sourcetype
and
index

•  Be
as
specific
as
possible:
use
/var/log/fubar.log,

not
/var/log/

•  Arrange
your
monitored
filesystems
to
minimize

unnecessary
monitored
logfiles

•  Use
a
scratch
index
while
tesMng
new
inputs

Best
&
Worst
PracMces
–
[monitor]

•  Lookout
for
inadvertent,
runaway
monitor
clauses

•  Don’t
monitor
thousands
of
ﬁles
unnecessarily–

that’s
the
NSA’s
job

•  From
the
CLI:
splunk
show
monitor

•  From
your
browser:
hsps://your_splunkd:8089/
services/admin/inputstatus/TailingProcessor:FileStatus

Best
&
Worst
PracMces
–
[monitor]

•  Find
&
ﬁx
index-‐Mme
problems
BEFORE
polluMng
your
index

•  A
try-‐it-‐before-‐you-‐fry-‐it
interface
for
ﬁguring
out

•  Event
breaking

•  Timestamp
recogniMon

•  Timezone
assignment

•  Provides
the
necessary
props.conf
parameter
sewngs

Your
friend,
the
Data
Previewer
Another

Tangent!

Data Onboarding Process,
continued

•  IdenMfy
"interesMng"
events
which
should
be
tagged
with
an
exisMng
CIM
tag
(hsp://
docs.splunk.com/DocumentaMon/CIM/latest/User/Alerts)

•  Get
a
list
of
all
current
tags:
|
rest
splunk_server=local
/services/admin/tags
|
rename

tag_name
as
tag,
field_name_value
AS
definiMon,
eai:acl.app
AS
app
|
eval

definiMon_and_app=definiMon
.
"
("
.
app
.
")"
|
stats
values(definiMon_and_app)
as

"definiMons
(app)"
by
tag
|
sort
+tag

•  Get
a
list
of
all
evensypes
(with
associated
tags):
|
rest
splunk_server=local
/services/
admin/evensypes
|
rename
Mtle
as
evensype,
search
AS
definiMon,
eai:acl.app
AS
app
|

table
evensype
definiMon
app
tags
|
sort
+evensype

•  Examine
the
current
list
of
CIM
tags.

For
each
"interesMng"
event,
idenMfy
which
tags

should
be
applied
to
each.

A
parMcular
event
may
have
mulMple
tags.

•  Are
there
new
tags
which
should
be
created,
beyond
those
in
the
current
CIM
tag
library?

If
so,
add
them
to
the
CIM
library

Build
the
search-‐Mme
configs:
evenRypes
&
tags

•  Extract
"interesMng"
fields

•  If
already
in
your
CIM
library,
name
or
alias
appropriately

•  If
not
already
in
your
CIM
library,
name
according
to
CIM

convenMons

•  Add
lookups
for
missing/desirable
fields

•  Lookups
may
be
required
to
supply
CIM-‐compliant
fields/field

values
(for
example,
to
convert
'sev=42'
to
'severity=medium'

•  Make
the
values
more
readable
for
humans

•  Put
everything
into
the
TA
package

Build
the
search-‐Mme
configs:
extracMons
&
lookups

•  Create
data
models.

What
will
be
interesMng
for
end
users?

•  Document!

(Especially
the
ﬁelds,
evensypes
&
tags)

•  Test

•  Does
this
data
drive
relevant
exisMng
dashboards
correctly?

•  Do
the
data
models
work
properly
/
produce
correct
results?

•  Is
the
TA
packaged
properly?

•  Check
with
originaMng
user/group;
is
it
OK?

Keep
Going

•  Determine
addiMonal
Splunk
infrastructure
required;
can

exisMng
infrastructure
&
license
support
this?

•  Will
new
forwarders
be
required?

If
so,
iniMate
CR
process(es)

•  Will
ﬁrewall
changes
be
required?

If
so,
iniMate
CR
process(es)

•  Will
new
Splunk
roles
be
required?

Create
&
map
to
AD
roles

•  Will
new
app
contexts
be
required?

Create
app(s)
as
necessary

•  Will
new
users
be
added?

Create
the
accounts

Get
ready
to
deploy

•  Deploy
new
search
heads
&
indexers
as
needed

•  Install
new
forwarders
as
needed

•  Deploy
new
app
&
TA
to
search
heads
&
indexers

•  Deploy
new
TA
to
relevant
forwarders

Bring
it!

•  All
sources
reporMng?

•  Event
breaking,
Mmestamp,
Mmezone,
host,
source,

sourcetype?

•  Field
extracMons,
aliases,
lookups?

•  Evensypes,
tags?

•  Data
model(s)?

•  User
access?

•  Conﬁrm
with
original
requesMng
user/group:
looks
OK?

Test
&
Validate

•  Bring
new
data
sources
in
correctly
the
ﬁrst
Mme

•  Reduce
the
amount
of
“bad”
data
in
your
indexes–
and

the
Mme
spent
dealing
with
it

•  Make
the
new
data
immediately
useful
to
ALL
users–
not

just
the
ones
who
originally
requested
it

•  Allow
the
data
to
drive
all
sorts
of
dashboards
without

extra
modiﬁcaMons

Gee,
this
seems
like
a
lot
of
work…

•  What
splunk
can
monitor:

•  hsp://docs.splunk.com/DocumentaMon/Splunk/latest/Data/WhatSplunkcanmonitor

•  How
data
moves
through
splunk:

•  hsp://docs.splunk.com/DocumentaMon/Splunk/latest/Deploy/Datapipeline

•  Components
of
the
data
pipeline:

•  hsp://docs.splunk.com/DocumentaMon/Splunk/latest/Deploy/Componentsofadistributedenvironment

•  Common
informaMon
model
app:

•  hsps://splunkbase.splunk.com/app/1621

•  Common
informaMon
model
docs:

•  hsp://docs.splunk.com/DocumentaMon/CIM/latest/User/Overview

•  Where
do
I
put
conﬁgs:

•  hsp://wiki.splunk.com/Where_do_I_conﬁgure_my_Splunk_sewngs

Reference

Data Onboarding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Data Onboarding

Similar to Data Onboarding (20)

More from Splunk

More from Splunk (20)

Recently uploaded

Recently uploaded (20)

Data Onboarding