This document discusses how to collect big data into Hadoop using Apache Flume and Fluentd. It describes some problems with a poor man's approach to data collection and discusses the basic theories of divide and conquer and streaming to make data collection more efficient. It then provides an overview of how Apache Flume and Fluentd work, including their network topologies, configurations, and plugin systems. Examples are given of how Fluentd has been used at Treasure Data to collect and analyze REST API logs, backend logs, and Hadoop logs. The document concludes with a discussion of developing plugins for Fluentd.
10. Poor man’s data collection
1. Copy files from servers using rsync
2. Create a RegExp to parse the files
3. Parse the files and generate a 10GB CSV file
4. Put it into HDFS
11. Problems to collect “big data”
> Includes broken values
> needs error handling & retrying
> Time-series data are changing and uncler
> parse logs before storing
> Takes time to read/write
> tools have to be optimized and parallelized
> Takes time for trial & error
> Causes network traffic spikes
12. Problem of poor man’s data collection
> Wastes time to implement error handling
> Wastes time to maintain a parser
> Wastes time to debug the tool
> Not reliable
> Not efficient
27. Fluentd - configuration
fluentd
fluentd fluentd
Fluentd fluentd
fluentd fluentd
fluentd
Use chef, puppet, etc. for configuration
(they do things better)
No central node - keep things simple
33. Concept of Fluentd
Customization is essential
> small core + many plugins
Fluentd core helps to implement plugins
> common features are already implemented
43. Fluentd at Treasure Data - REST API logs
fluent-logger-ruby
API servers + in_forward
Rails app
fluentd
Rails app
fluentd
out_forward
watch server fluentd
44. Fluentd at Treasure Data - backend logs
fluent-logger-ruby
API servers + in_forward worker servers
Rails app Ruby app
fluentd fluentd
Rails app Ruby app
fluentd fluentd
out_forward
watch server fluentd
45. Fluentd at Treasure Data - monitoring
fluent-logger-ruby
API servers + in_forward worker servers
Rails app Queue Ruby app
fluentd fluentd
PerfectQueue
Rails app Ruby app
fluentd fluentd
out_forward script
in_exec
fluentd watch server
46. Fluentd at Treasure Data - Hadoop logs
✓ resource consumption
statistics for each user Hadoop
✓ capacity monitoring JobTracker
thrift API call
script
in_exec
fluentd watch server
47. Fluentd at Treasure Data - store & analyze
fluentd watch server
out_tdlog out_metricsense
✓ streaming aggregation
Treasure Data Librato Metrics
for historical analysis for realtime analysis
51. class SomeInput < Fluent::Input
Fluent::Plugin.register_input('myin', self)
config_param :tag, :string
def start
Thread.new {
while true
time = Engine.new
record = {“user”=>1, “size”=>1}
Engine.emit(@tag, time, record)
end
}
end
def shutdown
...
end
end
<source>
type myin
tag myapp.api.heartbeat
</source>
52. class SomeOutput < Fluent::BufferedOutput
Fluent::Plugin.register_output('myout', self)
config_param :myparam, :string
def format(tag, time, record)
[tag, time, record].to_json + "n"
end
def write(chunk)
puts chunk.read
end
end
<match **>
type myout
myparam foobar
</match>
53. class MyTailInput < Fluent::TailInput
Fluent::Plugin.register_input('mytail', self)
def configure_parser(conf)
...
end
def parse_line(line)
array = line.split(“t”)
time = Engine.now
record = {“user”=>array[0], “item”=>array[1]}
return time, record
end
end
<source>
type mytail
</source>