Hadoop project design and a usecase

Case Study: Retail WiFi Log-file
Analysis with Hadoop and Impala

Agenda
Before Getting in into projects
General Planning Considerations
Most necessary things.
Data life cycle In Hadoop.
 Retail WiFi Log-file Analysis with Hadoop – A Use Case

Before Getting in into projects
 Understand problem, requirement and feasibility .
 Come up with document and model, design.
 Study existing model modify and adopt
 Resource, cost, expert building more important.
 Define stage and development path
 Existing model to new model- Define correspondent components .
 Consider Hadoop Design principles

Before Getting in into projects continue…
This will provide the basis for choosing the right Hadoop implementation and creating
an infrastructure.
 What data will be stored and analyzed? What questions do you want to answer?
 What methods of analysis do you plan to use?
 What data do you want to bring into Hadoop?
 How will Hadoop fit into my environment?
 Is Hoadoop/NoSql actually required ?
 Which among the various big data technologies resolve the problem in most effect
manner?
 Is ensure that adoption of thus technologies address your business needs?
 How will your use of Hadoop expand across users, applications, and so on?
 What is the scope of applications to be supported and what are their requirements
(disaster recovery, availability, and so on)?
 What existing, new tools and infrastructure do you want to integrate with Hadoop?
 Will your administrators need management tools or will administrators be hired or
trained for managing Hadoop?
Get your answer

Hadoop &Things are complicated ?
When it comes to Hadoop, however, things become a little
bit more complicated.
Hadoop encompasses a multiplicity of tools that are
designed and implemented to work together
Read notes

General Planning Considerations
To deploy, configure, manage and scale Hadoop clusters in a way
that optimizes performance and resource utilization there is a lot
to consider.
Operating system: Using a 64-bit operating system helps to avoid
constraining the amount of memory that can be used on worker
nodes.
Computation V/S Data: Computational (or processing) capacity is
determined by the aggregate number of Map/Reduce slots
available across all nodes in a cluster. Map/Reduce slots are
configured on a per-server basis. I/O performance issues can arise
from sub-optimal disk-to-core ratios (too many slots and too few
disks). HyperThreading improves process scheduling, allowing
you to configure more Map/Reduce slots.

General Planning Considerations continue…
Memory: Depending on the application, your system’s
memory requirements will vary..
Storage: A Hadoop platform that’s designed to achieve
performance and scalability by moving the compute activity
to the data is preferable. Data storage requirements for the
worker nodes may be best met by direct attached storage
(DAS) in a Just a Bunch of Disks (JBOD) configuration and
not as DAS with RAID or Network Attached Storage (NAS).

General Planning Considerations continue…
Capacity: The number of disks and their corresponding
storage capacity determines the total amount of the
FileServer storage capacity for your cluster. The more disks
you have, the less likely it is that you will have multiple tasks
accessing a given disk at the same time.
Network:
 Hadoop is very bandwidth-intensive, Use dedicated switches
use rack awareness.
 Beware of oversubscription in top-of-rack and core switches
 Consider bonded Ethernet to mitigate against failure

Most necessary in Hadoop Ecosystem
Automated deployment
 Automated deployment of both operating systems and the Hadoop
software ensures consistent, streamlined deployments.
 Configurations that are documented and simpler to deploy than
traditional manual IT deployment strategies, testing, and validation.
Configuration management
 Configuration management tool for all Hadoop environments.
configuration changes, managing the installation of the Hadoop
software, and providing a single interface to the Hadoop
environment for updates and configuration changes.
Monitoring and alerting
 Hardware monitoring and alerting is an important part of all
dynamic IT environments. Successful monitoring and alerting
ensures that problems are caught as soon as possible and
administrators are alerted so the problems can be corrected before
users are impacted.

A Framework for Considering Hadoop
Distributions
Core distribution: All vendors use the Apache Hadoop core and
package it for enterprise use.
 Management capabilities: Some vendors provide an additional layer
of management software that helps administrators con"gure,
monitor, and tune Hadoop.
 Enterprise reliability and integration: A third party of vendors offers
a more robust package, including a management layer augmented
with connectors to existingenterprise systems
and engineered to provide the same high level
of availability, scalability, and reliabilty as
other enterprise systems
And Support.

Data life cycle In Hadoop.
Data input(collection and load)
Data storage
Data analysis and processing.
Data product

Data life cycle In Hadoop continue…
Data collection
Getting the data into a Hadoop cluster is the first step in any
Big Data deployment
This Raw data. Required in many model
1. Flume: Apche Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving
large amounts of log data.
2. Sqoop: Apache Sqoop(TM) is a tool designed for
efficiently transferring bulk data between Apache Hadoop
and structured datastores such as relational databases.
3. Your own is also best !

Data storage
HDFS,
HBase
Hive
Use ETL model before store.

Data Process
1. Map-reduce best way
Streaming
2. Map-reduce difficult ? : use Hbase, Hive, pig

Data product
1. Integration Rest Architecture.
2. Application level computation
3. Integration application level RDBMS

Retail WiFi Log-file Analysis
with Hadoop and Impala

Tracking users visits
Tracking users in the online world to retail stores.
Data source:
logs in WiFi access point when connected to visitors WiFi
enabled mobile phones.

Collected data able to answer
Following questions for a particular store:
How many people visited the store (unique visits)?
How many visits did we have in total?
What is the average visit duration?
How many people are new vs. returning?
In what situation visitors are more ?
Answer security related questions

Business Architecture
21
Data Logic & Rules Applications
Data Sources
• Receive and
process
messages
• Store in flat files
• Store in
operational &
reporting
databases
Rules
• Apply Business
Rules – event and
condition specific
• Consistent rules
and logic used by
multiple
applications and
tools
• Store in
operational &
reporting
databases
General Analytics
Service and Offers
subscription system
Security

Set up
 WiFi access points to simulate two different stores with OpenWRT, a
linux based firmware for routers, installed *
 A virtual machine acting as central syslog daemon collecting all log
messages from the WiFi routers
 Flume to move all log messages to HDFS, without any manual
intervention (no transformation, no filtering)
 CDH4 cluster running with installed and monitored with Cloudera
Manager.
 Pentaho Data Integration‘s graphical designer for data transformation,
parsing, filtering and loading to the warehouse (Hive)
 Hive as data warehouse system on top of Hadoop to project structure
onto data
 Impala for querying data from Hive in real time
Microsoft Excel to visualize results **

Data collection
Configured WiFi access points to send all their local syslog
messages to a central syslog server a shared storage.
Using UDP,TCP,
In OpenWRT’s Unified Configuration Interface, simply called
UCI. You can collect logs UDP/TCP service. Assuming your
syslog server listens on address 192.168.0.1 with UDP/TCP,
and extend detailed log output, the configuration looks like:

Data collection
The logs are send through periodic base like 4 hour or
12hours intervals and some are real time.
The shared storage server are configured to accept
messages from remote hosts through HTTP/FTP and
instructed where to write those messages to defined for
format like time based or WiFi access points . They written
as just text files in disk.
The logs stored in shared storage like:
/var/logs/wifilogs/20130612/bang-mainbranch.log
/var/logs/wifilogs/20130612/bang-branch1.log
/var/logs/wifilogs/20130612/bang-branch2.log
/var/logs/wifilogs/20130612/realtime.logs

Data collection
WiFi- Access
TCP/UDP
Data collection Server Shared Storage Data Collection sever
point
WiFi- Access
point
Localsysetm (Local Disk)
HTTP/FTP client
HTTP/FTP sever
WiFi- Access
point
WiFi- Access
point
Localsysetm (Local Disk)
HTTP/FTP client
HTTP/FTP

Data sample
Exported log-file look like:
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: start authentication
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: unauthorizing port
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 1/4 msg of 4-Way Handshake
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (2/4 Pairwise)
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 3/4 msg of 4-Way Handshake
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (4/4 Pairwise)
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: authorizing port
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: pairwise key handshake completed (RSN)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authentication OK (open system)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-AUTHENTICATE.
indication(24:ab:81:91:c8:62, OPEN_SYSTEM)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-DELETEKEYS.request(24:ab:81:91:c8:62)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authenticated
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: association OK (aid 1)
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: associated (aid 1)

Blueprint for a Data Management
System with Hadoop

Logical Architecture
Ingest
•Transportation and Storage
•HTTP/FLUME
Parse
• Sectioning and Record formation
•FLUME,PDI
Transform
• Object creation
•PDI
Publish
• Real Time
• Batch mode
• Integration patterns (RDBMS)
View • Reporting
Enterprise Architecture Team 28

Data ingestion to HDFS
we have a log-file as data source we set up Flume to stream
the incoming content to HDFS. Due to the Flume
terminology we had the components
 data from the log-file as source
 HDFS folder /user/flume/bang-mainbranch as sink
as you preferred a flat directory layout to simplify the
access/processing of the files later on
 a channel, c1, to connect the source to the sink

FLUME :
 Flume is a distributed, reliable,
and available service for
efficiently collecting,
aggregating, and moving large
amounts of log data. It has a
simple and flexible architecture
based on streaming data flows.
It is robust and fault tolerant
with tunable reliability
mechanisms and many failover
and recovery mechanisms. It
uses a simple extensible data
model that allows for online
analytic application.

Data ingestion to HDFS continue..
Flume configuration
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# get data from exec command
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/logs/wifilogs/20130612/realtime.logs
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
# define hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://cdh-master.cdh-cluster:9000/user/flume/wifilogs/20130612/realtime.logs
a1.sinks.k1.hdfs.rollInterval = 120
a1.sinks.k1.hdfs.rollCount = 100
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.fileType = DataStream
# bind source and sink to channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Data ingestion to HDFS continue..
Flume is able to collect data from various sources it is
possible to configure Flume as “ with different source
points server” itself.
The WiFi access points would send their log messages
directly to the Flume agent also possible.
flume-ng agent --conf-file ./flume-datastream.conf --name a1 -Dflume.root.logger=INFO,console

Sample data looks in HDFS
With the directory structure /user/flume/20120612/datastream
2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001
2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from
192.168.201.99:55001
2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root): Disconnect received
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f IEEE 802.11: disassociated due to
inactivity
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DISASSOCIATE.
indication(8c:64:22:3a:74:1f, 4)
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DELETEKEYS.
request(8c:64:22:3a:74:1f)

Parse and Transform
To insert data raw formate the data onto based on Hive
scheme. the data to the Hive data warehouse need to parse
the raw data into a comma separated format .
There are quite a few open-source BI tools on the market for
this: Palo, SpargoBI, Pentaho, ETL, Talend and many more.
Used Pentaho Data Integration(PDI).

Pentaho:
 Pentaho Data Integration is the
ETL Server technology that will
be used to facilitate movement
of data between the new back-end
Hadoop environment and
downstream RDBMS systems.

Parse and Transform continue…
Pentaho for ETL
The collect WiFi router logs with Flume to store in HDFS, The
PDI used for transformation, parsing, filtering and finally
loading into Hive’s data warehouse.
Import the raw data to the Hive data warehouse we need to
parse the raw data into a comma separated format
according to Hive scheme.
The enabled us to design a MapReduce job for distributed
processing across multiple nodes for this task without any
programming environment.

Pentaho Data Integration’s Graphical Designer

Map Reduce
The map phase will read all the raw log files collected by
Flume on HDFS.
The input is interpreted as TextInputFormat and therefore
every line will go through a regex evaluation during the map
phase.
Filtering and transformation used “Regex Evaluation”
 In Filtering only line matches the pattern are selected.
 In transformation within the selected line on only few
columns are taken.

In Map Phase:
Transformation
 By matching a particular line against a regular expression
 Split up the pattern matched line in fields that will be used as
columns.
This is the regex for the transformation:
^((d{4})-(d{2})-(d{2})w(d{2}):(d{2}):(d{2})([+-]d{2}:d{2})) ([.a-zA-Z_0-9]*?) (.*?): (.*?): w*? ([w+:]{0,18}) (.*?): (.*)$

The matched lines from the regex are divided into different columns
like this:
year Integer
month Integer
day Integer
hour Integer
minute Integer
second Integer
timezone String
host String
facility_level String
service_level String
mac_address String
protocol String
message String

Filtering
 All lines that do not match the regular expression are filtered, Like error
 These line are discarded those lines because they carry useless
Disconnect received information for this case.
 ‘Filter Rows’ we remove empty lines. This is to ensure there are no
empty lines in the output after matching against the regular expression
Example lines:
2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001
2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from
192.168.201.99:55001
2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root):

To produce a comma separated lines
 Used a ‘User Defined Java Expression’ and concatenate the emitted
fields delimiting by ‘,’.
 At the beginning of each line we did a further transformation and
addition of string : ISO 8601 string to unix timestamp.
 To answer time related questions, e.g. average visit duration we need
values to calculate with. The unix timestamp is suitable for this.
 The outputted stored to /user/20130612/routerlogs/parsed/
Here is the ‘User Defined Java Expression’:
(javax.xml.bind.DatatypeConverter.parseDateTime(iso_8601).getTimeInMillis()/1000) + "," + year
+ "," + month + "," + day + "," + hour + "," + minute + "," + second + "," + timezone + "," + host +
"," + facility_level + "," + service_level + "," + mac_address + "," + protocol + "," + message

 The each contains comma seperated fields
 Configured our Pentaho MapReduce job to clean output path before
execution.
 The outputted stored to /user/20130612/routerlogs/parsed/
$hadoop fs –ls /user/20130612/routerlogs/parsed/
drwxrwxrwx - hadoopuser hadoopuser 0 2013-01-21 15:24 /user/20130612/routerlogs/parse/_logs
-rw-r--r-- 3 hadoopuser hadoopuser 118963 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00000

Load
At the very end, the transformed and parsed raw data lands
in HDFS once the MapReduce job has finished
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake
completed (RSN)
Now we have parsed and Transformed log files on HDFS.
Used Pentaho Data Integration once again to import the
data to Hive’s warehouse

HIVE
 Hive is a data warehouse
system for Hadoop that
facilitates easy data
summarization, ad-hoc
queries, and the analysis of
large datasets stored in
Hadoop compatible file
systems. Hive provides a
mechanism to project
structure onto this data and
query the data using a SQL-like
language called HiveQL.

Load…
created a HIVE table that matches the previously defined
schema with the query editor any query editor.

Load…
Loading data into the hive table is basically done by copying
files on HDFS from /user/20130612/routerlogs/parsed to
/user/hive/warehouse/routerlogs.20130612.
Automating the MapReduce job on a scheduled base, e.g
with Oozie.
Ensure incremental updates on the Hive table by using
partitioned table technique or unique output file naming
With date.

Load …
Querying the data with Impala:
Querying the data from Hive Table. The Sample data in hive
table
Sudhakara@loaclhost~]# hadoop fs -cat /user/hive/warehouse/routerlogs
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-AUTHENTICATE.
indication(98:0c:82:dc:8b:15, OPEN_SYSTEM)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS.
request(98:0c:82:dc:8b:15)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,authenticated
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,association OK (aid 2)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,associated (aid 2)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-ASSOCIATE.
indication(98:0c:82:dc:8b:15)
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS.
request(98:0c:82:dc:8b:15)
1358757010,2013,1,21,9,30,10,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,deauthenticated

Analysis and Report
You can see, the line “authentication OK‘. It represents user
enter to the WiFi access area i.e. login.
You can see, the line “deauthenticated‘ . It represents user
exit from the WiFi access area i.e. logout.
After querying the data through impala , application should
calculate the duration calculation.

Impala
 With Impala, you can query data,
whether stored in HDFS or Apache
HBase – including SELECT, JOIN, and
aggregate functions – in real time.
Furthermore, it uses the same metadata,
SQL syntax (Hive SQL), ODBC driver and
user interface (Hue Beeswax) as Apache
Hive, providing a familiar and unified
platform for batch-oriented or real-time
queries. (For that reason, Hive users can
utilize Impala with little setup overhead.)
The first beta drop includes support for
text files and SequenceFiles;
SequenceFiles can be compressed as
Snappy, GZIP, and BZIP (with Snappy
recommended for maximum
performance)

Analysis and Report….
Calculate the duration calculation for vistor the Hive query
looks like
SELECT A.ts, MIN(B.ts - A.ts), A.host, A.mac_address FROM routerlogs A,
routerlogs B WHERE A.host = B.host AND A.mac_address = B.mac_address AND
A.ts <= B.ts AND A.message LIKE '%authentication OK%' AND B.message LIKE
'%deauthenticated%' GROUP BY A.host, A.mac_address, A.ts;
Created a new Hive table called ‘visit_duration’ and loaded
the CSV file into it
create table visit_duration ( ts int,
duration_in_seconds int,
router string,
mac_address string) row format delimited
fields terminated by ',‘;

Analysis and Report…
Counting the visits for store number one reatail
(Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012)
[localhost:21000] > SELECT COUNT(*) FROM visit_durationWHERE router =
"buffalo";
135
Collecting the number of unique visitors is even simpler as
we have the mac addresses of visitors that make them
unique:
(Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012)
[localhost:21000] > SELECT COUNT(DISTINCT(mac_address)) FROM visit_duration
WHERE router = "buffalo";

The plot (figure 1) indicates that about 85% of the visits were
detected in store number one and about 15% in store
number two. One might draw the conclusion that store
number one is in a much better location with more
occasional customers. But let’s gain more insights by
analyzing the number of unique visitors.

The average visit duration in store number
[localhost:21000] > SELECT AVG(duration_in_seconds) FROM visit_duration
WHERE router = "buffalo";
976.6666666666666
Each user visit duration in store.

Many more…
How many people visited the store (unique visitors)?
Note: Unlike the traditional customer frequency counter at the
doors we have mac addresses at the log files that are unique for
mobile phones. Supposed people do not change their mobile
phones we can recognize unique visitors and not just visits.
How many visits did we have?
What is the average visit duration?
What is the peak hour for visitors?
How many people are new vs. returning?
Which location getting most vistors?
Which location branch has most regular customers?
What is the average length of time between two visits?

Ingest Parse Transform Publish View
Collectors
File
System
Real Time
Business Rules
PDI
Batch
Integrated
Data
Store
Internet
Functional Architecture View and summary
Analysis
Data
Services
Business Rules
Business Rules
Security
Service and
Offers
subscription
system
General
Analytics
Hive
Flume
System of Record Single Source of Truth Consumers

Conclusion
Analysing WiFi router log files could be done with a
traditional RDBMS database approach as well. But one of
the main benefits of this architecture is the ability to store a
variety of semi structured files.
Easy adoption for existing BI/analysis and reporting tools
with a BigData platform integration.
Modification, Evaluation
Share the data between many applications
Etc…

Hadoop project design and a usecase

Hadoop project design and a usecase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop project design and a usecase

Similar to Hadoop project design and a usecase (20)

Hadoop project design and a usecase

Editor's Notes