SlideShare a Scribd company logo
1 of 58
Case Study: Retail WiFi Log-file 
Analysis with Hadoop and Impala
Agenda 
Before Getting in into projects 
General Planning Considerations 
Most necessary things. 
Data life cycle In Hadoop. 
 Retail WiFi Log-file Analysis with Hadoop – A Use Case
Before Getting in into projects 
 Understand problem, requirement and feasibility . 
 Come up with document and model, design. 
 Study existing model modify and adopt 
 Resource, cost, expert building more important. 
 Define stage and development path 
 Existing model to new model- Define correspondent components . 
 Consider Hadoop Design principles
Before Getting in into projects continue… 
This will provide the basis for choosing the right Hadoop implementation and creating 
an infrastructure. 
 What data will be stored and analyzed? What questions do you want to answer? 
 What methods of analysis do you plan to use? 
 What data do you want to bring into Hadoop? 
 How will Hadoop fit into my environment? 
 Is Hoadoop/NoSql actually required ? 
 Which among the various big data technologies resolve the problem in most effect 
manner? 
 Is ensure that adoption of thus technologies address your business needs? 
 How will your use of Hadoop expand across users, applications, and so on? 
 What is the scope of applications to be supported and what are their requirements 
(disaster recovery, availability, and so on)? 
 What existing, new tools and infrastructure do you want to integrate with Hadoop? 
 Will your administrators need management tools or will administrators be hired or 
trained for managing Hadoop? 
Get your answer
Hadoop &Things are complicated ? 
When it comes to Hadoop, however, things become a little 
bit more complicated. 
Hadoop encompasses a multiplicity of tools that are 
designed and implemented to work together 
Read notes
General Planning Considerations 
To deploy, configure, manage and scale Hadoop clusters in a way 
that optimizes performance and resource utilization there is a lot 
to consider. 
Operating system: Using a 64-bit operating system helps to avoid 
constraining the amount of memory that can be used on worker 
nodes. 
Computation V/S Data: Computational (or processing) capacity is 
determined by the aggregate number of Map/Reduce slots 
available across all nodes in a cluster. Map/Reduce slots are 
configured on a per-server basis. I/O performance issues can arise 
from sub-optimal disk-to-core ratios (too many slots and too few 
disks). HyperThreading improves process scheduling, allowing 
you to configure more Map/Reduce slots.
General Planning Considerations continue… 
Memory: Depending on the application, your system’s 
memory requirements will vary.. 
Storage: A Hadoop platform that’s designed to achieve 
performance and scalability by moving the compute activity 
to the data is preferable. Data storage requirements for the 
worker nodes may be best met by direct attached storage 
(DAS) in a Just a Bunch of Disks (JBOD) configuration and 
not as DAS with RAID or Network Attached Storage (NAS).
General Planning Considerations continue… 
Capacity: The number of disks and their corresponding 
storage capacity determines the total amount of the 
FileServer storage capacity for your cluster. The more disks 
you have, the less likely it is that you will have multiple tasks 
accessing a given disk at the same time. 
Network: 
 Hadoop is very bandwidth-intensive, Use dedicated switches 
use rack awareness. 
 Beware of oversubscription in top-of-rack and core switches 
 Consider bonded Ethernet to mitigate against failure
Most necessary in Hadoop Ecosystem 
Automated deployment 
 Automated deployment of both operating systems and the Hadoop 
software ensures consistent, streamlined deployments. 
 Configurations that are documented and simpler to deploy than 
traditional manual IT deployment strategies, testing, and validation. 
Configuration management 
 Configuration management tool for all Hadoop environments. 
configuration changes, managing the installation of the Hadoop 
software, and providing a single interface to the Hadoop 
environment for updates and configuration changes. 
Monitoring and alerting 
 Hardware monitoring and alerting is an important part of all 
dynamic IT environments. Successful monitoring and alerting 
ensures that problems are caught as soon as possible and 
administrators are alerted so the problems can be corrected before 
users are impacted.
A Framework for Considering Hadoop 
Distributions 
Core distribution: All vendors use the Apache Hadoop core and 
package it for enterprise use. 
 Management capabilities: Some vendors provide an additional layer 
of management software that helps administrators con"gure, 
monitor, and tune Hadoop. 
 Enterprise reliability and integration: A third party of vendors offers 
a more robust package, including a management layer augmented 
with connectors to existingenterprise systems 
and engineered to provide the same high level 
of availability, scalability, and reliabilty as 
other enterprise systems 
And Support.
Data life cycle In Hadoop. 
Data input(collection and load) 
Data storage 
Data analysis and processing. 
Data product
Data life cycle In Hadoop continue… 
Data collection 
Getting the data into a Hadoop cluster is the first step in any 
Big Data deployment 
This Raw data. Required in many model 
1. Flume: Apche Flume is a distributed, reliable, and available 
service for efficiently collecting, aggregating, and moving 
large amounts of log data. 
2. Sqoop: Apache Sqoop(TM) is a tool designed for 
efficiently transferring bulk data between Apache Hadoop 
and structured datastores such as relational databases. 
3. Your own is also best !
Data life cycle In Hadoop continue… 
Data storage 
HDFS, 
HBase 
Hive 
Use ETL model before store.
Data life cycle In Hadoop continue… 
Data Process 
1. Map-reduce best way 
Streaming 
2. Map-reduce difficult ? : use Hbase, Hive, pig
Data life cycle In Hadoop continue… 
Data product 
1. Integration Rest Architecture. 
2. Application level computation 
3. Integration application level RDBMS
A typical Hadoop System
Hadoop Ecosystem
Retail WiFi Log-file Analysis 
with Hadoop and Impala
Tracking users visits 
Tracking users in the online world to retail stores. 
Data source: 
logs in WiFi access point when connected to visitors WiFi 
enabled mobile phones.
Collected data able to answer 
Following questions for a particular store: 
How many people visited the store (unique visits)? 
How many visits did we have in total? 
What is the average visit duration? 
How many people are new vs. returning? 
In what situation visitors are more ? 
Answer security related questions
Business Architecture 
21 
Data Logic & Rules Applications 
Data Sources 
• Receive and 
process 
messages 
• Store in flat files 
• Store in 
operational & 
reporting 
databases 
Rules 
• Apply Business 
Rules – event and 
condition specific 
• Consistent rules 
and logic used by 
multiple 
applications and 
tools 
• Store in 
operational & 
reporting 
databases 
General Analytics 
Service and Offers 
subscription system 
Security
Set up 
 WiFi access points to simulate two different stores with OpenWRT, a 
linux based firmware for routers, installed * 
 A virtual machine acting as central syslog daemon collecting all log 
messages from the WiFi routers 
 Flume to move all log messages to HDFS, without any manual 
intervention (no transformation, no filtering) 
 CDH4 cluster running with installed and monitored with Cloudera 
Manager. 
 Pentaho Data Integration‘s graphical designer for data transformation, 
parsing, filtering and loading to the warehouse (Hive) 
 Hive as data warehouse system on top of Hadoop to project structure 
onto data 
 Impala for querying data from Hive in real time 
Microsoft Excel to visualize results **
Data collection 
Configured WiFi access points to send all their local syslog 
messages to a central syslog server a shared storage. 
Using UDP,TCP, 
In OpenWRT’s Unified Configuration Interface, simply called 
UCI. You can collect logs UDP/TCP service. Assuming your 
syslog server listens on address 192.168.0.1 with UDP/TCP, 
and extend detailed log output, the configuration looks like:
Data collection 
The logs are send through periodic base like 4 hour or 
12hours intervals and some are real time. 
The shared storage server are configured to accept 
messages from remote hosts through HTTP/FTP and 
instructed where to write those messages to defined for 
format like time based or WiFi access points . They written 
as just text files in disk. 
The logs stored in shared storage like: 
/var/logs/wifilogs/20130612/bang-mainbranch.log 
/var/logs/wifilogs/20130612/bang-branch1.log 
/var/logs/wifilogs/20130612/bang-branch2.log 
/var/logs/wifilogs/20130612/realtime.logs
Data collection 
WiFi- Access 
TCP/UDP 
Data collection Server Shared Storage Data Collection sever 
point 
WiFi- Access 
point 
Localsysetm (Local Disk) 
HTTP/FTP client 
HTTP/FTP sever 
WiFi- Access 
point 
WiFi- Access 
point 
Localsysetm (Local Disk) 
HTTP/FTP client 
HTTP/FTP
Data sample 
Exported log-file look like: 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: start authentication 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: unauthorizing port 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 1/4 msg of 4-Way Handshake 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (2/4 Pairwise) 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 3/4 msg of 4-Way Handshake 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (4/4 Pairwise) 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: authorizing port 
2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: pairwise key handshake completed (RSN) 
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authentication OK (open system) 
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-AUTHENTICATE. 
indication(24:ab:81:91:c8:62, OPEN_SYSTEM) 
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-DELETEKEYS.request(24:ab:81:91:c8:62) 
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authenticated 
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: association OK (aid 1) 
2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: associated (aid 1)
Blueprint for a Data Management 
System with Hadoop
Logical Architecture 
Ingest 
•Transportation and Storage 
•HTTP/FLUME 
Parse 
• Sectioning and Record formation 
•FLUME,PDI 
Transform 
• Object creation 
•PDI 
Publish 
• Real Time 
• Batch mode 
• Integration patterns (RDBMS) 
View • Reporting 
Enterprise Architecture Team 28
Data ingestion to HDFS 
we have a log-file as data source we set up Flume to stream 
the incoming content to HDFS. Due to the Flume 
terminology we had the components 
 data from the log-file as source 
 HDFS folder /user/flume/bang-mainbranch as sink 
as you preferred a flat directory layout to simplify the 
access/processing of the files later on 
 a channel, c1, to connect the source to the sink
FLUME : 
 Flume is a distributed, reliable, 
and available service for 
efficiently collecting, 
aggregating, and moving large 
amounts of log data. It has a 
simple and flexible architecture 
based on streaming data flows. 
It is robust and fault tolerant 
with tunable reliability 
mechanisms and many failover 
and recovery mechanisms. It 
uses a simple extensible data 
model that allows for online 
analytic application.
Data ingestion to HDFS continue.. 
Flume configuration 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 
# get data from exec command 
a1.sources.r1.type = exec 
a1.sources.r1.command = tail -F /var/logs/wifilogs/20130612/realtime.logs 
a1.sources.r1.interceptors = i1 i2 
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder 
a1.sources.r1.interceptors.i1.preserveExisting = false 
a1.sources.r1.interceptors.i1.hostHeader = hostname 
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder 
# define hdfs sink 
a1.sinks.k1.type = hdfs 
a1.sinks.k1.hdfs.path = hdfs://cdh-master.cdh-cluster:9000/user/flume/wifilogs/20130612/realtime.logs 
a1.sinks.k1.hdfs.rollInterval = 120 
a1.sinks.k1.hdfs.rollCount = 100 
a1.sinks.k1.hdfs.rollSize = 0 
a1.sinks.k1.hdfs.fileType = DataStream 
# bind source and sink to channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1
Data ingestion to HDFS continue.. 
Flume is able to collect data from various sources it is 
possible to configure Flume as “ with different source 
points server” itself. 
The WiFi access points would send their log messages 
directly to the Flume agent also possible. 
flume-ng agent --conf-file ./flume-datastream.conf --name a1 -Dflume.root.logger=INFO,console
Sample data looks in HDFS 
With the directory structure /user/flume/20120612/datastream 
2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001 
2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from 
192.168.201.99:55001 
2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root): Disconnect received 
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f IEEE 802.11: disassociated due to 
inactivity 
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DISASSOCIATE. 
indication(8c:64:22:3a:74:1f, 4) 
2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DELETEKEYS. 
request(8c:64:22:3a:74:1f)
Parse and Transform 
To insert data raw formate the data onto based on Hive 
scheme. the data to the Hive data warehouse need to parse 
the raw data into a comma separated format . 
There are quite a few open-source BI tools on the market for 
this: Palo, SpargoBI, Pentaho, ETL, Talend and many more. 
Used Pentaho Data Integration(PDI).
Pentaho: 
 Pentaho Data Integration is the 
ETL Server technology that will 
be used to facilitate movement 
of data between the new back-end 
Hadoop environment and 
downstream RDBMS systems.
Parse and Transform continue… 
Pentaho for ETL 
The collect WiFi router logs with Flume to store in HDFS, The 
PDI used for transformation, parsing, filtering and finally 
loading into Hive’s data warehouse. 
Import the raw data to the Hive data warehouse we need to 
parse the raw data into a comma separated format 
according to Hive scheme. 
The enabled us to design a MapReduce job for distributed 
processing across multiple nodes for this task without any 
programming environment.
Parse and Transform continue… 
Pentaho Data Integration’s Graphical Designer
Parse and Transform continue… 
Map Reduce 
The map phase will read all the raw log files collected by 
Flume on HDFS. 
The input is interpreted as TextInputFormat and therefore 
every line will go through a regex evaluation during the map 
phase. 
Filtering and transformation used “Regex Evaluation” 
 In Filtering only line matches the pattern are selected. 
 In transformation within the selected line on only few 
columns are taken.
Parse and Transform continue… 
In Map Phase: 
Transformation 
 By matching a particular line against a regular expression 
 Split up the pattern matched line in fields that will be used as 
columns. 
This is the regex for the transformation: 
^((d{4})-(d{2})-(d{2})w(d{2}):(d{2}):(d{2})([+-]d{2}:d{2})) ([.a-zA-Z_0-9]*?) (.*?): (.*?): w*? ([w+:]{0,18}) (.*?): (.*)$
Parse and Transform continue… 
The matched lines from the regex are divided into different columns 
like this: 
year Integer 
month Integer 
day Integer 
hour Integer 
minute Integer 
second Integer 
timezone String 
host String 
facility_level String 
service_level String 
mac_address String 
protocol String 
message String
Parse and Transform continue… 
Filtering 
 All lines that do not match the regular expression are filtered, Like error 
 These line are discarded those lines because they carry useless 
Disconnect received information for this case. 
 ‘Filter Rows’ we remove empty lines. This is to ensure there are no 
empty lines in the output after matching against the regular expression 
Example lines: 
2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001 
2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from 
192.168.201.99:55001 
2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root):
Parse and Transform continue… 
To produce a comma separated lines 
 Used a ‘User Defined Java Expression’ and concatenate the emitted 
fields delimiting by ‘,’. 
 At the beginning of each line we did a further transformation and 
addition of string : ISO 8601 string to unix timestamp. 
 To answer time related questions, e.g. average visit duration we need 
values to calculate with. The unix timestamp is suitable for this. 
 The outputted stored to /user/20130612/routerlogs/parsed/ 
Here is the ‘User Defined Java Expression’: 
(javax.xml.bind.DatatypeConverter.parseDateTime(iso_8601).getTimeInMillis()/1000) + "," + year 
+ "," + month + "," + day + "," + hour + "," + minute + "," + second + "," + timezone + "," + host + 
"," + facility_level + "," + service_level + "," + mac_address + "," + protocol + "," + message
Parse and Transform continue… 
 The each contains comma seperated fields 
 Configured our Pentaho MapReduce job to clean output path before 
execution. 
 The outputted stored to /user/20130612/routerlogs/parsed/ 
$hadoop fs –ls /user/20130612/routerlogs/parsed/ 
drwxrwxrwx - hadoopuser hadoopuser 0 2013-01-21 15:24 /user/20130612/routerlogs/parse/_logs 
-rw-r--r-- 3 hadoopuser hadoopuser 118963 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00000 
-rw-r--r-- 3 hadoopuser hadoopuser 100500 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00001 
-rw-r--r-- 3 hadoopuser hadoopuser 11826 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00002
Load 
At the very end, the transformed and parsed raw data lands 
in HDFS once the MapReduce job has finished 
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port 
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake 
completed (RSN) 
Now we have parsed and Transformed log files on HDFS. 
Used Pentaho Data Integration once again to import the 
data to Hive’s warehouse
HIVE 
 Hive is a data warehouse 
system for Hadoop that 
facilitates easy data 
summarization, ad-hoc 
queries, and the analysis of 
large datasets stored in 
Hadoop compatible file 
systems. Hive provides a 
mechanism to project 
structure onto this data and 
query the data using a SQL-like 
language called HiveQL.
Load… 
created a HIVE table that matches the previously defined 
schema with the query editor any query editor.
Load… 
Loading data into the hive table is basically done by copying 
files on HDFS from /user/20130612/routerlogs/parsed to 
/user/hive/warehouse/routerlogs.20130612. 
Automating the MapReduce job on a scheduled base, e.g 
with Oozie. 
Ensure incremental updates on the Hive table by using 
partitioned table technique or unique output file naming 
With date.
Load … 
Querying the data with Impala: 
Querying the data from Hive Table. The Sample data in hive 
table 
Sudhakara@loaclhost~]# hadoop fs -cat /user/hive/warehouse/routerlogs 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-AUTHENTICATE. 
indication(98:0c:82:dc:8b:15, OPEN_SYSTEM) 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS. 
request(98:0c:82:dc:8b:15) 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,authenticated 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,association OK (aid 2) 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,associated (aid 2) 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-ASSOCIATE. 
indication(98:0c:82:dc:8b:15) 
1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS. 
request(98:0c:82:dc:8b:15) 
1358757010,2013,1,21,9,30,10,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,deauthenticated
Analysis and Report 
You can see, the line “authentication OK‘. It represents user 
enter to the WiFi access area i.e. login. 
You can see, the line “deauthenticated‘ . It represents user 
exit from the WiFi access area i.e. logout. 
After querying the data through impala , application should 
calculate the duration calculation.
Impala 
 With Impala, you can query data, 
whether stored in HDFS or Apache 
HBase – including SELECT, JOIN, and 
aggregate functions – in real time. 
Furthermore, it uses the same metadata, 
SQL syntax (Hive SQL), ODBC driver and 
user interface (Hue Beeswax) as Apache 
Hive, providing a familiar and unified 
platform for batch-oriented or real-time 
queries. (For that reason, Hive users can 
utilize Impala with little setup overhead.) 
The first beta drop includes support for 
text files and SequenceFiles; 
SequenceFiles can be compressed as 
Snappy, GZIP, and BZIP (with Snappy 
recommended for maximum 
performance)
Analysis and Report…. 
Calculate the duration calculation for vistor the Hive query 
looks like 
SELECT A.ts, MIN(B.ts - A.ts), A.host, A.mac_address FROM routerlogs A, 
routerlogs B WHERE A.host = B.host AND A.mac_address = B.mac_address AND 
A.ts <= B.ts AND A.message LIKE '%authentication OK%' AND B.message LIKE 
'%deauthenticated%' GROUP BY A.host, A.mac_address, A.ts; 
Created a new Hive table called ‘visit_duration’ and loaded 
the CSV file into it 
create table visit_duration ( ts int, 
duration_in_seconds int, 
router string, 
mac_address string) row format delimited 
fields terminated by ',‘;
Analysis and Report… 
Counting the visits for store number one reatail 
(Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012) 
[localhost:21000] > SELECT COUNT(*) FROM visit_durationWHERE router = 
"buffalo"; 
135 
Collecting the number of unique visitors is even simpler as 
we have the mac addresses of visitors that make them 
unique: 
(Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012) 
[localhost:21000] > SELECT COUNT(DISTINCT(mac_address)) FROM visit_duration 
WHERE router = "buffalo";
Analysis and Report… 
The plot (figure 1) indicates that about 85% of the visits were 
detected in store number one and about 15% in store 
number two. One might draw the conclusion that store 
number one is in a much better location with more 
occasional customers. But let’s gain more insights by 
analyzing the number of unique visitors.
Analysis and Report… 
The average visit duration in store number 
[localhost:21000] > SELECT AVG(duration_in_seconds) FROM visit_duration 
WHERE router = "buffalo"; 
976.6666666666666 
Each user visit duration in store.
Many more… 
How many people visited the store (unique visitors)? 
Note: Unlike the traditional customer frequency counter at the 
doors we have mac addresses at the log files that are unique for 
mobile phones. Supposed people do not change their mobile 
phones we can recognize unique visitors and not just visits. 
How many visits did we have? 
What is the average visit duration? 
What is the peak hour for visitors? 
How many people are new vs. returning? 
Which location getting most vistors? 
Which location branch has most regular customers? 
What is the average length of time between two visits?
Ingest Parse Transform Publish View 
Collectors 
File 
System 
Real Time 
Business Rules 
PDI 
Batch 
Integrated 
Data 
Store 
Internet 
Functional Architecture View and summary 
Analysis 
Data 
Services 
Business Rules 
Business Rules 
Security 
Service and 
Offers 
subscription 
system 
General 
Analytics 
Hive 
Flume 
System of Record Single Source of Truth Consumers
Conclusion 
Analysing WiFi router log files could be done with a 
traditional RDBMS database approach as well. But one of 
the main benefits of this architecture is the ability to store a 
variety of semi structured files. 
Easy adoption for existing BI/analysis and reporting tools 
with a BigData platform integration. 
Modification, Evaluation 
Share the data between many applications 
Etc…
Hadoop project design and  a usecase

More Related Content

What's hot

Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Federated learning
Federated learningFederated learning
Federated learningMindos Cheng
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringHadi Fadlallah
 
How Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyHow Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyScyllaDB
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowDatabricks
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 

What's hot (20)

Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Federated learning
Federated learningFederated learning
Federated learning
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
Apache web server
Apache web serverApache web server
Apache web server
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Data mesh
Data meshData mesh
Data mesh
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
How Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyHow Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low Latency
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 

Similar to Hadoop project design and a usecase

Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 

Similar to Hadoop project design and a usecase (20)

paper
paperpaper
paper
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 

Hadoop project design and a usecase

  • 1. Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala
  • 2. Agenda Before Getting in into projects General Planning Considerations Most necessary things. Data life cycle In Hadoop.  Retail WiFi Log-file Analysis with Hadoop – A Use Case
  • 3. Before Getting in into projects  Understand problem, requirement and feasibility .  Come up with document and model, design.  Study existing model modify and adopt  Resource, cost, expert building more important.  Define stage and development path  Existing model to new model- Define correspondent components .  Consider Hadoop Design principles
  • 4. Before Getting in into projects continue… This will provide the basis for choosing the right Hadoop implementation and creating an infrastructure.  What data will be stored and analyzed? What questions do you want to answer?  What methods of analysis do you plan to use?  What data do you want to bring into Hadoop?  How will Hadoop fit into my environment?  Is Hoadoop/NoSql actually required ?  Which among the various big data technologies resolve the problem in most effect manner?  Is ensure that adoption of thus technologies address your business needs?  How will your use of Hadoop expand across users, applications, and so on?  What is the scope of applications to be supported and what are their requirements (disaster recovery, availability, and so on)?  What existing, new tools and infrastructure do you want to integrate with Hadoop?  Will your administrators need management tools or will administrators be hired or trained for managing Hadoop? Get your answer
  • 5. Hadoop &Things are complicated ? When it comes to Hadoop, however, things become a little bit more complicated. Hadoop encompasses a multiplicity of tools that are designed and implemented to work together Read notes
  • 6. General Planning Considerations To deploy, configure, manage and scale Hadoop clusters in a way that optimizes performance and resource utilization there is a lot to consider. Operating system: Using a 64-bit operating system helps to avoid constraining the amount of memory that can be used on worker nodes. Computation V/S Data: Computational (or processing) capacity is determined by the aggregate number of Map/Reduce slots available across all nodes in a cluster. Map/Reduce slots are configured on a per-server basis. I/O performance issues can arise from sub-optimal disk-to-core ratios (too many slots and too few disks). HyperThreading improves process scheduling, allowing you to configure more Map/Reduce slots.
  • 7. General Planning Considerations continue… Memory: Depending on the application, your system’s memory requirements will vary.. Storage: A Hadoop platform that’s designed to achieve performance and scalability by moving the compute activity to the data is preferable. Data storage requirements for the worker nodes may be best met by direct attached storage (DAS) in a Just a Bunch of Disks (JBOD) configuration and not as DAS with RAID or Network Attached Storage (NAS).
  • 8. General Planning Considerations continue… Capacity: The number of disks and their corresponding storage capacity determines the total amount of the FileServer storage capacity for your cluster. The more disks you have, the less likely it is that you will have multiple tasks accessing a given disk at the same time. Network:  Hadoop is very bandwidth-intensive, Use dedicated switches use rack awareness.  Beware of oversubscription in top-of-rack and core switches  Consider bonded Ethernet to mitigate against failure
  • 9. Most necessary in Hadoop Ecosystem Automated deployment  Automated deployment of both operating systems and the Hadoop software ensures consistent, streamlined deployments.  Configurations that are documented and simpler to deploy than traditional manual IT deployment strategies, testing, and validation. Configuration management  Configuration management tool for all Hadoop environments. configuration changes, managing the installation of the Hadoop software, and providing a single interface to the Hadoop environment for updates and configuration changes. Monitoring and alerting  Hardware monitoring and alerting is an important part of all dynamic IT environments. Successful monitoring and alerting ensures that problems are caught as soon as possible and administrators are alerted so the problems can be corrected before users are impacted.
  • 10. A Framework for Considering Hadoop Distributions Core distribution: All vendors use the Apache Hadoop core and package it for enterprise use.  Management capabilities: Some vendors provide an additional layer of management software that helps administrators con"gure, monitor, and tune Hadoop.  Enterprise reliability and integration: A third party of vendors offers a more robust package, including a management layer augmented with connectors to existingenterprise systems and engineered to provide the same high level of availability, scalability, and reliabilty as other enterprise systems And Support.
  • 11. Data life cycle In Hadoop. Data input(collection and load) Data storage Data analysis and processing. Data product
  • 12. Data life cycle In Hadoop continue… Data collection Getting the data into a Hadoop cluster is the first step in any Big Data deployment This Raw data. Required in many model 1. Flume: Apche Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 2. Sqoop: Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 3. Your own is also best !
  • 13. Data life cycle In Hadoop continue… Data storage HDFS, HBase Hive Use ETL model before store.
  • 14. Data life cycle In Hadoop continue… Data Process 1. Map-reduce best way Streaming 2. Map-reduce difficult ? : use Hbase, Hive, pig
  • 15. Data life cycle In Hadoop continue… Data product 1. Integration Rest Architecture. 2. Application level computation 3. Integration application level RDBMS
  • 18. Retail WiFi Log-file Analysis with Hadoop and Impala
  • 19. Tracking users visits Tracking users in the online world to retail stores. Data source: logs in WiFi access point when connected to visitors WiFi enabled mobile phones.
  • 20. Collected data able to answer Following questions for a particular store: How many people visited the store (unique visits)? How many visits did we have in total? What is the average visit duration? How many people are new vs. returning? In what situation visitors are more ? Answer security related questions
  • 21. Business Architecture 21 Data Logic & Rules Applications Data Sources • Receive and process messages • Store in flat files • Store in operational & reporting databases Rules • Apply Business Rules – event and condition specific • Consistent rules and logic used by multiple applications and tools • Store in operational & reporting databases General Analytics Service and Offers subscription system Security
  • 22. Set up  WiFi access points to simulate two different stores with OpenWRT, a linux based firmware for routers, installed *  A virtual machine acting as central syslog daemon collecting all log messages from the WiFi routers  Flume to move all log messages to HDFS, without any manual intervention (no transformation, no filtering)  CDH4 cluster running with installed and monitored with Cloudera Manager.  Pentaho Data Integration‘s graphical designer for data transformation, parsing, filtering and loading to the warehouse (Hive)  Hive as data warehouse system on top of Hadoop to project structure onto data  Impala for querying data from Hive in real time Microsoft Excel to visualize results **
  • 23. Data collection Configured WiFi access points to send all their local syslog messages to a central syslog server a shared storage. Using UDP,TCP, In OpenWRT’s Unified Configuration Interface, simply called UCI. You can collect logs UDP/TCP service. Assuming your syslog server listens on address 192.168.0.1 with UDP/TCP, and extend detailed log output, the configuration looks like:
  • 24. Data collection The logs are send through periodic base like 4 hour or 12hours intervals and some are real time. The shared storage server are configured to accept messages from remote hosts through HTTP/FTP and instructed where to write those messages to defined for format like time based or WiFi access points . They written as just text files in disk. The logs stored in shared storage like: /var/logs/wifilogs/20130612/bang-mainbranch.log /var/logs/wifilogs/20130612/bang-branch1.log /var/logs/wifilogs/20130612/bang-branch2.log /var/logs/wifilogs/20130612/realtime.logs
  • 25. Data collection WiFi- Access TCP/UDP Data collection Server Shared Storage Data Collection sever point WiFi- Access point Localsysetm (Local Disk) HTTP/FTP client HTTP/FTP sever WiFi- Access point WiFi- Access point Localsysetm (Local Disk) HTTP/FTP client HTTP/FTP
  • 26. Data sample Exported log-file look like: 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: start authentication 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: unauthorizing port 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 1/4 msg of 4-Way Handshake 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (2/4 Pairwise) 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: sending 3/4 msg of 4-Way Handshake 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: received EAPOL-Key frame (4/4 Pairwise) 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy IEEE 802.1X: authorizing port 2013-01-21T13:39:51+01:00 buffalo hostapd: wlan0: STA 10:68:3f:40:xx:yy WPA: pairwise key handshake completed (RSN) 2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authentication OK (open system) 2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-AUTHENTICATE. indication(24:ab:81:91:c8:62, OPEN_SYSTEM) 2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy MLME: MLME-DELETEKEYS.request(24:ab:81:91:c8:62) 2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: authenticated 2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: association OK (aid 1) 2013-01-21T13:41:25+01:00 fonera hostapd: wlan0: STA 24:ab:81:91:xx:yy IEEE 802.11: associated (aid 1)
  • 27. Blueprint for a Data Management System with Hadoop
  • 28. Logical Architecture Ingest •Transportation and Storage •HTTP/FLUME Parse • Sectioning and Record formation •FLUME,PDI Transform • Object creation •PDI Publish • Real Time • Batch mode • Integration patterns (RDBMS) View • Reporting Enterprise Architecture Team 28
  • 29. Data ingestion to HDFS we have a log-file as data source we set up Flume to stream the incoming content to HDFS. Due to the Flume terminology we had the components  data from the log-file as source  HDFS folder /user/flume/bang-mainbranch as sink as you preferred a flat directory layout to simplify the access/processing of the files later on  a channel, c1, to connect the source to the sink
  • 30. FLUME :  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • 31. Data ingestion to HDFS continue.. Flume configuration a1.sources = r1 a1.sinks = k1 a1.channels = c1 a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # get data from exec command a1.sources.r1.type = exec a1.sources.r1.command = tail -F /var/logs/wifilogs/20130612/realtime.logs a1.sources.r1.interceptors = i1 i2 a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.r1.interceptors.i1.preserveExisting = false a1.sources.r1.interceptors.i1.hostHeader = hostname a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder # define hdfs sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = hdfs://cdh-master.cdh-cluster:9000/user/flume/wifilogs/20130612/realtime.logs a1.sinks.k1.hdfs.rollInterval = 120 a1.sinks.k1.hdfs.rollCount = 100 a1.sinks.k1.hdfs.rollSize = 0 a1.sinks.k1.hdfs.fileType = DataStream # bind source and sink to channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
  • 32. Data ingestion to HDFS continue.. Flume is able to collect data from various sources it is possible to configure Flume as “ with different source points server” itself. The WiFi access points would send their log messages directly to the Flume agent also possible. flume-ng agent --conf-file ./flume-datastream.conf --name a1 -Dflume.root.logger=INFO,console
  • 33. Sample data looks in HDFS With the directory structure /user/flume/20120612/datastream 2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001 2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from 192.168.201.99:55001 2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root): Disconnect received 2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f IEEE 802.11: disassociated due to inactivity 2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DISASSOCIATE. indication(8c:64:22:3a:74:1f, 4) 2013-01-17T15:52:14+01:00 fonera hostapd: wlan0: STA 8c:64:22:3a:74:1f MLME: MLME-DELETEKEYS. request(8c:64:22:3a:74:1f)
  • 34. Parse and Transform To insert data raw formate the data onto based on Hive scheme. the data to the Hive data warehouse need to parse the raw data into a comma separated format . There are quite a few open-source BI tools on the market for this: Palo, SpargoBI, Pentaho, ETL, Talend and many more. Used Pentaho Data Integration(PDI).
  • 35. Pentaho:  Pentaho Data Integration is the ETL Server technology that will be used to facilitate movement of data between the new back-end Hadoop environment and downstream RDBMS systems.
  • 36. Parse and Transform continue… Pentaho for ETL The collect WiFi router logs with Flume to store in HDFS, The PDI used for transformation, parsing, filtering and finally loading into Hive’s data warehouse. Import the raw data to the Hive data warehouse we need to parse the raw data into a comma separated format according to Hive scheme. The enabled us to design a MapReduce job for distributed processing across multiple nodes for this task without any programming environment.
  • 37. Parse and Transform continue… Pentaho Data Integration’s Graphical Designer
  • 38. Parse and Transform continue… Map Reduce The map phase will read all the raw log files collected by Flume on HDFS. The input is interpreted as TextInputFormat and therefore every line will go through a regex evaluation during the map phase. Filtering and transformation used “Regex Evaluation”  In Filtering only line matches the pattern are selected.  In transformation within the selected line on only few columns are taken.
  • 39. Parse and Transform continue… In Map Phase: Transformation  By matching a particular line against a regular expression  Split up the pattern matched line in fields that will be used as columns. This is the regex for the transformation: ^((d{4})-(d{2})-(d{2})w(d{2}):(d{2}):(d{2})([+-]d{2}:d{2})) ([.a-zA-Z_0-9]*?) (.*?): (.*?): w*? ([w+:]{0,18}) (.*?): (.*)$
  • 40. Parse and Transform continue… The matched lines from the regex are divided into different columns like this: year Integer month Integer day Integer hour Integer minute Integer second Integer timezone String host String facility_level String service_level String mac_address String protocol String message String
  • 41. Parse and Transform continue… Filtering  All lines that do not match the regular expression are filtered, Like error  These line are discarded those lines because they carry useless Disconnect received information for this case.  ‘Filter Rows’ we remove empty lines. This is to ensure there are no empty lines in the output after matching against the regular expression Example lines: 2013-01-17T15:50:41+01:00 192.168.201.197 dropbear[1172]: Child connection from 192.168.201.99:55001 2013-01-17T15:50:46+01:00 192.168.201.197 dropbear[1172]: Password auth succeeded for 'root' from 192.168.201.99:55001 2013-01-17T15:50:52+01:00 192.168.201.197 dropbear[1172]: Exit (root):
  • 42. Parse and Transform continue… To produce a comma separated lines  Used a ‘User Defined Java Expression’ and concatenate the emitted fields delimiting by ‘,’.  At the beginning of each line we did a further transformation and addition of string : ISO 8601 string to unix timestamp.  To answer time related questions, e.g. average visit duration we need values to calculate with. The unix timestamp is suitable for this.  The outputted stored to /user/20130612/routerlogs/parsed/ Here is the ‘User Defined Java Expression’: (javax.xml.bind.DatatypeConverter.parseDateTime(iso_8601).getTimeInMillis()/1000) + "," + year + "," + month + "," + day + "," + hour + "," + minute + "," + second + "," + timezone + "," + host + "," + facility_level + "," + service_level + "," + mac_address + "," + protocol + "," + message
  • 43. Parse and Transform continue…  The each contains comma seperated fields  Configured our Pentaho MapReduce job to clean output path before execution.  The outputted stored to /user/20130612/routerlogs/parsed/ $hadoop fs –ls /user/20130612/routerlogs/parsed/ drwxrwxrwx - hadoopuser hadoopuser 0 2013-01-21 15:24 /user/20130612/routerlogs/parse/_logs -rw-r--r-- 3 hadoopuser hadoopuser 118963 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00000 -rw-r--r-- 3 hadoopuser hadoopuser 100500 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00001 -rw-r--r-- 3 hadoopuser hadoopuser 11826 2013-01-21 15:25 /user/20130612/routerlogs/parse/part-00002
  • 44. Load At the very end, the transformed and parsed raw data lands in HDFS once the MapReduce job has finished 1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port 1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN) Now we have parsed and Transformed log files on HDFS. Used Pentaho Data Integration once again to import the data to Hive’s warehouse
  • 45. HIVE  Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
  • 46. Load… created a HIVE table that matches the previously defined schema with the query editor any query editor.
  • 47. Load… Loading data into the hive table is basically done by copying files on HDFS from /user/20130612/routerlogs/parsed to /user/hive/warehouse/routerlogs.20130612. Automating the MapReduce job on a scheduled base, e.g with Oozie. Ensure incremental updates on the Hive table by using partitioned table technique or unique output file naming With date.
  • 48. Load … Querying the data with Impala: Querying the data from Hive Table. The Sample data in hive table Sudhakara@loaclhost~]# hadoop fs -cat /user/hive/warehouse/routerlogs 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-AUTHENTICATE. indication(98:0c:82:dc:8b:15, OPEN_SYSTEM) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS. request(98:0c:82:dc:8b:15) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,authenticated 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,association OK (aid 2) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,associated (aid 2) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-ASSOCIATE. indication(98:0c:82:dc:8b:15) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS. request(98:0c:82:dc:8b:15) 1358757010,2013,1,21,9,30,10,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,deauthenticated
  • 49. Analysis and Report You can see, the line “authentication OK‘. It represents user enter to the WiFi access area i.e. login. You can see, the line “deauthenticated‘ . It represents user exit from the WiFi access area i.e. logout. After querying the data through impala , application should calculate the duration calculation.
  • 50. Impala  With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.) The first beta drop includes support for text files and SequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (with Snappy recommended for maximum performance)
  • 51. Analysis and Report…. Calculate the duration calculation for vistor the Hive query looks like SELECT A.ts, MIN(B.ts - A.ts), A.host, A.mac_address FROM routerlogs A, routerlogs B WHERE A.host = B.host AND A.mac_address = B.mac_address AND A.ts <= B.ts AND A.message LIKE '%authentication OK%' AND B.message LIKE '%deauthenticated%' GROUP BY A.host, A.mac_address, A.ts; Created a new Hive table called ‘visit_duration’ and loaded the CSV file into it create table visit_duration ( ts int, duration_in_seconds int, router string, mac_address string) row format delimited fields terminated by ',‘;
  • 52. Analysis and Report… Counting the visits for store number one reatail (Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012) [localhost:21000] > SELECT COUNT(*) FROM visit_durationWHERE router = "buffalo"; 135 Collecting the number of unique visitors is even simpler as we have the mac addresses of visitors that make them unique: (Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012) [localhost:21000] > SELECT COUNT(DISTINCT(mac_address)) FROM visit_duration WHERE router = "buffalo";
  • 53. Analysis and Report… The plot (figure 1) indicates that about 85% of the visits were detected in store number one and about 15% in store number two. One might draw the conclusion that store number one is in a much better location with more occasional customers. But let’s gain more insights by analyzing the number of unique visitors.
  • 54. Analysis and Report… The average visit duration in store number [localhost:21000] > SELECT AVG(duration_in_seconds) FROM visit_duration WHERE router = "buffalo"; 976.6666666666666 Each user visit duration in store.
  • 55. Many more… How many people visited the store (unique visitors)? Note: Unlike the traditional customer frequency counter at the doors we have mac addresses at the log files that are unique for mobile phones. Supposed people do not change their mobile phones we can recognize unique visitors and not just visits. How many visits did we have? What is the average visit duration? What is the peak hour for visitors? How many people are new vs. returning? Which location getting most vistors? Which location branch has most regular customers? What is the average length of time between two visits?
  • 56. Ingest Parse Transform Publish View Collectors File System Real Time Business Rules PDI Batch Integrated Data Store Internet Functional Architecture View and summary Analysis Data Services Business Rules Business Rules Security Service and Offers subscription system General Analytics Hive Flume System of Record Single Source of Truth Consumers
  • 57. Conclusion Analysing WiFi router log files could be done with a traditional RDBMS database approach as well. But one of the main benefits of this architecture is the ability to store a variety of semi structured files. Easy adoption for existing BI/analysis and reporting tools with a BigData platform integration. Modification, Evaluation Share the data between many applications Etc…

Editor's Notes

  1. When architects and developers discuss software, they typically immediately qualify a software tool for its specific usage. For example, they may say that Apache Tomcat is a web server and that MySQL is a database. When it comes to Hadoop, however, things become a little bit more complicated. Hadoop encompasses a multiplicity of tools that are designed and implemented to work together. As a result, Hadoop can be used for many things, and, consequently, people often define it based on the way they are using it. Because Hadoop provides such a wide array of capabilities that can be adapted to solve many problems, many consider it to be a basic framework. Certainly, Hadoop provides all of these capabilities, but Hadoop should be classified as an ecosystem comprised of many components that range from data storage, to data integration, to data processing, to specialized tools for data analysts.
  2. Memory: Depending on the application, your system’s memory requirements will vary. They differ between the management services and the worker services. For the worker services, sufficient memory is needed to manage the TaskTracker and FileServer services in addition to the sum of all the memory assigned to each of the Map/Reduce slots. If you have a memory-bound Map/Reduce Job, you may need to increase the amount of memory on all the nodes running worker services. Storage: A Hadoop platform that’s designed to achieve performance and scalability by moving the compute activity to the data is preferable. Data storage requirements for the worker nodes may be best met by direct attached storage (DAS) in a Just a Bunch of Disks (JBOD) configuration and not as DAS with RAID or Network Attached Storage (NAS).
  3. Show the rules as more of a filter. Look at using the other slide. -Show license management, how do we automate that. ASUP Ecosystem Receive & process messages Store in flat files Store in Databases 2. Rules Ecosystem Business rules for processing based on ASUP messages Automate Bug identification and attach rate to cases - signature (pattern) detection Consistent rules and Logic used by multiple applications and tools 3. UI SAP: Auto Case Creation and Auto Parts Dispatch MASUP: Health Checks/At Risk Systems, Storage Trending, Storage Efficiency Unified Tools Portal: Central Place to Access Support Tools, Multiple Tools using consistent data and business rules Support Site: Install Base Info, License Management eBI: Up sell & cross sell, Solution Stickiness, Aggregated Account and Segment Views Service Automation for Secure Site (SASS)