SlideShare a Scribd company logo
1 of 9
Download to read offline
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
DOI : 10.5121/ijcses.2014.5101 1
A Comparative Survey Based on Processing
Network Traffic Data Using Hadoop Pig and
Typical Mapreduce
Anjali P P and Binu A
Department of Information Technology, Rajagiri School of Engineering and Technology,
Kochi. M G University, Kerala
ABSTRACT
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
KEYWORDS
Big Data, MapReduce, Hadoop, Pig, netflow, network traffic.
1. INTRODUCTION
Netflow is the network protocol designed by Cisco Systems for assembling network traffic
information and has become an industry standard for monitoring netfwork traffic. Netflow is
widely supported by various platforms and Cisco's standard NetFlow version 5 defines network
flow as a unidirectional sequence of packets which has got seven parameters:
• Ingress interface (SNMP ifIndex)
• Source IP address
• Destination IP address
• IP protocol
• Source port for UDP or TCP, 0 for other protocols
• Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols
• IP Type of Service
Netflow is normally enabled on a per-interface basis for limiting the amount of exported flow
records and the load on the router components. By default, it will capture all packets received by
an ingress interface and IP filters can be used to decide if a packet can be captured by NetFlow.
Once accurately configured, it can even allow the observation and recording of IP packets on the
egress interface. Custom settings can also be done on the Netflow implementation according to
the requirements in a variety of practical scenarios. NetFlow data from dedicated probes can be
efficiently utilized for the observation of critical links, whereas the data from routers can provide
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
2
a network-wide view of the traffic that which will be of great use for capacity planning,
accounting, performance monitoring, and security.
Netflow data collected from dedicated probes can be effectively utilized for observing critical
links. Data from routers can provide a network-wide view of the traffic and can be made useful in
the fields like performance monitoring, capacity planning, accounting and security. This paper
comprises of a survey on network traffic analysis using Apache Hadoop MapReduce framework
and a comparative study of typical MapReduce against Hadoop Pig which is a platform for
analyzing large data sets. It also provides an ease of programming, thereby optimizing
opportunities and enhancing extensibility.
2. NETFLOW ANALYSIS USING TYPICAL HADOOP MAPREDUCE
FRAMEWORK
The Netflow processing requires sufficient network flow data collected at a particular time or at
regular intervals. By analyzing the incoming and outgoing network traffic corresponding to
various hosts, required precautions can be taken regarding server security, firewall setup, packet
filtering etc. Hence, the significance of network flow analysis cannot be kept aside. Netflow data
can be collected from switches or routers using router-console commands and are then filtered on
the basis of parameters such as source and destination IP addresses, corresponding Port numbers
etc. Similarly when considering websites, every website will generate access logs and such files
can be collected from the web-server for further analysis. Such logs collected will contain the
details of accessed URLs, the time of access, total number of hits etc which are inevitable for a
website manager for the proactive detection and encounter of problems like DDoS attacks and IP
address spoofing.
In a distributed environment, Hadoop can be considered as an efficient mechanism for big data
manipulation. Hadoop was developed by Apache foundation and is well-known for its scalability
as well as efficiency. It works on a computational paradigm called MapReduce which is
characterized by a Map and a Reduce phase. Map phase involves division of the computation
operation into many fragments and its distribution among the cluster nodes. Individual results
evolved at the end of the Map phase will be eventually combined and reduced to a single output
in the Reduce phase, thereby producing the result in a very short span of time.
The log files generated every minute by a web-server or router will accumulate to form a large-
sized file which needs a great deal of time to get processed. In such an instance, the advantage of
Hadoop MapReduce paradigm can be made use of. This big data file can be fed as the input for a
MapReduce algorithm which provides an efficiently processed output file in return. Every map
function transforms the input data into a certain number of key/value pairs and later sorts them
according to the key values whereas a reduce function combines the results with similar key
values to produce a single output.
Hadoop stores the data in its distributed file system popularly known as HDFS and will make the
requested data easily available to the users. HDFS is highly scalable and advantageous in its
portability. It enhances reliability by the mechanism of data replication across multiple nodes in
the cluster, thereby eliminating the need for RAID storage on the cluster nodes. High availability
can also be incorporated in Hadoop clusters by means of a backup provision for the master
node[name node] which is generally known as secondary namenode. The common limitations of
using HDFS are the inability to be mounted directly by an existing operating system and the
inconvenience of moving data in and out of the HDFS before and after the execution of a job.
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
3
Once Hadoop is deployed on a cluster, instances of these processes will be started – Namenode,
Datanode, Jobtracker, Tasktracker. Hadoop can process large amount of data such as website
logs, network traffic logs obtained from switches etc. The namenode, otherwise known as the
master node is the main component of the HDFS which stores the directory tree of the file system
and keeps track of the data during execution. The datanode stores the data in HDFS as suggested
by its name. The jobtracker delegates the MapReduce tasks to different cluster nodes and the
tasktracker spawns and executes the job on the member nodes.
One of the key features of Hadoop MapReduce is its coding complexity and can be accomplished
only by programmers with highly developed programming skills. The MapReduce programming
model is highly restrictive and Hadoop cluster management also becomes difficult while
considering aspects like log collection, debugging and distribution of software distribution. To be
precise, the effort to be invested in MapReduce programming restricts its acceptability to a
considerable extent. Here arises the need for a platform which can provide ease of programming
along with an enhanced extensibility and Apache Pig reveals itself as a novel, but realistic
approach in data analytics. The major features and applications of Apache Pig deployed over
Hadoop framework is described in Section 3.
3. USING APACHE PIG TO PROCESS NETWORK FLOW DATA
Apache Pig is a platform that supports substantial parallelization which enables the processing of
huge data sets. The structure of Pig can be described in two layers – An infrastructure layer with
an in-built compiler that can spawn MapReduce programs and a language layer which consists of
a text-processing language called “Pig Latin”. Pig makes data analysis and programming easier
and understandable even to novices in the Hadoop framework. It uses Pig Latin language which
can be considered as a parallel data flow language. The Pig setup has also got provisions for
further optimizations and user defined functionalities. Pig Latin queries are compiled into
MapReduce jobs and are executed in distributed Hadoop cluster environments. Pig Latin looks
similar to SQL [Structured Query Language] but is designed for Hadoop data processing
environment and analogous to SQL in RDBMS environment. Another member of Hadoop
ecosystem known as Hive is a query language like SQL and it needs data to be loaded into tables
before processing. Unlike Hive, Pig Latin can work on schema-less or inconsistent environments
and can operate on the available input data as soon as it is loaded into the HDFS. Pig closely
resembles scripting languages like Perl and Python in the aspects such as flexibility in syntax and
dynamic variables definitions. Hence, Pig Latin can be considered efficient enough to be the
native parallel processing language for distributed systems such as Hadoop.
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
4
Figure 1. Data processing model in PIG
It is clear that Pig is gaining popularity in different zones of computational and statistical
departments. Along with that, Apache Pig is widely used in processing weblogs and wiping off
corrupted data from records. It can build behaviour prediction models based on the user
interactions with a website and this speciality can be applied in displaying ads and news stories
that keeps the user entertained while visiting a website. It also generates an iterative processing
model and can keep track of every new updates by joining the behavioural model with the user
data. This feature is widely used in social-networking sites like Facebook and micro-blogging
sites like Twitter. Pig cannot be considered as effective as MapReduce when it comes to
processing small data or scanning multiple records in random order. The data processing models
framed by Hadoop Pig is depicted in Figure 1.
In this paper, we are exploring the scope of Pig Latin for processing the netflow data obtained
from Cisco routers. The proposed method for using Pig for netflow analysis is described in
Section 4.
4. IMPLEMENTATION ASPECTS
Here we present the NetFlow analysis using Hadoop, which is widely used in managing large
volume of data, thereby employing parallel processing and comes up with required output in no
time. A MapReduce algorithm needs to be deployed to obtain the required result and coding such
MapReduce programs for analyzing such a huge amount data is a time consuming task. Here the
Apache-Pig will help us to save a good deal of time and also making it possible to be deployed in
a real-time scenario.
We set our system with Java platform and the version requirement is above 1.6. The system
implemented is completely open source and the Java Development Kit can be downloaded and
installed freely. The key factor to be taken care of is to set the proper environment variables for
Java so that the base framework can work properly.
Hadoop is a system that makes use of Java framework for its functioning and we're deploying
Hadoop on top of Java. Hadoop downloadable packages are also open source and needs to be
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
5
extracted and configured properly. The following files need to be tweaked for setting the required
clustering as well as replication parameters for the whole system:
• hadoop-env.sh: Set the java directory path.
• core-site.xml: Regarding hadoop name node, port and temporary directories.
• hdfs-site.xml: Regarding parameters like replication values.
• mapred-site.xml: Regarding the host and port of mapreduce job tracker.
• masters: Details of a secondary name node, if any. (Here, it is not needed).
• slaves: Details of the slave nodes, if any.
Figure 2. System setup
The hadoop processes should be started after the deployment and we can test the processes by
using a command called 'jps' which lists all the Hadoop processes in the system. It should list a
Namenode, Datanode, Jobtracker, Tasktracker to ensure that the processes are all working fine
and the Hadoop deployment is fine. Testing can be done by creating and deleting directories as
well as moving files in and out of HDFS.
Now we need the Pig system for out test setup and we need to deploy that by downloading the
opensource Pig tarballs and properly setting its environment variables. The structure of our
system setup is given in figure 2.
We need to analyze traffic at a particular time where Pig plays the lead role. For data analysis, we
buckets are created with pairs of (SrcIF, SrcIPaddress), (DstIF,DstIPaddress),
(Protocal,FlowperSec). After that, they are joined and categorized as SrcIF and FlowperSec;
SrcIPaddress and FlowperSec. Netflow data is collected from a busy Cisco switch using tools
such as nfdump. The output contains three basic sections- IP Packet section, Source Destination
packets section and Protocol Flow section.
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
6
Figure 3. Netflow Analysis using Hadoop Pig
These three sections are created as three different schemas as Pig needs the netflow datasets in
arranged format which is then converted to SQL-like schema. Hence. A sample Pig Latin code is
given as follows:
Protocol = LOAD ‘NetFlow-Data1’ AS (protocol:chararray, flow:int, …)
Source = LOAD ‘NetFlow-Data2’ AS (SrcIF:chararray, SrcIPAddress:chararray, …)
Destination = LOAD ‘NetFlow-Data3’ AS (DstIF:chararray, DstIPaddress, …)
As shown in Figure 3, the network flow data is deduced into schemas by Pig and later these
schemas are combined for analysis to deliver the traffic flow per node as the resultant output.
5. ADVANTAGES OF USING HADOOP PIG OVER MAPREDUCE
The member of Apache Hadoop ecosystem popularly known as Pig can be considered as a
procedural version of SQL. The most attractive feature and advantage of Pig is its simplicity
itself. It doesn't demand highly skilled Java professionals for its implementation. The flexibility in
syntax as well as dynamic variable allocation makes it much more user-friendly. Added
extensibility for the framework is provided by the provisions for user-defined functions. The
ability of Pig to process un-structured data greatly accounts to its parallel data processing
capability. A remarkable difference comes when the computational complexity is considered
along with these aspects. Typical map-reduce programs will have large number of java code lines,
but a similar task can be accomplished using Pig Latin scripts in very fewer lines of code. Huge
network flow data input files can be processed by writing simple programs in Pig Latin which
closely resembles scripting languages like Perl and Python. In other words, the computational
complexity of Pig Latin scripts are much lesser than similar MapReduce programs written in Java
and this particular property makes Pig suitable for the implementation of this project. The
procedures and experiments followed for arriving at such a conclusion is described in Section 6.
6. EXPERIMENTS AND RESULTS
Apache Pig was selected as the programming language for Netflow analysis after conducting a
short test on sample input files with different sizes.
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
7
6.1. Experimental Setup
Apache Hadoop was installed and configured on a stand-alone Ubuntu node initially and Hadoop
Pig was deployed on top of this system. SSH Key-based authentication was setup on the machine
for convenience as Hadoop prompts for passwords while performing operations. Hadoop
instances such as namenode, datanode, jobtracker and tasktracker are started before running Pig
and it shows a “grunt” console upon starting. The grunt console is the command prompt specific
to Pig and is used for data and file system manipulation. Pig can work both in local mode and
MapReduce mode which can be selected as per the user requirement.
A sample Hadoop MapReduce word-count code programmed with Java was used for the
comparison. The typical MapReduce program used to process a sample input file of enough size
had one hundred thirty lines of code and its time of execution was noted. The same operation
coded using Pig Latin contained only five lines of code and corresponding running time for the
PigLatin code was also noted for framing the results.
6.2. Experiment Results
At a particular instant, the size of input file is maintained as a constant for MapReduce and Pig
whereas the lines of code are more for the former and less for the latter. For the sake of a good
comparison, the input file size was taken in the range of 13 kb to 208 kb and corresponding
execution time is recorded as in Figure 4.
Figure 4. Hadoop Pig vs. MapReduce
Figure 5. Input file-size vs. execution time for MapReduce
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
8
From this we can derive an expression as follows: As the input file size increases in the multiples
of x, the execution time for typical MapReduce also increases proportionally whereas Hadoop Pig
maintains a constant time at least for x upto 5 times. Graphs are plotted between the input file size
and execution time for both cases and are given as Figure 5 and Figure 6.
Figure 6. Input file-size vs. execution time for Pig
7. CONCLUSION
In this paper, a detailed survey regarding the wide spread acceptance of Apache Pig platform is
conducted. The major features of the network traffic flows are extracted for data analysis using
Hadoop Pig for the implementation. Pig was tested and proved to be advantageous in this aspect
with a very low computational complexity. The main advantage is that the Network flow capture
and flow analysis is done with the help of easily available open-source tools. Upon the
implementation of a Pig-based flow analyzer, network traffic flow analysis can be done with
increased convenience and easiness even without the help of highly developed programming
expertise. It can greatly reduce the time consumed for researching on the complex control loops
and programming constructs for implementing a typical MapReduce paradigm using Java, but
still enjoys the advantages of MapReduce task division technique by means of the layered
architecture and in-built compilers of Apache Pig.
ACKNOWLEDGEMENT
We are greatly indebted to the college management and the faculty members for providing
necessary hardware and other facilities along with timely guidance and suggestions for
implementing this work.
REFERENCES
[1] An Internet Traffic Analysis Method with MapReduce by Youngseok Lee, Wonchul Kang and
Hyeongu Son from Chungnam National University, Daejeon, 305-764, Republic of Korea
[2] http://orzota.com/blog/pig-for-beginners/
[3] http://pig.apache.org/
[4] http://www.hadoop.apache.org
[5] http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html
[6] http://www.hadoop.apache.org
[7] http://www.ibm.com/developerworks/library/l-hadoop-1/
[8] http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/
[9] http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html
International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014
9
[10] http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html
[11] http://developer.yahoo.com/hadoop/tutorial/module7.html
[12] https://cyberciti.biz
[13] http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
[14] http://www.cisco.com/en/US/docs/ios/ios_xe/netflow/configuration/guide/cfg_nflow_data_expt_xe.ht
ml#wp1056663
[15] http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/netflow.
html
[16] http://ofps.oreilly.com/titles/9781449302641/intro.html
Authors
Anjali P P is an MTech student of Information Technology Department in Rajagiri
School of Engineering and Technology under Mahathma Gandhi University, who is
specializing in Network Engineering. She has received BTech degree in Computer
Science and Engineering from Cochin University of Science and Technology. Her
industrial experience comes around 2 years in Systems Engineering, Networking and
Server Administration. Presently, she is working as a Systems Engineer. Her research
interest includes Networking and Big data analysis using Hadoop.
Binu A is working as an Assistant Professor in Department of Information Technology
since May 2004. He has one year of industrial experience in Web & Systems
Administration. He has graduated (M.Tech. Software Engg.) from Dept. of Computer
Science, CUSAT , and took his B.Tech. in Computer Science and Engg. from School of
Engineering, CUSAT. At present he is doing PhD in Dept. of Computer Science,
CUSAT. His areas of interest include Theoretical Computer Science, Compiler &
Architecture.

More Related Content

What's hot

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasortpramodbiligiri
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasort
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 

Viewers also liked

Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsIAEME Publication
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 

Viewers also liked (8)

Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applications
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar to A comparative survey based on processing network traffic data using hadoop pig and typical mapreduce

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...IRJET Journal
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkGraisy Biswal
 

Similar to A comparative survey based on processing network traffic data using hadoop pig and typical mapreduce (20)

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 

More from ijcses

Vision based entomology a survey
Vision based entomology  a surveyVision based entomology  a survey
Vision based entomology a surveyijcses
 
A survey on cost effective survivable network design in wireless access network
A survey on cost effective survivable network design in wireless access networkA survey on cost effective survivable network design in wireless access network
A survey on cost effective survivable network design in wireless access networkijcses
 
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSijcses
 
A Study of Various Steganographic Techniques Used for Information Hiding
A Study of Various Steganographic Techniques Used for Information HidingA Study of Various Steganographic Techniques Used for Information Hiding
A Study of Various Steganographic Techniques Used for Information Hidingijcses
 
Neural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabNeural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabijcses
 
Congestion control in packet switched wide area networks using a feedback model
Congestion control in packet switched wide area networks using a feedback modelCongestion control in packet switched wide area networks using a feedback model
Congestion control in packet switched wide area networks using a feedback modelijcses
 
Ijcses(most cited articles)
Ijcses(most cited articles)Ijcses(most cited articles)
Ijcses(most cited articles)ijcses
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
 
A survey of real-time routing protocols For wireless sensor networks
A survey of real-time routing protocols For wireless sensor networksA survey of real-time routing protocols For wireless sensor networks
A survey of real-time routing protocols For wireless sensor networksijcses
 
An Approach to Detect Packets Using Packet Sniffing
An Approach to Detect Packets Using Packet SniffingAn Approach to Detect Packets Using Packet Sniffing
An Approach to Detect Packets Using Packet Sniffingijcses
 
Artificial intelligence markup language: a Brief tutorial
Artificial intelligence markup language: a Brief tutorialArtificial intelligence markup language: a Brief tutorial
Artificial intelligence markup language: a Brief tutorialijcses
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTIONijcses
 
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCERESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCEijcses
 
PHYSICAL MODELING AND SIMULATION OF THERMAL HEATING IN VERTICAL INTEGRATED ...
PHYSICAL MODELING AND SIMULATION  OF THERMAL HEATING IN VERTICAL  INTEGRATED ...PHYSICAL MODELING AND SIMULATION  OF THERMAL HEATING IN VERTICAL  INTEGRATED ...
PHYSICAL MODELING AND SIMULATION OF THERMAL HEATING IN VERTICAL INTEGRATED ...ijcses
 

More from ijcses (14)

Vision based entomology a survey
Vision based entomology  a surveyVision based entomology  a survey
Vision based entomology a survey
 
A survey on cost effective survivable network design in wireless access network
A survey on cost effective survivable network design in wireless access networkA survey on cost effective survivable network design in wireless access network
A survey on cost effective survivable network design in wireless access network
 
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONSADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
ADAPTIVE MAP FOR SIMPLIFYING BOOLEAN EXPRESSIONS
 
A Study of Various Steganographic Techniques Used for Information Hiding
A Study of Various Steganographic Techniques Used for Information HidingA Study of Various Steganographic Techniques Used for Information Hiding
A Study of Various Steganographic Techniques Used for Information Hiding
 
Neural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabNeural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlab
 
Congestion control in packet switched wide area networks using a feedback model
Congestion control in packet switched wide area networks using a feedback modelCongestion control in packet switched wide area networks using a feedback model
Congestion control in packet switched wide area networks using a feedback model
 
Ijcses(most cited articles)
Ijcses(most cited articles)Ijcses(most cited articles)
Ijcses(most cited articles)
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
 
A survey of real-time routing protocols For wireless sensor networks
A survey of real-time routing protocols For wireless sensor networksA survey of real-time routing protocols For wireless sensor networks
A survey of real-time routing protocols For wireless sensor networks
 
An Approach to Detect Packets Using Packet Sniffing
An Approach to Detect Packets Using Packet SniffingAn Approach to Detect Packets Using Packet Sniffing
An Approach to Detect Packets Using Packet Sniffing
 
Artificial intelligence markup language: a Brief tutorial
Artificial intelligence markup language: a Brief tutorialArtificial intelligence markup language: a Brief tutorial
Artificial intelligence markup language: a Brief tutorial
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCERESEARCH ON DISTRIBUTED SOFTWARE TESTING  PLATFORM BASED ON CLOUD RESOURCE
RESEARCH ON DISTRIBUTED SOFTWARE TESTING PLATFORM BASED ON CLOUD RESOURCE
 
PHYSICAL MODELING AND SIMULATION OF THERMAL HEATING IN VERTICAL INTEGRATED ...
PHYSICAL MODELING AND SIMULATION  OF THERMAL HEATING IN VERTICAL  INTEGRATED ...PHYSICAL MODELING AND SIMULATION  OF THERMAL HEATING IN VERTICAL  INTEGRATED ...
PHYSICAL MODELING AND SIMULATION OF THERMAL HEATING IN VERTICAL INTEGRATED ...
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

A comparative survey based on processing network traffic data using hadoop pig and typical mapreduce

  • 1. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 DOI : 10.5121/ijcses.2014.5101 1 A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pig and Typical Mapreduce Anjali P P and Binu A Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala ABSTRACT Big data analysis has now become an integral part of many computational and statistical departments. Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data manipulation is now considered as a key area of research in the field of data analytics and novel techniques are being evolved day by day. Thousands of transaction requests are being processed in every minute by different websites related to e-commerce, shopping carts and online banking. Here comes the need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can efficiently process the Netflow data collected from routers, switches or even from website access logs at fixed intervals. KEYWORDS Big Data, MapReduce, Hadoop, Pig, netflow, network traffic. 1. INTRODUCTION Netflow is the network protocol designed by Cisco Systems for assembling network traffic information and has become an industry standard for monitoring netfwork traffic. Netflow is widely supported by various platforms and Cisco's standard NetFlow version 5 defines network flow as a unidirectional sequence of packets which has got seven parameters: • Ingress interface (SNMP ifIndex) • Source IP address • Destination IP address • IP protocol • Source port for UDP or TCP, 0 for other protocols • Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols • IP Type of Service Netflow is normally enabled on a per-interface basis for limiting the amount of exported flow records and the load on the router components. By default, it will capture all packets received by an ingress interface and IP filters can be used to decide if a packet can be captured by NetFlow. Once accurately configured, it can even allow the observation and recording of IP packets on the egress interface. Custom settings can also be done on the Netflow implementation according to the requirements in a variety of practical scenarios. NetFlow data from dedicated probes can be efficiently utilized for the observation of critical links, whereas the data from routers can provide
  • 2. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 2 a network-wide view of the traffic that which will be of great use for capacity planning, accounting, performance monitoring, and security. Netflow data collected from dedicated probes can be effectively utilized for observing critical links. Data from routers can provide a network-wide view of the traffic and can be made useful in the fields like performance monitoring, capacity planning, accounting and security. This paper comprises of a survey on network traffic analysis using Apache Hadoop MapReduce framework and a comparative study of typical MapReduce against Hadoop Pig which is a platform for analyzing large data sets. It also provides an ease of programming, thereby optimizing opportunities and enhancing extensibility. 2. NETFLOW ANALYSIS USING TYPICAL HADOOP MAPREDUCE FRAMEWORK The Netflow processing requires sufficient network flow data collected at a particular time or at regular intervals. By analyzing the incoming and outgoing network traffic corresponding to various hosts, required precautions can be taken regarding server security, firewall setup, packet filtering etc. Hence, the significance of network flow analysis cannot be kept aside. Netflow data can be collected from switches or routers using router-console commands and are then filtered on the basis of parameters such as source and destination IP addresses, corresponding Port numbers etc. Similarly when considering websites, every website will generate access logs and such files can be collected from the web-server for further analysis. Such logs collected will contain the details of accessed URLs, the time of access, total number of hits etc which are inevitable for a website manager for the proactive detection and encounter of problems like DDoS attacks and IP address spoofing. In a distributed environment, Hadoop can be considered as an efficient mechanism for big data manipulation. Hadoop was developed by Apache foundation and is well-known for its scalability as well as efficiency. It works on a computational paradigm called MapReduce which is characterized by a Map and a Reduce phase. Map phase involves division of the computation operation into many fragments and its distribution among the cluster nodes. Individual results evolved at the end of the Map phase will be eventually combined and reduced to a single output in the Reduce phase, thereby producing the result in a very short span of time. The log files generated every minute by a web-server or router will accumulate to form a large- sized file which needs a great deal of time to get processed. In such an instance, the advantage of Hadoop MapReduce paradigm can be made use of. This big data file can be fed as the input for a MapReduce algorithm which provides an efficiently processed output file in return. Every map function transforms the input data into a certain number of key/value pairs and later sorts them according to the key values whereas a reduce function combines the results with similar key values to produce a single output. Hadoop stores the data in its distributed file system popularly known as HDFS and will make the requested data easily available to the users. HDFS is highly scalable and advantageous in its portability. It enhances reliability by the mechanism of data replication across multiple nodes in the cluster, thereby eliminating the need for RAID storage on the cluster nodes. High availability can also be incorporated in Hadoop clusters by means of a backup provision for the master node[name node] which is generally known as secondary namenode. The common limitations of using HDFS are the inability to be mounted directly by an existing operating system and the inconvenience of moving data in and out of the HDFS before and after the execution of a job.
  • 3. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 3 Once Hadoop is deployed on a cluster, instances of these processes will be started – Namenode, Datanode, Jobtracker, Tasktracker. Hadoop can process large amount of data such as website logs, network traffic logs obtained from switches etc. The namenode, otherwise known as the master node is the main component of the HDFS which stores the directory tree of the file system and keeps track of the data during execution. The datanode stores the data in HDFS as suggested by its name. The jobtracker delegates the MapReduce tasks to different cluster nodes and the tasktracker spawns and executes the job on the member nodes. One of the key features of Hadoop MapReduce is its coding complexity and can be accomplished only by programmers with highly developed programming skills. The MapReduce programming model is highly restrictive and Hadoop cluster management also becomes difficult while considering aspects like log collection, debugging and distribution of software distribution. To be precise, the effort to be invested in MapReduce programming restricts its acceptability to a considerable extent. Here arises the need for a platform which can provide ease of programming along with an enhanced extensibility and Apache Pig reveals itself as a novel, but realistic approach in data analytics. The major features and applications of Apache Pig deployed over Hadoop framework is described in Section 3. 3. USING APACHE PIG TO PROCESS NETWORK FLOW DATA Apache Pig is a platform that supports substantial parallelization which enables the processing of huge data sets. The structure of Pig can be described in two layers – An infrastructure layer with an in-built compiler that can spawn MapReduce programs and a language layer which consists of a text-processing language called “Pig Latin”. Pig makes data analysis and programming easier and understandable even to novices in the Hadoop framework. It uses Pig Latin language which can be considered as a parallel data flow language. The Pig setup has also got provisions for further optimizations and user defined functionalities. Pig Latin queries are compiled into MapReduce jobs and are executed in distributed Hadoop cluster environments. Pig Latin looks similar to SQL [Structured Query Language] but is designed for Hadoop data processing environment and analogous to SQL in RDBMS environment. Another member of Hadoop ecosystem known as Hive is a query language like SQL and it needs data to be loaded into tables before processing. Unlike Hive, Pig Latin can work on schema-less or inconsistent environments and can operate on the available input data as soon as it is loaded into the HDFS. Pig closely resembles scripting languages like Perl and Python in the aspects such as flexibility in syntax and dynamic variables definitions. Hence, Pig Latin can be considered efficient enough to be the native parallel processing language for distributed systems such as Hadoop.
  • 4. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 4 Figure 1. Data processing model in PIG It is clear that Pig is gaining popularity in different zones of computational and statistical departments. Along with that, Apache Pig is widely used in processing weblogs and wiping off corrupted data from records. It can build behaviour prediction models based on the user interactions with a website and this speciality can be applied in displaying ads and news stories that keeps the user entertained while visiting a website. It also generates an iterative processing model and can keep track of every new updates by joining the behavioural model with the user data. This feature is widely used in social-networking sites like Facebook and micro-blogging sites like Twitter. Pig cannot be considered as effective as MapReduce when it comes to processing small data or scanning multiple records in random order. The data processing models framed by Hadoop Pig is depicted in Figure 1. In this paper, we are exploring the scope of Pig Latin for processing the netflow data obtained from Cisco routers. The proposed method for using Pig for netflow analysis is described in Section 4. 4. IMPLEMENTATION ASPECTS Here we present the NetFlow analysis using Hadoop, which is widely used in managing large volume of data, thereby employing parallel processing and comes up with required output in no time. A MapReduce algorithm needs to be deployed to obtain the required result and coding such MapReduce programs for analyzing such a huge amount data is a time consuming task. Here the Apache-Pig will help us to save a good deal of time and also making it possible to be deployed in a real-time scenario. We set our system with Java platform and the version requirement is above 1.6. The system implemented is completely open source and the Java Development Kit can be downloaded and installed freely. The key factor to be taken care of is to set the proper environment variables for Java so that the base framework can work properly. Hadoop is a system that makes use of Java framework for its functioning and we're deploying Hadoop on top of Java. Hadoop downloadable packages are also open source and needs to be
  • 5. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 5 extracted and configured properly. The following files need to be tweaked for setting the required clustering as well as replication parameters for the whole system: • hadoop-env.sh: Set the java directory path. • core-site.xml: Regarding hadoop name node, port and temporary directories. • hdfs-site.xml: Regarding parameters like replication values. • mapred-site.xml: Regarding the host and port of mapreduce job tracker. • masters: Details of a secondary name node, if any. (Here, it is not needed). • slaves: Details of the slave nodes, if any. Figure 2. System setup The hadoop processes should be started after the deployment and we can test the processes by using a command called 'jps' which lists all the Hadoop processes in the system. It should list a Namenode, Datanode, Jobtracker, Tasktracker to ensure that the processes are all working fine and the Hadoop deployment is fine. Testing can be done by creating and deleting directories as well as moving files in and out of HDFS. Now we need the Pig system for out test setup and we need to deploy that by downloading the opensource Pig tarballs and properly setting its environment variables. The structure of our system setup is given in figure 2. We need to analyze traffic at a particular time where Pig plays the lead role. For data analysis, we buckets are created with pairs of (SrcIF, SrcIPaddress), (DstIF,DstIPaddress), (Protocal,FlowperSec). After that, they are joined and categorized as SrcIF and FlowperSec; SrcIPaddress and FlowperSec. Netflow data is collected from a busy Cisco switch using tools such as nfdump. The output contains three basic sections- IP Packet section, Source Destination packets section and Protocol Flow section.
  • 6. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 6 Figure 3. Netflow Analysis using Hadoop Pig These three sections are created as three different schemas as Pig needs the netflow datasets in arranged format which is then converted to SQL-like schema. Hence. A sample Pig Latin code is given as follows: Protocol = LOAD ‘NetFlow-Data1’ AS (protocol:chararray, flow:int, …) Source = LOAD ‘NetFlow-Data2’ AS (SrcIF:chararray, SrcIPAddress:chararray, …) Destination = LOAD ‘NetFlow-Data3’ AS (DstIF:chararray, DstIPaddress, …) As shown in Figure 3, the network flow data is deduced into schemas by Pig and later these schemas are combined for analysis to deliver the traffic flow per node as the resultant output. 5. ADVANTAGES OF USING HADOOP PIG OVER MAPREDUCE The member of Apache Hadoop ecosystem popularly known as Pig can be considered as a procedural version of SQL. The most attractive feature and advantage of Pig is its simplicity itself. It doesn't demand highly skilled Java professionals for its implementation. The flexibility in syntax as well as dynamic variable allocation makes it much more user-friendly. Added extensibility for the framework is provided by the provisions for user-defined functions. The ability of Pig to process un-structured data greatly accounts to its parallel data processing capability. A remarkable difference comes when the computational complexity is considered along with these aspects. Typical map-reduce programs will have large number of java code lines, but a similar task can be accomplished using Pig Latin scripts in very fewer lines of code. Huge network flow data input files can be processed by writing simple programs in Pig Latin which closely resembles scripting languages like Perl and Python. In other words, the computational complexity of Pig Latin scripts are much lesser than similar MapReduce programs written in Java and this particular property makes Pig suitable for the implementation of this project. The procedures and experiments followed for arriving at such a conclusion is described in Section 6. 6. EXPERIMENTS AND RESULTS Apache Pig was selected as the programming language for Netflow analysis after conducting a short test on sample input files with different sizes.
  • 7. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 7 6.1. Experimental Setup Apache Hadoop was installed and configured on a stand-alone Ubuntu node initially and Hadoop Pig was deployed on top of this system. SSH Key-based authentication was setup on the machine for convenience as Hadoop prompts for passwords while performing operations. Hadoop instances such as namenode, datanode, jobtracker and tasktracker are started before running Pig and it shows a “grunt” console upon starting. The grunt console is the command prompt specific to Pig and is used for data and file system manipulation. Pig can work both in local mode and MapReduce mode which can be selected as per the user requirement. A sample Hadoop MapReduce word-count code programmed with Java was used for the comparison. The typical MapReduce program used to process a sample input file of enough size had one hundred thirty lines of code and its time of execution was noted. The same operation coded using Pig Latin contained only five lines of code and corresponding running time for the PigLatin code was also noted for framing the results. 6.2. Experiment Results At a particular instant, the size of input file is maintained as a constant for MapReduce and Pig whereas the lines of code are more for the former and less for the latter. For the sake of a good comparison, the input file size was taken in the range of 13 kb to 208 kb and corresponding execution time is recorded as in Figure 4. Figure 4. Hadoop Pig vs. MapReduce Figure 5. Input file-size vs. execution time for MapReduce
  • 8. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 8 From this we can derive an expression as follows: As the input file size increases in the multiples of x, the execution time for typical MapReduce also increases proportionally whereas Hadoop Pig maintains a constant time at least for x upto 5 times. Graphs are plotted between the input file size and execution time for both cases and are given as Figure 5 and Figure 6. Figure 6. Input file-size vs. execution time for Pig 7. CONCLUSION In this paper, a detailed survey regarding the wide spread acceptance of Apache Pig platform is conducted. The major features of the network traffic flows are extracted for data analysis using Hadoop Pig for the implementation. Pig was tested and proved to be advantageous in this aspect with a very low computational complexity. The main advantage is that the Network flow capture and flow analysis is done with the help of easily available open-source tools. Upon the implementation of a Pig-based flow analyzer, network traffic flow analysis can be done with increased convenience and easiness even without the help of highly developed programming expertise. It can greatly reduce the time consumed for researching on the complex control loops and programming constructs for implementing a typical MapReduce paradigm using Java, but still enjoys the advantages of MapReduce task division technique by means of the layered architecture and in-built compilers of Apache Pig. ACKNOWLEDGEMENT We are greatly indebted to the college management and the faculty members for providing necessary hardware and other facilities along with timely guidance and suggestions for implementing this work. REFERENCES [1] An Internet Traffic Analysis Method with MapReduce by Youngseok Lee, Wonchul Kang and Hyeongu Son from Chungnam National University, Daejeon, 305-764, Republic of Korea [2] http://orzota.com/blog/pig-for-beginners/ [3] http://pig.apache.org/ [4] http://www.hadoop.apache.org [5] http://jugnu-life.blogspot.com/2012/03/installing-pig-apache-hadoop-pig.html [6] http://www.hadoop.apache.org [7] http://www.ibm.com/developerworks/library/l-hadoop-1/ [8] http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/ [9] http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html
  • 9. International Journal of Computer Science & Engineering Survey (IJCSES) Vol.5, No.1, February 2014 9 [10] http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html [11] http://developer.yahoo.com/hadoop/tutorial/module7.html [12] https://cyberciti.biz [13] http://www.quuxlabs.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ [14] http://www.cisco.com/en/US/docs/ios/ios_xe/netflow/configuration/guide/cfg_nflow_data_expt_xe.ht ml#wp1056663 [15] http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/netflow. html [16] http://ofps.oreilly.com/titles/9781449302641/intro.html Authors Anjali P P is an MTech student of Information Technology Department in Rajagiri School of Engineering and Technology under Mahathma Gandhi University, who is specializing in Network Engineering. She has received BTech degree in Computer Science and Engineering from Cochin University of Science and Technology. Her industrial experience comes around 2 years in Systems Engineering, Networking and Server Administration. Presently, she is working as a Systems Engineer. Her research interest includes Networking and Big data analysis using Hadoop. Binu A is working as an Assistant Professor in Department of Information Technology since May 2004. He has one year of industrial experience in Web & Systems Administration. He has graduated (M.Tech. Software Engg.) from Dept. of Computer Science, CUSAT , and took his B.Tech. in Computer Science and Engg. from School of Engineering, CUSAT. At present he is doing PhD in Dept. of Computer Science, CUSAT. His areas of interest include Theoretical Computer Science, Compiler & Architecture.