SlideShare a Scribd company logo
1 of 10
Download to read offline
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
32
SURVEY ON LOAD BALANCING AND DATA SKEW
MITIGATION IN MAPREDUCE APPLICATIONS
Vishal A. Nawale1
, Prof. Priya Deshpande2
1
Information Technology, MAEER’s MITCOE, Kothrud, Pune, India
2
Information Technology, MAEER’s MITCOE, Kothrud, Pune, India
ABSTRACT
Since few years Map Reduce programming model have shown great success in processing
huge amount of data. Map Reduce is a framework for data-intensive distributed computing of batch
jobs. This data-intensive processing creates skew in Map Reduce framework and degrades
performance by great value. This leads to greatly varying execution time for the Map Reduce jobs.
Due to this varying execution times results into low resource utilization and high overall execution
time since new Map Reduce cycle can start only after all reducers are completed. During this paper,
we are going to study various methodologies and techniques used to mitigate data skew and partition
skew, and describe various advantages and limitations of these approaches.
Keywords : Data Skew, Hadoop, Map Reduce, Partition Skew.
I. INTRODUCTION
Big Data, now a days this term becomes common in Information Technology industry. As
today’s situation there is lot of data available in IT industry but there is nothing useful before big
data comes into picture. As we know there is huge amount of data surrounds us but we can’t make it
useful for us. Reason for this is that we don’t have any traditional system such a powerful that can
make out analysis from this huge amount of data. When we usually talk about big data the first name
comes in mind is Hadoop, Doug Cutting helped to create Apache Hadoop as requirement for data
exploded from the web and grew far beyond the handling of conventional systems. With Hadoop
there is no data that’s too big. In today’s world as we know it where daily tremendous amount of
data is generated, with the use of Hadoop various industries can now find value in data that was in
recent times considered useless.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 6, Issue 1, January (2015), pp. 32-41
© IAEME: www.iaeme.com/IJCET.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
33
Hadoop uses Hadoop Distributed File System (HDFS) as the primary distributed storage.
HDFS is well known for distributed, scalable and portable file-system using commodity hardware
[8].
Name Node/Master Node is having metadata information about whole system such as data
stored on data nodes, free space, active nodes, passive nodes, job tracker, task tracker and many
other configuration files such as replication of data [8].
Data Node is a type of slave node which is used to save data and task tracker in data node
which is used to track on the ongoing jobs which are coming from name node [8].
Figure 1: Hadoop Architecture [14]
A small-size Hadoop cluster includes one master node and multiple slave nodes. A slave
node acts as a Data Node in HDFS architecture and acts as Task Tracker in Mapreduce framework.
These types of clusters are used in only non-standard applications. Again Large-size cluster includes
a dedicated Name Node which manages all file system index or metadata, a secondary Name Node
which periodically generates snapshots of Name Node to prevent or recover from Name Node failure
and reducing loss of data. Similarly in MapReduce framework a standalone Job Tracker server which
manages job scheduling and several task trackers are running on Data Nodes or slaves [11],[8].
Since from last few years, MapReduce framework is becoming a paradigm for the bulk
operations or workloads and a programming model for parallel data processing. MapReduce
programs written in many languages like Java, Ruby, Python and C++. Hadoop and MapReduce
framework is used by big players of market like Facebook, Google, Yahoo, Amazon, etc.
MapReduce works by dividing processing into two phases namely map phase and reduce phase.
Both phase has (key, value) pair as input and output [8].
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
34
Figure 2: Multi Node Hadoop Cluster [11]
To maximize the MapReduce performance, preferably we want all tasks to finish around the
same time. The job completion time in MapReduce depends on the slowest running task in the job.
While MapReduce is a popular data processing tool, it still has several important limitations. In
Particular, skew is a significant challenge in many applications executed on this platform. When
skew arises, some tasks in a job take significantly longer time to process their input data than
remaining tasks this causes to slow down the entire computation.
1. Causes of Skew
Data skew can occur in both the map phase and reduce phase. Map skew occurs when some
input data are more difficult to process than others, but it is rare and can be easily addressed by
simply splitting map tasks. Lin [12] has provided an application-specific solution that split large,
expensive records into some smaller ones. Again data skew present in the reduce phase (also called
reduce skew or partitioning skew) is much more challenging.
It’s well known that skew has many causes. The original MapReduce paper considered the
problem of skew caused by resource contention and hardware malfunction. To deal with this type of
skew, MapReduce and Hadoop include a mechanism in which the last few tasks of a job are
speculatively replicated on a different machine and the job completes when the fastest replicas of
these final tasks complete. Mainly there are two causes of Skew: First cause of skew is uneven
distribution of data in which data as an input provided to map tasks is unevenly distributed may be in
the size of data and/or complexity of the data.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp.
Figure 3:
Second cause of skew related to straggler nodes it’s a node which takes more time to process
data than other nodes [13].
Figure 4:
II. LITERATURE SURVEY
Qi Chen et al. [1] presents LIBRA a sampling and partitioning algorithm which balances the
load on reduce tasks. The architecture of LIBRA system is shown in figure 5. Data skew mitigation
in LIBRA consists of the following steps:
From map tasks a small percentage is selected as a sample tasks, these sample tasks are
issued first whenever there are free slots are available in the system. Only when all sample tasks are
completed then only ordinary map tasks are issued. These Sample tasks collect statistics during
normal map tasks processing on the intermediate data and send this collected distribution about data
to the master after they complete. The master node collects all the sample informatio
estimate of the data distribution to take the decision of partition then notifies about this to the slave
nodes. When slave nodes receive partition decision then they need to partition the data according to
decision and already issued norm
for reduce tasks to wait for completion of all issued map tasks; they can be issued as soon as
decisions about partitioning are ready.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
35
Figure 3: Uneven distribution of data [13]
Second cause of skew related to straggler nodes it’s a node which takes more time to process
Figure 4: Skew caused by straggler node.[13]
[1] presents LIBRA a sampling and partitioning algorithm which balances the
load on reduce tasks. The architecture of LIBRA system is shown in figure 5. Data skew mitigation
LIBRA consists of the following steps:
From map tasks a small percentage is selected as a sample tasks, these sample tasks are
issued first whenever there are free slots are available in the system. Only when all sample tasks are
ry map tasks are issued. These Sample tasks collect statistics during
normal map tasks processing on the intermediate data and send this collected distribution about data
to the master after they complete. The master node collects all the sample informatio
estimate of the data distribution to take the decision of partition then notifies about this to the slave
nodes. When slave nodes receive partition decision then they need to partition the data according to
decision and already issued normal map tasks also partition the intermediate data. There is no need
for reduce tasks to wait for completion of all issued map tasks; they can be issued as soon as
decisions about partitioning are ready.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
© IAEME
Second cause of skew related to straggler nodes it’s a node which takes more time to process
[1] presents LIBRA a sampling and partitioning algorithm which balances the
load on reduce tasks. The architecture of LIBRA system is shown in figure 5. Data skew mitigation
From map tasks a small percentage is selected as a sample tasks, these sample tasks are
issued first whenever there are free slots are available in the system. Only when all sample tasks are
ry map tasks are issued. These Sample tasks collect statistics during
normal map tasks processing on the intermediate data and send this collected distribution about data
to the master after they complete. The master node collects all the sample information and derives an
estimate of the data distribution to take the decision of partition then notifies about this to the slave
nodes. When slave nodes receive partition decision then they need to partition the data according to
al map tasks also partition the intermediate data. There is no need
for reduce tasks to wait for completion of all issued map tasks; they can be issued as soon as
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp.
Figure 5:
Sampling method of LIBRA achieves a good estimate to the distribution of the whole
original data set. LIBRA reduces the inconsistency or unpredictability of job execution time
significantly by partitioning the intermediate data more evenly across reduce tasks. Us
can be made in various applications and can raise up to a factor of 4X performance improvement
than earlier. LIBRA can be used in both homogeneous and heterogeneous environments. The extra
operating cost of LIBRA is minimal and negligible even i
Q. Chen et. al.[2] proposes a new speculative execution strategy named Maximum Cost
Performance (MCP). When machine unusually takes long time to complete a task this machine is
called as straggler machine, this results into incr
cluster throughput drastically. One common approach is present called Speculative Execution used to
deal with the straggler nodes problem, it just backup the straggler task on alternative node.
Various speculative execution strategies are improving the performance in Heterogeneous
and homogeneous environment. But there are some drawbacks that degrade the performance. When
the existing strategies cannot work well, the new strategy is introduced called MCP, whi
the usefulness of speculative execution strategy. In this strategy cost refers to the computing
resources occupied by tasks, the performance refers to the reduction in the job execution time and the
boost in the cluster throughput. MCP aims at s
backing up on proper worker nodes. It is scalable and gives good performance in both small clusters
and large clusters.
To perfectly and promptly identify straggler nodes, MCP provide three methods as foll
1. Make use of the progress rate and the process bandwidth within a phase to decide on slow
tasks.
2. Use EWMA (Exponentially Weighted Moving Average) to make prediction about process
speed and to calculate a task’s remaining time.
3. With the consideration of the load on the cluster it uses cost
task to backup on another node.
To pick appropriate worker node for backup tasks, it takes both data skew and data locality into
consideration.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
36
Figure 5: LIBRA system Architecture [1]
g method of LIBRA achieves a good estimate to the distribution of the whole
original data set. LIBRA reduces the inconsistency or unpredictability of job execution time
significantly by partitioning the intermediate data more evenly across reduce tasks. Us
can be made in various applications and can raise up to a factor of 4X performance improvement
than earlier. LIBRA can be used in both homogeneous and heterogeneous environments. The extra
operating cost of LIBRA is minimal and negligible even in the absence of the skew.
Q. Chen et. al.[2] proposes a new speculative execution strategy named Maximum Cost
When machine unusually takes long time to complete a task this machine is
called as straggler machine, this results into increase in the job execution time and decreases the
cluster throughput drastically. One common approach is present called Speculative Execution used to
deal with the straggler nodes problem, it just backup the straggler task on alternative node.
lative execution strategies are improving the performance in Heterogeneous
and homogeneous environment. But there are some drawbacks that degrade the performance. When
the existing strategies cannot work well, the new strategy is introduced called MCP, whi
the usefulness of speculative execution strategy. In this strategy cost refers to the computing
resources occupied by tasks, the performance refers to the reduction in the job execution time and the
boost in the cluster throughput. MCP aims at selecting straggler nodes perfectly and rapidly and
backing up on proper worker nodes. It is scalable and gives good performance in both small clusters
To perfectly and promptly identify straggler nodes, MCP provide three methods as foll
Make use of the progress rate and the process bandwidth within a phase to decide on slow
Use EWMA (Exponentially Weighted Moving Average) to make prediction about process
speed and to calculate a task’s remaining time.
f the load on the cluster it uses cost-benefit model to determine which
task to backup on another node.
To pick appropriate worker node for backup tasks, it takes both data skew and data locality into
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
© IAEME
g method of LIBRA achieves a good estimate to the distribution of the whole
original data set. LIBRA reduces the inconsistency or unpredictability of job execution time
significantly by partitioning the intermediate data more evenly across reduce tasks. Use of LIBRA
can be made in various applications and can raise up to a factor of 4X performance improvement
than earlier. LIBRA can be used in both homogeneous and heterogeneous environments. The extra
n the absence of the skew.
Q. Chen et. al.[2] proposes a new speculative execution strategy named Maximum Cost
When machine unusually takes long time to complete a task this machine is
ease in the job execution time and decreases the
cluster throughput drastically. One common approach is present called Speculative Execution used to
deal with the straggler nodes problem, it just backup the straggler task on alternative node.
lative execution strategies are improving the performance in Heterogeneous
and homogeneous environment. But there are some drawbacks that degrade the performance. When
the existing strategies cannot work well, the new strategy is introduced called MCP, which extends
the usefulness of speculative execution strategy. In this strategy cost refers to the computing
resources occupied by tasks, the performance refers to the reduction in the job execution time and the
electing straggler nodes perfectly and rapidly and
backing up on proper worker nodes. It is scalable and gives good performance in both small clusters
To perfectly and promptly identify straggler nodes, MCP provide three methods as follows:
Make use of the progress rate and the process bandwidth within a phase to decide on slow
Use EWMA (Exponentially Weighted Moving Average) to make prediction about process
benefit model to determine which
To pick appropriate worker node for backup tasks, it takes both data skew and data locality into
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
37
Y. Kwon et. al.[3] designed SkewTune, it’s API-compatible with Hadoop providing the same
capability of parallel job execution environment while it’s added capabilities to detect and mitigate
skew. SkewTune is designed to work for MapReduce type systems where each operator reads data
from disk and writes data to disk and is thus decoupled from other operators in the data flow.
SkewTune continuously monitors the execution of a UDO (User Defined Operations) and detects the
current bottleneck task that dominates and delays the job completion. When such a task is identified,
the processing of that task is stopped and its remaining unprocessed input data is repartitioned among
idle nodes. The repartition occurs only when there is an idle node present or when new nodes are
dynamically added to accelerate the job. Thus if there is not present any task experiencing
considerable data skew, SkewTune imposes no overhead. If a node is idle and at least one task is
expected to take a long time to complete, SkewTune stops the task with the longest expected time
remaining. It repartitions data on other idle nodes and then parallelizes the remaining input data for
that task taking into consideration to expected future availability of all nodes in the cluster.
SkewTune continuously detects and removes bottleneck tasks until the job completes.
When repartitioning a task, SkewTune takes into account the predicted availability of all
nodes in the cluster to minimize the job completion time. The availability is estimated from the
progress of running tasks. An unprocessed remaining input data of a task is repartitioned on the
available nodes and utilize them fully or nodes that are expected to become available after some
time. Again SkewTune concatenate all the repartitioned data to construct the final output exactly
same as if there wasn’t any partition happened.
Y. Kwon et al. [4] developed a static optimizer that takes a User-defined Operation (UDO) as
input and outputs a data partition plan that balance load among all processing nodes known as
SkewReduce. SkewReduce does not only balance the amount of data assigned to each operator
partition, but rather the expected processing time for the assigned data. While a static optimizer
could be applied to different types of UDOs, SkewReduce was actually designed for feature
extracting applications. In these applications, emergent features are extracted from the data items
that are embedded in a metric space. These applications are captured by two UDOs: one that extracts
features from sub-spaces and another that merges features that cross partition boundaries.
The key idea behind SkewReduce is to ask the user for extra information about their UDOs
(the feature extraction and merge functions). The user creates additional cost functions that
characterize the UDOs runtimes. Firstly the cost function takes as input a sample of the input data,
the sampling rate that this sample corresponds to and secondly the bounding hypercube where the
sample was actually taken from. In response the function returns a real number, which should
represent an estimate of the processing time for the data. The sampling rate α allows the cost
function to properly scale up the estimate to the overall dataset. The best cost function returns the
values proportional to the actual cost of processing the data partition bounded by the hypercube. The
SkewReduce plans quality depends on the quality of the cost functions. However, author found that
even inaccurate cost functions could help reduce runtimes compared to no optimization at all. For the
domain experts such as statisticians and scientists, author assume that writing a cost model is easier
than debugging and optimizing distributed programs. The cost models are specific to the
implementation and can be reused across many data sets for the same UDOs.
Given the cost models, a sample of the input data, and the cluster configuration (e.g., the
number of nodes and the scheduling algorithm), SkewReduce searches a good partition plan for the
input data by (a) when there is considerable data skew is expected for some part of data then
applying fine grained data partitioning, (b) when data skew is not expected applying coarse grained
data partitioning, and (c) balancing the partition granularity in a way that minimizes expected job
completion time under the given cluster configuration. More specifically, the algorithm proceeds as
follows. Starting from a single partition that corresponds to the entire hypercube bounding the input
data, and thus also the data sample, the optimizer greedily splits the most expensive leaf partition in
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
38
the current partition plan. The optimizer stops splitting partitions when two conditions are met: (a)
All partitions can be processed and merged without running out of memory; (b) No further leaf-node
split improves the runtime: i.e. further splitting a node increases the expected runtime compared to
the current plan. To check the second condition, the SkewReduce optimizer first checks that the
original execution time of the partition is greater than the expected execution time of the two sub-
partitions running in parallel, followed by a newly added merge function to reconcile features found
in individual sub partitions. If the condition is satisfied, SkewReduce further verifies that the
resulting tasks can also be scheduled in the cluster in a manner that reduces the runtime. It proceeds
with the split only when both conditions are met.
Wei Yan et al.[5] presented a scalable solution to achieve load balanced record linkage over
the mapreduce framework. The solution in this paper contains two low memory load balancing
algorithms that work with a sketch based approximate data profiles. The solution provided in this
paper divided in two parts as follows. First, sketch based profiling method 1) scalable with the size
of the number of blocks and input data and 2) efficient for constructing sketch profile, such that
constant time taken by each update. Second proposed Cell Block division and Cell Range division
algorithms can efficiently break up expensive blocks without losing any record pairs that need to be
compared. A sketch is a two-dimensional array of cells, each indexed by a set of pair wise
independent hash function.
Figure 6: The workflow of MapReduce-based record linkage facilitated by sketch [5]
Figure 6 shows the overall design of system with a representation of linking two data sets R
and S. This design is based on two rounds of MapReduce jobs. The profiling job analyzes two input
data sets and provides the load estimation in terms of sketch. This generated load estimation
information is passed by map tasks into the second MapReduce job (comparison), where load
balancing strategies are applied to perform the actual record linkage task.
In Sketch based data profiling, FastAGMS sketch is used because it provides the most
accurate estimation for the size of join operation, regardless of data skew. Two FastAGMS sketches
are maintained to estimate workload in the form of record-pair comparisons within each block for
data sets R and S. The map tasks of the profiling job builds local sketches based on the input data
from two data sets. After completing map tasks local sketches are sent to one reducer, where they are
combined to build the final sketch by entry-wise adding all local sketches.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
39
In Cell Block Division Algorithm, final sketch from profiling job provides an estimation of
the record comparison workload, where each cell carries the estimated workload for multiple blocks.
A simple idea that comes into mind is to pack these cells into n partitions and assign each reducer its
own partition. It may happen some cells have workload larger than the average, so a division
procedure is required for large cells to divide it into subcells otherwise keep the cell as a single
subcell.
In Cell Range Division Algorithm, as the Cell Block division algorithm divides larger cells,
this may still lead to imbalanced reducer’s workload due to variation in the size of the subcells. To
overcome this new more sophisticated pair-based load balancing strategy that strives to generate a
uniform number of pairs for all reduce tasks. Each map task processes sketch and can therefore
enumerates the workload per cell. Label each record pair in sketch with a global index and divide all
record pairs into n-equal length ranges.
Benjamin Gufler et al.[6] present TopCluster, a sophisticated distributed monitoring approach
for load balancing in MapReduce systems. TopCluster requires cluster threshold which controls the
size of the local statistics that are sent from each mapper to the controller. The result is a global
histogram of (key, cardinality) pairs, which approximates the cardinalities of the clusters with the
most frequent keys. This global histogram is used to estimate the partition cost. As the global
histogram contains the largest clusters, the data skew is considered in the cost estimation.
The TopCluster algorithm computes an approximation of the global histogram for each
partition in three steps: 1) Local Histogram (mapper): For each partition a local histogram is
maintained on mapper. 2) Communication: After mapper completes it sends following information to
the controller a) the presence indicator for all local clusters and b) the histogram for the largest local
clusters. 3) Global Histogram (controller): The control approximates the global histogram by a) sum
aggregating heads of the local histograms from all mappers and b) By computing a cardinality
estimation for all clusters that are not in the local histograms.
1. Challenging Issues
In MapReduce system to improve performance of applications skew is the big issue, again
this skew is present in both map phase and reduce phase also. But reduce skew or partition skew is
much more challenging as it requires appropriate distribution of clusters to reduce tasks. Efficient
and scalable cluster partitioning scheme is required to overcome reduce skew and to improve
throughput of MapReduce applications.
2. Summary
MapReduce applications causes degrade in their performance due to presence of skew in
input data. To tackle this problem various authors have proposed their solutions, we will make
analysis of them as follows:
LIBRA [1] supports large cluster split. Evaluation of performance in both synthetic and real
workloads demonstrates that the resulting performance improvement is significant. LIBRA has
adjustment for heterogeneous environments. Overhead is minimal and negligible in the absence of
skew.
MCP[2] shows good results, can run jobs up to 39% faster and improves the cluster
throughput by up to 44% compared to Hadoop release 0.21. It got success in minimizing the skew by
detecting straggler nodes and introduces increase in cluster throughput also. But MCP only considers
speculative execution at the end of a stage; this can be future scope for this paper.
SkewTune[3] shows improvement by a factor of 4 over Hadoop, it is broadly applicable as it
makes no assumptions about the cause of skew. SkewTune preserves the order and partitioning
properties of the output. Limitations of SkewTune as follows:
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
40
1. It occupies more task slots than regular to run the skew mitigation task.
2. It cannot split exact large keys because it does not sample any key frequency information.
Therefore, as long as large keys are being gathered and processed, the system cannot rearrange
them.
3. It mitigate skew in reduce phase but cannot improve copy and sort phases, which may be the
performance bottleneck for some applications.
SkewReduce[4] improves the execution speed by a factor of up to 8 over default Hadoop and
runtime improves even if the cost functions provided by user were not perfectly accurate but that are
good enough to identify expensive results in the data. Limitation of SkewReduce is that it puts an
extra burden on user who must provide cost functions.
Wei Yan et al.[5] shows advantage as load balancing algorithms can decrease the overall job
completion time by 71.56% and 70.73% of the default settings in Hadoop and both algorithms
proposed have same load balancing performance while requiring much less memory.
TopCluster[6] shows advantage as it’s tailored to suit the characteristics of MapReduce
frameworks, especially in its communication behavior and also it provides a good basis for partition
cost estimation, which in turn required for effective load balancing.
III. CONCLUSION AND FUTURE WORK
In this paper, we have surveyed common causes of skew in MapReduce applications and
various techniques to tackle or minimize skew for the MapReduce so that faster processing of the
jobs will be done in the system. We have summarized current gaps in the research. In future work,
we will focus on solving skewed data distributions on the mappers and consider analyzing more
sophisticated statistics of data distribution in order to estimate the workload per partition more
accurately.
REFERENCES
1. Q. Chen, J. Yao, and Z. Xiao, “LIBRA: Lightweight Data Skew Mitigation in MapReduce,”
Parallel and Distributed Systems, IEEE Transactions on (Volume: PP , Issue: 99 ), 2014.
2. Q. Chen, C. Liu, and Z. Xiao, “Improving mapreduce performance using smart speculative
execution strategy,” IEEE Transactions on Computers (TC), vol. 63, no. 4, 2014.
3. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skewtune: Mitigating skew in mapreduce
applications,” in Proc. of the ACM SIGMOD International Conference on Management of
Data, 2012.
4. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. “Skew-resistant parallel processing of
feature-extracting scientific user-defined functions”. In Proc. of the First SOCC Conf., June
2010.
5. Wei Yan; Yuan Xue; Malin, B. "Scalable load balancing for mapreduce-based record
linkage", Performance Computing and Communications Conference (IPCCC), 2013 IEEE
32nd International, On page(s): 1 – 10.
6. B. Gufler, N. Augsten, A. Reiser, and A. Kemper. “Load balancing in mapreduce based on
scalable cardinality estimates”. Proc. of ICDE'12, 2012.
7. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,”
Commun. ACM, vol. 51, January 2008.
8. Y. Kwon, M. Balazinska, and B. Howe, “A study of skew in mapreduce applications,” in
Proc. of the Open Cirrus Summit, 2011.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME
41
9. Lin Jimmy, “The curse of zipf and limits to parallelization: A look at the stragglers problem
in mapreduce,” in 7th Workshop on Large-Scale Distributed Systems for Information
Retrieval (LSDS-IR), 2009.
10. Tom White, Hadoop – The Definitive Guide, 2nd
edition, O’REILLY, 2011.
11. Apache Hadoop. [Online]. Available: http://hadoop.apache.org/.
12. Apache Hadoop. [Online]. Available: http://en.wikipedia.org/wiki/Apache_Hadoop.
13. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skewtune: Mitigating skew in mapreduce
applications”. [Online]. Available:http://idb.snu.ac.kr/images/9/95/Seminar2014_Sk ewTun
e_M itigating_Skew_in_MapReduce_Applications.pptx.
14. RARE MILE TECHN OLOGIES, “Hadoop Demystified,” blog.raremile.com. [Online].
Available: http://blog.raremile.com/hadoop-demystified.
15. Nilay Narlawar and Ila Naresh Patil, “A Speedy Approach: User-Based Collaborative
Filtering with Mapreduce” International journal of Computer Engineering & Technology
(IJCET), Volume 5, Issue 5, 2014, pp. 32 - 39, ISSN Print: 0976 – 6367, ISSN Online: 0976
– 6375.
16. Suhas V. Ambade and Prof. Priya Deshpande, “Hadoop Block Placement Policy For
Different File Formats” International journal of Computer Engineering & Technology
(IJCET), Volume 5, Issue 12, 2014, pp. 249 - 256, ISSN Print: 0976 – 6367, ISSN Online:
0976 – 6375.
17. Gandhali Upadhye and Astt. Prof. Trupti Dange, “Nephele: Efficient Data Processing Using
Hadoop” International journal of Computer Engineering & Technology (IJCET), Volume 5,
Issue 7, 2014, pp. 11 - 16, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

More Related Content

What's hot

Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big DataData Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Dataijccsa
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slidesharebolu804
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterHarsh Kevadia
 
Dynamically Partitioning Big Data Using Virtual Machine Mapping
Dynamically Partitioning Big Data Using Virtual Machine MappingDynamically Partitioning Big Data Using Virtual Machine Mapping
Dynamically Partitioning Big Data Using Virtual Machine MappingAM Publications
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryCSCJournals
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmIRJET Journal
 
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsUsing a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsJoseph Luchette
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudAM Publications
 
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...AM Publications
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl systemijdms
 
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTVIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTijmpict
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
 

What's hot (20)

Data Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big DataData Distribution Handling on Cloud for Deployment of Big Data
Data Distribution Handling on Cloud for Deployment of Big Data
 
Dremel
DremelDremel
Dremel
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slideshare
 
Simplified Data Processing On Large Cluster
Simplified Data Processing On Large ClusterSimplified Data Processing On Large Cluster
Simplified Data Processing On Large Cluster
 
Dynamically Partitioning Big Data Using Virtual Machine Mapping
Dynamically Partitioning Big Data Using Virtual Machine MappingDynamically Partitioning Big Data Using Virtual Machine Mapping
Dynamically Partitioning Big Data Using Virtual Machine Mapping
 
A01260104
A01260104A01260104
A01260104
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud Library
 
Availability in Cloud Computing
Availability in Cloud ComputingAvailability in Cloud Computing
Availability in Cloud Computing
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking AlgorithmPerformance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
 
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsUsing a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in CloudSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud
 
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl system
 
IJET-V3I2P24
IJET-V3I2P24IJET-V3I2P24
IJET-V3I2P24
 
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTVIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 

Viewers also liked

A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 

Viewers also liked (9)

A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar to IJCET Survey on Load Balancing and Data Skew in MapReduce

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Managing Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainManaging Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainAM Publications
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challengesijcisjournal
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
 
Ijircce publish this paper
Ijircce publish this paperIjircce publish this paper
Ijircce publish this paperSANTOSH WAYAL
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 

Similar to IJCET Survey on Load Balancing and Data Skew in MapReduce (20)

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
E031201032036
E031201032036E031201032036
E031201032036
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Managing Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom DomainManaging Big data using Hadoop Map Reduce in Telecom Domain
Managing Big data using Hadoop Map Reduce in Telecom Domain
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Hadoop
HadoopHadoop
Hadoop
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
Ijircce publish this paper
Ijircce publish this paperIjircce publish this paper
Ijircce publish this paper
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Presentation1
Presentation1Presentation1
Presentation1
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 

More from IAEME Publication

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME Publication
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...IAEME Publication
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSIAEME Publication
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSIAEME Publication
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSIAEME Publication
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSIAEME Publication
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOIAEME Publication
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IAEME Publication
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYIAEME Publication
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...IAEME Publication
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEIAEME Publication
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...IAEME Publication
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...IAEME Publication
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...IAEME Publication
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...IAEME Publication
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...IAEME Publication
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...IAEME Publication
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...IAEME Publication
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...IAEME Publication
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTIAEME Publication
 

More from IAEME Publication (20)

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdf
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

IJCET Survey on Load Balancing and Data Skew in MapReduce

  • 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 32 SURVEY ON LOAD BALANCING AND DATA SKEW MITIGATION IN MAPREDUCE APPLICATIONS Vishal A. Nawale1 , Prof. Priya Deshpande2 1 Information Technology, MAEER’s MITCOE, Kothrud, Pune, India 2 Information Technology, MAEER’s MITCOE, Kothrud, Pune, India ABSTRACT Since few years Map Reduce programming model have shown great success in processing huge amount of data. Map Reduce is a framework for data-intensive distributed computing of batch jobs. This data-intensive processing creates skew in Map Reduce framework and degrades performance by great value. This leads to greatly varying execution time for the Map Reduce jobs. Due to this varying execution times results into low resource utilization and high overall execution time since new Map Reduce cycle can start only after all reducers are completed. During this paper, we are going to study various methodologies and techniques used to mitigate data skew and partition skew, and describe various advantages and limitations of these approaches. Keywords : Data Skew, Hadoop, Map Reduce, Partition Skew. I. INTRODUCTION Big Data, now a days this term becomes common in Information Technology industry. As today’s situation there is lot of data available in IT industry but there is nothing useful before big data comes into picture. As we know there is huge amount of data surrounds us but we can’t make it useful for us. Reason for this is that we don’t have any traditional system such a powerful that can make out analysis from this huge amount of data. When we usually talk about big data the first name comes in mind is Hadoop, Doug Cutting helped to create Apache Hadoop as requirement for data exploded from the web and grew far beyond the handling of conventional systems. With Hadoop there is no data that’s too big. In today’s world as we know it where daily tremendous amount of data is generated, with the use of Hadoop various industries can now find value in data that was in recent times considered useless. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 6, Issue 1, January (2015), pp. 32-41 © IAEME: www.iaeme.com/IJCET.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 33 Hadoop uses Hadoop Distributed File System (HDFS) as the primary distributed storage. HDFS is well known for distributed, scalable and portable file-system using commodity hardware [8]. Name Node/Master Node is having metadata information about whole system such as data stored on data nodes, free space, active nodes, passive nodes, job tracker, task tracker and many other configuration files such as replication of data [8]. Data Node is a type of slave node which is used to save data and task tracker in data node which is used to track on the ongoing jobs which are coming from name node [8]. Figure 1: Hadoop Architecture [14] A small-size Hadoop cluster includes one master node and multiple slave nodes. A slave node acts as a Data Node in HDFS architecture and acts as Task Tracker in Mapreduce framework. These types of clusters are used in only non-standard applications. Again Large-size cluster includes a dedicated Name Node which manages all file system index or metadata, a secondary Name Node which periodically generates snapshots of Name Node to prevent or recover from Name Node failure and reducing loss of data. Similarly in MapReduce framework a standalone Job Tracker server which manages job scheduling and several task trackers are running on Data Nodes or slaves [11],[8]. Since from last few years, MapReduce framework is becoming a paradigm for the bulk operations or workloads and a programming model for parallel data processing. MapReduce programs written in many languages like Java, Ruby, Python and C++. Hadoop and MapReduce framework is used by big players of market like Facebook, Google, Yahoo, Amazon, etc. MapReduce works by dividing processing into two phases namely map phase and reduce phase. Both phase has (key, value) pair as input and output [8].
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 34 Figure 2: Multi Node Hadoop Cluster [11] To maximize the MapReduce performance, preferably we want all tasks to finish around the same time. The job completion time in MapReduce depends on the slowest running task in the job. While MapReduce is a popular data processing tool, it still has several important limitations. In Particular, skew is a significant challenge in many applications executed on this platform. When skew arises, some tasks in a job take significantly longer time to process their input data than remaining tasks this causes to slow down the entire computation. 1. Causes of Skew Data skew can occur in both the map phase and reduce phase. Map skew occurs when some input data are more difficult to process than others, but it is rare and can be easily addressed by simply splitting map tasks. Lin [12] has provided an application-specific solution that split large, expensive records into some smaller ones. Again data skew present in the reduce phase (also called reduce skew or partitioning skew) is much more challenging. It’s well known that skew has many causes. The original MapReduce paper considered the problem of skew caused by resource contention and hardware malfunction. To deal with this type of skew, MapReduce and Hadoop include a mechanism in which the last few tasks of a job are speculatively replicated on a different machine and the job completes when the fastest replicas of these final tasks complete. Mainly there are two causes of Skew: First cause of skew is uneven distribution of data in which data as an input provided to map tasks is unevenly distributed may be in the size of data and/or complexity of the data.
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. Figure 3: Second cause of skew related to straggler nodes it’s a node which takes more time to process data than other nodes [13]. Figure 4: II. LITERATURE SURVEY Qi Chen et al. [1] presents LIBRA a sampling and partitioning algorithm which balances the load on reduce tasks. The architecture of LIBRA system is shown in figure 5. Data skew mitigation in LIBRA consists of the following steps: From map tasks a small percentage is selected as a sample tasks, these sample tasks are issued first whenever there are free slots are available in the system. Only when all sample tasks are completed then only ordinary map tasks are issued. These Sample tasks collect statistics during normal map tasks processing on the intermediate data and send this collected distribution about data to the master after they complete. The master node collects all the sample informatio estimate of the data distribution to take the decision of partition then notifies about this to the slave nodes. When slave nodes receive partition decision then they need to partition the data according to decision and already issued norm for reduce tasks to wait for completion of all issued map tasks; they can be issued as soon as decisions about partitioning are ready. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 35 Figure 3: Uneven distribution of data [13] Second cause of skew related to straggler nodes it’s a node which takes more time to process Figure 4: Skew caused by straggler node.[13] [1] presents LIBRA a sampling and partitioning algorithm which balances the load on reduce tasks. The architecture of LIBRA system is shown in figure 5. Data skew mitigation LIBRA consists of the following steps: From map tasks a small percentage is selected as a sample tasks, these sample tasks are issued first whenever there are free slots are available in the system. Only when all sample tasks are ry map tasks are issued. These Sample tasks collect statistics during normal map tasks processing on the intermediate data and send this collected distribution about data to the master after they complete. The master node collects all the sample informatio estimate of the data distribution to take the decision of partition then notifies about this to the slave nodes. When slave nodes receive partition decision then they need to partition the data according to decision and already issued normal map tasks also partition the intermediate data. There is no need for reduce tasks to wait for completion of all issued map tasks; they can be issued as soon as decisions about partitioning are ready. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), © IAEME Second cause of skew related to straggler nodes it’s a node which takes more time to process [1] presents LIBRA a sampling and partitioning algorithm which balances the load on reduce tasks. The architecture of LIBRA system is shown in figure 5. Data skew mitigation From map tasks a small percentage is selected as a sample tasks, these sample tasks are issued first whenever there are free slots are available in the system. Only when all sample tasks are ry map tasks are issued. These Sample tasks collect statistics during normal map tasks processing on the intermediate data and send this collected distribution about data to the master after they complete. The master node collects all the sample information and derives an estimate of the data distribution to take the decision of partition then notifies about this to the slave nodes. When slave nodes receive partition decision then they need to partition the data according to al map tasks also partition the intermediate data. There is no need for reduce tasks to wait for completion of all issued map tasks; they can be issued as soon as
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. Figure 5: Sampling method of LIBRA achieves a good estimate to the distribution of the whole original data set. LIBRA reduces the inconsistency or unpredictability of job execution time significantly by partitioning the intermediate data more evenly across reduce tasks. Us can be made in various applications and can raise up to a factor of 4X performance improvement than earlier. LIBRA can be used in both homogeneous and heterogeneous environments. The extra operating cost of LIBRA is minimal and negligible even i Q. Chen et. al.[2] proposes a new speculative execution strategy named Maximum Cost Performance (MCP). When machine unusually takes long time to complete a task this machine is called as straggler machine, this results into incr cluster throughput drastically. One common approach is present called Speculative Execution used to deal with the straggler nodes problem, it just backup the straggler task on alternative node. Various speculative execution strategies are improving the performance in Heterogeneous and homogeneous environment. But there are some drawbacks that degrade the performance. When the existing strategies cannot work well, the new strategy is introduced called MCP, whi the usefulness of speculative execution strategy. In this strategy cost refers to the computing resources occupied by tasks, the performance refers to the reduction in the job execution time and the boost in the cluster throughput. MCP aims at s backing up on proper worker nodes. It is scalable and gives good performance in both small clusters and large clusters. To perfectly and promptly identify straggler nodes, MCP provide three methods as foll 1. Make use of the progress rate and the process bandwidth within a phase to decide on slow tasks. 2. Use EWMA (Exponentially Weighted Moving Average) to make prediction about process speed and to calculate a task’s remaining time. 3. With the consideration of the load on the cluster it uses cost task to backup on another node. To pick appropriate worker node for backup tasks, it takes both data skew and data locality into consideration. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 36 Figure 5: LIBRA system Architecture [1] g method of LIBRA achieves a good estimate to the distribution of the whole original data set. LIBRA reduces the inconsistency or unpredictability of job execution time significantly by partitioning the intermediate data more evenly across reduce tasks. Us can be made in various applications and can raise up to a factor of 4X performance improvement than earlier. LIBRA can be used in both homogeneous and heterogeneous environments. The extra operating cost of LIBRA is minimal and negligible even in the absence of the skew. Q. Chen et. al.[2] proposes a new speculative execution strategy named Maximum Cost When machine unusually takes long time to complete a task this machine is called as straggler machine, this results into increase in the job execution time and decreases the cluster throughput drastically. One common approach is present called Speculative Execution used to deal with the straggler nodes problem, it just backup the straggler task on alternative node. lative execution strategies are improving the performance in Heterogeneous and homogeneous environment. But there are some drawbacks that degrade the performance. When the existing strategies cannot work well, the new strategy is introduced called MCP, whi the usefulness of speculative execution strategy. In this strategy cost refers to the computing resources occupied by tasks, the performance refers to the reduction in the job execution time and the boost in the cluster throughput. MCP aims at selecting straggler nodes perfectly and rapidly and backing up on proper worker nodes. It is scalable and gives good performance in both small clusters To perfectly and promptly identify straggler nodes, MCP provide three methods as foll Make use of the progress rate and the process bandwidth within a phase to decide on slow Use EWMA (Exponentially Weighted Moving Average) to make prediction about process speed and to calculate a task’s remaining time. f the load on the cluster it uses cost-benefit model to determine which task to backup on another node. To pick appropriate worker node for backup tasks, it takes both data skew and data locality into International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), © IAEME g method of LIBRA achieves a good estimate to the distribution of the whole original data set. LIBRA reduces the inconsistency or unpredictability of job execution time significantly by partitioning the intermediate data more evenly across reduce tasks. Use of LIBRA can be made in various applications and can raise up to a factor of 4X performance improvement than earlier. LIBRA can be used in both homogeneous and heterogeneous environments. The extra n the absence of the skew. Q. Chen et. al.[2] proposes a new speculative execution strategy named Maximum Cost When machine unusually takes long time to complete a task this machine is ease in the job execution time and decreases the cluster throughput drastically. One common approach is present called Speculative Execution used to deal with the straggler nodes problem, it just backup the straggler task on alternative node. lative execution strategies are improving the performance in Heterogeneous and homogeneous environment. But there are some drawbacks that degrade the performance. When the existing strategies cannot work well, the new strategy is introduced called MCP, which extends the usefulness of speculative execution strategy. In this strategy cost refers to the computing resources occupied by tasks, the performance refers to the reduction in the job execution time and the electing straggler nodes perfectly and rapidly and backing up on proper worker nodes. It is scalable and gives good performance in both small clusters To perfectly and promptly identify straggler nodes, MCP provide three methods as follows: Make use of the progress rate and the process bandwidth within a phase to decide on slow Use EWMA (Exponentially Weighted Moving Average) to make prediction about process benefit model to determine which To pick appropriate worker node for backup tasks, it takes both data skew and data locality into
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 37 Y. Kwon et. al.[3] designed SkewTune, it’s API-compatible with Hadoop providing the same capability of parallel job execution environment while it’s added capabilities to detect and mitigate skew. SkewTune is designed to work for MapReduce type systems where each operator reads data from disk and writes data to disk and is thus decoupled from other operators in the data flow. SkewTune continuously monitors the execution of a UDO (User Defined Operations) and detects the current bottleneck task that dominates and delays the job completion. When such a task is identified, the processing of that task is stopped and its remaining unprocessed input data is repartitioned among idle nodes. The repartition occurs only when there is an idle node present or when new nodes are dynamically added to accelerate the job. Thus if there is not present any task experiencing considerable data skew, SkewTune imposes no overhead. If a node is idle and at least one task is expected to take a long time to complete, SkewTune stops the task with the longest expected time remaining. It repartitions data on other idle nodes and then parallelizes the remaining input data for that task taking into consideration to expected future availability of all nodes in the cluster. SkewTune continuously detects and removes bottleneck tasks until the job completes. When repartitioning a task, SkewTune takes into account the predicted availability of all nodes in the cluster to minimize the job completion time. The availability is estimated from the progress of running tasks. An unprocessed remaining input data of a task is repartitioned on the available nodes and utilize them fully or nodes that are expected to become available after some time. Again SkewTune concatenate all the repartitioned data to construct the final output exactly same as if there wasn’t any partition happened. Y. Kwon et al. [4] developed a static optimizer that takes a User-defined Operation (UDO) as input and outputs a data partition plan that balance load among all processing nodes known as SkewReduce. SkewReduce does not only balance the amount of data assigned to each operator partition, but rather the expected processing time for the assigned data. While a static optimizer could be applied to different types of UDOs, SkewReduce was actually designed for feature extracting applications. In these applications, emergent features are extracted from the data items that are embedded in a metric space. These applications are captured by two UDOs: one that extracts features from sub-spaces and another that merges features that cross partition boundaries. The key idea behind SkewReduce is to ask the user for extra information about their UDOs (the feature extraction and merge functions). The user creates additional cost functions that characterize the UDOs runtimes. Firstly the cost function takes as input a sample of the input data, the sampling rate that this sample corresponds to and secondly the bounding hypercube where the sample was actually taken from. In response the function returns a real number, which should represent an estimate of the processing time for the data. The sampling rate α allows the cost function to properly scale up the estimate to the overall dataset. The best cost function returns the values proportional to the actual cost of processing the data partition bounded by the hypercube. The SkewReduce plans quality depends on the quality of the cost functions. However, author found that even inaccurate cost functions could help reduce runtimes compared to no optimization at all. For the domain experts such as statisticians and scientists, author assume that writing a cost model is easier than debugging and optimizing distributed programs. The cost models are specific to the implementation and can be reused across many data sets for the same UDOs. Given the cost models, a sample of the input data, and the cluster configuration (e.g., the number of nodes and the scheduling algorithm), SkewReduce searches a good partition plan for the input data by (a) when there is considerable data skew is expected for some part of data then applying fine grained data partitioning, (b) when data skew is not expected applying coarse grained data partitioning, and (c) balancing the partition granularity in a way that minimizes expected job completion time under the given cluster configuration. More specifically, the algorithm proceeds as follows. Starting from a single partition that corresponds to the entire hypercube bounding the input data, and thus also the data sample, the optimizer greedily splits the most expensive leaf partition in
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 38 the current partition plan. The optimizer stops splitting partitions when two conditions are met: (a) All partitions can be processed and merged without running out of memory; (b) No further leaf-node split improves the runtime: i.e. further splitting a node increases the expected runtime compared to the current plan. To check the second condition, the SkewReduce optimizer first checks that the original execution time of the partition is greater than the expected execution time of the two sub- partitions running in parallel, followed by a newly added merge function to reconcile features found in individual sub partitions. If the condition is satisfied, SkewReduce further verifies that the resulting tasks can also be scheduled in the cluster in a manner that reduces the runtime. It proceeds with the split only when both conditions are met. Wei Yan et al.[5] presented a scalable solution to achieve load balanced record linkage over the mapreduce framework. The solution in this paper contains two low memory load balancing algorithms that work with a sketch based approximate data profiles. The solution provided in this paper divided in two parts as follows. First, sketch based profiling method 1) scalable with the size of the number of blocks and input data and 2) efficient for constructing sketch profile, such that constant time taken by each update. Second proposed Cell Block division and Cell Range division algorithms can efficiently break up expensive blocks without losing any record pairs that need to be compared. A sketch is a two-dimensional array of cells, each indexed by a set of pair wise independent hash function. Figure 6: The workflow of MapReduce-based record linkage facilitated by sketch [5] Figure 6 shows the overall design of system with a representation of linking two data sets R and S. This design is based on two rounds of MapReduce jobs. The profiling job analyzes two input data sets and provides the load estimation in terms of sketch. This generated load estimation information is passed by map tasks into the second MapReduce job (comparison), where load balancing strategies are applied to perform the actual record linkage task. In Sketch based data profiling, FastAGMS sketch is used because it provides the most accurate estimation for the size of join operation, regardless of data skew. Two FastAGMS sketches are maintained to estimate workload in the form of record-pair comparisons within each block for data sets R and S. The map tasks of the profiling job builds local sketches based on the input data from two data sets. After completing map tasks local sketches are sent to one reducer, where they are combined to build the final sketch by entry-wise adding all local sketches.
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 39 In Cell Block Division Algorithm, final sketch from profiling job provides an estimation of the record comparison workload, where each cell carries the estimated workload for multiple blocks. A simple idea that comes into mind is to pack these cells into n partitions and assign each reducer its own partition. It may happen some cells have workload larger than the average, so a division procedure is required for large cells to divide it into subcells otherwise keep the cell as a single subcell. In Cell Range Division Algorithm, as the Cell Block division algorithm divides larger cells, this may still lead to imbalanced reducer’s workload due to variation in the size of the subcells. To overcome this new more sophisticated pair-based load balancing strategy that strives to generate a uniform number of pairs for all reduce tasks. Each map task processes sketch and can therefore enumerates the workload per cell. Label each record pair in sketch with a global index and divide all record pairs into n-equal length ranges. Benjamin Gufler et al.[6] present TopCluster, a sophisticated distributed monitoring approach for load balancing in MapReduce systems. TopCluster requires cluster threshold which controls the size of the local statistics that are sent from each mapper to the controller. The result is a global histogram of (key, cardinality) pairs, which approximates the cardinalities of the clusters with the most frequent keys. This global histogram is used to estimate the partition cost. As the global histogram contains the largest clusters, the data skew is considered in the cost estimation. The TopCluster algorithm computes an approximation of the global histogram for each partition in three steps: 1) Local Histogram (mapper): For each partition a local histogram is maintained on mapper. 2) Communication: After mapper completes it sends following information to the controller a) the presence indicator for all local clusters and b) the histogram for the largest local clusters. 3) Global Histogram (controller): The control approximates the global histogram by a) sum aggregating heads of the local histograms from all mappers and b) By computing a cardinality estimation for all clusters that are not in the local histograms. 1. Challenging Issues In MapReduce system to improve performance of applications skew is the big issue, again this skew is present in both map phase and reduce phase also. But reduce skew or partition skew is much more challenging as it requires appropriate distribution of clusters to reduce tasks. Efficient and scalable cluster partitioning scheme is required to overcome reduce skew and to improve throughput of MapReduce applications. 2. Summary MapReduce applications causes degrade in their performance due to presence of skew in input data. To tackle this problem various authors have proposed their solutions, we will make analysis of them as follows: LIBRA [1] supports large cluster split. Evaluation of performance in both synthetic and real workloads demonstrates that the resulting performance improvement is significant. LIBRA has adjustment for heterogeneous environments. Overhead is minimal and negligible in the absence of skew. MCP[2] shows good results, can run jobs up to 39% faster and improves the cluster throughput by up to 44% compared to Hadoop release 0.21. It got success in minimizing the skew by detecting straggler nodes and introduces increase in cluster throughput also. But MCP only considers speculative execution at the end of a stage; this can be future scope for this paper. SkewTune[3] shows improvement by a factor of 4 over Hadoop, it is broadly applicable as it makes no assumptions about the cause of skew. SkewTune preserves the order and partitioning properties of the output. Limitations of SkewTune as follows:
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 40 1. It occupies more task slots than regular to run the skew mitigation task. 2. It cannot split exact large keys because it does not sample any key frequency information. Therefore, as long as large keys are being gathered and processed, the system cannot rearrange them. 3. It mitigate skew in reduce phase but cannot improve copy and sort phases, which may be the performance bottleneck for some applications. SkewReduce[4] improves the execution speed by a factor of up to 8 over default Hadoop and runtime improves even if the cost functions provided by user were not perfectly accurate but that are good enough to identify expensive results in the data. Limitation of SkewReduce is that it puts an extra burden on user who must provide cost functions. Wei Yan et al.[5] shows advantage as load balancing algorithms can decrease the overall job completion time by 71.56% and 70.73% of the default settings in Hadoop and both algorithms proposed have same load balancing performance while requiring much less memory. TopCluster[6] shows advantage as it’s tailored to suit the characteristics of MapReduce frameworks, especially in its communication behavior and also it provides a good basis for partition cost estimation, which in turn required for effective load balancing. III. CONCLUSION AND FUTURE WORK In this paper, we have surveyed common causes of skew in MapReduce applications and various techniques to tackle or minimize skew for the MapReduce so that faster processing of the jobs will be done in the system. We have summarized current gaps in the research. In future work, we will focus on solving skewed data distributions on the mappers and consider analyzing more sophisticated statistics of data distribution in order to estimate the workload per partition more accurately. REFERENCES 1. Q. Chen, J. Yao, and Z. Xiao, “LIBRA: Lightweight Data Skew Mitigation in MapReduce,” Parallel and Distributed Systems, IEEE Transactions on (Volume: PP , Issue: 99 ), 2014. 2. Q. Chen, C. Liu, and Z. Xiao, “Improving mapreduce performance using smart speculative execution strategy,” IEEE Transactions on Computers (TC), vol. 63, no. 4, 2014. 3. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skewtune: Mitigating skew in mapreduce applications,” in Proc. of the ACM SIGMOD International Conference on Management of Data, 2012. 4. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. “Skew-resistant parallel processing of feature-extracting scientific user-defined functions”. In Proc. of the First SOCC Conf., June 2010. 5. Wei Yan; Yuan Xue; Malin, B. "Scalable load balancing for mapreduce-based record linkage", Performance Computing and Communications Conference (IPCCC), 2013 IEEE 32nd International, On page(s): 1 – 10. 6. B. Gufler, N. Augsten, A. Reiser, and A. Kemper. “Load balancing in mapreduce based on scalable cardinality estimates”. Proc. of ICDE'12, 2012. 7. J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, January 2008. 8. Y. Kwon, M. Balazinska, and B. Howe, “A study of skew in mapreduce applications,” in Proc. of the Open Cirrus Summit, 2011.
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 6, Issue 1, January (2015), pp. 32-41© IAEME 41 9. Lin Jimmy, “The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce,” in 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), 2009. 10. Tom White, Hadoop – The Definitive Guide, 2nd edition, O’REILLY, 2011. 11. Apache Hadoop. [Online]. Available: http://hadoop.apache.org/. 12. Apache Hadoop. [Online]. Available: http://en.wikipedia.org/wiki/Apache_Hadoop. 13. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skewtune: Mitigating skew in mapreduce applications”. [Online]. Available:http://idb.snu.ac.kr/images/9/95/Seminar2014_Sk ewTun e_M itigating_Skew_in_MapReduce_Applications.pptx. 14. RARE MILE TECHN OLOGIES, “Hadoop Demystified,” blog.raremile.com. [Online]. Available: http://blog.raremile.com/hadoop-demystified. 15. Nilay Narlawar and Ila Naresh Patil, “A Speedy Approach: User-Based Collaborative Filtering with Mapreduce” International journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 5, 2014, pp. 32 - 39, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 16. Suhas V. Ambade and Prof. Priya Deshpande, “Hadoop Block Placement Policy For Different File Formats” International journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 12, 2014, pp. 249 - 256, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 17. Gandhali Upadhye and Astt. Prof. Trupti Dange, “Nephele: Efficient Data Processing Using Hadoop” International journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 7, 2014, pp. 11 - 16, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.