SlideShare a Scribd company logo
1 of 15
Download to read offline
C o m m u n i t y E x p e r i e n c e D i s t i l l e d
Harness big data to provide meaningful insights, analytics,
and business intelligence for your financial institution
Hadoop for Finance
Essentials
RajivTiwari
Hadoop for Finance
Essentials
With the exponential growth of data and many enterprises
crunching more and more data every day, Hadoop as a
data platform has gained a lot of popularity. Financial
businesses want to minimize risks and maximize
opportunities, and Hadoop, largely dominating the big data
market, plays a major role.
This book will get you started with the fundamentals of
big data and Hadoop, enabling you to get to grips with
solutions to many top financial big data use cases including
regulatory projects and fraud detection. It is packed with
industry references and code templates, and is designed
to walk you through a wide range of Hadoop components.
By the end of the book, you'll understand a few industry
leading architecture patterns, big data governance, tips,
best practices, and standards to successfully develop
your own Hadoop based solution.
Who this book is written for
This book is perfect for developers, analysts, architects or
managers who would like to perform big data analytics with
Hadoop for the financial sector. This book is also helpful
for technology professionals from other industry sectors
who have recently switched or like to switch their business
domain to financial sector. Familiarity with big data, Java
programming, database and data warehouse, and business
intelligence would be beneficial.
$ 29.99 US
£ 19.99 UK
Prices do not include
local sales tax or VAT
where applicable
Rajiv Tiwari
What you will learn from this book
 Learn about big data and Hadoop
fundamentals including practical finance
use cases
 Walk through Hadoop-based finance
projects with explanations of solutions,
big data governance, and how to sustain
Hadoop momentum
 Develop a range of solutions for small
to large-scale data projects on the
Hadoop platform
 Learn how to process big data in the cloud
 Present practical business cases to
management to scale up existing platforms
at enterprise level
HadoopforFinanceEssentials
P U B L I S H I N GP U B L I S H I N G
community experience distilled
Visit www.PacktPub.com for books, eBooks,
code, downloads, and PacktLib.
Free
Sam
ple
In this package, you will find:
 The author biography
 A preview chapter from the book, Chapter 3 'Hadoop in the Cloud'
 A synopsis of the book’s content
 More information on Hadoop for Finance Essentials
About the Author
Rajiv Tiwari is a hands-on freelance big data architect with over 15 years of experience
across big data, data analytics, data governance, data architecture, data cleansing / data
integration, data warehousing, and business intelligence for banks and other financial
organizations.
He is an electronics engineering graduate from IIT, Varanasi, and has been working in
England for the past 10 years, mostly in the financial city of London, UK.
He has been using Hadoop since 2010, when Hadoop was in its infancy with regards to
the banking sector.
He is currently helping a tier 1 investment bank implement a large risk analytics project
on the Hadoop platform.
Rajiv can be contacted on his website at or on
Twitter at .
Hadoop for Finance Essentials
Data has been increasing at an exponential rate and organizations are either struggling to
cope up or rushing to take advantage by analyzing it. Hadoop is an excellent open source
framework, which addresses this big data problem.
I have used Hadoop within the financial sector for the last few years but could not find
any resource or book that explains the usage of Hadoop for finance use cases. The best
books I have ever found are again on Hadoop, Hive, or some MapReduce patterns, with
examples on counting words or Twitter messages in all possible ways.
I have written this book with the objective of explaining the basic usage of Hadoop and
other products to tackle big data for finance use cases. I have touched base on the
majority of use cases, providing a very practical approach.
What This Book Covers
Chapter 1, Big Data Overview, covers the overview of big data, its landscape, and
technology evolution. It also touches base with the Hadoop architecture, its components,
and distributions. If you know Hadoop already, just skim through this chapter.
Chapter 2, Big Data in Financial Services, extends the big data overview from the
perspective of a financial organization. It will explain the story of the evolution of big
data in the financial sector, typical implementation challenges, and different finance use
cases with the help of relevant tools and technologies.
Chapter 3, Hadoop in the Cloud, covers the overview of big data in cloud and a sample
portfolio risk simulation project with end-to-end data processing.
Chapter 4, Data Migration Using Hadoop, talks about the most popular project of
migrating historical trade data from traditional data sources to Hadoop.
Chapter 5, Getting Started, covers the implementation project of a very large enterprise
data platform to support various risk and regulatory requirements.
Chapter 6, Getting Experienced, gives an overview of real-time analytics and a sample
project to detect fraudulent transactions.
Chapter 7, Scale It Up, covers the topics to scale up the usage of Hadoop within your
organization, such as enterprise data lake, lambda architecture, and data governance. It
also touches base with few more financial use cases with brief solutions.
Chapter 8, Sustain the Momentum, talks about the Hadoop distribution upgrade cycle
and wraps up the book with best practices and standards.
[ 47 ]
Hadoop in the Cloud
Hadoop in the cloud can be implemented with very low initial investment and
is well suited for proof of concepts and data systems with variable IT resource
requirements. In this chapter, I will discuss the story of Hadoop in the cloud
and how Hadoop can be implemented in the cloud for banks.
I will cover the full data life cycle of a risk simulation project using Hadoop
in the cloud.
• Data collection—ingesting the data into the cloud
• Data transformation—iterating simulations with the given algorithms
• Data analysis—analyzing the output results
I recommend you refer to your Hadoop cloud provider documentation if you need
to dive deeper.
The big data cloud story
In the last few years, cloud computing has grown significantly within banks as they
strive to improve the performance of their applications, increase agility, and most
importantly reduce their IT costs. As moving applications into the cloud reduces
the operational cost and IT complexity, it helps banks to focus on their core business
instead of spending resources on technology support.
The Hadoop-based big data platform is just like any other cloud computing
platform and a few financial organizations have implemented projects with
Hadoop in the cloud.
Hadoop in the Cloud
[ 48 ]
The why
As far as banks are concerned, especially investment banks, business fluctuates a lot
and is driven by the market. Fluctuating business means fluctuating trade volume
and variable IT resource requirements. As shown in the following figure, traditional
on-premise implementations will have a fixed number of servers for peak IT
capacity, but the actual IT capacity needs are variable:
Your IT needs
Time
Traditional IT
capacity
Capacity
As shown in the following figure, if a bank plans to have more IT capacity than
maximum usage (a must for banks), there will be wastage, but if they plan to have
IT capacity that is the average of required fluctuations, it will be lead to processing
queues and customer dissatisfaction:
On and Off
WASTE
Fast Growth
Variable peaks Predictable peaks
CUSTOMER DISSATISFACTION
Chapter 3
[ 49 ]
With cloud computing, financial organizations only pay for the IT capacity they use
and it is the number-one reason for using Hadoop in the cloud–elastic capacity and
thus elastic pricing.
The second reason is proof of concept. For every financial institution, before the
adoption of Hadoop technologies, the big dilemma was, "Is it really worth it?" or
"Should I really spend on Hadoop hardware and software as it is still not completely
mature?" You can simply create Hadoop clusters within minutes, do a small proof
of concept, and validate the benefits. Then, either scale up your cloud with more use
cases or go on-premise if that is what you prefer.
The when
Have a look at the following questions. If you answer yes to any of these for your
big data problem, Hadoop in the cloud could be the way forward:
• Is your data operation very intensive but unpredictable?
• Do you want to do a small proof of concept without buying the hardware
and software up front?
• Do you want your operational expense to be very low or managed by
external vendors?
What's the catch?
If the cloud solves all big data problems, why isn't every bank implementing it?
• The biggest concern is—and will remain for the foreseeable future—the
security of the data in the cloud, especially customers' private data. The
moment senior managers think of security, they want to play safe and drop
the idea of implementing it on the cloud.
• Performance is still not as good as that on an on-premise installation. Disk
I/O is a bottleneck in virtual machine environments. Especially with mixed
tasks such as MapReduce, Spark, and so on, on the same cluster with several
concurrent users you will feel a big performance impact.
• Once the data is in the cloud, vendors manage the day-to-day administrative
tasks, including operations. The implementation of Hadoop in the cloud
will lead to the development and operation roles merging, which is slightly
against the norm in terms of departmental functions of banks.
In the next section, I will pick up one of the most popular use cases: implementing
Hadoop in the cloud for the risk division of a bank.
Hadoop in the Cloud
[ 50 ]
Project details – risk simulations in the
cloud
Value at Risk (VaR) is a very effective method to calculate the financial risk of a
portfolio. Monte Carlo is one of the methods used to generate the financial risk for a
number of computer-generated scenarios. The effectiveness of this method depends
on running as many scenarios as possible.
Currently, a bank runs the credit-risk Monte Carlo simulation to calculate the VaR
with complex algorithms to simulate diverse risk scenarios in order to evaluate the
risk metrics of its clients. The simulation requires high computational power with
millions of computer-generated simulations; even with high-end computers, it takes
20–30 hours to run the application, which is both time consuming and expensive.
Solution
For our illustration, I will use Amazon Web Services (AWS) with Elastic
MapReduce (EMR) and parallelize the Monte Carlo simulation using a MapReduce
model. Note, however, that it can be implemented on any Hadoop cloud platform.
The bank will upload the client portfolio data into cloud storage (S3); develop
MapReduce using the existing algorithms; and use EMR on-demand additional
nodes to execute the MapReduce in parallel, write back the results to S3, and release
EMR resources.
HDFS is automatically spread over data nodes. If you decommission
the nodes, the HDFS data on them will be lost. So always put your
persistent data on S3, not HDFS.
The current world
The bank loads the client portfolio data into the high-end risk data platform and
applies programming iterations in parallel for the configured number of iterations.
For each portfolio and iteration, they take the current Asset price and apply the
following function for a variety of random variables:
Chapter 3
[ 51 ]
( )1
1 1
1
Where: is theAsset priceat time ;
is theAsset priceat time 1
is themean of return on assets;
t t
t t t
t
t
S S t t
S S S
S t
S t
μ σε
μ
+
+ +
+
Δ = Δ + Δ
Δ = −
+
The asset price will fluctuate for each iteration. The following is an example with 15
iterations when the starting price is 10€:
Hadoop in the Cloud
[ 52 ]
For a large number of iterations, the asset price will follow a normal pattern. As
shown in the following figure, the value at risk at 99 percent is 0.409€, which is
defined as a 1 percent probability that the asset price will fall more than 0.409€ after
300 days. So, if a client holds 100 units of the asset price in his portfolio, the VaR is
40.9€ for his portfolio.
The results are only an estimate, and their accuracy is the square root of the number
of iterations, which means 1,000 iterations will make it 10 times more accurate. The
iterations could be anywhere from the hundreds of thousands to millions, and even
with powerful and expensive computers, the iterations could take more than 20
hours to complete.
The target world
In summary, they will parallelize the processing using MapReduce and reduce the
processing time to less than an hour.
First, they will have to upload the client portfolio data into Amazon S3. Then they
will apply the same algorithm, but using MapReduce programs and with a very
large number of parallel iterations using Amazon EMR, and write back the results
to S3.
Chapter 3
[ 53 ]
It is a classic example of elastic capacity—the customer data can be partitioned and
each partition can be processed independently. The execution time will drop almost
linearly with the number of parallel executions. They will spawn hundreds of nodes
to accommodate hundreds of iterations in parallel and release resources as soon as
the execution is complete.
The following diagram is courtesy of the AWS website. I recommend you visit
http://aws.amazon.com/elasticmapreduce/ for more details.
AMAZON
S3
AMAZON
S3
EC2 EC2 EC2
EC2 EC2 EC2
EMR
RESULTS
Data collection
The data storage for this project is Amazon S3 (where S3 stands for Simple Storage
Service). It can store anything, has unlimited scalability, and has 99.999999999
percent durability.
If you have a little more money and want better performance, go for storage on:
• Amazon DynamoDB: This is a NoSQL database with unlimited scalability
and very low latency.
• Amazon Redshift: This is a relational parallel data warehouse with the scale
of data in Petabytes and should be used if performance is your top priority.
It will be even more expensive in comparison to DynamoDB and in the order
of $1,000/TB/year.
Hadoop in the Cloud
[ 54 ]
Configuring the Hadoop cluster
Please visit http://docs.aws.amazon.com/ElasticMapReduce/latest/
DeveloperGuide/emr-what-is-emr.html for full documentation and relevant
screenshots.
Amazon Elastic Compute Cloud (EC2) is a single data processing node. Amazon
Elastic MapReduce is a fully managed cluster of EC2 processing nodes that uses
the Hadoop framework. Basically, the configuration steps are:
1. Sign up for an account with Amazon.
2. Create a Hadoop cluster with the default Amazon distribution.
3. Configure the EC2 nodes with high memory and CPU configuration,
as the risk simulations will be very memory-intensive operations.
4. Configure your user role and the security associated with it.
Data upload
Now you have to upload the client portfolio and parameter data into Amazon S3
as follows:
1. Create an input bucket on Amazon S3, which is like a directory and must have
a unique name, something like <organization name + project name + input>.
2. Upload the source files using a secure corporate Internet.
I recommend you use one of the two Amazon data transfer services, AWS
Import/Export and AWS Direct Connect, if there is any opportunity to do so.
The AWS Import/Export service includes:
• Export the data using Amazon format into a portable storage device—hard
disk, CD, and so on and ship it to Amazon.
• Amazon imports the data into S3 using its high-speed internal network and
sends you back the portable storage device.
• The process takes 5–6 days and is recommended only for an initial large data
load—not an incremental load.
• The guideline is simple—calculate your data size and network bandwidth.
If the upload takes time in the order of weeks or months, you are better off
not using this service.
Chapter 3
[ 55 ]
The AWS Direct Connect service includes:
• Establish a dedicated network connection from your on-premise data center
to AWS using anything from 1 GBps to 10 GBps
• Use this service if you need to import/export large volumes of data in and
out of the Amazon cloud on a day-to-day basis
Data transformation
Rewrite the existing simulation programs into Map and Reduce programs and
upload them into S3. The functional logic will remain the same; you just need
to rewrite the code using the MapReduce framework, as shown in the following
template, and compile it as MapReduce-0.0.1-VarRiskSimulationAWS.jar.
The mapper logic splits the client portfolio data into partitions and applies iterative
simulations for each partition. The reducer logic aggregates the mapper results,
value, and risk.
package com.hadoop.Var.MonteCarlo;
import <java libraries>;
import <org.apache.hadoop libraries>;
public class VarMonteCarlo{
public static void main(String[] args) throws Exception{
if (args.length < 2) {
System.err.println("Usage: VAR Monte Carlo <input path>
<output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "VaR calculation");
job.setJarByClass(VarMonteCarlo.class);
job.setMapperClass(VarMonteCarloMapper.class);
job.setReducerClass(VarMonteCarloReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(RiskArray.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
Hadoop in the Cloud
[ 56 ]
job.waitForCompletion(true);
}
public static class VarMonteCarloMapper extends
Mapper<LongWritable, Text, Text, Text>{
<Implement your algorithm here>
}
public static class VarMonteCarloReducer extends Reducer<Text,
Text, Text, RiskArray> {
<Implement your algorithm here>
}
}
Once the Map and Reduce code is developed, please follow these steps:
1. Create an output bucket on Amazon S3, which is like a directory and
must have a unique name, something like <organization name + project
name + results>.
2. Create a new job workflow using the following parameters:
 Input Location: This inputs the S3 bucket directory with client
portfolio data files
 Output Location: This outputs the S3 bucket directory to write the
simulation results
 Mapper: The textbox should be set to java -classpath MapReduce-
0.0.1-VarRiskSimulationAWS.jar com.hadoop.Var.
MonteCarlo.JsonParserMapper
 Reducer: The textbox should be set to java -classpath
MapReduce-0.0.1-VarRiskSimulationAWS.jar com.hadoop.Var.
MonteCarlo.JsonParserReducer
 Master EC2 instance: This selects the larger instances
 Core Instance EC2 instance: This selects the larger instances and
selects a lower count
 Task Instance EC2 instance: This selects the larger instances and
selects a very high count, which must be in line with the number
of risk simulation iterations
3. Execute the job workflow and monitor the progress.
4. The job is expected to complete much faster and should be done in less
than an hour.
5. The simulation results are written to the output S3 bucket.
Chapter 3
[ 57 ]
Data analysis
You can download the simulation results from the Amazon S3 output bucket for
further analysis with local tools.
In this case, you should be able to simply download the data locally, as the result
volume may be relatively low.
Summary
In this chapter, we learned how and when big data can be processed in the cloud,
right from configuration, collection, and transformation to the analysis of data.
Currently, Hadoop in the cloud is not used much in banks due to a few concerns
about data security and performance. However, that is debatable.
For the rest of this book, I will discuss projects using on-premise Hadoop
implementations only.
In the next chapter, I will pick up a medium-scale on-premise Hadoop project
and see it in a little more detail.
Where to buy this book
You can buy Hadoop for Finance Essentials from the Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.
www.PacktPub.com
Stay Connected:
Get more information Hadoop for Finance Essentials

More Related Content

What's hot

The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lakeCapgemini
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
 
Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?Edureka!
 
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse..."Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...Dataconomy Media
 
Data science by john d. kelleher, brendan tierney (z lib.org)
Data science by john d. kelleher, brendan tierney (z lib.org)Data science by john d. kelleher, brendan tierney (z lib.org)
Data science by john d. kelleher, brendan tierney (z lib.org)Tayab Memon
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Edgar Alejandro Villegas
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Unlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and ClouderaUnlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and ClouderaCloudera, Inc.
 
Sudhir hadoop and Data warehousing resume
Sudhir hadoop and Data warehousing resume Sudhir hadoop and Data warehousing resume
Sudhir hadoop and Data warehousing resume Sudhir Saxena
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKUlf Mattsson
 
Business intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeBusiness intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeData Science Thailand
 
Dedup with hadoop
Dedup with hadoopDedup with hadoop
Dedup with hadoopNeeta Pande
 
Introduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksIntroduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksCCG
 

What's hot (20)

The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lake
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?
 
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse..."Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
"Industrializing Machine Learning – How to Integrate ML in Existing Businesse...
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
Data science by john d. kelleher, brendan tierney (z lib.org)
Data science by john d. kelleher, brendan tierney (z lib.org)Data science by john d. kelleher, brendan tierney (z lib.org)
Data science by john d. kelleher, brendan tierney (z lib.org)
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Unlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and ClouderaUnlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and Cloudera
 
Sudhir hadoop and Data warehousing resume
Sudhir hadoop and Data warehousing resume Sudhir hadoop and Data warehousing resume
Sudhir hadoop and Data warehousing resume
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
 
Business intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeBusiness intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lake
 
Hadoop in the Cloud
Hadoop in the CloudHadoop in the Cloud
Hadoop in the Cloud
 
Dedup with hadoop
Dedup with hadoopDedup with hadoop
Dedup with hadoop
 
Introduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & DatabricksIntroduction to Machine Learning with Azure & Databricks
Introduction to Machine Learning with Azure & Databricks
 

Similar to Hadoop for Finance - sample chapter

The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosSenturus
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunitiesBigdata Meetup Kochi
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Jennifer Walker
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi
 
Big data and you
Big data and you Big data and you
Big data and you IBM
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopRamyaG50
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopRamyaG50
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012Cloudera, Inc.
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947CMR WORLD TECH
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesJyrki Määttä
 
Open Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise ITOpen Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise ITandreas kuncoro
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxPratimakumari213460
 

Similar to Hadoop for Finance - sample chapter (20)

The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and Cognos
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Big data and you
Big data and you Big data and you
Big data and you
 
Big Data
Big DataBig Data
Big Data
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
 
Open Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise ITOpen Source Ecosystem Future of Enterprise IT
Open Source Ecosystem Future of Enterprise IT
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Big Data
Big DataBig Data
Big Data
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
Introduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptxIntroduction-to-Big-Data-and-Hadoop.pptx
Introduction-to-Big-Data-and-Hadoop.pptx
 

Recently uploaded

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Recently uploaded (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

Hadoop for Finance - sample chapter

  • 1. C o m m u n i t y E x p e r i e n c e D i s t i l l e d Harness big data to provide meaningful insights, analytics, and business intelligence for your financial institution Hadoop for Finance Essentials RajivTiwari Hadoop for Finance Essentials With the exponential growth of data and many enterprises crunching more and more data every day, Hadoop as a data platform has gained a lot of popularity. Financial businesses want to minimize risks and maximize opportunities, and Hadoop, largely dominating the big data market, plays a major role. This book will get you started with the fundamentals of big data and Hadoop, enabling you to get to grips with solutions to many top financial big data use cases including regulatory projects and fraud detection. It is packed with industry references and code templates, and is designed to walk you through a wide range of Hadoop components. By the end of the book, you'll understand a few industry leading architecture patterns, big data governance, tips, best practices, and standards to successfully develop your own Hadoop based solution. Who this book is written for This book is perfect for developers, analysts, architects or managers who would like to perform big data analytics with Hadoop for the financial sector. This book is also helpful for technology professionals from other industry sectors who have recently switched or like to switch their business domain to financial sector. Familiarity with big data, Java programming, database and data warehouse, and business intelligence would be beneficial. $ 29.99 US £ 19.99 UK Prices do not include local sales tax or VAT where applicable Rajiv Tiwari What you will learn from this book  Learn about big data and Hadoop fundamentals including practical finance use cases  Walk through Hadoop-based finance projects with explanations of solutions, big data governance, and how to sustain Hadoop momentum  Develop a range of solutions for small to large-scale data projects on the Hadoop platform  Learn how to process big data in the cloud  Present practical business cases to management to scale up existing platforms at enterprise level HadoopforFinanceEssentials P U B L I S H I N GP U B L I S H I N G community experience distilled Visit www.PacktPub.com for books, eBooks, code, downloads, and PacktLib. Free Sam ple
  • 2. In this package, you will find:  The author biography  A preview chapter from the book, Chapter 3 'Hadoop in the Cloud'  A synopsis of the book’s content  More information on Hadoop for Finance Essentials About the Author Rajiv Tiwari is a hands-on freelance big data architect with over 15 years of experience across big data, data analytics, data governance, data architecture, data cleansing / data integration, data warehousing, and business intelligence for banks and other financial organizations. He is an electronics engineering graduate from IIT, Varanasi, and has been working in England for the past 10 years, mostly in the financial city of London, UK. He has been using Hadoop since 2010, when Hadoop was in its infancy with regards to the banking sector. He is currently helping a tier 1 investment bank implement a large risk analytics project on the Hadoop platform. Rajiv can be contacted on his website at or on Twitter at .
  • 3. Hadoop for Finance Essentials Data has been increasing at an exponential rate and organizations are either struggling to cope up or rushing to take advantage by analyzing it. Hadoop is an excellent open source framework, which addresses this big data problem. I have used Hadoop within the financial sector for the last few years but could not find any resource or book that explains the usage of Hadoop for finance use cases. The best books I have ever found are again on Hadoop, Hive, or some MapReduce patterns, with examples on counting words or Twitter messages in all possible ways. I have written this book with the objective of explaining the basic usage of Hadoop and other products to tackle big data for finance use cases. I have touched base on the majority of use cases, providing a very practical approach. What This Book Covers Chapter 1, Big Data Overview, covers the overview of big data, its landscape, and technology evolution. It also touches base with the Hadoop architecture, its components, and distributions. If you know Hadoop already, just skim through this chapter. Chapter 2, Big Data in Financial Services, extends the big data overview from the perspective of a financial organization. It will explain the story of the evolution of big data in the financial sector, typical implementation challenges, and different finance use cases with the help of relevant tools and technologies. Chapter 3, Hadoop in the Cloud, covers the overview of big data in cloud and a sample portfolio risk simulation project with end-to-end data processing. Chapter 4, Data Migration Using Hadoop, talks about the most popular project of migrating historical trade data from traditional data sources to Hadoop. Chapter 5, Getting Started, covers the implementation project of a very large enterprise data platform to support various risk and regulatory requirements. Chapter 6, Getting Experienced, gives an overview of real-time analytics and a sample project to detect fraudulent transactions. Chapter 7, Scale It Up, covers the topics to scale up the usage of Hadoop within your organization, such as enterprise data lake, lambda architecture, and data governance. It also touches base with few more financial use cases with brief solutions. Chapter 8, Sustain the Momentum, talks about the Hadoop distribution upgrade cycle and wraps up the book with best practices and standards.
  • 4. [ 47 ] Hadoop in the Cloud Hadoop in the cloud can be implemented with very low initial investment and is well suited for proof of concepts and data systems with variable IT resource requirements. In this chapter, I will discuss the story of Hadoop in the cloud and how Hadoop can be implemented in the cloud for banks. I will cover the full data life cycle of a risk simulation project using Hadoop in the cloud. • Data collection—ingesting the data into the cloud • Data transformation—iterating simulations with the given algorithms • Data analysis—analyzing the output results I recommend you refer to your Hadoop cloud provider documentation if you need to dive deeper. The big data cloud story In the last few years, cloud computing has grown significantly within banks as they strive to improve the performance of their applications, increase agility, and most importantly reduce their IT costs. As moving applications into the cloud reduces the operational cost and IT complexity, it helps banks to focus on their core business instead of spending resources on technology support. The Hadoop-based big data platform is just like any other cloud computing platform and a few financial organizations have implemented projects with Hadoop in the cloud.
  • 5. Hadoop in the Cloud [ 48 ] The why As far as banks are concerned, especially investment banks, business fluctuates a lot and is driven by the market. Fluctuating business means fluctuating trade volume and variable IT resource requirements. As shown in the following figure, traditional on-premise implementations will have a fixed number of servers for peak IT capacity, but the actual IT capacity needs are variable: Your IT needs Time Traditional IT capacity Capacity As shown in the following figure, if a bank plans to have more IT capacity than maximum usage (a must for banks), there will be wastage, but if they plan to have IT capacity that is the average of required fluctuations, it will be lead to processing queues and customer dissatisfaction: On and Off WASTE Fast Growth Variable peaks Predictable peaks CUSTOMER DISSATISFACTION
  • 6. Chapter 3 [ 49 ] With cloud computing, financial organizations only pay for the IT capacity they use and it is the number-one reason for using Hadoop in the cloud–elastic capacity and thus elastic pricing. The second reason is proof of concept. For every financial institution, before the adoption of Hadoop technologies, the big dilemma was, "Is it really worth it?" or "Should I really spend on Hadoop hardware and software as it is still not completely mature?" You can simply create Hadoop clusters within minutes, do a small proof of concept, and validate the benefits. Then, either scale up your cloud with more use cases or go on-premise if that is what you prefer. The when Have a look at the following questions. If you answer yes to any of these for your big data problem, Hadoop in the cloud could be the way forward: • Is your data operation very intensive but unpredictable? • Do you want to do a small proof of concept without buying the hardware and software up front? • Do you want your operational expense to be very low or managed by external vendors? What's the catch? If the cloud solves all big data problems, why isn't every bank implementing it? • The biggest concern is—and will remain for the foreseeable future—the security of the data in the cloud, especially customers' private data. The moment senior managers think of security, they want to play safe and drop the idea of implementing it on the cloud. • Performance is still not as good as that on an on-premise installation. Disk I/O is a bottleneck in virtual machine environments. Especially with mixed tasks such as MapReduce, Spark, and so on, on the same cluster with several concurrent users you will feel a big performance impact. • Once the data is in the cloud, vendors manage the day-to-day administrative tasks, including operations. The implementation of Hadoop in the cloud will lead to the development and operation roles merging, which is slightly against the norm in terms of departmental functions of banks. In the next section, I will pick up one of the most popular use cases: implementing Hadoop in the cloud for the risk division of a bank.
  • 7. Hadoop in the Cloud [ 50 ] Project details – risk simulations in the cloud Value at Risk (VaR) is a very effective method to calculate the financial risk of a portfolio. Monte Carlo is one of the methods used to generate the financial risk for a number of computer-generated scenarios. The effectiveness of this method depends on running as many scenarios as possible. Currently, a bank runs the credit-risk Monte Carlo simulation to calculate the VaR with complex algorithms to simulate diverse risk scenarios in order to evaluate the risk metrics of its clients. The simulation requires high computational power with millions of computer-generated simulations; even with high-end computers, it takes 20–30 hours to run the application, which is both time consuming and expensive. Solution For our illustration, I will use Amazon Web Services (AWS) with Elastic MapReduce (EMR) and parallelize the Monte Carlo simulation using a MapReduce model. Note, however, that it can be implemented on any Hadoop cloud platform. The bank will upload the client portfolio data into cloud storage (S3); develop MapReduce using the existing algorithms; and use EMR on-demand additional nodes to execute the MapReduce in parallel, write back the results to S3, and release EMR resources. HDFS is automatically spread over data nodes. If you decommission the nodes, the HDFS data on them will be lost. So always put your persistent data on S3, not HDFS. The current world The bank loads the client portfolio data into the high-end risk data platform and applies programming iterations in parallel for the configured number of iterations. For each portfolio and iteration, they take the current Asset price and apply the following function for a variety of random variables:
  • 8. Chapter 3 [ 51 ] ( )1 1 1 1 Where: is theAsset priceat time ; is theAsset priceat time 1 is themean of return on assets; t t t t t t t S S t t S S S S t S t μ σε μ + + + + Δ = Δ + Δ Δ = − + The asset price will fluctuate for each iteration. The following is an example with 15 iterations when the starting price is 10€:
  • 9. Hadoop in the Cloud [ 52 ] For a large number of iterations, the asset price will follow a normal pattern. As shown in the following figure, the value at risk at 99 percent is 0.409€, which is defined as a 1 percent probability that the asset price will fall more than 0.409€ after 300 days. So, if a client holds 100 units of the asset price in his portfolio, the VaR is 40.9€ for his portfolio. The results are only an estimate, and their accuracy is the square root of the number of iterations, which means 1,000 iterations will make it 10 times more accurate. The iterations could be anywhere from the hundreds of thousands to millions, and even with powerful and expensive computers, the iterations could take more than 20 hours to complete. The target world In summary, they will parallelize the processing using MapReduce and reduce the processing time to less than an hour. First, they will have to upload the client portfolio data into Amazon S3. Then they will apply the same algorithm, but using MapReduce programs and with a very large number of parallel iterations using Amazon EMR, and write back the results to S3.
  • 10. Chapter 3 [ 53 ] It is a classic example of elastic capacity—the customer data can be partitioned and each partition can be processed independently. The execution time will drop almost linearly with the number of parallel executions. They will spawn hundreds of nodes to accommodate hundreds of iterations in parallel and release resources as soon as the execution is complete. The following diagram is courtesy of the AWS website. I recommend you visit http://aws.amazon.com/elasticmapreduce/ for more details. AMAZON S3 AMAZON S3 EC2 EC2 EC2 EC2 EC2 EC2 EMR RESULTS Data collection The data storage for this project is Amazon S3 (where S3 stands for Simple Storage Service). It can store anything, has unlimited scalability, and has 99.999999999 percent durability. If you have a little more money and want better performance, go for storage on: • Amazon DynamoDB: This is a NoSQL database with unlimited scalability and very low latency. • Amazon Redshift: This is a relational parallel data warehouse with the scale of data in Petabytes and should be used if performance is your top priority. It will be even more expensive in comparison to DynamoDB and in the order of $1,000/TB/year.
  • 11. Hadoop in the Cloud [ 54 ] Configuring the Hadoop cluster Please visit http://docs.aws.amazon.com/ElasticMapReduce/latest/ DeveloperGuide/emr-what-is-emr.html for full documentation and relevant screenshots. Amazon Elastic Compute Cloud (EC2) is a single data processing node. Amazon Elastic MapReduce is a fully managed cluster of EC2 processing nodes that uses the Hadoop framework. Basically, the configuration steps are: 1. Sign up for an account with Amazon. 2. Create a Hadoop cluster with the default Amazon distribution. 3. Configure the EC2 nodes with high memory and CPU configuration, as the risk simulations will be very memory-intensive operations. 4. Configure your user role and the security associated with it. Data upload Now you have to upload the client portfolio and parameter data into Amazon S3 as follows: 1. Create an input bucket on Amazon S3, which is like a directory and must have a unique name, something like <organization name + project name + input>. 2. Upload the source files using a secure corporate Internet. I recommend you use one of the two Amazon data transfer services, AWS Import/Export and AWS Direct Connect, if there is any opportunity to do so. The AWS Import/Export service includes: • Export the data using Amazon format into a portable storage device—hard disk, CD, and so on and ship it to Amazon. • Amazon imports the data into S3 using its high-speed internal network and sends you back the portable storage device. • The process takes 5–6 days and is recommended only for an initial large data load—not an incremental load. • The guideline is simple—calculate your data size and network bandwidth. If the upload takes time in the order of weeks or months, you are better off not using this service.
  • 12. Chapter 3 [ 55 ] The AWS Direct Connect service includes: • Establish a dedicated network connection from your on-premise data center to AWS using anything from 1 GBps to 10 GBps • Use this service if you need to import/export large volumes of data in and out of the Amazon cloud on a day-to-day basis Data transformation Rewrite the existing simulation programs into Map and Reduce programs and upload them into S3. The functional logic will remain the same; you just need to rewrite the code using the MapReduce framework, as shown in the following template, and compile it as MapReduce-0.0.1-VarRiskSimulationAWS.jar. The mapper logic splits the client portfolio data into partitions and applies iterative simulations for each partition. The reducer logic aggregates the mapper results, value, and risk. package com.hadoop.Var.MonteCarlo; import <java libraries>; import <org.apache.hadoop libraries>; public class VarMonteCarlo{ public static void main(String[] args) throws Exception{ if (args.length < 2) { System.err.println("Usage: VAR Monte Carlo <input path> <output path>"); System.exit(-1); } Configuration conf = new Configuration(); Job job = new Job(conf, "VaR calculation"); job.setJarByClass(VarMonteCarlo.class); job.setMapperClass(VarMonteCarloMapper.class); job.setReducerClass(VarMonteCarloReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(RiskArray.class); FileInputFormat.addInputPath(job, new Path(args[1])); FileOutputFormat.setOutputPath(job, new Path(args[2]));
  • 13. Hadoop in the Cloud [ 56 ] job.waitForCompletion(true); } public static class VarMonteCarloMapper extends Mapper<LongWritable, Text, Text, Text>{ <Implement your algorithm here> } public static class VarMonteCarloReducer extends Reducer<Text, Text, Text, RiskArray> { <Implement your algorithm here> } } Once the Map and Reduce code is developed, please follow these steps: 1. Create an output bucket on Amazon S3, which is like a directory and must have a unique name, something like <organization name + project name + results>. 2. Create a new job workflow using the following parameters:  Input Location: This inputs the S3 bucket directory with client portfolio data files  Output Location: This outputs the S3 bucket directory to write the simulation results  Mapper: The textbox should be set to java -classpath MapReduce- 0.0.1-VarRiskSimulationAWS.jar com.hadoop.Var. MonteCarlo.JsonParserMapper  Reducer: The textbox should be set to java -classpath MapReduce-0.0.1-VarRiskSimulationAWS.jar com.hadoop.Var. MonteCarlo.JsonParserReducer  Master EC2 instance: This selects the larger instances  Core Instance EC2 instance: This selects the larger instances and selects a lower count  Task Instance EC2 instance: This selects the larger instances and selects a very high count, which must be in line with the number of risk simulation iterations 3. Execute the job workflow and monitor the progress. 4. The job is expected to complete much faster and should be done in less than an hour. 5. The simulation results are written to the output S3 bucket.
  • 14. Chapter 3 [ 57 ] Data analysis You can download the simulation results from the Amazon S3 output bucket for further analysis with local tools. In this case, you should be able to simply download the data locally, as the result volume may be relatively low. Summary In this chapter, we learned how and when big data can be processed in the cloud, right from configuration, collection, and transformation to the analysis of data. Currently, Hadoop in the cloud is not used much in banks due to a few concerns about data security and performance. However, that is debatable. For the rest of this book, I will discuss projects using on-premise Hadoop implementations only. In the next chapter, I will pick up a medium-scale on-premise Hadoop project and see it in a little more detail.
  • 15. Where to buy this book You can buy Hadoop for Finance Essentials from the Packt Publishing website. Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet book retailers. Click here for ordering and shipping details. www.PacktPub.com Stay Connected: Get more information Hadoop for Finance Essentials