The story of big data starts with Google.
Back in the early 2000s, as a startup, Google was growing really fast thanks to the internet.
Big Data
Data was growing fast with the birth of social media and web 2.0 sites
Data was growing faster than Google could keep up with. Expensive data warehouses were no longer a viable option.
Google engineers had the idea of using commodity (cheap) servers as a replacement. They created a way to store and process data in parallel on commodity servers.
They published these insights in a paper on MapReduce.
So what is Hadoop?
Interesting tidbit, Hadoop was developed by Doug Cutting while he was working at Yahoo. Hadoop is named after his son’s toy elephant. Apache Hadoop's MapReduce and HDFS components were inspired by research work done by Google.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on clusters built on commodity hardware. At Hadoop’s core are 3 main components: HDFS, YARN, and MapReduce.
HDFS provides the storage for Hadoop
MapReduce provides the processing for Hadoop
YARN coordinates the processing and scheduling of work across all the nodes
HDFS was designed to be fault-tolerant and to run on commodity hardware, therefore blocks are replicated a number of times to ensure high data availability. By default – Hadoop’s replication factor is set to 3 meaning there would be one original block and two replicas. This can be adjusted.
The true power of the Hadoop distributed computing architecture lies in its distribution. In other words, the ability to distribute work to many nodes in parallel permits Hadoop to scale to large infrastructures and, similarly, the processing of large amounts of data
Hadoop was designed more for batch processing. It’s a sequential process that relies on disk to store intermediate results.
Hadoop was driven by the need to capture and process the large volumes of data driven by the Big Data initiatives we talked about in the previous slides.
These data storage needs include:
The ability to store all types of data. These sources may be coming from sensors, social media feeds, log files or even traditional databases. The data storage can’t be limited to a particular format.
Need the processing capability to analyze these large volumes - not a sampling – but actually review each record as part of the analysis.
Ability to scale from from a single server to thousands of machines
Need for a lower cost alternative to traditional data warehouses since the volume would not make these use cases practical once hardware, storage and software costs were taken into account
Sometimes 1 minute is too late. How to quickly process, analyze and act on data - what opportunity are you missing?
The challenges clients face when trying to capture real-time value is the cost associated with storing these high volumes of data for analysis. Once the data is stored, it needs to be inspected and analyzed to identify the signal from the noise that determines what should be acted. This requires storage and analysis – but at that point, it’s no longer relevant as the opportunity has passed.
Take as an example a website that offers real-time personalization by presenting its visitors with an offer that’s appropriate based on what you’ve been viewing. To accomplish this, the website must understand your clickstream data, in real-time, to quickly serve up the offer relevant to your web visit. There is no time to store and analyze the data, at that point, the visitor has left the website. These clients need Streams to quickly stream in the clickstream data, analyze on the fly, and present the offer to the web visitor.
Spark Streaming
Process live streams of data (IoT, Twitter, Kafka, etc.) with the Spark engine to drive some action or be outputted in batches to various data stores
Implementing near-realtime stream event processing (e.g. fraud / security detection)
Mllib – Machine Learning
Processing machine learning algorithms in areas such as clustering, classification, etc. Applicability in sentiment analysis, predictive intelligence, segmentation, modeling, etc.
Building and deploying rich analytics models (e.g. risk metrics)
Spark SQL – Interactive Analytics
Query your structured data sets with SQL or other dataframe APIs.
Use BI tools to connect and query via JDBC or ODBC.
Interactive querying of very large data sets is a no-brainer, it’s one of the most important value adds enabled by Spark, versus Hadoop. The more interactive or the more iterative it is, the greater the performance improvement
GraphX (graph)
Represent and analyze systems represented by nodes and interconnections between them – transportation, person relationships, etc.
Allows you to perform operations on the graph to determine relationships e.g. behavior propensity, churn and fraud detection as examples
Data Processing and Integration
Existing data processing workloads done much faster
Coding that is simplified e.g. 3 lines of code instead of 6 pages in traditional programming
Client Name
SmarterData
Company Background
Based in San Ramon, California, Smarter Data, Inc. leverages advanced data science technologies – predictive and prescriptive analytics – to help companies achieve relevance with their customers both online and in a retail environment, and manage the demands of digital-age business challenges.
Business challenge
To help its retail clients navigate the uncertainties of the digital-age industry, SmarterData wanted to find new ways to provide relevant, actionable, data-driven insights into consumer behavior.
The benefit
SmarterData’s clients can now perform real-time analysis, utilizing everything from point-of-sale data to weather data, empowering in-store employees to take immediate action on the shop floor.
Pull Quote
“Using IBM Analytics for Apache Spark, we can now give in-store teams valuable insight in seconds.”
—Ram Himmatraopet, Founder & CEO, SmarterData
Solution components
IBM® Analytics for Apache Spark
IBM Bluemix®
Case study Link
http://www.ibm.com/common/ssi/cgi-bin/ssialias?subtype=AB&infotype=PM&htmlfid=YTC04066USEN&attachment=YTC04066USEN.PDF
Customers can run their IT and development in one of 4 options.
In Traditional IT applications can either be run at the customer’s location or On-premise or hosted at a 3rd party location.
In this option the good news is the customer has the capability and responsibility to investigate the right solutions, source and buy the solutions, integrate them and control, run and manage the entire stack.
The bad news is the customer HAS TO spend the time and money to research and test solutions and control, run, integrate and manage the entire stack!
This can be incredibly expensive and time consuming and does not add value to the customer’s business
For IaaS or Infrastructure as a Service the customer has the responsibility and requirements to run and manage the Operating System on up.
The Service Provider manages the bottom layers.
SoftLayer, AWS and Azure are IaaS solutions.
For Platform as a Service, which is what Bluemix is, the service provider manages the infrastructure and the customer’s developers focus 100% on their application code and the data.
For Software as a Service the service provider hosts 100% of the data, logic and infrastructure. The customer only gets a browser.
Examples are Salesforce.com, Microsoft Office 365, facebook, eBay, LinkedIn and Concur are examples.
Basically, All that is needed is a browser and a printer.
Let’s look at the three major cloud deployment models. These are private, public and hybrid clouds. On the far left of the graphic, you see the enterprise data center, most clients will continue to maintain a traditional data center for some IT services and in this deployment model, the client owns and operates all of the hardware and software and their enterprise data center.
The next box, the Private Cloud deployment model is a Private Cloud inside the client’s data center. The client owns and operates the infrastructure and software.
The next type of Private Cloud is the Managed Private Cloud. In this deployment model the cloud is located in the client’s data center but IBM is operating and managing the cloud for the client.
The next Private Cloud deployment model is the Hosted Private Cloud. This Private Cloud resides in an IBM Data Center, it is still owned by the client but IBM performs all of the operational and management support. Note, as we move further and further to the right, the client gives up more and more control to a third party.
To Cloud Data Services sellers, the differences between Private Cloud types are not as important as they are to IBM GTS or GBS sellers. Sales will be of monthly or perpetual licenses, and someone else is selling the infrastructure and labor. The only major thing to watch out for is that Hosted Private may require additional selling of data security and data movement technology, since the client’s data is moving off-premise.
To the far right, you see the Public Cloud deployment model and in this model a service provider makes resources such as applications and storage available to consumers over the internet. The client pays for the resources that they consume. SoftLayer and Amazon are examples of a Public Cloud. The final deployment model is the hybrid cloud shown at the bottom of the page. The hybrid cloud is an integrated cloud which may be cloud to enterprise or cloud to cloud integration, so the clients have the benefit of the seamless IT system. Many enterprise clients are moving to a hybrid cloud model.