An organization’s information is spread across multiple repositories, on-premise and in the cloud, with limited ability to correlate information and derive insights. The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization.
- Leverage 100% of your data: Text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven (powered by HP IDOL and HP Vertica), making it possible to integrate this valuable content and insights into various line of business applications.
- Democratize and enable multi-dimensional content analysis: - Empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease, using the 100% open source Hortonworks Data Platform.
- Extend the enterprise data warehouse: Synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in.
- Dramatically reduce complexity with enterprise-ready SQL engine: Tap into the richest analytics that support JOINs, complex data types, and other capabilities only available with HP Vertica SQL on the Hortonworks Data Platform.
Speakers:
- Ajay Singh, Director, Technical Channels, Hortonworks
- Will Gardella, Product Management, HP Big Data
Before we dive into Hadoop and its role within the modern data architecture, let’s set the context for why Hadoop has become important.
Existing approaches for data management have become both technically and commercially impractical.
Technically - these systems were never designed to store or process vast quantities of data
Commercially – the licensing structures with the traditonal approach are no longer feasible.
These two challenges combined with rate at which data is being produce predicated a need for a new approach to data systems. If we fast-forward another 3 to 5 years, more than half of the data under management within the enterprise will be from these new data sources.
Enter Hadoop.
Faced with this challenge the team at yahoo conceived and created apache hadoop to address the challenge. They then were convinced that contribution of this platform into an open community would speed innovation. They open sourced the technology and did so within the governance of the Apache Software Foundation. (ASF) This introduced two distinct significant advantages.
Not only could they manage new data types at scale but the now had a commercially feasible approach.
However, there will still significant challenges. The first generation of Hadoop was:
- designed and optimized for Batch only workloads,
- it required dedicated clusters for each application, and,
- it didn’t integrate easily with many of the existing technologies present in the data center.
Also, like any emerging technology, Hadoop was required to meet a certain level of readiness required by the enterprise.
After running Hadoop at scale at yahoo, the team spun out to form Hortonworks with the intent to address these challenges and make Hadoop enterprise ready.
Hortonworks has a singular focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type
We were founded in 2011 by 24 developers from Yahoo where Hadoop was conceived to address data challenges at internet scale. What we now know of as Hadoop really started in 2005, when a team at Yahoo was directed to build out a large-scale data storage and processing technology that would allow them to improve their most critical application, Search.
Their challenge was essentially two-fold. First they needed to capture and archive the contents of the internet, and then process the data so that users could search through it effectively an efficiently. Clearly traditional approaches were both technically (due to the size of the data) and commercially (due to the cost) impractical. The result was the Apache Hadoop project that delivered large scale storage (HDFS) and processing (MapReduce).
Today we are over 600 employees and have partnered with over 900 companies who are the leaders in the data center
We have also been very fortunate to achieve very significant customer adoption with over 230 customers as of Q3 2014, spanning nearly every vertical.
Hortonworks was founded the sole intent to make Hadoop an enterprise data platform. With YARN as its foundation, HDP delivers a centralized architecture with true multi-tenancy for data-processing and shared services for Security, Governance and Operations to satisfy enterprise requirements, all deeply integrated and certified with leading datacenter technologies.
We are uniquely focused on this transformation of Hadoop and doing our work completely in open source. This is all predicated on our leadership in the community, which enables not only to best support users of but also provides uniquely present customer requirements within this open, thriving community.
Our product, the Hortonworks Data Platform (or HDP for short) is a completely open source, enterprise-grade data platform that’s comprised of dozens of Apache open source projects including Apache Hadoop and YARN at its center.
We have a comprehensive engineering, testing, and certification process that integrates and packages all of these components into a cohesive platform that the enterprise can consume and deploy at scale. And our model enables us to proactively manage new innovations and new open source projects into HDP as they emerge.
To ensure the highest quality, we have a test suite, unique to Hortonworks, that is comprised of 10’s of thousands of system and integration tests that we run at scale on a regular basis including on the world’s largest Hadoop clusters at Yahoo! as part of our co-development relationship.
While our pure-play competitors focus on proprietary components for security, operations, and governance, we invest in new open source projects that address these areas.
For example, earlier in 2014, we acquired a small company called XA Secure that provided a comprehensive security and administration product. We flipped the technology in wholesale into open source as Apache Ranger.
Since our security, operations and governance technologies are open source projects, our partners are able to work with us on those projects to ensure deep integration within our joint solution architectures.
As the information era continues to generate massive volumes of different data formats regularly, organizations are looking for more efficient means of storing and analyzing that data in a standardized way across the different lines of business. Many are obviously turning to Hadoop given it lends itself nicely to this problem by offering efficiencies that other platforms don’t. There’s still problems though.
There’s multiple dimensions of complexity when trying to get insights from data being stored in a Hadoop system that’s leveraged at scale.
1)There’s of course the types of analysis that need to be done each with their own set of requirements and subtle complexities. Does this department or business need a predictive engine? Prescriptive? Does the data and data model support the kinds of questions I need to ask of the data. There’s also not a whole lot of analytics that can be used or enabled without significant effort. For the most part Hadoop allows you to store the data as is. There’s some open source engines and data on top of Hadoop that help you ask the hard questions, but they all use a different set of tools and APIs.
2)Then there’s the processing and delivery of results. What is the delivery / consumption model that works best with the problems I’m looking to solve? Does it align with the types of analysis I want to perform.
3)Most importantly, there’s data considerations. The many data types being used in the wild require fundamentally different methods to access and manage the information inside. Machine data, human information and structured data typically require fundamentally different approaches in the types of analysis which in turn require separate analytics engines. So data is really everything. All analytics decisions hinge on whether we can access what’s inside and what we can do with it.
4)Coupled with the skill set of the business users and the batch oriented processing of Hadoop, that leaves most organizations with a model that forces them to innovate slowly and by use case rather than dealing with the real root of the issue which is finding a way to access and collectively analyze all the data efficiently and through standardized, real-time procedures that are self-service and uniform across all the data.
So the issues are now on the table. Hadoop is a powerful toolset that gives enterprises a means to an end when it comes to understanding and acting on their data. The question is how do I give it that extra edge? The short answer is….IDOL. Now let’s talk about how IDOL helps to fill that void.
Key Messages:
One of the largest challenges in getting value from Hadoop investments is the disconnect between business users and the data-scientists.
They speak different languages
Hard to collaborate on the same-data due to lack of tools
Business users often have subject matter expertise, but don’t know technical data-science concepts.
Data Scientists know how to manipulate the data and extract value, but don’t know the nuances of the business
IDOL enables both Business Users and Data Scientists with:
Interactive Exploration of the Data
Non-SQL graphical navigation of Data
Collaboration Features to share insights
Powerful customer examples to lead off
Market/industry landscape/trends (what is today’s reality?)
What problems does this cause for the customer?
What do you need to do to fix the problem? (here are 3-5 requirements)
What are some issues with traditional solutions? (talk about challenges of human information, keyword search, etc)
What is the answer? Our solution, powered by IDOL (what is hp enterprise search, what is IDOL)
Why do people like you choose our solution? (powered by idol, gartner mq, KPIs, features, idol is key to hp’s strategy)
Illustrate today vs tomorrow with our technology
Summary slide
So in the end using Map Reduce, we’re able to translate the fetch / indexing activities into a discrete set of concurrent tasks within the racks running Hadoop. This accomplishes three things:
1)It translates the HP connectivity processing into an Hadoop best practice by harnessing MapReduce. Instead of the point and shoot architecture of most connectors, we’re building a native plugin that can be used to process and analyze your data within the Hadoop ecosystem. IDOL becomes a native plugin for Hadoop. It also translates what can be exhausting and complex code to write into configuration driven analytics processing.
2)By conforming to Hadoop best practices, we’re able to create a faster and more efficient means of processing the data so that it can be sent off to IDOL in the same way it has in the past for analysis. We’re not just using IDOL distribution anymore, we’re also leveraging the MPP capabilities of Hadoop to do the heavy lifting for us.
3)IDOL is able to incorporate it’s industry leading analytics capabilities into OOTB functions that can be turned on via configuration rather than through complex programmatic integration. Now IDOL’s ingestion pipeline can do many things we all know that, but being able to leverage those functions in a streamlined and configuration driven manner has huge advantages versus the more brute force programming methodologies employed by other vendors. Many people don’t want to have to code their way through these issues. Just enable the features and that’s it.
There’s a lot of value there and that’s just the connectors. But the connector is just part of the story i.e. just the data processing (ETL) and preparation before finally getting loaded into IDOL – it’s an important job but just one part of architecture. Once the data is in IDOL, that’s when the real interesting things happen because it’s when we start to really expose the powerful functions and capabilities of the platform. Stateful functions like retrieval, classification, clustering and many more functions become available to both explore and analyze your data in real-time. Let’s look at the big picture now…
Key Messages:
Unlike other technologies that simply read HDFS as a file-system, IDOL is integrated deeply into the Hadoop architecture
Takes advantage of MPP compute power of Hadoop
Deals with multi-tenancy and data with different security rights and privileges
Advanced analytics for all data-types
So now we’ll take a look at a use case that is becoming more and more common as different organizations adopt Hadoop and look to streamline data storage and analysis across the different lines of business that IT needs to support.
Key Messages:
Large Diversified Healthcare Company , acts as a payer & provider
Claims are the life-blood of their operations, used traditional Data-Warehouse, BI, and statistical tools
Challenges:
Business SMEs with knowledge of payments processes not data-scientists
Report generation took long time: 30-45 days
Did not speak the same language
Constant pressure to reduce Fraud, Waste, and Abuse
Payment Integrity early user of analytics - identified as high ROI target for Hadoop and Analytics
Challenging because patterns of providers and fraud constantly changing
Changes in regulations & contracts, + errors in data entry and process can result in incorrect payments.
Government estimates that $50B of $500B on Medicare is lost to FWA, private health insurers are also affected
IDOL solved this problem by providing self-service analytics to business users and data-scientist.
Hadoop is being used to scale out to all payment systems
New data sources and use-cases being added constantly
Enabling a wide variety of lines-of-business
Has potential for very big impact on the organization
So the issues are now on the table. Hadoop on it’s own isn’t enough. How do I create a real-time, efficient, all-encompassing, and multitenant environment to glean all the valuable insights contained within Hadoop. By pairing IDOL alongside Hadoop, you can leverage IDOL to:
Supercharge your analytics: instead of writing complicated and time consume map / reduce or yarn scripts that are mostly batch oriented, develop real-time advanced analytics techniques built directly into IDOL instead.
Democratize data and analysis – IDOL also offers something very unique for Hadoop. By removing the complexities involved in data processing through to configuration and offering a common analytics api, analysis and data management become a self-service function through a common and standardized Restful API that is simple and easy to use. Business intelligence is enabled across a wider set of of content.
Allows you to leverage 100% of the data for analysis - By ingesting data into IDOL, you’re not just able to execute the analytics faster, but you’re able to expand the scope of your analytics to cover more data types beyond the most common. Later I’ll show you how we can apply the standard keyword counter example using Hadoop and turn it on it’s head by simply asking IDOL or leveraging some of it’s core libraries.
Reduce Costs and Complexity: Also think about even the easiest problems to solve with Hadoop. Give me your best Hadoop technician and I’ll show you someone who needs a few hours if not a couple days to write scripts that work. Nothing ever works the first time and with batch oriented processing, IDOL enables you to ask complex questions and get real time answers. Saving time and getting answers faster saves time and money your resources could spend making decisions against the data they now fully understand.