Organizations that have vast amounts of data in legacy applications often experience difficulties delivering that data to business unit end-users. Register to learn how Eliza Corporation and Scholastic overcame this challenge by leveraging a Data Lake solution from NorthBay on AWS to optimize data analytics and provide greater visibility. AWS and NorthBay will give you an in-depth overview of how you can use a Data Lake in conjunction with your existing on-premises or cloud-based Data Warehouse. NorthBay helps organizations scale their ETL and data warehousing workloads using Amazon EMR and Amazon Redshift. Join us to learn: • Best practices for using a Data Lake in conjunction with your existing data warehouse • The key aspects of introducing agile and scrum methodologies into an enterprise • The most impactful cost-savings levers that are addressed via a cloud data warehouse migration
Who should attend: Heads of Analytics, Heads of BI, Analytics Managers, BI Teams, Senior Analysts
Develop a Custom Data Solution Architecture with NorthBay
1. “Teaching Old Data New Tricks™”
Brian Barker • CEO • NorthBay Solutions
John Puopolo • SVP • Engineering • Eliza Corporation
Ali Khan • Director, Business Intelligence and Analytics • Scholastic
Sai Reddy Thangirala • Solutions Architect • Amazon Web Services
2. Agenda
• Big Data on AWS
• NorthBay
• Eliza Corporation Case Study
• Challenges Eliza Faced
• Strategic Goals
• Why a Data Lake Approach was Chosen
• Outcomes & Benefits Eliza Achieved
• Scholastic Case Study
• Challenges
• Goals
• The AWS/NorthBay Decision
• How the Initiative Unfolded
• Key Learnings
3. Data is Growing
of new data will be
created every second
for every human being
on the planet by 2020
http://www.whizpr.be/upload/medialab/21/c
ompany/Media_Presentation_2012_DigiUn
iverseFINAL1.pdf
1.7MB
compound annual
growth rate of 58%
surpassing $1 billion by
2020 forecasted for the
Hadoop market
http://www.ap-institute.com/big-data-
articles/big-data-what-is-hadoop-
%E2%80%93-an-explanation-for-
absolutely-anyone.aspx
http://www.marketanalysis.com/?p=279
58%
of all data is ever
analyzed and used at
the moment
http://www.technologyreview.com/news/51
4346/the-data-made-me-do-it/
0.5%<
4. Big Data Is for Everyone
The market for Big Data technologies is growing more than six times faster than the
information technology market as a whole….
…and those companies who use their data well win.
5. Why AWS for Big Data?
Immediately
Available
Broad and Deep
Capabilities
Trusted and
Secure
Scalable
6. AWS Provides the Most Complete Platform
for Big Data
It’s easy to get data to AWS, store it securely, and analyze it with the engine of your choice, without any
long-term commitment or vendor lock-in
Collect
Import/Export
Snowball
Direct Connect
VM Import/Export
Store
Amazon S3
EMR
Amazon Glacier
Amazon Redshift
DynamoDB
Aurora
Analyze
Amazon Kinesis
Lambda
EMR
EC2
7. What Can You Do With Big Data on AWS?
Big Data Repositories Clickstream Analysis ETL Offload
Machine Learning Online Ad Serving BI Applications
9. “Teaching Old Data New Tricks™”
Untapped wealth - Companies gain
tremendous leverage when
“Teaching Old Data New Tricks™”
• So what does that mean?
• You’ll hear 2 exciting Customer
Examples/Use Cases
presented today
Building a HIPAA compliant Data Lake
Re-tooling old on premise technology on the fly
Customer Examples/Use Cases
10. Scholastic Preview of Coming Attractions
• How did an old school $1.5B 100-year-old company re-invent its
old school IBM and Microsoft based big data system & analytics
system on the fly?
• What was their starting point?
• What factors did they consider when making their decision?
• What did they decide on for technology and partners and why?
• How did they implement?
• What were the results?
• Lessons learned?
11. AWS & NorthBay Background
Global Provider of Big Data Solutions
250+
Full-time professionals
145+
Clients
200+
Solutions launched
13. Eliza Preview of Coming Attractions
• How does a high flying Healthcare services company re-platform
its Enterprise Data Platform while processing millions of
'interactions' every day.
• Why the need to change?
• What strategic goals had to be achieved?
• What is so tough about "named value pairs"
• Why a Data Lake and why NorthBay?
• Which AWS services were chosen to leverage?
• What did they decide on for technology and partners and why?
• How did it turn out?
• What did they learn?
15. About Eliza Corporation
• Founded 2000
• Leader in Health Engagement Management
(HEM) outreach services
• Hundreds of millions of outreaches for
intensive operation and analytics processing
• High-volume semi-structured data, complex
business flow of data
• Variety of analytics/consumption needs
ranging from portal for customers to ML
workloads
16. Challenges Eliza Faced
Eliza Corporation
analyzes more than
300 million interactions
per year
Outreach questions and
responses form a
decision tree, and each
question and response
are captured as a pair,
E.G.: <question,
response> = <“Did
you visit your
physician in
the last 30 days?”,
“Yes”>
Diverse downstream
consumption
requirements
Challenging to process
and analyze data
17. Strategic Goals
Create next generation
data architecture
Decouple Storage and
Compute
Ability to process old &
new data streams
Achieve HIPAA
compliance
Ingest & store original
datasets
Allow both real-time &
batch processing
Enable access through
entitlements and
governance
Increase self-service for
end-users
18. Conceptual Data Lake Architecture
Monitoring, auditing, management, and alerting
Data System Analytics (Lineage, Profiling)
EDWETL
Data Lake
Storage
Data Lake
Archive
Catalog
& Search
& Data
Discovery
API
& UI
Entitlements &
Authorizations
Data Quality &
Governance
Streaming
Data Sources
Batch Data
Sources
Data Sources & Ingestion Processing & Storage Consumption & Analytics
Real Time
Analytics
BI tools
Hadoop
(Shared
services)
Business
Units
BI UI
Hadoop,
SAS
(Business
Unit
Dedicated)
19. Benefits of the New Enterprise Data
Platform Architecture
• Hub & spoke model for one original copy of all enterprise
analytics data
• Quality layer for consistent transformations and cleansing of data
• Governance layer for entitlements and security management
• Enable multiple consumption patterns called projections
• A purpose-designed schema for an Enterprise Data Warehouse
(Redshift) for efficient reporting of known queries
• Streamline and automated ingestion of source batch and streaming
data reducing human/manual touch points
21. Major AWS Services Used
Aurora
Kinesis + Kinesis
Streams
Amazon Redshift Dynamo DB
Hive, Presto,
Spark on EMR
CloudSearch, EC2
22. Benefits of a New Enterprise Data Platform
• Streamlined data load process by enabling schema on read
• Improved business agility by allowing schema on read
• Improved ability to manage costs by allowing separation
of costs
• Provided ability to enable resources to scale on-demand
• Reduced end-to-end client analytics time
23. Key Learnings
• The nature of our data is name-value. We
were doing too many transformations due
to our original storage formats.
• Using mini-PoCs to form hypotheses and
prove/disprove them led to an emergent
architecture, which pointed us towards a
data lake
• A data lake architecture fits our core
business and growth plans extremely well
25. About Scholastic
in annual revenue. The worlds
largest publisher and
distributor of children’s books
website for U.S. elementary
school teachers
employees globally
1.6B #1 8,400+
countries languages
165+ 45+
A leader in comprehensive
educational solutions
26. Existing Platform & Challenges
• We taught old data new tricks
• IBM AS/400 was primary data warehouse platform, supplemented by Microsoft SQL
Server to enable business intelligence
• 5,500+ AS/400 workloads, 350+ SQL Server workloads
• Inflexible architecture – slow time to market
• Unable to meet internal SLAs due to performance of daily ETL processes
• Scalability limitations with SQL Server Analysis Services (SSAS) for
dashboards/reports
• Limited ability to perform self-service business intelligence
28
27. Project Goals
Improve performance, scalability,
availability, logging, security
Enable self-service business
intelligence
Integrate with existing
technology stack
Align with the tech strategy
(DevOps model, Cloud First)
Leverage the skill set of current
team (SQL/relational)
Team up with an experienced
partner
28. • AWS was chosen because of agility, scalability,
elasticity, security and alignment with corporate
strategy
• Redshift was chosen to replace AS400 and SQL
Server for its relational-style high performance
data store
• NorthBay was chosen for their expertise in Big
Data and Amazon Redshift migrations
The Decision
30
29. Pilot Plans
Migrate function area in
key business unit during a
3-month pilot
Demonstrate immediate
business value
Stand up the AWS environment
to allow IT to gain competence
with AWS
30. Pilot Outcomes
Create core framework for
migration
Implement ELT
architecture and perform
validation
Establish
visualization/self-service
capability through Tableau
32. Core Framework
• Jobs and job groups are defined as metadata in DynamoDB
• Control-M Scheduler, Custom Application and Data Pipeline for
Orchestration
• ELT Process with EMR/Sqoop for Extraction, Redshift Load and Transform
the data through SQL scripts
• Core Framework allows for
• Restart capability from point of failure
• Capturing of operational statistics ( # of rows updated)
• Audit capability (which feed caused the fact to change)
34
33. Data Visualization Through Tableau
• Business users have access to facts/dimensions for standard reports through Tableau
• Power users have access to Staging tables for Ad-Hoc queries through Tableau
• Data Scientists have access to Files in S3 (from all extracts serving as Data Archive)
using Hive and/or Presto
35
34. Accelerating the Program Timeline
36
• CTO moved budget forward to:
• Reduce project timeline by 50%
• Eliminate overhead of 2 platforms
• Parallel work streams (swim lanes) utilized the same core
framework for migrating data for other business units
• NorthBay partners with each of those work streams to
accelerate migration
• Users wanted to be on the new platform sooner
35. Lessons Learned - Technology
Isolate core framework
with project specific
code repositories
Make appropriate
schema changes when
migrating to new
platform
Customize Framework
for gathering
operational stats (eg: #
of rows loaded etc.)
Start with test
automation tools and
Acceptance Test Driven
Development (ATDD)
earlier in the project
36. Lessons Learned – Program Execution
Creating new data platforms and
migrating data into them is easy,
especially with AWS.
Decommission of existing data
platforms is hard!
“Data Champion” / “Data Guide”
partnership absolutely critical for
successful adoption of new
platforms and working models
Importance of strong Agile
coaches while scaling out Agile
teams
37. Questions & Answers
Brian Barker • CEO • NorthBay Solutions brian.barker@northbaysolutions.com
John Puopolo • SVP • Engineering • Eliza Corporation
Ali Khan • Director, Business Intelligence and Analytics • Scholastic
Sai Reddy Thangirala • Solutions Architect • Amazon Web Services
www.northbaysolutions.com info@northbaysolutions.com
Editor's Notes
Thanks - I am Brian Barker – CEO NorthBay
Sometimes in the middle of a seemingly routine project something great happens. It happened to me - when one of our customers – in fact who you will hear from today - they said – “It was our same data – but we get so much more out of it.”
In saying that he exposed Universal Truth - Virtually every company on this webcast, and in fact every company in America has a vast pool of untapped wealth in their old data. The key is unlocking, re -energizing and re-using, re-combining and making it availble to users who can in turn re -energize and re-use, re-combine it
We realized that if we can help our customer can TODNT they reap enormous rewards – when Amazon heard this they got excited too – and that is why we are here today
So what does that all mean ? 2 great NorthBay customers will share their story
What we are really
MODERN DATA LAKE (1. Catalog & metadata, 2. Storage & 3. Access control)
The conceptual architecture for a Data Lake-centered platform has:
Both [stream oriented] and [batch oriented] data sources
That are {ingested} going through the [Quality & Governance Layer] for cleansing of the data (CLEANSING AND STANDARDING DATA)
These data sets are stored in the [Data Lake] on S3 with the [Catalog and metadata] updated allowing for later search and discovery
The Data Lake has the [Archival] available for management of life cycle of the data inside the lake
{Processing} of the data from the data lake is done through [ETL] into [AWS Redshift]
The entire Data Lake and its services are encapsulated via an [API] that provides for [Entitlements & Authorization]
Multiple other consumption needs for the data residing in the data lake are met by various [BI Tools], [Hadoop clusters] and so on using it
Hello, I’m John Puopolo, Senior Vice President of Engineering at Eliza.
Since 1998, Eliza Corporation has developed healthcare consumer engagement solutions to address some of the industry’s greatest challenges – from adherence, to prevention, to condition management, to brand loyalty and retention.
"Pay-for-performance" in healthcare incentivizes payers and providers to keep a population under their care healthier.
The Pay-for-performance arrangements provide financial incentives to hospitals, physicians, and other healthcare providers to carry out such improvements and help achieve optimal outcomes for patients. This is a departure from fee-for-service, where payments are for each service used.
Eliza focuses on Health Engagement Management, and acts on behalf of healthcare organizations (e.g. hospitals, clinics, pharmacies, insurance companies, etc.) in order to engage people at the right time, with the right message, and in the right channel to capture relevant metrics to analyze the overall value provided by Healthcare.
We process the healthcare data for over 55M Americans every year
This translates to 100s of millions of interactions per year
Outreach results yield 2M-5M data points per day
A handful of our key tables have approximately 1 trillion rows
Eliza Corporation analyzes more than 300 million outreaches per year, primarily through outbound phone calls with Interactive Voice Response (IVR) technology, but other channels such as SMS, email, and in-bound IVR are growing quite rapidly.
For Eliza, interactions are healthcare questions. Each question results in an answer. The questions and answers are implemented as a decision tree, and we capture each unique question-answer pair as a tuple:
<question, response> = <“Did you visit your physician in the last 30 days?” , “Yes”>
Post-outreach, these question-answer pairs needs to be analyzed, sorted, aggregated, etc., and very often we need to processes them differently for different customers.
This means that keeping the data in raw form is important, as it makes analysis and reporting as flexible as possible. Imposing a schema-on-write can limit our ability to do what-if analysis.
Our strategic goals were…
To create a next generation data architecture to support continued growth and functionality, allowing for maximum flexibility in analysis and reporting
Have the ability to accept, transform and process old & new data streams
Allow for both real-time and batch processing with little human intervention
Provide metadata, catalog and data discovery for content in the data lake
Enable access to data through entitlements and governance
Increase the level of self-service enablement for end-users
MODERN DATA LAKE (1. Catalog & metadata, 2. Storage & 3. Access control)
The conceptual architecture for a Data Lake-centered platform has 3 main layers:
An ingress layer that accepts a variety of data formats along both batch and streaming pathways
A conditioning layer where data quality and governance rules can be applied, keeping corrupted or inaccurate data from entering the lake. At this layer, we can also generate and add metadata to any and all data streams. This metadata supports later search and discovery scenarios.
A consumption layer, where downstream clients can access and subsequently analyze the data
In addition to these primary layers, this canonical architecture readily supports data access control and data life-cycle management.
And…the entire Data Lake and its services are accessed through an API that provides for streamlined querying and reads.
The high-level technical architecture for the platform is based on the conceptual one we discussed a few slides back.
---
In our implementation, outreach results, e.g., call dispositions, SMS response, etc. flow into the system via AWS Kinesis Streams on a continuous, real-time basis. We perform near real-time analytics and queries on the outreach results using Kinesis Analytics, and we radiate activity and volume statistics to our Network Operations Center.
Other data, such as customer member profiles, come in through FTP and land in a “raw” S3 bucket.
The system takes the “raw” data and passes it through a conditioning layer. Here, we apply data quality rules, generate metadata for downstream data access and client consumption, and build a metadata index for rapid search and retrieval. We move the data through the system at scale using Spark jobs on EMR clusters, and store some of the by-products in DynamoDB.
After passing through the conditioning layer (which includes a HIPAA Obfuscation module not shown in the diagram for Dev & Test), the data is stored in S3 buckets that use randomized keys to achieve an operationally effective distribution of the data across partitions.
Moving to the storage layer, where the data in now “cooked”, we use DynamoDB tables to store the catalog and metadata, which provides a rapidly accessible map of the data space. IAM policies control access to the data.
Down the line, we will implement data life-cycle management policies to move data from S3 to Glacier according to customers’ and HIPAA requirements.
On the Data Access and consumption side, the lake serves a variety of clients. Two examples include
We make ad-hoc querying available to our analytics and data science teams using Hive and Presto.
Feeding the Enterprise Data Warehouse hosting on [AWS Redshift]
Internal clients and also access data from the read-only API.
One key consumption of data sets was the orchestration of data through AWS Data Pipeline ETL’d into AWS Redshift based Enterprise Data Warehouse for Eliza
Orchestration tool to ETL data into Redshift (EDW)
Has proper DB schema, etc.
DERT - legacy extracts (tableau adhoc questions…)... Data Extraction and Retrieval Tool
Lamda – telling Data Pipeline that something is available
S3 Copy Command
Datapapline – orchestrates ETL and populating Redshift
2-7 GB per day is processed
Eliza – GB per day added….
Robust architecture to support data at scale
We improved…
Business agility by allowing schema-on-read vs. traditional shaping of data in ETL process
Our Ability to manage costs improved by allowing separation of costs associated with compute and storage resources
Flexibility by providing elasticity, enabling resources to dynamically scale up or down based on demand
Reduced data transformations and touchpoints, resulting in elimination of about 30% of operational labor costs
We reduced cycle time for end-to-end Client Analytics reduced by 50%
Customer Portal gets near real-time updates
Ad-hoc reporting tools are self-service with Presto
In the future, we are considering applying Machine Learning to data sets to determine if and when a member falls into a given set of clinical categories, helping our customers segment their populations in new and meaningful ways.
Minimum Viable Plonboardatform (MVP) with one thread of processing end-to-end is very helpful in deriving business value soon
About
Challenges
Goals
Scholastic ran its business on old IBM AS/400 technologies (initially launched in 1988) and Microsoft SQL Server. Prior to their engagement with NorthBay, Scholastic had over 5,500 AS/400 workloads and more than 350 SQL Server workloads performing their Big Data and Analytics work. These systems grew over a 20-year period, and like most systems of that vintage, were rife with problems. They were expensive to run, unable to keep up with the business needs, and inflexible.
Scholastic knew they needed to evolve, and decided to evaluate how they could use Amazon Web Services (AWS) to evolve their Big Data and Analytics workloads. They evaluated multiple AWS Partner Network (APN) Consulting Partners, and chose to engage with NorthBay to help them meet their objectives.
Create a next generation platform that could provide stability, performance & scalability
Have the ability to accept, transform and process old and new data streams
Be able to handle real time streaming and all of the old warehouse
Raise the level of self-service enablement for end-users
Redshift was chosen to replace AS400 for its relational-style high performance data store. It is also managed service with cost-optimization models, elastic and scalable
Redshift as the Enterprise Data Warehouse platform
S3 as location for Data migration, Archival and Analysis
NorthBay was chosen as the implementation partner for after talking to other vendors recommended by Amazon for their expertise in Big Data and Redshift migrations from traditional warehouses
Datapipeline was chosen as the orchestration service for its flexibility and scalability
EMR with Sqoop chosen as ingestion process for its scalability and parallelizing the processes
DynamoDB No-SQL store is chosen to store metadata about various processes. This is cost-effective and offers flexibility with schema changes
Key Management Service is used for credentials encryption for source and target data stores
SNS, Lambda are used to trigger the success, failure notifications and handling them appropriately for their processes
Not invested in CDC/ETL tool for migrating iSeries AS400 data
After the success of a 3 month pilot:
Project timeline accelerated from 3 years to 18 months
Handful of workloads were transformed into the new platform
Parallel work streams started utilizing same core framework
AWS skills transferred to Scholastic’s technical team
IT team understood and gained comfort with AWS
Phase 1: 3-month pilot to:
transform a handful of their workloads into the new platform (AWS/Redshift/S3)
Demonstrate to the business users funding the project that they will get better information/ knowledge to run their business with the new platform
Stand up the AWS environment to allow IT to understand and gain comfort with AWS and the Cloud
Transfer AWS skills to the publisher’s technical team
Prove that a 1 Team approach will work (Client’s team and NorthBay)
Due to the HUGE pilot success, the CEO moved budget forward to accelerate project timeline from 3 Years to 18 Months (saving 18 mos. of cost on AS/400 and SQL Server
Parallel work streams (swim lanes) were started utilizing the same core framework for migrating data for other business units
NorthBay partnered with each of those work streams to accelerate the migration
The team who developed the framework (NorthBay/Customer IT) helping other initiatives at Customer by training, offering best practices and lessons learnt around AWS, CI/CD and running projects in Agile manner
After the success of a 3 month pilot:
Project timeline accelerated from 3 years to 18 months
Handful of workloads were transformed into the new platform
Parallel work streams started utilizing same core framework
AWS skills transferred to Scholastic’s technical team
IT team understood and gained comfort with AWS
Phase 1: 3-month pilot to:
transform a handful of their workloads into the new platform (AWS/Redshift/S3)
Demonstrate to the business users funding the project that they will get better information/ knowledge to run their business with the new platform
Stand up the AWS environment to allow IT to understand and gain comfort with AWS and the Cloud
Transfer AWS skills to the publisher’s technical team
Prove that a 1 Team approach will work (Client’s team and NorthBay)
Due to the HUGE pilot success, the CEO moved budget forward to accelerate project timeline from 3 Years to 18 Months (saving 18 mos. of cost on AS/400 and SQL Server
Parallel work streams (swim lanes) were started utilizing the same core framework for migrating data for other business units
NorthBay partnered with each of those work streams to accelerate the migration
The team who developed the framework (NorthBay/Customer IT) helping other initiatives at Customer by training, offering best practices and lessons learnt around AWS, CI/CD and running projects in Agile manner
+ Operational Systems (POS, Customer system, etc.)
+ ControlM is their Scheduler… on prem jobs run... Manages job dependencies.. Execute Job x... Migarte to Redshft
+ Custom Python framework – read Dyanamo DB adn create datapipeline for each table
Core Framework
Jobs and Job Groups are defined as metadata in DynamoDB
Control-M scheduler, Custom Application and Data Pipeline for Orchestration
ELT Process with EMR/Sqoop for Extraction, Redshift Load and Transform the data through SQL scripts
Core Framework allows for
Restart Capability from point of failure
Capturing of operational statistics ( # of rows updated etc.)
Audit capability (which feed caused the Fact to change etc.)
+ Data Pipeline is the job orchestration process
+ Each job creates its own pipeline .. Jobs are bundled into groups
+ DynamoDB stores the metadata (which schema, which job, job group, data source)
+ Sqoop…. Parallel reads from mainrame ... To S3.. .. Limit of As400 is 16-20 parallel connections
+ Facts & Dimensions in Redshift are SQL scripts saved in S3
After the success of the 3 month pilot the timeline was re-visited
Savings on AS/400 and SQL Server - Cost of being on 2 platforms eliminated
Consolidating logging solution across S3, Redshift, DynamoDB etc. was a challenge
Consolidating logging solution across S3, Redshift, DynamoDB etc. was a challenge
Send 5 pre-scripted q&a questions to Lo, Sai & Angela for review