The “Big Data era” has ushered in an avalanche of new technologies and approaches for delivering information and insights to business users. What is the role of the cloud in your analytical environment? How can you make your migration as seamless as possible? This closing keynote, delivered by Joe Caserta, a prominent consultant who has helped many global enterprises adopt Big Data, provided the audience with the inside scoop needed to supplement data warehousing environments with data intelligence—the amalgamation of Big Data and business intelligence.
This presentation was given as the closing keynote at DBTA's annual Data Summit in NYC.
3. @joe_Caserta#DataSummit
About Joe Caserta
Launched Big Data practice
Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and Business
Intelligence since 1996
Began consulting database programing and data
modeling 25+ years hands-on experience building database
solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in Intelligent
Enterprise magazine
Launched Data Science, Data Interaction and Cloud
practices
Laser focus on extending Data Analytics with Big Data
solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance Techniques on Big
Data (Innovation)
Awarded Top 20 Big Data Companies 2016
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing (BDW) Meetup NYC:
2,000+ Members
2016 Awarded Fastest Growing Big Data Companies
2016
Established best practices for big data ecosystem
implementations
4. @joe_Caserta#DataSummit
About Caserta Concepts
– Consulting Data Innovation
– Award-winning company
– Internationally recognized work force
– Strategy, Architecture, Implementation, Governance
– Innovation Partner
– Strategic Consulting
– Advanced Architecture
– Build & Deploy
- Leader in Enterprise Data Solutions
– Big Data Analytics
– Data Warehousing
– Business Intelligence
Data Science
Cloud Computing
Data Governance
5. @joe_Caserta#DataSummit
Why is Data so Important?
1500s
Printing Press
1840s
Penny Post
1850s
Telegraph
1850s
Rural Free Post
1890s
Telephone
1900s
Radio
1950s
TV
1970s
PCs
1980s
Internet
1990s
Web
2000s
Social Media, Mobile, Big Data, Cloud
98,000+ Tweets
695,000 Status Updates
11 Million instant messages
698,445 Google Searches
168 million+ emails sent
1,829 TB of data created
217 new mobile web
users
Every 60 Seconds
6. @joe_Caserta#DataSummit
Understanding the Customer
Awareness Consideration Purchase Service
Loyalty
Expansion
PR
Radio
TV
Print
Outdoor
Word of Mouth
Direct Mail
Customer Service
Physical Touchpoints
Digital Touchpoints
Search
Paid Content
email
Website/
Landing Pages
Social Media
Community
Chat
Social Media
Call Center
Offers
Mailings
Survey
Loyalty Programs
email
Agents
Partners
Ads
Website
Mobile
3rd Party Sites
Offers
Web self-service
7. @joe_Caserta#DataSummit
Life As We Know It
Business: “I need to analyze some new data”
IT collects requirements
Creates normalized and/or dimensional data models
Profiles and conforms and the data
Sophisticated ETL programs and quality standards
Loads it into data models
Builds a BI semantic layer
Creates dashboards and reports
IT: “You can access your data in 3-6 months to see if it has value!
– Onboarding new data is difficult!
– Rigid Structures and Data Governance
– Disconnected/removed from business
8. @joe_Caserta#DataSummit
The Problem: Shadow IT = Data Sprawl
• There is one application for every 5-10 employees generating copies of
the same files leading to massive amounts of duplicate idle data strewn
all across the enterprise. - Michael Vizard, ITBusinessEdge.com
• Employees spend 35% of their work time searching for information...
finding what they seek 50% of the time or less.
- “The High Cost of Not Finding Information,” IDC
10. @joe_Caserta#DataSummit
The New Data Paradigm
OLD WAY:
• Structure Data Ingest Data Analyze Data
• Fully Governed
• Monolith
NEW WAY:
• Ingest Data Analyze Data Structure Data
• Just Enough Governance
• Dynamic
RECIPE:
• Data Officer & Data Organization
• Enterprise Data Lake
• Corporate Data Pyramid
11. @joe_Caserta#DataSummit
Business Value
Cloud-based Data Lake
Big Data Analysis: The Ecosystem of the future
Analyze
Persist
DeployIngest
Data Integration
Identity Resolution
Data Quality
Discovery Exploration
Machine Learning
Models Development
Reports / Dashboards
Applications
APIs
Structured Data
Unstructured Data
SQL, NoSQL, Object Store
Find Share Collaborate
Data Engineer Data Scientist Business Analyst App Developer
Provides innovative and industry
leading technologies to rapidly be
applied to the business without
having to manage compatibility and
data complexity.
Technical Value
Provides an open framework
to reduce the number of
integration points and testing
environments to deliver
business solutions.
or
12. @joe_Caserta#DataSummit
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Usage Pattern Data Governance
Metadata, ILM,
Security
Corporate Data Pyramid (CDP)
13. @joe_Caserta#DataSummit
Cloud Component AWS Google Microsoft
Scalable distributed storage S3 GCS Azure Storage
Pluggable fit-for-purpose processing EMR DataProc HDInsight
Compute Services EC2 GCE VMs
Consistent extensible framework Spark Spark Spark
Dimensional MPP Data Warehouse Redshift BigQuery
Azure SQL Data
Warehouse
Data Streaming Kenesis PubSub Azure Stream
Common Interface Jupyter DataLab Azure Notebook
The Data Lake on the Cloud
• Remove barriers between data ingestion and analysis
• Democratize data with Just Enough Data Governance (JEDG)
15. @joe_Caserta#DataSummit
The Clouds Coalesce
Percent of organizations with AWS as primary, also
uses GCP
Percent of organizations with AWS as primary,
also uses Azure
Percent of organizations with GCP as primary, also
uses AWS
41%
32%
31%
Source: Clutch, 2016
16. @joe_Caserta#DataSummit
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
Spark has become our default processing engine for a data engineering & science
Why Spark?
17. @joe_Caserta#DataSummit
Analytics Development Lifecycle
• Data Science is performed in the ephemeral workspaces
• The work products of data science is promoted from “insights” to real applications.
• Rigorous Data Governance applied
• Processes must be hardened, repeatable, and performant
Big$
Data$
Warehouse$
Data$Science$Workspace$
Data$Lake$–$Integrated$Sandbox$$
Landing$Area$–$Source$Data$in$“Full$Fidelity”$
New
Data
New
Insights
Governance
Refinery
19. @joe_Caserta#DataSummit
Global economics
Intensity of competition
Reduce costs
Move to cross-functional teams
New executive leadership
Speed of technical change
Social trends and changes
Period of time in present role
Status & perks of office/dept under threat
No apparent reasons for proposed changes
Lack of understanding of proposed changes
Fear of inability to cope with new technology
Concern over job security
Forces for Change Forces Resisting Change
Status Quo
Moving the Status Quo
http://www.change-management-coach.com/force-field-analysis.html
20. @joe_Caserta#DataSummit
Introducing the Chief Data Officer
• Evangelize a data vision for the organization
• Support & enforce data governance policies via outreach, training & tools
• Monitor and enforce data quality in collaboration with data owners
• Monitor and enforce data security along with Legal/Security/Compliance
• Work with IT to develop/maintain an enterprise repository of strategic data
• Set standards for analytical reporting and generate data insights
• Provide a single point of accountability for data
initiatives and issues
• Innovate ways to use existing data
• Enrich and augment data by combining internal and
external sources
• Support efficient and agile analytics through training
and templates
21. @joe_Caserta#DataSummit
The CDO: The Whole Brain Challenge
Front
Back
Analytics Oriented
• Data Science
• Research
Process Oriented
• Data Governance
• Compliance
Operations Oriented
• Shared Services
• Data Engineering
Revenue Oriented
• Revenue Goals
• Monetizing Data
22. @joe_Caserta#DataSummit
Chief Data Organization (Oversight)
Vertical Business Area
[Sales/Finance/Marketing/Operations/Customer Svc]
Product Owner
SCRUM Master
Agile Development Team
Business Subject Matter Expertise
Data Librarian/Data Stewardship
Data Science/ Statistical Skills
Data Engineering / Architecture
Presentation/ BI Report Development Skills
Data Quality Assurance
DevOps
IT Organization
(Oversight)
Enterprise Data Architect
Solution Engineers
Data Integration Practice
User Experience Practice
QA Practice
Operations Practice
Advanced Analytics
Business Analysts
Data Analysts
Data Scientists
Statisticians
Data Engineers
Planning Organization
Project Managers
Data Organization
Data Gov Coordinator
Data Librarians
Data Stewards
Agile Data Teams
23. @joe_Caserta#DataSummit
Caution: Assembly Required
Some of the most hopeful tools are brand new or in
incubation
Enterprise big data implementations typically combine
products with custom built components
The Buildout
People, Processes and Business commitment are still critical!
Data Integration & Quality Data Catalog & Governance Emerging Solutions
24. @joe_Caserta#DataSummit
What the Future Holds
• DevOps for Analytics
• Search-Based BI (NLP)
• Artificial Intelligence (AI)
• Virtual Reality BI (VR)
• Virtual Assistant BI (Voice)
• Reporting/Predictions Converge
• Citizen Data Scientists Emerge
Capture, Analyze, influence, and maximize every touchpoint online and offline
Ask DG effectiveness questions.
Recent article - Oct 21, 2015
80% of all business are doing something
The paradigm shift is in the way we onboard and process data:
Formerly, we structured data before we would ingest and analyze it, Now, we ingest and analyze data, and then structure it.
This allows immediate access for both analysts and data scientists
Streamlines the path to cash register
We have also moved from fixed capacity to on-demand infrastructure
Large datasets and new datasets are being added at a rapid rate
They could grow or shrink on demand; many of the providers are startups
This minimizes the cost of operation
From Monolith to Ecosystem
No one set of tools will solve everything
Use a diverse set of technologies, and let them evolve over time
Solve for this using a combination of three concepts:
Cloud Computing, Data lake, and the Polyglot Warehouse.
Data has different audience and usage patterns each tier.
All tiers work cohesively to comprise the Big Data Ecosystem
All tiers are governed. Only the top tier is fully governed
When to use late bind, decided when to structure on case by case.
7 components of gov: Org, Metadata, Security, DQ, Business Integration, MDM, ILM
Organization
This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.
Metadata
Definitions, lineage (where does this data come from), business definitions, technical metadata
Privacy/Security
Identify and control sensitive data, regulatory compliance
Data Quality and Monitoring
Data must be complete and correct. Measure, improve, certify
Business Process Integration
Policies around data frequency, source availability, etc.
Master Data Management
Ensure consistent business critical data i.e. Members, Providers, Agents, etc.
Information Lifecycle Management (ILM)
Data retention, purge schedule, storage/archiving
“Big Box” tools vs ROI?
Prohibitively expensive limited by licensing $$$
Typically limited to the scalability of a single server
Cascading, Zementis
I’ve been doing it this way for 15 years. It works, don’t mess with it! People must learn: Evolution is inevitable. Evolve or die.
Kurt Lewin’s Force Field analysis
Data Governance
Data Insight
Generate Revenue
Reduce Risk
Over the course of my 30-year career, more change has occurred in the last three years, than in the previous 27 combined. This has been the most disruptive period in data science that I’ve seen.