This document discusses the shifting landscape of data integration. It begins with an introduction by William McKnight, who is described as the "#1 Global Influencer in Data Warehousing". The document then discusses how challenges in data integration are shifting from dealing with volume, velocity and variety to dealing with dynamic, distributed and diverse data in the cloud. It also discusses IDC's view that this shift is occurring from the traditional 3Vs to the 3Ds. The rest of the document discusses Matillion, a vendor that provides a modern solution for cloud data integration challenges.
How is Real-Time Analytics Different from Traditional OLAP?
The Shifting Landscape of Data Integration
1. The Shifting
Landscape of Data
Integration
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group Inc 5000
@williammcknight
www.mcknightcg.com
(214) 514-1444
16. Performance
• Performance is a critical point of interest when it comes to
selecting an analytics platform
• To measure data warehouse performance, we use similarly
priced specifications across data warehouse competitors.
• Usually when people say they care about performance, it is
the ultimate metric of price/performance
• The realities of creating fair tests can be overwhelming to
many shops, and is a task usually underestimated.
5
17. Cost Predictability and Transparency
• The cost profile options for cloud databases are straightforward
if you accept the defaults for simple workload or proof-of-
concept (POC) environments
• Initial entry costs and inadequately scoped environments can
artificially lower expectations of the true costs of jumping into a
cloud data warehouse environment.
• For some, you pay for compute resources as a function of time,
but you also choose the hourly rate based on certain enterprise
features you need.
• With some platforms, you pay for bytes processed and the
underlying architecture is unknown. The environment is scaled
automatically without affecting price. There is also a cost-per-
hour flat rate where you would need to calculate how long it
would take to run your queries to completion to predict costs.
• Customers need to analyze current workloads, performance,
and concurrency and project those into realistic pricing in
alternative platforms.
6
18. Administration
• Overall costs, time, as well as storage and compute resources
are affected by the simplicity of configurability and overall use.
• The platform should have embraced a self-sufficiency model
for its customers and be well into the process of automating
repetitive tasks.
• Easy administration starts with setup that is a simple process of
asking basic information and providing helpful information for
selecting the storage and node configurations.
• The data store should support mission-critical business
applications with minimal downtime.
7
19. Optimizer
• You may need:
• Conditional parallelism and what the causes are of
variations in the parallelism deployed.
• Dynamic and controllable prioritization of resources for
queries.
• Time and requirements for optimal queries, such as
compiling indexes or updating statistics.
• Workload isolation capabilities.
8
20. Concurrency Scaling
• If the database has concurrency limitations, designing
around them is difficult at best, and limiting to effective
data usage.
• If the data store automatically scales up to overcome
concurrency limitations, this may be costly if the data
warehouse charges by compute node.
• If the data store charges per user, costs will also increase as
the data warehouse is put to more use in the company.
• The workload may need linear scaling in overall query
workload performance as concurrent users are added.
9
21. Resource Elasticity
• You may need the ability to scale up and down and
take advantage of the elastic compute and storage
capabilities in the cloud, public or private, without
disruption or delay.
• The more the customer needs to be involved in
resource determination and provisioning, the less
elastic, and less modern, the solution is.
• One thing to watch for in elasticity scaling is keeping
the amount of money spent by the customer under the
customer’s control.
10
22. Machine Learning
• Today, some data query languages need to be extended to include machine learning, or
firms may find the programming required will be too challenging to keep pace.
• Data stores today may need to weave machine learning into their data processing
workflows
• Vendors must accommodate and extend SQL to include machine learning functions and
algorithms to expand the capabilities of those tools and users.
• If your database does not include machine learning, there are many extra things to be
concerned with.
• Other components will be needed to complete the toolbox and get the job done.
• Ideally, security for machine learning will be the same as database security.
• The data warehouse also needs to be able to operate at scale, beyond sampling
11
23. Data Storage Format Alternatives
• Cloud object storage is relatively inexpensive making data storage at high scale
affordable.
• On-premises, specialized private cloud storage options such as Pure Flashblades
tend to offer similar data type storage flexibility
• To take full advantage of the elasticity of the cloud without driving up costs, data
warehouse compute and storage need to be scaled separately.
• To take full advantage of the many types of data available, such as Apache ORC,
Apache Parquet, JSON, Apache Avro, etc., modern data warehouses need to be
able to analyze that data without moving or altering it.
• A unified analytics warehouse that supports these various data formats means you
have the benefit of querying them directly, without greatly expanding the
hierarchical complex data types to a standard tabular data structure for analysis.
• You should also be able to import data directly from these formats
• The ability to join data for analysis between the various internal and external data
formats provides the highest level of analytic flexibility.
12
25. Data Integration Product #Goals
• Cloud Native
• Intelligently driven automation suggests and generates new data pipelines between
source and target without manually mapping or design
• Data Orchestration
– Managing the ebb and flow of data throughout the ecosystem,
– Determining what data is analyzed/stored at the origin,
– Determining what data is moved upstream and at what granularity and state
– Moving, integrating and updating data, metadata and master data and machine learning models as
evolution happens at the core
• Trust created through transparency and understanding
– Modern data management and integration platforms give organizations holistic views of their enterprise
data and deep, thorough lineage paths to show how critical data traces back to a trusted, primary source
• Able to dynamically, and even automatically, scale to meet the increasing complexity and
concurrency demands of query executions.
14
27. Data Lineage Stakeholders and Use Cases
Data Lineage Stakeholders and Use Case
• Legislative Bodies and Compliance
• Data Discovery Initiatives
• Impact Analysis
• Audit Functions
• Data Quality Programs
• Project Sponsors and Managers
• Data Architects
• Training and Onboarding Staff
• Root Cause Analysis
28. Data Lineage Requirements
Data Lineage Requirements
• Graphical Representation
• Impact Analysis
• Root Cause Analysis
• Extend to Non-Standard or Custom Sources
• General Accessibility to Lineage and Metadata
29. Data Integration Requirements
• Comprehensive Native Connectivity
• Multi-Latency Data Ingestion
• Data Integration Patterns
• Data Quality and Data Governance
• Data Cataloging and Metadata Management
• Enterprise Trust
• Artificial Intelligence and Automation
• Ecosystem and Multi-Cloud
18
30. Data Prep Requirements
• Designed as a low setup, no configuration, and no maintenance
data pipeline to lift data from operational sources and deliver it
to a modern cloud data warehouse
• Well-suited for popular cloud applications like Salesforce, Stripe,
and Marketo
• Transformations are SQL-based rules written by business users
and set to run on a schedule or as new data becomes available
• Look for
– Migrate function to move or clone business logic and resources, change data capture,
queue messaging, and pub-sub
– Dynamic ETL (such as job variables and iterators) plus data orchestration and staging
mechanisms (such as functions to make real-time adjustments if jobs run long or fail)
– Automation in pipeline building
19
31. Application Programming Interfaces
• A ubiquitous method and de facto standard of communication among modern information
technologies.
• APIs have begun to replace older, more cumbersome methods of information sharing with
lightweight, endpoints.
• Due to the popularity and proliferation of APIs and microservices, the need has arisen to manage the
multitude of services a company relies on—both internal and external.
– APIs themselves vary greatly in protocols, methods, authorization/authentication schemes, and usage patterns.
– Additionally, IT needs greater control over their hosted APIs, such as rate limiting, quotas, policy enforcement, and user
identification, to ensure high availability and prevent abuse and security breaches.
– Exposing APIs opens the door to many partners who can co- create and expand the core platform without even knowing
anything about the underlying technology.
• Organizations depend on these services to be properly managed, with high performance and
availability.
– Many companies experience workloads of more than 1,000 transactions per second on their API endpoints.
– For these organizations, their need for performance is tantamount to their need for management because they rely on these
API transaction rates to keep up with the speed of their business.
– On the contrary, many companies are looking for a solution to load balance across redundant API endpoints and enable high
transaction volumes.
– A financial institution with 1,000 transactions happening per second translates to 86 million API calls in a single 24-hour day.
20
32. API & Microservices Ecosystem
Public Private - External Private - Internal
Over 20,000 public APIs*
*according to https://www.programmableweb.com/apis/directory
External Partners Connected Apps & Data
34. API Requirements
• Performance: Good for high performance
workloads (>1,000TPS)
• Reliability: All workloads completed with
100% message completion
• Complexity: Multiple plugins enabled
36. ETL is Insufficient for this combination
• Data platforms operating at an enterprise-wide scale
• A high variety of data sources
• Real-time/streaming data
• ETL forces either real-time loading without being scalable or
scalability with batch loading
– Data, produced from numerous sources, is a torrent of flowing
information, needing to be timestamped, dispatched, and even
duplicated (to protect against data loss)
– A postman is needed to distribute data from message senders to
receivers at the right place at the right time.
25
37. Real-Time Data
• A.k.a. messaging, live feeds, real-time, event-driven
• Comes in continuously and often quickly, so we call it
streaming data.
• Needs special attention and can be of immense value, but
only if we are alerted in time.
• Foundation for Artificial Intelligence
– Stream data forms the core of data for artificial intelligence
26
38. Enter Message-Oriented Middleware aka Streaming and
message queuing technology
• Messages can be any kind of data wrapped in a neat package
with a very simple header as a bow on top.
• Messages are sent by “producers”—systems, sensors, or devices
that generate the messages—toward a “broker.”
• A broker does not process the messages, but instead routes
them into queues according to the information enclosed in the
message header or its own routing process.
• Then “consumers” retrieve the messages from the queues to
which they subscribe (although sometimes messages are pushed
to consumers rather than pulled).
• The consumers open the messages and perform some kind of
action on them.
27
39. Performance and scalability in streaming
28
Storage
Ability to retain varying volumes of
messages for varying lengths of time
Throughput
High, sustainable rate of message
processing
Latency
Fast, consistent responsiveness for
publishing and consumption
Operations
Minimizing operational burden for scaling,
tuning, and monitoring
41. Apache Kafka
• Open source streaming platform developed at LinkedIn
• A distributed publish-subscribe messaging system that maintains feeds of
messages called topics
– Publishers write data to topics and subscribers read from topics
– Kafka topics are partitioned and replicated across multiple nodes in your Hadoop
cluster
• Enables “source to sink” data pipelines
• Kafka messages are simple, byte-long arrays that can store objects in
virtually any format with a key attached to each message; often in JSON
• E&L in ETL through Kafka Connect API
• T in ETL through Kafka Streams API
• Fault-tolerant
• DIY
30
42. Apache Pulsar
• Originally developed at Yahoo
• Began its incubation at Apache in late 2016
• Has been in production as Yahoo for 3 years prior—utilized in popular services and
applications like Yahoo! Mail, Finance, Sports, Flickr, Gemini Ads, and Sherpa
• Follows the publisher-subscriber model (pub-sub), and has the same producers,
topics, and consumers as some of the aforementioned technologies
• Uses built-in multi-datacenter replication
• Architected for multi-tenancy and uses concepts of properties and namespaces
– A property could represent all the messaging for a particular team, application, product vertical, etc.
– Namespaces is the administrative unit where security, replication, configurations, and subscriptions are managed
– At the topic level, messages are partitioned and managed by brokers using a user-defined routing policy—such as
single, round robin, hash, or custom—thus granting further transparency and control in the process
31
43. Workloads are Distinguished by
• The number of topics
• The size of the messages being produced and consumed
• The number of subscriptions per topic
• The number of producers per topic
• The rate at which producers produce messages (per
second)
• The size of the consumer’s backlog (in gigabytes)
• The total duration of the test (in minutes)
32
44. Key Takeaways & Application to the Enterprise
• An enterprise has many different types of data stores as well as many data stores of the same type
– This extends to clouds
• The reasons for this are many
– Performance
– Cost Predictability and Transparency
– Administration
– Optimizer
– Concurrency
– Resource Elasticity
– Machine Learning Capabilities
– Data Storage Format Alternatives
• It is fully expected that enterprise environments would have a heterogenous vendor as well as other
vendors
• APIs have begun to replace older, more cumbersome methods of information sharing with
lightweight, endpoints.
• Streaming and message queuing have lasting value to organizations.
• Streaming and messaging will be able to meet the real-time data volume, variety, and timing
requirements of the coming years.
45. The Shifting
Landscape of Data
Integration
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group Inc 5000
@williammcknight
www.mcknightcg.com
(214) 514-1444