This document discusses operational data integration and recommends using open source tools rather than hand-coding integration. It outlines common data integration models like consolidation, propagation, and federation. Hand-coding integration is not recommended due to high maintenance costs, while proprietary tools can be costly to license. Open source tools provide flexibility, lower costs, and avoid vendor lock-in compared to hand-coding or proprietary options for operational data integration projects.
GenAI and AI GCC State of AI_Object Automation Inc
How to Use the Right Tools for Operational Data Integration
1. How to Use the Right Tools for
Operational Data Integration
Mark R. Madsen – March, 2009
http://ThirdNature.net
Attribution-NonCommercial-No Derivative
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
2. What We’re Asked For
(simulation)
Slide 2
March 2009 Mark R. Madsen
3. How It Makes Us Feel
Slide 3
March 2009 Mark R. Madsen
4. How We Want to Feel
Slide 4
March 2009 Mark R. Madsen
5. Spending Priorities in IT
In 2007 and 2008 this is where the money went…
but you can’t do most of these without data integration.
Sources: CIO Insight
Slide 5
March 2009 Mark R. Madsen
6. Technology Priorities in IT
Data integration moved up to #3 spot for CIOs in 2008
Sources: CIO Insight
Slide 6
March 2009 Mark R. Madsen
7. The Cost Problem Management Reacts To
Source: IDC
Slide 7
March 2009 Mark R. Madsen
8. Where We Often Are Today: Point to Point
Typical scenario:
• Disparate data
• Heterogeneous sources
• Point integration
• Minimal reuse
• No tools
Databases Documents Flat Files XML Services ERP Applications
Source Environments
Slide 8
March 2009 Mark R. Madsen
9. The Desired Future State
“Data as a platform” provides:
• Standards-based interfaces
• Single views of disparate source data
• Single point of access / integration
• Reuse of data
…but you can’t achieve this by
Data Platform writing more application code
Databases Documents Flat Files XML Services ERP Applications
Source Environments
Slide 9
March 2009 Mark R. Madsen
10. Application versus Data Integration
Application Data Integration
Integration
Managing the flow of Managing the flow of
events data and access
Standardizes the Standardizes the data
transaction or service
Tools abstract the Tools abstract the
transport and system transport, system,
endpoints
representation and
manipulation
Must write code at Data structure, format
endpoints to manipulate and manipulation is
data abstracted
Focus on code - data as Focus on data - data as
a byproduct the product
Reusable functions, not Reusable data, not
data functions
Slide 10
March 2009 Mark R. Madsen
11. Analytic versus Operation Data Integration
Analytic Operational
Most of a BI project’s effort is Most of an application project
spent on data integration is focused on features, not DI
Many disparate sources One or a few sources
Generally unidirectional One-way or bidirectional
Large data volumes Large data volume for some,
small volume for others
Usually loaded daily Often loaded more often,
varies based on project type
Low concurrency Low to high concurrency
High latency Low to high latency
Slide 11
March 2009 Mark R. Madsen
12. Architectural Models for Data Integration
Physical
Data
Access
Model
Virtual
Distributed Centralized
Control
Slide 12
March 2009 Mark R. Madsen
13. Consolidation
Common operational DI scenarios
where this model is appropriate:
• Migrations
• Upgrades
• Consolidations
• Managing master / reference data
Characteristics:
• Large data volumes to move or access
• One time data movement
• Usually unidirectional
• Transformation or cleansing required
Slide 13
March 2009 Mark R. Madsen
14. Propagation
Common scenarios:
• Copying data that can’t be accessed
directly / remotely
• Synchronizing data
• Data cross-referencing
• Infrequent / one-time extracts
Characteristics:
• Can be one-way or bi-directional
• Often repetitive data movement
• Medium to large data volume (but not
always)
Slide 14
March 2009 Mark R. Madsen
15. Federation
Common scenarios:
• Real-time / low latency data access
• Security / regulatory requirements that
prevent copying data
• Impractical to create a central
database (e.g. # sources, latency)
• Centralized data services
Characteristics:
• One-way
• Lower data volumes
• Higher concurrency
Slide 15
March 2009 Mark R. Madsen
16. Choosing Models
There are some basic criteria
and tradeoffs to consider:
• Data currency vs. latency
• Diversity of data sources
• Data cleansing & transformation
• Predictability of performance
• Access to the same data is
needed via different interfaces
• Non-relational sources
• Frequency of access
• Data volumes
• And more…
Slide 16
March 2009 Mark R. Madsen
17. A Handy Comparison Chart
Consolidation Model
Criteria Physical Virtual
Data currency
Query performance / latency
Frequency of access
Diversity of data sources
Diversity of data types
Non-relational data sources
Transformation and cleansing
Predictability of performance
Multiple interfaces to same data
Large query / data volume
Need for history / aggregation
Slide 17
March 2009 Mark R. Madsen
18. Three Implementation Choices
• Write code! It’s fun! It’s easy! At first.
• Buy proprietary data integration tools
• Use available open source tools
Slide 18
March 2009 Mark R. Madsen
19. Hand-coded Integration
Why is this so common?
• DI is an afterthought on application projects
• It’s just data
• It’s hard to justify expensive tools for ODI
• Developers and DBAs don’t talk
The market is changing:
• Lower tolerance for the high cost of
custom DI development and maintenance
• External data challenges
• Bad fit for consolidation projects
Products get better over time. Hand-written
code gets worse.
Slide 19
March 2009 Mark R. Madsen
20. Buying Data Integration Tools
Buying is the usual alternative,
mostly ETL tools.
• ETL vendors are branching out
• Many companies have ETL for BI
But…
• Poor fit for propagation and
synchronization tasks
• Centralized servers
• Licensing costs / problems for
consolidation tasks or broad use
Integration code is single-purpose, tools are
multi-purpose. You should always go with
tools – when you can afford them.
Slide 20
March 2009 Mark R. Madsen
21. Use of Tools vs. Hand Coding
High Use Medium Use Low Use None
60%
50%
40%
30%
20%
10%
0%
ETL EDR EII EAI ETL EDR EII EAI ETL EDR EII EAI ETL EDR EII EAI
Source: TDWI, 2006
Slide 21
March 2009 Mark R. Madsen
22. Open Source: End of Buy vs. Build
Open source avoids the pitfalls
of coding and gains the
advantages of using tools.
• Tools can be distributed with little
to no license restrictions
• Application projects budget for
features, not glue
• Even basic tools have obvious
operational advantages over
hand-coding
Why build custom code when there are
comparable tools available?
Slide 22
March 2009 Mark R. Madsen
23. Benefits Reported
After your organization adopted open source
software, what was the primary benefit of its use?
Flexibility 31%
Lower cost 31%
Reduced dependence on vendors 15%
Performance 10%
Reliability 7%
Security 4%
Other 3%
Source: The 451 Group
Slide 23
March 2009 Mark R. Madsen
24. A Side Benefit of Flexibility
Comparison of time taken to evaluate tools
Source: Yankee Group
Slide 24
March 2009 Mark R. Madsen
25. Recommendations
1. Differentiate between analytic
data integration and operational
data integration
2. Stop hand-coding unless the
problem really is trivial, and this
includes table replication and
DBA SQL scripts
3. Use the right data integration
model for the problem
4. Augment existing data
integration infrastructure with
open source
5. Make open source the default
option for data integration tools
Slide 25
March 2009 Mark R. Madsen
26. Creative Commons
Thanks to the people who made their images available via creative commons:
red pill blue pill - http://www.flickr.com/photos/rcrowley/2540057217/
red pill blue pill2 - http://www.flickr.com/photos/thomasthomas/258931782/
happy dog jumping in meadow - http://flickr.com/photos/cenz/16128560/
Writing code – http://flickr.com/photos/cdm/72250667/
Woodworking – http://flickr.com/photos/rigoletto/126367565/
Febo – http://flickr.com/photos/jshyun/1573065713/
open_air_market_bologn - http://flickr.com/photos/pattchi/181259150/
Slide 26
March 2009 Mark R. Madsen
28. Creative Commons
This work is licensed under the Creative Commons
Attribution-Noncommercial-No Derivative Works 3.0 United
States License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send
a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.
Slide 28
March 2009 Mark R. Madsen