2. • Why? – change, old methods, useful
• What? – new, FCC, not all data is equal
• How? – new fabric - TCSV
• SCV
• Open
• Speed to value (tactical & strategic)
• No compromise on quality/integrity
• Complementary (to kit & thinking)
19. Business domain
EDW
Source domain ETL
Acquisition QueriesCouple to Conform
warehouse
mart
mart
Legacy CRM
Agency Media Plan
REST
SOAP
ODBC
Email
Indicates schema coupled operation. Changes in 1 lead to changes in many others.
Select
Calculate
Select
Calculate
Select
Join
Select
Aggregate
Split
Join
Sort
Quality Control
Select
Aggregate
Calculate
Select
Aggregate
Join
Select
Aggregate
Calculate
Select
Join
Aggregate
Select
Join
Calculate
Aggregate
20. Business domain
Hadoop
Source domain HDFS
Acquisition QueriesCouple to Conform
Legacy CRM
Agency Media Plan
REST
SOAP
Sqoop
Email
Indicates schema coupled operation. Changes in 1 lead to changes in many others.
Select
Calculate
Select
Calculate
Select
Join
Select
Aggregate
Split
Join
Sort
Quality Control
Map Reduce
Job
Storage
21. TCSV
Data DomainSource domain
Acquisition
Legacy CRM
Agency Media Plan
REST
SOAP
ODBC
Email
Storage
Parse
Business
domain
Quality Control
Enrichment
Data Acceptance
Calculation
Indicates schema coupled operation
Queries
Unparse
Pure TCSV operation
27. IRI
Kantar
Millward
Brown
Mindshare
Nielsen
Dispatches
Litmus
Kantar
Ireland
2. Cloud
Encrypted Data Harvest
Finance
Client Team
Client Firewall
Encryption
Agent
Delivery
Decryption*
Data from sensitive sources inside (or outside) the client
environment is encrypted at the value level (TCSV’s V) and
decrypted on delivery of data. Everything else is handled
normally. This is a hybrid solution because there are agents in
the client environment that would need to be managed.
* Can be provided in Excel
and Browser based systems
or through an agent that
decrypts data files for use
by a third party
application.
Data is encrypted at all times when outside of the client environment
28. IRI
Kantar
Millward
Brown
Mindshare
Nielsen
Dispatches
Litmus
Kantar
Ireland
3. Hybrid
Cloud VM in VPN
(Private Cloud)
Finance
Client Team
Client Firewall
VM
VMVM
DataShaka produce a managed set of VMs within the client
VPN. This utilises the platform as is for ‘normal’ data but
hosts finance data in a secure environment. The DISQ
registry is used to present a single interface in a secure way
from both sets.
This could be extended to an entire private cloud version of
the DataShaka platform.
33. O2 ~ DataShaka and Security: Integrity, Privacy, Security and Availability (IPSA)
Stack
(Methodology)
Tools Infrastructure
1. Integrity
(accuracy & consistency)
Consilience allows for data in different places
while retaining one single unified conceptual set.
CAMO describes the methodology for consilience
and TCSV is a CAMO. All of this is built to support
integrity and availability.
There are specific tools for checking data
taxonomy and for missing data. We provide
tools for Data Acceptance Testing (DAT)
As embodied on Windows Azure the DataShaka
tools are constructed to respect the integrity of
the underlying methodology. Each data
operation is recorded to provide full provenance.
2. Privacy
(of client data)
n/a
TCSV tools are designed to work with TCSV in a
content agnostic way. As mentioned in
methodology, privacy is a content specific
concern. Tools for processing TCSV can be
used to perform operations supportive of
privacy such as removal of PII.
The DataShaka platform is fully tenanted by
clients, with no cross pollination of data.
3. Security
(right people + right data)
In TCSV each point is uniquely identifiable by it’s
signature of T,C,S&V and sub-sets are similarly
identifiable. As such, TCSV is ideally suited to
embodiment within a system of individual point
level security/access control and above.
n/a
The DataShaka platform takes advantage of the
built-in security of Windows Azure. We use
Azure in a tenanted manner preventing cross
pollination or action between accounts.
4. Availability
(to SLA)
As with integrity the ‘unification’ methodology is
built for full availability of the unified set. As a
mutable set, enrichment is non-destructive giving
full availability to pre and post enriched queries.
n/a Reliant on infrastructure SLA’s
35. Content Agnosticism alongside quality and matching
TCSV Tools – Content Agnostic
• Enrichment
• Taxonomy Rules
• Missing Data Rules
• DAT
• Query
• Combine
Mutable Chaordic TCSV Set
External Tools – Content Specific
• Matching
• Statistical Models
• Machine Learning
• Content Agnostic Tools work on a ‘100% match’
basis.
• They use configuration files to make queries and
apply rules to TCSV.
• TCSV has Natural Relationships and Natural
Connections built in. The tools help with interpretive
connections.
• External tools use content specific techniques to
establish matches and rules.
• Text Mining
• Statistical Modelling
• Fuzzy Logic
• Machine Learning
• These can be more traditional MDM tools
• Deceased Suppressions
• Address/Person Matching
• Fuzzy Matching
• External Tools generate rules and new TCSV to enrich
and manipulate the TCSV set.
37. Raw Data
TCSV and Hadoop
IM Post on one way of doing it. http://www.datashaka.com/blog/techie/2014/02/how-do-you-get-an-elephant-to-speak-tcsv-
hdinsight
This is using the technology called Hive for allowing SQL like queries against Hadoop.
Another option on vanilla Hadoop is essentially when one is thinking of HDFS one can think of TCSV in terms of files. Using parsers to make raw data into
TCSV you remove the unhelpful differences and semi-structure the data. This allows you to take advantage of the consilience of TCSV while maintaining the
massive parallelism of Hadoop. TCSV can, of course, be stored outside the HDFS and, essentially, accessed via API or DISQ.
Query
MapReduce
Alternatives Alternatives
API
DISQ
39. A customer exists in the ‘real world’
In data, a customer is represented by a set of
identifying features
These features include location, device, and
many other useful things.
These features change over time for any
individual customer
Because it is Content Agnostic and
connectionist, TCSV captures a customer,
indeed any discreet entity, and all of it’s
features as they change over time.
40. One point in time 4 sources
Twitter Handle
Name
Device
Mobile Number
User id
100% matches are
automatically connected
as ‘C’ is held uniquely in 1
unified set.
Interpretive connections
can be made using TCSV
interpretation.
These sources share ‘id’
‘mobile’ number and
Device. As such,
connections can be
added.
52. • World changing fast (obvs)
• Old methods are not fit for purpose (to become a digital player)
• Time to think different (to coin a phrase)
• Why we made the decisions we made
• Data as FCC
• All data is not equal
• Exploiting information for value
• Data as fuel not a brake to an organisation (useful)
• Data as a service
• The data supply chain problem
• Flow of clean, curated, useful data
• Conformity (first) not move crap around
• SCV ‘story’
• Reducing costs
• Driving revenue (through better personalisation/enhanced provision)
• How continue to be relevant to new markets
• Sensor networks- IoT/M2M
• Deliver & exploit faster
• Power your transformation AND Drive value quickly/Quick wins (Rapid POC)
• No need to trade integrity (incl Quality) for agility (false compromise)
• Complementary to existing infrastructure & partners – TD, HW, Trillium etc (don’t slag off Hadoop)
• Plug ins (security)
• Want to be more than phones – a platform to sell other stuff
• Potential low cost architecture to leverage (Linux)
• Open agenda
Not to be presented
53. Efficiency and
Learning through
data
Efficiency through tooling and automation
Handles ever-increasing and ever-changing data
Comic Relief data team provide data products
• Self Serve
• Single Source Of Truth
‘Every’ team can use and learn from data
E.g. Marketing/Campaign
Including self serve query
Better informed marketing and campaigns
drive better charitable actions and more donations
Flexible and quality controlled
data acquisition for ever-changing sources
Easy access to quality data
Controlled, rational
easy to maintain
‘Data Lake’
55. Store
Harvest
“Everything is a source...”
http
file
FTP
email
API
market
place
secure
server
Unify
DISQ
Unstructured
Relational
Graph
In Memory
Document Store
File System
Big Table
Deliver
Enterprise
Data Store
Time
T
Unified
Data
Context
C
Signal
S
Value
V
Editor's Notes
A21 story
Well, sorry to pop that balloon, I’m going to talk about the elephant in the room of Big Data.
The fact that extracting the value from BD is not a simple switch flick.
It is non-trivial and it is difficult.
Why?
People talk idly about bringing all different kinds of data together to generate new insight but the truth is that this is hard to do.
People talk idly about bringing all different kinds of data together to generate new insight but the truth is that this is hard to do.
Hard to do at scale, at speed, at low cost and with agility (all of which of critical ioho)
People talk idly about bringing all different kinds of data together to generate new insight but the truth is that this is hard to do.
Wave RFP
This is what we do! (and more)
Repeat key for TCSV
People talk idly about bringing all different kinds of data together to generate new insight but the truth is that this is hard to do.
And self serve
Too broad, ‘all things…’
H&P are buying this
Is the market big enough?
Can we take a big enough % (lack of competition etc)
Where do we have an unfair advantage?
Focus down
Into TCSV as quickly as possible
ETL for unstructured data