SlideShare a Scribd company logo
1 of 23
Download to read offline
1
Big Data Patterns
Ron Bodkin
Founder and President, Think Big
2
About Me
Ron Bodkin
Founder and President, Think Big
I have 9 years’ experience working with Big Data and Hadoop. In
2010, I founded Think Big to help companies realize measurable value
from Big Data. Our expertise spans all facets of data science and
data engineering and helps our customers drive maximum value from
their Big Data initiatives.
Patterns in this talk from large-scale deployments in
high tech manufacturing & digital marketing.
Follow me at @ronbodkin
3
Agenda
•  Context
•  Patterns
•  Conclusions
10/4/15© 2015 Think Big, a Teradata Company
4
Big Data: The Key is Variety
Definition: Datasets so complex and large that they are
awkward to work with using standard tools and techniques
Location Social Images Weblogs Videos Text Audio Sensor
Size is not what is most important—it’s variety
5
How is Information Management Changing?
•  Schema on Read?
–  Yes… as step one
–  But data still has underlying structure
–  It’s more like agile modeling – reflect as much structure as needed
•  Loosely coupled schemas without platform guarantees but enable more
application flexibility
•  Data Modeling isn’t dead!
•  Metadata is more important than ever
•  Data Warehouses embracing Big Data principles (e.g., elasticity, JSON…)
10/4/15© 2015 Think Big, a Teradata Company
6
Changes in the Platform
•  Entry Level Hadoop cluster circa 2015 (20 nodes)
–  240 cores
–  1 PB spinning disk
–  10 TB RAM
–  10-40 GbE
–  Low software cost
•  Disk transfer times increasing => many disks => DAS (2005-2020)
•  Distributed RAM increasingly important to expedite computation
although data volumes increasing faster
•  The network will be the computer (really!) => you can distribute disks
separately across high bandwidth fabrics (2020+)
•  Changes many assumptions in traditional physical modeling
10/4/15© 2015 Think Big, a Teradata Company
7
Changes in Logical Modeling
•  JSON-like structures
–  Complex collections of relations, arrays, map of items
•  Graphs
–  Storing complex, dynamically changing not static relationships
•  Binary/CLOB/specialized data
–  Ability to execute specialized programs to interpret and process
10/4/15© 2015 Think Big, a Teradata Company
8
Changes in Physical Modeling
•  Big Data “unpacks” the database metaphor
–  Data distribution: key design, sharding/distribution, file formats
–  Multiple computational algorithms, e.g., MapReduce, Computational Graph
(Spark, Tez), data flow, streaming, graph engines
–  Integrity is an application concern
•  Storage is cheap
–  Denormalization and materialized views common
•  Yet compression is popular often for IO savings
•  Summarization is orders of magnitude more powerful
•  Index lookups are increasingly costly
•  Distributed systems impose eventual consistency, reconciliation
demands
10/4/15© 2015 Think Big, a Teradata Company
9
Leading Financial Asset Manger
10/4/15© 2015 Think Big, a Teradata Company
​ Challenge
•  Siloed consumer analytics
•  Lack of agility in analysis
•  Slow ETL
​ Solution
•  Scalable ETL
•  Discovery analytics tech & process
•  Cross-channel data science models
•  Cloudera Enterprise, HBase, Greenplum
Results
•  Scalable Processing
•  Extracted customer behavior signals from raw
data for existing and new behavior models
•  Faster time to insight
Financial Services
Photo courtesy of Flickr. Creative Commons
10
Leading Enterprise Tech Component Vendor
10/4/15© 2015 Think Big, a Teradata Company
​ Challenge
•  Data search parties waste engineers time
•  Excess scrap waste, slow time to market
•  Reactive analytics model
​ Solution
•  Scalable data lake
•  Search and deep analytic queries
•  Integrated assembly insights for data science models
•  Hive, Impala, Red Shift, Elastic search
•  Big data training and “hackathons”
Results
•  Supply chain “line of sight” from R&D, manufacturing, to servicing at
customer sites
•  “End-to-end” proactive analytics: reduced development time,
improved manufacturing yield, increased customer satisfaction
•  Proactive, scale analytics led to better engineering theory
High Tech Manufacturing
Photo courtesy of Flickr. Creative Commons
11
Patterns
10/4/15
12
Important New Patterns
•  Denormalized Fact
•  Profile
•  Event History
•  Timeline
•  Assembly
•  Distributed Sources
•  Late Data
•  Deep Aggregates
•  Recovery
•  Multiple Active Clusters
10/4/15© 2015 Think Big, a Teradata Company
13
Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data
123 uid1 1/1/15
13:16:11
… … … { “TstA” : 1
…}
456 uid2 1/1/15
13:16:14
… … … { “TstB” : 1
…}
•  Fact table about common events to allow e.g., cross-channel analytics
in context
–  E.g., clickstream, posts, purchases, content consumption, device activity
•  Stored in columnar format (e.g., Parquet, ORCfile)
•  Join as was value of slowly changing dimensions
•  Often “extension” column of unparsed/not modeled JSON-like data
•  Partitioned by event time buckets, perhaps also by other dimension(s)
Event History
10/4/15© 2015 Think Big, a Teradata Company
14
Actor id Segments Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id …
uid1 [1, 3, 7] 123 1/1/15
13:16:11
… { “TstA” :
1 …}
789 …
uid2 [2, 3] 456 1/1/15
13:16:14
… { “TstB” : 1
…}
0ab …
•  Pivot on event history: table of actors with events over time
–  Customer journey, device history
–  Enable support/analysis on specific items, long-lived analysis
•  May have hierarchy of actors (e.g., household, individual, device)
•  May be array of events, many columns or subsorted (cluster key)
•  Also stored in columnar format, may be partitioned
•  May be updated in near real-time AND batch
•  Often holds cached alogirthm values (combined Profile)
Timeline
10/4/15© 2015 Think Big, a Teradata Company
15
Event Analytics
•  Propensity/segmentation
–  May be scored in real-time using Timeline/Profile
–  May be hybrid scored batch using Event History
–  Trained from timeline
•  Attribution
–  Score impact of past events on new event (e.g., purchase, churn)
–  Algorithms range from simple rules to Shapley value
–  Natural in timeline
•  Reporting, exploration
–  Often via Deep Aggregates, using HyperLogLog
•  Discovery
10/4/15© 2015 Think Big, a Teradata Company
16
Event Data Management
•  Identity merge
–  Discovery of new identities (e.g., cookie logs in, Facebook connect)
–  Indirection or rewrites
–  Requires rescoring
•  Expiration/archival
–  Efficiency, policy requirements
•  Governance
–  Lineage & security
10/4/15© 2015 Think Big, a Teradata Company
17
•  Ongoing status of configuration
–  Parts in assembly
–  Related items (versions)
–  Social groups
•  Can be people, devices etc.
•  Maintain links in graph structure
–  May be current or historical
•  Use links to pull full context from Event History or Timeline
•  Search -> simple query -> complex analytics
–  E.g., transitive closure, impact analysis
•  Technologies
–  Giraph, GraphX
–  TitanDB, Neo4j
Network
10/4/15© 2015 Think Big, a Teradata Company
18
Distributed Sources
•  Unlike simple “all or nothing” feeds…
•  May have many distributed sources feeding data
•  It’s critical to know whether all (or enough) data has arrived
•  Goals
–  only produce analytic results when sufficient
–  provide provenance – timeliness & completeness statistics
•  Need
–  SLA’s about timeliness and required fraction of data
–  Control totals
–  Metadata about process (expected lineage)
–  Heartbeats/configuration
•  Root cause of complexity of ingestion
10/4/15© 2015 Think Big, a Teradata Company
19
Late Data
•  Data may be delayed due to
–  Upstream system failures (server down esp. with unreliable delivery,
network outage)
–  Offline/disconnected devices (endemic with mobile & IoT)
•  Metadata to track lineage is critical
•  Define delay time where with high confidence sufficient data
has arrived
•  Process “authoritative” derived data after that time
–  May process incremental/incomplete data earlier (a la economic
statistics)
–  May re-process in emergency (restatement)
–  May include changed data in later period
•  Report on how much data has arrived late
•  Implementation: bucket on event time, secondary on delay
epoch (partitions for late data)
10/4/15© 2015 Think Big, a Teradata Company
Zipfian Distribution
20
Conclusions
10/4/15
21
Probabilistic Data Structures
•  Increasingly valuable as an optimization technique, e.g.,
•  Bloom filters
–  Hashed key values for array
–  Check key to see if may be present
–  indexing/filtering sparse reads
•  HyperLogLog, sketch sets
–  Multiple hashes used to estimate count of unique items
–  Far more space compact (KB’s to count billions of items +/- 2%)
–  Can be composed (unlike exact unique counts) – e.g., across time,
categories
•  MinHash
–  Least hashed value in common between two sets
–  Used to identify duplicates, estimate overlap in arbitrary sets
10/4/15© 2015 Think Big, a Teradata Company
22
Anti-Patterns
•  3rd Normal Form, Star Schema, Snowflake Schema
•  Index lookups slow in general
–  Focus on partitioned reads not disk seeks
•  Poor results in practice
•  Not natural representations for repeating events, nested structure
•  Use of SSD, maturing optimizers, platform updates (Kudu?) are slowly
improving… an industry would love this to happen
•  Expect data marts to work in Big Data before data warehouses do
10/4/15© 2015 Think Big, a Teradata Company
23
Conclusions
•  Much of Big Data today is trade-craft
–  Learned lore & derived from first principles
•  As we scale data lakes & analytics, critical to have common
vocabulary, shared understandings
•  I’d love your input on common patterns & practices
•  Look for blogs with more depth on each pattern at
http://thinkbig.teradata.com/author/rbodkin/
•  Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com
10/4/15© 2015 Think Big, a Teradata Company

More Related Content

What's hot

The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lakeCapgemini
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBigDataExpo
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analyticsDendej Sawarnkatat
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
Future of Analytics: Drivers of Change
Future of Analytics: Drivers of ChangeFuture of Analytics: Drivers of Change
Future of Analytics: Drivers of ChangeCCG
 
The Business Value of Big Data
The Business Value of Big DataThe Business Value of Big Data
The Business Value of Big DataClark Boyd
 
IDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data EnvironmentsIDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data EnvironmentsDATAVERSITY
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDTHE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDwebwinkelvakdag
 
Data catalog
Data catalogData catalog
Data catalogiamtodor
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...Revolution Analytics
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Data Warehouse Logical Design Guide
Data Warehouse Logical Design GuideData Warehouse Logical Design Guide
Data Warehouse Logical Design GuideAndy Yuan
 
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Erik Fransen
 

What's hot (20)

The principles of the business data lake
The principles of the business data lakeThe principles of the business data lake
The principles of the business data lake
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Future of Analytics: Drivers of Change
Future of Analytics: Drivers of ChangeFuture of Analytics: Drivers of Change
Future of Analytics: Drivers of Change
 
The Business Value of Big Data
The Business Value of Big DataThe Business Value of Big Data
The Business Value of Big Data
 
IDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data EnvironmentsIDERA Slides: Managing Complex Data Environments
IDERA Slides: Managing Complex Data Environments
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEEDTHE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
THE FUTURE OF DATA: PROVISIONING ANALYTICS-READY DATA AT SPEED
 
Data catalog
Data catalogData catalog
Data catalog
 
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
The Modern Data Architecture for Predictive Analytics with Hortonworks and Re...
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Adding Hadoop to Your Analytics Mix?
Adding Hadoop to Your Analytics Mix?Adding Hadoop to Your Analytics Mix?
Adding Hadoop to Your Analytics Mix?
 
Data Warehouse Logical Design Guide
Data Warehouse Logical Design GuideData Warehouse Logical Design Guide
Data Warehouse Logical Design Guide
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
Best Practices: Datawarehouse Automation Conference September 20, 2012 - Amst...
 

Viewers also liked

Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...
Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...
Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...DATAVERSITY
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
LDM Webinar: UML for Data Modeling – When Does it Make Sense?
LDM Webinar: UML for Data Modeling – When Does it Make Sense?LDM Webinar: UML for Data Modeling – When Does it Make Sense?
LDM Webinar: UML for Data Modeling – When Does it Make Sense?DATAVERSITY
 
LDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSONLDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSONDATAVERSITY
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Amr Awadallah
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information qualityPeter O'Kelly
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchMapR Technologies
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
آموزش مدیریت بانک اطلاعاتی اوراکل - بخش سوم
آموزش مدیریت بانک اطلاعاتی اوراکل - بخش سومآموزش مدیریت بانک اطلاعاتی اوراکل - بخش سوم
آموزش مدیریت بانک اطلاعاتی اوراکل - بخش سومfaradars
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationEmbarcadero Technologies
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 

Viewers also liked (20)

Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...
Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...
Lessons in Data Modeling: Why a Data Model is an Important Part of Your Data ...
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
LDM Webinar: UML for Data Modeling – When Does it Make Sense?
LDM Webinar: UML for Data Modeling – When Does it Make Sense?LDM Webinar: UML for Data Modeling – When Does it Make Sense?
LDM Webinar: UML for Data Modeling – When Does it Make Sense?
 
LDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSONLDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSON
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
آموزش مدیریت بانک اطلاعاتی اوراکل - بخش سوم
آموزش مدیریت بانک اطلاعاتی اوراکل - بخش سومآموزش مدیریت بانک اطلاعاتی اوراکل - بخش سوم
آموزش مدیریت بانک اطلاعاتی اوراکل - بخش سوم
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 

Similar to Big Data Modeling and Analytic Patterns – Beyond Schema on Read

The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence ArchitecturePhilippe Julio
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Saurabh K. Gupta
 
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...Pedro Mac Dowell Innecco
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesDATAVERSITY
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)Syaifuddin Ismail
 

Similar to Big Data Modeling and Analytic Patterns – Beyond Schema on Read (20)

The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence Architecture
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Big Data Modeling and Analytic Patterns – Beyond Schema on Read

  • 1. 1 Big Data Patterns Ron Bodkin Founder and President, Think Big
  • 2. 2 About Me Ron Bodkin Founder and President, Think Big I have 9 years’ experience working with Big Data and Hadoop. In 2010, I founded Think Big to help companies realize measurable value from Big Data. Our expertise spans all facets of data science and data engineering and helps our customers drive maximum value from their Big Data initiatives. Patterns in this talk from large-scale deployments in high tech manufacturing & digital marketing. Follow me at @ronbodkin
  • 3. 3 Agenda •  Context •  Patterns •  Conclusions 10/4/15© 2015 Think Big, a Teradata Company
  • 4. 4 Big Data: The Key is Variety Definition: Datasets so complex and large that they are awkward to work with using standard tools and techniques Location Social Images Weblogs Videos Text Audio Sensor Size is not what is most important—it’s variety
  • 5. 5 How is Information Management Changing? •  Schema on Read? –  Yes… as step one –  But data still has underlying structure –  It’s more like agile modeling – reflect as much structure as needed •  Loosely coupled schemas without platform guarantees but enable more application flexibility •  Data Modeling isn’t dead! •  Metadata is more important than ever •  Data Warehouses embracing Big Data principles (e.g., elasticity, JSON…) 10/4/15© 2015 Think Big, a Teradata Company
  • 6. 6 Changes in the Platform •  Entry Level Hadoop cluster circa 2015 (20 nodes) –  240 cores –  1 PB spinning disk –  10 TB RAM –  10-40 GbE –  Low software cost •  Disk transfer times increasing => many disks => DAS (2005-2020) •  Distributed RAM increasingly important to expedite computation although data volumes increasing faster •  The network will be the computer (really!) => you can distribute disks separately across high bandwidth fabrics (2020+) •  Changes many assumptions in traditional physical modeling 10/4/15© 2015 Think Big, a Teradata Company
  • 7. 7 Changes in Logical Modeling •  JSON-like structures –  Complex collections of relations, arrays, map of items •  Graphs –  Storing complex, dynamically changing not static relationships •  Binary/CLOB/specialized data –  Ability to execute specialized programs to interpret and process 10/4/15© 2015 Think Big, a Teradata Company
  • 8. 8 Changes in Physical Modeling •  Big Data “unpacks” the database metaphor –  Data distribution: key design, sharding/distribution, file formats –  Multiple computational algorithms, e.g., MapReduce, Computational Graph (Spark, Tez), data flow, streaming, graph engines –  Integrity is an application concern •  Storage is cheap –  Denormalization and materialized views common •  Yet compression is popular often for IO savings •  Summarization is orders of magnitude more powerful •  Index lookups are increasingly costly •  Distributed systems impose eventual consistency, reconciliation demands 10/4/15© 2015 Think Big, a Teradata Company
  • 9. 9 Leading Financial Asset Manger 10/4/15© 2015 Think Big, a Teradata Company ​ Challenge •  Siloed consumer analytics •  Lack of agility in analysis •  Slow ETL ​ Solution •  Scalable ETL •  Discovery analytics tech & process •  Cross-channel data science models •  Cloudera Enterprise, HBase, Greenplum Results •  Scalable Processing •  Extracted customer behavior signals from raw data for existing and new behavior models •  Faster time to insight Financial Services Photo courtesy of Flickr. Creative Commons
  • 10. 10 Leading Enterprise Tech Component Vendor 10/4/15© 2015 Think Big, a Teradata Company ​ Challenge •  Data search parties waste engineers time •  Excess scrap waste, slow time to market •  Reactive analytics model ​ Solution •  Scalable data lake •  Search and deep analytic queries •  Integrated assembly insights for data science models •  Hive, Impala, Red Shift, Elastic search •  Big data training and “hackathons” Results •  Supply chain “line of sight” from R&D, manufacturing, to servicing at customer sites •  “End-to-end” proactive analytics: reduced development time, improved manufacturing yield, increased customer satisfaction •  Proactive, scale analytics led to better engineering theory High Tech Manufacturing Photo courtesy of Flickr. Creative Commons
  • 12. 12 Important New Patterns •  Denormalized Fact •  Profile •  Event History •  Timeline •  Assembly •  Distributed Sources •  Late Data •  Deep Aggregates •  Recovery •  Multiple Active Clusters 10/4/15© 2015 Think Big, a Teradata Company
  • 13. 13 Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data 123 uid1 1/1/15 13:16:11 … … … { “TstA” : 1 …} 456 uid2 1/1/15 13:16:14 … … … { “TstB” : 1 …} •  Fact table about common events to allow e.g., cross-channel analytics in context –  E.g., clickstream, posts, purchases, content consumption, device activity •  Stored in columnar format (e.g., Parquet, ORCfile) •  Join as was value of slowly changing dimensions •  Often “extension” column of unparsed/not modeled JSON-like data •  Partitioned by event time buckets, perhaps also by other dimension(s) Event History 10/4/15© 2015 Think Big, a Teradata Company
  • 14. 14 Actor id Segments Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id … uid1 [1, 3, 7] 123 1/1/15 13:16:11 … { “TstA” : 1 …} 789 … uid2 [2, 3] 456 1/1/15 13:16:14 … { “TstB” : 1 …} 0ab … •  Pivot on event history: table of actors with events over time –  Customer journey, device history –  Enable support/analysis on specific items, long-lived analysis •  May have hierarchy of actors (e.g., household, individual, device) •  May be array of events, many columns or subsorted (cluster key) •  Also stored in columnar format, may be partitioned •  May be updated in near real-time AND batch •  Often holds cached alogirthm values (combined Profile) Timeline 10/4/15© 2015 Think Big, a Teradata Company
  • 15. 15 Event Analytics •  Propensity/segmentation –  May be scored in real-time using Timeline/Profile –  May be hybrid scored batch using Event History –  Trained from timeline •  Attribution –  Score impact of past events on new event (e.g., purchase, churn) –  Algorithms range from simple rules to Shapley value –  Natural in timeline •  Reporting, exploration –  Often via Deep Aggregates, using HyperLogLog •  Discovery 10/4/15© 2015 Think Big, a Teradata Company
  • 16. 16 Event Data Management •  Identity merge –  Discovery of new identities (e.g., cookie logs in, Facebook connect) –  Indirection or rewrites –  Requires rescoring •  Expiration/archival –  Efficiency, policy requirements •  Governance –  Lineage & security 10/4/15© 2015 Think Big, a Teradata Company
  • 17. 17 •  Ongoing status of configuration –  Parts in assembly –  Related items (versions) –  Social groups •  Can be people, devices etc. •  Maintain links in graph structure –  May be current or historical •  Use links to pull full context from Event History or Timeline •  Search -> simple query -> complex analytics –  E.g., transitive closure, impact analysis •  Technologies –  Giraph, GraphX –  TitanDB, Neo4j Network 10/4/15© 2015 Think Big, a Teradata Company
  • 18. 18 Distributed Sources •  Unlike simple “all or nothing” feeds… •  May have many distributed sources feeding data •  It’s critical to know whether all (or enough) data has arrived •  Goals –  only produce analytic results when sufficient –  provide provenance – timeliness & completeness statistics •  Need –  SLA’s about timeliness and required fraction of data –  Control totals –  Metadata about process (expected lineage) –  Heartbeats/configuration •  Root cause of complexity of ingestion 10/4/15© 2015 Think Big, a Teradata Company
  • 19. 19 Late Data •  Data may be delayed due to –  Upstream system failures (server down esp. with unreliable delivery, network outage) –  Offline/disconnected devices (endemic with mobile & IoT) •  Metadata to track lineage is critical •  Define delay time where with high confidence sufficient data has arrived •  Process “authoritative” derived data after that time –  May process incremental/incomplete data earlier (a la economic statistics) –  May re-process in emergency (restatement) –  May include changed data in later period •  Report on how much data has arrived late •  Implementation: bucket on event time, secondary on delay epoch (partitions for late data) 10/4/15© 2015 Think Big, a Teradata Company Zipfian Distribution
  • 21. 21 Probabilistic Data Structures •  Increasingly valuable as an optimization technique, e.g., •  Bloom filters –  Hashed key values for array –  Check key to see if may be present –  indexing/filtering sparse reads •  HyperLogLog, sketch sets –  Multiple hashes used to estimate count of unique items –  Far more space compact (KB’s to count billions of items +/- 2%) –  Can be composed (unlike exact unique counts) – e.g., across time, categories •  MinHash –  Least hashed value in common between two sets –  Used to identify duplicates, estimate overlap in arbitrary sets 10/4/15© 2015 Think Big, a Teradata Company
  • 22. 22 Anti-Patterns •  3rd Normal Form, Star Schema, Snowflake Schema •  Index lookups slow in general –  Focus on partitioned reads not disk seeks •  Poor results in practice •  Not natural representations for repeating events, nested structure •  Use of SSD, maturing optimizers, platform updates (Kudu?) are slowly improving… an industry would love this to happen •  Expect data marts to work in Big Data before data warehouses do 10/4/15© 2015 Think Big, a Teradata Company
  • 23. 23 Conclusions •  Much of Big Data today is trade-craft –  Learned lore & derived from first principles •  As we scale data lakes & analytics, critical to have common vocabulary, shared understandings •  I’d love your input on common patterns & practices •  Look for blogs with more depth on each pattern at http://thinkbig.teradata.com/author/rbodkin/ •  Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com 10/4/15© 2015 Think Big, a Teradata Company