Recent research has pointed out the complementary nature of Hadoop and other data management solutions and the importance of leveraging existing systems, SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve analytic processing. Come to this session to learn how companies optimize the use of Hadoop with other enterprise systems to improve overall analytical throughput and build new data-driven products. This session covers: ways to achieve high-performance integration between Hadoop and relational-based systems; Hadoop+NoSQL vs Hadoop+SQL architectures; high-speed, massively parallel data transfer to analytical platforms that can aggregate web log data with granular fact data; and strategies for freeing up capacity for more explorative, iterative analytics and ad hoc queries.
2. What We’re Covering Today
• Data Science in Enterprise (vs the Valley)
• Quick Overview of Teradata Aster’s Technology
• Hybrid Hadoop Architectures
• Connecting Hadoop to Other Systems
• MapReduce Enteprise Use Cases
2 Teradata Confidential and Proprietary
3. About Aster Data
• Aster has been a Big Data & Big Analytics pioneer since 2005
by developing an MPP SQL+MapReduce platform
• Aster Data acquisition completed on April 6, 2011
• Opportunity for Teradata to expand its business in the Big Data
analytics market to include multi-structured data and new
analytical capabilities
• Intense Focus on the Enterprise
3 Teradata Confidential and Proprietary
4. The Nature of Data Scientist Analytics
in the Enteprise
5. What is Data Science?
Curiosity/ Data
Cleverness Scientists
Technical Business
Expertise Acumen
5 Teradata Confidential and Proprietary
6. Data Science is Exploding
6 Teradata Confidential and Proprietary
7. What is Making Data Science Popular?
1. Proliferation of Data-Driven Products & Businesses
2. Consumer Interactions with Web & Social Channels
3. Breadth of Tools Available
4. Wealth of Machine-Generated Data
7 Teradata Confidential and Proprietary
8. A Day in the Life of a Data Scientist –
“Investigative Analytics”
Integrate
Investigate
Implement
8 Teradata Confidential and Proprietary
9. Data Scientists in the Enterprise
are Not Only Developers
SQL Analysts
SAS/R Analysts
Curiosity/ DBMS Power Users
Cleverness Java Coders
…
Technical Business
Expertise Acumen
9 Teradata Confidential and Proprietary
10. Data Scientists Have Different Skills
Combination of:
- Analysts
- Coders Enterprises
- Sys admins /
EngOps
Hard to find &
expensive
Web Startups
10 Teradata Confidential and Proprietary
12. A Brief History of MapReduce & Hadoop
2008: Aster Data 2009-2011:
becomes the first Follow-on DBMS
vendor to incorporate vendors announce
2006: Hadoop MapReduce connectors to
becomes the first Hadoop
open-source Aster Data
implementation of tightly coupled: Hadoop
MapReduce embedded MapReduce Distributions/
2004: Google
with SQL to bring Platforms emerge:
publishes
MapReduce to • Amazon
MapReduce paper at
enterprises – • Cloudera
OSDI Conference
SQL-MapReduce® • Hortonworks
• Data Stax
• MapR
• …
12 Teradata Confidential and Proprietary
13. MapReduce is the SQL of Big Analytics
• MapReduce is a parallel Map Function
programming framework
- “J2EE for Big Data Analytics” Scheduler
• MapReduce provides
- Automatic parallelization
map
- Fault tolerance
- Monitoring & status updates
shuffle
• Hadoop reduce
- Open source MapReduce
• Aster
- Commercial implementation of Results
MapReduce + SQL
13 Teradata Confidential and Proprietary
15. The Technology Gap
SQL-MR Hadoop-MR
• Analyst-friendly • Developer-friendly
• Iterative & Fast • Batch-oriented
• Integrates well • Requires lots of
with BI/Viz Tools coding
But what if you need both?
15 Teradata Confidential and Proprietary
17. Filling the Gap: SQL-MapReduce
17 Teradata Confidential and Proprietary
18. Enabling Analysis of Diverse Data
Aster capabilities for processing and analyzing multi-structured,
raw data
Multi-structured
raw data Aster Analytic Platform
SQL-MapReduce Output
Col1 Col2 Col3 Col4
Structured
Data tokenize, unpack,
sessionize, …
(DW, DBMS)
Integrate Data Process and Explore Leverage Results
• Load raw data directly • Use SQL-MapReduce • Structured output of
into Aster Database functions to interpret & SQL-MapReduce
• Bypass complex ETL analyze raw data processing available for
pipeline via ELT • Leverage flexible, further use or output to
dynamically-created data warehouse
schema at runtime
18 Teradata Confidential and Proprietary
19. SQL-MapReduce for Big Data Analytics
Example: Pattern Matching, Time Series Analysis
Discover patterns in rows of sequential data
Weblogs
{user, page, time}
Aster SQL-MapReduce Approach
Click 1 Click 2 Click 3 Click 4 • Single-pass of data
{device, value, time} • Linked list sequential analysis
Smart
Meters Reading 1 Reading 2 Reading 3 Reading 4
• Gap recognition
{user, product, time}
Sales
Transactions Purchase 1 Purchase 2 Purchase 3 Purchase 4
{stock, price, time}
Traditional SQL Approach
Stock Tick • Full Table Scans
Data Tick 1 Tick 2 Tick 3 Tick 4
• Self-Joins for sequencing
Call Data Records
{user, number, time}
• Limited operators for ordered data
Call 1 Call 2 Call 3 Call 4 Call 4
eBusiness Telecomm Financial Federal
>Sessionization >Calling Patterns >Trade Sequences >Pattern Detection
>Click Analysis >Signal Processing >Pairs Trading >Fuzzy Matching
>Golden Path >Forecasting >Fraud Detection >Inference Analysis
>Rev Attribution >Inexact linking
19 Teradata Confidential and Proprietary
20. Sample SQL-MapReduce Packaged Functions
Modules SQL-MapReduce Analytic Functions
• nPath: complex sequential analysis for time series and behavioral patterns
Path Analysis
• nPath Extensions: count entrants, track exit paths, count children, and
Discover patterns in rows generate subsequences
of sequential data
• Sessionization: identifies sessions from time series data in single pass
Graph and • Graph analysis: finds shortest path from distinct node to all other nodes in
Relational Analysis graph
• nTree: new function for performing operations on tree hierarchies. *
Analyze patterns across New
rows of data • Other: triangle finding, square finding, clustering coefficient *
• Sentiment Analysis: classify content is positive or negative
(for product review, customer feedback) * New
• Text Categorization: used to label content as spam/not spam *
Text Analysis • Entity Extraction/Rules Engine: identify addresses, phone number, names
from textual data *
Derive patterns in textual
data • Text Processing: counts occurrences of words, identifies roots, & tracks
relative positions of words & multi-word phrases
• nGram: split an input stream of text into individual words and phrases
• Levenshtein Distance: computes the distance between two words
• Pivot: convert columns to rows or rows to columns *
Data • Log parser: Generalized tool for parsing Apache logs * New
Transformation • Unpack: extracts nested data for further analysis
Transform data for more • Pack: compress multi-column data into a single column
advanced analysis • Antiselect: returns all columns except for specified column
20 • Multicase:
Teradata Confidential and Proprietary case statement that supports row match for multiple cases
22. You Need Hybrid Architectures
Engineers Data Scientists Business Analysts
5-10
concurrent users
50+
concurrent users
5000+
concurrent users
Ingest, Transform, Archive
Discover and explore
Analyze and Report
• Path & pattern
• Fast data loading analysis
• ELT/ETL • Operational analysis
• Image processing • Graph analysis • Transactional analysis
• Online archival • Fraud detection
• High volume ad-hoc
• Text analysis
• Elastic data marts
Hadoop Aster Teradata
Batch Interactive Active
22 Teradata Confidential and Proprietary
23. Complimentary and Overlapping Use Cases
Use cases Use Cases Use Cases
• Data preprocessing • Web log analysis • Pattern matching
• Image processing • Text processing • Visitor behavior
• Search indexes • Genomic, • Graph & relationship
• Web crawling Astronomical, , analysis
Geo-Spatial, • Investigative
scientific analytics
BATCH FAST/
PROCESSING INTERACTIVE
23 Teradata Confidential and Proprietary
24. An Example of an Enterprise Hybrid Architecture
Data Business
Data Scientists BI
Analysts Apps
Teradata | Aster
Hadoop
Multi-
Structured
Structured Teradata | EDW
Data
Data
• Batch • Weblogs • Financial • Customer
Processing • Machine data data addresses,
• Data Archival • Customer • SAP, ERP, phones, etc
… • Integration with
• Data Interaction data
• Call center text • Address, financial,
Transform- phones, …
data operational data
ations
24 Teradata Confidential and Proprietary
26. 3 Ways to Connect Hadoop to Databases
Ad-Hoc
Purpose-Built
Connectors
Hadoop
Front-End
(Pig/Hive)
Batch HDFS
Scripts
Ease of Use
26 Teradata Confidential and Proprietary
27. Using Aster Data and Hadoop Together
Aster Data for rich, ultra-fast analytics
Data
Sources
Hadoop Aster Database
Web data
NetFlow data Map Map
HDFS
Reduce Reduce Connector
SQL + SQL/MR
Data Source HDFS
Log files
Text files
Diverse Data
Sources
1 2 3 4
Non-relational data Hadoop processes Data from HDFS Data used for
loaded into Hadoop data transformation loaded into Aster interactive analytics
cluster using HDFS connector inside Aster Database
27 Teradata Confidential and Proprietary
28. The Aster-Hadoop Data Connector
Enable users to analyze data where it makes the most sense
• Why Is It Needed?
Example:
- Hadoop can be used batch ETL and
batch data processing
insert into mytable
- Aster for fast, interactive analysis select *
- Challenge: slow, tedious manual from
operations required to transfer data load_from_hadoop(
from Hadoop into Aster Database
on mytable
host('10.10.3.22')
• What Is It? port(9000)
- A set of 2 SQL-MapReduce functions delimiter(',')
developed by Aster Data nullstring('')
• LoadFromHadoop: Parallel data loading from files('hdfs_input_filepaths.txt')
HDFS to Aster nCluster
• LoadToHadoop: Parallel data loading from Aster
);
nCluster to HDFS
- Advantages: Parallel performance,
Seamless (SQL), Consistency (ACID)
28 Teradata Confidential and Proprietary
30. Example #1: SQL-MapReduce for
Data Scientist Investigative Analytics
Data Scientist Discovery of Bot Detection Algos
• Business Goal:
• Update bot detection algo’s with new markers of suspect
traffic for potential fraud or spam attacks
“We’ve always wanted to examine
search sub-sessions to really
• Aster Data Differentiated Solution:
understand what behaviors come
• Investigative analysis to identify new attributes that increase
from specific searches…
the predictive accuracy of bot detection
• Correlate data within/across sessions from complex URLs
• Use nPath to quickly identify and iteratively explore site All of this requires cursors and
activity patterns external programming in Oracle,
but can be easily parallelized in
• Business Impact : Aster Data even with non-
• Site integrity: identify bot traffic which can degrade programmers.”
performance and security of www.book.com (B&N)
• Improved customer experience: detect and prevent spam Michael Wexler, VP of Analytics,
and other automated nuisances to B&N members Barnes & Noble
Other Aster Data Applications at Barnes & Noble:
• Online marketing attribution – across search, device, features
• Customer personalized recommendations - ever-changing
30 Teradata Confidential and Proprietary
31. Example #2: Enabling Creation of
Data-Driven Products
/ “Cards that fit you”
• Personalized recommendations
of credit cards that would
provide best fit for customer
• Uses clickstream analysis +
text analysis to process data
about customer interests and
spending patterns
• Business Impact: delivers
referral revenue related to
click-throughs on specific card
offers
31 Teradata Confidential and Proprietary
32. Example #3:
Better Visibility to Marketing Impact
“Aster gives us the analytic capability to provide
best-in-class digital marketing optimization for our clients, enabling
more accurate marketing attribution. With Aster, we can help our
clients understand every marketing interaction with consumers over
time and across their entire online market ecosystem, knowing the
impact of every marketing dollar spent.”
Sunil Kavi, Director of Technology
Razorfish
32 Teradata Confidential and Proprietary
33. Visualization Example: Aster Data Tableau
Integration with SQL-MapReduce®
33 Teradata Confidential and Proprietary
34. Summary - MapReduce for the Rest of Us
Data Science is Growing Fast but
1
Big Enterprise is not Facebook
There is a Gap Between Existing Enterprise
2
Skills and Technology Capabilities
To Solve this Problem Look at Utilizing the
3
Right Technology for the Right Problem
34 Teradata Confidential and Proprietary
35. Thank You! ... Questions?
Learn More About SQL-MapReduce
• MapReduce Resource Center -
www.asterdata.com/mapreduce
• Aster Developer Express IDE trial
www.asterdata.com/ide
• Download white paper at
www.asterdata.com
See it in action tonight!! – Aster & Tableau Happy Hour
Eventi Hotel
851 Avenue of the Americas (6th Avenue)
New York, NY 10001
7-9PM
35 Teradata Confidential and Proprietary