Testing of Hadoop, NoSQL and Data Warehouses Visually
-----------------------------------------------------------------------------
We just made automated data testing really easy. Automate your Big Data testing visually, with no programming needed.
See how to automate Hadoop, No SQL and Data Warehouse testing visually, without writing any SQL or HQL. See how QuerySurge, the leading Big Data testing solution, provides novices and non-technical team members with a fast & easy way to be productive immediately while speeding up testing for team members skilled in SQL/HQL.
This webinar is geared towards:
- Big Data & Data Warehouse Architects, ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
• Improve your Data Quality
• Accelerate your data testing cycles
• Reduce your costs & risks
• Realize a huge ROI
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code
1. built by
QuerySurge™
Automated
Big Data Testing
without Writing Code
Testing of Hadoop and Data Warehouses Visually
Bill Hayduk
CEO/President
RTTS
Jeff Bocarsly, PhD
Chief Architect
QuerySurge /RTTS
3. built by
QuerySurge™
About
FACTS
Founded:
1996
Headquarters:
New York
Customer profile:
• Fortune 1000
• 600+ customers
Strategic Partners:
IBM, Microsoft, HP,
Oracle, Teradata,
HortonWorks, Cloudera,
Amazon Web Services
Software:
QuerySurge
RTTS is the leading provider of software & data quality
for critical business systems
4. “70% of enterprises have either deployed or are planning to
deploy big data projects and programs this year”
– analyst firm IDG
“46% of companies cite data quality as a barrier for adopting
Business Intelligence products.”
- InformationWeek
“Poor data quality is a primary reason for 40% of all business
initiatives failing to achieve their targeted benefits.”
- analyst firm Gartner
Data Quality Issues
built by
QuerySurge™
5. Business Intelligence (BI) software
CxOs are using Business Intelligence & Analytics to make critical business decisions
– with the assumption that the underlying data is fine.
“The average organization loses
$14.2 million annually through
poor Data Quality.”
- Gartner
The Executive Office & Critical Data
potential
problem areas
ETL
Data Architecture
Flat
Files
7. Data Warehouse: the Marketplace
“The data warehousing market will see a compound annual growth rate of
11.5% …to reach a total of $13.2 billion in revenue.”
- consulting specialist The 451 Group
Data Warehouse software vendors
- Analyst firm Gartner’s Magic Quadrant for Data Warehouse Database Management Systems
Leaders
Challengers
built by
QuerySurge™
9. Testing the Data Warehouse: Test Entry Points
Recommended functional test strategy: Test every entry point in the
system (feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point test entry point test entry points
built by
QuerySurge™
Legacy DB
CRM/ERP
DB
Finance DB
ETL ETL
Source Data ETL Process Target DW ETL Process Data Mart
Business
Intelligence
software
11. Big Data Vendors
built by
QuerySurge™
Big Data technology & services market will grow at a 26.4% CAGR to $41.5 billion
through 2018, or about 6x the growth rate of the overall IT market.
- Analyst firm IDC
12. Basic Hadoop Architecture
MapReduce
(Task Tracker)
HDFS
(Data
Node)
MapReduce – processing part that manages the
programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) – stores
data on the machines. (a.k.a. Data Node)
machine
Cluster Add more machines for scaling, from 1 to 100 to 1,000
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Name Node Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
accepts jobs, assigns tasks, identifies failed machines
13. MapReduce
(Task Tracker)
HDFS
(Data
Node)HiveQLHiveQL
HiveQLHiveQL
HiveQL
Hive - a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
• create
• insert
• update
• delete
• select
Hive
15. Recommended functional test strategy: Test every entry point in the system
(feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
built by
Business
Intelligence
software
ETL
Source Data
Source Hadoop ETL Process Target DWH
built by
QuerySurge™
Use Case #1:
Data Warehouse & Hadoop
test entry point test entry points
16. Use Case #2:
MongoDB, Hadoop, Data Warehouse
Relational DB & Data
WarehousingSource Data
@
BI, Analytics &
ReportingIngestion
built by
™
test entry point
test entry point
test entry point
test entry point test entry point
17. 2 Prevalent DataTesting Strategies
built by
1) Stare & Compare
(also known as sampling)
2) Minus Queries
18. Strategy #1: Stare & Compare
built by
QuerySurge™
• Review Mapping Document (business rules, data flow mapping, data movement requirements)
• Write Tests in SQL editor
• Execute 2 Tests: 1 at Source & 1 at Target
• Dump results to 2 Excel files
• Compare results by eye (‘Stare & Compare’ or ‘sampling’)
Issue with Stare & Compare:
Impossible to visually compare billions of data sets.
Result: usually less than 1% of data is compared
Example:
Current QuerySurge customer has:
• a single test with 100 million rows & 200 columns
• = 20 billion data sets
• the client has > 7,000 total tests
19. built by
QuerySurge™
MINUS QUERIES subtract one result set from another result set to show difference
Comment: MINUS QUERIES need to be executed 2x (Source MINUS Target; Target MINUS Source)
Result sets may not be accurate when dealing with duplicate rows of data
No historical data from past testing – audit and regulatory issues
Processing of minus queries puts pressure on the servers
Double execution means 2x testing time and resource utilization
Potential for false positives (bad data could exist on both sides of an ETL leg)
DataTesting Strategy #2: Minus Queries
Minus Query #1: Table_1 MINUS Table_2
Minus Query #2: Table_2 MINUS Table_1
Result Set #1
Result Set #2
ISSUES with MINUS QUERIES
Write 2 MINUS queries
in SQL editor
Execute
MINUS queries 2x
22. What is QuerySurge™?
the collaborative
Big Data Testing solution
that finds bad data &
provides a holistic view
of your data’s health
built by
23. the QuerySurge advantage
built by
QuerySurge™
Automate the entire testing cycle
Automate kickoff, tests, comparison, auto-emailed results
Create Tests easily with no programming
ensures minimal time & effort to create tests / obtain results
Test across different platforms
data warehouse, Hadoop, NoSQL, database, flat file, XML
Collaborate with team
Data Health dashboard, shared tests & auto-emailed reports
Verify more data & do it quickly
verifies up to 100% of all data up to 1,000 x faster
Integrate for Continuous Delivery
Integrates with most Build, ETL & QA management software
24. Collaboration
Testers
- functional testing
- regression testing
- result analysis
Developers / DBAs
- unit testing
- result analysis
Data Analysts
- review, analyze data
- verify mapping failures
Operations teams
- monitoring
- result analysis
Managers
- oversight
- result analysis
Share information on the
built by
QuerySurge™
26. SQL
HQL
SQL
HQL
SQL
SQL
QS pulls data from data sources
QS pulls data from target data store
QS compares data quickly
QS generates reports, audit trails
How QuerySurge Works
Reports, Data Health Dashboard, auto emails
built by
QuerySurge™
Source Data Target Data
Data Stores
• Databases
• Data Warehouses
• Data Marts
Flat Files
• Fixed Width
• Delimited
• Excel
Big Data stores
• Hadoop
• NoSQL
Data
Warehouses
XML
28. Design Library
• Create Query Pairs (source & target SQLs)
• Great for team members skilled with SQL
QuerySurge™ Modules
Scheduling
Build groups of Query Pairs
Schedule Test Runs
built by
QuerySurge™
29. Deep-Dive Reporting
Examine and automatically
email test results
Run Dashboard
View real-time execution
Analyze real-time results
QuerySurge™ Modules
built by
QuerySurge™
30. QuerySurge Test Management Connectors
built by
QuerySurge™
Drive QuerySurge execution from your Test Management Solution
Outcome results (Pass/Fail/etc.) are returned from QuerySurge to your Test Management Solution
Results are linked in your Test Management Solution so that you can click directly into detailed QuerySurge
results
• HP ALM (Quality Center)
• Microsoft Team Foundation Server
• IBM Rational Quality Manager
Integration with leading
Test Management Solutions
31. QuerySurge & DevOps: Continuous Delivery & Integration
built by
QuerySurge™
Automated
Testing
Automated
Reporting
Automated
Launch
Data Integration/ETL
solutions
QuerySurge™
and many others…
email
report
Test Management
solutions
QuerySurge™
email
report
and many others…
QuerySurge™
Automated Build
solutions
email
report
32. built by
Introducing the new
We just made data testing
REALLY EASY!
No programming needed
Testing Big Data Visually
33. built by
From a recent poll1 of:
• Big Data Experts
• Data Warehouse Architects
• Solution Architects
• ETL Architects
Recent Survey: Data Experts
Consensus Answer:
80% of data columns have no transformation at all
Our Question: What % of columns in your projects have no
transformations at all?
1Poll conducted by RTTS on targeted LinkedIn groups
Why is this important?
34. Fast and Easy.
No programming needed.
built by
QuerySurge™
QuerySurge™ Modules
Compare by Table, Column & Row
• Perform 80% of all data tests
•Automatically generates SQL & HQL code
• Opens up testing to novice & non-
technical team members
• Speeds up testing for skilled SQL coders
• provides a huge Return-On-Investment
35. built by
QuerySurge™
QuerySurge™ Modules
3 Types of Data Comparison Wizards:
The also provide you with automated features for:
o filtering (‘Where’ clause) and
o sorting (‘Order By’ clause)
Column-Level Comparison:
This is great for Big Data stores and Data Warehouses
Table-Level Comparison:
This comparator is great for Data Migrations and Database Upgrades.
Row Count Comparison:
Great for all - Big Data stores, Data Warehouses, Data Migrations and Database Upgrades.
36. Uses:
Tests the columns that have no
transformations, which means it tests
approximately 80% of your data store without
you writing any SQL code
Tests:
Big Data, Data Warehouses
Value added:
novice or non-technical: no coding needed,
productive immediately
experienced user: saves time
built by
QuerySurge™
38. Uses:
Verifies data loads when no
transformation occurs
Tests:
data migrations, upgrades
Value added:
novice or non-technical:
no coding needed
experienced user:
saves time
built by
QuerySurge™
39. Use:
Verify that the amount of rows from the
source match the amount from the target
Tests:
Big data, data warehouse, data
migration, database upgrades, data
interfaces
Value added:
novice: no coding needed
experienced user: saves time
built by
QuerySurge™
_________
Total
40. 10/15/2015 40
built by
QuerySurge™
Training Courses
Data Warehouse Testing
• Data Warehouse & ETL Testing Fundamentals (1 day)
• Fundamentals of QuerySurge (1 day)
• Introduction to SQL for QuerySurge (1 day)
• Advanced SQL techniques for QuerySurge (1 day)
Big Data Testing
• Big Data And ETL Testing Fundamentals
• Introduction To Big Data Testing Using Hive And HQL
Consulting
RTTS, the software quality experts (and developer of QuerySurge), provides consulting
solutions to the challenges of Big Data & Data Warehouse / ETL Testing
• Jumpstart 2-week program – combines training courses, mentoring, consulting
• Staff Augmentation – add additional RTTS resources to your team
• Outsourcing - RTTS can perform all testing, including planning, design, execution
41. (1) Trial in the Cloud of QuerySurgeTM, including self-learning
tutorial that works with sample data for 3 days
(2) Downloaded Trial of QuerySurgeTM, including self-learning
tutorial with sample data or your data for 15 days
(3) Proof of Concept of QuerySurgeTM includes our team of experts
assisting you for 30 days
for more information on (1), (2) and (3),
Go to http://www.querysurge.com/compare-trial-options
TRIAL
IN THE CLOUD
built by
QuerySurge™
Free TrialsQuerySurge™
Proof
of
Concept
Informatica’s software is the premier used for ETL, but was not mentioned in Gartner’s report because they don’t have DW software.
QuerySurge provides insight into the health of your data throughout your organization through BI dashboards and reporting at your fingertips. It is a collaborative tool that allows for distributed use of the tool throughout your organization and provides for a sharable, holistic view of your data’s health and your organization’s level of maturity of your data management.
QuerySurge can utilized by active practitioners such as testers & developers to create and launch tests, or by managers, analysts and operations to view data test results and the overall health of the data. QuerySurge facilitates this by providing 2 types of licenses: (1) full user & (2) participant user.
(1) Full User – This type of user has unlimited access to create QueryPairs, Suites, and Scenarios. This user can also schedule and run tests, see results, run and export reports, and export data. Perfect for anyone creating and/or running data tests while performing analysis of results.
(2) Participant User – This user cannot create or run tests, but has access to all other information - including viewing all query pairs, results, and reports, receiving email notifications, and exporting test results and reports. Perfect for managers, analysts, architects, DBAs, developers, and operations users who need to know the health of their data.
Your distributed team from around the world can use any of these web browsers: Internet Explorer, Chrome, Firefox and Safari.
Installs on operating systems: Windows & Linux.
QS connects to any JDBC-compliant data source. Even if it is not listed here.
QuerySurge finds bad data by natively connecting to:
any data source, whether it is any type of database, flat file or xml and
can connect to any data target, whether it is a db, file, xml, data warehouse or hadoop implementation.
QuerySurge pulls data from the source and the target and compares them very quickly (typically in a few minutes) and then produces reports that show every data difference, even if there are millions of rows and hundreds of columns in the test. These reports can be automatically emailed to your team.
You can pick from a multitude of reports or export the results so that you can build your own reports.