SlideShare a Scribd company logo
1 of 74
Download to read offline
THE MECHANICS OF
TESTING
LARGE DATA PIPELINES
MATHIEU BASTIAN
Head of Data Engineering, GetYourGuide
@mathieubastian
www.linkedin.com/in/mathieubastian
QCon London 2015
Outline
▸ Motivating example
▸ Challenges
▸ Testing strategies
▸ Validation Strategies
▸ Tools
Integration
Tests
ArchitectureUnit Test
Data Pipelines often start simple
Users E-commerce website
Search App
Views
Offline
Dashboard
Search
Metrics
Views
HDFS
They have one use-case and one developer
But there are many other use-
cases
Recommender Systems
Anomaly Detection
Search Ranking
A/B Testing
Spam Detection
Sentiment Analysis
Topic Detection
Trending TagsQuery Expansion
Customer Churn Prediction
Related searches
Fraud Prediction
Bidding Prediction
Machine Translation
Signal Processing
Content Curation
Sentiment Analysis
Image recognition
Optimal pricing
Location normalization
Standardization
Funnel Analysis
additional events and logsDevelopers add
Users E-commerce website
Search App Clicks
Views
Offline
Dashboard
Search
Metrics
Clicks
Views
HDFS
third-party dataDevelopers add
Users E-commerce website
3rd parties
Search App Clicks
Views
A/B LogsMobile
Analytics
Offline
Dashboard
Search
Metrics
Clicks
Views
A/B Logs
HDFS
search ranking predictionDevelopers add
Users E-commerce website
3rd parties
Search App Clicks
Views
A/B LogsMobile
Analytics
Offline
Dashboard
Search
Metrics
Clicks
Views
Training data
Training &
validation
Model
Clicks
Views Features
transformation
A/B Logs
HDFS
personalized user featuresDevelopers add
Users E-commerce website
3rd parties
Search App Clicks
Views
ProfilesUser
Database
A/B LogsMobile
Analytics
Offline
Dashboard
Search
Metrics
Clicks
Views
Training data
Training &
validation
Model
Clicks
Views
Profiles
Features
transformation
A/B Logs
HDFS
query extensionDevelopers add
Users E-commerce website
3rd parties
Search App Clicks
Views
ProfilesUser
Database
A/B LogsMobile
Analytics
Offline
Dashboard
Search
Metrics
Clicks
Views
Training data
Training &
validation
Model
Clicks
Views
Profiles
Features
transformation
A/B Logs
Filter
queries
Query
extension
RDBMS
Views
Training data
HDFS
Developers add recommender system
Users E-commerce website
3rd parties
Search App Clicks
Views
ProfilesUser
Database
A/B LogsMobile
Analytics
Offline
Dashboard
Search
Metrics
Clicks
Views
Training data
Training &
validation
Model
Clicks
Views
Profiles
Features
Features
transformation
Features
NoSQL
Compute
recommendations
A/B Logs
Filter
queries
Query
extension
RDBMS
Views
Training data
HDFS
Data Pipelines can grow very large
That is a lot of code and data
Code contain bugs
Industry Average: about 15 - 50 errors per 1000 lines of
delivered code.
Data will change
Industry Average: ?
Embrace automated
testing of code
validation of data
Because it delivers
▸ Testing
▸ Tested code has less bugs
▸ Gives the confidence to iterate quickly
▸ Scales well to multiple developers
▸ Validation
▸ Reduce manual testing
▸ Avoid catastrophic failures
But it’s challenging
▸ Testing
▸ Need data to test "realistically"
▸ Not running locally, can be expensive
▸ Tooling weaknesses
▸ Validation
▸ Data sources out of our control
▸ Difficult to test machine learning models
Reality check
Source: @SteveGodwin, QCon London 2016
Manual testing
Waiting Coding Looking
at logs
Code
Upload
Run workflow
Look at
logs
▸ Time Spent
Testing strategies
Prepare environment
▸ Care about tests from the start of your project
▸ All jobs should be functions (output only depends on input)
▸ Safe to re-run the job
▸ Does the input data still exists?
▸ Would it push partial results?
▸ Centralize configurations and no hard-coded paths
▸ Version code and timestamp data
Unit test locally
▸ Test locally each individual job
▸ Tests its good code
▸ Tests expected failures
▸ Need to overcome challenges with fake data creation
▸ Complex structures and numerous data sources
▸ Too small to be meaningful
▸ Need to specify a different configuration
Build from schemas
Fake data creation based on schemas. Compare:
Customer c = Customer.newBuilder().

setId(42).

setInterests(Arrays.asList(new Interest[]{

Interest.newBuilder().setId(0).setName("Ping-Pong").build()

Interest.newBuilder().setId(1).setName(“Pizza").build()}))
.build();
vs
Map<String, Object> c = new HashMap<>();

c.put("id", 42);

Map<String, Object> i1 = new HashMap<>();

i1.put("id", 0);

i1.put("name", "Ping-Pong");

Map<String, Object> i2 = new HashMap<>();

i2.put("id", 1);

i2.put("name", "Pizza");

c.put("interests", Arrays.asList(new Map[] {i1, i2}));
Build from schemas
Avro Schema example
{
"type": "record",
"name": "Customer",
"fields": [{
"name": "id",
"type": "int"
}, {
"name": "interests",
"type": {
"type": "array",
"items": {
"name": "Interest",
"type": "record",
"fields": [{
"name": "id",
"type": "int"
}, {
"name": "name",
"type": ["string", "null"]
}]
}
}
}
]
}
nullable field
Complex generators
▸ Developed in the field of property-based testing
//Small Even Number Generator
val smallEvenInteger = Gen.choose(0,200) suchThat (_ % 2 == 0)
▸ Goal is to simulate, not sample real data
▸ Define complex random generators that match properties (e.g.
frequency)
▸ Can go beyond unit-testing and generate complex domain
models
▸ https://www.scalacheck.org/ for Scala/Java is a good starting
point for examples
Integration test on sample data
▸ Integration test the entire workflow
▸ File paths
▸ Configuration
▸ Evaluate performance
▸ Sample data
▸ Large enough to be meaningful
▸ Small enough to speed-up testing
JOB A JOB B
JOB C
JOB D
Validation strategies
Where it fail
Control
Difficulty
Model biases
Bug
Noisy data
Schema changes
Missing data
Input and output validation
Make the pipeline robust by validating inputs and outputs
Input
Input
Input
Workflow
Production
ValidationValidation
Input Validation
Input data validation
Input data validation is a key component
of pipeline robustness.
The goal is to test the entry points of our system for data quality.
ETL RDBMS NOSQL EVENTS TWITTER
DATA
PIPELINE
Why it matters
▸ Bad input data will most likely degrade the output
▸ It likely will fail silently
▸ Because data will change
▸ Data migrations: maintenance, cluster update, new
infrastructure
▸ Events change due to product evolution
▸ Data dependencies updated
Input data validation
▸ Validation code should
▸ Detect pathological data and fail early
▸ Deal with expected data variability
▸ Example issues:
▸ Missing values, encoding issues, etc.
▸ Schema changes
▸ Duplicates rows
▸ Data order changes
Pathological data
▸ Value
▸ Validity depends on a single, independent value.
▸ Easy to validate on streams of data
▸ Dataset
▸ Validity depends on the entire dataset
▸ More difficult to validate as it needs a window of data
Metadata validation
Analyzing metadata is the quickest way to validate input data
▸ Number of records and file sizes
▸ Hadoop/Spark counters
▸ Number of map/reduce records, size
▸ Record-level custom counters
▸ Average text length
▸ Task-level custom counters
▸ Min/Max/Median values
Hadoop/Spark counters
Results can be accessed programmatically and checked
Control inputs with Schemas
▸ CSVs aren’t robust to change, use Schemas
▸ Makes expected data explicit and easy to test against
▸ Gives basic validation for free with binary serialization (e.g. Avro,
Thrift, Protocol Buffer)
▸ Typed (integer, boolean, lists etc.)
▸ Specify if value is optional
▸ Schemas can be evolved without breaking compatibility
Output Validation
Why it matters
▸ Humans makes mistake, we need a safeguard
▸ Rolling back data is often complex
▸ Bad output propagates to downstream systems
Example with a recommender system
// One recommendation set per user
{
"userId": 42,
"recommendations": [{
"itemId": 1456,
"score": 0.9
}, {
"itemId": 4232,
"score": 0.1
}],
"model": "test01"
}
Check for anomalies
Simple strategies similar to input data validation
▸ Record level (e.g. values within bounds)
▸ Dataset level (e.g. counts, order)
Challenges around relevance evaluation
▸ When supervised, use a validation dataset and threshold
accuracy
▸ Introduce hypothetical examples
Incremental update as validation
Join with the previous “best" output
▸ Allows fine comparisons
▸ Incremental framework can be extended to
▸ Only recompute recommendations that have changed
▸ Produce variations metric between different models
Daily Recommendations
Compute daily
recommendationsHDFS
Recommendations
Yesterday Recommendations
Join with
previous
result
External validation
Even in automated environment it is possible to validate with
humans
▸ Example: Search ranking evaluation
▸ Solution: Crowdsourcing
▸ Complex validation that requires training
▸ Can be automated through APIs
Mitigate risk with A/B testing
Gradually rolling out data products improvements reduces the
need for complex output validation
▸ Experiment can be controlled online or offline
▸ Online: Push multiple set of recommendations (1 per model)
▸ Offline: Split users and push unique set of recommendations
userId -> [{

"model": "test01",

"recommendations": [{...}]

}, {

"model": "test02",

"recommendations": [{...}]

}]
A
B
userId -> {

"model": "test01",

"recommendations": [{...}]

}
Mitigate risk with A/B testing
Important
▸ Log model variation downstream in logs
▸ Encapsulate model logic
FEATURE
1-A
MODEL A MODEL B
FEATURE 1 FEATURE 2
MODEL A MODEL B
A BA B
FEATURE
1-B
FEATURE
2-A
FEATURE
2-B
Appendix
Hadoop Testing
Two ways to test Hadoop jobs
▸ MRUnit
▸ Java library to test MapReduce jobs in a
simulated environment
▸ Last release June 2014
▸ MiniCluster
▸ Utility to locally run a fully-functional
Hadoop cluster in a test environment
▸ Ships with Hadoop itself
MiniMRCluster
▸ Advantages
▸ Behaves like a real cluster, including setup and configuration
▸ Can be used to test multiple jobs (integration testing)
▸ Disadvantages
▸ Very slow compared to unit testing Java code
MRUnit
▸ Advantages
▸ Faster
▸ Less boilerplate code
▸ Disadvantages
▸ Need to replicate job configuration
▸ Only built to test map and reduce functions
▸ Difficult to make it work with custom input formats (e.g. Avro)
MiniMRCluster setup*
Setup MR cluster and obtain FileSystem
@BeforeClass

public void setup() {

Configuration dfsConf = new Configuration();

dfsConf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, new File("./target/
hdfs/").getAbsolutePath());

_dfsCluster = new MiniDFSCluster.Builder(dfsConf).numDataNodes(1).build();

_dfsCluster.waitClusterUp();

_fileSystem = _dfsCluster.getFileSystem();



YarnConfiguration yarnConf = new YarnConfiguration();

yarnConf.setFloat(YarnConfiguration.NM_MAX_PER_DISK_UTILIZATION_PERCENTAGE,
99.0f);

yarnConf.setInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_MB, 64);

yarnConf.setClass(YarnConfiguration.RM_SCHEDULER, FifoScheduler.class,
ResourceScheduler.class);

_mrCluster = new MiniMRYarnCluster(getClass().getName(), taskTrackers);

yarnConf.set("fs.defaultFS", _fileSystem.getUri().toString());

_mrCluster.init(yarnConf);

_mrCluster.start();

}
* Hadoop version used 2.7.2
Keep the test file clean of boilerplate code
Best is to wrap the start/stop code into a TestBase class
/**

* Default constructor with one task tracker and one node.

*/

public TestBase() { ... }



@BeforeClass

public void startCluster() throws IOException { ... }



@AfterClass

public void stopCluster() throws IOException { ... }



/**

* Returns the Filesystem in use.

*

* @return the filesystem used by Hadoop.

*/

protected FileSystem getFileSystem() {

return _fileSystem;

}
Initialize and clean HDFS before/after each test
Clean up and initialize file system before each test
private final Path _inputPath = new Path("/input");

private final Path _cachePath = new Path("/cache");

private final Path _outputPath = new Path("/output");



@BeforeMethod

public void beforeMethod(Method method) throws IOException {

getFileSystem().delete(_inputPath, true);

getFileSystem().mkdirs(_inputPath);

getFileSystem().delete(_cachePath, true);

}



@AfterMethod

public void afterMethod(Method method) throws IOException {

getFileSystem().delete(_inputPath, true);

getFileSystem().delete(_cachePath, true);

getFileSystem().delete(_outputPath, true);

}
Run MiniCluster Test
Clean up and initialize file system before each test
@Test

public void testBasicWordCountJob() throws IOException, InterruptedException,
ClassNotFoundException {

writeWordCountInput();

configureAndRunJob(new BasicWordCountJob(), "BasicWordCountJob", _inputPath,
_outputPath);

checkWordCountOutput();

}
private void configureAndRunJob(AbstractJob job, String name, Path inputPath,
Path outputPath) throws IOException, ClassNotFoundException,
InterruptedException {

Properties _props = new Properties();

_props.setProperty("input.path", inputPath.toString());

_props.setProperty("output.path", outputPath.toString());

job.setProperties(_props);

job.setName(name);

job.run();

}
MRUnit setup
Setup MapDriver and ReduceDriver
BasicWordCountJob.Map mapper;

BasicWordCountJob.Reduce reducer;

MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;

ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;



@BeforeClass

public void setup() {

mapper = new BasicWordCountJob.Map();

mapDriver = MapDriver.newMapDriver(mapper);

reducer = new BasicWordCountJob.Reduce();

reduceDriver = ReduceDriver.newReduceDriver(reducer);

}
Run MRUnit test
Set Input/Output and run Test
@Test

public void testMapper() throws IOException {

mapDriver.withInput(new LongWritable(0), new Text("banana pear banana"));

mapDriver.withOutput(new Text("banana"), new IntWritable(1));

mapDriver.withOutput(new Text("pear"), new IntWritable(1));

mapDriver.withOutput(new Text("banana"), new IntWritable(1));

mapDriver.runTest();

}



@Test

public void testReducer() throws IOException {

reduceDriver.withInput(new Text("banana"), Arrays.asList(new IntWritable(1),
new IntWritable(1)));

reduceDriver.withInput(new Text("pear"), Arrays.asList(new IntWritable(1)));

reduceDriver.withOutput(new Text("banana"), new IntWritable(2));

reduceDriver.withOutput(new Text("pear"), new IntWritable(1));

reduceDriver.runTest();

}
Most common pitfall
▸ With both MiniMRCluster and MRUnit one spend most of the
time
▸ Creating fake input data
▸ Verifying output data
▸ Solutions
▸ Use rich data structures format (e.g. Avro, Thrift)
▸ Use automated Java classes generation
Other common pitfalls
▸ MiniMRCluster
▸ Enable Hadoop INFO logging so you can see real job failure
causes
▸ Beware of partitioning or sorting issues unrevealed when
testing with too few rows and number of nodes
▸ The API has changed over the years, difficult to find
examples
▸ MRUnit
▸ Custom serialization issues (e.g. Avro, Thrift)
Pig Testing
Introducing PigUnit
▸ PigUnit
▸ Official library to unit tests Pig script
▸ Ships with Pig (latest version 0.15.0)
▸ The principle is easy
1. Generate test data
2. Run script with PigUnit
3. Verify output
▸ Runs locally but can be run on a cluster
too
Script example
WordCount example
‣ Input and output are standard formats
‣ Uses variables $input and $output
text = LOAD '$input' USING TextLoader();



flattened = FOREACH text GENERATE flatten(TOKENIZE((chararray)$0)) as word;

grouped = GROUP flattened by word;

result = FOREACH grouped GENERATE group, (int)COUNT($1) AS cnt;

sorted = ORDER result BY cnt DESC;



STORE sorted INTO '$output' USING PigStorage('t');
PigTestBase
Create PigTest object
protected final FileSystem _fileSystem;



protected PigTestBase() {

System.setProperty("udf.import.list", StringUtils.join(Arrays.asList("oink.",
"org.apache.pig.piggybank."), ":"));

fileSystem = FileSystem.get(new Configuration());

}



/**

* Creates a new <em>PigTest</em> instance ready to be used.

*

* @param scriptFile the path to the Pig script file

* @param inputs the Pig arguments

* @return new PigTest instance

*/

protected PigTest newPigTest(String scriptFile, String[] inputs) {

PigServer pigServer = new PigServer(ExecType.LOCAL);

Cluster pigCluster = new Cluster(pigServer.getPigContext());

return new PigTest(scriptFile, inputs, pigServer, pigCluster);

}
Test using aliases
getAlias() allows to obtain the data anywhere in the script
@Test

public void testWordCountAlias() throws IOException, ParseException {

//Write input data

BufferedWriter writer = new BufferedWriter(new FileWriter(new
File("input.txt")));

writer.write("banana pear banana");

writer.close();



PigTest t = newPigTest("pig/src/main/pig/wordcount_text.pig", new String[]
{"input=input.txt", "output=result.csv"});



Iterator<Tuple> tuples = t.getAlias("sorted");

Assert.assertTrue(tuples.hasNext());

Tuple tuple = tuples.next();

Assert.assertEquals(tuple.get(0), "banana");

Assert.assertEquals(tuple.get(1), 2);

Assert.assertTrue(tuples.hasNext());

tuple = tuples.next();

Assert.assertEquals(tuple.get(0), "pear");

Assert.assertEquals(tuple.get(1), 1);

}
Test using mock and assert
▸ mockAlias allows to substitute input data
▸ assertOutput allows to compare String output data
@Test

public void testWordCountMock() throws IOException, ParseException {

//Write input data

BufferedWriter writer = new BufferedWriter(new FileWriter(new
File("input.txt")));

writer.write("banana pear banana");

writer.close();



PigTest t = newPigTest("pig/src/main/pig/wordcount_text.pig", new String[]
{"input=input.txt", "output=null"});

t.runScript();

t.assertOutputAnyOrder("sorted", new String[]{"(banana,2)", "(pear,1)"});

}
Both of these tools have limitations
▸ Built around standard input and output (Text, CSVs etc.)
▸ Realistically most of our data is in other formats (e.g. Avro,
Thrift, JSON)
▸ Does not test the STORE function (e.g. schema errors)
▸ getAlias() is especially difficult to use
▸ Need to remember field position: tuple.get(0)
▸ assertOutput() only allows String comparison
▸ Cumbersome to write complex structures (e.g. bags of bags)
Example with Avro input/output
▸ Focus on testing script’s output
▸ Difficulty is to generate dummy Avro data and compare result
text = LOAD '$input' USING AvroStorage();



flattened = FOREACH text GENERATE flatten(TOKENIZE(body)) as word;

grouped = GROUP flattened by word;

result = FOREACH grouped GENERATE group AS word, (int)COUNT($1) AS cnt;

sorted = ORDER result BY cnt DESC;



STORE result INTO '$output' USING AvroStorage();
▸ By default, PigUnit doesn’t execute the STORE, but it can be
overridden
pigTest.unoverride("STORE");
Simple utility classes for Avro
▸ BasicAvroWriter
▸ Writes Avro file on disk based on a schema
▸ Supports GenericRecord and SpecificRecord
▸ BasicAvroReader
▸ Reads Avro file, the schema heads the file
▸ Also supports GenericRecord and SpecificRecord
Test with Avro GenericRecord
▸ Create Schema with SchemaBuilder, write data, run script, read
result and compare
@Test

public void testWordCountGenericRecord() throws IOException, ParseException {

Schema schema = SchemaBuilder.builder().record("record").fields().

name("text").type().stringType().noDefault().endRecord();

GenericRecord genericRecord = new GenericData.Record(schema);

genericRecord.put("text", "banana apple banana");



BasicAvroWriter writer = new BasicAvroWriter(new Path(new
File("input.avro").getAbsolutePath()), schema, getFileSystem());

writer.append(genericRecord);



PigTest t = newPigTest("pig/src/main/pig/wordcount_avro.pig", new String[]
{"input=input.avro", "output=sorted.avro"});

t.unoverride("STORE");

t.runScript();



//Check output

BasicAvroReader reader = new BasicAvroReader(new Path(new
File("sorted.avro").getAbsolutePath()), getFileSystem());

Map<Utf8, GenericRecord> result = reader.readAndMapAll("word");

Assert.assertEquals(result.size(), 2);

Assert.assertEquals(result.get(new Utf8("banana")).get("cnt"), 2);

Assert.assertEquals(result.get(new Utf8("apple")).get("cnt"), 1);

}
Test with Avro SpecificRecord
▸ Use InputRecord and OutputRecord generated Java classes, write
data, run script, read result and compare
@Test

public void testWordCountSpecificRecord() throws IOException, ParseException {

InputRecord input = InputRecord.newBuilder().setText("banana apple banana").build();



BasicAvroWriter<InputRecord> writer = new BasicAvroWriter<InputRecord>(new Path(new
File("input.avro").getAbsolutePath()), input.getSchema(), getFileSystem());

writer.writeAll(input);



PigTest t = newPigTest("pig/src/main/pig/wordcount_avro.pig", new String[]
{"input=input.avro", "output=sorted.avro"});

t.unoverride("STORE");

t.runScript();



//Check output

BasicAvroReader<OutputRecord> reader = new BasicAvroReader<OutputRecord>(new Path(new
File("sorted.avro").getAbsolutePath()), getFileSystem());

List<OutputRecord> result = reader.readAll();

Assert.assertEquals(result.size(), 2);

Assert.assertEquals(result.get(0),
OutputRecord.newBuilder().setWord("banana").setCount(2).build());

Assert.assertEquals(result.get(1),
OutputRecord.newBuilder().setWord("apple").setCount(1).build());

}
Common pitfalls
▸ PigUnit
▸ Mocking capabilities are very limited
▸ Overhead of 1-5 seconds per script
▸ Cryptic error messages sometimes (NullPointerException)
▸ Pig UDFs
▸ Can be tested independently
Spark Testing
Spark Testing Base
Base classes to use when writing tests with Spark
▸ https://github.com/holdenk/spark-testing-base
▸ Functionalities
▸ Provides SparkContext
▸ Utilities to compare RDDs and DataFrames
▸ Simulate how Streaming works
▸ Includes cool RDD and DataFrames generator
Thank You!
We are hiring!
http://careers.getyourguide.com/
Extra Resources
▸ https://github.com/miguno/avro-hadoop-starter
▸ http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive/
▸ http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
▸ http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015
▸ http://avro.apache.org/docs/current/
▸ http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one
▸ http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
▸ https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-
real-time-datas-unifying

More Related Content

What's hot

BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementationSimon Su
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Managing Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksManaging Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksDatabricks
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingJulien Pivotto
 
Session 2 - NGSI-LD primer & Smart Data Models | Train the Trainers Program
Session 2 - NGSI-LD primer & Smart Data Models | Train the Trainers ProgramSession 2 - NGSI-LD primer & Smart Data Models | Train the Trainers Program
Session 2 - NGSI-LD primer & Smart Data Models | Train the Trainers ProgramFIWARE
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
FIWARE Global Summit - NGSI-LD - NGSI with Linked Data
FIWARE Global Summit - NGSI-LD - NGSI with Linked DataFIWARE Global Summit - NGSI-LD - NGSI with Linked Data
FIWARE Global Summit - NGSI-LD - NGSI with Linked DataFIWARE
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonDatabricks
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolAlex Rayón Jerez
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 

What's hot (20)

BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Managing Millions of Tests Using Databricks
Managing Millions of Tests Using DatabricksManaging Millions of Tests Using Databricks
Managing Millions of Tests Using Databricks
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabeling
 
Session 2 - NGSI-LD primer & Smart Data Models | Train the Trainers Program
Session 2 - NGSI-LD primer & Smart Data Models | Train the Trainers ProgramSession 2 - NGSI-LD primer & Smart Data Models | Train the Trainers Program
Session 2 - NGSI-LD primer & Smart Data Models | Train the Trainers Program
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
FIWARE Global Summit - NGSI-LD - NGSI with Linked Data
FIWARE Global Summit - NGSI-LD - NGSI with Linked DataFIWARE Global Summit - NGSI-LD - NGSI with Linked Data
FIWARE Global Summit - NGSI-LD - NGSI with Linked Data
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
gRPC in Go
gRPC in GogRPC in Go
gRPC in Go
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 

Viewers also liked

A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Viewers also liked (7)

A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to The Mechanics of Testing Large Data Pipelines (QCon London 2016)

The Mechanics of Testing Large Data Pipelines
The Mechanics of Testing Large Data PipelinesThe Mechanics of Testing Large Data Pipelines
The Mechanics of Testing Large Data PipelinesC4Media
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
Test Automation Best Practices (with SOA test approach)
Test Automation Best Practices (with SOA test approach)Test Automation Best Practices (with SOA test approach)
Test Automation Best Practices (with SOA test approach)Leonard Fingerman
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingRTTS
 
StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?mabl
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityRTTS
 
Why You Need to STOP Using Spreadsheets for Audit Analysis
Why You Need to STOP Using Spreadsheets for Audit AnalysisWhy You Need to STOP Using Spreadsheets for Audit Analysis
Why You Need to STOP Using Spreadsheets for Audit AnalysisCaseWare IDEA
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP TestingRTTS
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Techniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptxTechniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptxKnoldus Inc.
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileMehmet Gök
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdfAll You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdfBahaa Al Zubaidi
 
Automate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayAutomate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayTorana, Inc.
 
Presentation Title
Presentation TitlePresentation Title
Presentation Titlebutest
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryRTTS
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databaseselliando dias
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scalemabl
 

Similar to The Mechanics of Testing Large Data Pipelines (QCon London 2016) (20)

The Mechanics of Testing Large Data Pipelines
The Mechanics of Testing Large Data PipelinesThe Mechanics of Testing Large Data Pipelines
The Mechanics of Testing Large Data Pipelines
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Test Automation Best Practices (with SOA test approach)
Test Automation Best Practices (with SOA test approach)Test Automation Best Practices (with SOA test approach)
Test Automation Best Practices (with SOA test approach)
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?StarWest 2019 - End to end testing: Stupid or Legit?
StarWest 2019 - End to end testing: Stupid or Legit?
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 
Why You Need to STOP Using Spreadsheets for Audit Analysis
Why You Need to STOP Using Spreadsheets for Audit AnalysisWhy You Need to STOP Using Spreadsheets for Audit Analysis
Why You Need to STOP Using Spreadsheets for Audit Analysis
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Techniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptxTechniques for effective test data management in test automation.pptx
Techniques for effective test data management in test automation.pptx
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Data Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel FileData Driven Testing Is More Than an Excel File
Data Driven Testing Is More Than an Excel File
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdfAll You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
All You Need To Know About Big Data Testing - Bahaa Al Zubaidi.pdf
 
Automate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayAutomate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile way
 
Presentation Title
Presentation TitlePresentation Title
Presentation Title
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databases
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scale
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

The Mechanics of Testing Large Data Pipelines (QCon London 2016)

  • 1. THE MECHANICS OF TESTING LARGE DATA PIPELINES MATHIEU BASTIAN Head of Data Engineering, GetYourGuide @mathieubastian www.linkedin.com/in/mathieubastian QCon London 2015
  • 2. Outline ▸ Motivating example ▸ Challenges ▸ Testing strategies ▸ Validation Strategies ▸ Tools Integration Tests ArchitectureUnit Test
  • 3. Data Pipelines often start simple
  • 4. Users E-commerce website Search App Views Offline Dashboard Search Metrics Views HDFS They have one use-case and one developer
  • 5. But there are many other use- cases Recommender Systems Anomaly Detection Search Ranking A/B Testing Spam Detection Sentiment Analysis Topic Detection Trending TagsQuery Expansion Customer Churn Prediction Related searches Fraud Prediction Bidding Prediction Machine Translation Signal Processing Content Curation Sentiment Analysis Image recognition Optimal pricing Location normalization Standardization Funnel Analysis
  • 6. additional events and logsDevelopers add Users E-commerce website Search App Clicks Views Offline Dashboard Search Metrics Clicks Views HDFS
  • 7. third-party dataDevelopers add Users E-commerce website 3rd parties Search App Clicks Views A/B LogsMobile Analytics Offline Dashboard Search Metrics Clicks Views A/B Logs HDFS
  • 8. search ranking predictionDevelopers add Users E-commerce website 3rd parties Search App Clicks Views A/B LogsMobile Analytics Offline Dashboard Search Metrics Clicks Views Training data Training & validation Model Clicks Views Features transformation A/B Logs HDFS
  • 9. personalized user featuresDevelopers add Users E-commerce website 3rd parties Search App Clicks Views ProfilesUser Database A/B LogsMobile Analytics Offline Dashboard Search Metrics Clicks Views Training data Training & validation Model Clicks Views Profiles Features transformation A/B Logs HDFS
  • 10. query extensionDevelopers add Users E-commerce website 3rd parties Search App Clicks Views ProfilesUser Database A/B LogsMobile Analytics Offline Dashboard Search Metrics Clicks Views Training data Training & validation Model Clicks Views Profiles Features transformation A/B Logs Filter queries Query extension RDBMS Views Training data HDFS
  • 11. Developers add recommender system Users E-commerce website 3rd parties Search App Clicks Views ProfilesUser Database A/B LogsMobile Analytics Offline Dashboard Search Metrics Clicks Views Training data Training & validation Model Clicks Views Profiles Features Features transformation Features NoSQL Compute recommendations A/B Logs Filter queries Query extension RDBMS Views Training data HDFS
  • 12. Data Pipelines can grow very large
  • 13. That is a lot of code and data
  • 14. Code contain bugs Industry Average: about 15 - 50 errors per 1000 lines of delivered code.
  • 16. Embrace automated testing of code validation of data
  • 17. Because it delivers ▸ Testing ▸ Tested code has less bugs ▸ Gives the confidence to iterate quickly ▸ Scales well to multiple developers ▸ Validation ▸ Reduce manual testing ▸ Avoid catastrophic failures
  • 18. But it’s challenging ▸ Testing ▸ Need data to test "realistically" ▸ Not running locally, can be expensive ▸ Tooling weaknesses ▸ Validation ▸ Data sources out of our control ▸ Difficult to test machine learning models
  • 20. Manual testing Waiting Coding Looking at logs Code Upload Run workflow Look at logs ▸ Time Spent
  • 22. Prepare environment ▸ Care about tests from the start of your project ▸ All jobs should be functions (output only depends on input) ▸ Safe to re-run the job ▸ Does the input data still exists? ▸ Would it push partial results? ▸ Centralize configurations and no hard-coded paths ▸ Version code and timestamp data
  • 23. Unit test locally ▸ Test locally each individual job ▸ Tests its good code ▸ Tests expected failures ▸ Need to overcome challenges with fake data creation ▸ Complex structures and numerous data sources ▸ Too small to be meaningful ▸ Need to specify a different configuration
  • 24. Build from schemas Fake data creation based on schemas. Compare: Customer c = Customer.newBuilder().
 setId(42).
 setInterests(Arrays.asList(new Interest[]{
 Interest.newBuilder().setId(0).setName("Ping-Pong").build()
 Interest.newBuilder().setId(1).setName(“Pizza").build()})) .build(); vs Map<String, Object> c = new HashMap<>();
 c.put("id", 42);
 Map<String, Object> i1 = new HashMap<>();
 i1.put("id", 0);
 i1.put("name", "Ping-Pong");
 Map<String, Object> i2 = new HashMap<>();
 i2.put("id", 1);
 i2.put("name", "Pizza");
 c.put("interests", Arrays.asList(new Map[] {i1, i2}));
  • 25. Build from schemas Avro Schema example { "type": "record", "name": "Customer", "fields": [{ "name": "id", "type": "int" }, { "name": "interests", "type": { "type": "array", "items": { "name": "Interest", "type": "record", "fields": [{ "name": "id", "type": "int" }, { "name": "name", "type": ["string", "null"] }] } } } ] } nullable field
  • 26. Complex generators ▸ Developed in the field of property-based testing //Small Even Number Generator val smallEvenInteger = Gen.choose(0,200) suchThat (_ % 2 == 0) ▸ Goal is to simulate, not sample real data ▸ Define complex random generators that match properties (e.g. frequency) ▸ Can go beyond unit-testing and generate complex domain models ▸ https://www.scalacheck.org/ for Scala/Java is a good starting point for examples
  • 27. Integration test on sample data ▸ Integration test the entire workflow ▸ File paths ▸ Configuration ▸ Evaluate performance ▸ Sample data ▸ Large enough to be meaningful ▸ Small enough to speed-up testing JOB A JOB B JOB C JOB D
  • 29. Where it fail Control Difficulty Model biases Bug Noisy data Schema changes Missing data
  • 30. Input and output validation Make the pipeline robust by validating inputs and outputs Input Input Input Workflow Production ValidationValidation
  • 32. Input data validation Input data validation is a key component of pipeline robustness. The goal is to test the entry points of our system for data quality. ETL RDBMS NOSQL EVENTS TWITTER DATA PIPELINE
  • 33. Why it matters ▸ Bad input data will most likely degrade the output ▸ It likely will fail silently ▸ Because data will change ▸ Data migrations: maintenance, cluster update, new infrastructure ▸ Events change due to product evolution ▸ Data dependencies updated
  • 34. Input data validation ▸ Validation code should ▸ Detect pathological data and fail early ▸ Deal with expected data variability ▸ Example issues: ▸ Missing values, encoding issues, etc. ▸ Schema changes ▸ Duplicates rows ▸ Data order changes
  • 35. Pathological data ▸ Value ▸ Validity depends on a single, independent value. ▸ Easy to validate on streams of data ▸ Dataset ▸ Validity depends on the entire dataset ▸ More difficult to validate as it needs a window of data
  • 36. Metadata validation Analyzing metadata is the quickest way to validate input data ▸ Number of records and file sizes ▸ Hadoop/Spark counters ▸ Number of map/reduce records, size ▸ Record-level custom counters ▸ Average text length ▸ Task-level custom counters ▸ Min/Max/Median values
  • 37. Hadoop/Spark counters Results can be accessed programmatically and checked
  • 38. Control inputs with Schemas ▸ CSVs aren’t robust to change, use Schemas ▸ Makes expected data explicit and easy to test against ▸ Gives basic validation for free with binary serialization (e.g. Avro, Thrift, Protocol Buffer) ▸ Typed (integer, boolean, lists etc.) ▸ Specify if value is optional ▸ Schemas can be evolved without breaking compatibility
  • 40. Why it matters ▸ Humans makes mistake, we need a safeguard ▸ Rolling back data is often complex ▸ Bad output propagates to downstream systems Example with a recommender system // One recommendation set per user { "userId": 42, "recommendations": [{ "itemId": 1456, "score": 0.9 }, { "itemId": 4232, "score": 0.1 }], "model": "test01" }
  • 41. Check for anomalies Simple strategies similar to input data validation ▸ Record level (e.g. values within bounds) ▸ Dataset level (e.g. counts, order) Challenges around relevance evaluation ▸ When supervised, use a validation dataset and threshold accuracy ▸ Introduce hypothetical examples
  • 42. Incremental update as validation Join with the previous “best" output ▸ Allows fine comparisons ▸ Incremental framework can be extended to ▸ Only recompute recommendations that have changed ▸ Produce variations metric between different models Daily Recommendations Compute daily recommendationsHDFS Recommendations Yesterday Recommendations Join with previous result
  • 43. External validation Even in automated environment it is possible to validate with humans ▸ Example: Search ranking evaluation ▸ Solution: Crowdsourcing ▸ Complex validation that requires training ▸ Can be automated through APIs
  • 44. Mitigate risk with A/B testing Gradually rolling out data products improvements reduces the need for complex output validation ▸ Experiment can be controlled online or offline ▸ Online: Push multiple set of recommendations (1 per model) ▸ Offline: Split users and push unique set of recommendations userId -> [{
 "model": "test01",
 "recommendations": [{...}]
 }, {
 "model": "test02",
 "recommendations": [{...}]
 }] A B userId -> {
 "model": "test01",
 "recommendations": [{...}]
 }
  • 45. Mitigate risk with A/B testing Important ▸ Log model variation downstream in logs ▸ Encapsulate model logic FEATURE 1-A MODEL A MODEL B FEATURE 1 FEATURE 2 MODEL A MODEL B A BA B FEATURE 1-B FEATURE 2-A FEATURE 2-B
  • 48. Two ways to test Hadoop jobs ▸ MRUnit ▸ Java library to test MapReduce jobs in a simulated environment ▸ Last release June 2014 ▸ MiniCluster ▸ Utility to locally run a fully-functional Hadoop cluster in a test environment ▸ Ships with Hadoop itself
  • 49. MiniMRCluster ▸ Advantages ▸ Behaves like a real cluster, including setup and configuration ▸ Can be used to test multiple jobs (integration testing) ▸ Disadvantages ▸ Very slow compared to unit testing Java code
  • 50. MRUnit ▸ Advantages ▸ Faster ▸ Less boilerplate code ▸ Disadvantages ▸ Need to replicate job configuration ▸ Only built to test map and reduce functions ▸ Difficult to make it work with custom input formats (e.g. Avro)
  • 51. MiniMRCluster setup* Setup MR cluster and obtain FileSystem @BeforeClass
 public void setup() {
 Configuration dfsConf = new Configuration();
 dfsConf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, new File("./target/ hdfs/").getAbsolutePath());
 _dfsCluster = new MiniDFSCluster.Builder(dfsConf).numDataNodes(1).build();
 _dfsCluster.waitClusterUp();
 _fileSystem = _dfsCluster.getFileSystem();
 
 YarnConfiguration yarnConf = new YarnConfiguration();
 yarnConf.setFloat(YarnConfiguration.NM_MAX_PER_DISK_UTILIZATION_PERCENTAGE, 99.0f);
 yarnConf.setInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_MB, 64);
 yarnConf.setClass(YarnConfiguration.RM_SCHEDULER, FifoScheduler.class, ResourceScheduler.class);
 _mrCluster = new MiniMRYarnCluster(getClass().getName(), taskTrackers);
 yarnConf.set("fs.defaultFS", _fileSystem.getUri().toString());
 _mrCluster.init(yarnConf);
 _mrCluster.start();
 } * Hadoop version used 2.7.2
  • 52. Keep the test file clean of boilerplate code Best is to wrap the start/stop code into a TestBase class /**
 * Default constructor with one task tracker and one node.
 */
 public TestBase() { ... }
 
 @BeforeClass
 public void startCluster() throws IOException { ... }
 
 @AfterClass
 public void stopCluster() throws IOException { ... }
 
 /**
 * Returns the Filesystem in use.
 *
 * @return the filesystem used by Hadoop.
 */
 protected FileSystem getFileSystem() {
 return _fileSystem;
 }
  • 53. Initialize and clean HDFS before/after each test Clean up and initialize file system before each test private final Path _inputPath = new Path("/input");
 private final Path _cachePath = new Path("/cache");
 private final Path _outputPath = new Path("/output");
 
 @BeforeMethod
 public void beforeMethod(Method method) throws IOException {
 getFileSystem().delete(_inputPath, true);
 getFileSystem().mkdirs(_inputPath);
 getFileSystem().delete(_cachePath, true);
 }
 
 @AfterMethod
 public void afterMethod(Method method) throws IOException {
 getFileSystem().delete(_inputPath, true);
 getFileSystem().delete(_cachePath, true);
 getFileSystem().delete(_outputPath, true);
 }
  • 54. Run MiniCluster Test Clean up and initialize file system before each test @Test
 public void testBasicWordCountJob() throws IOException, InterruptedException, ClassNotFoundException {
 writeWordCountInput();
 configureAndRunJob(new BasicWordCountJob(), "BasicWordCountJob", _inputPath, _outputPath);
 checkWordCountOutput();
 } private void configureAndRunJob(AbstractJob job, String name, Path inputPath, Path outputPath) throws IOException, ClassNotFoundException, InterruptedException {
 Properties _props = new Properties();
 _props.setProperty("input.path", inputPath.toString());
 _props.setProperty("output.path", outputPath.toString());
 job.setProperties(_props);
 job.setName(name);
 job.run();
 }
  • 55. MRUnit setup Setup MapDriver and ReduceDriver BasicWordCountJob.Map mapper;
 BasicWordCountJob.Reduce reducer;
 MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
 ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;
 
 @BeforeClass
 public void setup() {
 mapper = new BasicWordCountJob.Map();
 mapDriver = MapDriver.newMapDriver(mapper);
 reducer = new BasicWordCountJob.Reduce();
 reduceDriver = ReduceDriver.newReduceDriver(reducer);
 }
  • 56. Run MRUnit test Set Input/Output and run Test @Test
 public void testMapper() throws IOException {
 mapDriver.withInput(new LongWritable(0), new Text("banana pear banana"));
 mapDriver.withOutput(new Text("banana"), new IntWritable(1));
 mapDriver.withOutput(new Text("pear"), new IntWritable(1));
 mapDriver.withOutput(new Text("banana"), new IntWritable(1));
 mapDriver.runTest();
 }
 
 @Test
 public void testReducer() throws IOException {
 reduceDriver.withInput(new Text("banana"), Arrays.asList(new IntWritable(1), new IntWritable(1)));
 reduceDriver.withInput(new Text("pear"), Arrays.asList(new IntWritable(1)));
 reduceDriver.withOutput(new Text("banana"), new IntWritable(2));
 reduceDriver.withOutput(new Text("pear"), new IntWritable(1));
 reduceDriver.runTest();
 }
  • 57. Most common pitfall ▸ With both MiniMRCluster and MRUnit one spend most of the time ▸ Creating fake input data ▸ Verifying output data ▸ Solutions ▸ Use rich data structures format (e.g. Avro, Thrift) ▸ Use automated Java classes generation
  • 58. Other common pitfalls ▸ MiniMRCluster ▸ Enable Hadoop INFO logging so you can see real job failure causes ▸ Beware of partitioning or sorting issues unrevealed when testing with too few rows and number of nodes ▸ The API has changed over the years, difficult to find examples ▸ MRUnit ▸ Custom serialization issues (e.g. Avro, Thrift)
  • 60. Introducing PigUnit ▸ PigUnit ▸ Official library to unit tests Pig script ▸ Ships with Pig (latest version 0.15.0) ▸ The principle is easy 1. Generate test data 2. Run script with PigUnit 3. Verify output ▸ Runs locally but can be run on a cluster too
  • 61. Script example WordCount example ‣ Input and output are standard formats ‣ Uses variables $input and $output text = LOAD '$input' USING TextLoader();
 
 flattened = FOREACH text GENERATE flatten(TOKENIZE((chararray)$0)) as word;
 grouped = GROUP flattened by word;
 result = FOREACH grouped GENERATE group, (int)COUNT($1) AS cnt;
 sorted = ORDER result BY cnt DESC;
 
 STORE sorted INTO '$output' USING PigStorage('t');
  • 62. PigTestBase Create PigTest object protected final FileSystem _fileSystem;
 
 protected PigTestBase() {
 System.setProperty("udf.import.list", StringUtils.join(Arrays.asList("oink.", "org.apache.pig.piggybank."), ":"));
 fileSystem = FileSystem.get(new Configuration());
 }
 
 /**
 * Creates a new <em>PigTest</em> instance ready to be used.
 *
 * @param scriptFile the path to the Pig script file
 * @param inputs the Pig arguments
 * @return new PigTest instance
 */
 protected PigTest newPigTest(String scriptFile, String[] inputs) {
 PigServer pigServer = new PigServer(ExecType.LOCAL);
 Cluster pigCluster = new Cluster(pigServer.getPigContext());
 return new PigTest(scriptFile, inputs, pigServer, pigCluster);
 }
  • 63. Test using aliases getAlias() allows to obtain the data anywhere in the script @Test
 public void testWordCountAlias() throws IOException, ParseException {
 //Write input data
 BufferedWriter writer = new BufferedWriter(new FileWriter(new File("input.txt")));
 writer.write("banana pear banana");
 writer.close();
 
 PigTest t = newPigTest("pig/src/main/pig/wordcount_text.pig", new String[] {"input=input.txt", "output=result.csv"});
 
 Iterator<Tuple> tuples = t.getAlias("sorted");
 Assert.assertTrue(tuples.hasNext());
 Tuple tuple = tuples.next();
 Assert.assertEquals(tuple.get(0), "banana");
 Assert.assertEquals(tuple.get(1), 2);
 Assert.assertTrue(tuples.hasNext());
 tuple = tuples.next();
 Assert.assertEquals(tuple.get(0), "pear");
 Assert.assertEquals(tuple.get(1), 1);
 }
  • 64. Test using mock and assert ▸ mockAlias allows to substitute input data ▸ assertOutput allows to compare String output data @Test
 public void testWordCountMock() throws IOException, ParseException {
 //Write input data
 BufferedWriter writer = new BufferedWriter(new FileWriter(new File("input.txt")));
 writer.write("banana pear banana");
 writer.close();
 
 PigTest t = newPigTest("pig/src/main/pig/wordcount_text.pig", new String[] {"input=input.txt", "output=null"});
 t.runScript();
 t.assertOutputAnyOrder("sorted", new String[]{"(banana,2)", "(pear,1)"});
 }
  • 65. Both of these tools have limitations ▸ Built around standard input and output (Text, CSVs etc.) ▸ Realistically most of our data is in other formats (e.g. Avro, Thrift, JSON) ▸ Does not test the STORE function (e.g. schema errors) ▸ getAlias() is especially difficult to use ▸ Need to remember field position: tuple.get(0) ▸ assertOutput() only allows String comparison ▸ Cumbersome to write complex structures (e.g. bags of bags)
  • 66. Example with Avro input/output ▸ Focus on testing script’s output ▸ Difficulty is to generate dummy Avro data and compare result text = LOAD '$input' USING AvroStorage();
 
 flattened = FOREACH text GENERATE flatten(TOKENIZE(body)) as word;
 grouped = GROUP flattened by word;
 result = FOREACH grouped GENERATE group AS word, (int)COUNT($1) AS cnt;
 sorted = ORDER result BY cnt DESC;
 
 STORE result INTO '$output' USING AvroStorage(); ▸ By default, PigUnit doesn’t execute the STORE, but it can be overridden pigTest.unoverride("STORE");
  • 67. Simple utility classes for Avro ▸ BasicAvroWriter ▸ Writes Avro file on disk based on a schema ▸ Supports GenericRecord and SpecificRecord ▸ BasicAvroReader ▸ Reads Avro file, the schema heads the file ▸ Also supports GenericRecord and SpecificRecord
  • 68. Test with Avro GenericRecord ▸ Create Schema with SchemaBuilder, write data, run script, read result and compare @Test
 public void testWordCountGenericRecord() throws IOException, ParseException {
 Schema schema = SchemaBuilder.builder().record("record").fields().
 name("text").type().stringType().noDefault().endRecord();
 GenericRecord genericRecord = new GenericData.Record(schema);
 genericRecord.put("text", "banana apple banana");
 
 BasicAvroWriter writer = new BasicAvroWriter(new Path(new File("input.avro").getAbsolutePath()), schema, getFileSystem());
 writer.append(genericRecord);
 
 PigTest t = newPigTest("pig/src/main/pig/wordcount_avro.pig", new String[] {"input=input.avro", "output=sorted.avro"});
 t.unoverride("STORE");
 t.runScript();
 
 //Check output
 BasicAvroReader reader = new BasicAvroReader(new Path(new File("sorted.avro").getAbsolutePath()), getFileSystem());
 Map<Utf8, GenericRecord> result = reader.readAndMapAll("word");
 Assert.assertEquals(result.size(), 2);
 Assert.assertEquals(result.get(new Utf8("banana")).get("cnt"), 2);
 Assert.assertEquals(result.get(new Utf8("apple")).get("cnt"), 1);
 }
  • 69. Test with Avro SpecificRecord ▸ Use InputRecord and OutputRecord generated Java classes, write data, run script, read result and compare @Test
 public void testWordCountSpecificRecord() throws IOException, ParseException {
 InputRecord input = InputRecord.newBuilder().setText("banana apple banana").build();
 
 BasicAvroWriter<InputRecord> writer = new BasicAvroWriter<InputRecord>(new Path(new File("input.avro").getAbsolutePath()), input.getSchema(), getFileSystem());
 writer.writeAll(input);
 
 PigTest t = newPigTest("pig/src/main/pig/wordcount_avro.pig", new String[] {"input=input.avro", "output=sorted.avro"});
 t.unoverride("STORE");
 t.runScript();
 
 //Check output
 BasicAvroReader<OutputRecord> reader = new BasicAvroReader<OutputRecord>(new Path(new File("sorted.avro").getAbsolutePath()), getFileSystem());
 List<OutputRecord> result = reader.readAll();
 Assert.assertEquals(result.size(), 2);
 Assert.assertEquals(result.get(0), OutputRecord.newBuilder().setWord("banana").setCount(2).build());
 Assert.assertEquals(result.get(1), OutputRecord.newBuilder().setWord("apple").setCount(1).build());
 }
  • 70. Common pitfalls ▸ PigUnit ▸ Mocking capabilities are very limited ▸ Overhead of 1-5 seconds per script ▸ Cryptic error messages sometimes (NullPointerException) ▸ Pig UDFs ▸ Can be tested independently
  • 72. Spark Testing Base Base classes to use when writing tests with Spark ▸ https://github.com/holdenk/spark-testing-base ▸ Functionalities ▸ Provides SparkContext ▸ Utilities to compare RDDs and DataFrames ▸ Simulate how Streaming works ▸ Includes cool RDD and DataFrames generator
  • 73. Thank You! We are hiring! http://careers.getyourguide.com/
  • 74. Extra Resources ▸ https://github.com/miguno/avro-hadoop-starter ▸ http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive/ ▸ http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ ▸ http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015 ▸ http://avro.apache.org/docs/current/ ▸ http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one ▸ http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ ▸ https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about- real-time-datas-unifying