DiffTool (difftool.org) is a UI based data comparison tool that can be used across data sources like RDBMS, Hadoop(Hive) or text files (CSV & JSON). DiffTool Compares datasets based on key(s), and some of the features include:
controlling data volume using custom filters,
Transforming columns using SQL expressions,
The ability to scale using distributed architecture,
And analyzing results intuitively with rich visualization.
2. • Validate data for the number one use case in big data
currently i.e. ETL offloading.
• Validate data to implement existing Process
improvements, optimizations and enhancements.
• Validate data loads/transformations in the Data Pipeline.
Why DiffTool ?
3. Common “Data Comparison”
Requirements ?
• Gather stats to analyze the differences.
• Compare a random sample instead of the full data.
• Apply Transformations before comparing.
5. How to use DiffTool ?
• Using the UI
• Integrate simple Rest API into the build process
6. Technical Challenges
• Automate testing with various data sources Hive,
JDBC(popular databases), CSV, Parquet, ORC, - Thanks
to Docker
• Even 30-35 seconds wait is not acceptable for interactive
applications.
Hello Everybody, My name is Dhiraj Peechara and I am the cofounder of DiffTool. The website is difftool.org.
My goal is to make data validation on Hadoop an easy task.
“Data Validation” is a very broad term and it can mean many things. I have started with the “Data Comparison” module and started building the other modules. For this presentation I would only be talking about “Data Comparison”. If time permits I will talk about how machine learning can help in solving some of the other modules of data validation.
1 min.
Currently the number one use case for big data is ETL offloading. ETL offloading process can go from months to multi year and during this development process the Hadoop developers are trying to produce the same output as existing ETL jobs and compare the data.
Even after setting up data pipeline in Hadoop, the process of optimization, improvements and enhancements are always there. A simple and efficient data comparison tool can help both developers and testers.
The amount of effort the testing teams spend to validate the a-b transformations is significant specially when a and b are different sources. This is a two step process
First do the a-btesting transformation and than compare b and btesting.
2 min.
We have found discrepancies in data comparison and what is the next step, we try to analyze the results and gather stats. All this requires quite some effort.
Compare on random sample of data instead of full data for various reasons like the volume is huge or any other reason.
Home Page : The application is a zip file. typically downloaded to gateway node. the installation is very simple. Just unzip and start the server. The server has very low resource requirement. It requires less than 500MB of RAM.
DiffTool can be used by the UI or by integrating simple REST API into the existing build process.