2. AGENDA (PART 1)
• Background about Totango and our data architecture
• Spark in the Totango Architecture
• Quality: Testing Spark code in production
8. Totango Data Architecture
Collection
Real-time processing
Batch processing
Pixel
3rd Party
(SFDC)
CSV
Serving Layer
• ‘Lambda Architecture’
• Hosted on AWS
• AWS and Open-source technologies
• Java with a dash of Python
9. Totango Data Architecture
Pixel
3rd Party
(SFDC)
CSV
• Hosted on AWS
• ‘Lambda Architecture’
• AWS and Open-source technologies
• Java with a dash of Python
Kinesis
Kinesis
S3
ELB
10. Totango Data Architecture
Pixel
3rd Party
(SFDC)
CSV
• Hosted on AWS
• ‘Lambda Architecture’
• AWS and Open-source technologies
• Java with a dash of Python
Kinesis
Kinesis
S3
ELB
11. Batch Processing
• Executed once a day (midnight at customer’s local-time)
• Each task calculates a set of account-metrics (e.g. Health, Change)
• One Spark cluster runs all tasks for all customers
• Pipeline executed by Pipeline Runner, using Spotify Luigi
calc
some
metrics
calc
other
metrics
more
merge
results
Some
dependent
computation
Merge
results
Into final
document
Raw Events Account
Documents
12. Environment
• Multi tenant: Shared infrastructure for all Totango customers (Services)
• Daily, hourly and on-demand schedule
• Standalone Spark cluster on AWS EC2 instances
• Input and Output on S3. Final results also indexed on Elasticsearch
Service A
calc
some
metrics
calc
other
metrics
more
merge
results
Some
dependent
computation
Merge
results
Into final
document
Raw Events Account
Documents
Service A
calc
some
metrics
calc
other
metrics
more
merge
results
Some
dependent
computation
Merge
results
Into final
document
Raw Events Account
Documents
Service XYZ
calc
some
metrics
calc
other
metrics
more
merge
results
Some
dependent
computation
Merge
results
Into final
document
Raw Events Account
Documents
14. Requirements from infrastructure:
• Reliability: Calculate metrics accurately at all times
• Velocity: Frequent release of new data processing code
Challenge:
High quality and highly automated regression testing
calc
some
metrics
calc
some
metric
more
merge
results
Some
dependent
computation
Merge
results
Into final
document
Raw Events Account
Documents
NEW
VERSION
How do we make sure the new version didn’t
break anything?
15. calc
some
metrics
merge
results
Some
dependent
computation
Merge
results
Into final
document
Raw Events Account
Documents
NEW
VERSION
SHADOW
OLD
VERSION
compare csv
Testing In Production: How
• Before deployment, run release-candidate ‘side by side’ older version.
• New version runs in Shadow mode and does not propagate results
• Compare old and new version results. Output unexpected diffs
• Deploy to production only if no diffs across all customer data sets
16. 1.
2.
3.
4.
5.
Unit testing
Test environment: Integration testing
Side by side testing
in production of
new code
New code rolled-out,
old version
side-by-side as backup
Rollout complete!
Deployment
Flow
• We know the new
version works
correctly
• We do not need to
think of all the
corner test-cases
• We do not need to
write lots of
regression tests