Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure
* Validation: constantly guard against bugs in core data and imported data
* Provenance: know how data came to be
* Sandboxes: combine public and non-public data; "good fences make good neighbors"
Presenter: Dan Gunter, LBNL
2. Goals
• Validation
– constantly guard against bugs in core data
and imported data
• Provenance
– know how data came to be
• Sandboxes
– Combine public and non-public data; "good
fences make good neighbors"
4. Validation runs all the time
• Rules with "constraints" for every database (and sandbox)
• Test constraints against entire DB every night email reports
• Validation engine, etc. all open-source software in pymatgen-db
Remote
server
Validation
engine
Rules
MP Databases
Reports
(email, web pages, ..)
6. Validation summary
Easy-to-use, integrated, efficient tools to
report errors
Next steps
– Record all check results in DB
– More sophisticated checks (Map/Reduce)
– Make it easier to add new checks internally
– Make it easier to add new check for anyone
• per-sandbox or even per-user ("MP Alerts")
8. Types of provenance in the system
1) Calculation workflows
– FireWorks records calculation inputs, .. results in great detail
2) External datasets
– Structure Notation Language standardizes the naming of data
sources and publications
3) Post-calculation data transformations
– New "builders" provides framework for tracking creation of final
database products
(1) (2)
(3)
11. Future work: unified view of
provenance
VASP
result
ICSD
VASP
result
VASP
result
Post-
processing
Material
properties
Computation
Data import
processing
e.g., Defects
14. Sandboxes = Database + Apps
Core data Core data
+
multivalent
materials
Non-
JCESR
users
JCESR
users
15. Technical challenges
• Pre-process data for real-time search
• Interfaces for per-user access control
– https://materialsproject.org/materials/1234?san
dbox=jcesr
– Web UI elements
and
16. Future: dynamic sandbox creation
Current:
– Large & significant
additional data / apps
• e.g., JCESR
– Longer-term
connections to MP data
• e.g. porous materials
– Companies
• e.g. VW/Stanford
Future
small collab.
per-user?
CoD?
17. Summary
• Validation
– guard against bugs by checking all data daily
and at data import/creation time
• Provenance
– universal standard for annotating data
provenance
• Sandboxes
– unified view of distinct databases
– onramp for new collaborations and data
Editor's Notes
Picture of 1915 Heinrich Campendonk painting, "Landscape with horses". Steve Martin paid $850K for a forged version of the painting, from a reputable art house in Paris, in 2004. He sold it at a loss of $250K before discovering it was a forgery. The forgery was performed by Wolfgang Beltracchi.
Sandboxes are a way to share preliminary data in the context of MP data and tools.