15. Jupyter Notebooks supports ML Lifecycle
1. Collect
Data
Retrieve Files
Query SQL Databases
Call Web Services
“Scrape” Web Pages
2.
Prepare
Data
Explore Data
Validate Data
Clean Data
Features / Data
4.
Evaluate
Model
Test Performance
Compare Models
Validate Model
Visualize
5. Deploy
Model
Export Model File
Prepare Job
Deploy Container
Re-package Model
Execute code blocks:
- Python, R… code
- SQL queries
- Shell commands
3. Train
Model
Prepare Training Set
Experiment
Test Model
Visualize
Write Documentation:
- Markdown language
Visualize Data
- Viz tools…
19. Mathematica evolved…
Jupyter Notebook
Market leader
Started for single use
Academic community
GitHub integration
Added Jupyter Hub for
collaboration
Zeppelin Notebook
Start for collaboration
Enterprise
Security
Vendor Notebook
Databricks for Apache Spark
Jupyter-like, but proprietary
format
@lynnlangit
24. Bioinformatics | Denis C. Bauer | @allPowerde|
GT-Scan2
How can genome engineering
be made more effective?
Variant Spark
How to find disease genes in
population-size cohorts?
Genomic
Research
Tools
Two
Examples
25. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Machine learning…
on 1.7 Trillion data points
https://www.projectmine.com/about/
26. Bioinformatics | Denis C. Bauer | @allPowerde|
VariantSpark - Parallelize Random Forest for scalability
• Spark ML’s RF was designed for ‘Big’ low dimensional data.
• The full genome-wide profile does NOT fit into the executors memory
“Cursed” BigData: e.g. Genomics
Moderate number of samples with many features
Feature set too large to be handled by single executer
27. Bioinformatics | Denis C. Bauer | @allPowerde|
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
Flip the matrix: partition by column
VariantSpark - Parallelize RF to scale with features
28. Bioinformatics | Denis C. Bauer | @allPowerde|
Wide RF scalable with features and samples
29. # set up context and input parameters
spark = SparkSession(sc)
vc = VariantsContext(spark)
label = vc.load_label('dius/data/chr22-labels.csv', 'col_name')
features = vc.import_vcf('dius/data/chr22_1000.vcf')
# instantiate analysis (parameters are type-checked)
imp_analysis = features.importance_analysis(label)
# get significant factors as both a tuple list and a dataframe
imp_vars = imp_analysis.important_variables(20)
most_imp_var = imp_vars[0][0]
imp_df = imp_analysis.variable_importance()
oob_error = imp_analysis.oob_error()
# convert to work with common Python tools
pandas_imp_df = imp_df.toPandas()
New -- Python API for VariantSpark
35. Tools for Jupyter
• Binder for GitHub
• Point to your GitHub Repo
• Jupyter Notebooks
• Requirements.txt
• It builds a Docker image
• You can run your Notebooks
@lynnlangit
37. Future of Jupyter for Research
Academic
Institutions
and
Research
Labs
UC Berkeley, Davis, San Diego
Cal Poly San Luis Obispo
Clemson University
UC Boulder
U of Illinois, Minnesota, Missouri, Rochester, Texas
MIT
Michigan State U
Texas A & M
@lynnlangit
History talk from Cristian Prieto (NDC Oslo 2016) -- https://vimeo.com/223984769
http://blog.fperez.org/2012/01/ipython-notebook-historical.html
Local install
pip install –iPython all -OR- can use anaconda, which installs Jupyter notebooks by default
pip install jupyter[all] and you can pip install R
You can use Docker – 2.1 GB image contains all libraries or you can use Azure Notebooks or AWS SageMaker Notebooks
Only Python2 is installed by default, you can install other runtimes
Start and run in local browser (no database, uses local .json files)
IPython notebook -> localhost:8888/tree
Use GitHub-flavor Markdown (by default)
https://dwhsys.com/2017/03/25/apache-zeppelin-vs-jupyter-notebook/
https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
https://medium.com/@lynnlangit/aws-sagemaker-for-bioinformatics-b8e8a96479d8
Jupyter on GCE VM -- https://towardsdatascience.com/running-jupyter-notebook-in-google-cloud-platform-in-15-min-61e16da34d52
https://mybinder.org/ -ALSO-
https://nbviewer.jupyter.org/ - allows you to run notebooks stored in GitHub