SlideShare a Scribd company logo
1 of 24
Download to read offline
Anahita Bhiwandiwalla – Intel Corporation
Karthik Vadla – Intel Corporation
A Predictive Analytics
Workflow on DICOM
Images using Apache
Spark
Who we are?
Agenda
• Use-case overview
• Challenges
• What is our data?
• Deriving Insights
• Machine Learning
• Tools
• Spark-tk
• Demo
• spark-tensorflow-connector
• Performance
Use-case overview
• Vast amount of medical image data available
• Meta-data accompanying the image
– Patient information
– Status of condition
– Method of image capture
– Time to events(if part of a study)
Create distributed technology needed to efficiently
store, scan and analyze data in healthcare.
Challenges
• Getting data for experimentation is hard!
• Combine data from multiple public sources for simulation
• Multiple tools – image processing libraries, visualization
tools, distributed engines, ML libraries
• Compatibility across different tools
How do we make Machine Learning easy to use?
Data: DICOM
- Digital Imaging and
COMmunications in Medicine
- International standard format to
store, exchange, and transmit
medical images
- Standards for imaging modalities -
radiography, ultrasonography,
computed tomography (CT),
magnetic resonance imaging (MRI),
and radiation therapy.
- Protocols for image exchange, image
compression, 3-D visualization,
image presentation etc.
Data: Meta-data
• Image related: storage, date, relevant parts of
the image etc.
• Patient related: age, weight, height, DOB,
medical history
• Device related: manufacturer information,
operator, method of capture
Deriving insights
• Can we develop smart filtering techniques based on
image meta-data?
• Can we conclude and predict certain conditions based
on patient meta-data?
• Can we classify image capture techniques and suggest
improvements based on device meta-data?
• Can we combine image & patient meta-data to conclude
and predict?
• Can analytics help automate/improve image capture
techniques?
Machine Learning
• Patient data
– Train classifier model on patient data to predict presence
of a condition
• Logistic Regression, Random Forest, Naïve Bayes
– Train regressor model on patient data to predict probability
of a condition
• Linear Regression, Random Forest
– Train survival analysis model on patient data to predict
time to event, compare risk of populations
• Accelerate Failure Time Model, Cox Proportional Hazards Model
Machine Learning
• Image data
– Compute the Eigen decomposition of the image
– Compute the Covariance Matrix of the image
– Compute the Principal Component Analysis of the
image pixel data
Machine Learning
• Device data:
– Predict trends in image capture techniques
• Classification/Regression techniques
– Predict device time to failure
• Cox Proportional Hazards Model, Accelerated Failure Time
– Recommendations to device capture methods
• Collaborative filtering
Tools
• pydicom
– Python package for working with DICOM files such as medical images, reports, and radiotherapy
– Single threaded
• dcm4che
– Collection of open source applications and utilities for the healthcare enterprise.
– Developed in the Java programming language for performance and portability, supporting
deployment on JDK 1.6 and up
• matplotlib
– Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
and interactive environments across platforms.
– Can be used in Python scripts, Python and IPython shell, jupyter notebook, web application servers
• Machine Learning libraries
– Classifiers, regressors, failure analysis analysis
– Eigen decomposition
Spark-tk Introduction
• Open source library enhancing the Spark experience
• APIs for feature engineering, ETL, graph construction,
machine learning, image processing, scoring
• Abstraction level familiar to data scientists; removes
complexity of parallel processing
• Lower level Spark APIs seamlessly exposed
• Easy to build apps plugging in other tools/libraries
https://github.com/trustedanalytics/spark-tk
http://trustedanalytics.github.io/spark-tk/
Spark-tk components
• Frames
– Scalable data frame representation
– More intuitive than low level HDFS file and Spark
RDD/DataFrame/DataSet formats; schema inference
– APIs to manipulate the data frames for feature engineering and
exploration, such as joins and aggregations
– Run user-defined transformations and filters using distributed
processing
– Input to our models
– Easy to convert to/from Spark-tk Frames - Spark representations
Spark-tk components
• Graphs
– Scalable graph representation based on Frames for
vertices and edges
– In house distributed algorithms for graph analytics
using GraphX
– Supports importing/exporting to OrientDB’s scalable
graph databases – visualize, real-time graph querying
– Store massive graphs
Spark-tk components
• Machine Learning & Streaming
– Time series analysis
– Recommender systems
– Topic Modeling
– Clustering
– Classification/Regression
– Image processing
– Scikit Models
– Streaming via Scoring engine
How Spark-tk helps
• Brings in support for image analytics to Spark
– Support for ingesting and processing DICOM images in a
distributed environment
– Queries, filters, and analytics on image collections
– Integrated & distributed dcm4che3 via. Spark
– Tested against pydicom
– Used matplotlib for visualization
• Machine learning for image and health records
– Created distributed Cox Survival Proportional Hazards Model to
compare sets of populations
– Recommend algorithm and hyper-parameters
Live Demo
spark-tensorflow-connector
• Developed library for loading and storing
TensorFlow records with Apache Spark
• Implements data import from the standard
TensorFlow record format (TFRecords) into
Spark SQL DataFrames, and data export from
DataFrames to TensorFlow records
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-
connector
Performance
0
5
10
15
20
25
30
0.25 0.5 1 2 4 8 16 32
Data(GB) Vs. Time(Min)
Processing time increases with data(Constant #Partitions=1000)
Performance
24.5
25
25.5
26
26.5
27
27.5
28
28.5
29
29.5
300 450 675 1012 2278 3147 5125 7688
#Partitions Vs. Time(Min)
Increasing partitions reduced the time and then increased time(overhead)
Performance
0
10
20
30
40
50
60
70
80
90
100
7 11 24 35 56
#Executors Vs. Time(Min)
Increasing number of executors reduces the time significantly
Performance
0
5
10
15
20
25
30
35
1 2 3 4 5
#Executor Cores Vs. Time(Min)
Increasing executor cores did not significantly change time
Thank You.
https://github.com/trustedanalytics/spark-tk
http://trustedanalytics.github.io/spark-tk/

More Related Content

What's hot

[찾아가는세미나] 클라우드 데이터 가상화솔루션
[찾아가는세미나] 클라우드 데이터 가상화솔루션[찾아가는세미나] 클라우드 데이터 가상화솔루션
[찾아가는세미나] 클라우드 데이터 가상화솔루션해은 최
 
Azure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and SparkAzure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and SparkBraja Krishna Das
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...SANG WON PARK
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking OverviewSreenivas Makam
 
Cloud computing Risk management
Cloud computing Risk management  Cloud computing Risk management
Cloud computing Risk management Padma Jella
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Google cloud computing
Google cloud computingGoogle cloud computing
Google cloud computingBrian Pichman
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingAnimesh Singh
 

What's hot (20)

[찾아가는세미나] 클라우드 데이터 가상화솔루션
[찾아가는세미나] 클라우드 데이터 가상화솔루션[찾아가는세미나] 클라우드 데이터 가상화솔루션
[찾아가는세미나] 클라우드 데이터 가상화솔루션
 
Azure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and SparkAzure Blob Storage API for Scala and Spark
Azure Blob Storage API for Scala and Spark
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Docker Networking Overview
Docker Networking OverviewDocker Networking Overview
Docker Networking Overview
 
Cloud computing Risk management
Cloud computing Risk management  Cloud computing Risk management
Cloud computing Risk management
 
cloud computing architecture.pptx
cloud computing architecture.pptxcloud computing architecture.pptx
cloud computing architecture.pptx
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Cloud security
Cloud security Cloud security
Cloud security
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Google cloud computing
Google cloud computingGoogle cloud computing
Google cloud computing
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
 
systemd
systemdsystemd
systemd
 

Similar to A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahita Bhiwandiwalla and Karthik Vadla

AI IN PATH final PPT.pptx
AI IN PATH final PPT.pptxAI IN PATH final PPT.pptx
AI IN PATH final PPT.pptxDivyaGaurav4
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution WSO2
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningNEEVEE Technologies
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningMakine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningAli Alkan
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 

Similar to A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahita Bhiwandiwalla and Karthik Vadla (20)

AI IN PATH final PPT.pptx
AI IN PATH final PPT.pptxAI IN PATH final PPT.pptx
AI IN PATH final PPT.pptx
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 Application and Challenges of Streaming Analytics and Machine Learning on Mu... Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningMakine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahita Bhiwandiwalla and Karthik Vadla

  • 1. Anahita Bhiwandiwalla – Intel Corporation Karthik Vadla – Intel Corporation A Predictive Analytics Workflow on DICOM Images using Apache Spark
  • 3. Agenda • Use-case overview • Challenges • What is our data? • Deriving Insights • Machine Learning • Tools • Spark-tk • Demo • spark-tensorflow-connector • Performance
  • 4. Use-case overview • Vast amount of medical image data available • Meta-data accompanying the image – Patient information – Status of condition – Method of image capture – Time to events(if part of a study) Create distributed technology needed to efficiently store, scan and analyze data in healthcare.
  • 5. Challenges • Getting data for experimentation is hard! • Combine data from multiple public sources for simulation • Multiple tools – image processing libraries, visualization tools, distributed engines, ML libraries • Compatibility across different tools How do we make Machine Learning easy to use?
  • 6. Data: DICOM - Digital Imaging and COMmunications in Medicine - International standard format to store, exchange, and transmit medical images - Standards for imaging modalities - radiography, ultrasonography, computed tomography (CT), magnetic resonance imaging (MRI), and radiation therapy. - Protocols for image exchange, image compression, 3-D visualization, image presentation etc.
  • 7. Data: Meta-data • Image related: storage, date, relevant parts of the image etc. • Patient related: age, weight, height, DOB, medical history • Device related: manufacturer information, operator, method of capture
  • 8. Deriving insights • Can we develop smart filtering techniques based on image meta-data? • Can we conclude and predict certain conditions based on patient meta-data? • Can we classify image capture techniques and suggest improvements based on device meta-data? • Can we combine image & patient meta-data to conclude and predict? • Can analytics help automate/improve image capture techniques?
  • 9. Machine Learning • Patient data – Train classifier model on patient data to predict presence of a condition • Logistic Regression, Random Forest, Naïve Bayes – Train regressor model on patient data to predict probability of a condition • Linear Regression, Random Forest – Train survival analysis model on patient data to predict time to event, compare risk of populations • Accelerate Failure Time Model, Cox Proportional Hazards Model
  • 10. Machine Learning • Image data – Compute the Eigen decomposition of the image – Compute the Covariance Matrix of the image – Compute the Principal Component Analysis of the image pixel data
  • 11. Machine Learning • Device data: – Predict trends in image capture techniques • Classification/Regression techniques – Predict device time to failure • Cox Proportional Hazards Model, Accelerated Failure Time – Recommendations to device capture methods • Collaborative filtering
  • 12. Tools • pydicom – Python package for working with DICOM files such as medical images, reports, and radiotherapy – Single threaded • dcm4che – Collection of open source applications and utilities for the healthcare enterprise. – Developed in the Java programming language for performance and portability, supporting deployment on JDK 1.6 and up • matplotlib – Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. – Can be used in Python scripts, Python and IPython shell, jupyter notebook, web application servers • Machine Learning libraries – Classifiers, regressors, failure analysis analysis – Eigen decomposition
  • 13. Spark-tk Introduction • Open source library enhancing the Spark experience • APIs for feature engineering, ETL, graph construction, machine learning, image processing, scoring • Abstraction level familiar to data scientists; removes complexity of parallel processing • Lower level Spark APIs seamlessly exposed • Easy to build apps plugging in other tools/libraries https://github.com/trustedanalytics/spark-tk http://trustedanalytics.github.io/spark-tk/
  • 14. Spark-tk components • Frames – Scalable data frame representation – More intuitive than low level HDFS file and Spark RDD/DataFrame/DataSet formats; schema inference – APIs to manipulate the data frames for feature engineering and exploration, such as joins and aggregations – Run user-defined transformations and filters using distributed processing – Input to our models – Easy to convert to/from Spark-tk Frames - Spark representations
  • 15. Spark-tk components • Graphs – Scalable graph representation based on Frames for vertices and edges – In house distributed algorithms for graph analytics using GraphX – Supports importing/exporting to OrientDB’s scalable graph databases – visualize, real-time graph querying – Store massive graphs
  • 16. Spark-tk components • Machine Learning & Streaming – Time series analysis – Recommender systems – Topic Modeling – Clustering – Classification/Regression – Image processing – Scikit Models – Streaming via Scoring engine
  • 17. How Spark-tk helps • Brings in support for image analytics to Spark – Support for ingesting and processing DICOM images in a distributed environment – Queries, filters, and analytics on image collections – Integrated & distributed dcm4che3 via. Spark – Tested against pydicom – Used matplotlib for visualization • Machine learning for image and health records – Created distributed Cox Survival Proportional Hazards Model to compare sets of populations – Recommend algorithm and hyper-parameters
  • 19. spark-tensorflow-connector • Developed library for loading and storing TensorFlow records with Apache Spark • Implements data import from the standard TensorFlow record format (TFRecords) into Spark SQL DataFrames, and data export from DataFrames to TensorFlow records https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow- connector
  • 20. Performance 0 5 10 15 20 25 30 0.25 0.5 1 2 4 8 16 32 Data(GB) Vs. Time(Min) Processing time increases with data(Constant #Partitions=1000)
  • 21. Performance 24.5 25 25.5 26 26.5 27 27.5 28 28.5 29 29.5 300 450 675 1012 2278 3147 5125 7688 #Partitions Vs. Time(Min) Increasing partitions reduced the time and then increased time(overhead)
  • 22. Performance 0 10 20 30 40 50 60 70 80 90 100 7 11 24 35 56 #Executors Vs. Time(Min) Increasing number of executors reduces the time significantly
  • 23. Performance 0 5 10 15 20 25 30 35 1 2 3 4 5 #Executor Cores Vs. Time(Min) Increasing executor cores did not significantly change time