SlideShare a Scribd company logo
1 of 69
Download to read offline
Visualizing and Clustering Life Science
Applications in Parallel
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India
May 25 2015
Geoffrey Fox
gcf@indiana.edu
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
5/25/2015 1
Deterministic Annealing
Algorithms
25/25/2015
Some Motivation
• Big Data requires high performance – achieve with parallel
computing
• Big Data sometimes requires robust algorithms as more
opportunity to make mistakes
• Deterministic annealing (DA) is one of better approaches to
robust optimization and broadly applicable
– Started as “Elastic Net” by Durbin for Travelling Salesman
Problem TSP
– Tends to remove local optima
– Addresses overfitting
– Much Faster than simulated annealing
• Physics systems find true lowest energy state if
you anneal i.e. you equilibrate at each
temperature as you cool
• Uses mean field approximation, which is also
used in “Variational Bayes” and “Variational
inference”5/25/2015 3
(Deterministic) Annealing
• Find minimum at high temperature when trivial
• Small change avoiding local minima as lower temperature
• Typically gets better answers than standard libraries- R and Mahout
• And can be parallelized and put on GPU’s etc.
45/25/2015
General Features of DA
• In many problems, decreasing temperature is classic
multiscale – finer resolution (√T is “just” distance scale)
• In clustering √T is distance in space of points (and
centroids), for MDS scale in mapped Euclidean space
• T = ∞, all points are in same place – the center of universe
• For MDS all Euclidean points are at center and distances
are zero. For clustering, there is one cluster
• As Temperature lowered there are phase transitions in
clustering cases where clusters split
– Algorithm determines whether split needed as second derivative
matrix singular
• Note DA has similar features to hierarchical methods and
you do not have to specify a number of clusters; you need
to specify a final distance scale 55/25/2015
Math of Deterministic Annealing
• H() is objective function to be minimized as a function of
parameters  (as in Stress formula given earlier for MDS)
• Gibbs Distribution at Temperature T
P() = exp( - H()/T) /  d exp( - H()/T)
• Or P() = exp( - H()/T + F/T )
• Use the Free Energy combining Objective Function and Entropy
F = < H - T S(P) > =  d {P()H+ T P() lnP()}
• Simulated annealing performs these integrals by Monte Carlo
• Deterministic annealing corresponds to doing integrals analytically
(by mean field approximation) and is much much faster
• Need to introduce a modified Hamiltonian for some cases so that
integrals are tractable. Introduce extra parameters to be varied so
that modified Hamiltonian matches original
• In each case temperature is lowered slowly – say by a factor 0.95 to
0.9999 at each iteration
5/25/2015 6
Some Uses of Deterministic Annealing DA
• Clustering improved K-means
– Vectors: Rose (Gurewitz and Fox 1990 – 486 citations encouraged me to revisit)
– Clusters with fixed sizes and no tails (Proteomics team at Broad)
– No Vectors: Hofmann and Buhmann (Just use pairwise distances)
• Dimension Reduction for visualization and analysis
– Vectors: GTM Generative Topographic Mapping
– No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise
distances)
• Can apply to HMM & general mixture models (less study)
– Gaussian Mixture Models
– Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA
as alternative to Latent Dirichlet Allocation for finding “hidden factors”
• Have scalable parallel versions of much of above – mainly Java
• Many clustering methods – not clear what is best although DA pretty good and
improves K-means at increased computing cost which is not always useful
• DA clearly improves MDS which is ~only reliable” dimension reduction?
5/25/2015 7
Clusters v. Regions
• In Lymphocytes clusters are distinct; DA useful
• In Pathology, clusters divide space into regions and
sophisticated methods like deterministic annealing are
probably unnecessary
8
Pathology 54D
Lymphocytes 4D
5/25/2015
Some Problems we are working on
• Analysis of Mass Spectrometry data to find peptides by
clustering peaks (Broad Institute/Hyderabad)
– ~0.5 million points in 2 dimensions (one collection) -- ~ 50,000
clusters summed over charges
• Metagenomics – 0.5 million (increasing rapidly) points NOT
in a vector space – hundreds of clusters per sample
– Apply MDS to Phylogenetic trees
• Pathology Images >50 Dimensions
• Social image analysis is in a highish dimension vector space
– 10-50 million images; 1000 features per image; million clusters
• Finding communities from network graphs coming from
Social media contacts etc.
95/25/2015
5/25/2015 10
MDS and
Clustering on
~60K EMR
5/25/2015 11
Colored by Sample – not clustered.
Distances from vectors in word space. This 128 4mers for ~200K sequences
Examples of current DA
algorithms: LCMS
125/25/2015
Background on LC-MS
• Remarks of collaborators – Broad Institute/Hyderabad
• Abundance of peaks in “label-free” LC-MS enables large-scale comparison of
peptides among groups of samples.
• In fact when a group of samples in a cohort is analyzed together, not only is it
possible to “align” robustly or cluster the corresponding peaks across samples,
but it is also possible to search for patterns or fingerprints of disease states
which may not be detectable in individual samples.
• This property of the data lends itself naturally to big data analytics for
biomarker discovery and is especially useful for population-level studies with
large cohorts, as in the case of infectious diseases and epidemics.
• With increasingly large-scale studies, the need for fast yet precise cohort-wide
clustering of large numbers of peaks assumes technical importance.
• In particular, a scalable parallel implementation of a cohort-wide peak
clustering algorithm for LC-MS-based proteomic data can prove to be a critically
important tool in clinical pipelines for responding to global epidemics of
infectious diseases like tuberculosis, influenza, etc.
135/25/2015
Proteomics 2D DA Clustering T= 25000
with 60 Clusters (will be 30,000 at T=0.025)
5/25/2015 14
The brownish triangles are “sponge” (soaks up trash) peaks outside
any cluster.
The colored hexagons are peaks inside clusters with the white
hexagons being determined cluster center
15
Fragment of 30,000 Clusters
241605 Points
5/25/2015
Continuous Clustering
• This is a very useful subtlety introduced by Ken Rose but not widely known
although it greatly improves algorithm
• Take a cluster k to be split into 2 with centers Y(k)A and Y(k)B with initial
values Y(k)A = Y(k)B at original center Y(k)
• Then typically if you make this change
and perturb the Y(k)A and Y(k)B, they will
return to starting position
as F at stable minimum (positive eigenvalue)
• But instability (the negative eigenvalue) can develop and one finds
• Implement by adding arbitrary number p(k) of centers for each cluster
Zi = k=1
K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N
• Halve p(k) at splits; can’t split easily in standard case p(k) = 1
• Show weighting in sums like Zi now equipoint not equicluster as p(k)
proportional to points C(k) in cluster 16
Free Energy F
Y(k)A and Y(k)B
Y(k)A + Y(k)B
Free Energy F Free Energy F
Y(k)A - Y(k)B
5/25/2015
Trimmed Clustering
• Clustering with position-specific constraints on variance: Applying
redescending M-estimators to label-free LC-MS data analysis (Rudolf
Frühwirth, D R Mani and Saumyadipta Pyne) BMC
Bioinformatics 2011, 12:358
• HTCC = k=0
K i=1
N Mi(k) f(i,k)
– f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0
– f(i,0) = c2 / 2 k = 0
• The 0’th cluster captures (at zero temperature) all points outside
clusters (background)
• Clusters are trimmed
(X(i) - Y(k))2/2(k)2 < c2 / 2
• Relevant when well
defined errors
T ~ 0
T = 1
T = 5
Distance from
cluster center
5/25/2015 17
0
10000
20000
30000
40000
50000
60000
1.00E-031.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+06
ClusterCount
Temperature
DAVS(2) DA2D
Start Sponge DAVS(2)
Add Close Cluster Check
Sponge Reaches final value
Cluster Count v. Temperature for 2 Runs
• All start with one cluster at far left
• T=1 special as measurement errors divided out
• DA2D counts clusters with 1 member as clusters. DAVS(2) does not5/25/2015 18
Simple Parallelism as in k-means
• Decompose points i over processors
• Equations either pleasingly parallel “maps” over i
• Or “All-Reductions” summing over i for each cluster
• Parallel Algorithm:
– Each process holds all clusters and calculates
contributions to clusters from points in node
– e.g. Y(k) = i=1
N <Mi(k)> Xi / C(k)
• Runs well in MPI or MapReduce
–See all the MapReduce k-means papers
195/25/2015
Better Parallelism
• The previous model is correct at start but each point does
not really contribute to each cluster as damped
exponentially by exp( - (Xi- Y(k))2 /T )
• For Proteomics problem, on average only 6.45 clusters
needed per point if require (Xi- Y(k))2 /T ≤ ~40 (as exp(-40)
small)
• So only need to keep nearby clusters for each point
• As average number of Clusters ~ 20,000, this gives a factor
of ~3000 improvement
• Further communication is no longer all global; it has
nearest neighbor components and calculated by
parallelism over clusters which can be done in parallel if
separated
205/25/2015
21
Speedups for several runs on Tempest from 8-way through 384 way MPI parallelism with
one thread per process. We look at different choices for MPI processes which are either
inside nodes or on separate nodes
5/25/2015
22
Parallelism within a Single Node of Madrid
Cluster. A set of runs on 241605 peak data
with a single node with 16 cores with either
threads or MPI giving parallelism.
Parallelism is either number of threads or
number of MPI processes.
Parallelism (#threads or #processes)
5/25/2015
METAGENOMICS -- SEQUENCE
CLUSTERING
Non-metric Spaces
O(N2) Algorithms – Illustrate Phase Transitions
5/25/2015 23
24
• Start at T= “” with 1
Cluster
• Decrease T, Clusters
emerge at instabilities
5/25/2015
255/25/2015
265/25/2015
446K sequences
~100 clusters
5/25/2015 27
METAGENOMICS -- SEQUENCE
CLUSTERING
Non-metric Spaces
O(N2) Algorithms – Compare Other Methods
5/25/2015 28
“Divergent” Data
Sample
23 True Clusters
29
CDhit
UClust
Divergent Data Set UClust (Cuts 0.65 to 0.95)
DAPWC 0.65 0.75 0.85 0.95
Total # of clusters 23 4 10 36 91
Total # of clusters uniquely identified 23 0 0 13 16
(i.e. one original cluster goes to 1 uclust cluster )
Total # of shared clusters with significant sharing 0 4 10 5 0
(one uclust cluster goes to > 1 real cluster)
Total # of uclust clusters that are just part of a real cluster 0 4 10 17(11) 72(62)
(numbers in brackets only have one member)
Total # of real clusters that are 1 uclust cluster 0 14 9 5 0
but uclust cluster is spread over multiple real clusters
Total # of real clusters that have 0 9 14 5 7
significant contribution from > 1 uclust cluster
DA-PWC
5/25/2015
PROTEOMICS
No clear clusters
5/25/2015 30
Protein Universe Browser for COG Sequences with a
few illustrative biologically identified clusters
315/25/2015
Heatmap of biology distance (Needleman-
Wunsch) vs 3D Euclidean Distances
32
If d a distance, so is f(d) for any monotonic f. Optimize choice of f
5/25/2015
O(N2) ALGORITHMS?
5/25/2015 33
Algorithm Challenges
• See NRC Massive Data Analysis report
• O(N) algorithms for O(N2) problems
• Parallelizing Stochastic Gradient Descent
• Streaming data algorithms – balance and interplay between
batch methods (most time consuming) and interpolative
streaming methods
• Graph algorithms – need shared memory?
• Machine Learning Community uses parameter servers;
Parallel Computing (MPI) would not recommend this?
– Is classic distributed model for “parameter service” better?
• Apply best of parallel computing – communication and load
balancing – to Giraph/Hadoop/Spark
• Are data analytics sparse?; many cases are full matrices
• BTW Need Java Grande – Some C++ but Java most popular in
ABDS, with Python, Erlang, Go, Scala (compiles to JVM) …..5/25/2015 34
O(N2) interactions between
green and purple clusters
should be able to represent by
centroids as in Barnes-Hut.
Hard as no Gauss theorem; no
multipole expansion and points
really in 1000 dimension space
as clustered before 3D
projection
O(N2) green-green and purple-
purple interactions have value
but green-purple are “wasted”
“clean” sample of 446K
5/25/2015 35
36
Use Barnes Hut OctTree,
originally developed to
make O(N2) astrophysics
O(NlogN), to give similar
speedups in machine
learning
5/25/2015
37
OctTree for 100K
sample of Fungi
We use OctTree for logarithmic
interpolation (streaming data)
5/25/2015
Fungi Analysis
5/25/2015 38
Fungi Analysis
• Multiple Species from multiple places
• Several sources of sequences starting with 446K and
eventually boiled down to ~10K curated sequences with 61
species
• Original sample – clustering and MDS
• Final sample – MDS and other clustering methods
• Note MSA and SWG gives similar results
• Some species are clearly split
• Some species are diffuse; others compact making a fixed
distance cut unreliable
– Easy for humans!
• MDS very clear on structure and clustering artifacts
• Why not do “high-value” clustering as interactive iteration
driven by MDS?
5/25/2015 39
Fungi -- 4 Classic Clustering Methods
5/25/2015 40
5/25/2015 41
5/25/2015 42
5/25/2015 43
Same Species
5/25/2015 44
Same Species Different Locations
Parallel Data Mining
5/25/2015 45
Parallel Data Analytics
• Streaming algorithms have interesting differences but
• “Batch” Data analytics is “just classic parallel computing” with
usual features such as SPMD and BSP
• Expect similar systematics to simulations where
• Static Regular problems are straightforward but
• Dynamic Irregular Problems are technically hard and high level
approaches fail (see High Performance Fortran HPF)
– Regular meshes worked well but
– Adaptive dynamic meshes did not although “real people with MPI”
could parallelize
• However using libraries is successful at either
– Lowest: communication level
– Higher: “core analytics” level
• Data analytics does not yet have “good regular parallel libraries”
– Graph analytics has most attention
5/25/2015 46
Remarks on Parallelism I
• Maximum Likelihood or 2 both lead to objective functions like
• Minimize sum items=1
N (Positive nonlinear function of
unknown parameters for item i)
• Typically decompose items i and parallelize over both i and
parameters to be determined
• Solve iteratively with (clever) first or second order
approximation to shift in objective function
– Sometimes steepest descent direction; sometimes Newton
– Have classic Expectation Maximization structure
– Steepest descent shift is sum over shift calculated from each point
• Classic method – take all (millions) of items in data set and
move full distance
– Stochastic Gradient Descent SGD – take randomly a few hundred of
items in data set and calculate shifts over these and move a tiny
distance
– SGD cannot parallelize over items 475/25/2015
Remarks on Parallelism II
• Need to cover non vector semimetric and vector spaces for
clustering and dimension reduction (N points in space)
• Semimetric spaces just have pairwise distances defined between
points in space (i, j)
• MDS Minimizes Stress and illustrates this
(X) = i<j=1
N weight(i,j) ((i, j) - d(Xi , Xj))2
• Vector spaces have Euclidean distance and scalar products
– Algorithms can be O(N) and these are best for clustering but for MDS O(N)
methods may not be best as obvious objective function O(N2)
– Important new algorithms needed to define O(N) versions of current O(N2) –
“must” work intuitively and shown in principle
• Note matrix solvers often use conjugate gradient – converges in 5-
100 iterations – a big gain for matrix with a million rows. This
removes factor of N in time complexity
• Ratio of #clusters to #points important; new clustering ideas if ratio
>~ 0.1
485/25/2015
Problem Structure
• Note learning networks have huge number of parameters (11 billion
in Stanford work) so that inconceivable to look at second derivative
• Clustering and MDS have lots of parameters but can be practical to
look at second derivative and use Newton’s method to minimize
• Parameters are determined in distributed fashion but are typically
needed globally
– MPI use broadcast and “AllCollectives” implying Map-Collective is a useful
programming model
– AI community: use parameter server and access as needed. Non-optimal?
495/25/2015
(1) Map Only
(4) Point to Point or
Map-Communication
(3) Iterative Map Reduce or
Map-Collective
(2) Classic
MapReduce
Input
map
reduce
Input
map
reduce
Iterations
Input
Output
map
Local
Graph
(5) Map-Streaming
maps brokers
Events
(6) Shared memory
Map Communicates
Map & Communicate
Shared Memory
MDS in more detail
5/25/2015 50
WDA-SMACOF “Best” MDS
• Semimetric spaces just have pairwise distances defined between
points in space (i, j)
• MDS Minimizes Stress with pairwise distances (i, j)
(X) = i<j=1
N weight(i,j) ((i, j) - d(Xi , Xj))2
• SMACOF clever Expectation Maximization method choses good
steepest descent
• Improved by Deterministic Annealing reducing distance scale; DA
does not impact compute time much and gives DA-SMACOF
• Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial
weights but get nonuniform weight from
– The preferred Sammon method weight(i,j) = 1/(i, j) or
– Missing distances put in as weight(i,j) = 0
• Use conjugate gradient – converges in 5-100 iterations – a big gain
for matrix with a million rows. This removes factor of N in time
complexity and gives WDA-SMACOF 515/25/2015
Timing of WDA SMACOF
• 20k to 100k AM Fungal sequences on 600 cores
5/25/2015 52
1
10
100
1000
10000
100000
20k 40k 60k 80k 100k
Seconds
Data Size
Time Cost Comparison between WDA-
SMACOF with Equal Weights and Sammon's
Mapping
Equal Weights Sammon's Mapping
WDA-SMACOF Timing
• Input Data: 100k to 400k AM Fungal sequences
• Environment: 32 nodes (1024 cores) to 128 nodes (4096 cores) on
BigRed2.
• Using Harp plug in for Hadoop (MPI Performance)
0
500
1000
1500
2000
2500
3000
3500
4000
100k 200k 300k 400k
Seconds
Data Size
Time Cost of WDA-SMACOF over
Increasing Data Size
512 1024 2048 4096
0
0.2
0.4
0.6
0.8
1
1.2
512 1024 2048 4096
ParallelEfficiency
Number of Processors
Parallel Efficiency of WDA-SMACOF over
Increasing Number of Processors
WDA-SMACOF (Harp)
5/25/2015 53
Spherical Phylogram
• Take a set of sequences mapped to nD with MDS (WDA-
SMACOF) (n=3 or ~20)
– N=20 captures ~all features of dataset?
• Consider a phylogenetic tree and use neighbor joining
formulae to calculate distances of nodes to sequences (or
later other nodes) starting at bottom of tree
• Do a new MDS fixing mapping of sequences noting that
sequences + nodes have defined distances
• Use RAxML or Neighbor Joining (N=20?) to find tree
• Random note: do not need Multiple Sequence Alignment;
pairwise tools are easier to use and give reliably good results
5/25/2015 54
RAxML result visualized in FigTree.
Spherical Phylogram visualized in PlotViz
for MSA or SWG distances
Spherical Phylograms
MSA
SWG
5/25/2015 55
Quality of 3D Phylogenetic Tree
• EM-SMACOF is basic SMACOF
• LMA was previous best method using Levenberg-Marquardt
nonlinear 2 solver
• WDA-SMACOF finds best result
• 3 different distance measures
Sum of branch lengths of the Spherical Phylogram generated in 3D space
on two datasets
0
5
10
15
20
25
30
MSA SWG NW
SumofBranches
Sum of Branches on 599nts Data
WDA-SMACOF LMA EM-SMACOF
0
5
10
15
20
25
MSA SWG NW
SumofBranches
Sum of Branches on 999nts Data
WDA-SMACOF LMA EM-SMACOF
5/25/2015 56
Summary
• Always run MDS. Gives insight into data and performance
of machine learning
– Leads to a data browser as GIS gives for spatial data
– 3D better than 2D
– ~20D better than MSA?
• Clustering Observations
– Do you care about quality or are you just cutting up
space into parts
– Deterministic Clustering always makes more robust
– Continuous clustering enables hierarchy
– Trimmed Clustering cuts off tails
– Distinct O(N) and O(N2) algorithms
• Use Conjugate Gradient 575/25/2015
Java Grande
5/25/2015 58
Java Grande
• We once tried to encourage use of Java in HPC with Java Grande
Forum but Fortran, C and C++ remain central HPC languages.
– Not helped by .com and Sun collapse in 2000-2005
• The pure Java CartaBlanca, a 2005 R&D100 award-winning
project, was an early successful example of HPC use of Java in a
simulation tool for non-linear physics on unstructured grids.
• Of course Java is a major language in ABDS and as data analysis
and simulation are naturally linked, should consider broader use
of Java
• Using Habanero Java (from Rice University) for Threads and
mpiJava or FastMPJ for MPI, gathering collection of high
performance parallel Java analytics
– Converted from C# and sequential Java faster than sequential C#
• So will have either Hadoop+Harp or classic Threads/MPI
versions in Java Grande version of Mahout5/25/2015 59
Performance of MPI Kernel Operations
1
100
10000
0B
2B
8B
32B
128B
512B
2KB
8KB
32KB
128KB
512KB
Averagetime(us)
Message size (bytes)
MPI.NET C# in Tempest
FastMPJ Java in FG
OMPI-nightly Java FG
OMPI-trunk Java FG
OMPI-trunk C FG
Performance of MPI send and receive operations
5
5000
4B
16B
64B
256B
1KB
4KB
16KB
64KB
256KB
1MB
4MB
Averagetime(us)
Message size (bytes)
MPI.NET C# in Tempest
FastMPJ Java in FG
OMPI-nightly Java FG
OMPI-trunk Java FG
OMPI-trunk C FG
Performance of MPI allreduce operation
1
100
10000
1000000
4B
16B
64B
256B
1KB
4KB
16KB
64KB
256KB
1MB
4MB
AverageTime(us)
Message Size (bytes)
OMPI-trunk C Madrid
OMPI-trunk Java Madrid
OMPI-trunk C FG
OMPI-trunk Java FG
1
10
100
1000
10000
0B
2B
8B
32B
128B
512B
2KB
8KB
32KB
128KB
512KB
AverageTime(us)
Message Size (bytes)
OMPI-trunk C Madrid
OMPI-trunk Java Madrid
OMPI-trunk C FG
OMPI-trunk Java FG
Performance of MPI send and receive on
Infiniband and Ethernet
Performance of MPI allreduce on Infiniband
and Ethernet
Pure Java as
in FastMPJ
slower than
Java
interfacing
to C version
of MPI
5/25/2015 60
Java Grande and C# on 40K point DAPWC Clustering
Very sensitive to threads v MPI
64 Way parallel
128 Way parallel
256 Way
parallel
TXP
Nodes
Total
C#
Java
C# Hardware 0.7 performance Java Hardware
5/25/2015 61
Java and C# on 12.6K point DAPWC Clustering
Java
C##Threads x #Processes per node
# Nodes
Total Parallelism
Time hours
1x1 2x21x2 1x42x1 1x84x1 2x4 4x2 8x1
#Threads x #Processes per node
C# Hardware 0.7 performance Java Hardware
5/25/2015 62
Data Analytics in SPIDAL
5/25/2015 63
Analytics and the DIKW Pipeline
• Data goes through a pipeline
Raw data  Data  Information  Knowledge  Wisdom 
Decisions
• Each link enabled by a filter which is “business logic” or “analytics”
• We are interested in filters that involve “sophisticated analytics”
which require non trivial parallel algorithms
– Improve state of art in both algorithm quality and (parallel) performance
• Design and Build SPIDAL (Scalable Parallel Interoperable Data
Analytics Library)
More Analytics
KnowledgeInformation
Analytics
InformationData
5/25/2015 64
Strategy to Build SPIDAL
• Analyze Big Data applications to identify analytics
needed and generate benchmark applications
• Analyze existing analytics libraries (in practice limit to
some application domains) – catalog library members
available and performance
– Mahout low performance, R largely sequential and missing
key algorithms, MLlib just starting
• Identify big data computer architectures
• Identify software model to allow interoperability and
performance
• Design or identify new or existing algorithm including
parallel implementation
• Collaborate application scientists, computer systems
and statistics/algorithms communities5/25/2015 65
Machine Learning in Network Science, Imaging in Computer
Vision, Pathology, Polar Science, Biomolecular Simulations
66
Algorithm Applications Features Status Parallelism
Graph Analytics
Community detection Social networks, webgraph
Graph
.
P-DM GML-GrC
Subgraph/motif finding Webgraph, biological/social networks P-DM GML-GrB
Finding diameter Social networks, webgraph P-DM GML-GrB
Clustering coefficient Social networks P-DM GML-GrC
Page rank Webgraph P-DM GML-GrC
Maximal cliques Social networks, webgraph P-DM GML-GrB
Connected component Social networks, webgraph P-DM GML-GrB
Betweenness centrality Social networks Graph, Non-metric,
static
P-Shm
GML-GRA
Shortest path Social networks, webgraph P-Shm
Spatial Queries and Analytics
Spatial relationship based
queries
GIS/social networks/pathology
informatics Geometric
P-DM PP
Distance based queries P-DM PP
Spatial clustering Seq GML
Spatial modeling Seq PP
GML Global (parallel) ML
GrA Static GrB Runtime partitioning
5/25/2015
Some specialized data analytics in
SPIDAL
• aa
67
Algorithm Applications Features Status Parallelism
Core Image Processing
Image preprocessing
Computer vision/pathology
informatics
Metric Space Point
Sets, Neighborhood
sets & Image
features
P-DM PP
Object detection &
segmentation P-DM PP
Image/object feature
computation P-DM PP
3D image registration Seq PP
Object matching
Geometric
Todo PP
3D feature extraction Todo PP
Deep Learning
Learning Network,
Stochastic Gradient
Descent
Image Understanding,
Language Translation, Voice
Recognition, Car driving
Connections in
artificial neural net P-DM GML
PP Pleasingly Parallel (Local ML)
Seq Sequential Available
GRA Good distributed algorithm needed
Todo No prototype Available
P-DM Distributed memory Available
P-Shm Shared memory Available5/25/2015
Some Core Machine Learning Building Blocks
68
Algorithm Applications Features Status //ism
DA Vector Clustering Accurate Clusters Vectors P-DM GML
DA Non metric Clustering Accurate Clusters, Biology, Web Non metric, O(N2) P-DM GML
Kmeans; Basic, Fuzzy and Elkan Fast Clustering Vectors P-DM GML
Levenberg-Marquardt
Optimization
Non-linear Gauss-Newton, use
in MDS Least Squares P-DM GML
SMACOF Dimension Reduction DA- MDS with general weights Least Squares,
O(N2) P-DM GML
Vector Dimension Reduction DA-GTM and Others Vectors P-DM GML
TFIDF Search Find nearest neighbors in
document corpus
Bag of “words”
(image features)
P-DM PP
All-pairs similarity search
Find pairs of documents with
TFIDF distance below a
threshold
Todo GML
Support Vector Machine SVM Learn and Classify Vectors Seq GML
Random Forest Learn and Classify Vectors P-DM PP
Gibbs sampling (MCMC) Solve global inference problems Graph Todo GML
Latent Dirichlet Allocation LDA
with Gibbs sampling or Var.
Bayes
Topic models (Latent factors) Bag of “words” P-DM GML
Singular Value Decomposition
SVD Dimension Reduction and PCA Vectors Seq GML
Hidden Markov Models (HMM) Global inference on sequence
models Vectors Seq PP &
GML
5/25/2015
Some Futures
• Always run MDS. Gives insight into data
– Leads to a data browser as GIS gives for spatial data
• Claim is algorithm change gave as much performance
increase as hardware change in simulations. Will this
happen in analytics?
– Today is like parallel computing 30 years ago with regular meshs.
We will learn how to adapt methods automatically to give
“multigrid” and “fast multipole” like algorithms
• Need to start developing the libraries that support Big Data
– Understand architectures issues
– Have coupled batch and streaming versions
– Develop much better algorithms
• Please join SPIDAL (Scalable Parallel Interoperable Data
Analytics Library) community 695/25/2015

More Related Content

What's hot

Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Vijay Srinivas Agneeswaran, Ph.D
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 

What's hot (20)

V3 i35
V3 i35V3 i35
V3 i35
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 

Viewers also liked

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data StackGeoffrey Fox
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
Summer Independent Study Report
Summer Independent Study ReportSummer Independent Study Report
Summer Independent Study ReportShreya Chakrabarti
 

Viewers also liked (9)

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Letter or Recommendation-UOP
Letter or Recommendation-UOPLetter or Recommendation-UOP
Letter or Recommendation-UOP
 
51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Summer Independent Study Report
Summer Independent Study ReportSummer Independent Study Report
Summer Independent Study Report
 
Informatics Master's Programs at USC (webinar)
Informatics Master's Programs at USC (webinar) Informatics Master's Programs at USC (webinar)
Informatics Master's Programs at USC (webinar)
 

Similar to Visualizing and Clustering Life Science Applications in Parallel 

Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
An Exact Branch And Bound Algorithm For The General Quadratic Assignment Problem
An Exact Branch And Bound Algorithm For The General Quadratic Assignment ProblemAn Exact Branch And Bound Algorithm For The General Quadratic Assignment Problem
An Exact Branch And Bound Algorithm For The General Quadratic Assignment ProblemJoe Andelija
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2OSri Ambati
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 

Similar to Visualizing and Clustering Life Science Applications in Parallel  (20)

Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
An Exact Branch And Bound Algorithm For The General Quadratic Assignment Problem
An Exact Branch And Bound Algorithm For The General Quadratic Assignment ProblemAn Exact Branch And Bound Algorithm For The General Quadratic Assignment Problem
An Exact Branch And Bound Algorithm For The General Quadratic Assignment Problem
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
0 introduction
0  introduction0  introduction
0 introduction
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 

More from Geoffrey Fox

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online EducationGeoffrey Fox
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationGeoffrey Fox
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 

More from Geoffrey Fox (17)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
 
Remarks on MOOC's
Remarks on MOOC'sRemarks on MOOC's
Remarks on MOOC's
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 

Recently uploaded

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Recently uploaded (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Visualizing and Clustering Life Science Applications in Parallel 

  • 1. Visualizing and Clustering Life Science Applications in Parallel HiCOMB 2015 14th IEEE International Workshop on High Performance Computational Biology at IPDPS 2015 Hyderabad, India May 25 2015 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington 5/25/2015 1
  • 3. Some Motivation • Big Data requires high performance – achieve with parallel computing • Big Data sometimes requires robust algorithms as more opportunity to make mistakes • Deterministic annealing (DA) is one of better approaches to robust optimization and broadly applicable – Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP – Tends to remove local optima – Addresses overfitting – Much Faster than simulated annealing • Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool • Uses mean field approximation, which is also used in “Variational Bayes” and “Variational inference”5/25/2015 3
  • 4. (Deterministic) Annealing • Find minimum at high temperature when trivial • Small change avoiding local minima as lower temperature • Typically gets better answers than standard libraries- R and Mahout • And can be parallelized and put on GPU’s etc. 45/25/2015
  • 5. General Features of DA • In many problems, decreasing temperature is classic multiscale – finer resolution (√T is “just” distance scale) • In clustering √T is distance in space of points (and centroids), for MDS scale in mapped Euclidean space • T = ∞, all points are in same place – the center of universe • For MDS all Euclidean points are at center and distances are zero. For clustering, there is one cluster • As Temperature lowered there are phase transitions in clustering cases where clusters split – Algorithm determines whether split needed as second derivative matrix singular • Note DA has similar features to hierarchical methods and you do not have to specify a number of clusters; you need to specify a final distance scale 55/25/2015
  • 6. Math of Deterministic Annealing • H() is objective function to be minimized as a function of parameters  (as in Stress formula given earlier for MDS) • Gibbs Distribution at Temperature T P() = exp( - H()/T) /  d exp( - H()/T) • Or P() = exp( - H()/T + F/T ) • Use the Free Energy combining Objective Function and Entropy F = < H - T S(P) > =  d {P()H+ T P() lnP()} • Simulated annealing performs these integrals by Monte Carlo • Deterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is much much faster • Need to introduce a modified Hamiltonian for some cases so that integrals are tractable. Introduce extra parameters to be varied so that modified Hamiltonian matches original • In each case temperature is lowered slowly – say by a factor 0.95 to 0.9999 at each iteration 5/25/2015 6
  • 7. Some Uses of Deterministic Annealing DA • Clustering improved K-means – Vectors: Rose (Gurewitz and Fox 1990 – 486 citations encouraged me to revisit) – Clusters with fixed sizes and no tails (Proteomics team at Broad) – No Vectors: Hofmann and Buhmann (Just use pairwise distances) • Dimension Reduction for visualization and analysis – Vectors: GTM Generative Topographic Mapping – No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances) • Can apply to HMM & general mixture models (less study) – Gaussian Mixture Models – Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation for finding “hidden factors” • Have scalable parallel versions of much of above – mainly Java • Many clustering methods – not clear what is best although DA pretty good and improves K-means at increased computing cost which is not always useful • DA clearly improves MDS which is ~only reliable” dimension reduction? 5/25/2015 7
  • 8. Clusters v. Regions • In Lymphocytes clusters are distinct; DA useful • In Pathology, clusters divide space into regions and sophisticated methods like deterministic annealing are probably unnecessary 8 Pathology 54D Lymphocytes 4D 5/25/2015
  • 9. Some Problems we are working on • Analysis of Mass Spectrometry data to find peptides by clustering peaks (Broad Institute/Hyderabad) – ~0.5 million points in 2 dimensions (one collection) -- ~ 50,000 clusters summed over charges • Metagenomics – 0.5 million (increasing rapidly) points NOT in a vector space – hundreds of clusters per sample – Apply MDS to Phylogenetic trees • Pathology Images >50 Dimensions • Social image analysis is in a highish dimension vector space – 10-50 million images; 1000 features per image; million clusters • Finding communities from network graphs coming from Social media contacts etc. 95/25/2015
  • 11. 5/25/2015 11 Colored by Sample – not clustered. Distances from vectors in word space. This 128 4mers for ~200K sequences
  • 12. Examples of current DA algorithms: LCMS 125/25/2015
  • 13. Background on LC-MS • Remarks of collaborators – Broad Institute/Hyderabad • Abundance of peaks in “label-free” LC-MS enables large-scale comparison of peptides among groups of samples. • In fact when a group of samples in a cohort is analyzed together, not only is it possible to “align” robustly or cluster the corresponding peaks across samples, but it is also possible to search for patterns or fingerprints of disease states which may not be detectable in individual samples. • This property of the data lends itself naturally to big data analytics for biomarker discovery and is especially useful for population-level studies with large cohorts, as in the case of infectious diseases and epidemics. • With increasingly large-scale studies, the need for fast yet precise cohort-wide clustering of large numbers of peaks assumes technical importance. • In particular, a scalable parallel implementation of a cohort-wide peak clustering algorithm for LC-MS-based proteomic data can prove to be a critically important tool in clinical pipelines for responding to global epidemics of infectious diseases like tuberculosis, influenza, etc. 135/25/2015
  • 14. Proteomics 2D DA Clustering T= 25000 with 60 Clusters (will be 30,000 at T=0.025) 5/25/2015 14
  • 15. The brownish triangles are “sponge” (soaks up trash) peaks outside any cluster. The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center 15 Fragment of 30,000 Clusters 241605 Points 5/25/2015
  • 16. Continuous Clustering • This is a very useful subtlety introduced by Ken Rose but not widely known although it greatly improves algorithm • Take a cluster k to be split into 2 with centers Y(k)A and Y(k)B with initial values Y(k)A = Y(k)B at original center Y(k) • Then typically if you make this change and perturb the Y(k)A and Y(k)B, they will return to starting position as F at stable minimum (positive eigenvalue) • But instability (the negative eigenvalue) can develop and one finds • Implement by adding arbitrary number p(k) of centers for each cluster Zi = k=1 K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N • Halve p(k) at splits; can’t split easily in standard case p(k) = 1 • Show weighting in sums like Zi now equipoint not equicluster as p(k) proportional to points C(k) in cluster 16 Free Energy F Y(k)A and Y(k)B Y(k)A + Y(k)B Free Energy F Free Energy F Y(k)A - Y(k)B 5/25/2015
  • 17. Trimmed Clustering • Clustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis (Rudolf Frühwirth, D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358 • HTCC = k=0 K i=1 N Mi(k) f(i,k) – f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0 – f(i,0) = c2 / 2 k = 0 • The 0’th cluster captures (at zero temperature) all points outside clusters (background) • Clusters are trimmed (X(i) - Y(k))2/2(k)2 < c2 / 2 • Relevant when well defined errors T ~ 0 T = 1 T = 5 Distance from cluster center 5/25/2015 17
  • 18. 0 10000 20000 30000 40000 50000 60000 1.00E-031.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+06 ClusterCount Temperature DAVS(2) DA2D Start Sponge DAVS(2) Add Close Cluster Check Sponge Reaches final value Cluster Count v. Temperature for 2 Runs • All start with one cluster at far left • T=1 special as measurement errors divided out • DA2D counts clusters with 1 member as clusters. DAVS(2) does not5/25/2015 18
  • 19. Simple Parallelism as in k-means • Decompose points i over processors • Equations either pleasingly parallel “maps” over i • Or “All-Reductions” summing over i for each cluster • Parallel Algorithm: – Each process holds all clusters and calculates contributions to clusters from points in node – e.g. Y(k) = i=1 N <Mi(k)> Xi / C(k) • Runs well in MPI or MapReduce –See all the MapReduce k-means papers 195/25/2015
  • 20. Better Parallelism • The previous model is correct at start but each point does not really contribute to each cluster as damped exponentially by exp( - (Xi- Y(k))2 /T ) • For Proteomics problem, on average only 6.45 clusters needed per point if require (Xi- Y(k))2 /T ≤ ~40 (as exp(-40) small) • So only need to keep nearby clusters for each point • As average number of Clusters ~ 20,000, this gives a factor of ~3000 improvement • Further communication is no longer all global; it has nearest neighbor components and calculated by parallelism over clusters which can be done in parallel if separated 205/25/2015
  • 21. 21 Speedups for several runs on Tempest from 8-way through 384 way MPI parallelism with one thread per process. We look at different choices for MPI processes which are either inside nodes or on separate nodes 5/25/2015
  • 22. 22 Parallelism within a Single Node of Madrid Cluster. A set of runs on 241605 peak data with a single node with 16 cores with either threads or MPI giving parallelism. Parallelism is either number of threads or number of MPI processes. Parallelism (#threads or #processes) 5/25/2015
  • 23. METAGENOMICS -- SEQUENCE CLUSTERING Non-metric Spaces O(N2) Algorithms – Illustrate Phase Transitions 5/25/2015 23
  • 24. 24 • Start at T= “” with 1 Cluster • Decrease T, Clusters emerge at instabilities 5/25/2015
  • 28. METAGENOMICS -- SEQUENCE CLUSTERING Non-metric Spaces O(N2) Algorithms – Compare Other Methods 5/25/2015 28
  • 29. “Divergent” Data Sample 23 True Clusters 29 CDhit UClust Divergent Data Set UClust (Cuts 0.65 to 0.95) DAPWC 0.65 0.75 0.85 0.95 Total # of clusters 23 4 10 36 91 Total # of clusters uniquely identified 23 0 0 13 16 (i.e. one original cluster goes to 1 uclust cluster ) Total # of shared clusters with significant sharing 0 4 10 5 0 (one uclust cluster goes to > 1 real cluster) Total # of uclust clusters that are just part of a real cluster 0 4 10 17(11) 72(62) (numbers in brackets only have one member) Total # of real clusters that are 1 uclust cluster 0 14 9 5 0 but uclust cluster is spread over multiple real clusters Total # of real clusters that have 0 9 14 5 7 significant contribution from > 1 uclust cluster DA-PWC 5/25/2015
  • 31. Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters 315/25/2015
  • 32. Heatmap of biology distance (Needleman- Wunsch) vs 3D Euclidean Distances 32 If d a distance, so is f(d) for any monotonic f. Optimize choice of f 5/25/2015
  • 34. Algorithm Challenges • See NRC Massive Data Analysis report • O(N) algorithms for O(N2) problems • Parallelizing Stochastic Gradient Descent • Streaming data algorithms – balance and interplay between batch methods (most time consuming) and interpolative streaming methods • Graph algorithms – need shared memory? • Machine Learning Community uses parameter servers; Parallel Computing (MPI) would not recommend this? – Is classic distributed model for “parameter service” better? • Apply best of parallel computing – communication and load balancing – to Giraph/Hadoop/Spark • Are data analytics sparse?; many cases are full matrices • BTW Need Java Grande – Some C++ but Java most popular in ABDS, with Python, Erlang, Go, Scala (compiles to JVM) …..5/25/2015 34
  • 35. O(N2) interactions between green and purple clusters should be able to represent by centroids as in Barnes-Hut. Hard as no Gauss theorem; no multipole expansion and points really in 1000 dimension space as clustered before 3D projection O(N2) green-green and purple- purple interactions have value but green-purple are “wasted” “clean” sample of 446K 5/25/2015 35
  • 36. 36 Use Barnes Hut OctTree, originally developed to make O(N2) astrophysics O(NlogN), to give similar speedups in machine learning 5/25/2015
  • 37. 37 OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation (streaming data) 5/25/2015
  • 39. Fungi Analysis • Multiple Species from multiple places • Several sources of sequences starting with 446K and eventually boiled down to ~10K curated sequences with 61 species • Original sample – clustering and MDS • Final sample – MDS and other clustering methods • Note MSA and SWG gives similar results • Some species are clearly split • Some species are diffuse; others compact making a fixed distance cut unreliable – Easy for humans! • MDS very clear on structure and clustering artifacts • Why not do “high-value” clustering as interactive iteration driven by MDS? 5/25/2015 39
  • 40. Fungi -- 4 Classic Clustering Methods 5/25/2015 40
  • 44. 5/25/2015 44 Same Species Different Locations
  • 46. Parallel Data Analytics • Streaming algorithms have interesting differences but • “Batch” Data analytics is “just classic parallel computing” with usual features such as SPMD and BSP • Expect similar systematics to simulations where • Static Regular problems are straightforward but • Dynamic Irregular Problems are technically hard and high level approaches fail (see High Performance Fortran HPF) – Regular meshes worked well but – Adaptive dynamic meshes did not although “real people with MPI” could parallelize • However using libraries is successful at either – Lowest: communication level – Higher: “core analytics” level • Data analytics does not yet have “good regular parallel libraries” – Graph analytics has most attention 5/25/2015 46
  • 47. Remarks on Parallelism I • Maximum Likelihood or 2 both lead to objective functions like • Minimize sum items=1 N (Positive nonlinear function of unknown parameters for item i) • Typically decompose items i and parallelize over both i and parameters to be determined • Solve iteratively with (clever) first or second order approximation to shift in objective function – Sometimes steepest descent direction; sometimes Newton – Have classic Expectation Maximization structure – Steepest descent shift is sum over shift calculated from each point • Classic method – take all (millions) of items in data set and move full distance – Stochastic Gradient Descent SGD – take randomly a few hundred of items in data set and calculate shifts over these and move a tiny distance – SGD cannot parallelize over items 475/25/2015
  • 48. Remarks on Parallelism II • Need to cover non vector semimetric and vector spaces for clustering and dimension reduction (N points in space) • Semimetric spaces just have pairwise distances defined between points in space (i, j) • MDS Minimizes Stress and illustrates this (X) = i<j=1 N weight(i,j) ((i, j) - d(Xi , Xj))2 • Vector spaces have Euclidean distance and scalar products – Algorithms can be O(N) and these are best for clustering but for MDS O(N) methods may not be best as obvious objective function O(N2) – Important new algorithms needed to define O(N) versions of current O(N2) – “must” work intuitively and shown in principle • Note matrix solvers often use conjugate gradient – converges in 5- 100 iterations – a big gain for matrix with a million rows. This removes factor of N in time complexity • Ratio of #clusters to #points important; new clustering ideas if ratio >~ 0.1 485/25/2015
  • 49. Problem Structure • Note learning networks have huge number of parameters (11 billion in Stanford work) so that inconceivable to look at second derivative • Clustering and MDS have lots of parameters but can be practical to look at second derivative and use Newton’s method to minimize • Parameters are determined in distributed fashion but are typically needed globally – MPI use broadcast and “AllCollectives” implying Map-Collective is a useful programming model – AI community: use parameter server and access as needed. Non-optimal? 495/25/2015 (1) Map Only (4) Point to Point or Map-Communication (3) Iterative Map Reduce or Map-Collective (2) Classic MapReduce Input map reduce Input map reduce Iterations Input Output map Local Graph (5) Map-Streaming maps brokers Events (6) Shared memory Map Communicates Map & Communicate Shared Memory
  • 50. MDS in more detail 5/25/2015 50
  • 51. WDA-SMACOF “Best” MDS • Semimetric spaces just have pairwise distances defined between points in space (i, j) • MDS Minimizes Stress with pairwise distances (i, j) (X) = i<j=1 N weight(i,j) ((i, j) - d(Xi , Xj))2 • SMACOF clever Expectation Maximization method choses good steepest descent • Improved by Deterministic Annealing reducing distance scale; DA does not impact compute time much and gives DA-SMACOF • Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial weights but get nonuniform weight from – The preferred Sammon method weight(i,j) = 1/(i, j) or – Missing distances put in as weight(i,j) = 0 • Use conjugate gradient – converges in 5-100 iterations – a big gain for matrix with a million rows. This removes factor of N in time complexity and gives WDA-SMACOF 515/25/2015
  • 52. Timing of WDA SMACOF • 20k to 100k AM Fungal sequences on 600 cores 5/25/2015 52 1 10 100 1000 10000 100000 20k 40k 60k 80k 100k Seconds Data Size Time Cost Comparison between WDA- SMACOF with Equal Weights and Sammon's Mapping Equal Weights Sammon's Mapping
  • 53. WDA-SMACOF Timing • Input Data: 100k to 400k AM Fungal sequences • Environment: 32 nodes (1024 cores) to 128 nodes (4096 cores) on BigRed2. • Using Harp plug in for Hadoop (MPI Performance) 0 500 1000 1500 2000 2500 3000 3500 4000 100k 200k 300k 400k Seconds Data Size Time Cost of WDA-SMACOF over Increasing Data Size 512 1024 2048 4096 0 0.2 0.4 0.6 0.8 1 1.2 512 1024 2048 4096 ParallelEfficiency Number of Processors Parallel Efficiency of WDA-SMACOF over Increasing Number of Processors WDA-SMACOF (Harp) 5/25/2015 53
  • 54. Spherical Phylogram • Take a set of sequences mapped to nD with MDS (WDA- SMACOF) (n=3 or ~20) – N=20 captures ~all features of dataset? • Consider a phylogenetic tree and use neighbor joining formulae to calculate distances of nodes to sequences (or later other nodes) starting at bottom of tree • Do a new MDS fixing mapping of sequences noting that sequences + nodes have defined distances • Use RAxML or Neighbor Joining (N=20?) to find tree • Random note: do not need Multiple Sequence Alignment; pairwise tools are easier to use and give reliably good results 5/25/2015 54
  • 55. RAxML result visualized in FigTree. Spherical Phylogram visualized in PlotViz for MSA or SWG distances Spherical Phylograms MSA SWG 5/25/2015 55
  • 56. Quality of 3D Phylogenetic Tree • EM-SMACOF is basic SMACOF • LMA was previous best method using Levenberg-Marquardt nonlinear 2 solver • WDA-SMACOF finds best result • 3 different distance measures Sum of branch lengths of the Spherical Phylogram generated in 3D space on two datasets 0 5 10 15 20 25 30 MSA SWG NW SumofBranches Sum of Branches on 599nts Data WDA-SMACOF LMA EM-SMACOF 0 5 10 15 20 25 MSA SWG NW SumofBranches Sum of Branches on 999nts Data WDA-SMACOF LMA EM-SMACOF 5/25/2015 56
  • 57. Summary • Always run MDS. Gives insight into data and performance of machine learning – Leads to a data browser as GIS gives for spatial data – 3D better than 2D – ~20D better than MSA? • Clustering Observations – Do you care about quality or are you just cutting up space into parts – Deterministic Clustering always makes more robust – Continuous clustering enables hierarchy – Trimmed Clustering cuts off tails – Distinct O(N) and O(N2) algorithms • Use Conjugate Gradient 575/25/2015
  • 59. Java Grande • We once tried to encourage use of Java in HPC with Java Grande Forum but Fortran, C and C++ remain central HPC languages. – Not helped by .com and Sun collapse in 2000-2005 • The pure Java CartaBlanca, a 2005 R&D100 award-winning project, was an early successful example of HPC use of Java in a simulation tool for non-linear physics on unstructured grids. • Of course Java is a major language in ABDS and as data analysis and simulation are naturally linked, should consider broader use of Java • Using Habanero Java (from Rice University) for Threads and mpiJava or FastMPJ for MPI, gathering collection of high performance parallel Java analytics – Converted from C# and sequential Java faster than sequential C# • So will have either Hadoop+Harp or classic Threads/MPI versions in Java Grande version of Mahout5/25/2015 59
  • 60. Performance of MPI Kernel Operations 1 100 10000 0B 2B 8B 32B 128B 512B 2KB 8KB 32KB 128KB 512KB Averagetime(us) Message size (bytes) MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG Performance of MPI send and receive operations 5 5000 4B 16B 64B 256B 1KB 4KB 16KB 64KB 256KB 1MB 4MB Averagetime(us) Message size (bytes) MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG Performance of MPI allreduce operation 1 100 10000 1000000 4B 16B 64B 256B 1KB 4KB 16KB 64KB 256KB 1MB 4MB AverageTime(us) Message Size (bytes) OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG 1 10 100 1000 10000 0B 2B 8B 32B 128B 512B 2KB 8KB 32KB 128KB 512KB AverageTime(us) Message Size (bytes) OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG Performance of MPI send and receive on Infiniband and Ethernet Performance of MPI allreduce on Infiniband and Ethernet Pure Java as in FastMPJ slower than Java interfacing to C version of MPI 5/25/2015 60
  • 61. Java Grande and C# on 40K point DAPWC Clustering Very sensitive to threads v MPI 64 Way parallel 128 Way parallel 256 Way parallel TXP Nodes Total C# Java C# Hardware 0.7 performance Java Hardware 5/25/2015 61
  • 62. Java and C# on 12.6K point DAPWC Clustering Java C##Threads x #Processes per node # Nodes Total Parallelism Time hours 1x1 2x21x2 1x42x1 1x84x1 2x4 4x2 8x1 #Threads x #Processes per node C# Hardware 0.7 performance Java Hardware 5/25/2015 62
  • 63. Data Analytics in SPIDAL 5/25/2015 63
  • 64. Analytics and the DIKW Pipeline • Data goes through a pipeline Raw data  Data  Information  Knowledge  Wisdom  Decisions • Each link enabled by a filter which is “business logic” or “analytics” • We are interested in filters that involve “sophisticated analytics” which require non trivial parallel algorithms – Improve state of art in both algorithm quality and (parallel) performance • Design and Build SPIDAL (Scalable Parallel Interoperable Data Analytics Library) More Analytics KnowledgeInformation Analytics InformationData 5/25/2015 64
  • 65. Strategy to Build SPIDAL • Analyze Big Data applications to identify analytics needed and generate benchmark applications • Analyze existing analytics libraries (in practice limit to some application domains) – catalog library members available and performance – Mahout low performance, R largely sequential and missing key algorithms, MLlib just starting • Identify big data computer architectures • Identify software model to allow interoperability and performance • Design or identify new or existing algorithm including parallel implementation • Collaborate application scientists, computer systems and statistics/algorithms communities5/25/2015 65
  • 66. Machine Learning in Network Science, Imaging in Computer Vision, Pathology, Polar Science, Biomolecular Simulations 66 Algorithm Applications Features Status Parallelism Graph Analytics Community detection Social networks, webgraph Graph . P-DM GML-GrC Subgraph/motif finding Webgraph, biological/social networks P-DM GML-GrB Finding diameter Social networks, webgraph P-DM GML-GrB Clustering coefficient Social networks P-DM GML-GrC Page rank Webgraph P-DM GML-GrC Maximal cliques Social networks, webgraph P-DM GML-GrB Connected component Social networks, webgraph P-DM GML-GrB Betweenness centrality Social networks Graph, Non-metric, static P-Shm GML-GRA Shortest path Social networks, webgraph P-Shm Spatial Queries and Analytics Spatial relationship based queries GIS/social networks/pathology informatics Geometric P-DM PP Distance based queries P-DM PP Spatial clustering Seq GML Spatial modeling Seq PP GML Global (parallel) ML GrA Static GrB Runtime partitioning 5/25/2015
  • 67. Some specialized data analytics in SPIDAL • aa 67 Algorithm Applications Features Status Parallelism Core Image Processing Image preprocessing Computer vision/pathology informatics Metric Space Point Sets, Neighborhood sets & Image features P-DM PP Object detection & segmentation P-DM PP Image/object feature computation P-DM PP 3D image registration Seq PP Object matching Geometric Todo PP 3D feature extraction Todo PP Deep Learning Learning Network, Stochastic Gradient Descent Image Understanding, Language Translation, Voice Recognition, Car driving Connections in artificial neural net P-DM GML PP Pleasingly Parallel (Local ML) Seq Sequential Available GRA Good distributed algorithm needed Todo No prototype Available P-DM Distributed memory Available P-Shm Shared memory Available5/25/2015
  • 68. Some Core Machine Learning Building Blocks 68 Algorithm Applications Features Status //ism DA Vector Clustering Accurate Clusters Vectors P-DM GML DA Non metric Clustering Accurate Clusters, Biology, Web Non metric, O(N2) P-DM GML Kmeans; Basic, Fuzzy and Elkan Fast Clustering Vectors P-DM GML Levenberg-Marquardt Optimization Non-linear Gauss-Newton, use in MDS Least Squares P-DM GML SMACOF Dimension Reduction DA- MDS with general weights Least Squares, O(N2) P-DM GML Vector Dimension Reduction DA-GTM and Others Vectors P-DM GML TFIDF Search Find nearest neighbors in document corpus Bag of “words” (image features) P-DM PP All-pairs similarity search Find pairs of documents with TFIDF distance below a threshold Todo GML Support Vector Machine SVM Learn and Classify Vectors Seq GML Random Forest Learn and Classify Vectors P-DM PP Gibbs sampling (MCMC) Solve global inference problems Graph Todo GML Latent Dirichlet Allocation LDA with Gibbs sampling or Var. Bayes Topic models (Latent factors) Bag of “words” P-DM GML Singular Value Decomposition SVD Dimension Reduction and PCA Vectors Seq GML Hidden Markov Models (HMM) Global inference on sequence models Vectors Seq PP & GML 5/25/2015
  • 69. Some Futures • Always run MDS. Gives insight into data – Leads to a data browser as GIS gives for spatial data • Claim is algorithm change gave as much performance increase as hardware change in simulations. Will this happen in analytics? – Today is like parallel computing 30 years ago with regular meshs. We will learn how to adapt methods automatically to give “multigrid” and “fast multipole” like algorithms • Need to start developing the libraries that support Big Data – Understand architectures issues – Have coupled batch and streaming versions – Develop much better algorithms • Please join SPIDAL (Scalable Parallel Interoperable Data Analytics Library) community 695/25/2015