London bosc2010

Dealing with the Data Deluge:
What can the Robotics
Community Teach us?
Making our pipelines organic,
adaptable, and scalable

Darin London

Part I. The Challenges of
NextGen Sequencing Data

Datasets

50+ Cell Lines
Each sequenced with up to 2 different technologies
(DNaseHS and FAIRE) and 3 different ChIP-Seq antibodies
(CTCF, PolII, c-Myc), as well as a Control (Input) for
comparison
Most involved multiple biological replicates, and some
biological replicates were sequenced multiple times to
create technical replicates of the same biological sample
1.3 Gb zipped raw data per Cell_line-Technology-Replicate
on average
351 Gb zipped raw sequence data analyzed (and
counting...)

Some characteristics of NextGen
Sequencing Data
heterogeneous in time:
comes in batches by lane and sample
order of date of sample submission does not fix the
order of date of receipt of data
heterogeneous in size:
some samples will produce more data than others
size affects timing of most computational tasks
heterogeneous in quality:
some data will not merit being run through the entire
pipeline
some data may merit extra analysis

Meet Shakey

http://www.ai.sri.com/movies/Shakey.ram

The first fully autonomous robot able to reason about its
surroundings
Pioneered many algorithms to model data from multiple
sensors into a central world map, apply one or more plans
of action, and determine appropriate action behaviors to
achieve these plans
If science is the 'Art of the Soluble' then Shakey
demonstrated the solubility of autonomous robotics to the
world.

That being said...

The autonomous systems roving
on mars, fighting in Afghanistan,
and cleaning our floors do not
share much in common with
Shakey.

These systems descend from more practical approaches
pioneered in the 1980s by Rodney Brooks and others

In 1986, he introduced the world
to Allen, a Behavior-based robot
based on the Subsumption
Architecture

Behavior-based Robots
Attempt to mimic biological actions, rather than human
cognition
Built out of many small modules
Modules act autonomously by continuously sensing
the environment for specific signals, and immediately
perform a specific action based on that sensory input
Modules arranged hierarchically, with higher layer
modules able to mask (subsume) the input or output
of lower layer modules (lower layer modules are not
aware that they are being subsumed)
There is no central planning module
The intelligence of the system is completely
distributed throughout all the smaller subsystems,
each designed to achieve certain parts of the overall
task list opportunistically as the environment becomes
favorable to it acting as it is designed to act

Cost-Benefit Analysis
Benefits of Behavior-based robots over AI:
Easier and cheaper to build
Scale better with existing technology
More easily adaptable, new behaviors emerge with the
addition of modules with little or no change to other
modules
More fault tolerant, partial behaviors tend to persist even
when many modules fail to act
Deficiencies of Behavior-based robots:
'Higher order' reasoning and logic functions are too
complex
No capacity to learn from mistakes except through
changes, addition, or subtraction of modules

Part III. Making our
Bioinformatics Pipelines
Organic, and Adaptable

Many Bioinformatics Pipelines
resemble Shakey by:
Involving centralized controller systems which control every
aspect of pipeline behavior
Mixing the logic for selecting tasks from a list together with
the logic for performing these tasks

Except that, unlike Shakey, many
pipelines:
Have little or no knowledge of their computing environment
Have no, or very little, capacity to:
perform tasks in different orders, opportunistically
temporarily re-focus their work on smaller subsets of the
total task list
run tasks in parallel
etc.
Lack intelligent points for human agent inclusion
Are subject to human will at every level

Behavior Based Pipelines are
ideal for dealing with
heterogeneous data efficiently

They are Modular
Much like Object Oriented Programming
Failure is easy to diagnose and fix
Failure in one module does not (necessarily) impact other
module actions
Failure in one module does not (necessarily) require other
modules to be rerun, or require complex skipping logic in
the pipeline code

They are Adaptable

New analyses should simply require plugging in a new
module, with minimal or no 'rewiring' of other modules
Reanalyses should simply require the removal of certain
outputs, and possibly a reset of the completion state of a
particular task to accomplish, and all downstream tasks
should either react to the presence of new data, or require
minimal state manipulation to get them to rerun themselves
Modules can be augmented, or replaced as needed, with
little or no change to other modules, as long as their original
functionality is maintained or assumed by another module

They are Scalable

Modules can be deployed onto as many different machines
as are available (servers, nodes on a cluster, nodes in a
cloud) to expand throughput
Modules with high resource requirements can be deployed
onto separate machines from those with low resource
requirements
Modules can be grouped together on different machines, or
sets of machines, according to functionality, or data
proximity

They act Autonomously

Individual modules can 'react' to data to produce information
as soon as the data is made available in the 'environment'
Datasets can be moved through the pipeline at different
rates
Modules do not require humans to manage them, but,
instead, react and respond to different human inputs at
many different places
Humans are really just another intelligent agent in the
system

They can act Opportunistically

Modules can be tied into multiple task-management
systems
overall dataset-task list
priority dataset-task list
machine specific dataset-task lists
manual intervention
The priority system can be set to take precedence over the
overall system, but if priority datasets get backlogged, the
system can still opportunistically process items in the overall
system until the backlog is cleared, and the priority system
can then regain the focus of some or all machines in the
system

They are sensitive to their computing
environment, and knowledgeable of the
resources they need to work
Modules should know how much memory, file system
space, etc. they need
Modules should know about other modules that would
compete with them for scarce resources
This may run counter to the ethos of platform nutrality, but,
for instance (if you are running on redhat/centos) you can
parse /proc/meminfo for memory information (my $meminfo
= YAML::LoadFile('/proc/meminfo')), ps for information on
other processes running in the environment, df for
filesystem information, etc.

These systems have other advantages:
They make it easy to get up and running with 1-2 modules
tested on a small dataset, which can then be applied to all
other datasets available, and yet to come
They allow for 'partial solutions', e.g. some data will always
be produced even if the entire pipeline is not finished (what
pipeline is ever 'finished', anyway), or if one or more parts of
the pipeline are discovered to have bugs
New modules can be created, tested against 1 or more
datasets, and then 'released to the wild' so that they can
autonomously fill in the gaps for all previously received data,
and then analyze all data received in the future
Buggy modules can be pulled out of the pipeline, fixed and
tested in the same way

Part IV. The IGSP Encode
Pipeline

Pipeline designed to generate data for
the Encyclopedia of DNA Elements
(EncODE)
http://www.genome.gov/10005107

For both EncODE and non EncODE cell_lines and treatments:
Automates movement of data from sequencing staging to IGSP server
Aligns raw sequence files to hg19 using bwa (previously hg18 using
maq)
Generates feature density distributions of whole-genome sequence data
aligned to hg19
Generates visual tracks of data in the IGSP internal UCSC Genome
Browser
Generates submission tarballs of bam, peaks, parzen bigWigs, and
base count bigWigs to be submitted to UCSC

Compute Infrastructure
4 Centos Compute Nodes: 8 core (2.50 Ghz, dual quad core
procs), 32GB 1066Mhz Ram, Primary 120GB HDD,
Secondary 250GB HDD
Duke Shared Cluster Resource: 19 high priority Encode
nodes, each with 8 cores and 16 GB Ram
Compute nodes connected to DSCR via NFS mounted
volume provided by a Netapp NAS array of 42 15k 450GB
FC disks exported through a 10G Fibre-E link
Raw Data, and analytical output stored on two NFS
mounted volumes provided by a Netapp NAS array of 14
7.2k sata disks, 1TB and 750G in size
Each compute node contains its own, locally mounted 230G
scratch directory to minimize NFS read-write concurrency
issues

Pipeline composed of many different
agents, each falling into one of three
categories:
Runner Agents: These simply read through a list of datasets and
tasks to be done on each dataset, and launch the necessary
processing agents required to accomplish each task on the
dataset. They do not care whether it is possible for the agent to
accomplish the task on the dataset
Processing Agents: These are small programs designed to perform
a specific processing task on a given dataset. In addition, they are
designed to know when it is possible to perform the task (based on
prerequisites), whether the resources (memory, storage space,
etc) required for it to run are available, and whether other
programs which are running on the system will compete with it in
ways which adversely effect its performance

Main Task List

Composed of a set of worksheets in a Google
Spreadsheet. This has a number of advantages:
Allows people all over the world to keep track of what has
been done, and what remains to be done
Since the Google Spreadsheet API is also available to
agents on any internet connected computer, it can be used
by runner and processing agents on any number of servers

The third type of agents in this system
are humans
The google spreadsheet model makes it very easy to plug
humans into the overall logic of the system:
arguments, variables, and state switches can be
communicated to an agent using meta-fields on the
worksheet. The values for these fields can be filled in by
humans, or other computer agents
processing agents can be coded to require prerequisite
meta-fields which require a human to switch on before they
run
processing agents can write data to information fields upon
completion, failure, or both. This might include changing the
state of prerequisite fields required by other agents
processes requiring human intervention can be replaced by
computational logic over time, as the logic becomes
formalized into one or more agents

Part V.
Google::Spreadsheet::Agent
http://search.cpan.org/~dmlond/Google-Spreadsheet-Agent-0.01

#!/usr/bin/perl
use strict;
use Getopt::Std;
use Google::Spreadsheet::Agent;
# usually other modules are used

my $goal = basename($0);
$goal =~ s/_agent.pl//;

my $cell_line = shift or die "cell_linen";
my $technology = shift or die "technologyn";
my $replicate = shift or die "replicaten";
my $google_page = ($replicate =~ m/.*_TP.*/) ? 'combined' : $technology;

my %opts;
getopts('dr:P:', %opts);
my $debug = $opts{d};
$data_root = $opts{r} if ($opts{r})
$google_page = $opts{P} if ($opts{P});

my $prerequisites = [];
$prerequisites->[0] = ($replicate =~ m/.*_TP.*/) ? 'combined' : 'aligned';

my $google_agent = Google::Spreadsheet::Agent->new(
agent_name => $goal, page_name => $google_page, debug => $debug,
max_selves => 3,
bind_key_fields => { cellline => $cell_line, technology => $technology, replicate => $replicate },
prerequisites => $prerequisites
);
$google_agent->run_my(&agent_code);
exit;

my $min_gigs = 18; # start with an 18G /scratch2 availability
requirement
my $gigs_avail = &get_scratch_availability or exit(1);
exit if ($gigs_avail < $min_gigs);

sub get_scratch_availability {
my $opened = open (my $df_in, '-|', 'df', '-h', '/scratch2');
unless ($opened) {
print STDERR "Couldnt check scratch2 usage $!n";
return;
}
my $in = <$df_in>; # skip first line
$in = <$df_in>;
chomp $in;
close $df_in;
my $gigs_avail = (split /s+/, $in)[3];
$gigs_avail =~ s/D+$//;
return $gigs_avail;
}

use YAML::Any qw/LoadFile/;
my $min_mem = 16; # requires about 16-18G memory to run
exit if (&get_available_memory <= $min_mem);
sub get_available_memory {
my $info = LoadFile('/proc/meminfo') or die "Couldnt load meminfo $!n";
my $free_mem = $info->{MemFree};
$free_mem =~ s/D+$//;
my $buffers = $info->{Buffers};
my $cached = $info->{Cached};
$buffers =~ s/D+$//;
$cached =~ s/D+$//;
$free_mem += $buffers + $cached;
$free_mem /= (1024*1024);
return $free_mem;
}

sub agent_code {
my $entry = shift;
my $replicate_root = join('/', $data_root, $cell_line, $technology, 'sequence_'.$replicate);
my $db_name = getDBName($replicate_root);
my $scratch_root = $replicate_root;
$scratch_root =~ s/$data_root//scratch2/;

my $helper_command = join(' ', join('/', $generic_apps_dir, 'parzen_fseq_helper.pl'),
$replicate_root, join('/', $replicate_root, 'bwa_'.$entry->{build}, 'sequence.final.bed'),
$cell_line, $technology, $entry->{sex}, $entry->{build}, $db_name
);

print STDERR "Running ${helper_command}n";
`$helper_command`;
if ($?) {
print "Problem running parzen_helper $!";
return;
}

my $parzen_track_name = $db_name . "_parzen";
my $scratch_parzen_dir = join('/', $scratch_root, 'parzen_'.$build);
my $parzen_dir = join('/', $replicate_roo 'parzen_'.$build);
$parzen_dir =~ s/sata2/sata4/;

my $wiggle_helper = join(' ', join('/', $generic_apps_dir, 'parzen_wiggle_helper.pl'),
$build, $parzen_track_name, $parzen_dir, $scratch_parzen_dir
);

print STDERR "Running ${wiggle_helper}n";
`$wiggle_helper`;
if ($?) {
print STDERR "Problem running wiggle_helper $!n";
return;
}
return 1;
}

#!/usr/bin/perl
use FindBin;
use Google::Spreadsheet::Agent;

my $google_agent = Google::Spreadsheet::Agent->new(
agent_name => 'agent_runner',
page_name => 'all',
bind_key_fields => { cellline => 'all', technology => 'all', replicate => 'all' }
);

# iterate through each page on the database, get runnable rows, and run each runnable on the row
foreach my $page_name ( map { $_->title } $google_agent->google_db->worksheets ) {
foreach my $runnable_row (
grep {
$_->content->{ready} && !$_->content->{complete}
} $google_agent->google_db->worksheet({ title => $page_name })->rows
){
foreach my $goal (keys %{$runnable_row->content}) {
next if ($runnable_row->content->{$goal}; # r,1,F cause it to skip

# some of these will skip because they are fields without agents
my $goal_agent = $FindBin::Bin.'/../agent_bin/'.$goal.'_agent.pl';
return unless (-x $goal_agent);

my @cmd = ($goal_agent);
foreach my $query_field ( sort {
$google_agent->config->{key_fields}->{$a}->{rank} <=> $google_agent->config->{key_fields}->{$b}->{rank}
} keys %{$google_agent->config->{key_fields}} ) {
next unless ($row_content->{$query_field});
push @cmd, $row_content->{$query_field};
}
system( join(' ', @cmd).'&');
sleep 5;
}
}
}
exit;

Future Plans
1. Making inter-lab communication more concrete, automatic
2. Each server can have its own 'task' view of a particular
google spreadsheet worksheet, in that it can have its own
unique set of executable agent_bin scripts tied to a set of
fields that systems on other servers would ignore
3. Put some of the runner code, and requirements checking
routines into Google::Spreadsheet::Agent for version 1.1

Acknowledgements

The Institute for Genome Sciences and Policy (IGSP)
The Encode Consortium
Terry Furey
Alan Boyle
Greg Crawford
Mark DeLong
Rob Wagner
Peyton Vaughn
Darrin Mann
Alan Cowles

London bosc2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to London bosc2010

Similar to London bosc2010 (20)

More from BOSC 2010

More from BOSC 2010 (20)

Recently uploaded

Recently uploaded (20)

London bosc2010