SlideShare a Scribd company logo
1 of 24
Download to read offline
Parallel Programming in Fortran
            with Coarrays
      John Reid, ISO Fortran Convener,
            JKR Associates and
       Rutherford Appleton Laboratory


  Fortran 2008 is now in FDIS ballot: only
  ‘typos’ permitted at this stage.
  The biggest change from Fortran 2003 is
  the addition of coarrays.
  This talk will introduce coarrays and
  explain why we believe that they will lead
  to easier development of parallel programs,
  faster execution times, and better
  maintainability.


       Advanced Programming Specialist Group,
                   BCS, London, 6 May 2010.
Design objectives
Coarrays are the brain-child of Bob Numrich
(Minnesota Supercomputing Institute, formerly
Cray).
The original design objectives were for
  A simple extension to Fortran
  Small demands on the implementors
  Retain optimization between synchronizations
  Make remote references apparent
  Provide scope for optimization of
  communication
A subset has been implemented by Cray for some
ten years.
Coarrays have been added to the g95 compiler,
are being added to gfort, and for Intel ‘are at the
top of our development list’.


                         2
Summary of coarray model
  SPMD – Single Program, Multiple Data
  Replicated to a number of images (probably as
  executables)
  Number of images fixed during execution
  Each image has its own set of variables
  Coarrays are like ordinary variables but have
  second set of subscripts in [ ] for access
  between images
  Images mostly execute asynchronously
  Synchronization: sync all, sync images,
  lock, unlock, allocate, deallocate,
  critical construct
  Intrinsics: this_image, num_images,
  image_index
Full summary: WG5 N1824 (Google WG5)


                       3
Examples of coarray syntax
real :: r[*], s[0:*] ! Scalar coarrays
real,save :: x(n)[*] ! Array coarray
type(u),save :: u2(m,n)[np,*]
! Coarrays always have assumed
! cosize (equal to number of images)

real :: t               ! Local
integer p, q, index(n) ! variables
     :
t = s[p]
x(:) = x(:)[p]
! Reference without [] is to local part
x(:)[p] = x(:)
u2(i,j)%b(:) = u2(i,j)[p,q]%b(:)




                   4
Implementation model
Usually, each image resides on one core.
However, several images may share a core (e.g.
for debugging) and one image may execute on a
cluster (e.g. with OpenMP).
A coarray has the same set of bounds on all
images, so the compiler may arrange that it
occupies the same set of addresses within each
image (known as symmetric memory).
On a shared-memory machine, a coarray might be
implemented as a single large array.
On any machine, a coarray may be implemented
so that each image can calculate the memory
address of an element on another image.




                        5
Synchronization

With a few exceptions, the images execute
asynchronously. If syncs are needed, the user
supplies them explicitly.

Barrier on all images
       sync all

Wait for others
       sync images(image-set)

Limit execution to one image at a time
    critical
        :
    end critical

Limit execution in a more flexible way
    lock(lock_var[6])
        p[6] = p[6] + 1
    unlock(lock_var[6])

These are known as image control statements.


                        6
The sync images statement
Ex 1: make other images to wait for image 1:
if (this_image() == 1) then
   ! Set up coarray data for other images
   sync images(*)
else
   sync images(1)
   ! Use the data set up by image 1
end if
Ex 2: impose the fixed order 1, 2, ... on images:
real :: a, asum[*]
integer :: me,ne
me = this_image()
ne = num_images()
if(me==1) then
     asum = a
else
   sync images( me-1 )
   asum = asum[me-1] + a
end if
if(me<ne) sync images( me+1 )




                         7
Execution segments
On an image, the sequence of statements
executed before the first image control statement
or between two of them is known as a segment.
For example, this code reads a value on image 1
and broadcasts it.
real :: p[*]
     :                            ! Segment 1
sync all
if (this_image()==1) then ! Segment 2
    read (*,*) p                  !   :
    do i = 2, num_images() !          :
       p[i] = p                   !   :
    end do                        !   :
end if                            ! Segment 2
sync all
     :                            ! Segment 3




                        8
Execution segments (cont)
Here we show three segments.
On any image, these are executed in order,
Segment 1, Segment 2, Segment 3.
The sync all statements ensure that Segment 1
on any image precedes Segment 2 on any other
image and similarly for Segments 2 and 3.
However, two segments 1 on different images are
unordered.
Overall, we have a partial ordering.
Important rule: if a variable is defined in a
segment, it must not be referenced, defined, or
become undefined in a another segment unless
the segments are ordered.




                        9
Dynamic coarrays
Only dynamic form: the allocatable coarray.
real, allocatable :: a(:)[:], s[:,:]
    :
allocate ( a(n)[*], s[-1:p,0:*] )
All images synchronize at an allocate or
deallocate statement so that they can all
perform their allocations and deallocations in the
same order. The bounds, cobounds, and length
parameters must not vary between images.
An allocatable coarray may be a component of a
structure provided the structure and all its
ancestors are scalars that are neither pointers nor
coarrays.
A coarray is not allowed to be a pointer.




                         10
Non-coarray dummy arguments
A coarray may be associated as an actual
argument with a non-coarray dummy argument
(nothing special about this).
A coindexed object (with square brackets) may be
associated as an actual argument with a non-
corray dummy argument. Copy-in copy-out is to
be expected.
These properties are very important for using
existing code.




                       11
Coarray dummy arguments
A dummy argument may be a coarray. It may be
of explicit shape, assumed size, assumed shape,
or allocatable:
subroutine subr(n,w,x,y,z)
   integer :: n
   real :: w(n)[n,*] ! Explicit shape
   real :: x(n,*)[*] ! Assumed size
   real :: y(:,:)[*] ! Assumed shape
   real, allocatable :: z(:)[:,:]
Where the bounds or cobounds are declared, there
is no requirement for consistency between
images. The local values are used to interpret a
remote reference. Different images may be
working independently.
There are rules to ensure that copy-in copy-out of
a coarray is never needed.




                        12
Coarrays and SAVE
Unless allocatable or a dummy argument, a
coarray must be given the SAVE attribute.
This is to avoid the need for synchronization
when coarrays go out of scope on return from a
procedure.
Similarly, automatic-array coarrays
subroutine subr (n)
   integer :: n
   real :: w(n)[*]
and array-valued functions
function fun (n)
   integer :: n
   real :: fun(n)[*]
are not permitted, since they would require
synchronization.




                       13
Structure components
A coarray may be of a derived type with
allocatable or pointer components.
Pointers must have targets in their own image:
 q => z[i]%p      ! Not allowed
 allocate(z[i]%p) ! Not allowed
Provides a simple but powerful mechanism for
cases where the size varies from image to image,
avoiding loss of optimization.




                       14
Program termination
The aim is for an image that terminates normally
(stop or end program) to remain active so that
its data is available to other executing images,
while an error condition leads to quick
termination of all images.
Normal termination occurs in three steps:
initiation, synchronization, and completion. Data
on an image is available to the others until they
all reach the synchronization.
Error termination occurs if any image hits an
error condition or executes an error stop
statement. All other images that have not initiated
error termination do so as soon as possible.




                        15
Input/output
Default input (*) is available on image 1 only.
Default output (*) and error output are available
on every image. The files are separate, but their
records will be merged into a single stream or one
for the output files and one for the error files.
To order the writes from different images, need
synchronization and the flush statement.
The open statement connects a file to a unit on
the executing image only.
Whether a file on one image is the same as a file
with the same name on another image is
processor dependent.




                        16
Optimization
Most of the time, the compiler can optimize as if
the image is on its own, using its temporary
storage such as cache, registers, etc.
There is no coherency requirement while
unordered segments are executing. The
programmer is required to follow the rule: if a
variable is defined in a segment, it must not be
referenced, defined, or become undefined in a
another segment unless the segments are ordered.
The compiler also has scope to optimize
communication.




                       17
Planned extensions
The following features were part of the proposal
but have moved into a planned Technical Report
on ‘Enhanced Parallel Computing Facilities’:
1.    The collective intrinsic subroutines.
2.    Teams and features that require teams.
3.    The notify and query statements.
4.    File connected on more than one image,
     unless default output or default error.




                        18
A comparison with MPI

A colleague (Ashby, 2008) recently converted
most of a large code, SBLI, a finite-difference
formulation of Direct Numerical Simulation
(DNS) of turbulance, from MPI to coarrays using
a small Cray X1E (64 processors).

Since MPI and coarrays can be mixed, he was
able to do this gradually, and he left the solution
writing and the restart facilites in MPI.

Most of the time was taken in halo exchanges and
the code parallelizes well with this number of
processors. The speeds were very similar.

The code clarity (and maintainability) was much
improved. The code for halo exchanges,
excluding comments, was reduced from 176 lines
to 105 and the code to broadcast global
parameters from 230 to 117.


                         19
Advantages of coarrays
Easy to write code – the compiler looks
after the communication
References to local data are obvious as
such.
Easy to maintain code – more concise than
MPI and easy to see what is happening
Integrated with Fortran – type checking,
type conversion on assignment, ...
The compiler can optimize communication
Local optimizations still available
Does not make severe demands on the
compiler, e.g. for coherency.




                  20
References
Ashby, J.V. and Reid, J.K (2008). Migrating a
scientific application from MPI to coarrays. CUG
2008 Proceedings. RAL-TR-2008-015, see
http://www.numerical.rl.ac.uk/
               reports/reports.shtml
Reid, John (2010). Coarrays in the next Fortran
Standard. ISO/IEC/JTC1/SC22/ WG5 N1824, see
ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850
Reid, John (2010). The new features of Fortran
2008. ISO/IEC/JTC1/SC22/ WG5 N1828, see
ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850
WG5(2010). FDIS revision of the Fortran
Standard. ISO/IEC/JTC1/SC22/ WG5 N1826, see
ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850




                       21
Atomic subroutines

  call atomic_define(atom[p],value)
  call atomic_ref(value,atom[p])

The effect of executing an atomic subroutine is as
if the action occurs instantaneously, and thus does
not overlap with other atomic actions that might
occur asynchronously.

It acts on a scalar variable of type
integer(atomic_int_kind) or
logical(atomic_logical_kind).

The kinds are defined in an intrinsic module.

The variable must be a coarray or a coindexed
object.

Atomics allowed to break the rule that if a
variable is defined in a segment, it must not be
referenced, defined, or become undefined in a
another segment unless the segments are ordered.


                        22
Sync memory
The sync memory statement is an image control
statement that defines a boundary on an image
between two segments, each of which can be
ordered in some user-defined way with respect to
a segment on another image.
The compiler will almost certainly treat this as a
barrier for code motion, flush all data that might
be accessed by another image to memory, and
subsequently reload such data from memory.




                        23
Spin-wait loop

Atomics and sync memory allow the spin-wait
loop, e.g.
  use, intrinsic :: iso_fortran_env
  logical(atomic_logical_kind) :: &
                           locked[*]=.true.
  logical :: val
  integer :: iam, p, q
     :
  iam = this_image()
  if (iam == p) then
     sync memory
     call atomic_define(locked[q],.false.)
  else if (iam == q) then
     val = .true.
     do while (val)
         call atomic_ref(val,locked)
     end do
     sync memory
  end if
Here, segment 1 on p and segment 2 on q are
unordered but locked is atomic so it is OK.
Image q will emerge from its spin when it sees
that locked has become false.


                       24

More Related Content

What's hot

Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
PREGEL a system for large scale graph processing
PREGEL a system for large scale graph processingPREGEL a system for large scale graph processing
PREGEL a system for large scale graph processingAbolfazl Asudeh
 
What is pattern recognition (lecture 6 of 6)
What is pattern recognition (lecture 6 of 6)What is pattern recognition (lecture 6 of 6)
What is pattern recognition (lecture 6 of 6)Randa Elanwar
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Block Image Encryption using Wavelet
Block Image Encryption using WaveletBlock Image Encryption using Wavelet
Block Image Encryption using WaveletIRJET Journal
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...IJERA Editor
 
Kct guidev3p0
Kct guidev3p0Kct guidev3p0
Kct guidev3p0aanchal19
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)Yu Liu
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711RCCSRENKEI
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;Robin Srivastava
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
 
Paper description of "Reformer"
Paper description of "Reformer"Paper description of "Reformer"
Paper description of "Reformer"GenkiYasumoto
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationMohammed Bennamoun
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows seteSAT Publishing House
 

What's hot (20)

Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
PREGEL a system for large scale graph processing
PREGEL a system for large scale graph processingPREGEL a system for large scale graph processing
PREGEL a system for large scale graph processing
 
What is pattern recognition (lecture 6 of 6)
What is pattern recognition (lecture 6 of 6)What is pattern recognition (lecture 6 of 6)
What is pattern recognition (lecture 6 of 6)
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Block Image Encryption using Wavelet
Block Image Encryption using WaveletBlock Image Encryption using Wavelet
Block Image Encryption using Wavelet
 
Introduction to OpenCV
Introduction to OpenCVIntroduction to OpenCV
Introduction to OpenCV
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
 
Kct guidev3p0
Kct guidev3p0Kct guidev3p0
Kct guidev3p0
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
neural networksNnf
neural networksNnfneural networksNnf
neural networksNnf
 
Paper description of "Reformer"
Paper description of "Reformer"Paper description of "Reformer"
Paper description of "Reformer"
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimization
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 

Similar to Bcs apsg 2010-05-06_presentation

Generic Image Processing With Climb
Generic Image Processing With ClimbGeneric Image Processing With Climb
Generic Image Processing With ClimbLaurent Senta
 
Generic Image Processing With Climb – 5th ELS
Generic Image Processing With Climb – 5th ELSGeneric Image Processing With Climb – 5th ELS
Generic Image Processing With Climb – 5th ELSChristopher Chedeau
 
Towards Exascale Computing with Fortran 2015
Towards Exascale Computing with Fortran 2015Towards Exascale Computing with Fortran 2015
Towards Exascale Computing with Fortran 2015inside-BigData.com
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...7mind
 
Presentation1.2.pptx
Presentation1.2.pptxPresentation1.2.pptx
Presentation1.2.pptxpranaykusuma
 
Recognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesRecognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesIJCSEA Journal
 
Python imaging-library-overview - [cuuduongthancong.com]
Python imaging-library-overview - [cuuduongthancong.com]Python imaging-library-overview - [cuuduongthancong.com]
Python imaging-library-overview - [cuuduongthancong.com]Dinh Sinh Mai
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_iankit_ppt
 
Gráficas en python
Gráficas en python Gráficas en python
Gráficas en python Jhon Valle
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
 
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...IRJET Journal
 
An Evaluation Of Anticipated Extensions For Fortran Coarrays
An Evaluation Of Anticipated Extensions For Fortran CoarraysAn Evaluation Of Anticipated Extensions For Fortran Coarrays
An Evaluation Of Anticipated Extensions For Fortran CoarraysDarian Pruitt
 
1 of 6 LAB 5 IMAGE FILTERING ECE180 Introduction to.docx
1 of 6  LAB 5 IMAGE FILTERING ECE180 Introduction to.docx1 of 6  LAB 5 IMAGE FILTERING ECE180 Introduction to.docx
1 of 6 LAB 5 IMAGE FILTERING ECE180 Introduction to.docxmercysuttle
 
C interview questions
C interview questionsC interview questions
C interview questionsSoba Arjun
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
Image Classification using Deep Learning
Image Classification using Deep LearningImage Classification using Deep Learning
Image Classification using Deep LearningIRJET Journal
 

Similar to Bcs apsg 2010-05-06_presentation (20)

Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
 
Generic Image Processing With Climb
Generic Image Processing With ClimbGeneric Image Processing With Climb
Generic Image Processing With Climb
 
Generic Image Processing With Climb – 5th ELS
Generic Image Processing With Climb – 5th ELSGeneric Image Processing With Climb – 5th ELS
Generic Image Processing With Climb – 5th ELS
 
Towards Exascale Computing with Fortran 2015
Towards Exascale Computing with Fortran 2015Towards Exascale Computing with Fortran 2015
Towards Exascale Computing with Fortran 2015
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Presentation1.2.pptx
Presentation1.2.pptxPresentation1.2.pptx
Presentation1.2.pptx
 
Recognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenesRecognition and tracking moving objects using moving camera in complex scenes
Recognition and tracking moving objects using moving camera in complex scenes
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Python imaging-library-overview - [cuuduongthancong.com]
Python imaging-library-overview - [cuuduongthancong.com]Python imaging-library-overview - [cuuduongthancong.com]
Python imaging-library-overview - [cuuduongthancong.com]
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
Gráficas en python
Gráficas en python Gráficas en python
Gráficas en python
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
 
An Evaluation Of Anticipated Extensions For Fortran Coarrays
An Evaluation Of Anticipated Extensions For Fortran CoarraysAn Evaluation Of Anticipated Extensions For Fortran Coarrays
An Evaluation Of Anticipated Extensions For Fortran Coarrays
 
1 of 6 LAB 5 IMAGE FILTERING ECE180 Introduction to.docx
1 of 6  LAB 5 IMAGE FILTERING ECE180 Introduction to.docx1 of 6  LAB 5 IMAGE FILTERING ECE180 Introduction to.docx
1 of 6 LAB 5 IMAGE FILTERING ECE180 Introduction to.docx
 
C interview questions
C interview questionsC interview questions
C interview questions
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Image Classification using Deep Learning
Image Classification using Deep LearningImage Classification using Deep Learning
Image Classification using Deep Learning
 

More from Geoff Sharman

BCS APSG Theory of Systems
BCS APSG Theory of SystemsBCS APSG Theory of Systems
BCS APSG Theory of SystemsGeoff Sharman
 
BCS CCS Enterprise Systems
BCS CCS Enterprise SystemsBCS CCS Enterprise Systems
BCS CCS Enterprise SystemsGeoff Sharman
 
BCS APSG Theory of Systems
BCS APSG Theory of SystemsBCS APSG Theory of Systems
BCS APSG Theory of SystemsGeoff Sharman
 
BCS APSG Enterprise Systems
BCS APSG Enterprise SystemsBCS APSG Enterprise Systems
BCS APSG Enterprise SystemsGeoff Sharman
 
BCS APSG Quantum Computing tutorial
BCS APSG Quantum Computing tutorialBCS APSG Quantum Computing tutorial
BCS APSG Quantum Computing tutorialGeoff Sharman
 
BCS APSG The landscape of enterprise applications
BCS APSG The landscape of enterprise applicationsBCS APSG The landscape of enterprise applications
BCS APSG The landscape of enterprise applicationsGeoff Sharman
 
Landscape of enterprise applications
Landscape of enterprise applicationsLandscape of enterprise applications
Landscape of enterprise applicationsGeoff Sharman
 

More from Geoff Sharman (7)

BCS APSG Theory of Systems
BCS APSG Theory of SystemsBCS APSG Theory of Systems
BCS APSG Theory of Systems
 
BCS CCS Enterprise Systems
BCS CCS Enterprise SystemsBCS CCS Enterprise Systems
BCS CCS Enterprise Systems
 
BCS APSG Theory of Systems
BCS APSG Theory of SystemsBCS APSG Theory of Systems
BCS APSG Theory of Systems
 
BCS APSG Enterprise Systems
BCS APSG Enterprise SystemsBCS APSG Enterprise Systems
BCS APSG Enterprise Systems
 
BCS APSG Quantum Computing tutorial
BCS APSG Quantum Computing tutorialBCS APSG Quantum Computing tutorial
BCS APSG Quantum Computing tutorial
 
BCS APSG The landscape of enterprise applications
BCS APSG The landscape of enterprise applicationsBCS APSG The landscape of enterprise applications
BCS APSG The landscape of enterprise applications
 
Landscape of enterprise applications
Landscape of enterprise applicationsLandscape of enterprise applications
Landscape of enterprise applications
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Bcs apsg 2010-05-06_presentation

  • 1. Parallel Programming in Fortran with Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory Fortran 2008 is now in FDIS ballot: only ‘typos’ permitted at this stage. The biggest change from Fortran 2003 is the addition of coarrays. This talk will introduce coarrays and explain why we believe that they will lead to easier development of parallel programs, faster execution times, and better maintainability. Advanced Programming Specialist Group, BCS, London, 6 May 2010.
  • 2. Design objectives Coarrays are the brain-child of Bob Numrich (Minnesota Supercomputing Institute, formerly Cray). The original design objectives were for A simple extension to Fortran Small demands on the implementors Retain optimization between synchronizations Make remote references apparent Provide scope for optimization of communication A subset has been implemented by Cray for some ten years. Coarrays have been added to the g95 compiler, are being added to gfort, and for Intel ‘are at the top of our development list’. 2
  • 3. Summary of coarray model SPMD – Single Program, Multiple Data Replicated to a number of images (probably as executables) Number of images fixed during execution Each image has its own set of variables Coarrays are like ordinary variables but have second set of subscripts in [ ] for access between images Images mostly execute asynchronously Synchronization: sync all, sync images, lock, unlock, allocate, deallocate, critical construct Intrinsics: this_image, num_images, image_index Full summary: WG5 N1824 (Google WG5) 3
  • 4. Examples of coarray syntax real :: r[*], s[0:*] ! Scalar coarrays real,save :: x(n)[*] ! Array coarray type(u),save :: u2(m,n)[np,*] ! Coarrays always have assumed ! cosize (equal to number of images) real :: t ! Local integer p, q, index(n) ! variables : t = s[p] x(:) = x(:)[p] ! Reference without [] is to local part x(:)[p] = x(:) u2(i,j)%b(:) = u2(i,j)[p,q]%b(:) 4
  • 5. Implementation model Usually, each image resides on one core. However, several images may share a core (e.g. for debugging) and one image may execute on a cluster (e.g. with OpenMP). A coarray has the same set of bounds on all images, so the compiler may arrange that it occupies the same set of addresses within each image (known as symmetric memory). On a shared-memory machine, a coarray might be implemented as a single large array. On any machine, a coarray may be implemented so that each image can calculate the memory address of an element on another image. 5
  • 6. Synchronization With a few exceptions, the images execute asynchronously. If syncs are needed, the user supplies them explicitly. Barrier on all images sync all Wait for others sync images(image-set) Limit execution to one image at a time critical : end critical Limit execution in a more flexible way lock(lock_var[6]) p[6] = p[6] + 1 unlock(lock_var[6]) These are known as image control statements. 6
  • 7. The sync images statement Ex 1: make other images to wait for image 1: if (this_image() == 1) then ! Set up coarray data for other images sync images(*) else sync images(1) ! Use the data set up by image 1 end if Ex 2: impose the fixed order 1, 2, ... on images: real :: a, asum[*] integer :: me,ne me = this_image() ne = num_images() if(me==1) then asum = a else sync images( me-1 ) asum = asum[me-1] + a end if if(me<ne) sync images( me+1 ) 7
  • 8. Execution segments On an image, the sequence of statements executed before the first image control statement or between two of them is known as a segment. For example, this code reads a value on image 1 and broadcasts it. real :: p[*] : ! Segment 1 sync all if (this_image()==1) then ! Segment 2 read (*,*) p ! : do i = 2, num_images() ! : p[i] = p ! : end do ! : end if ! Segment 2 sync all : ! Segment 3 8
  • 9. Execution segments (cont) Here we show three segments. On any image, these are executed in order, Segment 1, Segment 2, Segment 3. The sync all statements ensure that Segment 1 on any image precedes Segment 2 on any other image and similarly for Segments 2 and 3. However, two segments 1 on different images are unordered. Overall, we have a partial ordering. Important rule: if a variable is defined in a segment, it must not be referenced, defined, or become undefined in a another segment unless the segments are ordered. 9
  • 10. Dynamic coarrays Only dynamic form: the allocatable coarray. real, allocatable :: a(:)[:], s[:,:] : allocate ( a(n)[*], s[-1:p,0:*] ) All images synchronize at an allocate or deallocate statement so that they can all perform their allocations and deallocations in the same order. The bounds, cobounds, and length parameters must not vary between images. An allocatable coarray may be a component of a structure provided the structure and all its ancestors are scalars that are neither pointers nor coarrays. A coarray is not allowed to be a pointer. 10
  • 11. Non-coarray dummy arguments A coarray may be associated as an actual argument with a non-coarray dummy argument (nothing special about this). A coindexed object (with square brackets) may be associated as an actual argument with a non- corray dummy argument. Copy-in copy-out is to be expected. These properties are very important for using existing code. 11
  • 12. Coarray dummy arguments A dummy argument may be a coarray. It may be of explicit shape, assumed size, assumed shape, or allocatable: subroutine subr(n,w,x,y,z) integer :: n real :: w(n)[n,*] ! Explicit shape real :: x(n,*)[*] ! Assumed size real :: y(:,:)[*] ! Assumed shape real, allocatable :: z(:)[:,:] Where the bounds or cobounds are declared, there is no requirement for consistency between images. The local values are used to interpret a remote reference. Different images may be working independently. There are rules to ensure that copy-in copy-out of a coarray is never needed. 12
  • 13. Coarrays and SAVE Unless allocatable or a dummy argument, a coarray must be given the SAVE attribute. This is to avoid the need for synchronization when coarrays go out of scope on return from a procedure. Similarly, automatic-array coarrays subroutine subr (n) integer :: n real :: w(n)[*] and array-valued functions function fun (n) integer :: n real :: fun(n)[*] are not permitted, since they would require synchronization. 13
  • 14. Structure components A coarray may be of a derived type with allocatable or pointer components. Pointers must have targets in their own image: q => z[i]%p ! Not allowed allocate(z[i]%p) ! Not allowed Provides a simple but powerful mechanism for cases where the size varies from image to image, avoiding loss of optimization. 14
  • 15. Program termination The aim is for an image that terminates normally (stop or end program) to remain active so that its data is available to other executing images, while an error condition leads to quick termination of all images. Normal termination occurs in three steps: initiation, synchronization, and completion. Data on an image is available to the others until they all reach the synchronization. Error termination occurs if any image hits an error condition or executes an error stop statement. All other images that have not initiated error termination do so as soon as possible. 15
  • 16. Input/output Default input (*) is available on image 1 only. Default output (*) and error output are available on every image. The files are separate, but their records will be merged into a single stream or one for the output files and one for the error files. To order the writes from different images, need synchronization and the flush statement. The open statement connects a file to a unit on the executing image only. Whether a file on one image is the same as a file with the same name on another image is processor dependent. 16
  • 17. Optimization Most of the time, the compiler can optimize as if the image is on its own, using its temporary storage such as cache, registers, etc. There is no coherency requirement while unordered segments are executing. The programmer is required to follow the rule: if a variable is defined in a segment, it must not be referenced, defined, or become undefined in a another segment unless the segments are ordered. The compiler also has scope to optimize communication. 17
  • 18. Planned extensions The following features were part of the proposal but have moved into a planned Technical Report on ‘Enhanced Parallel Computing Facilities’: 1. The collective intrinsic subroutines. 2. Teams and features that require teams. 3. The notify and query statements. 4. File connected on more than one image, unless default output or default error. 18
  • 19. A comparison with MPI A colleague (Ashby, 2008) recently converted most of a large code, SBLI, a finite-difference formulation of Direct Numerical Simulation (DNS) of turbulance, from MPI to coarrays using a small Cray X1E (64 processors). Since MPI and coarrays can be mixed, he was able to do this gradually, and he left the solution writing and the restart facilites in MPI. Most of the time was taken in halo exchanges and the code parallelizes well with this number of processors. The speeds were very similar. The code clarity (and maintainability) was much improved. The code for halo exchanges, excluding comments, was reduced from 176 lines to 105 and the code to broadcast global parameters from 230 to 117. 19
  • 20. Advantages of coarrays Easy to write code – the compiler looks after the communication References to local data are obvious as such. Easy to maintain code – more concise than MPI and easy to see what is happening Integrated with Fortran – type checking, type conversion on assignment, ... The compiler can optimize communication Local optimizations still available Does not make severe demands on the compiler, e.g. for coherency. 20
  • 21. References Ashby, J.V. and Reid, J.K (2008). Migrating a scientific application from MPI to coarrays. CUG 2008 Proceedings. RAL-TR-2008-015, see http://www.numerical.rl.ac.uk/ reports/reports.shtml Reid, John (2010). Coarrays in the next Fortran Standard. ISO/IEC/JTC1/SC22/ WG5 N1824, see ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850 Reid, John (2010). The new features of Fortran 2008. ISO/IEC/JTC1/SC22/ WG5 N1828, see ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850 WG5(2010). FDIS revision of the Fortran Standard. ISO/IEC/JTC1/SC22/ WG5 N1826, see ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850 21
  • 22. Atomic subroutines call atomic_define(atom[p],value) call atomic_ref(value,atom[p]) The effect of executing an atomic subroutine is as if the action occurs instantaneously, and thus does not overlap with other atomic actions that might occur asynchronously. It acts on a scalar variable of type integer(atomic_int_kind) or logical(atomic_logical_kind). The kinds are defined in an intrinsic module. The variable must be a coarray or a coindexed object. Atomics allowed to break the rule that if a variable is defined in a segment, it must not be referenced, defined, or become undefined in a another segment unless the segments are ordered. 22
  • 23. Sync memory The sync memory statement is an image control statement that defines a boundary on an image between two segments, each of which can be ordered in some user-defined way with respect to a segment on another image. The compiler will almost certainly treat this as a barrier for code motion, flush all data that might be accessed by another image to memory, and subsequently reload such data from memory. 23
  • 24. Spin-wait loop Atomics and sync memory allow the spin-wait loop, e.g. use, intrinsic :: iso_fortran_env logical(atomic_logical_kind) :: & locked[*]=.true. logical :: val integer :: iam, p, q : iam = this_image() if (iam == p) then sync memory call atomic_define(locked[q],.false.) else if (iam == q) then val = .true. do while (val) call atomic_ref(val,locked) end do sync memory end if Here, segment 1 on p and segment 2 on q are unordered but locked is atomic so it is OK. Image q will emerge from its spin when it sees that locked has become false. 24