ThreadModel rev 1.4

Threading and Concurrency
in
Proton Core™
A reusable C++ base model
for concurrent and asynchronous computing
(logo)
Revision 1.4

ii
Copyright © 2015 Thread Concepts, LLC.
All rights reserved.
Revision 1.4, September 2015
ISBN: 978-0-9908699-1-7
Author: Creator of Proton Core™, Christopher Cochran
Contact: chris.cochran.protoncore@gmail.com
Product names appearing in this manual are for identification purposes only, any trademarks, product names
or brand names appearing in this document are the property of their respective owners. Any function
names and other programmatic identifiers that happen match trademarks or brand names are purely
coincidental, and relate only to the local concepts described herein. Proton Core™ is a trademark of Thread
Concepts, LLC. Any instances of the name “Proton” found in this and related documentation and source
code, all refer to “Proton Core™”.
Any specifications contained within this document are subject to change without notice, and do not
represent any commitment by the manufacturer to be complete, correct, finalized, or necessarily suitable for
any particular purpose.
Thread Concepts, LLC
Digitally signed by Thread Concepts, LLC
DN: serialNumber=2zlmm5pwjvz5ghg0, c=US,
st=California, l=Fairfax, o=Thread Concepts, LLC,
cn=Thread Concepts, LLC
Date: 2015.09.26 12:48:09 -07'00'

1
Threading and Concurrency in Proton Core™
Background
From even before the introduction of Visual Basic, Java and Visual Studio, software development has
typically been based on top of some unified supporting API, or Universal Library with broad coverage
across computation, storage, networking and i/o services. Microsoft has done an exceptionally good job
at this, producing APIs like Win32, MFC, ActiveX, Visual Basic, CLR, C#, .Net and others. The trouble
is, relying on such vast software systems comes with some unexpected costs, risks and ramifications,
including:
•• Today’s .net and java systems lock you into a garbage collection programming model, an approach with
known uneven performance (denied by proponents), for the most critical, essential computational resource
of all—operational memory. Combine that with a pseudo code virtual machine, and you have a system
guaranteed to produce mediocre performance now, and on into the future (despite additional denials). Such
systems give up higher performance in order to achieve internet portability, a useful and common tradeoff.
•• Without a reusable higher-level threading and shared memory strategy, multithreaded solutions are often
newly reinvented and built-up from lower-level constructs for each application. Although this can work
well, it can also lead to unexpected complexities, disappointing concurrent processor utilization and creeping
development schedules.
•• Many foundation APIs can be replaced with better methods, resulting in higher performance from software
depending on them, including memory management, string processing, sorting, data conversion and
transformation and others.
•• Reusing software you have already built in future developments is necessary to controlling development
costs and schedules. But this process is prematurely and regularly disrupted by the push to move into the
“new wave”, rendering years of good work artificially useless when it is not immediately portable to the “new
system”. You are often left to your own devices for any bridges back to your valued past work, if it is feasible.
•• The compatibility between C and C++ led over the years to a “race to the bottom”, with most libraries and
frameworks developing in the lowest common denominator, in C. This has slowed the adoption of class-
based object-oriented methods available from C++, foregoing its superior reusability, unlimited logical
stratification and simpler management of large applications, compared with C. In C, your thinking is
dominated by the implementation, in C++, your thinking is dominated by the problem domain at hand.
•• Due to historical sequencing, treating strings as character arrays is firmly entrenched in many string
processing systems, including standard strings of C and C++. The trouble is, that approach only works when
characters are the same width, while Unicode characters are variable width. This is a serious flaw, a collision
with the past that causes a variety of bugs in many installed software bases world-wide to this day. Unicode
strings in Proton avoids this problem, with true character-oriented strings and processing services.
•• High-profile, commonly used application models, from Microsoft and others, are studied by hackers for
vulnerabilities to attack or exploit. Products based upon independently developed logical systems are not as
commonly or quickly assimilated.

2
Because of these and other characteristics, some vendors have developed smaller, faster, more focused
and domain-specific, application-supporting infrastructures. Some systems concentrate more on the
fundamentals, use non-proprietary languages, support composition and specialization, and play well with
others. Proton over C++ provides this kind of application support.
Proton is a C++ framework that arose from the end of the “free lunch” era, when clock rates stopped
doubling every two years. Proton was developed to advance performance still further, by combining
the most critical elements together for tighter interaction: memory management, multithreading and
event processing. Over that substructure, Proton provides mid-level data processing services, including
comprehensive treatment of strings, arrays, vectors and file access. Expressions of aggregate data perform
iterative computation without loops or subscripting. Strings and arrays dynamically manage themselves to
automatically provide memory space based on present needs. These services are designed as an everyday
programming model with more concise logic and shorter code. Move constructor technology eliminates
the temporary value problem, making expressions of aggregate values practical and highly efficient.
Preface
This document provides a summary of the asynchronous programming services provided by the Proton
framework. While the product manual goes into more depth and detail across all topics using Proton, this
discussion focuses on the architecture and motivation for the Proton concurrency solutions now running.
Proton presents essentially the same programming models and APIs under Windows and Linux. Proton
applications compile and run under both operating systems, and provide all the same services. Proton uses
and relies on the Windows API over MS Windows, and on Winelib over Linux.
Proton makes space for both the logical flexibility of multithreading and the opportunities for
concurrency it brings. One goal is to be able to write application logic in C++ that easily scales in
performance with the number of processors available to run them. This goal might be easy and off
the shelf, if turning a pile of threads loose in your application were all there was to it. But it’s not that
simple—as soon as different threads begin interacting with the same data, and each other—synchronizing
data transactions becomes necessary, opening the door to delays, deadlocks, performance degradation,
and interlocking complexities. The locking methods that work so well to control transaction access, can
deadlock, don’t scale well to large numbers of concurrent transactions, and their use in finer-grained
synchronization can severely degrade transaction performance.
The multithreaded event processing model in Proton is built from a thread reactor pool combined with
active objects, virtual blocking, dynamic resource management, lock-free transactions, RAII orientation,
and other well-founded supporting models and design patterns. The resulting Actor model effectively
unifies multithreading with event processing in C++, supporting it with custom per-thread allocation
services designed for this work, and scales up to the available processors and across active threads for as far
as virtual memory will stretch. For applications large and small, Proton runs best in 64-bit, where virtual
memory is practically limitless and 64-bit logic can be much faster.
Proton applications consist of some set of independent threads, each with their own independent
roles to play, that often collaborate on work initiated by one another. An application begins with a
main thread and a Windows message pump thread, with additional threads started as needed for the
application. Threads have names and can start or find each other, and use native non-blocking event-
based communication among themselves. Passing tasks to others is an effective way to spread work
among processors, while maintaining asynchronous operation throughout.

3
C++ object model
Threads, tasks and events are objects with virtual behavior that you define. Objects give you something
to create and lifetimes to manage, with member functions local to the object and out of the global
space. Threads start up when you post events to them, stay alive with AddRef ( ) calls, and retire upon
final Release ( ). All of the handles, identifiers, operating system calls and other details required are
encapsulated within these objects, making them simple and low-noise to use, portable into other
environments and bullet proof.
Proton base classes encapsulate the layered complexities of multithreading, event processing and dynamic
allocation services, so that client application logic can focus on its own specific activities. Virtual
functions are defined on objects everywhere to support custom operations and choices. A variety of
Proton debugging hooks and procedures are provided to make all of these pieces practical to use in the
face of rough and tumble development and imperfect programming. Defaults exist everywhere to do
something reasonable until you choose to do something more.
Free-running thread roles
The model for managing multithreaded complexities starts with a set of threads acting out their own
independent roles,roles of your choice that you program. Threads in Proton are defined by the data and
behavior ( ) that you put into them,as specified in the thread classes you define over Proton thread base classes.
Proton provides a thread-local free-running environment for unimpeded thread execution,free from interference
from other threads,each computing generally in their own logical data space atomic-free—nearly all of the time.
When thread roles communicate by posting non-blocking events and tasks among themselves,you have the
Actor Model—a strategy that first appeared in Errlang,a language from the 1980s that was far ahead of its time.
Posting tasks and events
When one role needs something to happen in another role, it posts events or tasks to that role, or to
an object managed by that role. These tasks are serviced by the receiving thread and run in the order
posted. Tasks and events can do anything in the receiving thread context, including reposting themselves
or posting additional tasks to other recipients, and finally auto-releasing after use. Tasks in Proton are
defined by the data and action ( ) that you put into them, as specified in the task classes you build over
Proton task base classes. Custody of task data changes hands from one thread to the next, and is normally
assumed to be unaccessed by others, to provide unfettered non-atomic access to thread, task and event data,
wherever it goes. Tasks can finish after one use, or live on to be posted to another thread, or go back to the
thread that posted it. Many degrees of freedom exist here.
Task and event processing
In order to be responsive to events from other thread roles, threads must process their events from
time to time. This is performed implicitly while blocking for completions or time-outs, or explicitly
whenever needed, by calling a DoEvents ( ) function to immediately get up-to-date. Each call services
all accumulated work available for the calling thread. Each thread controls its own event latency by how
often it is able to process the events and tasks it receives. It also chooses where in its logic to service
things, to tightly control order and placement of local completions.

4
Busy threads choose their moment to service their own events and tasks, and service them in the order
received. Blocking becomes just another excuse to service thread-local events and tasks. Threads only
block when they have nothing better to do, but otherwise are free to go about their own independent
business. In this way, all threads become event processors, an arrangement that is fast and flexible, that
supports a wide domain of multithreaded application models, and that can scale to hundreds of threads in
32-bit model (more in 64-bit), across any number of processors.
Concurrent central workheap
The concurrency available from this multithreaded event model can sometimes be more coincidental than
the kind of focused high-utilization concurrency we would like to see. So for that, threads may also post
tasks to the application-wide workheap, where they are immediately processed by all available processors,
as directed by processing load. This capability is similar to the ExecutorService in Java, and comes
preassembled and ready to accept new tasks, with immediate and aggressive performance, low-overhead
and fully integrated into the Proton programming model.
The Proton workheap is always-on, and its automatic processing doesn’t wait—it starts the moment
it goes non-empty. Proton manages this and makes it practical and robust, within an overall discrete
workflow approach that may best be described by considering more of the elements involved:
Task and event workflow
•• threads package up events and tasks with the data necessary for others to process
•• client threads post unordered events and tasks, service threads process them
•• tasks go into light-weight lock-free queues with low internal latency
•• tasks usually don’t need critical sections unless required by application data
•• task and event-based work flow passes between and among threads
•• a task is an action ( ) with the data it needs, that you hand off to do in another thread
•• event response is defined by its recipient, task action ( ) is defined by its sender
•• you can post tasks anytime and they can start executing immediately upon arrival
•• a completion task may run after completing a group of tasks posted to different threads
•• threads normally carry on independently, block, compute, perform i/o and other duties
•• work tasks are buffered, submitted and processed in single-cacheline units called worksets
•• posting work tasks is atomic-free, worksets are internally posted wait-free and unbounded
•• event and task objects are faster, more manageable and generic than message passing
Concurrent processing opportunities
•• multiple threads acting in specific roles while servicing and posting tasks between them
•• multi-stream tasks to a global workheap for automatic multi-processor servicing
•• large tasks can recursively split into smaller pieces and repost to other processors
•• event-driven scheduling and completion of concurrent external processes
•• use completion tasks to post and orchestrate the next set of concurrent tasks
•• multiple completion groups may proceed concurrently

5
•• event producers provide hints (size, ordering, priority, etc.) for better service routing
•• work may be posted in phases, such that phase (i) is finished before phase (i+1) is started
•• phase barriers may limit the extent to which concurrency can be achieved
•• larger work phases allow for more concurrency than smaller phases
•• number of processors servicing the workheap follows the workload
Virtual blocking
•• virtual blocking state puts waiting threads to work servicing events and tasks
•• blocking threads maintain a pool of themselves, for use as concurrent service threads
•• hard blocked threads are wakeable for multiprocessing service
•• threads (hard) block only when they have nothing else to do,when work really has run out
•• open files in ASYNC mode continue servicing events while blocking during file i/o
Notifiers
•• notifiers allow posting threads to define actions later triggered by service completions
•• a notifier is for client-specific actions that necessarily originate in the service thread
•• notifiers can be defined by any thread, for other threads to activate like signals
•• servicing threads can notify posting threads when their local work completes
•• invoking undefined notifiers is harmless and does nothing, redefinition replaces them
•• unlike event responses that run in the target thread, notifiers run in the notifying thread
•• notifiers are logically similar to events but more direct and immediate
•• notifiers can post events or directly take any action as required
•• multi-threaded precautions are often necessary in notifier responses
Containment and Tasks
Much of the Proton approach is designed to promote containment for managing most things, to give
your code normal non-atomic access to objects and data, nearly all the time. Containment just means
that objects have current thread affinity, and only one thread may access them at any given time. If I have
such an object, I can modify it all I want, then give it to you, then you can access and modify it all you
want—correct, fast and simple. This is how tasks work, and posting a task to another thread moves the
task object from one thread containment to another. Task code needs no further synchronization unless it
is introduced by special needs of its application logic. An effective containment policy generally reduces
complexity and improves performance.
Well-written Proton applications take advantage of the containment model as a base approach, helping
to keep multithreaded hard transactions relatively rare. Proton memory management supports this, and
allows a memory aquisition and later release to occur in different threads as needed. Proton supports
multi-threaded transactions most often by encapsulating their complexity within objects that are cleaner
to use in everyday programming. All the Proton atomics and related structures and APIs are also
available for use in pursuit of your goals.

6
More on posting events and tasks
Threads post events and tasks for other threads to perform. An event is a message block where the
recipient defines the response, a task is one where the sender defines the response. These transactions
are serviced synchronously, at a moment chosen by the receiving thread, by calling its thread-local
DoEvents ( ) function. This call is made implicitly when blocking for timeouts or other wait conditions.
Threads may explicitly call DoEvents ( ) to process all events and tasks that have queued up since the
last such call. Threads may choose to put off calling DoEvents ( ), until they need to bring thread-local
data state up-to-date. When thread-local work dries up, DoEvents ( ) processes available work in the
concurrent workheap, along with other threads so inclined, up to the available processors. Inattentive
threads can call DoEvents ( ) interspersed within any logical process to improve event processing latency,
as needed. Each call processes all pending activity and returns quickly when there is nothing to do (under
100 machine cycles).
An event is a spontaneous occurrence that requires someone’s attention and often a response, e.g. a key is
struck on the keyboard, or a request arrives from a remote procedure call. Events are handled differently
depending on the type of event, who sends it and who receives it. A task is an event that contains a
specific response to be played-upon-opening by its servicing thread. You create tasks, in order to post
them to other threads to perform, and as such, they may contain any data necessary to their completion.
Tasks have less overhead than events and remain efficient with small payloads down to a few hundred
machine cycles. Events are normally processed in the order they were posted; tasks may or may not have
ordering requirements, depending on the tasks and their context. Some additional related points include:
•• Events are data objects you send to recipients, who expect them and provide their own response
•• Tasks are like events, but define their own response for the recipient to run upon opening
•• Concurrent data transactions in events/tasks must provide their own synchronization as needed
•• Transactions performed completely within specific threads don’t need synchronization
•• Tasks require at least 250 machine cycles to create, initialize and post for servicing
Events and tasks are posted to threads and stored in queues for subsequent servicing in the order they
arrived. Windows, threads and other active objects each have their own event queues, so there is no global
queue being maintained anywhere. Events are handled by responses defined by the active objects to which
those events were posted. Responses are typically defined for a complete set of events, which is created
and selected into the active object involved. DoEvents ( ) applies the correct response for each event it
services, and executes each task directly in the order they arrive. On completion, calling Release ( ) on the
task/event finishes the transaction. Tasks can be defined easily and directly:
class SimpleTask : public TaskEvent { // Tasks are simple and straightforward to define
Integer value;
String string; // define whatever data your task needs to run
public:
SimpleTask (Integer i, String s) : value(i), string(s) { your_init ( ); }
virtual void action ( ) { do_something ( ); } // this is what you want the task to perform
};

7
Later, you can post a new task in this manner: thread‑>postevent (new SimpleTask (666,“test”)); any
number of these may be created and posted like this. Each task is run by the receiving thread, followed
by task‑>Release ( ) on completion. Tasks generally find their operational data in, or referred to in the
task object, as setup by its own constructor. Task objects can live on for repeated posting, by calling
task‑>AddRef ( ) for each additional cycle, before reposting them (normally with further set up).
Ordered tasks and events may be posted to active objects and threads, by any thread at any time. Each
thread only attends to servicing its own queue(s), allowing others to do the same. There are no particular
limits on the number of threads that may participate in this manner, nor limits on event queue capacity,
other than the available memory resources and enough processors to move things along.
Event coordinators
Posting tasks to specific threads is simple, fast and and direct. But sometimes this can be too direct, as it
expects posting threads to know what specific threads to use. This is fine for simple applications, but can
introduce unwanted task management dependencies in more complex situations. The solution to this is
the EventCoordinator abstraction, for indirectly posting tasks which are then (or later) routed to specific
destination threads. Participating end points register themselves with known coordinator objects, that
are programmed to distribute incoming tasks to appropriate threads or other end point objects—without
having to know the specific threads ahead of time.
There are many interesting ways to use this abstraction. A coordinator can choose an under-utilized end
point over those that are saturated, to balance loads and increase concurrency. A coordinator can hide
logical distances between threads, routing some over networks and others to local threads or processes.
Internal task routing policies can be modified within coordinators without changing the participating
client thread logic. Client threads can be isolated from details having security or proprietary sensitivities,
while allowing them to run freely, without complications. Like many objects in Proton, coordinator
lifetimes are reference count managed, to operate for as long as they are being used, and automatically
freed when not.
Task interthreading
Normally, tasks live to execute their action ( ) just once, followed by Release ( ) and they are gone. But it
can be useful in a task’s action ( ) to be able to call a switch_to (thread) function, to allow specific threads
to contribute their own pieces to a larger task assembly. Task interthreading is simpler and faster than
generating a new task for every thread switch along the way, because there is only one task object passed
around for the duration.
This is possible because tasks provide functional closure, designed to be posted to, and run in, other
threads. Until the task action ( ) itself finally returns, your code can switch from thread to thread, running
different code in each thread it visits, along an arbitrary finite state machine. Such a sequence can itself be
considered a virtual thread, and arbitrarily many such threads can be active simultaneously, over no more
actual threads than the available processors permit.

8
TaskGroups and completion tasks
A TaskGroup is a temporary object that associates a set of tasks, of arbitrary number, with a common
completion task, posted to the thread of your choice after all dependent tasks have been completed.
The included tasks are those posted from the calling thread, to other threads for servicing, while the
TaskGroup object is active. The posting logic and the tasks themselves require no change to be used in
this manner, and no knowledge of the group is required by the included tasks to function properly. Hence
you can group the tasks of arbitrary existing task logic behind a common completion task--from the
outside. Multiple groups of completing tasks can run like this independently and concurrently.
A likely use for task groups, is to group multiple tasks sent to the Proton workheap with other tasks,
behind a single completion task that safely acts once those tasks have completed. For example, computing
a screen graphic using multiple task threads of the workheap normally requires a completion task to
invalidate the window rectangle--but only after the results all stablize. The completion task is ultimately
executed by a chosen target thread. Completion tasks can start up other sequences with their own
completion tasks, and march on indefinitely over the prevailing thread activity.
Virtual blocking
Each thread carries on its own process, independent of other threads. When a thread blocks for input, time
delays, or other application-level wait conditions, its uses a virtualized blocking state. To software, this state
appears like normal blocking, except during this period the thread may continue to run, staying productive
performing other activities while waiting for the logical block to release. Such activities include:
•• servicing thread-local events and tasks arriving in its queues
•• thread-centric housekeeping, resource management, maintaining runtime statistics
•• servicing the application concurrent workheap, alongside other processing threads
•• performing potential deadlock detection and other run-time integrity testing
•• really blocking when there is nothing else to do, but wakeable for workheap service
In this environment, threads stay productive longer, more fully utilizing their time-slice, and support
concurrent tasks, up to the number of available processors. Porting single-threaded software into multi-
threaded designs can benefit from this approach. The request-block-continue style prevalent in single-
threaded designs can often remain a sensible logical model, as long as its blocking state can be made
productive and stay out of (deadlock) trouble.
Proton makes no attempt to “hook into” the blocking mechanisms provided by the operating system
to provide virtual blocking capabilities. Rather, you explicitly substitute Proton blocking methods
for the standard waiting functions, wherever you desire virtual blocking services. Proton provides
WaitForObject ( ) and WaitForAnyObject ( ) functions to use for blocking with time-outs on waitable
objects, like threads, events, signals, processes, etc. Virtual blocking works with any object on which you
would normally call WaitForSingleObject ( ) and other related functions, with similar argument structure
and return values, so using them in existing C++ code is pretty painless.
Combining SignalReactors with the DoEvents ( ) architecture adds significant flexibility to Proton
event handling. These are tasks that you define and post to install a prepared signal and response into
any thread’s DoEvents ( ) handler. It is designed to enable custom DoEvents ( ) response to some set of
signals that become active while a thread processes its regular task and event traffic. It can be used for any

9
thread-specific application signals whose responses fit this model. It services its signal until either you
remove ( ) it or its thread exits.
Activity monitor
Pressing ctl-alt- into a running Proton application window, brings up a side-window into the immediate
activity and statistics across its running threads. This includes thread names, event latency, atomic retry
contention, critical section collisions, exceptions thrown, memory allocated, timers active, number of
windows and controls, etc. It helps you identify threads that appear to be non-responsive, modal, stuck in a
loop or creeping recursion, leaking memory, threads with latency to spare, and those barely awake. It shows
unprocessed work levels, events posted and processed, event latency and relative thread loads. Putting the
mouse over any of its data columns elicits a short description of it, making it largely self-explanatory.
The monitor lets you view your multiprocessing operation in action, to see that key indicators are
occurring as they should, in real-time. Its display runs continuously until you terminate it, and may be
viewed in any Proton application that allows it to be viewed, enabled by default. The monitor state at
application exit is restored when the application starts up the next time. It becomes invaluable during the
debug, shakeout and testing phases of multi-threaded software development, showing significant internal
run-time characteristics not available from debuggers or external process viewers. Its event-driven
operation incurs a negligible 0.1% machine load when running, and acts as a continuously running Proton
self-test, to demonstrate core integrity during any running Proton application session.
Multiprocessing unordered tasks
The application workheap holds batches of tasks to be multiprocessed at the highest achievable processing
rates. Proton puts idle application threads to work processing the earliest work phases, up to the number
of available processors. When there are not enough available and willing threads to service the workload,
additional threads are started automatically, which run until the workload is consumed, then exit after a
few moments of quiescence.
But you can’t just stuff anything you like into the multiprocessing workheap. Events and tasks that logically
must follow the completion of others, must all be posted together, to a specific thread, to guarantee the
servicing order they require. The multiprocessing workheap is intended only for tasks that can be processed
independently, in any order within a work phase, that make little or no use of critical sections. See the
workheap->reserve(n) function for a method to preserve fine-grained task order in the workheap.
Normally, events and tasks are posted for single threaded servicing in the order posted, by the receiving
thread. But when you mark an event “unordered” and post it, postevent ( ) routes it to the multiprocessing
workheap, where it goes into the current work phase. Calling the function workheap‑>postevent(task)
also puts a task into the calling thread’s active work phase.
The act of posting unordered events and tasks, advertises a thread’s work for discovery by active service
threads, looking to find and process any available workload. Other threads will often be similarly engaged.
The service threads are merely those that happen to call DoEvents ( ), or those that were blocking and
reawakened to be put into service. Neither the posting threads, nor the service threads need any particular
awareness of this process—it just happens. Tasks are initialized as “unordered” or not, so any thread
posting them can implicitly participate in the process.

10
Processor considerations
Proton applications can operate on just one or two processors, but things get more interesting with more.
With four processors well utilized, you can expect a throughput resembling 3.8 effective processors, and
higher when memory contention stays low. Similar ratios for 6 and 12 processors should hold. Proton
however will use all the physical CPUs that are made available to it, excluding hyperthreads.
Hyperthreading is supported, doubling the number of schedulable processors when HT is active. But
since each pair of HT processors share one physical core, they don’t typically add further performance, and
can run a bit slower with twice the threads doing half the work. Because of this, the workheap limits itself
to available physical processors, even when HT is active. The workheap‑>exclude (n) function lets you
reduce the effective processor count even more, to keep other things running the way you want. Specific
threads that cannot tolerate any extra delays can mark themselves with thread->is (RealTime) to keep
from being conscripted into service by the workheap.
When Proton detects that adding further processors finds them under-achieving, the added processors
can automatically backoff and try again later. This reduces the occasion of system overload and
degradation of real-time response.
Splitting up work
In Proton, individual tasks are not processed by multiple threads, rather, tasks are usually single-threaded
and many tasks are distributed among multiple threads. To more effectively use the multiprocessing
engine, the work you run though it can be self-splitting, or be already split into some number of smaller
chunks. This characterization is intentionally vague, because there is a bit of latitude here, but there is also
a range. Generally dividing the work into hundreds of subtasks will work well, particularly when their
sizes are roughly the same. Larger tasks that are marked Immediate also divide well among concurrent
threads, but enough of them still must be posted before all processors will ramp up.
Statistical load balancing becomes effective when the total task count is much higher than the thread count.
If the chunks are few, or vastly different sizes, concurrency tuning may show gains by “gear shifting”. A
workset with one large task can represent the same load as a workset with many small tasks. Adjusting for
that (when viable) helps to balance the work load among processors. There are several control points:
•• Mark large tasks as Immediate in their constructors (who ought to know). Such tasks go out
to the work heap immediately upon posting them, along with any others present in the current
workset.
•• Call workheap‑>postevent (0) to flush the task buffer immediately, increasing the number of
worksets to divide among hungry service threads.
•• Set the cutoff point in the current thread for workset posting, by calling workheap‑>dwell (n),
so that when the workset task count reaches n, it is moved to the multiprocessing workheap.
You set n low for large tasks, high for small tasks with a default of 1.
•• Tasks can divide their work by reposting portions of themselves as new tasks to the workheap
for others to similarly act on. This cascades across the available workheap processors and
quickly ramps up processing to full capacity until completion.
It takes worksets lining up before a second processor is added, the load must keep building to add a third,
and so on. If you generate too many tasks, that is ok, the work just queues up and service threads crank
through it until the backlog subsides.

11
When tasks are hefty enough by themselves, setting task‑>is (Immediate) before the posting sequence,
will immediately put them into the active work phase, by themselves if necessary, without waiting for more
tasks to stack up. This exposes more opportunities for concurrent processing with large task sequences.
Avoid using Immediate on small tasks, allowing them to buffer normally for higher bandwidth.
Not splitting up work
Sometimes you want to ensure that a short sequence of tasks will be scheduled together, as a unit later
processed in order, by one thread, but still by the workheap (e.g. you might have many such sequences to
work off). Calling workheap‑>reserve (n) will do just that, by checking the room for n tasks per workset
in the calling thread, and flushing it out for a fresh one if necessary.
This does not obligate you to actually post anything, and not doing so makes calling reserve (n) another
way to flush the task buffer. Setting the threshold too high will be limited by a lower dwell (n) setting, or
by Immediate tasks that arrive. Grouping tasks in this manner is a practical way of serializing several tasks,
at any time, in the midst of unordered multiprocessing chaos. The posting side is fiercely single-threaded,
so these things are performed with the convenience of making your own non-atomic synchronous choices.
Multi-phase workheap strategy
A thread’s workheap can be partitioned into two or more work phases that are sequentially serviced, for
ordering at larger scales without giving up any concurrency within each work phase. Work phases permit
unordered computation in one phase to depend on and wait for results from completed earlier phases.
Individual phases are unbounded and dynamically managed. Each thread always posts into its own
current work phase, but may advance at any time to a new phase. Prior phases are multiprocessed in order,
with phase i completed before phase i+1 begins. Servicing proceeds through each phase and finishes with
the latest phase.
A thread does not have to divide its work into phases, but may choose instead to post and process the
current phase continuously, without ever advancing it. Work phases are independent by thread and exist
for the convenience and benefit of their respective thread. Work phases in separate threads are temporally
unrelated, but can expect continuous attention from multiprocessing services. Only the thread posting the
work generally knows whether any such dependencies might exist.
Threads post unordered tasks into an internal thread-local workset, consisting of a light-weight single
cacheline buffer of task pointers. When the workset fills or is submitted, it is appended to its current
work phase, which may itself be in a state of being serviced. Worksets spread the transaction costs across
multiple tasks, particularly useful for fine-grain tasks. Worksets hold tasks until they are submitted to the
workheap for servicing, but Proton performs that submission automatically at various opportunities.
The first post into an empty and idle per-thread workheap, attempts to awaken a blocked thread for duty.
If there are none, a new thread may be started if there are more processors to bring into play. Each thread
looks at its own situation and asks for more help (i.e. from other threads) when it seems necessary.
Threads that do not opt out of workheap service, enter workheap service during DoEvents ( ) calls and
during its blocking state. Only the oldest unfinished work phases are ever processed at any given moment,
followed by each phase after that, if any. A servicing thread begins by choosing an active subscriber at
random, then working off its earliest work phase, choosing again, and so on.

12
Work distribution across threads
The workheap consists of a set of thread-local queues assigned across participating threads. Each thread
posts its work only to its own private queue within the workheap, without interference to/from other
threads. Posting to just one of these queues is still serviced by all participating threads, but that can also
create a hotspot, with too many threads going after one queue. A better work distribution puts more work
in front of more processors more quickly, for a faster completion.
A cascading work distribution approach, posts large chunks of work, which are picked up by other threads
and reposted in smaller pieces, quickly spreading the work across all participating threads. This results in
many servers with many clients, rather than one server with all the clients, scaling easily to any number of
processors that rapidly converge to completion.
Such a work task simply processes a small chunk, or divides a larger chunk in half, reposting one of them
back to the workheap, and repeating again with the remainder. This rapidly breaks down work while
minimizing added transactions into the local workheap queue. Different scenarios and types of work
define their own tasks for custom task breakdown and servicing.
This method has been tested and demonstrated in the fractal application, and shown to be highly effective
in rapidly and evenly distributing large requests across all participating threads. Such work distribution
activity can be viewed and evaluated in real-time, from the Proton thread activity monitor. See the
fractal.cpp source code for further details about fractal.exe implementation.
Concurrent service threads
Global workheap structures maintain a list of subscribing threads, their active state, how many processors
are servicing each and in total. Additional servicing threads use this information to choose which
subscriber to service next and when to request additional processor assistance. With insufficient threads
available for service, the workheap makes more threads available by starting EventHandler threads (shown
as “zombies”), which do nothing but block—waiting for events to process. After a few moments without
any new material to process, these supporting threads time-out and exit automatically.
On phase complete, service threads randomly choose another active subscriber or exit the workheap.
Similarly, prospective service threads that find the earliest phase in an empty state, cannot make any
progress and randomly choose another active subscriber or exit the workheap. The final processor to find
the service phase empty, advances the service phase to the next phase (but not past the posting phase).
This guarantees all earlier phases are complete before processing the next one.
Calling workheap‑>service ( ) performs a service cycle in the calling thread, but the function returns
harmlessly, doing nothing if already entered, disabled, nothing to do, etc. If the workheap draws
too much processor attention away from other application activities, you can exclude one or more
processors from workheap service with the workheap‑>exclude (n) function, that acts globally to limit
processor participation (0 or 1 are the best values). This can also help overall system responsiveness
under high loads.
Any thread may opt out of being called into service by calling thread->is (RealTime), which indicates the
thread cannot tolerate much delay or latency. For example, the thread monitor and the winpump threads
do not service the workheap. This keeps the message flow moving and the monitoring display current,
even when all processor effort is servicing the workheap. Similar accomodation may be necessary in a

13
variety of multithreaded application scenarios, specifically for threads involved in activities considered
time-critical. Within each thread, this setting can be changed back and forth as often as needed. You can
view the task latency for any system of Proton threads through the activity monitor.
Service workchannel architecture
To manage the multi-threaded operation of workheap activities, the workheap maintains a fixed set of
internal “workchannels” that are assigned to threads for posting worksets into the workheap. Among
other things, the workchannel ensures that puts ( ) and gets ( ) always access the proper work phase
queues, queues are allocated, initialized and freed at the right times, new threads are started when needed,
and available work is presented to prospective threads for servicing. Workchannels are internal to the
workheap, so there is no API for them, but their existence can be useful to know about.
Workchannels remain connected to their thread until released by their thread. They are never taken away
by another thread in need of a work channel. Instead, empty and inactive channels are returned to the
workheap for reassignment to another posting sequence. This occurs at the end of a work sequence, after
its servicing has completed.
The channel array is pre-allocated in the workheap, to avoid problems with stale pointers and other
multithreaded complexities. Its defined channel capacity should be set sufficiently high to meet the
largest possible instantaneous thread demand. The channel internally allocates and frees the queues it uses
to manage its own multi-phase work load operation.
Workchannel assignment comes out of a wait-free allocation bit table. This map is quickly searched and
modified by any number of threads. It is read often, but not so much written. Each channel knows its
thread owner, and each thread knows its channel. Thread ownership over a workchannel is obtained and
surrendered by the posting thread, the later happening after work is complete with nothing further posted.
This promptly recycles the unused workchannels for immediate availability to threads when needed,
cycling once per continuous sequence completion.
Thread notifiers
Notifiers are objects a thread holds, that implement notification from other threads. Completing the
local workheap can activate a thread notifier, if defined in the client thread ahead of time. Notifiers are
run in the servicing thread, not the client thread, so its client virtual activate ( ) call should be written
with that in mind. Applications may use a notifier to invalidate and update the display on completion of
any work in its local workheap. It could do anything else that is sensible there as well, i.e. multi-threaded
sensibilities. Notifiers don’t have the open-ended flexibility of completion tasks, but they can handle
situations that cannot tolerate the task latency.
Scalability
Workheap capacity ramps up automatically at multiple scales. First, workset capacity expands to
accomodate the current dwell ( ) setting. The filled worksets are posted to bounded queues, but when
a queue fills up, a new queue is created and linked in to allow unobstructed posting. Multiple queues
provide more opportunities for keeping processors busy with low contention, helping to isolate posting
from processing and processing from processing, increasing bandwidth at the very moment when the load
is getting larger.

14
This arrangement is spread across all the threads that post work to the workheap, with all available
processors consuming it everywhere as quickly as possible. The ramifications of all this can be viewed in
the Proton activity monitor in real-time, to assist application debugging and balancing. Putting unlimited
processors on one service channel is feasible because multiple servers are spread across multiple queues to
avoid contention. Each client (posting) thread maintains its own part of the workheap, so multiple clients
posting work increase the available servicing bandwidth, when there are more processors to run them.
One way to spread the work across all participating threads, is to post larger tasks that are picked up by
other service threads and reposted as many smaller tasks. This seeds the distributed workheaps to multiply
concurrent servicing possibilities. You cannot post tasks directly into the work phases of other threads,
but they do post into their own channels, as part of their concurrent servicing operation of tasks you post.
This method is used by the fractal sample application, whose tasks repeatedly repost half the task down to
the tile level. See the code in fractal.cpp for implementation details.
Workchannels are limited to 64, meaning “only”that many threads can be actively posting tasks into
the workheap simultaneously. But since just one posting thread can often easily overwhelm all available
processors with task overload, it is questionable whether that many posting threads are even servicable,
short of putting sufficient processors on the job. However, many threads streaming tasks to the workheap at
sustainable rates, will all see the gain in performance and reduced latency they expect. Which work channel
to process next is selected at random, by incoming service threads, from those eligible to run. Overload is
not usually a problem however, because tasks simply queue up in dynamic queues until servicing arrives.
Once serviced to empty and quiescent, channels are returned to the workheap for reassignment to another
thread that may start posting to the workheap at any moment. This keeps all unused channels available.
If a thread cannot obtain a channel, it services the workheap until a channel becomes available, making
progress in either case.
Future Proton releases may process work phases differently for more effective scheduling (as internal
ordering is undefined), but the present architecture can efficiently utilize at least 64 processors when
many threads are both posting and servicing. The workheap actively avoids employing any more service
threads than the actual number of processors made available to the running process. You can hold back
some processors from the work heap by calling workheap‑>exclude (n), to keep other threads responsive
in specific scenarios. Excluding at least one processor from situations involving continous high-load
processing, is beneficial allowing the rest of the system to breathe and process normally when your
application is taking over everything.
By monitoring their own workloads, service threads can drop out of service if they see utilization below
25% for some time interval. Low utilization indicates thread is not doing much now and can leave as long
as other threads are present. Such threads often hold resources useful to the other threads that remain,
and releasing them will recycle those resources.
Hyperthreading the workheap
Because Proton workheap scheduling tends to even out the load among multiple service threads,
for computation-bound material, hyperthreaded workheap servicing provides no real advantage.
Hyperthreading works better with uneven loads among threads, where under-utilization in one thread
increases availability for another thread. However such cooperative match-ups are highly dependent
on the application loads present. Therefore Proton effectively hides HT from consideration, with the
workheap depending only upon the physical processor count, even when HT is enabled.

15
I/O bound material, like that involving file processing and network access, may expose more opportunities
to enable modest gains using HT. Hyperthreading is a system-wide setting, and usually chosen to benefit
system-wide performance. Individual applications should be able to accomodate this situation as it arises,
either way. Hyperthreading is also helpful for testing multi-threaded software behavior across a wider
variety of multi-processor configurations. Processor limits in Proton have nothing to say about how many
threads you choose to run, where HT can provide its full bounty.
Without HT, it is sometimes useful to reserve at least one processor, leaving it under-utilized, to
avoid starving the rest of the system of processor attention. Hyperthreading makes this less an issue,
because employing all physical processors still leaves hyperthreads available to keep the rest of the
system advancing.
Multiprocessing resources in Proton
class ActiveRole generic thread class in Proton from which you derive your thread classes,
you define their virtual behavior ( ), add your data, and start them
with postevent (0), virtual functions define initialize/teardown,
blocking/unblocking, etc
thread‑>behavior ( ); virtual function where you define and implement your thread logic,
it is called by thread startup services in Proton, not by the application
thread‑>postevent (0); brief thread wakeup, or start thread if suspended or never started,
starts only after construction, never from inside a base class constructor
thread‑>is (RealTime); set to indicate a thread with no tolerance for extra delays,
clear this state on threads willing to service the workheap (default),
each thread manages its own real-time setting
thread‑>notify (obj); set thread notifier to activate when local workheap tasks are complete
thread‑>notify (enum); attempt to trigger a specific notifier defined (or not) in the thread

16
y y y
class TaskEvent base class for tasks, you define its virtual action ( ) and data to suit
thread‑>action ( ); virtual function where you define and implement your task logic,
it is called by event processing services in Proton, not by the application
task‑>is (Unordered); marks task as unordered (often set in the task constructor)
task‑>is (Immediate); marks tasks that immediately post, potentially with others, to the workheap
task‑>is (LowPriority); marks tasks that are serviced after (regular) high priority tasks
thread‑>postevent (task); posts ordered tasks to a thread, or unordered tasks to the workheap
task‑>postback ( ); posts a task back to the thread it came from after being received
TaskGroup z (task, target); associates a completion task to the tasks posted while z remains in scope
DoEvents ( ); processes all thread-local events, services the workheap as needed, no
waiting
DoEvents (n); processes all thread-local events/workheap, waiting up to n msecs for work
WaitSecs (period); calls DoEvents ( ) for a time period, with accurate timing, blocking as
needed, both WaitSecs ( ) and DoEvents ( ) return 0=exiting, 1=idle,
2=active
y y y
WorkHeap *workheap; application global workheap, always defined throughout app session
workheap‑>postevent (tsk); post a task to the calling thread’s work channel into the workheap
workheap‑>postevent (0); post any buffered tasks from the calling thread out to the workheap
workheap‑>dwell (lim); set how many tasks accumulate before being put into the active
work phase
workheap‑>service ( ); services the workheap in the calling thread, as needed and enabled,
fast out when no work, has enough processors, is disabled or is unneeded,
willing threads call-in automatically to ensure fast workheap response
workheap‑>newphase ( ); begin a new local work phase, so that subsequent posted work is not
started until earlier phases finish.
workheap‑>reserve (nt); keeps the next nt posts in the same workset buffer (up to dwell)
workheap‑>exclude (np); excludes np processors from service that would normally be available
y y y y y y

17
A workheap rather than work-stealing?
Work-stealing is a proactive approach to concurrent task scheduling that has similarities with
Proton workheap services. It is however but one concept of many that go into building a practical
multiprocessing system. Here are some other considerations:
•• Work-stealing is a pull-technology, Proton’s workheap is more of a push-pull work-
distribution technology, where threads know about and cooperate with one another.
•• Rather than chasing volatile thread objects and stealing work from their queues, the workheap
centralizes multiprocessing choices, making available work more directly accessible to interested
threads, with less contention. Still volatile, but much more contained, with fewer cache-line
load implications.
•• Once you let work-stealing threads pull tasks out of your thread queues, processing your
task queue in posted-order can no longer be guaranteed. Since support for posted-order is
mandatory for many things, work-stealing by itself is inadequate.
•• Proton threads support both ordered tasks, processed by specific threads, and unordered tasks,
multiprocessed by many threads. Workheap multiprocessing activities are independent from
the ordered tasks normally posted to specific threads, so ordered and unordered tasks really
represent separate workflows.
•• Each thread posts unordered work in one or more phases, to be multiprocessed in that order,
one by one. Multiple processors concentrate on each phase and complete it before starting
the next phase. The current phase may be posted and processed concurrently, prior phases are
processed until empty and released.
•• The Proton posting side to things is fast, lock-free with few atomics, and distributed across
participating threads. The servicing side is wait-free within and across all work phases.
•• The workheap knows how to wake up blocked threads and put them to work as unordered
work piles up awaiting service. This wakeup occurs when threads try to post things.
Tasks posted and on their way to a thread queue, are diverted to the workheap when marked as
“unordered tasks”. This indicator can be set on task creation, or marked somewhere along the way. Those
responsible for constructing and posting tasks can be expected to know which of their tasks have thread
affinity, which tasks do not, and to mark them accordingly before posting them. By default, tasks and
events are normally ordered, and processed by their target thread.

18
Fractal explorer application
This sample application illustrates some of Proton’s standard multiprocessing services brought to bear.
Fractal.exe was an older single threaded application with crushing computational needs, requiring the
highest CPU performance and a tiny display to make it at all interesting and responsive. It was never
designed for multi-processing, and earlier attempts to make it so, found achieving it overly complicated,
with mixed and disappointing results. This made fractal.exe a perfect test candidate for applying Proton
multi-processing technology. All source code for fractal.exe is included as a sample Proton project, with
code you can reshape and derive your own work from as needed.
Originating from user-initiated changes in view, the required graphics are split into hundreds of graphic
tile regions (under 1200 pixels each) that are posted to the workheap for computation. The finished tiles
are then posted to the winpump for direct screen rendering, which side-steps having massive update
of critical sections (a big cost) by instead using single threaded direct rendering (a tiny cost, in this
case). Window updates occur when all events are (fleetingly) complete and the display has been marked
modified. Since individual tiles require a relatively large computational effort, just one task per workset
can be used for this application.
Realistically, running fractal.exe requires a 16Ghz uniprocessor, but any quad-core CPU clocked over
3Ghz will do. Even dual-core CPUs are not quite enough to keep things interesting. Fractal runs twice
as fast on a 4Ghz-6 processor (Gulftown), as compared with a 3Ghz-4 processor (Bloomfield). The
thread monitor shows the multi-threaded blow-by-blow action (brought up with ctl-alt-). Larger
window sizes have more pixels to compute, so you get faster display response with smaller windows. With
a large enough image size to render and animate, Fractal.exe can still bring any single CPU to its knees,
no matter how many processors it has, no matter how high its clock rate (i.e. for now).
You can change the display size, drag the fractal image around with the mouse, and zoom in/out with the
mouse wheel, or with a 1st/2nd mouse button click. The mouse wheel is more fun when you keep the
window size smaller, but you can expand back when you arrive some place interesting. Its windows apply
different sizing responses to form-resize, depending on the shift key state. The application remembers
where you left it on the screen at the last exit, and comes up with the same content the next time. Full
screen and multi-screen image sizes are supported.
Fractal.exe is particularly interesting and instructive when you have the thread activity monitor up
(ctl‑alt-), while you explore the fractal. The effects from variable loads quickly lead to resource changes
and action in more or fewer service threads, as they compete for the work available, and show up to you in
real-time samples.
Fractal.exe has been designed and used as a test program to exercise and hammer the Proton multi-
processing architecture, to watch how it survives variable loads, and to learn things that go back into
making Proton services better. As such, a number of performance opportunities have been foregone in
fractal.exe, as its sledgehammer characteristics remain important for Proton quality-assurance testing.
Future versions of fractal.exe may go beyond the present one, when time permits, with some of the cool
ideas already documented in its source code.

ThreadModel rev 1.4

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to ThreadModel rev 1.4

Similar to ThreadModel rev 1.4 (20)

ThreadModel rev 1.4