Compare and Swap

I came across an interesting introductory codeproject tutorial on Compare and Swap (CAS) instructions* for lock free parallelism. Parallel programming frequently makes use of locking constructs, such as mutexes, semaphores, and critical sections to suspend other threads while the lock is in place. Such suspensions typically consume a great deal of time and thus adversely affect system performance. Lock free and wait free parallelism in contrast uses atomic instructions such as CAS to guarantee that data is not corrupted during writes, and to avoid suspension of other threads. A good deal of theoretical background can be found by following the links from this Wikipedia article.

Valve has published a GDC 2007 presentation called Dragged Kicking and Screaming: Source Multicore that talks about how they migrated to a lock-free threading system; some parts of their implementation are also wait-free. They structured their game architecture around what they call a hybrid threading model, where algorithms are broken up according to appropriate task models. They support high level constructs of function threading, where work is parceled out to other threads, but appears to complete in line, array parallelism, where work is described in array slices, and queued execution. To that end, they provide some templated functors and functional interfaces that abstract the implementation details from the application programmers.

For comparison with another approach, please see this analysis of multi-threaded rendering in Capcom’s MT2 engine. The MT2 engine has to deal with further complexity to support parallelism on the PS3 in that every SPU task has to have a corresponding PPU thread to marshal it.

I definitely have a lot to think about before I approach my next parallelization task!

—————————-

* CAS takes three arguments; an address, the expected old value, and the new value. If the expected old value and the data at the address agree, then the new value is written to the address. If the address was written, success is reported. This allows an algorithm to read a value from memory, modify it, and write it to memory only if no other process has written to it in the mean time.

In a typical implementation, a lock free algorithm using CAS re-attempts its computation if a value was written under it. This algorithm would not be wait-free. The tutorial on codeproject implements a lock-free queue implementation that waits. The linked list implementation in Valve’s presentation also waits. The algorithms are structured around atomic operations such as rewriting the head or tail of the list.

Post to Twitter Post to Delicious Post to Facebook

  • admin
    Thanks Bjoern and Angelo. There's a lot to think about there. As I read your comments, I think it's safe to say that a general theme is to avoid locking as much as possible through mechanisms such as deferred queries and task queues.
  • Valve used lockfree algorithms in a very few, simple parallel data structures as far as I know. In general, lockfree algorithms are really hard to design, pose same composibility problems that explicit lock have, and can be even slower than locks, when adding many barriers or if the threads have to poll a lot before succeeding to publish a CAS or if the CAS fail entails a huge update in order to try again. CAS & lockfree algorithms are a useful tool, but not a general solution, at all.
  • Bjoern Knafla
    Valve's approach is interesting because of their hybrid threading model combining:
    - function level parallelism: distributing functions to threads (sound, app-client-part, app-server-part)
    - task level parallelism: using task pools that distribute tasks automatically to idle (task pool controlled) threads
    - data parallelism: putting as many threads to work on arrays of data as possible (think OpenMP or GPGPU/stream processing)

    They describe a common theme in parallelization: to make computations as independent as possible and therefore to minimize the need for synchronization, duplicate data and use thread or task local data. This train of thought resembles distributed programming or actor based parallelization which aims for side-effect free computations.
    Access to shared resources is made more and more explicit and decoupled by using double buffering (for example by interleaving rendering and simulation).
    This decoupling can also be described by optimizing the architecture for deferred or asynchronous computations like in the discussion about optimizing the input-output latency in games ( http://meshula.net/wordpress/?p=109 ).
    Kelly Brocks detailed his approach to parallelism via deferred calculations on the sweng-gamedev mailinglist in December 2007 where he described an explicit division of his code into read-, computation-, and write-/modification-phases. Additionally, if for example an AI agent wants to know if another agent is in its line of sight it creates a query which isn't answered immediately but deferred (perhaps in the next frame).

    Valve's slides also describe the idea of some kind of phases. In the sequential programming days you often find an architecture where every entity in your game is processed one after the other, each one doing all calculations for the step (absorb input, decide what to do based on the current situation/context, simulate, draw). The advent of middle-ware physics engines concentrated the physical simulation for al entities in one phase, while the entities walked through all their other phases, entity for entity for all the other computations.
    Valve shows how they started to slice though the entities (for the rendering) and collect equal phases of all entities into one step/one slice (bone setup, animation, etc.). Now they are going through slice for slice and simulate the slice-aspect of each entity during this slice-phase. This allows for task-level or even data-parallelism (array-parallelism) and should scale quite well with growing CPU core counts.

    Insomniac (Terrance Cohen) describes a comparable approach ( http://www.insomniacgames.com/tech/articles/0807/a_gameplay_architecture_for_performance_clean.php ) and combines it with the way game entities are represented. A game entity combines many components (Insomniac calls them aspects) that describe its data and in the way the components interact its behavior). Instead of storing all aspects in one object Insomniac distributes them over so called systems. Each system contains its own data - one aspect of entities - and can organize it as optimal as possible (think cache-usage), for example to allow data-parallelism or to put as many computations as possible on PS3 SPUs. Systems can run independently from each other but the flow of data through them and the order of systems seems to be modelled statically on the code level or more dynamically by a subscriber-observer pattern.

    To maximize the use of the PS3 SPUs, each system offers so called shaders (or computation kernels) - slots where the user of a system can inject SPU code that adheres to a system defined slot interface. Such shaders are called during the execution of the system and only get access to data and functions as described by the fixed interface the system offers ( Mike Acton: http://www.insomniacgames.com/tech/articles/0907/spu_shaders_introduction.php , http://www.insomniacgames.com/tech/articles/0108/files/spu_shaders.pdf ).

    So, some core idea of parallelization are to decouple computations and to use deferred/async queries the moment system-boundaries are crossed. Task- and data-level parallelism is used to allow for scaling with the CPU core count.
    Valve also states that they introduce lock-free data structures where access to global or shared resources can't be prevented (memory allocation in their case, logging or debugging infrastructure in other cases, or even access to threads (which can be seen as a global resource)).

    Ok, sorry for the incoherent writing but your blog entry triggered my urge to comment and to add some other links that I found very interesting.

    Cheers,
    Bjoern
blog comments powered by Disqus

Bad Behavior has blocked 877 access attempts in the last 7 days.