Transactional Memory: History and Development

Definition

Parallel programming is difficult. When using systems with common memory, you can’t go without synchronization of parallel processes/threads access to the common resource (memory). The following are used for it:

locks (mutex);
lock-free algorithms (lockless);
transactional memory.

Transactional memory is a technology of concurrent threads synchronization. It simplifies the parallel programming by extracting instruction groups to atomic transactions. Concurrent threads operate paralleled till they start to modify the same memory chunk. For example, operations of nodes adding to the red/black tree (animation in the heading) can operate in parallel in several threads.

/* Move item from one list to another */
int move(list *from, list *to) {
    __transaction_atomic {
        node *n = pop(from);
        push(to, n);
    }
}

Concurrency management approach with transactional memory use is called optimistic: we think that threads operate independently of one another and rarely change the same data. In this case most transactions are successful. While approaches using locks are pessimistic. We suppose that the threads will always conflict and we always bar them to be in a critical section at the same time.

If data conflict takes place, the transaction is cancelled. Transaction cancellation leads to canceling the actions which the thread was executing during transaction. After that transaction is usually rerun or other function, named in advance an “emergency exit”, is called. It’s usually a backoff for the locks use.

Transactional memory advantages: - relatively easy in use (enclosing entire methods into one transaction block); - there are no locks and deadlocks; - parallelism level is increased, so performance is boosted as well.

Transactional memory is no silver bullet. It also has disadvantages: - When it’s improperly used capability fall is possible and it may operate incorrectly - Application is limited — you can’t execute operations in a transaction, the action of which can’t be cancelled; - Debugging is difficult— you can’t place a breakpoint inside the transaction.

Creating the Theory

The first idea of transferring transactions from the database world to the world of parallel programming appeared several decades ago, in 1980s. The technology was developed and popularized by Maurice Herlihy, Ravi Rajwar, Nir Shavit. The first researches offered hardware implementations of transactional memory, which weren’t to appear 30 years ago. In 1990s appeared the first program implementations of the technology, hardware implementations were introduced by 2000s.

Software Implementations (Software Transactional Memory, STM)

I would like to highlight four among plenty of program transactional memory implementations. The examples are available at github:JIghtuse/tm-experiments.

Clojure

Clojure — is the only language, the core of which supports transactional memory. The basic builds of STM are: ref (reference to the data, via this reference the data is changed in transaction only) and dosync (transaction block).

The approach to STM in Clojure is called a MultiVersion Concurrency Control (MVCC): multiple logic versions of data, which are used in transaction, are stored there. During transaction a thread watches the snapshot of data for the moment of its start.

The chart of reference versions in Clojure transaction.

1 and 2 transactions start at the same time, getting the same version copy ref v0. Within the transactions data processing is taking place. It changes ref value. The first transaction commits earlier and wins the race for ref to be updated with the new value. Then commits the second transaction and its effort to update ref fails (red arrow in the chart), as ref version wasn’t the one expected. In this case transaction is rerun getting the copy of the new ref version. Whereas none of transactions is trying to change ref, for the second time transaction 2 commits successfully. Calculations in transaction 3 don’t change ref value, so restart isn’t called and it commits successfully.

Let’s consider an example of funds transfer between bank accounts:

The program operates in one thread, but thread-safely.

(def account1 (ref 100))
(def account2 (ref 0))
; for the current 'ref' value reading  (deref refname)is used:
(deref account1)
100
; @refname - equivalent (deref refname)
@account2
0

(defn transfer [amount from to]
    (dosync
       (alter from - amount)
       (alter to   + amount)))

(transfer 100 account1 account2)
100

There are some variants of concurrency use when the environment is allowed to “relax” in order to reach the additional capability. Let’s imagine that you are storing the transaction log of a day. The order of transactions in the log doesn’t matter if you know that the balance will be proper. If you receive two payments of USD 100 and USD 50, it doesn’t matter in what order they will be added to the log. The payment from two transactions is commutative. Clojure provides the concurrent operation with commute exactly for such cases.

(defn log-deposit [account amount]
     (dosync
        (println "Depositing $" amount " into account, balance now: "
            (commute account + amount))))

(def myaccount (ref 0))

(log-deposit myaccount 100)
(log-deposit myaccount 50)

; (as good as) equivalent to 

(log-deposit myaccount 50)
(log-deposit myaccount 100)

Haskell

In Haskell the STM library implements transactional memory. It is part of the Haskell Platform. Types ensure that transactions are used correctly and violations cause errors at compile time.

Haskell has a high end type system. The language distinguishes the functions with side effects and pure functions. Any value of IO t type is called an action. In order to execute any action atomically in Haskell it is preceded by the word atomically. The call of atomically with the action as a parameter gives two guarantees:

atomicity — the effect of atomically act will be right away and entirely visible to other threads. No other thread can see the state inside the atomic action, except for the finite state.
isolation — during atomically act call the action act is executed absolutely independently from other threads. It takes away the peace state at the start and is executed in this state.

atomically has atomically :: STM a -> IO a type. The action of STM a type — is an argument. As well as IO action, STM has side effects, but their range is much narrower. In STM action read or write of transactional variable TVar a usually takes place.

readTVar :: TVar a -> STM a
writeTVar :: TVar a -> a -> STM ()

Let’s consider an example of funds transfer between the two bank accounts.

import System.IO
import Control.Concurrent.STM

-- Type system protects TVar type variable from read/write beyond the transaction
type Account = TVar Int

withdraw :: Account -> Int -> STM ()
withdraw acc amount = do
    bal <- readTVar acc
    writeTVar acc (bal - amount)

deposit :: Account -> Int -> STM ()
deposit acc amount = withdraw acc (- amount)

transfer :: Account -> Account -> Int -> IO ()
-- Transfer ’amount’ from account ’from’ to account ’to’
transfer from to amount 
    = atomically (do deposit to amount
                     withdraw from amount)

showAccount :: Account -> IO Int
showAccount acc = atomically (readTVar acc)

main = do
    from <- atomically (newTVar 200)
    to   <- atomically (newTVar 100)
    transfer from to 50
    v1 <- showAccount from
    v2 <- showAccount to
    putStrLn $ (show v1) ++ ", " ++ (show v2)

-- The program will print "150, 150"

Scala

STM implementation for Scala (ScalaSTM) has been developed under the impression of Haskell and Clojure implementations. Besides Scala, ScalaSTM is called from Java and Clojure. Implementation is used in the popular Akka framework for parallel programs writing. ScalaSTM provides Ref cell that can be changed inside transaction only. The data structures on the basis of invariant objects and Ref are used by many threads. Let’s consider the implementation of the double-linked thread-safe list with the use of transactional memory.

package scala.concurrent.stm.examples

import scala.concurrent.stm._

class ConcurrentIntList {
  private class Node(val elem: Int, prev0: Node, next0: Node) {
    val isHeader = prev0 == null
    val prev = Ref(if (isHeader) this else prev0)
    val next = Ref(if (isHeader) this else next0)
  }

  private val header = new Node(-1, null, null)

  def addLast(elem: Int) {
    atomic { implicit txn =>
      val p = header.prev()
      val newNode = new Node(elem, p, header)
      p.next() = newNode
      header.prev() = newNode
    }
  }

  def addLast(e1: Int, e2: Int, elems: Int*) {
    atomic { implicit txn =>
      addLast(e1)
      addLast(e2)
      elems foreach { addLast(_) }
    }
  }

  //def isEmpty = atomic { implicit t => header.next() == header }
  def isEmpty = header.next.single() == header

  def removeFirst(): Int = atomic { implicit txn =>
    val n = header.next()
    if (n == header)
      retry
    val nn = n.next()
    header.next() = nn
    nn.prev() = header
    n.elem
  }

  def maybeRemoveFirst(): Option[Int] = {
    atomic { implicit txn =>
      Some(removeFirst())
    } orAtomic { implicit txn =>
      None
    }
  }

  override def toString: String = {
    atomic { implicit txn =>
      val buf = new StringBuilder("ConcurrentIntList(")
      var n = header.next()
      while (n != header) {
        buf ++= n.elem.toString
        n = n.next()
        if (n != header) buf ++= ","
      }
      buf ++= ")" toString
    }
  }
}

object ConcurrentIntList {
  def select(stacks: ConcurrentIntList*): (ConcurrentIntList, Int) = {
    atomic { implicit txn =>
      for (s <- stacks) {
        s.maybeRemoveFirst() match {
          case Some(e) => return (s, e)
          case None =>
        }
      }
      retry
    }
  }
}

In order to implement the shared build the pointers to the next and previous nodes are made thread-safe. If there’s a possibility of one thread recording the variable at the same time when another thread has access to it (reading or writing), Ref is used. Let’s define the class for the list node and initialize the head. The list is ring: when being created pointers of the head list point to it as well. These pointers are never null.

class ConcurrentIntList {
  private class Node(val elem: Int, prev0: Node, next0: Node) {
    val isHeader = prev0 == null
    val prev = Ref(if (isHeader) this else prev0)
    val next = Ref(if (isHeader) this else next0)
  }

  private val header = new Node(-1, null, null)

If x is Ref, then x() gets the value, which is stored in x, and x() = v defines it equal to v value. Ref are read and written inside the atomic block only (transaction). It’s verified during compilation. Transaction in use:

def addLast(elem: Int) {
    atomic { implicit txn =>
      val p = header.prev()
      val newNode = new Node(elem, p, header)
      p.next() = newNode
      header.prev() = newNode
    }
  }

Scala combines ideas of many programming paradigms. It’s no surprise that the language has original technologies in transactional memory. The mentioned above Akka framework combines the facilities of actors and transactional memory receiving agents and transactors, which will blow your mind.

C/C++ (GCC 4.7+)

Starting from 4.7 version GCC supports transactional memory. Implementation represents the libitm library of execution time. -fgnu-tm (-mrtm, -mhle) flag is used for compilation. The library has been developed considering the draft of transactional builds for C++ (inclusion to the language standard is offered). Most implementations of hardware transactional memory are using the best efforts principle. That’s why practical implementations are using combining technologies of hardware and program transactional memory. Such systems are called a “hybrid transactional memory”. GCC implementation is among them.

The key words are entered into the language:

__transaction_atomic { … } — defining that the code block is a transaction;
__transaction_relaxed { … } — defining that the unsafe code inside the block doesn’t lead to side effects;
__transaction_cancel — obvious transaction cancel;
attribute((transaction_safe)) — defining the transactional-safe function;
attribute((transaction_pure)) — defining the function without side effects.

In order to demonstrate transactional memory use in C we’ll complete the data bar chart in concurrent threads.

A complete implementation is available at github: github.com/JIghtuse/tm-experiments/blob/master/histogram/src/hist.c

One transaction block is used for one cycle of the bar chart update. The library of execution time (or hardware support) will define when and which transactions should be rerun.

#ifdef _USE_TSX
    __transaction_atomic {
#endif
        for (i = 0; i < d->sz; ++i) {
            struct pixel p = d->pixels[i];

            unsigned int luminance = rY * p.red + gY * p.green + bY * p.blue;

#if defined _USE_TSX
            ++histogram[luminance/BORDER];
#elif defined _USE_MUTEX
            pthread_mutex_lock(&mut);
            ++histogram[luminance/BORDER];
            pthread_mutex_unlock(&mut);
#endif
        }
#ifdef _USE_TSX
    }
#endif

Hardware Implementations (Hardware Transactional Memory, HTM)

The first hardware implementations of transactional memory appeared lately.

Sun Rock (SPARC v9) 2007-2008

Rock microprocessor by Sun Microsystems Company was the first microprocessor with hardware support for transactional memory. 64-bit processor of SPARC v9 architecture had 16 cores. In 2007 the company unveiled support for the technology. Two new instructions chkpt и commit and one new cps status register have been added so that transactional memory could function. chkpt instruction was used for transaction starting and commit instruction for its commitment. When transaction was cancelled, a jump to was performed. At that time cps register could be checked in order to define the reason of cancellation. Transactions were interrupted due to data conflicts, TLB misses, interruptions, complex instructions. Nevertheless, in many cases transactional memory use in Rock processor gave advantages in synchronization.

In 2008 Sun engineers unveiled a transactional memory interface and Adaptive Transactional Memory Test Platform simulator. When Sun Company was bought by Oracle Corporation the project was closed. “This processor had two astonishing features: it was incredibly low and consumed plenty of energy. It was so hot that they had to hang 12 inches of cooling fans so that processor wouldn’t overheat. It would have been insane to continue this project.”

IBM BlueGene/Q (PowerPC A2) 2011

Supercomputer Sequoia with BlueGene/Q architecture became the first commercial system with hardware support for transactional memory. The technology operates in a 32 MB second level cache of PowerPC A2 (PowerPC BQC 16C) processor. The data in cache have “version” tag. The cache can store several versions of the same data.

The program notifies the processor about transaction starting, executes its work and notifies that transaction should be committed. If other threads changed the same data – creating versions – the cache rejects transaction and the thread tries to run it once more. If there are no other data versions, then the data are modified and transaction commits successfully.

The technology of version tags in PowerPC A2 is also used for the speculative execution. Instead of waiting for new data version a thread can perform calculations with the current data, speculatively executing effective work. If the data were same as after an update, the thread commits its operation from the cache. Performance is higher: the thread executes operation till it gets the final value. Otherwise the results of speculative operation are rejected and the thread performs calculations with correct values.

Transactional memory support is a sort of a logic extension of the facility, which has been in PowerPC processors for a while — “load-link/store-conditional” or LL/SC. LL/SC is a primitive operation for all thread-safe builds. The first LL/SC part is load-link. It’s used by a program for collecting the data from the memory. Then the thread changes the data and records them back to the memory with the help of store-conditional. The command commits successfully if the data hadn’t changed. Otherwise the program will perform again actions with the data from the beginning.

Transactional memory LL/SC at steroids. The threads perform LL operations on plenty of different storage areas. Operation of transaction commitment atomically changes all storage areas or rejects transaction (as SC).

Ruud Haring, who has introduced IBM work on transactional memory at Hot Chips, confessed, that the company faced a great number of non-trivial problems during implementation. Implementation has some restrictions: it doesn’t support inter-processor support for transactions. The problem isn’t current for Sequoia and, at the same time, allows to estimate the gain from transactional memory use.

IBM zEnterprise EC12 (Z/Architecture) 2012

It’s the first commercial server with support for transactional memory instructions. When using transactional memory IBM noticed a 45% performance growth comparing to z196 machine in DB2 database and works in virtualized server loads.

IBM Power8 (Power) 2013

Power 8 memory controllers support transactional memory. Technology support in Linux core was provided starting from 3.9. Core version. I didn’t find much information about HTM in Power8, hope it will be available soon.

Intel Haswell (x86) 2013

In 2012 Intel Company announced adding Transactional Synchronization Extensions (TSX) to x86 instruction set. Extensions allow the programmers to mark separate code areas as transactions. In 2013 hardware support for transactional memory is becoming available for consumers after generation if Haswell processors had been introduced.

Haswell controls read and record sets with granularity of the cache line, monitoring addresses of L1 data cache. The conflicts are determined with the help of cache coherence protocol. Which is quite logical, as the tasks of determining the transaction conflicts and supporting cache coherence are close. If the meaning is changed in one thread, but not in other ones, something’s wrong.

TSX consists of two parts. Hardware Lock Elision (HLE) represents a simple programs conversion to transactional ones on the basis of locks. The programs will be backward compatible with the current processors. Restricted Transactional Memory (RTM) is a more complete implementation of transactional memory.

Hardware Lock Elision

HLE slightly modifies instructions for the change of memory chunk. The technology adds prefixes (instructions that don’t allow to perform any actions but can change the next function interpretation) for the modification of lock acquiring and releasing instructions (XACQUIRE and XRELEASE).

Between acquiring and releasing the lock the processor monitors the memory chunks which are reading and recording the threads. If a conflict occurs (if two treads modify the same data or one thread reads the data, the other one records) processor cancels the transaction at the lock release. Otherwise, execution continues normally. Thus, HLE allows the general code to work optimistically on the basis of locks. Each thread will think that it has got the lock, while other threads can work concurrently. Until its safe, transactions are not cancelled.

The technology is backward compatible with processors without HTM support. Operations on lock control remain, but with a special prefix. Haswell processors will consider the prefix and use transactional execution instead of manipulations with locks. Any other processor will ignore the prefix and just control the lock using a traditional behavior on the basis of locks. XACQUIRE and XRELEASE instructions already exist but have any denotation until they are used by specific instructions. HLE allows to write programs and operational systems that will use transactions on Haswell, increasing parallelism level without the locks. The code will operate properly on the current processors. As a result, a new facility entry will be safe and simple.

HLE by a simple example

Operational systems implement locks in a core as memory chunks, using a typical for the processor integer type. Acquiring the lock thread performs some actions something with this memory chunk. For example, increases the value from 0 to 1. For the lock release an inverse operation is applied (for instance decrement from o to 1). The changes are visible to each processor core and each thread in the system. So other threads define that they can’t acquire the lock and should wait for its release (acquiring 0 value).

With XACQUIRE/XRELEASE prefixes an effort of lock acquiring succeeds and the process supposes that the lock belongs to it and keeps operating. But no global changes in lock value occur. Thus, a thread with the lock will suppose that its value is 1, while other threads will see it as 0. This allows other threads to acquire the lock at the same time and the processor will again lie to them. The thread will see the lock acquired and other threads won’t think so. Such behavior explains the definition HLE: instead of changing the value from 0 to 1 and vice versa, it is remained equal to 0. “Unnecessary” notation disappeared.

Restricted Transactional Memory

RTM requires more participation. It steps away from the backward compatibility and enters three new instructions. While HLE indirectly uses transactions allowing the code on the basis of locks operate in-parallel. RTM makes obvious the start, commitment and interruption of transactions. A thread starts a transaction by XBEGIN instruction, providing an “emergency” function which is run in case of transaction being interrupted. The transaction is committed by XEND instruction. The processor conducts a transaction if there have been no conflicts or interrupts it, changing to emergency function. Transactions are obviously interrupted in the program by XABORT instruction. Thanks to the obvious use of limits and “emergency exit”, RTM allows to control transactions more than HLE. In a long-term perspective RTM will simplify the implementation of transactions facilities.

Resume

Using transactional memory is a viable alternative to the current synchronization methods. It allows to simplify the parallel programming.

Footnotes

Rob Pike explained the difference between “concurrency” and “parallelism” in his speech «Concurrency is not Parallelism»: “Concurrency means operating with plenty of things at a time, while parallelism means executing plenty of things at a time”
Yep, exactly. Besides some software programs (OpenSolaris, MySQL, OpenOffice) Oracle gave up a perspective hardware technology as well.

Sources

What do you guys think? Do you want more posts about transactional memory?

[original source]