Part I
Running tests with litmus

Traditionally, a litmus test is a small parallel program designed to exercise the memory model of a parallel, shared-memory, computer. Given a litmus test in assembler (X86, Power or ARM) litmus runs the test.

Using litmus thus requires a parallel machine, which must additionally feature gcc and the pthreads library. At the moment, litmus is a prototype and has numerous limitations (recognised instructions, limited porting). Nevertheless, litmus should accept all tests produced by the companion diy tool and has been successfully used on Linux, MacOS and on AIX.

The authors of litmus are Luc Maranget and Susmit Sarkar. The present litmus is inspired from a prototype by Thomas Braibant and Francesco Zappa Nardelli.

1 A tour of litmus

1.1 A simple run

Consider the following (rather classical, store buffering) SB.litmus litmus test for X86:

X86 SB
"Fre PodWR Fre PodWR"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[y] | MOV EAX,[x] ;
exists (0:EAX=0 /\ 1:EAX=0)

A litmus test source has three main sections:

The initial state defines the initial values of registers and memory locations. Initialisation to zero may be omitted.
The code section defines the code to be run concurrently — above there are two threads. Yes we know, our X86 assembler syntax is a mistake.
The final condition applies to the final values of registers and memory locations.

Run the test by (complete run log):

% litmus SB.litmus
%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for SB.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%
X86 SB
"Fre PodWR Fre PodWR"

{x=0; y=0;}

 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[y] | MOV EAX,[x] ;

exists (0:EAX=0 /\ 1:EAX=0)
Generated assembler
#START _litmus_P1
 movl $1,(%r10)
 movl (%r9),%eax
#START _litmus_P0
 movl $1,(%r9)
 movl (%r10),%eax

Test SB Allowed
Histogram (4 states)
40    *>0:EAX=0; 1:EAX=0;
499923:>0:EAX=1; 1:EAX=0;
500009:>0:EAX=0; 1:EAX=1;
28    :>0:EAX=1; 1:EAX=1;
Ok

Witnesses
Positive: 40, Negative: 999960
Condition exists (0:EAX=0 /\ 1:EAX=0) is validated
Hash=7dbd6b8e6dd4abc2ef3d48b0376fb2e3
Observation SB Sometimes 40 999960
Time SB 0.44
...

The litmus test is first reminded, followed by actual assembler — the machine is a 64 bits one, in-line address references disappeared, registers may change, and assembler syntax is now more familiar. The test has run one million times, producing one million final states, or outcomes for the registers EAX of threads P₀ and P₁. The test run validates the condition, with 40 positive witnesses.

1.2 Cross compilation

With option -o <name.tar>, litmus does not run the test. Instead, it produces a tar archive that contains the C sources for the test.

Consider SB-PPC.litmus, a Power version of the previous test:

PPC SB-PPC
"Fre PodWR Fre PodWR"
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r4=x;
}
 P0           | P1           ;
 li r1,1      | li r1,1      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 lwz r3,0(r4) | lwz r3,0(r4) ;
exists (0:r3=0 /\ 1:r3=0)

Our target machine (ppc) runs MacOS, which we specify with the -os option:

% litmus -o /tmp/a.tar -os mac SB-PPC.litmus
% scp /tmp/a.tar ppc:/tmp

Then, on the remote machine ppc:

ppc% mkdir SB && cd SB
ppc% tar xf /tmp/a.tar
ppc% ls
comp.sh  Makefile  outs.c  outs.h  README.txt  run.sh  SB-PPC.c  show.awk  utils.c  utils.h

Test is compiled by the shell script comp.sh (or by (Gnu) make, at user’s choice) and run by the shell script run.sh:

ppc% sh comp.sh
ppc% sh run.sh
...
Test SB-PPC Allowed
Histogram (3 states)
1784  *>0:r3=0; 1:r3=0;
498564:>0:r3=1; 1:r3=0;
499652:>0:r3=0; 1:r3=1;
Ok

Witnesses
Positive: 1784, Negative: 998216
Condition exists (0:r3=0 /\ 1:r3=0) is validated
Hash=4edecf6abc507611612efaecc1c4a9bc
Observation SB-PPC Sometimes 1784 998216
Time SB-PPC 0.55
...

(Complete run log.) As we see, the condition validates also on Power. Notice that compilation produces an executable file, SB-PPC.exe, which can be run directly, for a less verbose output.

1.3 Running several tests at once

Consider the additional test STFW-PPC.litmus:

PPC STFW-PPC
"Rfi PodRR Fre Rfi PodRR Fre"
{
0:r2=x; 0:r5=y;
1:r2=y; 1:r5=x;
}
 P0           | P1           ;
 li r1,1      | li r1,1      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 lwz r3,0(r2) | lwz r3,0(r2) ;
 lwz r4,0(r5) | lwz r4,0(r5) ;
exists
(0:r3=1 /\ 0:r4=0 /\ 1:r3=1 /\ 1:r4=0)

To compile the two tests together, we can give two file names as arguments to litmus:

$ litmus -o /tmp/a.tar -os mac SB-PPC.litmus STFW-PPC.litmus

Or, more conveniently, list the litmus sources in a file whose name starts with @:

$ cat @ppc
SB-PPC.litmus
STFW-PPC.litmus
$ litmus -o /tmp/a.tar -os mac @ppc

To run the test on the remote ppc machine, the same sequence of commands as in the one test case applies:

ppc% tar xf /tmp/a.tar && make && sh run.sh
...
Test SB-PPC Allowed
Histogram (3 states)
1765  *>0:r3=0; 1:r3=0;
498741:>0:r3=1; 1:r3=0;
499494:>0:r3=0; 1:r3=1;
Ok

Witnesses
Positive: 1765, Negative: 998235
Condition exists (0:r3=0 /\ 1:r3=0) is validated
Hash=4edecf6abc507611612efaecc1c4a9bc
Observation SB-PPC Sometimes 1765 998235
Time SB-PPC 0.57
...
Test STFW-PPC Allowed
Histogram (4 states)
480   *>0:r3=1; 0:r4=0; 1:r3=1; 1:r4=0;
499560:>0:r3=1; 0:r4=1; 1:r3=1; 1:r4=0;
499827:>0:r3=1; 0:r4=0; 1:r3=1; 1:r4=1;
133   :>0:r3=1; 0:r4=1; 1:r3=1; 1:r4=1;
Ok

Witnesses
Positive: 480, Negative: 999520
Condition exists (0:r3=1 /\ 0:r4=0 /\ 1:r3=1 /\ 1:r4=0) is validated
Hash=92b2c3f6332309325000656d0632131e
Observation STFW-PPC Sometimes 480 999520
Time STFW-PPC 0.56
...

(Complete run log.) Now, the output of run.sh shows the result of two tests.

2 Controlling test parameters

Users can control some of testing conditions. Those impact efficiency and outcome variability.

Sometimes one looks for a particular outcome — for instance, one may seek to get the outcome 0:r3=1; 1:r3=1; that is missing in the previous experiment for test SB-PPC. To that aim, varying test conditions may help.

2.1 Architecture of tests

Consider a test a.litmus designed to run on t threads P₀,…, P_t−1. The structure of the executable a.exe that performs the experiment is as follows:

So as to benefit from parallelism, we run n = max(1,a/t) (integer division) tests concurrently on a machine where a logical processors are available.
Each of these (identical) tests consists in repeating r times the following sequence:
- Fork t (POSIX) threads T₀,… T_t−1 for executing P₀,…, P_t−1. Which thread executes which code is either fixed, or changing, controlled by the launch mode. In our experience, the launch mode has marginal impact.
  In cache mode the T_k threads are re-used. As a consequence, t threads only are forked.
- Each thread T_k executes a loop of size s. Loop iteration number i executes the code of P_k (in fixed mode) and saves the final contents of its observed registers in some arrays indexed by i. Furthermore, still for iteration i, memory location x is in fact an array cell.
  How this array cell is accessed depends upon the memory mode. In direct mode the array cell is accessed directly as x[i]; as a result, cells are accessed sequentially and false sharing effects are likely. In indirect mode the array cell is accessed by the means of a shuffled array of pointers; as a result we observed a much greater variability of outcomes. Additionally, the increment of the main loop (of size s) can be set to a value or stride different from the default of one. Running a test several times with changing the stride value also proved quite effective in favouring outcome variability.
  If the random preload mode is enabled, a preliminary loop of size s reads a random subset of the memory locations accessed by P_k. Preload have a noticeable effect and teh random preload mode is enabled by default. Starting from version 5.0, we provide a more precise control over preloading memory locations — See Sec. 3.2.
  The iterations performed by the different threads T_k may be unsynchronised, exactly synchronised by a pthread based barrier, or approximately synchronised by specific code. Absence of synchronisation may be interesting when t exceeds a. As a matter of fact, in this situation, any kind of synchronisation leads to prohibitive running times. However, for a large value of parameter s and small t we have observed spontaneous concurrent execution of some iterations amongst many. Pthread based barriers are exact but they are slow and in fact offers poor synchronisation for short code sequences. The approximate synchronisation is thus the preferred technique.
  Starting from version 5.0, we provide a slightly altered user synchronisation mode: userfence, which alters user mode by executing memory fences to speedup write propagation. The new mode features overall better synchronisation, yielding dramatic improvements on some examples. However, outcome variability may suffer from this more accurate synchronisation, hence user mode remains the default.
  More importantly, we provide an additional exact, timebase synchronisation technique: test threads will first synchronise using polling synchronisation barrier code, agree on a target timebase¹ value and then loop reading the timebase until it exceeds the target value. This technique yields very good synchronisation and allows fine synchronisation tuning by assigning different starting delays to different threads — see Sec. 3.1. As ARM does not provide timebase counters, notice that “timebase” synchronisation for ARM silently degrades to synchronisation by the means of the polling synchronisation barrier.
- Wait for the t threads to terminate and collect outcomes in some histogram like structure.
Wait for the n tests to terminate and sum their histograms.

Hence, running a.exe produces n × r × s outcomes. Parameters n, a, r and s can first be set directly while invoking a.exe, using the appropriate command line options. For instance, assuming t=2, ./a.exe -a 201 -r 10k -s 1 and ./a.exe -n 1 -r 1 -s 1M will both produce one million outcomes, but the latter is probably more efficient. If our machine has 8 cores, ./a.exe -a 8 -r 1 -s 1M will yield 4 millions outcomes, in a time that we hope not to exceed too much the one experienced with ./a.exe -n 1. Also observe that the memory allocated is roughly proportional to n × s, while the number of T_k threads created will be t × n × r (t × n in cache mode). The run.sh shell script transmits its command line to all the executable (.exe) files it invokes, thereby providing a convenient means to control testing condition for several tests. Satisfactory test parameters are found by experimenting and the control of executable files by command line options is designed for that purpose.

Once satisfactory parameters are found, it is a nuisance to repeat them for every experiment. Thus, parameters a, r and s can also be set while invoking litmus, with the same command line options. In fact those settings command he default values of .exe files controls. Additionally, the synchronisation technique for iterations, the memory mode, and several others compile time parameters can be selected by appropriate litmus command line options. Finally, users can record frequently used parameters in configuration files.

2.2 Affinity

We view affinity as a scheduler property that binds a (software, POSIX) thread to a given (hardware) logical processor. In the most simple situation a logical processor is a core. However in the presence of hyper-threading (x86) or simultaneous multi threading (SMT, Power) a given core can host several logical processors.

2.2.1 Introduction to affinity

In our experience, binding the threads of test programs to selected logical processors yields significant speedups and, more importantly, greater outcome variety. We illustrate the issue by the means of an example.

We consider the test ppc-iriw-lwsync.litmus:

PPC ppc-iriw-lwsync
{
0:r2=x; 1:r2=x; 1:r4=y;
2:r4=y; 3:r2=x; 3:r4=y;
}
 P0           | P1           | P2           | P3           ;
 li r1,1      | lwz r1,0(r2) | li r1,1      | lwz r1,0(r4) ;
 stw r1,0(r2) | lwsync       | stw r1,0(r4) | lwsync       ;
              | lwz r3,0(r4) |              | lwz r3,0(r2) ;
exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0)

The test consists of four threads. There are two writers (P0 and P2) that write the value one into two different locations (x and y), and two readers that read the contents of x and y in different orders — P1 reads x first, while P3 reads y first. The load instructions lwz in reader threads are separated by a lightweight barrier instruction lwsync. The final condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) characterises the situation where the reader threads see the writes by P0 and P2 in opposite order. The corresponding outcome 1:r1=1; 1:r3=0; 3:r1=1; 3:r3=0; is the only non-sequential consistent (non-SC, see Part II) possible outcome. By any reasonable memory model for Power, one expects the condition to validate, i.e. the non-SC outcome to show up.

The tested machine vargas is a Power 6 featuring 32 cores (i.e. 64 logical processors, since SMT is enabled) and running AIX in 64 bits mode. So as not to disturb other users, we run only one instance of the test, thus specifying four available processors. The litmus tool is absent on vargas. All these conditions command the following invocation of litmus, performed on our local machine:

$ litmus -r 1000 -s 1000 -a 4 -os aix -ws w64 ppc-iriw-lwsync.litmus -o ppc.tar
$ scp ppc.tar vargas:/var/tmp

On vargas we unpack the archive and compile the test:

vargas% tar xf /var/tmp/ppc.tar && sh comp.sh

Then we run the test:

vargas% ./ppc-iriw-lwsync.exe
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
163674:>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=0;
34045 :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=0;
40283 :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=0;
95079 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=0;
33848 :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=0;
72201 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=0;
32452 :>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=0;
43031 :>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=1;
73052 :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=1;
1     :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=1;
42482 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=1;
90470 :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=1;
30306 :>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=1;
43239 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=1;
205837:>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=1;
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is NOT validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Never 0 1000000
Time ppc-iriw-lwsync 1.32

The non-SC outcome does not show up.

Altering parameters may yield this outcome. In particular, we may try using all the available logical processors with option -a 64. Affinity control offers an alternative, which is enabled at compilation time with litmus option -affinity:

$ litmus ... -affinity incr1 ppc-iriw-lwsync.litmus -o ppc.tar
$ scp ppc.tar vargas:/var/tmp

Option -affinity takes one argument (incr1 above) that specifies the increment used while allocating logical processors to test threads. Here, the (POSIX) threads created by the test (named T₀, T₁, T₂ and T₃ in Sec. 2.1) will get bound to logical processors 0, 1, 2, and 3, respectively.

Namely, by default, the logical processors are ordered as the sequence 0, 1, …, A−1 — where A is the number of available logical processors, which is inferred by the test executable². Furthermore, logical processors are allocated to threads by applying the affinity increment while scanning the logical processor sequence. Observe that since the launch mode is changing (the default) threads T_k correspond to different test threads P_i at each run. The unpack compile and run sequence on vargas now yields the non-SC outcome, better outcome variety and a lower running time:

vargas% tar xf /var/tmp/ppc.tar && sh comp.sh
vargas% ./ppc-iriw-lwsync.exe
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
180600:>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=0;
3656  :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=0;
18812 :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=0;
77692 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=0;
2973  :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=0;
9     *>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=0;
28881 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=0;
75126 :>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=0;
20939 :>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=1;
30498 :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=1;
1234  :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=1;
89993 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=1;
75769 :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=1;
76361 :>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=1;
87864 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=1;
229593:>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=1;
Ok

Witnesses
Positive: 9, Negative: 999991
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Sometimes 9 999991
Time ppc-iriw-lwsync 0.68

One may change the affinity increment with the command line option -i of executable files. For instance, one binds the test threads to logical processors 0, 2, 4 and 6 as follows:

vargas% ./ppc-iriw-lwsync.exe -i 2 
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
160629:>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=0;
33389 :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=0;
43725 :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=0;
93114 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=0;
33556 :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=0;
64875 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=0;
34908 :>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=0;
43770 :>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=1;
64544 :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=1;
4     :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=1;
54633 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=1;
92617 :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=1;
34754 :>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=1;
54027 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=1;
191455:>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=1;
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is NOT validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Never 0 1000000
Time ppc-iriw-lwsync 0.92

One observes that the non-SC outcome does not show up with the new affinity setting.

One may also bind test thread to logical processors randomly with executable option +ra.

vargas% ./ppc-iriw-lwsync.exe +ra
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
...
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is NOT validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Never 0 1000000
Time ppc-iriw-lwsync 1.85

As we see, the condition does not validate either with random affinity. As a matter of fact, logical processors are taken at random in the sequence 0, 1, …, 63; while the successful run with -i 1 took them in the sequence 0, 1, 2, 3. One can limit the sequence of logical processor with option -p, which takes a sequence of logical processors numbers as argument:

vargas% ./ppc-iriw-lwsync.exe +ra -p 0,1,2,3
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
...
8     *>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=0;
...
Ok

Witnesses
Positive: 8, Negative: 999992
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Sometimes 8 999992
Time ppc-iriw-lwsync 0.70

The condition now validates.

2.2.2 Study of affinity

As illustrated by the previous example, both the running time and the outcomes of a test are sensitive to affinity settings. We measured running time for increasing values of the affinity increment from 0 (which disables affinity control) to 20, producing the following figure:

As regards outcome variety, we get all of the 16 possible outcomes only for an affinity increment of 1.

The differences in running times can be explained by reference to the mapping of logical processors to hardware. The machine vargas consists in four MCM’s (Multi-Chip-Module), each MCM consists in four “chips”, each chip consists in two cores, and each core may support two logical processors. As far as we know, by querying vargas with the AIX commands lsattr, bindprocessor and llstat, the MCM’s hold the logical processors 0–15, 16–31, 32–47 and 48–63, each chip holds the logical processors 4k, 4k+1, 4k+2, 4k+3 and each core holds the logical processors 2k, 2k+1.

The measure of running times for varying increments reveals two noticeable slowdowns: from an increment of 1 to an increment of 2 and from 5 to 6. The gap between 1 and 2 reveals the benefits of SMT for our testing application. An increment of 1 yields both the greatest outcome variety and the minimal running time. The other gap may perhaps be explained by reference to MCM’s: for a value of 5 the tests runs on the logical processors 0, 5, 10, 15, all belonging to the same MCM; while the next affinity increment of 6 results in running the test on two different MCM (0, 6, 12 on the one hand and 18 on the other).

As a conclusion, affinity control provides users with a certain level of control over thread placement, which is likely to yield faster tests when threads are constrained to run on logical processors that are “close” one to another. The best results are obtained when SMT is effectively enforced. However, affinity control is no panacea, and the memory system may be stressed by other means, such as, for instance, allocating important chunks of memory (option -s).

2.2.3 Advanced control

For specific experiments, the technique of allocating logical processors sequentially by following a fixed increment may be two rigid. litmus offers a finer control on affinity by allowing users to supply the logical processors sequence. Notice that most users will probably not need this advanced feature.

Anyhow, so as to confirm that testing ppc-iriw-lwsync benefits from not crossing chip boundaries, one may wish to confine its four threads to logical processors 16 to 19, that is to the first chip of the second MCM. This can be done by overriding the default logical processors sequence by an user supplied one given as an argument to command-line option -p:

vargas% ./ppc-iriw-lwsync.exe -p 16,17,18,19 -i 1
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
169420:>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=0;
1287  :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=0;
17344 :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=0;
85329 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=0;
1548  :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=0;
3     *>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=0;
27014 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=0;
75160 :>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=0;
19828 :>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=1;
29521 :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=1;
441   :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=1;
93878 :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=1;
81081 :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=1;
76701 :>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=1;
93623 :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=1;
227822:>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=1;
Ok

Witnesses
Positive: 3, Negative: 999997
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Sometimes 3 999997
Time ppc-iriw-lwsync 0.63

Thus we get results similar to the previous experiment on logical processors 0 to 3 (option -i 1 alone).

We may also run four simultaneous instances (-n 4, parameter n of section 2.1) of the test on the four available MCM’s:

vargas% ./ppc-iriw-lwsync.exe -p 0,1,2,3,16,17,18,19,32,33,34,35,48,49,50,51 -n 4 -i 1
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
...
57    *>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=0;
...
Ok

Witnesses
Positive: 57, Negative: 3999943
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is validated
Hash=4fbfaafa51f6784d699e9bdaf5ba047d
Observation ppc-iriw-lwsync Sometimes 57 3999943
Time ppc-iriw-lwsync 0.75

Observe that, for a negligible penalty in running time, the number of non-SC outcomes increases significantly.

By contrast, binding threads of a given instance of the test to different MCM’s results in poor running time and no non-SC outcome.

vargas% ./ppc-iriw-lwsync.exe -p 0,1,2,3,16,17,18,19,32,33,34,35,48,49,50,51 -n 4 -i 4
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
...
Witnesses
Positive: 0, Negative: 4000000
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is NOT validated
Time ppc-iriw-lwsync 1.48

In the experiment above, the increment is 4, hence the logical processors allocated to the first instance of the test are 0, 16, 32, 48, of which indices in the logical processors sequence are 0, 4, 8, 12, respectively. The next allocated index in the sequence is 12+4 = 16. However, the sequence has 16 items. Wrapping around yields index 0 which happens to be the same as the starting index. Then, so as to allocate fresh processors, the starting index is incremented by one, resulting in allocating processors 1, 17, 33, 49 (indices 1, 5, 9, 13) to the second instance — see section 2.3 for the full story. Similarly, the third and fourth instances will get processors 2, 18, 34, 50 and 3, 19, 35, 51, respectively. Attentive readers may have noticed that the same experiment can be performed with option -i 16 and no -p option.

Finally, users should probably be aware that at least some versions of Linux for x86 feature a less obvious mapping of logical processors to hardware. On a bi-processor, dual-core, 2-ways hyper-threading, Linux, AMD64 machine, we have checked that logical processors residing on the same core are k and k+4, where k is an arbitrary core number ranging from 0 to 3. As a result, a proper choice for favouring effective hyper-threading on such a machine is -i 4 (or -p 0,4,1,5,2,6,3,7 -i 1). More worthwhile noticing, perhaps, the straightforward choice -i 1 disfavours effective hyper-threading…

2.2.4 Custom control

Most tests run by litmus are produced by the litmus test generators described in Part II. Those tests include meta-information that may direct affinity control. For instance we generate one test with the diyone tool, see Sec. 5.2. More specifically we generate IRIW+lwsyncs for Power (ppc-iriw-lwsync in the previous section) as follows:

% diyone -arch PPC -name IRIW+lwsyncs Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre

We get the new source file IRIW+lwsyncs.litmus:

PPC IRIW+lwsyncs
"Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre"
Prefetch=0:x=T,1:x=F,1:y=T,2:y=T,3:y=F,3:x=T
Com=Rf Fr Rf Fr
Orig=Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre
{
0:r2=x;
1:r2=x; 1:r4=y;
2:r2=y;
3:r2=y; 3:r4=x;
}
 P0           | P1           | P2           | P3           ;
 li r1,1      | lwz r1,0(r2) | li r1,1      | lwz r1,0(r2) ;
 stw r1,0(r2) | lwsync       | stw r1,0(r2) | lwsync       ;
              | lwz r3,0(r4) |              | lwz r3,0(r4) ;
exists
(1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0)

The relevant meta-information is the “Com” line that describes how test threads are related — for instance, thread 0 stores a value to memory that is read by thread 1, written “Rf” (see Part II for more details). Custom affinity control will tend to run threads related by “Rf” on “close” logical processors, where we can for instance consider that close logical processors belong to the same physical core (SMT for Power). This minimal logical processor topology is described by two litmus command-line option: -smt <n> that specifies n-way SMT; and -smt_mode (seq|end) that specifies how logical processors from the same core are numbered. For a 8-cores 4-ways SMT power7 machine we invoke litmus as follows:

% litmus -mem direct -smt 4 -smt_mode seq -affinity custom -o a.tar IRIW+lwsyncs.litmus

Notice that memory mode is direct and that the number of available logical processors is unspecified, resulting in running one instance of the test. More importantly, notice that affinity control is enabled -affinity custom, additionally specifying custom affinity mode.

We then upload the archive a.tar to our Power7 machine, unpack, compile and run the test:

power7% tar xmf a.tar
power7% make
...
power7% ./IRIW+lwsyncs.exe -v
./IRIW+lwsyncs.exe  -v
IRIW+lwsyncs: n=1, r=1000, s=1000, +rm, +ca, p='0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31'
thread allocation: 
[23,22,3,2] {5,5,0,0}

Option -v instructs the executable to show settings of the test harness: we see that one instance of the test is run (n=1), size parameters are reminded (r=1000, s=1000) and shuffling of indirect memory mode is performed (+rm). Affinity settings are also given: mode is custom (+ca) and the logical processor sequence inferred is given (-p 0,1,…,31). Additionally, the allocation of test threads to logical processors is given, as […], as well as the allocation of test threads to physical cores, as {…}.

Here is the run output proper:

Test IRIW+lwsyncs Allowed
Histogram (15 states)
2700  :>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=0;
142   :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=0;
37110 :>1:r1=0; 1:r3=1; 3:r1=0; 3:r3=0;
181257:>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=0;
78    :>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=0;
15    *>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=0;
103459:>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=0;
149486:>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=0;
30820 :>1:r1=0; 1:r3=0; 3:r1=0; 3:r3=1;
9837  :>1:r1=1; 1:r3=0; 3:r1=0; 3:r3=1;
2399  :>1:r1=1; 1:r3=1; 3:r1=0; 3:r3=1;
204629:>1:r1=0; 1:r3=0; 3:r1=1; 3:r3=1;
214700:>1:r1=1; 1:r3=0; 3:r1=1; 3:r3=1;
5186  :>1:r1=0; 1:r3=1; 3:r1=1; 3:r3=1;
58182 :>1:r1=1; 1:r3=1; 3:r1=1; 3:r3=1;
Ok

Witnesses
Positive: 15, Negative: 999985
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is validated
Hash=836eb3085132d3cb06973469a08098df
Com=Rf Fr Rf Fr
Orig=Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre
Affinity=[2, 3] [0, 1] ; (1,2) (3,0)
Observation IRIW+lwsyncs Sometimes 15 999985
Time IRIW+lwsyncs 0.70

As we see, the test validates. Namely we observe the non-SC behaviour of IRIW in spite of the presence of two lwsync barriers. We may also notice, in the executable output some meta-information related to affinity: it reads that threads 2 and 3 on the one hand and threads 0 and 1 on the other are considered “close” (i.e. will run on the same physical core); while threads 1 and 2 on the one hand and threads 3 and 0 on the other are considered “far” (i.e. will run on different cores).

Custom affinity can be disabled by enabling another affinity mode. For instance with -i 0 we specify an affinity increment of zero. That is, affinity control is disabled altogether:

power7% ./IRIW+lwsyncs.exe -i 0 -v
./IRIW+lwsyncs.exe -i 0  -v
IRIW+lwsyncs: n=1, r=1000, s=1000, +rm, i=0, p='0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31'
Test IRIW+lwsyncs Allowed
Histogram (15 states)
...
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r3=0) is NOT validated
Hash=836eb3085132d3cb06973469a08098df
Com=Rf Fr Rf Fr
Orig=Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre
Observation IRIW+lwsyncs Never 0 1000000
Time IRIW+lwsyncs 0.90

As we see, the test does not validate under those conditions.

Notice that section 18 describes a complete experiment on affinity control.

2.3 Controlling executable files

Test conditions

Any executable file produced by litmus accepts the following command line options.

-v: Be verbose, can be repeated to increase verbosity. Specifying -v is a convenient way to look at the default of options.
-q: Be quiet.
-a <n>: Run maximal number of tests concurrently for n available logical processors — parameter a in Sec. 2.1. Notice that if affinity control is enabled (see below), -a 0 will set parameter a to the number of logical processors effectively available.
-n <n>: Run n tests concurrently — parameter n in Sec. 2.1.
-r <n>: Perform n runs — parameter r in Sec. 2.1.
-fr <f>: Multiply r by f (f is a floating point number).
-s <n>: Size of a run — parameter s in Sec. 2.1.
-fs <f>: Multiply s by f.
-f <f>: Multiply s by f and divide r by f.

Notice that options -s and -r accept a generalised syntax for their integer argument: when suffixed by k (resp. M) the integer gets multiplied by 10³ (resp. 10⁶).

The following options are accepted only for tests compiled in indirect memory mode (see Sec. 2.1):

-rm: Do not shuffle pointer arrays, resulting a behaviour similar do direct mode, without recompilation.
+rm: Shuffle pointer arrays, provided for regularity.

The following option is accepted only for tests compiled with a specified stride value (see Sec. 2.1).

-st <n>: Change stride to <n>. The default stride is specified at compile time by litmus option -stride.

The following option is accepted when enabled at compile time:

-l <n>: Insert the assembly code of each thread in a loop of size <n>.

Affinity

If affinity control has been enabled at compilation time (for instance, by supplying option -affinity incr1 to litmus), the executable file produced by litmus accepts the following command line options.

-p <ns>: Logical processors sequence. The sequence <ns> is a comma separated list of integers, The default sequence is inferred by the executable as 0,1,…,A−1, where A is the number of logical processors featured by the tested machine; or is a sequence specified at compile time with litmus option -p.
-i <n>: Increment for allocating logical processors to threads. Default is specified at compile time by litmus option -affinity incr<n>. Notice that -i 0 disable affinity and that .exe files reject the -i option when affinity control has not been enabled at compile time.
+ra: Perform random allocation of affinity at each test round.
+ca: Perform custom affinity.

Notice that when custom affinity is not available, would it be that the test source lacked meta-information or that logical processor topology was not specified at compile-time, then +ca behaves as +ra.

Logical processors are allocated test instance by test instance (parameter n of Sec. 2.1) and then thread by thread, scanning the logical processor sequence left-to-right by steps of the given increment. More precisely, assume a logical processor sequence P = p₀, p₁, …, p_A−1 and an increment i. The first processor allocated is p₀, then p_i, then p_2i etc, Indices in the sequence P are reduced modulo A so as to wrap around. The starting index of the allocation sequence (initially 0) is recorded, and coincidence with the index of the next processor to be allocated is checked. When coincidence occurs, a new index is computed, as the previous starting index plus one, which also becomes the new starting index. Allocation then proceeds from this new starting index. That way, all the processors in the sequence will get allocated to different threads naturally, provided of course that less than A threads are scheduled to run. See section 2.2.3 for an example with A=16 and i=4.

3 Advanced control of test parameters

3.1 Timebase synchronisation mode

Timebase synchronisation of the testing loop iterations (see Sec. 2.1) is selected by litmus command line option -barrier timebase. In that mode, test threads will first synchronise using polling synchronisation barrier code, agree on a target timebase value and then loop reading the timebase until it exceeds the target value. Some tests demonstrate that timebase synchronisation is more precise than user synchronisation (-barrier user and default).

For instance, consider the x86 test 6.SB, a 6-thread analog of the SB test:

X86 6.SB
"Fre PodWR Fre PodWR Fre PodWR Fre PodWR Fre PodWR Fre PodWR"
{
}
 P0          | P1          | P2          | P3          | P4          | P5          ;
 MOV [x],$1  | MOV [y],$1  | MOV [z],$1  | MOV [a],$1  | MOV [b],$1  | MOV [c],$1  ;
 MOV EAX,[y] | MOV EAX,[z] | MOV EAX,[a] | MOV EAX,[b] | MOV EAX,[c] | MOV EAX,[x] ;
exists
(0:EAX=0 /\ 1:EAX=0 /\ 2:EAX=0 /\ 3:EAX=0 /\ 4:EAX=0 /\ 5:EAX=0)

As for SB, the final condition of 6.SB identifies executions where each thread loads the initial value 0 of a location that is writtent into by another thread.

We first compile the test in user synchronisation mode, saving litmus output files into the directory R:

% mkdir -p R
% litmus -barrier user -vb true -o R 6.SB.litmus
% cd R
% make

The additional command line option -vb true activates the printing of some timing information on synchronisations.

We then directly run the test executable 6.SB.exe:

% ./6.SB.exe
Test 6.SB Allowed
Histogram (62 states)
7569  :>0:EAX=1; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
8672  :>0:EAX=0; 1:EAX=1; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
326   :>0:EAX=1; 1:EAX=0; 2:EAX=1; 3:EAX=1; 4:EAX=1; 5:EAX=1;
907   :>0:EAX=0; 1:EAX=1; 2:EAX=1; 3:EAX=1; 4:EAX=1; 5:EAX=1;
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (0:EAX=0 /\ 1:EAX=0 /\ 2:EAX=0 /\ 3:EAX=0 /\ 4:EAX=0 /\ 5:EAX=0) is NOT validated
Hash=107f1303932972b3abace3ee4027408e
Observation 6.SB Never 0 1000000
Time 6.SB 0.85

The targeted outcome — reading zero in the EAX registers of the 6 threads — is not observed. We can observe synchronisation times for all tests runs with the executable command line option +vb:

% ./6.SB.exe  +vb
99999: 162768 420978 564546   -894 669468
99998:    -93      3     81   -174   -651
99997:   -975    -30    -33     93   -192
99996:    990   1098    852   1176    774
...

We see five columns of numbers that list, for each test run, the starting delays of P₁, P₂ etc. with respect to P₀, expressed in timebase ticks. Obviously, synchronisation is rather loose, there are always two threads whose starting delays differ of about 1000 ticks.

We now compile the same test in timebase synchronisation mode, saving litmus output files into the pre-existing directory RT:

% mkdir -p RT
% litmus -barrier timebase -vb true -o RT 6.SB.litmus
% cd RT
% make

And we run the test directly (option -vb disable the printing of any synchronisation timing information):

% ./6.SB.exe -vb
Test 6.SB Allowed
Histogram (64 states)
60922 *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
38299 :>0:EAX=1; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
598   :>0:EAX=0; 1:EAX=1; 2:EAX=1; 3:EAX=1; 4:EAX=1; 5:EAX=1;
142   :>0:EAX=1; 1:EAX=1; 2:EAX=1; 3:EAX=1; 4:EAX=1; 5:EAX=1;
Ok

Witnesses
Positive: 60922, Negative: 939078
Condition exists (0:EAX=0 /\ 1:EAX=0 /\ 2:EAX=0 /\ 3:EAX=0 /\ 4:EAX=0 /\ 5:EAX=0) is validated
Hash=107f1303932972b3abace3ee4027408e
Observation 6.SB Sometimes 60922 939078
Time 6.SB 1.62

We now see that the test validates. Moreover all of the 64 possible outcomes are observed.

Timebase synchronisation works as follows: at every iteration,

one of the threads reads timebase T;
all threads synchronise by the means of a polling synchronisation barrier;
each thread computes T_i = T + δ_i, where δ_i is the timebase delay, a thread specific constant;
each thread loops, reading the timebase until the read value exceeds T_i.

By default the timebase delay δ_i is 2¹¹ = 2048 for all threads.

The precision of timebase synchronisation can be illustrated by enabling the printing of all synchronisation timings:

% ./6.SB.exe +vb
99999: 672294[1] 671973[1] 672375[1] 672144[1] 672303[1] 672222[1]
99998:  4524[1]  4332[1]  4446[1]  2052[65]  2064[73]  4095[1]
...
99983:  4314[1]  3036[1]  3141[1]  2769[1]  4551[1]  3243[1]
99982:* 2061[36]  2064[33]  2067[11]  2079[12]  2064[14]  2064[24]
99981:  2121[1]  2382[1]  2586[1]  2643[1]  2502[1]  2592[1]
...

For each test iteration and each thread, two numbers are shown (1) the last timebase value read by and (2) (in brackets […]) how many iterations of loop 4. were performed. Additionally a star “*” indicates the occurrence of the targeted outcome. Here, we see that a nearly perfect synchronisation can be achieved (cf. line 99982: above).

Once timebase synchronisation have been selected (litmus option -barrier timebase), test executable behaviour can be altered by the following two command line options:

-ta <n>: Change the timebase delay δ_i of all threads.
-tb <0:n₀;1:n₁;⋯>: Change the timebase delay δ_i of individual threads.

The litmus command line option -vb true (verbose barrier) governs the printing of synchronisation timings. It comes handy when choosing values for the -ta and -tb options. When set, the executable show synchronisation timings for outcomes that validate the test final condition. This default behaviour can be altered with the following two command line options:

-vb: Do not show synchronisation timings.
+vb: Show synchronisation timings for all outcomes.

Synchronisation timings are expressed in timebase ticks. The format depends on the synchronisation mode (litmus option -barrier). This section just gave two examples for user mode (timings are show as differences from thread P₀); and for timebase mode (timings are shown as differences from a commonly agreed by all thread timebase value). Notice that, when affinity control is enabled, the running logical processors of threads are also shown.

3.2 Advanced prefetch control

Supplying the tags custom, static, static1 or static2 to litmus command line option -preload commands the insertion of cache prefetch or flush instructions before every test instance.

In custom mode the execution of such cache management instruction is under total user control, the other, “static”, modes offer less control to the user, for the sake of not altering test code proper.

3.2.1 Custom prefetch

Custom prefetch mode offers complete control over cache management instructions. Users enable this mode by supplying the command line option -preload custom to litmus. For instance one may compile the x86 test 6.SB.litmus as follows:

% mkdir -p R
% litmus -mem indirect -preload custom -o R 6.SB.litmus
% cd R
% make

Notice the test is compiled in indirect memory mode, in order to reduce false sharing effects.

The executable 6.SB.exe accepts two new command line options: -prf and -pra. Those options takes arguments that describe cache management instructions. The option -pra takes one letter that stands for a cache management instruction as we here describe:

I: do nothing, F: cache flush, T: cache touch, W: cache touch for a write.

All those cache management instructions are not provided by all architectures, in case some instruction is missing, the letters behave as follows:

F: do nothing, T: do nothing, W: behave as T.

With -pra X the commanded action applies to all threads and all variables, for instance:

% ./6.SB.exe -pra T

will perform a run where every test thread touches the test locations that it refers to (i.e. x and y for Thread 0, y and z for Thread 1, etc.) before executing test code proper. Although one may achieve interesting results by using this -pra option, the more selective -prf option should prove more useful. The -prf option takes a comma separated list of cache managment directives. A cache management directive is n:loc=X, where n is a thread number, loc is a program variable, and X is a cache management controle letter. For instance, -prf 0:y=T instructs thread 0 to touch location y. More generally, having each thread of the test 6.SB to touch the memory location it reads with its second instruction would favor reading the initial value of these locations, and thus validating the final condition of the test “(0:EAX=0 /\ 1:EAX=0 /\ 2:EAX=0 /\ 3:EAX=0 /\ 4:EAX=0 /\ 5:EAX=0)”.

Notice that those locations can be found by looking at the test code or at the diagram of the target execution. Let us have a try:

./6.SB.exe -prf 0:y=T,1:z=T,2:a=T,3:b=T,4:c=T,5:x=T               
Test 6.SB Allowed
Histogram (63 states)
10    *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
Witnesses
Positive: 10, Negative: 999990
...
Prefetch=0:y=T,1:z=T,2:a=T,3:b=T,4:c=T,5:x=T
...

As can be seen, the final condition is validated. Also notice that the prefetch directives used during the run are reminded. If given several times, -prf options cumulate, the rightmost directives taking precedence in case of ambiguity. As a consequence, one may achieve the same prefetching effect as above with:

% ./6.SB.exe -prf 0:y=T -prf 1:z=T -prf 2:a=T -prf 3:b=T -prf 4:c=T -prf 5:x=T

3.2.2 Prefetch metadata

The source code of tests may include prefetch directives as metadata prefixed with “Prefetch=”. In particular, the generators of the diy suite (see Part II) produce such metadata. For instance in the case of the 6.SB test (generated source 6.SB+Prefetch.litmus), this metadata reads:

Prefetch=0:x=F,0:y=T,1:y=F,1:z=T,2:z=F,2:a=T,3:a=F,3:b=T,4:b=F,4:c=T,5:c=F,5:x=T

That is, each thread flushes the location it stores to and touches each location it reads from. Notice that each thread starts with a memory location access (here a store) and ends with another (here a load). The idea simply is to accelerate the exit access (with a cache touch) while delaying the entry access (with a cache flush).

When prefetch metadata is available, it acts as the default of prefetch directives:

% litmus -mem indirect -preload custom -o R 6.SB+Prefetch.litmus
% cd R
% make

Then we run the test by:

% ./6.SB+Prefetch.exe
Test 6.SB Allowed
Histogram (63 states)
674   *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
Witnesses
Positive: 674, Negative: 999326
...
Prefetch=0:x=F,0:y=T,1:y=F,1:z=T,2:a=T,2:z=F,3:a=F,3:b=T,4:b=F,4:c=T,5:c=F,5:x=T
...

One may notice that the prefetch directives from the source file medata found its way to the test executable.

As with any kind of metadata, one can change the prefetch metadata by editing the litmus source file, or better by using the -hints command line option. The -hints command line option takes a filename as argument. This file is a mapping that associates new metadata to test names. As an example, we reverse diy scheme for cache management directives: accelerating entry accesses and delaying exit accesses:

% cat map.txt
6.SB Prefetch=0:x=W,0:y=F,1:y=W,1:z=F,2:a=F,2:z=W,3:a=W,3:b=F,4:b=W,4:c=F,5:c=W,5:x=F
% litmus -mem indirect -preload custom -hints map.txt -o R 6.SB.litmus
% cd R
% make 
...
% ./6.SB.exe 
Test 6.SB Allowed
Histogram (63 states)
24    *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
Prefetch=0:x=W,0:y=F,1:y=W,1:z=F,2:a=F,2:z=W,3:a=W,3:b=F,4:b=W,4:c=F,5:c=W,5:x=F
...

As we see above, the final condition validates. It does so in spite of the apparently unfavourable cache management directives.

We can experiment further without recompilation, by using the -pra and -prf command line options of the test executable. Those are parsed left-to-right, so that we can (1) cancel any default cache management directive with -pra I and (2) enable cache touch for the stores:

% ./6.SB.exe -pra I -prf 0:x=W -prf 1:y=W -prf 2:z=W -prf 3:a=W -prf 4:b=W -prf 5:c=W
Test 6.SB Allowed
...
Witnesses
Positive: 0, Negative: 1000000
...
Prefetch=0:x=W,1:y=W,2:z=W,3:a=W,4:b=W,5:c=W

As we see, the final condition does not validate.

By contrast, flushing or touching the locations that the threads load permit to repetitively achieve validation:

chi% ./6.SB.exe -pra I -prf 0:y=F -prf 1:z=F -prf 2:a=F -prf 3:b=F -prf 4:c=F -prf 5:x=F
Test 6.SB Allowed
Histogram (63 states)
211   *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
% ./6.SB.exe -pra I -prf 0:y=T -prf 1:z=T -prf 2:a=T -prf 3:b=T -prf 4:c=T -prf 5:x=T
Test 6.SB Allowed
Histogram (63 states)
10    *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...

As a conclusion, interpreting the impact of cache management directives is not easy. However, custom preload mode (litmus command line option -preload custom) and test executable options -pra and -prf allow experimentation on specific tests.

3.2.3 “Static” prefetch control

Custom prefetch mode comes handy when one wants to tailor cache management directives for a particular test. In practice, we run batches of tests using source metadata for prefetch directives. In such a setting, the code that interprets the prefetch directives is useless, as we do not use the -prf option of the test executables. As this code get executed before each test thread code, it may impact test results. It is desirable to supress this code from test executables, still performing cache management instructions. To that aim, litmus provides some “static” preload modes, enabled with command line options -preload static, -preload static1 and -preload static2.

In the former mode -preload static and without any further user intervention, each test thread executes the cache management instructions commanded by the Prefetch metadata:

% mkdir -p S
% litmus -mem indirect -preload static -o R 6.SB+Prefetch.litmus
% make -C S
% S/6.SB+Prefetch.exe
Test 6.SB Allowed
Histogram (63 states)
804   *>0:EAX=0; 1:EAX=0; 2:EAX=0; 3:EAX=0; 4:EAX=0; 5:EAX=0;
...
Observation 804 999196
...

As we can see above, the effect of the cache management instructions looks more favorable than in custom preload mode.

Users still have a limited control on the execution of cache management instructions: produced executable accept a new -prs <n> option, which take a positive or null integer as argument. Then, each test thread executes the cache management instructions commanded by source metadata with probability 1/n, the special value n=0 disabling prefetch altogether. The default for the -prs options is “1” (always execute the cache management instructions). Let us try:

% S/6.SB+Prefetch.exe -prs 0 | grep Observation
Observation 6.SB Never 0 1000000
% S/6.SB+Prefetch.exe -prs 1 | grep Observation
Observation 6.SB Sometimes 901 999099
% S/6.SB+Prefetch.exe -prs 2 | grep Observation
Observation 6.SB Sometimes 29 999971
% S/6.SB+Prefetch.exe -prs 3 | grep Observation
Observation 6.SB Sometimes 16 999984

In those experiments we show the “Observation” field of litmus output: this field gives the count of outcomes that validate the final condition, followed by the count of outcomes that do not validate the final condition. The above counts confirm that cache management instructions favor validation.

The remaining preload modes static1 and static2 are similar, except that they produce executable files that do not accept the -prs option. Furthermore, in the former mode -preload static1 cache management instructions are always executed, while in the latter mode -preload static2 cache management instructions are executed with probability 1/2. Those modes thus act as pure static mode (litmus option -preload static), with runtime options -prs 1 and -prs 2 respectively. Moreover, as the test scaffold includes no code to interpret the -prs <n> switch, the test code is less perturbed. In practice and for the 6.SB example, there is little difference:

% mkdir -p S1 S2
% litmus -mem indirect -preload static1 -o S1 6.SB+Prefetch.litmus
% litmus -mem indirect -preload static2 -o S2 6.SB+Prefetch.litmus
% make -C S1 && make -C S2
...
% S1/6.SB+Prefetch.exe | grep Observation
Observation 6.SB Sometimes 1119 998881
% S2/6.SB+Prefetch.exe | grep Observation
Observation 6.SB Sometimes 16 999984

4 Usage of litmus

Arguments

litmus takes file names as command line arguments. Those files are either a single litmus test, when having extension .litmus, or a list of file names, when prefixed by @. Of course, the file names in @files can themselves be @files.

Options

There are many command line options. We describe the more useful ones:

General behaviour

-version

Show version number and exit.

-libdir

Show installation directory and exit.

-v

Be verbose, can be repeated to increase verbosity.

-mach <name>

Read configuration file name.cfg. See the next section for the syntax of configuration files.

-o <dest>

Save C-source of test files into <dest> instead of running them. If argument <dest> is an archive (extension .tar) or a compressed archive (extension .tgz), litmus builds an archive: this is the “cross compilation feature” demonstrated in Sec. 1.2. Otherwise, <dest> is interpreted as the name of an existing directory and tests are saved in it.

-driver (shell|C|XCode)

Choose the driver that will run the tests. In the “shell” (and default) mode, each test will be compiled into an executable. A dedicated shell script run.sh will launch the test executables. In the “C” mode, one executable run.exe is produced, which will launch the tests (see Sec. 20.1.2 for an example). Finally, the XCode mode is for inclusion of the tests into a dedicated iOS App, which we do not distribute at the moment.

-crossrun <(user@)?host(:port)?|adb>

When the shell driver is used (-driver shell above), instruct the run.sh script to run individual tests on a remote machine. The remote host can be contacted by the means of ssh or the Android Debug Bridge.

ssh: user is a login name on the the remote host, <host> is the name of the remote host, and port is a port-number which can be omitted when standard (22).
adb: Tests will be run in the remote directory /data/tmp.

This option may be useful when the tested machine has little disk space or a crippled installation. Default is disabled — i.e. run tests on the machine where the run.sh script runs.

-index <@name>

Save the source names of compiled files in index file @name.

Test conditions

The following options set the default values of the options of the executable files produced:

-a <n>: Run maximal number of tests concurrently for n available logical processors — set default value for -a of Sec. 2.3. Default is 1 (run one test). When affinity control is enabled, the value 0 has the special meaning of having executables to set the number of available logical processors according to how many are actually present.
-limit <bool>: Do not process tests with more than n threads, where n is the number of available cores defined above. Default is true.
-r <n>: Perform n runs — set default value for option -r of Sec. 2.3. The option accepts generalised syntax for integers and default is 10.
-s <n>: Size of a run — set default value for option -s of Sec. 2.3. The option accepts generalised syntax for integers and default is 100000 (or 100k).

The following additional options control the various modes described in Sec. 2.1, and more. Those cannot be changed without running litmus again:

-barrier (user|userfence|pthread|none|timebase): Set synchronisation mode, default user. Synchronisation modes are described in Sec. 2.1
-launch (changing|fixed): Set launch mode, default changing.
-mem (indirect|direct): Set memory mode, default indirect. It is possible to instruct executables compiled in indirect mode to behave almost as if compiled in direct mode, see Sec. 2.3.
-stride <n>: Specify a stride value of <n> — set default value for option -st of Sec. 2.3. See Sec. 2.1 for details on the stride parameter. If <n> is negative or zero, restore the default, which is stride feature disabled.
-st <n>: Alias for -stride <n>.
-para (self|shell): Perform several tests concurrently, either by forking POSIX threads (as described in Sec. 2.1), or by forking Unix processes. Only applies for cross compilation. Default is self.
-alloc (dynamic|static|before): Set memory allocation mode. In “dynamic” and “before” modes, the memory used by test threads is allocated with malloc — in “before” mode, memory is allocated before forking test instances. In “static” mode, the memory is pre-allocated as static arrays. In that latter case, the size of allocated arrays depend upon compile time defined parameters: the number of available logical processors (see option -a <n>) and the size of a run (see option -s <n>). It remains possible to change those those at execution time, provided the resulting memory size does not exceed the compile time value. Default is dynamic.
-preload (no|random|custom|static|static1|static2): Specify preload mode (see Sec. 2.1), default is random. Starting from version 5.0 we provide additional “custom” and “static” modes for a finer control of prefetching and flushing of some memory locations by some threads. See Sec 3.2.
-safer (no|all|write): Specify safer mode, default is write. When instructed to do so, executable files perform some consistency checks. Those are intended both for debugging and for dynamically checking some assumptions on POSIX threads that we rely upon. More specifically the test harness checks for the stabilisation of memory locations after a test round in the “all” and “write” mode, while the initial values of memory locations are checked in “all” mode.
-speedcheck (no|some|all): Quick condition check mode, default is “no”. In mode “some”, test executable will stop as soon as its condition is settled. In mode “all”, the run.sh script will additionally not run the test if invoked once more later.

The following optiondra commands affinity control:

-affinity (none|incr<n>|random|custom)

Enable (of disable with tag none) affinity control, specifying default affinity mode of executables. Default is none, i.e. executables do not include affinity control code. The various tags are interpreted as follows:

incr<n>: integer <n> is the increment for allocating logical processors to threads — see Sec. 2.2. Notice that with -affinity incr0 the produced code features affinity control, which executable files do not exercise by default.
random: executables perform random allocation of test threads to logical processors.
custom: executables perform custom allocation of test threads to logical processors.

Notice that the default for executables can be overridden using options -i,+ra and +ca of Sec. 2.3.

-i <n>

Alias for -affinity incr<n>.

Notice that affinity control is not implemented for MacOs.

The following options are significant when affinity control is enabled. Otherwise they are silent no-ops.

-p <ns>: Specify the sequence of logical processors. The notation <ns> stands for a comma separated list of integers. Set default value for option -p of Sec. 2.3. Default for this -p option will let executable files compute the logical processor sequence themselves.
-force_afffinity <bool>: Code that sets affinity will spin until all specified cores (as given with option -avail <n>) processors are up. This option is necessary on devices that let core sleep when the computing load is low. Default is false.

Custom affinity control (see Sec. 2.2.4) is enabled, first by enabling affinity control (e.g. with -affinity …), and then by specifying a logical processor topology with options -smt and -smt_mode.

-smt <n>: Specify that logical processors are close by groups of n, default is 1.
-smt_mode (none|seq|end): Specify how “close” logical processors are numbered, default is none. In mode “end”, logical processors of the same core are numbered as c, c+A_c etc. where c is a physical core number and A_c is the number of physical cores available. In mode “seq”, logical processors of the same core are numbered in sequence.

Notice that custom affinity works only for those tests that include the proper meta-information. Otherwise, custom affinity silently degrades to random affinity.

Finally, a few miscellaneous options are documented:

-l <n>: Insert the assembly code of each thread in test in a loop of size <n>. Accepts generalised integer syntax, disabled by default. Sets default value for option -l of Sec. 2.3.
This feature may prove useful for measuring running times that are not too much perturbed by the test harness, in combination with options -s 1 -r 1.
-vb <bool>: Disable/enable the printing of synchronisation timings, default is false.
This feature may prove useful for analysing the synchronisation behaviour of a specific test, see Sec. 3.1.
-ccopts <flags>: Set gcc compilation flags (defaults: X86="-fomit-frame-pointer -O2", PPC/ARM="-O2").
-gcc <name>: Change the name of C compiler, default gcc.
-linkopt <flags>: Set gcc linking flags. (default: void).
-gas <bool>: Emit Gnu as extensions (default Linux/Mac=true, AIX=false)

Target architecture description

Litmus compilation chain may slightly vary depending on the following parameters:

-os (linux|mac|aix): Set target operating system. This parameter mostly impacts some of gcc options. Default linux.
-ws (w32|w64): Set word size. This option first selects gcc 32 or 64 bits mode, by providing it with the appropriate option (-m32 or -m64 on linux, -maix32 or -maix64 on AIX). It also slightly impacts code generation in the corner case where memory locations hold other memory locations. Default is a bit contrived: it acts as w32 as regards code generation, while it provides no 32/64 bits mode selection option to gcc.

Change input

Some items in the source of tests can be changed at the very last moment. The new items are defined in mapping files whose names are arguments to the appropriate command line options. Mapping files simply are lists of pairs, with one line starting with a test name, and the rest of line defining the changed item. The changed item may also contains several lines: in that case it should be included in double quotes “".”.

-names <file>: Run litmus only on tests whose names are listed in <file>.
-rename <file>: Change test names.
-kinds <file>: Change test kinds. This amonts to changing the quantifier of final conditions, with kind Allow being exists, kind Forbid being ~exists and kind Require being forall.
-conds <file>: Change the final condition of tests.
-hints <file>: Change meta-data, or hints. Hints command avanced features such as custom affinity (option -affinity custom and Sec. 2.2.4) and prefech control (option -preload custom and Sec. 3.2).

Observe that the rename mapping is applied first. As a result kind or condition change must refer to new names. For instance, we can highlight that a X86 machine is not sequentially consistent by first renaming SB into SB+SC, and then changing the final condition. The new condition expresses that the first instruction (a store) of one of the threads must come first:

rename.txt		cond.txt

SB SB+SC		SB+SC "forall (0:EAX=1 \/ 1:EAX=1)"

Then, we run litmus:

% litmus -mach x86 -rename rename.txt -conds cond.txt SB.litmus
%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for SB.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%
X86 SB+SC
"Fre PodWR Fre PodWR"

{x=0; y=0;}

 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[y] | MOV EAX,[x] ;

forall (0:EAX=1 \/ 1:EAX=1)
Generated assembler
#START _litmus_P1
        movl $1,(%r8,%rdx)
        movl (%rdx),%eax
#START _litmus_P0
        movl $1,(%rdx)
        movl (%r8,%rdx),%eax

Test SB+SC Required
Histogram (4 states)
39954 *>0:EAX=0; 1:EAX=0;
3979407:>0:EAX=1; 1:EAX=0;
3980444:>0:EAX=0; 1:EAX=1;
195   :>0:EAX=1; 1:EAX=1;
No

Witnesses
Positive: 7960046, Negative: 39954
Condition forall (0:EAX=1 \/ 1:EAX=1) is NOT validated
Hash=7dbd6b8e6dd4abc2ef3d48b0376fb2e3
Observation SB+SC Sometimes 7960046 39954
Time SB+SC 0.48

One sees that the test name and final condition have changed.

Configuration files

The syntax of configuration files is minimal: lines “key = arg” are interpreted as setting the value of parameter key to arg. Each parameter has a corresponding option, usually -key, except for single-letter options:



option	key	arg


-a	avail	integer
-s	size_of_test	integer
-r	number_of_run	integer
-p	procs	list of integers
-l	loop	integer

Notice that litmus in fact accepts long versions of options (e.g. -avail for -a).

As command line option are processed left-to-right, settings from a configuration file (option -mach) can be overridden by a later command line option. Some configuration files for the machines we have tested are present in the distribution. As an example here is the configuration file hpcx.cfg.

size_of_test = 2000
number_of_run = 20000
os = AIX
ws = W32
# A node has 16 cores X2 (SMT)
avail = 32

Lines introduced by # are comments and are thus ignored.

Configuration files are searched first in the current directory; then in any directory specified by setting the shell environment variable LITMUSDIR; and then in litmus installation directory, which is defined while compiling litmus.

1: Power and x86-based systems provide a user accessible timebase counter that should provide consistent times to all cores and processors.
2: Parameter A is not to be confused with a of section 2.1. The former serves to compute logical threads while the latter governs the number of tests that run simultaneously. However parameters a will be set to A when affinity control is enabled and when a value is 0.

Part I Running tests with litmus

1 A tour of litmus

1.1 A simple run

1.2 Cross compilation

1.3 Running several tests at once

2 Controlling test parameters

2.1 Architecture of tests

2.2 Affinity

2.2.1 Introduction to affinity

2.2.2 Study of affinity

2.2.3 Advanced control

2.2.4 Custom control

2.3 Controlling executable files

Test conditions

Affinity

3 Advanced control of test parameters

3.1 Timebase synchronisation mode

3.2 Advanced prefetch control

3.2.1 Custom prefetch

3.2.2 Prefetch metadata

3.2.3 “Static” prefetch control

4 Usage of litmus

Arguments

Options

General behaviour

Test conditions

Target architecture description

Change input

Configuration files

Part I
Running tests with litmus