Part I
Running tests with litmus

Traditionally, a litmus test is a small parallel program designed to exercise the memory model of a parallel, shared-memory, computer. Given a litmus test in assembler (X86 or Power) litmus runs the test.

Using litmus thus requires a parallel machine, which must additionally feature gcc and the pthreads library. At the moment, litmus is a prototype and has numerous limitations (recognised instructions, limited porting). Nevertheless, litmus should accept all tests produced by the companion diy tool and has been successfully used on Linux, MacOS and on AIX.

The authors of litmus are Luc Maranget and Susmit Sarkar. The present litmus is inspired from a prototype by Thomas Braibant (INRIA Rhône-Alpes) and Francesco Zappa Nardelli (INRIA Paris-Rocquencourt).

1 A tour of litmus

1.1 A simple run

Consider the following (rather classical) classic.litmus litmus test for X86:

X86 classic
"Fre PodWR Fre PodWR"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV EAX,[x] | MOV EAX,[y] ;
exists (0:EAX=0 /\ 1:EAX=0)

A litmus test source has three main sections:

The initial state defines the initial values of registers and memory locations. Initialisation to zero may be omitted.
The code section defines the code to be run concurrently — above there are two threads. Yes we know, our X86 assembler syntax is a mistake.
The final condition applies to the final values of registers and memory locations.

Run the test by:

$ litmus classic.litmus
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for classic.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
X86 classic
"Fre PodWR Fre PodWR"

{ x=0; y=0; }

 P0          | P1          ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV EAX,[x] | MOV EAX,[y] ;

exists (0:EAX=0 /\ 1:EAX=0)
Generated assembler
        _litmus_P0_0_: movl $1,(%rcx)
        _litmus_P0_1_: movl (%rsi),%eax
        _litmus_P1_0_: movl $1,(%rsi)
        _litmus_P1_0_: movl $1,(%rsi)
        _litmus_P1_1_: movl (%rcx),%eax

Test classic Allowed
Histogram (4 states)
34    :>0:EAX=0; 1:EAX=0;
499911:>0:EAX=1; 1:EAX=0;
499805:>0:EAX=0; 1:EAX=1;
250   :>0:EAX=1; 1:EAX=1;
Ok

Witnesses
Positive: 34, Negative: 999966
Condition exists (0:EAX=0 /\ 1:EAX=0) is validated
Hash=eb447b2ffe44de821f49c40caa8e9757
Time classic 0.60
...

The litmus test is first reminded, followed by actual assembler — the machine is an AMD64, in-line address references disappeared, registers may change, and assembler syntax is now more familiar. The test has run one million times, producing one million final states, or outcomes for the registers EAX of threads P₀ and P₁. The test run validates the condition, with 34 positive witnesses.

1.2 Cross compilation

With option -o <name.tar>, litmus does not run the test. Instead, it produces a tar archive that contains the C sources for the test.

Consider ppc-classic.litmus, a Power version of the previous test:

PPC ppc-classic
"Fre PodWR Fre PodWR"
{
0:r2=y; 0:r4=x;
1:r2=x; 1:r4=y;
}
 P0           | P1           ;
 li r1,1      | li r1,1      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 lwz r3,0(r4) | lwz r3,0(r4) ;
exists (0:r3=0 /\ 1:r3=0)

Our target machine (ppc) runs MacOS, wich we specify with the -os option:

$ litmus -o /tmp/a.tar -os mac ppc-classic.litmus
$ scp /tmp/a.tar ppc:/tmp

Then, on the remote machine ppc:

ppc$ mkdir classic && cd classic
ppc$ tar xf /tmp/a.tar
ppc$ ls
Makefile comp.sh run.sh ppc-classic.c outs.c utils.c

Test is compiled by the shell script comp.sh (of by (Gnu) make, at user’s choice) and run by the shell script run.sh:

$ sh comp.sh
$ sh run.sh
  ...
Test ppc-classic Allowed
Histogram (3 states)
3947  :>0:r3=0; 1:r3=0;
499357:>0:r3=1; 1:r3=0;
496696:>0:r3=0; 1:r3=1;
Ok

Witnesses
Positive: 3947, Negative: 996053
Condition exists (0:r3=0 /\ 1:r3=0) is validated
  ...

As we see, the condition validates also on Power. Notice that compilation produces an executable file, ppc-classic.exe, which can be run directly, for a less verbose output.

1.3 Running several tests at once

Consider the additional test ppc-storefwd.litmus:

PPC ppc-storefwd
"DpdR Fre Rfi DpdR Fre Rfi"
{
0:r2=x; 0:r6=y;
1:r2=y; 1:r6=x;
}
 P0            | P1            ;
 li r1,1       | li r1,1       ;
 stw r1,0(r2)  | stw r1,0(r2)  ;
 lwz r3,0(r2)  | lwz r3,0(r2)  ;
 xor r4,r3,r3  | xor r4,r3,r3  ;
 lwzx r5,r4,r6 | lwzx r5,r4,r6 ;
exists (0:r3=1 /\ 0:r5=0 /\ 1:r3=1 /\ 1:r5=0)

To compile the two tests together, we can give two file names as arguments to litmus:

$ litmus -o /tmp/a.tar -os mac ppc-classic.litmus ppc-storefwd.litmus

Or, more conveniently, list the litmus sources in a file whose name starts with @:

$ cat @ppc
ppc-classic.litmus
ppc-storefwd.litmus
$ litmus -o /tmp/a.tar -os mac @ppc

To run the test on the remote ppc machine, the same sequence of commands as in the one test case applies:

ppc$ tar xf /tmp/a.tar && make && sh run.sh
...
Test ppc-classic Allowed
Histogram (3 states)
4167  :>0:r3=0; 1:r3=0;
499399:>0:r3=1; 1:r3=0;
496434:>0:r3=0; 1:r3=1;
Ok

Witnesses
Positive: 4167, Negative: 995833
Condition exists (0:r3=0 /\ 1:r3=0) is validated
...
Test ppc-storefwd Allowed
Histogram (4 states)
37    :>0:r3=1; 0:r5=0; 1:r3=1; 1:r5=0;
499837:>0:r3=1; 0:r5=1; 1:r3=1; 1:r5=0;
499912:>0:r3=1; 0:r5=0; 1:r3=1; 1:r5=1;
214   :>0:r3=1; 0:r5=1; 1:r3=1; 1:r5=1;
Ok

Witnesses
Positive: 37, Negative: 999963
Condition exists (0:r3=1 /\ 0:r5=0 /\ 1:r3=1 /\ 1:r5=0) is validated
...

Now, the output of run.sh shows the result of two tests.

2 Controlling test parameters

Users can control some of testing conditions. Those impact efficiency and outcome variability.

Sometimes one looks for a particular outcome — for instance, one may seek to get the outcome 0:r3=1; 1:r3=1; that is missing in the previous experiment for test ppc-classical. To that aim, varying test conditions may help.

2.1 Architecture of tests

Consider a test a.litmus designed to run on t threads P₀,…, P_t−1. The structure of the executable a.exe that performs the experiment is as follows:

So as to benefit from parallelism, we run n = max(1,a/t) (integer division) tests concurrently on a machine where a cores are available.
Each of these (identical) tests consists in repeating r times the following sequence:
- Fork t (POSIX) threads T₀,… T_t−1 for executing P₀,…, P_t−1. Which thread executes which code is either fixed, or changing, controlled by the launch mode. In our experience, the launch mode has marginal impact.
  In cache mode the T_k threads are re-used. As a consequence, t threads only are forked.
- Each thread T_k executes a loop of size s. Loop iteration number i executes the code of P_k (in fixed mode) and saves the final contents of its observed registers in some arrays indexed by i. Furthermore, still for iteration i, memory location x is in fact an array cell.
  How this array cell is accessed depends upon the memory mode. In direct mode the array cell is accessed directly as x[i]; as a result, cells are accessed sequentially and false sharing effects are likely. In indirect mode the array cell is accessed by the means of a shuffled array of pointers; as a result we observed a much greater variability of outcomes.
  If the preload mode is enabled, a preliminary loop of size s reads a random subset of the memory locations accessed by P_k. Preload have a noticeable effect.
  The iterations performed by the different threads T_k may be unsynchronised, exactly synchronised by a pthread based barrier, or approximately synchronised by specific code. Absence of synchronisation may be interesting when t exceeds a. As a matter of fact, in this situation, any kind of synchronisation leads to prohibitive running times. However, for a large value of parameter s and small t we have observed spontaneous concurrent execution of some iterations amongst many. Pthread based barriers are exact but they are slow and in fact offers poor synchronisation for short code sequences. The approximate synchronisation is thus the preferred technique.
- Wait for the t threads to terminate and collect outcomes in some histogram like structure.
Wait for the n tests to terminate and sum their histograms.

Hence, running a.exe produces n × r × s outcomes. Parameters n, a, r and s can first be set directly while invoking a.exe, using the appropriate command line options. For instance, assuming t=2, ./a.exe -a 201 -r 10000 -s 1 and ./a.exe -n 1 -r 1 -s 1000000 will both produce one million outcomes, but the latter is probably more efficient. If our machine has 8 cores, ./a.exe -a 8 -r 1 -s 1000000 will yield 4 millions outcomes, in a time that we hope not to exceed too much the one experienced with ./a.exe -n 1. Also observe that the memory allocated is roughly proportional to n × s, while the number of T_k threads created will be t × n × r (t × n in cache mode). The run.sh shell script transmits its command line to all the executable (.exe) files it invokes, thereby providing a convenient means to control testing condition for several tests. Satisfactory test parameters are found by experimenting and the control of executable files by command line options is designed for that purpose.

Once satisfactory parameters are found, it is a nuisance to repeat them for every experiment. Thus, parameters a, r and s can also be set while invoking litmus, with the same command line options. In fact those settings command the default values of .exe files controls. Additionally, the synchronisation technique for iterations, the memory mode, and several others compile time parameters can be selected by appropriate litmus command line options. Finally, users can record frequently used parameters in configuration files.

2.2 Affinity

We view affinity as a scheduler property that binds a (software, POSIX) thread to a given (hardware) logical processor. In the most simple situation a logical processor is a core. However in the presence of hyperthreading (x86) or simultaneous multi threading (SMT, Power) a given core can host several logical processors.

2.2.1 Introduction to affinity

In our experience, binding the threads of test programs to selected logical processors yields sigificant speedups and, more importantly, greater outcome variety. We illustrate the issue by the means of an example.

We consider the test ppc-iriw-lwsync.litmus:

PPC ppc-iriw-lwsync
{
1:r2=x; 3:r2=y;
0:r2=y; 0:r4=x; 2:r2=x; 2:r4=y;
}
 P0           | P1           | P2           | P3           ;
 lwz r1,0(r2) | li r1,1      | lwz r1,0(r2) | li r1,1      ;
 lwsync       | stw r1,0(r2) | lwsync       | stw r1,0(r2) ;
 lwz r3,0(r4) |              | lwz r3,0(r4) |              ;
exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0)

The test consists of four threads. There are two writers (P1 and P3) that write the value one into two different locations (x and y), and two readers that read the contents of x and y in different orders — P0 reads y first, while P2 reads x first. The load instructions lwz in reader threads are separated by a lightweight barrier instruction lwsync. The final condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) characterises the situation where the reader threads see the writes by P1 and P3 in opposite order. The corresponding outcome 0:r1=1; 0:r3=0; 2:r1=1; 2:r3=0; is the only non-sequential consistent (non-SC, see Part II) possible outcome. By any reasonable memory model for Power, one expects the condition to validate, i.e. the non-SC outcome to show up.

The tested machine vargas is a Power 6 featuring 32 cores (i.e. 64 logical processors, since SMT is enabled) and running AIX in 64 bits mode. So as not to disturb other users, we run only one instance of the test, thus specifying four available processors. The litmus tool is absent on vargas. All these conditions command the following invocation of litmus, performed on our local machine:

$ litmus -r 1000 -s 1000 -a 4 -os aix -ws w64 ppc-iriw-lwsync.litmus -o ppc.tar
$ scp ppc.tar vargas:/var/tmp

On vargas we unpack the archive and compile the test:

vargas$ tar xf /var/tmp/ppc.tar && sh comp.sh

Then we run the test:

vargas$ ./ppc-iriw-lwsync.exe -v
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
152885:>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=0;
35214 :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=0;
42419 :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=0;
95457 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=0;
35899 :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=0;
70460 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=0;
30449 :>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=0;
42885 :>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=1;
70068 :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=1;
1     :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=1;
41722 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=1;
95857 :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=1;
30916 :>0:r1=1; 0:r3=0; 2:r1=1; 2:r3=1;
40818 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=1;
214950:>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=1;
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is NOT validated
Hash=8ce05c9f86d49b2adfd5546bd471aa44
Time ppc-iriw-lwsync 1.33

The non-SC outcome does not show up.

Altering parameters may yield this outcome. In particular, we may try using all the available logical processors with option -a 64. Affinity control offers an alternative, which is enabled at compilation time with litmus option -affinity:

$ litmus ... -affinity incr1 ppc-iriw-lwsync.litmus -o ppc.tar
$ scp ppc.tar vargas:/var/tmp

Option -affinity takes one argument (incr1 above) that specifies the increment used while allocating logical processors to test threads. Here, the (POSIX) threads created by the test (named T₀, T₁, T₂ and T₃ in Sec. 2.1) will get bound to logical processors 0, 1, 2, and 3, respectively.

Namely, by default, the logical processors are ordered as the sequence 0, 1, …, A−1 — where A is the number of available logical processors, which is inferred by the test executable¹. Furthermore, logical processors are allocated to threads by applying the affinity increment while scanning the logical processor sequence. Observe that since the launch mode is changing (the default) threads T_k correspond to different test threads P_i at each run. The unpack compile and run sequence on vargas now yields the non-SC outcome, better outcome variety and a lower running time:

vargas$ tar xf /var/tmp/ppc.tar && sh comp.sh
vargas$ ./ppc-iriw-lwsync.exe
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
166595:>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=0;
2841  :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=0;
19581 :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=0;
86307 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=0;
3268  :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=0;
9     :>0:r1=1; 0:r3=0; 2:r1=1; 2:r3=0;
21876 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=0;
79354 :>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=0;
21406 :>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=1;
26808 :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=1;
1762  :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=1;
100381:>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=1;
83005 :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=1;
72241 :>0:r1=1; 0:r3=0; 2:r1=1; 2:r3=1;
98047 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=1;
216519:>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=1;
Ok

Witnesses
Positive: 9, Negative: 999991
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is validated
Hash=8ce05c9f86d49b2adfd5546bd471aa44
Time ppc-iriw-lwsync 0.67

One may change the affinity increment with the command line option -i of executable files. For instance, one binds the test threads to logical processors 0, 2, 4 and 6 as follows:

vargas$ ./ppc-iriw-lwsync.exe -i 2 
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
163114:>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=0;
38867 :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=0;
48395 :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=0;
81191 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=0;
38912 :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=0;
70574 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=0;
30918 :>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=0;
47846 :>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=1;
69048 :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=1;
5     :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=1;
42675 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=1;
82308 :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=1;
30264 :>0:r1=1; 0:r3=0; 2:r1=1; 2:r3=1;
43796 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=1;
212087:>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=1;
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is NOT validated
Hash=8ce05c9f86d49b2adfd5546bd471aa44
Time ppc-iriw-lwsync 0.89

One observe that the non-SC outcome does not show up with the new affinity setting.

2.2.2 Study of affinity

As illustrated by the previous example, both the running time and the outcomes of a test are sensitive to affinity settings. We measured running time for increasing values of the affinity increment from 0 (which disables affinity control) to 20, producing the following figure:

As regards outcome variety, we get all of the 16 possible outcomes only for an affinity increment of 1.

The differences in running times can be explained by reference to the mapping of logical processors to hardware. The machine vargas consists in four MCM’s (Multi-Chip-Module), each MCM consists in four “chips”, each chip consists in two cores, and each core may support two logical processors. As far as we know, by querying vargas with the AIX commands lsattr, bindprocessor and llstat, the MCM’s hold the logical processors 0–15, 16–31, 32–47 and 48–63, each chip holds the logical processors 4k, 4k+1, 4k+2, 4k+3 and each core holds the logical processors 2k, 2k+1.

The measure of running times for varying increments reveals two noticeable slowdowns: from an increment of 1 to an increment of 2 and from 5 to 6. The gap between 1 and 2 reveals the benefits of SMT for our testing application. An increment of 1 yields both the greatest outcome variety and the minimal running time. The other gap may perhaps be explained by reference to MCM’s: for a value of 5 the tests runs on the logical processors 0, 5, 10, 15, all belonging to the same MCM; while the next affinity increment of 6 results in running the test on two different MCM (0, 6, 12 on the one hand and 18 on the other).

As a conclusion, affinity control provides users with a certain level of control over thread placement, which is likely to yield faster tests when threads are constrained to run on logical processors that are “close” one to another. The best results are obtained when SMT is effectively enforced. However, affinity control is no panacea, and the memory system may be stressed by other means, such as, for instance, allocating important chunks of memory (option -s).

2.2.3 Advanced control

For specific experiments, the technique of allocating logical processors sequentially by following a fixed increment may be two rigid. litmus offers a finer control on affinity by allowing users to supply the logical processors sequence. Notice that most users will probably not need this advanced feature.

Anyhow, so as to confirm that testing ppc-iriw-lwsync benefits from not crossing chip boundaries, one may wish to confine its four threads to logical processors 16 to 19, that is to the first chip of the second MCM. This can be done by overriding the default logical processors sequence by an user supplied one given as an argument to command-line option -p:

vargas$ ./ppc-iriw-lwsync.exe -p 16,17,18,19 -i 1
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
186125:>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=0;
1333  :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=0;
16334 :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=0;
83954 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=0;
1573  :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=0;
9     :>0:r1=1; 0:r3=0; 2:r1=1; 2:r3=0;
19822 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=0;
72876 :>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=0;
20526 :>0:r1=0; 0:r3=0; 2:r1=0; 2:r3=1;
24835 :>0:r1=1; 0:r3=0; 2:r1=0; 2:r3=1;
1323  :>0:r1=0; 0:r3=1; 2:r1=0; 2:r3=1;
97756 :>0:r1=1; 0:r3=1; 2:r1=0; 2:r3=1;
78809 :>0:r1=0; 0:r3=0; 2:r1=1; 2:r3=1;
67206 :>0:r1=1; 0:r3=0; 2:r1=1; 2:r3=1;
94934 :>0:r1=0; 0:r3=1; 2:r1=1; 2:r3=1;
232585:>0:r1=1; 0:r3=1; 2:r1=1; 2:r3=1;
Ok

Witnesses
Positive: 9, Negative: 999991
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is validated
Hash=8ce05c9f86d49b2adfd5546bd471aa44
Time ppc-iriw-lwsync 0.66

Thus we get results similar to the previous experiment on logical processors 0 to 3 (option -i 1 alone).

We may also run four simultaneous instances (-n 4, parameter n of section 2.1) of the test on the four available MCM’s:

vargas$ ./ppc-iriw-lwsync.exe -p 0,1,2,3,16,17,18,19,32,33,34,35,48,49,50,51 -n 4 -i 1
Test ppc-iriw-lwsync Allowed
Histogram (16 states)
...

Witnesses
Positive: 80, Negative: 3999920
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is validated
Time ppc-iriw-lwsync 0.74

Obzserve that, for a negligible penalty in running time, the number of non-SC outcomes increases significantly.

By contrast, binding threads of a given instance of the test to different MCM’s results in poor running time and no non-SC outcome.

vargas$ ./ppc-iriw-lwsync.exe -p 0,1,2,3,16,17,18,19,32,33,34,35,48,49,50,51 -n 4 -i 4
Test ppc-iriw-lwsync Allowed
Histogram (15 states)
...
Witnesses
Positive: 0, Negative: 4000000
Condition exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0) is NOT validated
Time ppc-iriw-lwsync 1.48

In the experiement above, the increment is 4, hence the logical processors allocated to the first instance of the test are 0, 16, 32, 48, of which indices in the logical processors sequence are 0, 4, 8, 12, respectively. The next allocated index in the sequence is 12+4 = 16. However, the sequence has 16 items. Wrapping around yields index 0 which happens to be the same as the starting index. Then, so as to allocate fresh processors, the starting index is incremented by one, resulting in allocating processors 1, 17, 33, 49 (indices 1, 5, 9, 13) to the second instance — see section 2.3 for the full story. Similarily, the third and fourth instances will get processors 2, 18, 34, 50 and 3, 19, 35, 51, respectively. Attentive readers may have noticed that the same experiment can be performed with option -i 16 and no -p option.

Finally, users should probably be aware that at least some versions of linux for x86 feature a less obvious mapping of logical processors to hardware. On a bi-processor, dual-core, 2-ways hyperthreading, linux, AMD64 machine, we have checked that logical processors residing on the same core are k and k+4, where k is an arbitray core number ranging from 0 to 3. As a result, a proper choice for favouring effective hyperthreading on such a machine is -i 4 (or -p 0,4,1,5,2,6,3,7 -i 1). More worthwhile noticing, perhaps, the straightforward choice -i 1 disfavours effective hyperthreading…

2.3 Controlling executable files

Test conditions

Any executable file produced by litmus accepts the following command line options.

-v: Be verbose, can be repeated to increase verbosity. Specifying -v is a convenient way to look at the default of options.
-q: Be quiet.
-a <n>: Run maximal number of tests concurrently for n available cores — parameter a in Sec. 2.1.
-n <n>: Run n tests concurrently — parameter n in Sec. 2.1.
-r <n>: Perform n runs — parameter r in Sec. 2.1.
-fr <f>: Multiply r by f (f is a floating point number).
-s <n>: Size of a run — parameter s in Sec. 2.1.
-fs <f>: Multiply s by f.
-f <f>: Multiply s by f and divide r by f.

Affinity

If affinity control has been enabled at compilation time (by supplying option -affinity incr1 to litmus, for instance), the executable file produced by litmus accepts the following two command line options.

-p <ns>: Logical processors sequence. The sequence <ns> is a comma separated list of integers, The default sequence is infered by the executable as 0,1,…,A−1, where A is the number of logical processors featured by the tested machine; or is a sequence specified at compile time with litmus option -p.
-i <n>: Increment for allocating logical processors to threads. Default is specified at compile time by litmus option -affinity incr<n>. Notice that -i 0 disable affinity control and that .exe files reject the -i option when affinity control has not been enabled at compile time.

Logical processors are allocated test instance by test instance (parameter n of Sec. 2.1) and then thread by thread, scanning the logical processor sequence left-to-right by steps of the given increment. More precisely, assume a logical processor sequence P = p₀, p₁, …, p_A−1 and an increment i. The first processor allocated is p₀, then p_i, then p_2i etc, Indices in the sequence P are reduced modulo A so as to wrap around. The starting index of the allocation sequence (initially 0) is recorded, and coincidence with the index of the next processor to be allocated is checked. When coincidence occurs, a new index is computed, as the previous starting index plus one, which also becomes the new starting index. Allocation then proceeds from this new starting index. That way, all the processors in the sequence will get allocated to different threads naturally, provided of course that less than A threads are scheduled to run. See section 2.2.3 for an example with A=16 and i=4.

3 Usage of litmus

Arguments

litmus takes file names as command line arguments. Those files are either a single litmus test, when having extension .litmus, or a list of file names, when prefixed by @. Of course, the file names in @files can themselves be @files.

Options

There are many command line options. We describe the more useful ones:

General behaviour

-version: Show version number and exit.
-libdir: Show installation directory and exit.
-v: Be verbose, can be repeated to increase verbosity.
-mach <name>: Read configuration file name.cfg. See the next section for the syntax of configuration files.
-o <name.tar>: Cross compile tests into tar file name.tar.
-index <@name>: Save the source names of compiled files in index file @name.

Test conditions

The following options set the default values of the options of the executable files produced:

-a <n>: Run maximal number of tests concurrently for n available cores — set default value for -a of Sec. 2.3. Default is 0 (run one test).
-limit <bool>: Do not process tests with more than n threads, where n is the number of available cores defined above. Default is false.
-r <n>: Perform n runs — set default value for option -r of Sec. 2.3. Default is 10.
-s <n>: Size of a run — set default value for option -s of Sec. 2.3. Default is 100000.

The following two options enable affinity control. Affnity control is not implemeted for MacOs.

-affinity (none|incr<n>): Step for allocating logical processors to threads — set default value for option -i of Sec. 2.3. Default is none, .i.e. produced code does not feature affinity control. With -affinity incr0, produced code features affinity control, which executable files do not exercise by default.
-i <n>: Alias for -affinity incr<n>.
-p <ns>: Specify the sequence of logical processors, implies -affinity incr1. The notation <ns> stands for a comma separated list of integers. Set default value for option -p of Sec. 2.3.
Default for this -p option will let executable files compute the logical processor sequence themselves.

The following additional options control the various modes described in Sec. 2.1. Those cannot be changed without running litmus again:

-barrier (user|pthread|none): Set synchronisation mode, default user.
-launch (changing|fixed): Set launch mode, default changing.
-cache <bool>: Enable or disable cache mode, default disabled.
-mem (indirect|direct): Set memory mode, default indirect.
-para (self|shell): Perform several tests concurrently, either by forking POSIX threads (as described in Sec. 2.1), or by forking Unix processes. Only applies for cross compilation. Default is self.
-prealloc <bool>: Enable or disable pre-allocation mode, default disabled. In pre-allocation mode, memory is allocated before forking any thread.
-preload <bool>: Enable or disable preload, default enabled.
-safer <bool>: Enable or disable safer mode, default enabled. In safer mode, executable files perform some consistency checks. Those are intended both for debugging and for dynamically checking some assumption on POSIX threads that we rely upon.
-speedcheck <bool>: Enable or disable quick condition check mode, default enabled. When enabled, stop test as soon as condition is settled.
-ccopts <flags>: Set additional gcc compilation flags (defaults: X86="-fomit-frame-pointer -O2", PPC="-O").

Target architecture description

Litmus compilation chain may slightly vary depending on the following parameters:

-os (linux|mac|aix): Set target operating system. This parameter mostly impacts some of gcc options. Default linux.
-ws (w32|w64): Set word size. This option first selects gcc 32 or 64 bits mode, by providing it with the appropriate option (-m32 or -m64 on linux, -maix32 or -maix64 on AIX). It also slightly impacts code generation in the corner case where memory locations hold other memory locations. Default is a bit contrived: it acts as w32 as regards code generation, while it provides no 32/64 bits mode selection option to gcc.

Configuration files

The syntax of configuration files is minimal: lines “key = arg” are interpreted as setting the value of parameter key to arg. Each parameter has a corresponding option, usually -key, except for single-letter options:



option	key	arg


`-a`	avail	integer
`-s`	size_of_test	integer
`-r`	number_of_run	integer
`-p`	procs	list of integers

As command line option are processed left-to-right, settings from a configuration file (option -mach) can be overridden by a later command line option. Some configuration files for the machines we have tested are present in the distribution. As an example here is the configuration file hpcx.cfg.

size_of_test = 2000
number_of_run = 20000
os = AIX
ws = W32
# A node has 16 cores X2 (SMT)
avail = 32

Lines introduced by # are comments and are thus ignored.

Configuration files are searched first in the current directory; then in any directory specified by setting the shell environment variable LITMUSDIR; and then in litmus installation directory, which is defined while compiling litmus.

1: Parameter A is not to be confused with a of section 2.1. The former serves to compute logical threads while the latter governs the number of tests that run simultaneously.

Part I Running tests with litmus

1 A tour of litmus

1.1 A simple run

1.2 Cross compilation

1.3 Running several tests at once

2 Controlling test parameters

2.1 Architecture of tests

2.2 Affinity

2.2.1 Introduction to affinity

2.2.2 Study of affinity

2.2.3 Advanced control

2.3 Controlling executable files

Test conditions

Affinity

3 Usage of litmus

Arguments

Options

General behaviour

Test conditions

Target architecture description

Configuration files

Part I
Running tests with litmus