Part II
Generating tests

7 Preamble

We wrote diy7 as part of our empirical approach to studying relaxed memory models: developing in tandem testing tools and models of multiprocessor behaviour. In this tutorial, we attempt an independent tool presentation. Readers interested by the companion formalism are invited to refer to our CAV 2010 publication [1].

The distribution includes additional test generators: diyone7 for generating one test and diycross7 for generating simple variations on one test.

7.1 Relaxation of Sequential Consistency

Relaxation is one of the key concepts behind simple analysis of weak memory models. We define a candidate relaxation (candidate for short) by reference to the most natural model of parallel execution in shared memory: Sequential Consistency (SC), as defined by L. Lamport [3]. A parallel program running on a sequentially consistent machine behaves as an interleaving of its sequential threads.

Consider once more the example SB.litmus:

X86 SB
"Fre PodWR Fre PodWR"
{ x=0; y=0; }
 P0          | P1          ;
 MOV [y],$1  | MOV [x],$1  ; #(a)Wy1  | (c)Wx1
 MOV EAX,[x] | MOV EAX,[y] ; #(b)Rx0  | (d)Ry0
exists (0:EAX=0 /\ 1:EAX=0)

To focus on interaction through shared memory, let us consider memory accesses, or memory events. A memory event will hold a direction (write, written W, or read, written R), a memory location (written x, y) a value and a unique label. In any run of the simple example above, four memory events occur: two writes (c) Wx1 and (a) Wy1 and two reads (b) Rxv₁ with a certain value v₁ and (d) Ryv₂ with a certain value v₂.

If the program’s behaviour is modelled by the interleaving of its events, the first event must be a write of value 1 to location x or y and at least one of the loads must see a 1. Thus, a SC machine would exhibit only three possible outcomes for this test:

Allowed: 0:EAX = 0 ∧ 1:EAX = 1

Allowed: 0:EAX = 1 ∧ 1:EAX = 0

Allowed: 0:EAX = 1 ∧ 1:EAX = 1

However, running (see Sec. 1.1) this test on a x86 machine yields an additional result:

Allowed: 0:EAX = 0 ∧ 1:EAX = 0

And indeed, x86 allows each write-read pair on both processors to be reordered [2]: thus the write-read pair in program order is relaxed on each of these architectures. We cannot use SC as an accurate memory model for modern architectures. Instead we analyse memory models as relaxing the ordering constraints of the SC memory model.

7.2 Introduction to candidate relaxations

Consider again our classical example, from a SC perspective. We briefly argued that the outcome “0:EAX = 0 ∧ 1:EAX = 0” is forbidden by SC. We now present a more complete reasoning:

From the condition on outcome, we get the values in read events: (b) Rx0 and (d) Ry0.
Because of these values, (b) Rx0 must precede the write (c) Wx1 in the final interleaving of SC. Similarly, (d) Ry0 must precede the write (a) Wy1. This we note (b) →^fr (c) and (d) →^fr (a).
Because of sequential execution order on one single processor (a.k.a. program order), (a) Wy1 must precede (b) Rx0 (first processor); while (c) Wx1 must precede (d) Ry0 (second processor). This we note (a) →^po (b) and (c) →^po (d).
We synthesise the four constraints above as the following graph:

Constraint arrows or global arrows are shown in brown colour. As the graph contains a cycle of brown arrows, the events cannot be ordered. Hence the execution presented is not allowed by SC.

The key idea of diy7 resides in producing programs from similar cycles. To that aim, the edges in cycles must convey additional information:

For →^po edges, we consider whether the locations of the events on both sides of the edge are the same or not (’s’ or ’d’); and the direction of these events (W or R). For instance the two →^po edges in the example are PodWR (program order edge between a write and a read whose locations are different).
For →^fr edges, we consider whether the processor of the events on both sides of the edge are the same or not (’i’ for internal, or ’e’ for external). For instance the two →^fr edges in the example are Fre.

So far so good, but our x86 machine produced the outcome 0:EAX = 0 ∧ 1:EAX = 0. The Intel Memory Ordering White Paper [2] specifies: “Loads may be reordered with older stores to different locations”, which we rephrase as: PodWR is relaxed. Considering Fre to be safe, we have the graph:

And the brown sub-graph becomes acyclic.

We shall see later why we choose to relax PodWR and not Fre. At the moment, we observe that we can assume PodWR to be relaxed and Fre not to be (i.e. to be safe) and test our assumptions, by producing and running more litmus tests. The diy7 suite precisely provides tools for this approach.

As a first example, SB.litmus can be created as follows:

% diyone7 -arch X86 -name SB Fre PodWR Fre PodWR

As a second example, we can produce several similar tests as follows:

% diy7 -arch X86 -safe Fre -relax PodWR -name SB
Generator produced 2 tests
Relaxations tested: {PodWR}

diy7 produces two litmus tests, SB000.litmus and SB001.litmus, plus one index file @all. One of the litmus tests generated is the same as above, while the new test is:

% cat SB001.litmus
X86 SB001
"Fre PodWR Fre PodWR Fre PodWR"
Cycle=Fre PodWR Fre PodWR Fre PodWR
Relax=PodWR
Safe=Fre
{ }
 P0          | P1          | P2          ;
 MOV [z],$1  | MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[x] | MOV EAX,[y] | MOV EAX,[z] ;
exists (0:EAX=0 /\ 1:EAX=0 /\ 2:EAX=0)
% cat @all
# diy -arch X86 -safe Fre -relax PodWR -name SB
# Revision: 3333
SB000.litmus
SB001.litmus

diy7 first generates cycles from the candidate relaxations given as arguments, up to a limited size, and then generates litmus tests from these cycles.

7.3 More candidate relaxations

We assume the memory to be coherent. Coherence implies that, in a given execution, the writes to a given location are performed by following a sequence, or coherence order, and that all processors see the same sequence.

In diy7, the coherence orders are specified indirectly. For instance, the candidate relaxation Coe (resp. Coi) specifies two writes, performed by different processors (resp. the same processor), to the same location ℓ, the first write preceding the second in the coherence order of ℓ. The condition of the produced test then selects the specified coherence orders. Consider for instance:

% diyone7 -arch X86 -name x86-2+2W Coe PodWW Coe PodWW

The cycle that reveals a violation of the SC memory model is:

So the coherence order is 0 (initial store, not depicted), 1, 2 for both locations x and y. While the produced test is:

X86 x86-2+2W
"Wse PodWW Wse PodWW"
Prefetch=0:x=F,0:y=W,1:y=F,1:x=W
Com=Ws Ws
Orig=Wse PodWW Wse PodWW
{
}
 P0         | P1         ;
 MOV [x],$2 | MOV [y],$2 ;
 MOV [y],$1 | MOV [x],$1 ;
exists
(x=2 /\ y=2)

By the coherence hypothesis, checking the final value of locations suffices to characterise those two coherence orders, as expressed by the final condition of x86-2+2W:

exists (x=2 /\ y=2)

See Sec. 11 for alternative means to identify coherence orders.

Candidate relaxations Rfe and Rfi relate writes to reads that load their value. We are now equipped to generate the famous iriw test (independent reads of independent writes):

% diyone7 -arch X86 Rfe PodRR Fre Rfe PodRR Fre -name iriw

We generate its internal variation (i.e. where all Rfe are replaced by Rfi) as easily:

% diyone7 -arch X86 Rfi PodRR Fre Rfi PodRR Fre -name iriw-internal

We get the cycles of Fig. 1,

Figure 1: Cycles for iriw and iriw-internal

and the litmus tests of Fig. 2.

Figure 2: Litmus tests iriw and iriw-internal

X86 iriw
"Rfe PodRR Fre Rfe PodRR Fre"
{ }
 P0          | P1         | P2          | P3         ;
 MOV EAX,[y] | MOV [x],$1 | MOV EAX,[x] | MOV [y],$1 ;
 MOV EBX,[x] |            | MOV EBX,[y] |            ;
exists (0:EAX=1 /\ 0:EBX=0 /\ 2:EAX=1 /\ 2:EBX=0)

X86 iriw-internal
"Rfi PodRR Fre Rfi PodRR Fre"
{ }
 P0          | P1          ;
 MOV [x],$1  | MOV [y],$1  ;
 MOV EAX,[x] | MOV EAX,[y] ;
 MOV EBX,[y] | MOV EBX,[x] ;
exists
(0:EAX=1 /\ 0:EBX=0 /\
 1:EAX=1 /\ 1:EBX=0)

Candidate relaxations given as arguments really are a “concise specification”. As an example, we get iriw for Power, simply by changing -arch X86 into -arch PPC.

% diyone7 -arch PPC Rfe PodRR Fre Rfe PodRR Fre
PPC A
"Rfe PodRR Fre Rfe PodRR Fre"
{
0:r2=y; 0:r4=x;
1:r2=x;
2:r2=x; 2:r4=y;
3:r2=y;
}
 P0           | P1           | P2           | P3           ;
 lwz r1,0(r2) | li r1,1      | lwz r1,0(r2) | li r1,1      ;
 lwz r3,0(r4) | stw r1,0(r2) | lwz r3,0(r4) | stw r1,0(r2) ;
exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0)

Also notice that without the -name option, diyone7 writes its result to standard output.

7.4 Summary of simple candidate relaxations

We summarise the candidate relaxations available on all architectures.

7.4.1 Communication candidate relaxations

We call communication candidate relaxations the relations between two events communicating through memory, though they could belong to the same processor. Thus, these events operate on the same memory location.

diy7 syntax	Source	Target	Processor	Additional property
Rfi	W	R	Same	Target reads its value from source
Rfe	W	R	Different	Target reads its value from source
Coi	W	W	Same	Source precedes target in coherence order
Coe	W	W	Different	Source precedes target in coherence order
Fri	R	W	Same	Source reads a value from a write that precedes target in coherence order
Fre	R	W	Different	Source reads a value from a write that precedes target in coherence order

Notice that “Ws” is a deprecated synonym of “Co”

7.4.2 Program order candidate relaxations

We call program order candidate relaxations each relation between two events in the program order. These events are on the same processor, since they are in program order. As regards code output, diy7 interprets a program order candidate relaxation by generating two memory instructions (load or store) following one another.

Program order candidate relaxations have the following syntax:

Po(s|d)(R|W)(R|W)

where:

s (resp. d) indicates that the two events are to the same (resp. different) location(s);
R (resp. W) indicates an event to be a read (resp. a write);

In practice, we have:

diy7 syntax	Source	Target	Location
PosRR	R	R	Same
PodRR	R	R	Diff
PosRW	R	W	Same
PodRW	R	W	Diff
PosWW	W	W	Same
PodWW	W	W	Diff
PosWR	W	R	Same
PodWR	W	R	Diff

It is to be noticed that PosWR, PosWW and PosRW are similar to Rfi, Coi and Fri, respectively. More precisely, diy7 is unable to consider a PosWR (or PosWW, or PosRW) candidate relaxation as not being also a Rfi (or Coi, or Fri) candidate relaxation. However, litmus tests conditions may be more informative in the case of Rfi and Fri.

7.4.3 Fence candidate relaxations

Relaxed architectures provide specific instructions, namely barriers or fences, to enforce order of memory accesses. In diy7 the presence of a fence instruction is specified with fence candidate relaxations, similar to program order candidate relaxations, except that a fence instruction is inserted. Hence we have FencesRR, FencedRR. etc. The inserted fence is the strongest fence provided by the architecture — that is, mfence for x86 and sync for Power.

Fences can also be specified by using specific names. For instance, we have MFence for x86; while on PPC we have Sync, LwSync, Eieio and ISync. Hence, to yield two reads to different locations and separated by the lightweight PPC barrier lwsync, we specify LwSyncdRR.

The table in figure 3 lists all fence prefixes for all supported architectures:

Figure 3: Fence prefixes per architecture

Arch Fence prefixes

X86 MFence

PPC Sync LwSync Eieio ISync

ARM DSB DMB DMB.ST DSB.ST ISB

AArch64¹ DMB.SY DMB.ST DMB.LD ISB

MIPS Fence

RISCV Fence.rw.rw Fence.rw.r Fence.rw.w Fence.r.rw Fence.r.r Fence.r.w Fence.w.rw Fence.w.r Fence.w.w

C FenceSc FenceAR FenceAcq FenceRel FenceCons FenceRlx

LISA² FenceAn…

1
More fences available through command line option -moreedges true
2
Fence annotations an are defined in the bell file as instructions F[…,’an,…].

Also notice that, for a given architecture A, recognized fence prefixes can be listed with the following command: “diyone7 -arch A -show fences”.

7.5 Annotations

Most architectures, and more specifically LISA, support annotations that decorate events. The most common such annotation is “X” (or “A” depending on the architecture) for “atomic” accesses that will generate atomic accesses, i.e. load-reserve/store-conditional pairs (lr/sc) or read-modify-write instructions.

As annotations apply to events, they are pseudo-candidates that appear in-between actual candidate relaxations. For instance, consider the AArch64 architecture that features primitive store release (annotation “L”) and load acquire (annotation “A”) instructions. One may specify the well known message-passing release-acquire idiom as follows:

% diyone7 -arch AArch64 PodWW L Rfe A PodRR Fre
AArch64 A
...
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
 P0           | P1           ;
 MOV W0,#1    | LDAR W0,[X1] ;
 STR W0,[X1]  | LDR W2,[X3]  ;
 MOV W2,#1    |              ;
 STLR W2,[X3] |              ;
exists (1:X0=1 /\ 1:X2=0)

In the ARMv8 assembler code above, one notices the STLR (store release) and LDAR (load acquire) instructions.

In AArch64, the “atomic” annotation reads “X”, as “A” stands for “acquire”. This applies to both stores and loads and generates code snippets that implement “‘atomic” accesses using lr/sc pairs.

% diyone7 -arch AArch64 PodWW XL Rfe XA PodRR Fre
AArch64 A
...
{
0:X1=x; 0:X2=y;
1:X0=y; 1:X4=x;
}
 P0               | P1              ;
 MOV W0,#1        | Loop01:         ;
 STR W0,[X1]      | LDAXR W1,[X0]   ;
 MOV W3,#1        | STXR W2,W1,[X0] ;
 Loop00:          | CBNZ W2,Loop01  ;
 LDXR W4,[X2]     | LDR W3,[X4]     ;
 STLXR W5,W3,[X2] |                 ;
 CBNZ W5,Loop00   |                 ;
exists
(y=1 /\ 0:X4=0 /\ 1:X1=1 /\ 1:X3=0)

Observe that the “L” and “A” annotations can be combined, yielding “atomic release” store and “atomic acquire” load operations.

Loop unrolling of lr/sc idioms is controlled through the -ua <n> (abbreviation of -unrollatonic <n>) command line option. Of some interest is the setting -ua 0 that commands unrolling the loops once and testing store conditional success in the final condition:

% diyone7 -ua 0 -arch AArch64 PodWW XL Rfe XA PodRR Fre
AArch64 A
"PodWWPXL RfeXLXA PodRRXAP Fre"
Generator=diyone7 (version 7.49+01(dev))
Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
Com=Rf Fr
Orig=PodWWPXL RfeXLXA PodRRXAP Fre
{
0:X1=x; 0:X2=y;
1:X0=y; 1:X4=x;
}
 P0               | P1              ;
 MOV W0,#1        | LDAXR W1,[X0]   ;
 STR W0,[X1]      | STXR W2,W1,[X0] ;
 MOV W3,#1        | LDR W3,[X4]     ;
 LDXR W4,[X2]     |                 ;
 STLXR W5,W3,[X2] |                 ;
exists
(y=1 /\ 0:X5=0 /\ 0:X4=0 /\ 1:X2=0 /\ 1:X1=1 /\ 1:X3=0)

As another example of atomic accesses, the X86 architecture use atomic exchange instructions to implement atomic store operations:

% diyone7 -arch X86  A PodWR Fre A PodWR Fre
X86 A
...
{
}
 P0           | P1           ;
 MOV EAX,$1   | MOV EAX,$1   ;
 XCHG [x],EAX | XCHG [y],EAX ;
 MOV EBX,[y]  | MOV EBX,[x]  ;
exists
(x=1 /\ y=1 /\ 0:EBX=0 /\ 1:EBX=0)

The table in figure 4 lists all annotations for all supported architectures. Notice that all annotations do not apply to all accesses. For instance, the X86 “A” applies to stores only, while the PPC “R” (Reserve) applies to loads only, etc.

Figure 4: Annotation pseudo-candidates

Arch Pseudo-candidates

X86 A

PPC R A

ARM R A

AArch64¹ A Q L X XL XA XAL

MIPS R A

RISCV² XARAR XARRl XARAq XARP XRlAR XRlRl XRlAq XRlP XAqAR XAqRl XAqAq XAqP XPAR XPRl XPAq X AR Rl Aq

C Con Rlx Sc AR Rel Acq

LISA³ An…

1
Annotation “A” is acquire, annotation “X” is atomic
2
Some annotations are experimental and are not included in ISA
3
Fence annotations An are defined in the bell file (option -bell <bell file>) as instructions R[…,’an,…] and instructions W[…,’an,…].

Also notice that, for a given architecture A, recognized annotation pseudo relaxations can be listed with the following command: “diyone7 -arch A -show annot.

8 Testing candidate relaxations with diy7

The tool diy7 can probably be used in various, creative ways; but the tool first stems from our technique for testing relaxed memory models. The -safe and -relax options are crucial here. We describe our technique by the means of an example: X86-TSO.

8.1 Principle

Before engaging in testing it is important to categorise candidate relaxations as safe or relaxed.

This can be done by interpretation of vendor’s documentation. For instance, the iriw test of Sec. 7.3 is the example 7.7 of [2] “Stores Are Seen in a Consistent Order by Other Processors”, with a Forbid specification. Hence we deduce that Fre, Rfe and PodRR are safe. Then, from test iriw-internal of Sec. 7.3, which is Intel’s test 7.5 “Intra-Processor Forwarding Is Allowed” with an allow specification, we deduce that Rfi is relaxed. Namely, the cycle of iriw-internal is “Fre Rfi PodRR Fre Rfi PodRR”. Therefore, the only possibility is for Rfi to be relaxed.

Overall, we deduce:

Candidate relaxations PosWR (Rfi) and PodWR are relaxed
The remaining candidate relaxations PosRR, PodRR, PosWW (Coi), PodWW, PosRW (Fri), Fre and Coe are safe. Fence relaxations FencedsWR and FenceddWR are also safe and worth testing.

Of course these remain assumptions to be tested. To do so, we perform one series of tests per relaxed candidate relaxation, and one series of tests for confirming safe candidate relaxations as much as possible. Let S be all safe candidate relaxations.

Let r be a relaxed candidate relaxation. We produce tests for confirming r being relaxed by diy -relax r -safe S. We run these tests with litmus7. If one of the tests yields Ok, then r is confirmed to be relaxed, provided the experiments on S below do not fail.
For confirming the safe set, we produce tests by diy -safe S. We run these tests as much as possible and expect never to see Ok.

Namely, diy7 builds cycles as follows:

diy -relax r -safe S build cycles with at least one r taking other candidate relaxations from S.
diy -safe S build cycles from the candidate relaxations in S.

For the purpose of confirming relaxed candidate relaxations, S can be replaced by a subset.

8.2 Testing x86

Repeating command line options is painful and error prone. Besides, configuration parameters may get lost. Thus, we regroup those in configuration files that simply list the options to be passed to diy7, one option per line. For instance here is the configuration file for testing the safe relaxations of x86, x86-safe.conf.

#safe x86 conf file
-arch X86
#Generate tests on four processors or less
-nprocs 4
#From cycles of size at most six
-size 6
#With names safe000, safe0001,...
-name safe
#List  of safe relaxations
-safe PosR* PodR* PodWW PosWW Rfe Wse Fre FencesWR FencedWR

Observe that the syntax of candidate relaxations allows one shortcut: the wildcard * stands for W and R. Thus PodR* gets expanded to the two candidate relaxations PodRR and PodRW.

We get safe tests by issuing the following command, preferably in a specific directory, say safe.

% diy7 -conf x86-safe.conf
Generator produced 38 tests
Relaxations tested: {}

Here are the configuration files for confirming that Rfi and PodWR are relaxed, x86-rfi.conf and x86-podwr.conf.

#rfi x86 conf file
-arch X86
-nprocs 4
-size 6
-name rfi
-safe PosR* PodR* PodWW PosWW Rfe Wse Fre FencesWR FencedWR
-relax Rfi
#At most three "instructions" per thread.
-ins 3

#podrw x86 conf file
-arch X86
-nprocs 4
-size 6
-name podwr
-safe Fre
-relax PodWR

Notice that we used the complete safe set in x86-rfi.conf and a reduced set in x86-podwr.conf. Also notice that the thread length (or number of instructions) is limited in the former case. Tests are to be generated in specific directories. To that aim, we provide a convenient archive x86.tar.

% cd rfi
% diy7 -conf x86-rfi.conf
Generator produced 28 tests
Relaxations tested: {Rfi}
% cd ../podwr
% diy7 -conf x86-podwr.conf
Generator produced 2 tests
Relaxations tested: {PodWR}
% cd ..

Now, let us run all tests at once, with the parameters of machine saumur (4 physical cores with hyper-threading):

% litmus7 -mach saumur rfi/@all > rfi/saumur.rfi.00
% litmus7 -mach saumur podwr/@all > podwr/saumur.podwr.00
% litmus7 -mach saumur safe/@all > safe/saumur.safe.00

If your machine has 2 cores only, try litmus -a 2 -limit true…

We now look for the tests that have validated their condition in the result files of litmus7. A simple tool, readRelax7, does the job:

% readRelax7 rfi/saumur.rfi.00 podwr/saumur.podwr.00 safe/saumur.safe.00
   .
   .
   .
** Relaxation summary **
{Rfi} With {Coe, Fre, PodRR, PodRW} {Coe, Fre, PodRR, PodRW, Rfe}\
 {Coe, Fre, PodRR, PodWW} {Fre, PodRR} {Fre, PodRR, PodWW}\
 {Fre, PodRR, Rfe}
{PodWR} With {Fre}

The tool readRelax7 first lists the result of all tests (which is omitted above), and then dumps a summary of the relaxations it found. The sets of the candidate relaxations that need to be safe for the tests to indeed reveal a relaxed candidate relaxation are also given. Here, Rfi and PodWR are confirmed to be relaxed, while no candidate relaxation in the safe set is found to be relaxed. Had it been the case, a line {} With {...} would have occurred in the relaxation summary. The safe tests need to be run a lot of times, to increase our confidence in the safe set.

9 Additional relaxations

We introduce some additional candidate relaxations that are specific to the Power architecture. We shall not detail here our experiments on Power machines. See our experience report http://diy.inria.fr/phat/ for more details.

9.1 Intra-processor dependencies

In a very relaxed architecture such as Power, intra-processor dependencies becomes significant. Roughly, intra-processor dependencies fall into two categories:

Data dependencies: occur when a memory access instruction reads a register whose contents depends upon a previous (in program order) load. In diy7 we specify such a dependency as:
Dp(s|d)(R|W)
where, as usual, s (resp. d) indicates that the source and target events are to the same (resp. different) location(s); and R (resp. W) indicates that the target event is a read (resp. a write). As a matter of fact, we do not need to specify the direction of the source event, since it always is a read.
Finally, one may control the nature of the dependency: address dependency (DpAddr(s|d)(R|W) or data dependency (DpData(s|d)W).
Control dependencies: occur when the execution of a memory access is conditioned by the contents of a previous load. Their syntax is similar to the one of Dp relaxations, with a Ctrl tag:
Ctrl(s|d)(R|W)
This default syntax expands to control dependencies as guaranteed by the Power documentation. For read to write, conditioning execution is enough (expanded syntax, DpCtrl(s|d)W). But for read to read, an extra instruction, isync, is needed (expanded syntax DpCtrlIsync(s|d)R, see below). The syntax DpCtrl(s|d)R also exists, it expresses the conditional execution of a load instruction and does not create ordering.
ARM has similar candidate relaxations, Isync being replaced by ISB.

In the produced code, diy7 expresses a data dependency by a false dependency (or dummy dependency) that operates on the address of the target memory access. For instance:

% diyone7 DpdW Rfe DpdW Rfe
PPC A
"DpAddrdW Rfe DpAddrdW Rfe"
{
0:r2=y; 0:r5=x;
1:r2=x; 1:r5=y;
}
 P0            | P1            ;
 lwz r1,0(r2)  | lwz r1,0(r2)  ;
 xor r3,r1,r1  | xor r3,r1,r1  ;
 li r4,1       | li r4,1       ;
 stwx r4,r3,r5 | stwx r4,r3,r5 ;
exists (0:r1=1 /\ 1:r1=1)

On P₀, the effective address of the indexed store stwx r4,r3,r5 depends on the contents of the index register r3, which itself depends on the contents of r1. The dependency is a “false” one, since the contents of r3 always is zero, regardless of the contents of r1. One may observe that DpdW is changed into DpAddrdW in the comment field of the test. As a matter of fact, DpdW is a macro for the address dependency DpAddrW. We could have specified data dependency instead:

% diyone7 DpDatadW Rfe DpAddrdW Rfe
PPC A
"DpDatadW Rfe DpAddrdW Rfe"
{
0:r2=y; 0:r4=x;
1:r2=x; 1:r5=y;
}
 P0           | P1            ;
 lwz r1,0(r2) | lwz r1,0(r2)  ;
 xor r3,r1,r1 | xor r3,r1,r1  ;
 addi r3,r3,1 | li r4,1       ;
 stw r3,0(r4) | stwx r4,r3,r5 ;
exists
(0:r1=1 /\ 1:r1=1)

On P₀, the value stored by the last (store) instruction stw r3,0(r4) is now computed from the value read by the first (load) instruction lwz r1,0(r2). Again, this is a “false” dependency.

A control dependency is implemented by the means of an useless compare and branch sequence, plus the isync instruction when the target event is a load. For instance

% diyone7 CtrldR Fre SyncdWW Rfe
PPC A
"DpCtrlIsyncdR Fre SyncdWW Rfe"
{
0:r2=y; 0:r4=x;
1:r2=x; 1:r4=y;
}
 P0           | P1           ;
 lwz r1,0(r2) | li r1,1      ;
 cmpw r1,r1   | stw r1,0(r2) ;
 beq  LC00    | sync         ;
 LC00:        | li r3,1      ;
 isync        | stw r3,0(r4) ;
 lwz r3,0(r4) |              ;
exists
(0:r1=1 /\ 0:r3=0)

Also notice that CtrldR is interpreted as DpCtrlIsyncR in the comment field of the test.

Of course, in all cases, we assume that “false” dependencies are not “optimised out” by the assembler or the hardware.

9.2 Composite relaxations and cumulativity

Users may specify a small sequence of single candidate relaxations as behaving as a single candidate relaxation to diy7. The syntax is:

[r1, r2, …]

The main usage of the feature is to specify cumulativity candidate relaxations, that is, the sequence of Rfe and of a fence candidate relaxation (A-cumulativity), the sequence of a fence candidate relaxation and of Rfe (B-cumulativity), or both (AB-cumulativity).

Cumulativity candidate relaxations are best expressed by the following syntactical shortcuts: let r be a fence candidate relaxation, then ACr stands for [Rfe,r], BCr stands for [r,Rfe], while ABCr stands for [Rfe,r,Rfe],

Hence, a simple way to generate iriw-like (see Sec. 7.3) litmus tests with lwsync is as follows:

% diy7 -name iriw-lwsync -nprocs 8 -size 8 -relax ACLwSyncdRR -safe Fre
Generator produced 3 tests
Relaxations tested: {ACLwSyncdRR}

where we have for instance:

% cat iriw-lwsync001.litmus
PPC iriw-lwsync001
"Fre Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre Rfe LwSyncdRR"
Cycle=Fre Rfe LwSyncdRR Fre Rfe LwSyncdRR Fre Rfe LwSyncdRR
Relax=ACLwSyncdRR
Safe=Fre
{
0:r2=z; 0:r4=x; 1:r2=x;
2:r2=x; 2:r4=y; 3:r2=y;
4:r2=y; 4:r4=z; 5:r2=z;
}
 P0           | P1           | P2           | P3           | P4           | P5           ;
 lwz r1,0(r2) | li r1,1      | lwz r1,0(r2) | li r1,1      | lwz r1,0(r2) | li r1,1      ;
 lwsync       | stw r1,0(r2) | lwsync       | stw r1,0(r2) | lwsync       | stw r1,0(r2) ;
 lwz r3,0(r4) |              | lwz r3,0(r4) |              | lwz r3,0(r4) |              ;
exists (0:r1=1 /\ 0:r3=0 /\ 2:r1=1 /\ 2:r3=0 /\ 4:r1=1 /\ 4:r3=0)

9.3 Detour candidate relaxations

Detours combine a Pos candidate relaxation and a sequence of two external communication candidate relaxations. More precisely detours are some constrained Pos candidate relaxations: the source and target events must be related by a sequence of two communication candidate relaxations, whose target and source are a common event whose processor is new.

diy7 syntax	Source	Target	Detour
DetourR	R	R	Fre; Rfe
DetourW	W	R	Coe; Rfe
DetourRW	R	W	Fre;Coe
DetourWW	W	W	Coe;Coe

DetourRR and DetourWR are accepted as synonyms for DetourR and DetourW respectively.

Graphically, we have:

Finally notice that “internal” detours need no special treatement as they can be expressed by the sequences “Fri; Rfi”, “Coi;Rfi”, etc.

10 Test variations with diycross7

The tool diycross7 has an interface similar to diyone7, except it accepts list of candidate relaxations where diyone7 accepts single candidate relaxations. The new tool produces the test resulting by “cross producing” the lists. For instance, one can generate all variations on the IRIW test (see Sec. 7.3) that involve data dependencies and the lightweight barrier lwsync as follows:

% diycross7 -arch PPC -name IRIW Rfe DpdR,LwSyncdRR Fre Rfe DpdR,LwSyncdRR Fre
Generator produced 3 tests
% ls
@all  IRIW+addrs.litmus  IRIW+lwsync+addr.litmus  IRIW+lwsyncs.litmus

diycross7 outputs the index file @all that lists the test source files, and three tests, with names we believe to be self-explanatory:

% cat IRIW+lwsync+addr.litmus
PPC IRIW+lwsync+addr
"Rfe LwSyncdRR Fre Rfe DpAddrdR Fre"
Cycle=Rfe LwSyncdRR Fre Rfe DpAddrdR Fre
{
0:r2=y;
1:r2=y; 1:r4=x;
2:r2=x;
3:r2=x; 3:r5=y;
}
 P0           | P1           | P2           | P3            ;
 li r1,1      | lwz r1,0(r2) | li r1,1      | lwz r1,0(r2)  ;
 stw r1,0(r2) | lwsync       | stw r1,0(r2) | xor r3,r1,r1  ;
              | lwz r3,0(r4) |              | lwzx r4,r3,r5 ;
exists (1:r1=1 /\ 1:r3=0 /\ 3:r1=1 /\ 3:r4=0)

Users may use the special keywords allRR, allRW, allWR and allWW to specify the set of all existing program order candidate relaxations between the specified “R” or “W”. For instance, we get the complete variations on IRIW by:

% diycross7 -arch PPC -name IRIW Rfe allRR Fre Rfe allRR Fre
Generator produced 28 tests
% ls
@all
IRIW.litmus
IRIW+addr+po.litmus
IRIW+lwsync+addr.litmus
...
IRIW+isyncs.litmus

11 Identifying coherence orders with observers

We first produce the “four writes” test 2+2W for Power:

% diyone7 -name 2+2W -arch PPC PodWW Coe PodWW Coe
% cat 2+2W.litmus
PPC 2+2W
"PodWW Coe PodWW Coe"
{ 0:r2=x; 0:r4=y; 1:r2=y; 1:r4=x; }
 P0           | P1           ;
 li r1,2      | li r1,2      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 li r3,1      | li r3,1      ;
 stw r3,0(r4) | stw r3,0(r4) ;
exists (x=2 /\ y=2)

Test 2+2W is the Power version of the x86 test x86-2+2W of Sec. 7.3. In that section, we argued that the final condition exists (x=2 /\ y=2) suffices to identify the coherence orders 0, 1, 2 for locations x and y. As a consequence, a positive final condition reveals the occurrence of the specified cycle: Coe PodWW Coe PodWW.

11.1 Simple observers

Observers provide an alternative, perhaps more intuitive, means to identify coherence orders: an observer simply is an additional thread that performs several loads from the same location in sequence. Here, loading value 1 and then value 2 from location x identifies the coherence order 0, 1, 2. The command line switch -obs force commands the production of observers (test 2+2WObs):

% diyone7 -name 2+2WObs -obs force -obstype straight -arch PPC PodWW Coe PodWW Coe
% cat 2+2WObs.litmus
PPC 2+2WObs
"PodWW Coe PodWW Coe"
{ 0:r2=x; 1:r2=y; 2:r2=x; 2:r4=y; 3:r2=y; 3:r4=x; }
 P0           | P1           | P2           | P3           ;
 lwz r1,0(r2) | lwz r1,0(r2) | li r1,2      | li r1,2      ;
 lwz r3,0(r2) | lwz r3,0(r2) | stw r1,0(r2) | stw r1,0(r2) ;
              |              | li r3,1      | li r3,1      ;
              |              | stw r3,0(r4) | stw r3,0(r4) ;
exists (0:r1=1 /\ 0:r3=2 /\ 1:r1=1 /\ 1:r3=2)

Thread P0 observes location x, while thread P1 observes location y. With respect to 2+2W, final condition has changed, the direct observation of the final contents of locations x and y being replaced by two successive observations of the contents of x and y.

It should first be noticed that the reasoning above assumes that having the same thread to read 1 from say x and then 2 implies that 1 takes place before 2 in the coherence order of x. This may not be the case in general — although it holds for Power. Moreover, running 2+2W and 2+2WObs yields contrasted results. While a positive conclusion is immediate for 2+2W, we were not able to reach a similar conclusion for 2+2WObs. As a matter of fact, 2+2WObs yielding Ok stems from the still-to-be-observed coincidence of several events: both observers threads must run at the right pace to observe the change from 1 to 2, while the cycle must indeed occur.

11.2 More observers

11.2.1 Fences and loops in observers

A simple observer consisting of loads performed in sequence is a straight observer. We define two additional sorts of observers: fenced observers, where loads are separated by the strongest fence available, and loop observers, which poll on location contents change. Those are selected by the homonymous tags given as arguments to the command line switch -obstype. For instance, we get the test 2+2WObsFenced by:

% diyone7 -name 2+2WObsFenced -obs force -obstype fenced -arch PPC PodWW Coe PodWW Coe
% cat 2+2WObsFenced.litmus
PPC 2+2WObsFenced
"PodWW Coe PodWW Coe"
{ 0:r2=x; 1:r2=y; 2:r2=x; 2:r4=y; 3:r2=y; 3:r4=x; }
 P0           | P1           | P2           | P3           ;
 lwz r1,0(r2) | lwz r1,0(r2) | li r1,2      | li r1,2      ;
 sync         | sync         | stw r1,0(r2) | stw r1,0(r2) ;
 lwz r3,0(r2) | lwz r3,0(r2) | li r3,1      | li r3,1      ;
              |              | stw r3,0(r4) | stw r3,0(r4) ;
exists (0:r1=1 /\ 0:r3=2 /\ 1:r1=1 /\ 1:r3=2)

Invoking diyone7 as “diyone -obs force -obstype loop ...” yields the additional test 2+2WObsLoop:

PPC 2+2WObsLoop
"PodWW Wse PodWW Wse"
{ 0:r2=x; 1:r2=y;
  2:r2=x; 2:r4=y; 3:r2=y; 3:r4=x; }
 P0            | P1            | P2           | P3           ;
 L00:          | L03:          | li r1,2      | li r1,2      ;
 lwz r1,0(r2)  | lwz r1,0(r2)  | stw r1,0(r2) | stw r1,0(r2) ;
 cmpwi r1,0    | cmpwi r1,0    | li r3,1      | li r3,1      ;
 beq  L00      | beq  L03      | stw r3,0(r4) | stw r3,0(r4) ;
 li r4,200     | li r4,200     |              |              ;
 L01:          | L04:          |              |              ;
 lwz r3,0(r2)  | lwz r3,0(r2)  |              |              ;
 cmpw r3,r1    | cmpw r3,r1    |              |              ;
 bne  L02      | bne  L05      |              |              ;
 addi r4,r4,-1 | addi r4,r4,-1 |              |              ;
 cmpwi r4,0    | cmpwi r4,0    |              |              ;
 bne  L01      | bne  L04      |              |              ;
 L02:          | L05:          |              |              ;
exists (0:r1=1 /\ 0:r3=2 /\ 1:r1=1 /\ 1:r3=2)

A loop observer first busily waits for the observed location not to hold its initial contents 0, and then busily waits for another change of location contents. The second loop is performed at most a finite number of times (here 200), in order to ensure termination.

11.2.2 Local observers

With local observers, coherence order is observed by the test threads. This implies changing the tests, and some care must be exercised when interpreting results.

The idea is as follows: when two threads are connected by a Coe candidate relaxation, meaning that the first thread ends by writing v to some location ℓ and that the second threads starts by writing v+1 to the same location ℓ, we add an observing read of location ℓ at the end of the first thread. Then, reading v+1 means that the write by the first thread precedes the write by the second thread in ℓ coherence order. More concretely, we instruct diy7 generators to emit such local observers with option -obs local:

% diyone7 -name 2+2WLocal -obs local -obstype straight -arch PPC PodWW Coe PodWW Coe
% cat 2+2WLocal.litmus
PPC 2+2WLocal
"PodWW Coe PodWW Coe"
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r4=x;
}
 P0           | P1           ;
 li r1,2      | li r1,2      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 li r3,1      | li r3,1      ;
 stw r3,0(r4) | stw r3,0(r4) ;
 lwz r5,0(r4) | lwz r5,0(r4) ;
exists
(0:r5=2 /\ 1:r5=2)

With respect to 2+2W, final condition has changed, the direct observation of the final contents of locations y and x being replaced local observation of y by thread 0 and local observation of x by thread 1.

Based for instance on the test execution witness, whose only SC-violation cycle is the same as as for 2+2W,

one may argue that tests 2+2W and 2+2WLocal are equivalent, in the sense that both are allowed or both are forbidden by a model or machine.

Local observers can also be fenced or looping. For instance, one produces 2+2WLocalFenced, the fenced local observer version of 2+2W as follows:

% diyone7 -name 2+2WLocalFenced -obs local -obstype fenced -arch PPC PodWW Coe PodWW Coe
% cat 2+2WLocalFenced.litmus
PPC 2+2WLocalFenced
"PodWW Coe PodWW Coe"
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r4=x;
}
 P0           | P1           ;
 li r1,2      | li r1,2      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 li r3,1      | li r3,1      ;
 stw r3,0(r4) | stw r3,0(r4) ;
 sync         | sync         ;
 lwz r5,0(r4) | lwz r5,0(r4) ;
exists
(0:r5=2 /\ 1:r5=2)

While one produces 2+2WLocalLoop, the looping local observer version of 2+2W as follows:

% diyone7 -name 2+2WLocalLoop -obs local -obstype loop -arch PPC PodWW Coe PodWW Coe
% cat 2+2WLocalLoop.litmus
PPC 2+2WLocalLoop
"PodWW Coe PodWW Coe"
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r4=x;
}
 P0            | P1            ;
 li r1,2       | li r1,2       ;
 stw r1,0(r2)  | stw r1,0(r2)  ;
 li r3,1       | li r3,1       ;
 stw r3,0(r4)  | stw r3,0(r4)  ;
 li r6,200     | li r6,200     ;
 L00:          | L02:          ;
 lwz r5,0(r4)  | lwz r5,0(r4)  ;
 cmpwi r5,1    | cmpwi r5,1    ;
 bne  L01      | bne  L03      ;
 addi r6,r6,-1 | addi r6,r6,-1 ;
 cmpwi r6,0    | cmpwi r6,0    ;
 bne  L00      | bne  L02      ;
 L01:          | L03:          ;
exists (0:r5=2 /\ 1:r5=2)

In the code above, observing loads are attempted at most 200 time or until a value different from 1 is read.

11.2.3 Performance of observers

As an indication of the performance of the various sorts of observers, the following table summarises a litmus7 experiment performed on a 8-cores 4-ways SMT Power7 machine machine— complete litmus7 log.

	2+2W	2+2WObs	2+2WObsFenced	2+2WObsLoop	2+2WLocal	2+2WLocalFenced	2+2WLocalLoop
Positive	2.2M/160M	0/80M	326/80M	25k/80M	2/160M	34k/160M	111k/160M

The row “Positive” shows the number of observed positive outcomes/total number of outcomes produced. For instance, in the case of 2+2W, we observed the positive outcome x=2 /\ y=2 more than 2 millions times out of a total of 160 millions outcomes. As a conclusion, all techniques achieve decent results, except straight observers.

11.3 Three stores or more

In test 2+2W the coherence orders sequence two writes. If there are three writes or more to the same location, it is no longer possible to identify a coherence order by observing the final contents of the memory location involved. In other words, observers are mandatory.

The argument to the -obs switch commands the production of observers. It can take four values:

accept: Produce observers when absolutely needed. More precisely, given memory location x, no equality on x appears in the final condition for zero or one write to x, one such appears for two writes, and observers are produced for three writes or more.
avoid: Never produce observers, i.e. fail when there are three writes to the same location.
force: Produce observers for two writes or more.
local: Always produce local observers.

With diyone7, one easily build a three writes test as for instance the following W5:

% diyone7 -obs accept -obstype fenced -arch PPC -name W5 Coe Coe PodWW Coe PodWW
% cat W5.litmus
PPC W5
"Coe Coe PodWW Coe PodWW"
{ 0:r2=y; 1:r2=y; 1:r4=x; 2:r2=x; 2:r4=y; 3:r2=y; }
 P0           | P1           | P2           | P3           ;
 lwz r1,0(r2) | li r1,3      | li r1,2      | li r1,2      ;
 sync         | stw r1,0(r2) | stw r1,0(r2) | stw r1,0(r2) ;
 lwz r3,0(r2) | li r3,1      | li r3,1      |              ;
 sync         | stw r3,0(r4) | stw r3,0(r4) |              ;
 lwz r4,0(r2) |              |              |              ;
exists (x=2 /\ 0:r1=1 /\ 0:r3=2 /\ 0:r4=3)

As apparent from the code above, we have a fenced observer thread on y (P0), while the final state of x is observed directly (x=2). The command line switch -obs force would yield two observers, while -obs avoid would lead to failure.

With command line switch -obs local we get three local observations of coherence, which suffice to reconstruct the complete coherence orders:

% diyone7 -obs local -obstype fenced -arch PPC -name W5Local Coe Coe PodWW Coe PodWW
chi% cat W5Local.litmus
PPC W5Local
"Coe Coe PodWW Coe PodWW"
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r4=x;
2:r2=x;
}
 P0           | P1           | P2           ;
 li r1,3      | li r1,2      | li r1,2      ;
 stw r1,0(r2) | stw r1,0(r2) | stw r1,0(r2) ;
 li r3,1      | li r3,1      | sync         ;
 stw r3,0(r4) | stw r3,0(r4) | lwz r3,0(r2) ;
 sync         | sync         |              ;
 lwz r5,0(r4) | lwz r5,0(r4) |              ;
exists (0:r5=2 /\ 1:r5=2 /\ 2:r3=3)

12 Command usage

The diy7 suite consists in four main tools:

diyone7: generates one litmus test from the specification of a violation of the sequential consistency memory model as a cycle — see Sec. 7.2.
diycross7: generates variations of diyone7 style tests — see Sec. 10.
diy7: generates several tests, aimed at confirming that candidate relaxations are relaxed or safe—see Sec. 8.
readRelax7: Extract relevant information from the results of tests—see Sec. 8.2.

12.1 A note on test names

We have designed a simple naming scheme for tests. A normalised test name decomposes first as a family name, and second as a description of program order (or internal) candidate relaxations.

12.1.1 Family names

Cycles (and thus tests) are first grouped by families. Family names describe test structure, based upon external communication candidate relaxations. More specifically, external communication candidate relaxations suffice to settle the directions (W or R) of first and last events of threads, considering the case when those two events are the same. For instance, consider the cycle “PodWW Rfe PodRR Fre”: there are two threads in the corresponding test (as there are two external communication candidate relaxations), one thread starts and ends with a write (written WW), while the other thread starts and ends with a read (written RR). The family name is thus WW+RR, (or RR+WW, but we choose the former). For reference, a normalised family name is the minimal amongst the representations of a given cycle, following the lexical order derived from the order W < WW < RR < RW < WR < R.

The most common families have nicknames, which are defined by this document. For instance, consider the test whose cycle is “PodWR Fre PodWR Fre”. The family name is WR+WR, as this is a two-thread test, both threads starting with a write and ending with a read. The nickname for this family is, as we already know, SB (store-buffering). Here is the list of nicknames and family names for two thread tests:

2+2W	WW+WW	PodWW Coe PodWW Coe
LB	RW+RW	PodRW Rfe PodRW Rfe
MP	WW+RR	PodWW Rfe PodRR Fre
R	WW+WR	PodWW Coe PodWR Fre
S	WW+RW	PodWW Rfe PodRW Coe
SB	WR+WR	PodWR Fre PodWR Fre

Isolated writes (and reads) originate from the combinations of communication relaxations, for instance [Fre,Rfe]. They appear as “W” (and R) in family names. For instance, “Rfe PodRR Fre Rfe PodRR Fre” contains two such isolated writes, its name is thus W+RR+W+RR and its nickname is, as we know, IRIW (Independent reads of independent writes). The test “Rfe PodRW Rfe PodRR Fre” contains one isolated write, as apparent from this diagram:

The family name is thus W+RW+RR and the nickname is WRC (Write to Read Causality).

12.1.2 Descriptive names for variants

Every family has a prototype, homonymous test where every thread code consists in one (for W or R) or two memory accesses to different locations (for WW, WR etc.). For instance, the MP test is derived from the cycle “PodWW Rfe PodRR Fre”. Variants are described by tags that illustrate the various program order relaxations: they appear after the family name, still with “+” as a separation. For instance the test derived from “LwSyncdWW Rfe DpAddrdR Fre” is named MP+lwsync+addr.

When all threads have the same tag tag, the test name is abbreviated as Family+tags. For instance, the test MP+lwsync+lwsync (“LwSyncdWW Rfe LwSyncdRR Fre”) is in fact MP+lwsyncs. Additionally, the tag pos (all po’s) is omitted, in order to yield family names for the prototype tests — cf. MP whose name would have been MP+pos otherwise.

For the sake of terseness, tags do not describe program order relaxations completely. For instance both DpAddrdR and DpAddrdW (address dependency to read and write, respectively) have the same tag, addr. It does not harm for simple tests, as the missing direction can be inferred from the family name. Consider for instance MP+lwsync+addr and LB+lwsync+addr.

The naming scheme extends to cycles with consecutive program order relaxations, by separating tags with “-” when they follow one another: for instance “LwSyncdWW Rfe DpAddrdR PodRR Fre” is named MP+lwsync+addr-po. Unfortunately, the current naming scheme falls short in supplying non-ambiguous names to all tests. For instance, “LwSyncdWW Rfe DpAddrdW PodWR Fre” is also named MP+lwsync+addr-po. In that situation tools will either fail or silently add a numeric suffix, depending on the boolean -addnum option.

% diycross7 -addnum false LwSyncdWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
Fatal error: Duplicate name MP+lwsync+addr-po
% diycross7 -addnum true LwSyncdWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
Generator produced 2 tests
% cat @all
# diycross7 -addnum true LwSyncdWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
MP+lwsync+addr-po.litmus
MP+lwsync+addr-po001.litmus

As a result, we get the two tests: MP+lwsync+addr-po and MP+lwsync+addr-po001.

Future versions of diy7 may solve this issue in a more satisfying manner. At the moment, users are advised not to rely too much on the automatic naming scheme. Users may name tests in a non-ambiguous fashion by (1) specifying an explicit family name (-name name) and (2) selecting the numeric scheme (-num true):

% diycross7 -name MP+X -num true LwSyncdWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
Generator produced 2 tests

The diycross7 generator outputs the same tests as above, with names MP+X000 and MP+X001.

12.2 Common options

All test generators accept the following documented command-line options:

-v: Be verbose, repeat to increase verbosity.
-version: Show version number and exit.
-arch (X86|PPC|ARM|AArch64|RISCV|LISA|C): Set architecture. Default is PPC.
-bell <name>: Read bell file <name>, implies -arch LISA.
-o <dest>: Redirect output to <dest>. This option applies when tools generate a set of tests and an index file @all, .i.e. in all situations except for diyone7 simplest operating mode.
If argument <dest> is an archive (extension .tar) or a compressed archive (extension .tgz), the tool builds an archive. Otherwise, <dest> is interpreted as the name of an existing directory. Default is “.”, that is tool output goes into the current directory.
-cycleonly <bool>: When true, do not generate the tests themseleves, output their cycles instead. Default is false.
-obs (accept|avoid|force|local): Management of observers, default is avoid. See Sec. 11.3.
-obstype (fenced|loop|straight): Style of observers, default is fenced. See Sec. 11.2.
-cond (cycle|uni|observe): Control final condition of tests, default is cycle. In mode cycle, the final condition identifies executions that correspond to the generating cycle. In mode unicond, the final condition identifies executions that are valid w.r.t. the uniproc model (see Sec. 14.2). In mode observe there is no final condition: the litmus7 and herd7 tools will simply list the final values of locations.
-unrollatomic <int>: Unroll atomic pairs (i.e. lr/sc idioms) according to the given number. Zero has the special meaning to unroll once and to test the success of teh store conditional in the test final condition. See section 7.5 for an example.
-ua <int>: Shorthand for -unrollatomic <int>.
-oneloc: Do not reject tests that acts on unique location. Notice that conditions on such tests usually are not significant and should probaly be edited by hand. See also options -cond unicond and -cond observe above.
-optcond: Optimise conditions by disregarding the values of loads that are neither the target of Rf, nor the source of Fr. This is the default.
-nooptcond: Do not optimise conditions.
-optcoherence: Optimise conditions assuming that the tested system (at least) follows the uniproc model (see Sec. 14.2).
-nooptcoherence: Do not optimise conditions assuming that the tested system (at least) follows the uniproc model. This is the default.
-neg <bool>: Negate final condition, default is false.
-c <bool>: Avoid equivalent cycles. Default is true. Setting -c true is intended for debug.

The naming of tests is controlled by the following options:

-name <name>: Use name for naming tests, the exact consequences depend on the generator. By default the generator has no name available.
-num <bool>: Use numeric names, i.e. from a base name <base> the generator will name tests as <base>000, <base>001 etc. The default depends upon the generator.
-addnum <bool>: If true, when faced with tests whose name <name> has already been given, use names <name>001, <name>002, etc. Otherwise fail in the same situation. The default depends upon the generator.
-fmt <n>: Size of numerical suffixes, default is 3.

The following option permits the listing for valid candidate relaxations, given the selected architecture.

-show (edges|fences|annotations): List all candidate relaxations, all fence prefixes or all annotations, respectively.

12.3 Usage of diyone7

The tool diyone7 has two operating modes. The selected mode depends on the presence of command-line arguments,

In the first operating mode, diyone7 takes a non-empty list of candidate relaxations as arguments and outputs a litmus test. Note that diyone7 may fail to produce the test, with a message that briefly details the failure.

% diyone7 Rfe Rfe PodRR
Test a [Rfe Rfe PodRR] failed:
Impossible direction PodRR Rfe

In this mode, -name <name> sets the name of the test to <name> and output it into file <name>.litmus. If absent, the test name is A and output goes to standard output.

Otherwise, i.e. when there are no command-line arguments, diyone7 reads the standard input and generates the tests described by the lines it reads. Each line starts with a test name name, followed by “:”, followed by a list of candidate relaxations RS. Then, diyone7 acts as if invoked as diyone opts -name name RS.

The tool diyone7 accepts the following documented option:

-norm: Normalise tests and give them normalised names. In the first operating mode (when a cycle is explicitly given) the test will be named with a family name and a descriptive name. In the second operating mode, numeric names are used, base being either given explicitly (with option -name <base>) or being a normalised family name.

12.4 Usage of diycross7

diycross7 produces several tests by “cross producing” lists of candidate relaxations given as arguments, see Sec 10. diycross7 also produces an index file @all that lists all produced litmus source files.

If option -name <name> is given, it sets the family name of generated tests, otherwise standard family names are used (cf. Sec. 12.1). By default descriptive names are used (i.e. -num false) and diycross7 will fail if two different tests have the same name (i.e. -addnum false):

% diycross7 PodWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
Fatal error: Duplicate name MP+po+addr-po

Should this happen users can resort either to numeric names,

%diycross7 -num true PodWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
Generator produced 2 tests
con% ls
@all  MP000.litmus  MP001.litmus

or to disambiguating numeric suffixes.

%diycross7 -addnum true PodWW Rfe [DpAddrdR,PodRR],[DpAddrdW,PodWR] Fre
Generator produced 2 tests
con% ls
@all  MP+po+addr-po001.litmus  MP+po+addr-po.litmus

12.5 Usage of diy7

As diycross7, diy7 produce several files, hence naming issues are critical. By default, diy7 uses family names and the numeric naming scheme (-num true). Users can specify a family name family for all tests with -name family, or attempt using the descriptive names of Sec 12.1 with -num false. Moreover, diy7 produces an index file @all that lists the file names of all tests produced.

The tool diy7 also accepts the following, additional, documented options.

-conf <file>: Read configuration file <file>. A configuration file consists in a list of options, one option per line. Lines introduced by # are comments and are thus ignored.
-size <n>: Set the maximal size of cycles. Default is 6.
-exact: Produce cycles of size exactly <n>, in place of size up to <n>.
-nprocs <n>: Reject tests with more than <n> threads. Default is 4.
-eprocs: Produce tests with exactly <n> threads, where <n> is set above.
-prefix <relax-list>: Specify a prefix for cycles: all cycles will (start with (in fact include) the given sequence of relaxations. The option can be repeated so as to specify several prefixes.
-ins <n>: Reject tests as soon as the code of one thread originates from <n> edges or more. Default is 4.
-relax <relax-list>: Add candidates to the relax set. This option can be repeated, with cumulative effect. The syntax of <relax-list> is a comma (or space) separated list of candidate relaxations.
-mix <bool>: Mix the elements of the relax list (see below), default false.
-maxrelax <n>: In mix mode, upper bound on the number of different candidate relaxations tested together. Default is 100
-minrelax <n>: In mix mode, lower bound on the number of different candidate relaxations tested together. Default is 1. Bounds are included: ‘-minrelax 3 -maxrelax 3‘ produces tests from cycles with exactly three different candidate relaxations. Notice that each of the different candidate relaxations can be repeated.
-safe <relax-list>: Add candidates to the safe set. This option can ne repeated, with cumulative effect.
-relaxlist <relax-list>: Alias for -safe <relax-list>.
-mode (default|sc|thin|uni|critical|free|ppo|total|mixed): Control generation of cycles, default is default. Those tags command the activation of some constraints over cycle generation, see below.
-cumul <bool>: Permit implicit cumulativity, i.e. authorise building up the sequence Rfe followed by a fence, or the reverse. Default is true.

The relax and safe sets command the generation of cycles as follows:

When the relax set is empty, cycles are built from the candidate relaxations of the safe set.
When the relax set is of size 1, cycles are built from its single element r and from the elements of the safe set. Additionally, the cycle produced contains r at least once.
When the relax set is of size n, with n > 1, the behaviour of diy7 depends on the mix mode:
1. By default (-mix false), diy7 generates n independent sets of cycles, each set being built with one relaxation from the relax set and all the relaxations in the safe set. In other words, diy7 on a relax set of size n behaves similarly to n runs of diy7 on each candidate relaxation in the set.
2. Otherwise (-mix true), diy7 generates cycles that contains at least one element from the relax set, including some cycles that contain different relaxations from the relax set. The cycles will contain at most m different elements from the relax set, where m is specified with option “-maxrelax m”.

Generally speaking, diy7 generates “some” cycles and does not generate “all” cycles up to a certain size. Modes constraint cycle generation.

Regardless of mode, diy7 is guaranteed not to generate redundant communication candidates in the following sense: let us call Com the union of Co, Rf and Fr (the e|i specification is irrelevant here). Co being transitive and by definition of Fr, one easily shows that the transitive closure Com+ of Com is the union of Com plus [Co,Rf] (Co followed by Rf) plus [Fr,Rf]. As a consequence, maximal sub-sequences of communication relaxations in diy7 cycles are limited to single relaxations (i.e. Co, Rf and Fr) and to the above mentioned two sequences (i.e. [Co,Rf] and [Fr,Rf]). For instance, [Co,Co] and [Fr,Co] should never appear in diy7 generated cycles. However, such sub-sequences can be generated on an individual basis with diyone7, see the example of W5 in Sec 11.3, or by introducing them as explicit composite candidates – see Section 9.2.

In default mode (default ), diy7 performs some optimisations. By optimisation, we mean that some sequences of two candidates are rejected. Generally, the sequence of two internal candidates is rejected. However some of those compositions are accepted: internal communication (i.e. Rfi, Coi and Fri) with Po or Dp candidates with specification “d”; Rfi with Po with specification “s”; and Rmw candidates.

The sc mode (-mode sc) is similar to the default mode. In some sense this mode is more tolerant and may accept more sequences. In particular, it attempts to sequence Po candidates. However, this more relaxed behaviour is balanced by a specific restriction: when some Po candidates are members of the safe set, then sequences that would otherwise be allowed are rejected when their write or read endpoints and their “s” or “d” status are the same as the ones of a safe Po candidate. For instance, if PosWR is part of the safe set, then the sequence of Rfi and PosRR is rejected.

In critical mode (-mode critical), cycles are strictly specified as follows:

Communication candidate relaxations sequences are limited to Rf,Fr,Co,[Co,Rf] and [Fr,Rf], as in all other modes.
No two internal⁵ candidates follow one another.
If the option -cumul false is specified, diy7 will not construct the sequence of Rfe followed by a fence (or A-cumulativity) candidate relaxation or of a fence candidate relaxation followed by Rfe (B-cumulativity).
Cycles that access one single memory location are rejected.
None of the rules above applies to the internal sequences of composite candidate relaxations. For instance, if [Rfi,PodRR] is given as a candidate relaxation, the sequence “Rfi,PodRR” appears in cycles.

In the absence of composite candidates, the cycles described above are the critical cycles of [5].

In free mode (-mode free), cycles are strictly specified as follows:

Communication candidate relaxations sequences are limited to Rf,Fr,Co,[Co,Rf] and [Fr,Rf]. However, arbitrary sequences of communication candidates are accepted when they are internal and external or external and internal.
Cycles that access one single memory location are rejected.

Finally, the uni mode enforces the following constraints on cycles:

Sequences of communication candidate relaxations are restricted in the same manner as for free mode (see above).
Sequences of Po candidate relaxation are rejected.

The transitive mode rejects internal sequences that match candidates from the safe set. More specifically the sequence is reduced according to some rules and rejected when the result is a member in the safe set. The effect is similar to the sc mode when Po candidates are safe. Reduction rules also consider fences: when a fence is present in the sequence it is compacted into a single fence candidate operating over the endpoints of the sequence, some reductions also apply to Dp candidates for instance a DpCtrl candidate followed by any (internal) candidate reduces to a DpCrl candidate operating over the endpoints of the sequence. This mode is experimental and is mostly left unspecified.

The remaining modes are unspecified.

12.6 Usage of readRelax7

readRelax7 is a simple tool to extract relevant information out of litmus7 run logs of tests produced by the diy7 generator. For a given run of a given litmus test, the relevant information is:

Whether the test yielded Ok or not,
An optional candidate relaxation, which is the one given as argument to diy7 option -relax at test build time, or none.
The safe set relevant to the given test, i.e. the safe candidate relaxations that appear in the tested cycle.

See Sec. 8.2 for an example.

The tool readRelax7 takes file names as arguments. If no argument is present, it reads a list of file names on standard input, one name per line.

13 Additional tools: extracting cycles and classification

When non-standard family names or numeric names are used, it proves convenient to rename tests with the standard naming scheme. We provide two tools to do so: mcycle7 that extracts cycles from litmus source files and classify7 that normalises and renames cycles.

For instance, one can use diy7 to generate all simple, critical, tests up to three threads for X86 with the following configuration file X.conf

-arch X86
-name X
-nprocs 3
-size 6
-safe Pod**,Fre,Rfe,Wse
-mode critical

% diy7 -conf X.conf
Generator produced 23 tests
% ls
@all         X003.litmus  X007.litmus  X011.litmus  X015.litmus  X019.litmus  X.conf
X000.litmus  X004.litmus  X008.litmus  X012.litmus  X016.litmus  X020.litmus
X001.litmus  X005.litmus  X009.litmus  X013.litmus  X017.litmus  X021.litmus
X002.litmus  X006.litmus  X010.litmus  X014.litmus  X018.litmus  X022.litmus

Cycles are extracted with mcycle7, which takes the index file @all as argument:

% mcycles7 @all
X000: Coe PodWR Fre PodWR Fre PodWW
X001: Rfe PodRR Fre PodWR Fre PodWW
X002: Coe PodWR Fre PodWW
X003: Coe PodWW Coe PodWR Fre PodWW
X004: Rfe PodRW Coe PodWR Fre PodWW
X005: Rfe PodRR Fre PodWW
X006: Coe PodWW Rfe PodRR Fre PodWW
X007: Rfe PodRW Rfe PodRR Fre PodWW
X008: Coe Rfe PodRR Fre PodWW
X009: Coe PodWW Coe PodWW
...

The output of mcycle7 can be piped into classify7 for family classification:

% mcycles7 @all | classify7 -arch X86
2+2W
  X009 -> 2+2W : PodWW Coe PodWW Coe
3.2W
  X010 -> 3.2W : PodWW Coe PodWW Coe PodWW Coe
3.LB
  X020 -> 3.LB : PodRW Rfe PodRW Rfe PodRW Rfe
3.SB
  X016 -> 3.SB : PodWR Fre PodWR Fre PodWR Fre
ISA2
  X007 -> ISA2 : PodWW Rfe PodRW Rfe PodRR Fre
LB
  X019 -> LB : PodRW Rfe PodRW Rfe
MP
  X005 -> MP : PodWW Rfe PodRR Fre
...

Notice that classify7 accepts the arch option, as it needs to parse cycles.

Finally, one can normalise tests, using normalised names by piping mcycle7 output into diyone7 with options -norm -num false:

% mkdir src
% mcycles7 @all | diyone7 -arch X86 -norm -num false -o src
Generator produced 23 tests
% ls src
2+2W.litmus  @all         R.litmus    WRC.litmus     WRW+WR.litmus  Z6.2.litmus
3.2W.litmus  ISA2.litmus  RWC.litmus  WRR+2W.litmus  WWC.litmus     Z6.3.litmus
3.LB.litmus  LB.litmus    SB.litmus   WRW+2W.litmus  Z6.0.litmus    Z6.4.litmus
3.SB.litmus  MP.litmus    S.litmus    W+RWC.litmus   Z6.1.litmus    Z6.5.litmus

Alternatively, one may instruct classify7 to produce output for diyone7. In that case one should pass option -diyone to classify7 so as to instruct it to produce output that is parsable by diyone7:

% rm -rf src && mkdir src
% mcycles7 @all | classify7 -arch X86 -diyone | diyone7 -arch X86 -o src
Generator produced 23 tests
% ls src
2+2W.litmus  @all         R.litmus    WRC.litmus     WRW+WR.litmus  Z6.2.litmus
3.2W.litmus  ISA2.litmus  RWC.litmus  WRR+2W.litmus  WWC.litmus     Z6.3.litmus
3.LB.litmus  LB.litmus    SB.litmus   WRW+2W.litmus  Z6.0.litmus    Z6.4.litmus
3.SB.litmus  MP.litmus    S.litmus    W+RWC.litmus   Z6.1.litmus    Z6.5.litmus

13.1 Usage of mcycle7

The tool mcycle7 has no options and takes litmus source files or index files as arguments. It outputs a list of lines to standard output. Each line starts with a test name, suffixed by “:”, then the cycle of the named test. Notice that this format is the input format to diyone7 in its second operating mode — see Sec. 12.3.

It is important to notice that, for mcycle7 to extract cycles, those must be present as meta-information in source files. In practice, this means that mcycle7 operates normally on sources produced by diyone7, diycross7 and diy7. Moreover only one instance of a given cycle will be output.

13.2 Usageof classify7

The tool classify7 reads its standard input, interpreting is as a list of cycles in the output format of mcycle7. It normalises and classifies those cycles. The tool classify7 accepts the following documented options:

-arch (X86|PPC|ARM|AArch64|RISCV|LISA|C): Set architecture. Default is PPC.
-bell <name>: Read bell file <name>, implies -arch LISA.
-lowercase <bool>: Use lowercase familly names, default false.
-u: Instruct classify7 to fail when two tests have the same normalised name. Otherwise classify7 will output one line per test, regardless of duplicate names.
-diyone: Output a normalised list of names and cycles, which is legal input for diyone7.

13.3 Usage of norm7

The tool norm7 is a simplified classify7 tool: it takes a cycle on the command line and outputs a normalized cycles, with name. This output can serve as standard input to diyone7.

-arch (X86|PPC|ARM|AArch64|RISCV|LISA|C): Set architecture. Default is PPC.
-bell <name>: Read bell file <name>, implies -arch LISA.
-lowercase <bool>: Use lowercase familly names, default false.

For instance, here is how one can find the systematic name of a variation on the store-buffer “SB” test:

% norm7 -arch AArch64 Rfi DMB.LDdRR Fre Rfi DpAddrdR Fre
SB+rfi-dmb.ld+rfi-addr: Rfi DMB.LDdRR Fre Rfi DpAddrdR Fre

Notice that the option -arch is mandatory (or defaults to PPC), as candidate relaxations depend on the selected architecture, as illustrated by the fence candidate DMB.LD… above.

5: That is, the source and target accesses are by the same processor.

Arch	Fence prefixes

X86	MFence
PPC	Sync LwSync Eieio ISync
ARM	DSB DMB DMB.ST DSB.ST ISB
AArch64¹	DMB.SY DMB.ST DMB.LD ISB
MIPS	Fence
RISCV	Fence.rw.rw Fence.rw.r Fence.rw.w Fence.r.rw Fence.r.r Fence.r.w Fence.w.rw Fence.w.r Fence.w.w
C	FenceSc FenceAR FenceAcq FenceRel FenceCons FenceRlx
LISA²	FenceAn…

Arch	Pseudo-candidates

X86	A
PPC	R A
ARM	R A
AArch64¹	A Q L X XL XA XAL
MIPS	R A
RISCV²	XARAR XARRl XARAq XARP XRlAR XRlRl XRlAq XRlP XAqAR XAqRl XAqAq XAqP XPAR XPRl XPAq X AR Rl Aq
C	Con Rlx Sc AR Rel Acq
LISA³	An…

Part II Generating tests