Arm Memory Effects

Arm Architecture Formal Team

1 Overall

The Armv8 memory model gives ordering requirements over memory accesses, or more precisely, the Effects generated by AArch64 instructions. The External visibility requirement intuitively states which local orderings must be respected on a given processor, or thread.

Formally, those constraints forbid certain shapes in program executions. For example the External visibility requirement forbids cycles in the Ordered-before relation, which consists of local orderings (Locally-ordered-before relation) and interactions between threads (Observed-by relation).

Which Effects an instruction can generate, or in other words, the semantics of that instruction, depends on the level of abstraction considered. For example, at user-level, we only need to consider that a load instruction generates a memory read without any consideration related to page table or instruction fetch.

For each level of abstraction we will have a look at which Effects can be generated for a small number of instructions. Furthermore we will see if there are any Intrinsic dependencies between those Effects, or in other words whether the Effects of a given instruction are required to be ordered or not.

Note that determining which Effects constitute the semantics of a given instruction, as well as the Intrinsic dependencies or lack thereof between them, is out of scope of this tutorial. Here we will assume those semantics are given as shown.

We will also, for each level of abstraction, examine a number of litmus tests which illustrate the level under scrutiny.

2 User-level

2.1 LDR instruction semantics

Consider the test LDR:

AArch64 LDR
{
0:X2=x;
0:X1=0;
int64_t x=1;
}
 P0         ;
LDR X1,[X2] ;
exists(0:X1=1)

At user-level, the Effects generated by the LDR instruction are as follows:

In words, at user-level, the instruction LDR X1,[X2] generates:

a Register Read Effect c: R0:X2q=x, which reads the contents of register X2 on P0 and finds the address of memory location x;
a Memory Read Effect a: R[x]q=1, which reads the contents of memory location x and finds the value 1;
a Register Write Effect d: W0:X1q=1, which writes the value obtained from Memory Read Effect a, viz, 1, into register X1 on P0.

Furthermore, those Effects are ordered with respect to one another, as follows:

there is an Intrinsic data dependency (arrow iico_data) from Register Read Effect c: R0:X2q=x to Memory Read Effect a: R[x]q=1: this is because the Effect a needs the data obtained by the Effect c (viz, memory location x);
there is an Intrinsic data dependency (arrow iico_data) from Memory Read Effect a: R[x]q=1 to Register Write Effect d: W0:X1q=1: this is because the Effect d needs the data obtained by Effect a (viz, the value 1).

This can be seen using the following command (Unix)

% herd7 -conf userlevel.cfg -showevents all -show prop -gv LDR.litmus

Or (MacOS):

% herd7 -conf userlevel.cfg -showevents all -show prop -preview LDR.litmus

2.2 STR instruction semantics

Consider the test STR:

AArch64 STR
{
0:X1=0;
0:X2=x;
}
P0         ;
MOV X1,#1  ;
STR X1,[X2];
exists(x=1)

At user-level, the Effects generated by the STR instruction are as follows:

In words, at user-level, the instruction STR X1,[X2] generates:

a Register Read Effect d: R0:X2q=x, which reads the contents of register X2 on P0 and finds the address of memory location x;
a Register Read Effect e: R0:X1q=1, which reads the contents of register X1 on P0 and finds the value 1;
a Memory Write Effect a: W[x]q=1, which writes the value obtained by Effect e into the address obtained by Effect d, viz, writes the value 1 into memory location x.

Furthermore, those Effects are ordered with respect to one another, as follows:

there is an Intrinsic data dependency (arrow iico_data) from Register Read Effect c: R0:X2q=x to Memory Write Effect a: W[x]q=1: this is because the Effect a needs the data obtained by the Effect d (viz, memory location x);
there is an Intrinsic data dependency (arrow iico_data) from Register Read Effect e: R0:X1q=1 to Memory Write Effect a: W[x]q=1: this is because the Effect a needs the data obtained by the Effect e (viz, the value 1).

Notably, there is no required ordering between Register Read Effects d and e and they can therefore occur in parallel.

This can be seen using the following command (Unix)

% herd7 -conf userlevel.cfg -showevents all -show prop -gv STR.litmus

Or (MacOS):

% herd7 -conf userlevel.cfg -showevents all -show prop -preview STR.litmus

2.3 Applying the user-level Arm memory model to the MP test

Consider the test MP:

AArch64 MP
{
int x=0;
int y=0;
0:X1=x; 0:X3=y;
1:X0=y; 1:X2=x;
}
 P0          | P1          ;
 MOV W0,#1   | LDR W1,[X0] ;
 STR W0,[X1] | LDR W3,[X2] ;
 MOV W2,#1   |             ;
 STR W2,[X3] |             ;
exists (1:X1=1 /\ 1:X3=0)

The Arm memory model allows the MP test; this can be seen using herd7 as follows:

% herd7 MP.litmus

Here are all possible executions for MP:

To forbid this behaviour, we can use barriers as in the following test MP+DMB.ST+DMB.LD:

AArch64 MP+DMB.ST+DMB.LD
{
0:X1=x; 0:X3=y;
1:X1=x; 1:X3=y;
}
P0           | P1          ;
MOV W0,#1    | LDR W0,[X3] ;
STR W0,[X1]  | DMB LD      ;
DMB ST       | LDR W2,[X1] ;
MOV W2,#1    |             ;
STR W2,[X3]  |             ;
exists(1:X0=1 /\ 1:X2=0)

Try running herd7 again to see what it says.

We can also use Acquire and Release instructions, like in the following test MP+rel+acq:

AArch64 MP+rel+acq
{
0:X1=x; 0:X3=y;
1:X0=y; 1:X2=x;
}
 P0           | P1           ;
 MOV W0,#1    | LDAR W1,[X0] ;
 STR W0,[X1]  | LDR W3,[X2]  ;
 MOV W2,#1    |              ;
 STLR W2,[X3] |              ;
exists (1:X1=1 /\ 1:X3=0)

Try running herd7 on this test too to see what it says.

3 With Instruction Fetch

Instructions, prior to their execution, get fetched. This constitutes accesses to memory, which are subject to relaxed ordering guarantees due to various buffers (notably, instruction cache) that are not necessarily coherent with the rest of the memory. The relaxed ordering may become apparent when litmus tests attempt modifying instructions they execute.

3.1 LDR instruction semantics

Consider the following LDR-ifetch test illustrating the instruction semantics for LDR.

AArch64 LDR-ifetch
Variant=ifetch
{
0:X1=x;
0:X9=P0:L0;
}
P0           ;
L0:          ;
 LDR W0,[X1] ;

This test simply executes the LDR instruction, and it is interested in accounting for the fetching of the LDR. This is made possible through two deliberate actions. Firstly, the enablement of the "ifetch" extension of the memory model and tools, as indicated by Variant=ifetch. Secondly, the presence of a value of the address of the LDR, as achieved by using a label L0 and initialising 0:X9=P0:L0.

This test has one execution, which we can display using the following command (Unix):

% herd7 -conf ifetch.cfg showevents all -show all -gv LDR-ifetch.litmus

(macOS):

% herd7 -conf ifetch.cfg showevents all -show all -preview LDR-ifetch.litmus

On the execution graph, the Effect (a) is Implicit Instruction Read Effect representing an instruction fetch. The Effect (e) is a Branching Effect representing the decision made at the decode stage: whether the decoded instruction is valid or not. The other effects represent the normal execution of the LDR.

3.2 STR instruction semantics

Consider the following STR-ifetch test illustrating the instruction semantics for STR.

AArch64 STR-ifetch
Variant=ifetch
{
0:X1=x;
0:X9=P0:L0;
}
P0           ;
L0:          ;
 STR W0,[X1] ;

This test simply executes the STR instruction, and it is interested in accounting for the fetching of the STR. This is made possible through two deliberate actions. Firstly, the enablement of the "ifetch" extension of the memory model and tools, as indicated by Variant=ifetch. Secondly, the presence of a value of the address of the STR, as achieved by using a label L0 and initialising 0:X9=P0:L0.

This test has one execution, which we can display using the following command (Unix):

% herd7 -conf ifetch.cfg showevents all -show all -gv STR-ifetch.litmus

(macOS):

% herd7 -conf ifetch.cfg showevents all -show all -preview STR-ifetch.litmus

3.3 A simple self-modifying code example

Consider the following SM+po test, illustrating simple self-modifying code.

AArch64 SM+po
Variant=ifetch
{ 0:X0=instr:"NOP";
  0:X1=P0:Lself00;
}
P0           ;
  STR W0,[X1];
 Lself00:    ;
  B L00      ;
  MOV X9,#1  ;
 L00:        ;
exists 0:X9=0

The initial state of this test stores a NOP instruction in the register X0 and the address of the label Lself00 in X1. The code executes STR W0,[X1] to overwrite the immediate branch at label Lself00 with NOP. Notice the pattern we use in litmus tests often: when the branch is not overwritten, it skips over MOV X9,#1, but if it is overwritten with NOP, the MOV executes. Recall that registers not initialized in the precondition hold zeroes. Therefore, since the register X9 is initialised to 0, at the end of this litmus test the value held in X9 indicates which instruction ended up being fetched and executed: the branch or NOP.

The exists clause of the test asks if it is possible for B L00 to execute despite the preceding store.

This test has two executions. The first can be produced as follows (Unix):

% herd7 -conf ifetch.cfg showevents all -show neg -gv SM+po.litmus

(macOS):

% herd7 -conf ifetch.cfg showevents all -show neg -preview SM+po.litmus

This first execution shows the expected result: fetching and executing the most recently stored instruction, NOP.

The second execution can be produced as follows (Unix):

% herd7 -conf ifetch.cfg showevents all -show prop -gv SM+po.litmus

(macOS):

% herd7 -conf ifetch.cfg -show prop -preview SM+po.litmus

This second execution shows the result satisfying the exists clause: fetching and executing B L00.

Outside of litmus-test bodies, we use the syntax with offsets: B .+12 is a branch to the third next instruction.

3.4 A self-modifying code example with cache maintenance

Cache maintenance sequence is required to ensure the most recent instruction value is read. This is illustrated by the following SM+cachesyncisb test.

AArch64 SM+cachesyncisb
Variant=ifetch
{
  0:X0=instr:"NOP";
  0:X1=P1:Lself00;
}
P0            ;
  STR W0,[X1] ;
  DC CVAU,X1  ;
  DSB ISH     ;
  IC IVAU,X1  ;
  DSB ISH     ;
  ISB         ;
Lself00:      ;
  B L00       ;
  MOV X9,#1   ;
L00:          ;
exists 0:X9=0

The role of the sequence of DC CVAU,X1, DSB SY, IC IVAU,X1 and DSB SY is to ensure STR W0,[X1] is visible to the Instruction Fetch observer. In other words, those instructions clean the relevant data-cache entry and invalidate the relevant instruction-cache entry, so that the subsequent instruction fetch had to observe the most recent instruction.

The role of DSB SY and ISB together is to ensure the aforementioned visibility takes place before the instruction fetch at label Lself00: generally speaking, in absence of ISB memory accesses due to instruction fetches may execute ahead of program order.

The exists clause of the test is now not satisfied. To produce a (now forbidden) execution nevertheless, we can execute:

% herd7 -conf ifetch.cfg -show prop -skipcheck external -gv SM+cachesyncisb.litmus

(macOS):

% herd7 -conf ifetch.cfg -show prop -skipcheck external -preview SM+cachesyncisb.litmus

3.5 A self-modifying code example with an Instruction Abort

Previously, we mentioned that Branching Effects associated with instruction fetch may result in exceptional behaviour. The following SM.udf+cachesyncisb test illustrates this.

AArch64 SM.udf+cachesyncisb
Variant=ifetch,fatal
{ 0:X0=instr:"UDF #0";
  0:X1=P0:Lself00;
}
P0            ;
  STR W0,[X1] ;
  DC CVAU,X1  ;
  DSB ISH     ;
  IC IVAU,X1  ;
  DSB ISH     ;     
  ISB         ;
Lself00:      ;
  B L00       ;
  MOV X9,#1   ;
L00:          ;
exists 0:X9=0

Unlike the previous tests, this one overwrites the immediate branch at the label Lself00 with an undefined instruction. As a result, the only permitted execution of the test ends with a fault, an Instruction Abort.

% herd7 -conf ifetch.cfg -show prop -gv SM.udf+cachesyncisb.litmus

(macOS):

% herd7 -conf ifetch.cfg -show prop -preview SM.udf+cachesyncisb.litmus

3.6 Repeated fetching

Consider the coFF test. There P0 overwrites an instruction on P1. In turn, P1 executes the instruction twice—however, is it the same instruction both times?

AArch64 coFF
{
  0:X0=NOP; 0:X1=P1:L0;
}
P0          | P1         ;
STR W0,[X1] |  BL L0     ;
            |  MOV W1,W9 ;
            |  BL L0     ;
            |  MOV W2,W9 ;
            |  B END     ;
            |L0:         ;
            |  B L1      ;
            |  MOV W9,#1 ;
            |  RET       ;
            |L1:         ;
            |  MOV W9,#0 ;
            |  RET       ;
            |END:        ;
exists (1:X1=1 /\ 1:X2=0)

The code on P1 is organised using a subroutine starting at label L0, which is also the label of the instruction that might be modified. If the instruction is not overwritten, P1 jumps again to label L1, where it puts 0 into the register W9 and returns from the subroutine; otherwise it skips to the next instruction, putting 1 into the register W9 and returning from the subroutine. Overall, P1 uses a branch-and-link instruction BL to jump to L0 twice, each time recording which instruction was executed at L0.

The exists clause of this test asks if it is possible to have the final state where the up-to-date instruction was fetched at L0 the first time, but not the second time. Such behaviour is allowed by the Arm Memory Model. To produce the forbidden execution nevertheless, we can execute on Unix:

% herd7 -conf ifetch.cfg -show prop -gv coFF.litmus

or on macOS:

% herd7 -conf ifetch.cfg -show prop -preview coFF.litmus

3.7 Message passing with instruction fetch

Consider the following MP.FR+dmb.st+po test, which is a variation of the message passing shape.

AArch64 MP.FR+dmb.st+po
{
  0:X0=NOP;
  0:X1=P1:Lself; 1:X1=P1:Lself;
  0:X3=x; 1:X3=x;
}
P0          | P1           ;
MOV W2,#1   | Lself:       ;
STR W2,[X3] |  B L00       ;
DMB ST      |  MOV W9,#1   ;
STR W0,[X1] | L00:         ;
            |  LDR W2,[X3] ;
exists 1:X9=1 /\ 1:X2=0

In this test, P0 first prepares a payload by storing 1 into x and then, after a DMB ST barrier, overwrites an instruction on P1. In turn, P1 fetches the instruction that may have been modified, and then proceeds to reading the payload x.

The exists clause of this test asks if it is possible to have the final state where the up-to-date instruction is fetched and out-of-date payload is read.

The Arm Memory Model forbids executions satisfying the exists clause. To produce the forbidden execution nevertheless, we can execute on Unix:

% herd7 -conf ifetch.cfg -show prop -gv -skipcheck external MP.FR+dmb.st+po.litmus

or on macOS:

% herd7 -conf ifetch.cfg -show prop -preview -skipcheck external MP.FR+dmb.st+po.litmus

An important ordering principle illustated by this test has to do with the ordering guaranteed for P1: the Implicit Instruction Read Effect (TODO) is ordered-before the Explicit Read Effect (TODO). Intuitively, given two instructions in program order, the former is always fetched before the latter is executed.

Consider now a different message-passing test, MP.RF+cachesync+dsb.sy-isb.

AArch64 MP.RF+cachesync+dsb.sy-isb
{
  0:X0=NOP;         0:X1=P1:L0;
  0:X2=1; 1:X2=0;   0:X3=z; 1:X3=z;
}
P0          | P1           ;
STR W0,[X1] |  LDR W2,[X3] ;
DC CVAU,X1  |  DSB SY     ;
DSB SY      |  ISB         ;
IC IVAU,X1  | L0:         ;
DSB SY      |  B L1       ;
STR W2,[X3] |  MOV W9,#1   ;
            | L1:         ;
exists 1:X2=1 /\ 1:X9=0

In this test, P0 first prepares a payload by overwriting an instruction on P1 and then, after cache maintenance with appropriate barriers, sets a flag. In turn, P1 reads the flag and, after a DSB SY and ISB barriers, fetches the instruction that may have been modified.

The exists clause of this test asks if it is possible to have the final state where the up-to-date flag is read and out-of-date instruction is fetched.

The Arm Memory Model forbids executions satisfying the exists clause. To produce the forbidden execution nevertheless, we can execute on Unix:

% herd7 -conf ifetch.cfg -show prop -gv -skipcheck external MP.RF+cachesync+dsb.sy-isb.litmus -doshow CSE-ob

or on macOS:

% herd7 -conf ifetch.cfg -show prop -preview -skipcheck external MP.RF+cachesync+dsb.sy-isb.litmus -doshow CSE-ob

The DSB SY and ISB barriers on P1 are essential, as they ensure that Effect (c) is ordered-before Effect (d).

4 With Memory Tagging

The Memory Tagging Extension (MTE) adds a new memory type, Normal Tagged Memory, to the Arm Architecture. Loads and stores to Normal Tagged Memory are checked; a tag which is part of the address of the access is compared against an associated allocation tag which is stored in memory. A mismatch between the tag in the address and the allocation tag can be configured to cause a synchronous exception or to be asynchronously reported. MTE is designed to assist the developer in detecting memory safety violations.

In a system with synchronous MTE, it is possible to precisely determine which load or store instruction caused the tag mismatch. When the system is configured to report asynchronously, it is only possible to isolate the mismatch to a particular thread of execution.

MTE adds new instructions to the Arm Architecture. For the purposes of this tutorial we will use:

LDG: Loads the allocation tag associated with an address x.
STG: Stores the specified tag to allocation tag associated with an address x.

In herdtools7, tags are represented as colours (e.g., green, red, yellow, …). For example, x:green is an address of the symbolic location x with the symbolic address tag green. By default, all memory locations are assumed to be associated with the allocation tag green, unless they are explicit initialized or their allocation tag is modified.

4.1 LDR instruction semantics

4.1.1 In synchronous MTE

Consider the test LDRgreen-sync where P0 performs a LDR with the address tag green (tag match):

AArch64 LDRgreen-sync
Variant=memtag,sync
{
 [x]=1;
 0:X0=x:green;
}
 P0          ;
L0:          ;
 LDR W1,[X0] ;
exists(0:X1=1)

The Effects generated by the LDR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop -gv LDR-green-sync.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview LDRgreen-sync.litmus

Consider the test LDRred-sync where P0 performs a LDR with the address tag red (tag match):

AArch64 LDRred-sync
Variant=memtag,sync
{
 [x]=1;
 0:X0=x:red;
}
 P0          ;
L0:          ;
 LDR W1,[X0] ;
exists(fault(P0:L0,x,TagCheck) /\ 0:X1=0)

The Effects generated by the LDR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop prop -gv LDR-red-sync.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview LDRred-sync.litmus

4.1.2 In asynchronous MTE

Consider the test LDRgreen-async where P0 performs a LDR with the address tag green (tag match):

AArch64 LDRgreen-async
Variant=memtag,async
{
 [x]=1;
 0:X0=x:green;
}
 P0          ;
L0:          ;
 LDR W1,[X0] ;
exists(0:X1=1)

The Effects generated by the LDR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop -gv LDR-green-async.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview LDRgreen-async.litmus

Consider the test LDRred-async where P0 performs a LDR with the address tag red (tag match):

AArch64 LDRred-async
Variant=memtag,async
{
 [x]=1;
 0:X0=x:red;
}
 P0          ;
L0:          ;
 LDR W1,[X0] ;
exists(0:TFSR_ELx=1 /\ 0:X1=1)

The Effects generated by the LDR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop -gv LDR-red-async.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview LDRred-async.litmus

4.2 STR instruction semantics

The instruction semantics of a STR are very similar to the instruction semantics of a LDR.

4.2.1 In synchronous MTE

Consider the test STRgreen-sync where P0 performs a STR with the address tag green (tag match):

AArch64 STRgreen-sync
Variant=memtag,sync
{
 0:X0=x:green;
 0:X1=1;
}
 P0          ;
L0:          ;
 STR W1,[X0] ;
exists([x]=1)

The Effects generated by the STR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop -gv STR-green-sync.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview STRgreen-sync.litmus

Consider the test STRred-sync where P0 performs a STR with the address tag red (tag match):

AArch64 STRred-sync
Variant=memtag,sync
{
 0:X0=x:red;
 0:X1=1;
}
 P0          ;
L0:          ;
 STR W1,[X0] ;
exists(fault(P0:L0,x,TagCheck) /\ [x]=0)

The Effects generated by the STR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop prop -gv STR-red-sync.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview STRred-sync.litmus

4.2.2 In asynchronous MTE

Consider the test STRgreen-async where P0 performs a STR with the address tag green (tag match):

AArch64 STRgreen-async
Variant=memtag,async
{
 0:X0=x:green;
 0:X1=1;
}
 P0          ;
L0:          ;
 STR W1,[X0] ;
exists([x]=1)

The Effects generated by the STR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop -gv STR-green-async.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview STRgreen-async.litmus

Consider the test STRred-async where P0 performs a STR with the address tag red (tag match):

AArch64 STRred-async
Variant=memtag,async
{
 0:X0=x:red;
 0:X1=1;
}
 P0          ;
L0:          ;
 STR W1,[X0] ;
exists(0:TFSR_ELx=1 /\ [x]=1)

The Effects generated by the STR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf mte.cfg -show prop -gv STR-red-async.litmus

Or (MacOS):

% herd7 -conf mte.cfg -show prop -preview STRred-async.litmus

4.3 A free-after-use example with asynchronous MTE

Consider the test LB+TFSR_ELx-dmb.ld+dmb.ld-stg:

AArch64 LB+TFSR_ELx-dmb.ld+dmb.ld-stg
(* Buffer use and free *)
Variant=memtag,async
{
 0:X1=x:green; 1:X1=x:green;
 0:X2=y:green; 1:X2=y:green;
 1:X3=x:red;
}
 P0           | P1          ;
 LDR W3,[X1]  | LDR W4,[X2] ;
 DMB LD       | DMB LD      ;
 MOV W0,#1    | STG X3,[X1] ;
 STR W0,[X2]  |             ;
exists(0:TFSR_ELx=1 /\ 1:X4=1)

The behaviour illustated in LB+TFSR_ELx-dmb.ld+dmb.ld-stg is forbidden when run in a system that implements asynchronous MTE. This is a case where a memory location x is used by P0 (loads from x) which then signals to P1 that it is done with its use of x. P1 can then free the memory location x. As part of the free operation, it changes its allocation tag. In such an execution, it is architecturally forbidden for the LDR to run into a tag mismatch and set the system register TFSR_ELx which indicates that the execution of P0 run into a tag mismatch.

% herd7 LB+TFSR_ELx-dmb.ld+dmb.ld-stg.litmus

An illustration of the forbidden execution is shown here:

4.4 A use-after-free example with synchronous MTE

Consider the test MP+stg-dmb.st+dmb.ld:

AArch64 MP+stg-dmb.st+dmb.ld
(* Buffer use after free *)
Variant=memtag,sync
{
 0:X1=x:green; 1:X1=x:green;
 0:X0=x:red;
 0:X3=y:green; 1:X3=y:green;
}
 P0           | P1          ;
 STG X0,[X1]  | LDR W2,[X3] ;
 DMB ST       | DMB LD      ;
 MOV W2,#1    |L1:          ;
 STR W2,[X3]  | LDR W0,[X1] ;
exists(1:X2=1 /\ ~fault(P1:L1,x,TagCheck))

The behaviour illustated in MP+stg-dmb.st+dmb.ld.litmus is forbidden when run in a system that implements synchronous MTE. This is a case where P0 free/deallocates a buffer; in this example represented by the memory location x. As part of this process, it changes the allocation tag of x to red. In doing so, it expects that all futures uses of x with the old address tag green will raise a TagCheck fault. P1 performs a load from location x using the tag green. If the execution of the load from x in P1 is ordered after the change of tag by P0 then it will cause a TagCheck fault.

% herd7 MP+stg-dmb.st+dmb.ld.litmus

An illustration of the forbidden execution is shown here:

5 With Page Table Entries

In the previous examples showed, we assumed that x and y were physical memory locations.

But usually one interacts with virtual memory addresses rather than physical memory directly.

5.1 Virtual and Physical Memory

Virtual memory gives an abstract view of the physical memory resources, which amongst other benefits:

frees software developers from having to manage a shared memory space, and
affords them more memory than might be physically available.

The mapping from virtual addresses to physical locations resides in a collection of page tables.

Page tables are used to translate the virtual addresses seen by software into physical locations used by the hardware.

Each page table entry (PTE) holds a flag called the valid bit, indicating whether the entry is valid or not:

If it is valid, reading the PTE gives the physical location corresponding to a virtual address;
If it is not valid, accessing a virtual address via that PTE raises a fault.

5.2 LDR instruction semantics

5.2.1 When the valid bit is 1

Consider the test LDRv1:

AArch64 LDRv1
Variant=vmsa
{
0:X2=x;
0:X1=0;
int64_t x=1;
pte_x=(valid:1);
}
 P0         ;
LDR X1,[X2] ;
exists(0:X1=1)

At this level, the Effects generated by the LDR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf vmsa.cfg -showevents all -show prop -gv LDRv1.litmus

Or (MacOS):

% herd7 -conf vmsa.cfg -showevents all -show prop -preview LDRv1.litmus

5.2.2 When the valid bit is 0

Consider the test LDRv0:

AArch64 LDRv0
Variant=vmsa,fatal
{
0:X2=x;
0:X1=0;
int64_t x=1;
PTE(x)=(valid:0);
}
P0          ;
LDR X1,[X2] ;
exists(0:X1=0)

At this level, the Effects generated by the LDR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf vmsa.cfg -showevents all -show prop -gv LDRv0.litmus

Or (MacOS):

% herd7 -conf vmsa.cfg -showevents all -show prop -preview LDRv0.litmus

5.3 STR instruction semantics

5.3.1 When the valid bit is 1

Consider the test STRv1:

AArch64 STRv1
Variant=vmsa
{
0:X1=0;
0:X2=x;
pte_x=(valid:1);
}
P0         ;
MOV X1,#1  ;
STR X1,[X2];
exists(x=1)

At this level, the Effects generated by the STR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf vmsa.cfg -showevents all -show prop -gv STRv1.litmus

Or (MacOS):

% herd7 -conf vmsa.cfg -showevents all -show prop -preview STRv1.litmus

5.3.2 When the valid bit is 0

Consider the test STRv0:

AArch64 STRv0
Variant=vmsa,fatal
{
0:X1=0;
0:X2=x;
pte_x=(valid:0);
}
P0         ;
MOV X1,#1  ;
STR X1,[X2];
exists x=0

At this level, the Effects generated by the STR instruction are as follows:

This can be seen using the following command (Unix)

% herd7 -conf vmsa.cfg -showevents all -show prop -gv STRv0.litmus

Or (MacOS):

% herd7 -conf vmsa.cfg -showevents all -show prop -preview STRv0.litmus

5.4 Changing the output address of PTE(x)

Consider the test MP-pte+dmb.st+dmb.ld:

AArch64 MP-pte+dmb.st+dmb.ld
Variant=vmsa,handled
{
 [PTE(x)]=(oa:PA(x),valid:1);
 [x]=0; [y]=1;
 0:X2=(oa:PA(y),valid:1); 0:X1=PTE(x);
 0:X7=x; 1:X7=x;
 0:X5=z; 1:X5=z;
}
P0              | P1           ;
STR X2,[X1]     | LDR W4,[X5]  ;
DMB ST          | DMB LD       ;
MOV W4,#1       | LDR W6,[X7]  ;
STR W4,[X5]     |              ;
exists(1:X4=1 /\ 1:X6=0)

The behaviour illustated in MP-pte+dmb.st+dmb.ld is architecturally allowed. P0 changes the output address of PTE(x) from PA(x) to PA(y), expecting that future uses of the virtual address x will be using the physical location PA(y). However, due the caching effects of the Translation Lookaside Buffer (TLB), the change of the oa field of the PTE(x) might not become visible to future uses of the virtual address of x.

P1 performs a load to the symbolic virtual address x. If the load is ordered after P0 set the oa of PTE(x), then it has to read from the new physical location PA(y).

The Arm memory model forbids the MP-pte+dmb.st+dmb.ld test; this can be seen using herd7 as follows:

% herd7 MP-pte+dmb.st+dmb.ld.litmus

Here is the execution of interest:

5.5 Changing the output address of PTE(x) using break-before-make

Consider the test MP+bbm-dmb.st+dmb.ld:

AArch64 MP+bbm-dmb.st+dmb.ld
Variant=vmsa,handled
{
 [PTE(x)]=(oa:PA(x),valid:1);
 [x]=0; [y]=1;
 0:X0=(valid:0); 0:X1=PTE(x);
 0:X2=(oa:PA(y),valid:1);
 0:X7=x; 1:X7=x;
 0:X5=z; 1:X5=z;
}
P0              | P1           ;
STR X0,[X1]     | LDR W4,[X5]  ;
DSB ISH         | DMB LD       ;
LSR X9,X7,12    | LDR W6,[X7]  ;
TLBI VAAE1IS,X9 |              ;
DSB ISH         |              ;
STR X2,[X1]     |              ;
DMB ST          |              ;
MOV W4,#1       |              ;
STR W4,[X5]     |              ;
exists(1:X4=1 /\ 1:X6=0)

The behaviour illustated in MP+bbm-dmb.st+dmb.ld is forbidden. P0 changes the output address of PTE(x) from PA(x) to PA(y), expecting that future uses of the virtual address x will be using the physical location PA(y). Due the caching effects of the Translation Lookaside Buffer (TLB), to change the oa field of the PTE(x), it has to do a break-before-make sequence; first, it sets textttvalid:0, then, it executes a TLBI instruction to invalidate old copies of PTE(x) in the TLB, and, finally, it updates PTE(x) with the new oa and set valid:1.

P1 performs a load to the symbolic virtual address x. If the load is ordered after P0 set the oa of PTE(x), then it has to read from the new physical location PA(y).

The Arm memory model forbids the MP+bbm-dmb.st+dmb.ld test; this can be seen using herd7 as follows:

% herd7 MP+bbm-dmb.st+dmb.ld.litmus

Here is the execution of interest:

5.6 Copy-On-Write example

Consider the test copy-on-write:

AArch64 copy-on-write
Variant=vmsa,handled
{
 int x=1;
 0:X2=x; 1:X2=x; 2:X2=x;

 [PTE(z)]=(oa:PA(x));
 2:X3=(valid:0); 2:X4=PTE(x);
 2:X8=(oa:PA(y),valid:1);
 2:X6=z; 2:X7=y;
}
 P0          | P1          | P2                           ;
 LDR W0,[X2] | MOV W0,#2   | (* break PTE(x) *)           ;
 LDR W1,[X2] | STR W0,[X2] | STR X3,[X4]                  ;
             |             | LSR X9,X2,#12                ;
             |             | DSB ISH                      ;
             |             | TLBI VAAE1IS,X9              ;
             |             | DSB ISH                      ;
             |             | (* copy PA(x) to PA(y) *)    ;
             |             | LDR W5,[X6]                  ;
             |             | STR W5,[X7]                  ;
             |             | DMB ISH                      ;
             |             | (* make PTE(x) - oa:PA(y) *) ;
             |             | STR X8,[X4]                  ;
exists(0:X0=2 /\ 0:X1=1)

The behaviour illustated in copy-on-write is forbidden. P0 loads from x twice; if the first load in P0 happens to read the value P1 wrote (2) then the second loads cannot read the old value. In the meantime, P2 performs a break-before-make sequence to remap x to PA(y) and copies y to a writeable alias of x (z) much like a Copy-On-Write sequence would do.

The Arm memory model forbids the copy-on-write test; this can be seen using herd7 as follows:

% herd7 copy-on-write.litmus

Here is the execution of interest:

5.7 Page Table Entry Values

In the general case, a page table entry value PTE(x) is a tuple (oa, valid, af, db, dbm), where:

oa is the “output address”;
valid is the valid bit;
af is the Access Flag;
db is the Dirty Bit;
dbm is the Dirty Bit Management bit.

This document was translated from L^AT_EX by H^EV^EA.