A Deeper Look at ARM Assembly Language
Let's continue our blog series on ARM assembly language by drilling down into some of the basic ARM machine language instructions.
Instruction Formats and Addressing Modes
ARM instructions accept from zero to three (and occasionally more) operands. An optional S suffix can be added to indicate that the result should affect the flags in the status register. Most source operands can be a register or immediate data and the destination register can usually be the same as a source register.
The most basic instruction is MOV (for move) and takes the form MOV dest, src. Here are some examples:
MOV R1,R2 Copy the contents of R2 to R1
MOV R2,#1234 Move immediate value 1234 into R2
MOVN R1,R2 Move Negative; Copy 1's complement of R2 into R1
Math Functions
All of the common arithmetic functions are provided. Below are some examples:
Add:
ADD R0,R1,R2 R0 ← R1 + R2 Add
ADC R0,R1,R2 R0 ← R1 + R2 Add with carry
ADDS R0,R1,R2 R0 ← R1 + R2 Add, setting flags
Subtract:
SUB R0,R1,R2 R0 ← R1-R2 Subtract
RSB R0,R1,R2 R0 ← R2-R1 Reverse subtract
Multiply, Multiply and Accumulate:
MUL R0,R1,R2 R0 ← R1*R2
MLA R0,R1,R2,R3 R0 ← (R1*R2)+R3
There is also (at least on some ARM platforms) a "long multiply" that produces a 64-bit result.
Compare and Logical Instructions
Instructions for comparing values are provided, e.g.
CMP R1,R2 Compare R2 and R1, set flags
CMP R3,#0 Compare R3 with zero, set flags
CMPN R4,0 Compare negative (1's complement of zero)
Also provided are logical operations such as AND, ORR (or), and EOR (exclusive or), e.g.
AND R1,R2,R3 R1 ← R2 AND R3
Branching and Conditions
Branching works like on most processors, e.g.
BEQ label Branch to label if Z flag is set
BNE label Branch to label if Z is not set
In the more general case, most instructions can be made conditional simply by adding a suffix with the condition, e.g.
MOVCS R0,R1 Move if carry flag set
MOV CS R0,R1 Same as above (space allowed between mnemonic and condition)
The conditions supported are the following:
EQ/NE | Equal/Not equal |
VS/VC | Overflow set/Overflow clear |
AL | Always |
NV | Never |
HI | Higher |
LS | Lower or same |
PL | Plus (minus clear) |
MI | Minus (Minus set) |
CS/HS | Carry set (higher or same) |
CC/LO | Carry clear (lower) |
GE | Greater than or equal |
LT | Less than |
GT | Greater than |
LE | Less than or equal |
Shifts and Rotates
The ARM CPU has a barrel shifter that can shift or rotate a result by up to 32 bit positions at once. Shifts and rotates are only done as part of other instructions and not explicitly with shift or rotate instructions (however, the assembler will accept them as instructions and convert them to a MOV).
The shift or rotate operation is added as an optional third operand. This is supported by instructions for move, add, subtract, compare, and, or, xor, test, and others. The operations are LSL, LSR, ASL, ASR, ROR, ROL, RRX (rotate through extend/carry bit). Logical shifts shift in a zero. Arithmetic shifts maintain the sign of the value.
Here are some examples:
MOV R0, R1, LSL#1 R0 ← R1 shifted left by 1 bit position
MOV R0, R1, ROR#4 R0 ← R1 rotated right by 4 bit positions
MOV CS S R0, R1, ASR #2 If carry bit is set, R0 ← R1 shifted right by two positions, maintaining sign and setting flags
Assembler Output From C OR C++ Compiler
I mentioned the use case of wanting to examine the assembler output of the C or C++ compiler. This can be useful for optimizing code or debugging suspected compiler issues. Let's use this small example which illustrates some typical C code but doesn't do anything meaningful:
int main()
{
int j, k;
for (int i = 0; i < 100; i++) {
j = i * i;
if (j % 2) {
k = j;
} else {
k = 2 * j;
}
}
return 0;
}
If we compile this with the GNU C compiler and the -S and -fverbose-asm options, we can see the assembler output, e.g.:
gcc -S -fverbose-asm example.c
The output can be enlightening, even if you don't know all the details of assembler programming. It is quite long though, so let's just look at a few highlights. The corresponding line of C code is shown as comments in the assembler output. Here is the line starting the for loop:
@ example.c:5: for (int i = 0; i < 100; i++) {
mov r3, #0 @ tmp114,
str r3, [fp, #-8] @ tmp114, i
Here we can see R3 set to zero for the loop variable i. Local variables are stored on the stack, using the frame pointer, fp or R11. It looks like variable i is at an offset of 8 from the frame pointer, and the initialized value in R3 gets stored there.
@ example.c:6: j = i * i;
ldr r3, [fp, #-8] @ tmp116, i
ldr r2, [fp, #-8] @ tmp117, i
mul r3, r2, r3 @ tmp115, tmp117, tmp116
str r3, [fp, #-12] @ tmp115, j
In the code above we are getting the value of i, again at offset -8 from the frame pointer and putting it in r3. Another copy goes in r2. Then we multiply r2 and r3 and store the result back in r3. Finally, r3 is stored at -12 from the frame pointer, which must correspond to the variable j.
@ example.c:8: k = j;
ldr r3, [fp, #-12] @ tmp118, j
str r3, [fp, #-16] @ tmp118, k
Here we see a simple assignment, getting variable j from the stack frame and writing it to variable k on the stack. Due to the load/store architecture, we need to do this via a register.
@ example.c:10: k = 2 * j;
ldr r3, [fp, #-12] @ tmp120, j
lsl r3, r3, #1 @ tmp119, tmp120,
str r3, [fp, #-16] @ tmp119, k
In this code, we get variable j from the stack and put it in register r3. The compiler is smart and implements a multiply by 2 as a shift left instruction. The result is stored via the stack frame into variable k.
To see a case where it does use a multiply instruction, look at the code generated from line 6 of the example:
@ example.c:6: j = i * i;
ldr r3, [fp, #-8] @ tmp116, i
ldr r2, [fp, #-8] @ tmp117, i
mul r1, r2, r3 @ tmp115, tmp117, tmp116
str r1, [fp, #-12] @ tmp115, j
Advanced and Miscellaneous Topics
There are many more ARM instructions and features that we could cover, but not in a reasonable length for a blog post. I would like to mention some topics that you might want to explore in more detail.
Addressing modes: We didn't cover the supported instruction addressing modes. The compiler output showed the use of indirect addressing using the frame pointer. ARM supports indirect and pre and post-indexed addressing modes, PC-relative addressing for position independent code, and more.
Is ARM big-endian or little-endian? It can be set to either via the CPSR! The default is little-endian and most OSes use it in this mode (which is the same as Intel platforms) but in certain cases, you could choose to use big-endian for increased efficiency.
Debugging: The GNU gdb debugger can show or step by machine code instructions, set breakpoints, and has other features to support debugging at the machine language level. These features are available from most IDEs that use gdb, including Qt Creator.
C/C++ in-line assembler: GCC supports putting assembler code into C or C++ code. This is often the best solution for small routines or optimizing where you just want to add a few lines of code. You have access to variable names and can use conditional compilation to use different code depending on the platform.
Floating Point: Most ARM chips, including the Raspberry Pi, have an onboard Vector Floating-Point unit (VFP). This provides 32 64-bit IEEE standard floating-point registers with support for math functions as well as some vector operations. On platforms where it is not present, the operating system can transparently implement it in software (at reduced performance).
Thumb Mode: Thumb is an alternative mode where the processor implements a subset of ARM instructions. The instructions are encoded into 16-bits rather than 32, making the code more compact. You can switch modes at run time and make calls between ARM and THUMB code.
GPIO: Some SOMs (e.g. Raspberry Pi) have registers to control onboard GPIO functions. If you want to run GPIO at the maximum possible speed you can directly access the registers. It is highly hardware-dependent (even varying across models of Raspberry Pi, for example) but extremely fast.
64-bit Support: Some ARM processors have a 64-bit mode (e.g. the Raspberry Pi 3 and 4). In this mode, you have 64-bit registers and can use 64-bit addresses. 32-bit addresses are limited to 4GB, so this allows addressing more memory and working more efficiently with 64-bit values. The downside is the larger code size. A 64-bit version of the Raspberry Pi OS was recently made available as a supported option for that platform.
Raspberry Pi Pico: If you want to explore ARM-based microcontrollers, the Raspberry Pi Pico is a $4 ARM-based microcontroller that has an RP2040 Dual-core Cortex-M0+ processor with 264KB RAM, 2MB flash. Other features include 26 GPIO pins, 3 analog inputs, 2 UARTs, 2 SPI interfaces, 2 I2C interfaces, USB, and 8 Programmable I/O (PIO) state machines. This is a good option for learning bare-metal microcontroller programming
References
Here are a few references I have found useful for learning ARM assembly language programming:
- Raspberry Pi Assembly Language, Bruce Smith https://www.brucesmith.info
- ARM A32 Assembly Language, Bruce Smith https://www.brucesmith.info
- ARM and Thumb-2 Instruction Set Quick Reference Card https://developer.arm.com/documentation/qrc0001/m/
- https://marcin.juszkiewicz.com.pl/download/tables/syscalls.html
- https://en.wikipedia.org/wiki/ARM_architecture
Summary
I hope you found this blog series on ARM assembly language programming interesting and useful. If you want to learn more, I encourage you to try writing and running some small programs of your own on an ARM-based platform such as a Raspberry Pi. If you missed part 1 in our series, read it here.