ARM Development
This post contains some notes on ARM development resulting from participating in a 3-day training course in Austin, Texas.
ARM Architecture
About ARM
-
Founded in 1990, has about 1000 employees
-
ARM licenses RISC processor cores and other IP
-
ARM does not fabricate processors
-
Provides support to partners and third-parties
-
Sells development software, boards, debugging hardware, etc
ARM based SoC
-
System on Chip
-
Is composed of
-
A deeply embedded processor core debuggable externally using JTAG
-
Internal and external memory
-
Interrupt controller (core supports two interrupts)
-
Advanced microcontroller bus architecture (AMBA)
-
Other peripherals
-
Processor terminology – with cache
-
MPU
-
Memory Protection Unit
-
Controls memory access permissions
-
Controls cacheable and bufferable attributes
-
-
MMU
-
Memory Management Unit
-
All features of an MPU
-
Virtual to physical address translation
-
-
Cache
-
TCM – Tightly coupled memory
-
Write buffer
ARM architecture evolution
-
4
-
Initial version
-
4T – Thumb instruction set
-
ARM7TDMI and ARM9TDMI families
-
-
5TE
-
Improved ARM / Thumb interworking, CLZ instruction, Saturated arithmetic, DSP multiply-accumulate instructions…
-
ARM9E and ARM10E families
-
5TEJ – Jazelle (java byte code execution)
-
-
6
-
SIMD instructions, multi-processing, unaligned data support…
-
6T2 – Thumb-2 instruction set
-
6Z – TrustZone extensions
-
ARM11 family
-
-
Same architecture can have different implementations
-
Von Neuman core – 3 stage pipeline with single instruction and data bus
-
Harvard core – 5 stage pipeline with separate instruction and data busses
-
Data size and instruction sets
-
ARM is a RISC architecture
-
One cycle per instruction
-
Pipelining to run several instructions per cycle
-
Large number of registers to reduce interactions with memory
-
-
32 bit load / store architecture
- Instructions and data
-
Memory sizes
-
Byte (always 8-bit)
-
Halfword – 16-bits (two bytes)
-
Word – 32-bits (four bytes)
-
Doubleword – 64-bits (eight bytes)
-
-
Two instruction sets
-
32-bit ARM
-
16-bit Thumb
-
Registers and processor modes
-
37 registers, each 32-bits long
-
Privileged modes
-
Supervisor (SVC)
- R13 (sp), R14 (lr), spsr accessible
-
High-priority interrupt (FIQ)
- R8, R9, R10, R11, R12, R13 (sp), R14 (lr), spsr accessible
-
Normal interrupt (IRQ)
- R13 (sp), R14 (lr), spsr accessible
-
Abort
-
R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12, R13 (sp), R14 (lr), R15 (pc), spsr accessible
-
Undef
- R13 (sp), R14 (lr), spsr accessible
-
System
- R13 (sp) and R14 (lr) accessible
-
-
-
Unprivileged mode
-
User
- R13 (sp) and R14 (lr) accessible
-
Program status register (PSR)
-
Condition code flags [31:28]
-
N [31] – Negative result from ALU
-
Z [30] – Zero result from ALU
-
C [29] – ALU operation carried over
-
V [28] – ALU operation overflowed
-
-
Sticky overflow flag – Q flag [27]
-
5TE and later
-
Indicates if saturation has occurred
-
-
J bit [24]
-
5TEJ and later
-
1 indicates processor in Jazelle state
-
-
Interrupt disable bits
-
I [7] = 1 disables IRQ
-
F [6] = 1 disables FIQ
-
-
T Bit [5]
-
T = 0 means processor in ARM state
-
T = 1 means Thumb state
-
Introduced in 4T
-
-
Mode bits [4:0]
- Specify processor mode
-
New bits in V6
-
GE[3:0] (register bits [19:16]) used by some SIMD instructions
-
E bit [9] controls load / store endianness
-
A bit [8] disables imprecise data aborts
-
IT [15:10] IF THEN conditional execution of Thumb2 instruction groups
-
Program counter (PC)
-
ARM state
-
All instructions 32 bits wide
-
All instructions word aligned
-
PC value is stored in bits [31:2] with bits [1:0] undefined
-
-
Thumb state
-
All instructions 16 bits
-
All instructions must be halfword aligned
-
PC value is stored in bits [31:1] with bit [0] undefined
-
Data alignment
-
Prior to architecture v6 data accesses must be properly aligned or else unexpected results will be produced
-
Byte access should be byte aligned
-
Halfword access should be halfword aligned
-
Word access should be word aligned
-
-
Invalid data accesses can be used to produce an Abort exception using external logic or MMU
- [Instruction fetches]{.ul} may appear unaligned, be careful
-
Unaligned data can be accessed using multiple aligned accesses combined with shifting and masking
-
Architecture v6 adds hardware support for unaligned access
Endianness
-
Endianness determines how the contents of registers are recovered from memory
-
ARM registers are word width
-
ARM addresses memory as a sequence of bytes
-
-
ARM processors are little-endian but can be configured to access big-endian memory systems
-
Little-Endian implies that the least significant bit (LSB) is at the lowest address
-
Three models of endianness are supported
-
Little-Endian (LE)
-
Word invariant Big-Endian (BE-32)
-
Byte invariant Big-Endian (BE-8 – [introduced in v6]{.ul})
-
-
Exception handling
-
When an exception occurs the core
-
Copies CPSR into SPSR_<mode>
-
Sets appropriate CPSR bits
-
Change to ARM state
-
Change to exception mode
-
Disable interrupts (if appropriate)
-
-
Store the return address in LR_<mode>
-
Sets PC to vector address
-
-
To return the exception handler, needs to
-
Restore CPSR from SPSR_<mode>
-
Restore PC from LR_<mode>
-
Can only be done in ARM state
-
Vector table or jump table
0x1C – FIQ
0x18 – IRQ
0x14 – Reserved
0x10 – Abort
0x0C – Prefetch abort
0x08 – Software interrupt
0x04 – Undefined instruction
0x00 – Reset
ARM instruction set
-
All instructions 32 bits long
-
Most execute in a single cycle
-
Load / store architecture
-
Has conditionally executed instructions
-
Some day Thumb-2 may replace ARM instruction set
Example data processing instructions
SUB r0, r1, #5
- r0 = r1 - 5
ADD r2, r3, r3, LSL #2
- r2 = r3 + (r3 * 4)
ANDS r4, r4, #0x20
- r4 = r4 AND 0x20 (set flags)
ADDEQ r5, r5, r6
- IF EQ condition true r5 = r5 + r6
Example branching instruction
B <label>
- Branches forward or backwards relative to PC (+/- 32MB range)
Example memory access instructions
LDR r0, [r1]
- Load word at address in r1 to r0
STRNEB r2, [r3, r4]
- IF NE condition true, store bottom byte of r2 to address r3 + r4
STMFD sp!, {r4 - r8, lr}
- Store registers r4 to r8 and lr on stack, then update stack pointer.
Thumb instruction set
-
16-bit instruction set
-
Optimized for code density from C code
-
Improved performance from narrow memory
-
Subset of the functionality of the ARM instruction set
-
-
For most Thumb instructions
-
Flags are always set and conditional execution is not used
-
Source and destination registers identical
-
Only low registers used (R0 to R7)
-
Constants are of limited size (8 bits)
-
Inline barrel shifter not used
-
-
Switch between ARM and Thumb state using BX instruction
-
Thumb is not a regular instruction set and is targeted at compiler generation, not hand-coding
- Thumb-2 can be hand-coded
ARM7TDMI processor core
-
v4T architecture
-
Von Neumann architecture with shared instruction and data bus
-
3 stage pipeline
-
Optimal pipeline executes each instruction in three cycles (fetch, decode, execute) but the effective cycle per instruction (CPI) is one
-
Read or write to memory can stall the pipeline
-
Branching breaks the pipeline
-
PC always points to the instruction being fetched
-
Average CPI is approximately 1.9
-
-
ARM720T is an ARM7TDMI with cache, MMU and write buffer
ARM9TDMI
-
v4T architecture
-
5 stage pipeline
-
Fetch, Decode, Execute, Memory, Write
-
Average CPI is 1.5
-
Improved maximum clock frequency
-
Memory load instruction can cause interlock if immediately followed by an instruction that uses the loaded data
-
-
Harvard architecture
- Simultaneous access to instruction and data memory possible
-
Normally supplied with caches
-
ARM9E processor cores are derived from ARM9TDMI and have support v5TE architecture
Other families
-
ARM10E
-
v5TE architecture
-
CPI of approximately 1.3
-
6 stage pipeline
-
Static branch prediction
-
-
ARM11
-
ARM v6 architecture
-
8 stage pipeline
-
Static and dynamic branch prediction and Return stack
-
-
Intel XScale
-
Intel implementation of ARM v5TE architecture
-
7 to 8 stage pipeline
-
Data and instruction caches
-
Development Tools
RVDS
-
Compilation tools
-
ISO C/C++ compiler (armcc)
-
ABI compliant
-
Allows binary compatibility between tool chains
-
Linker feedback mechanism to inform compiler of unused functions which can be eliminated in subsequent builds
-
ARM / Thumb assembler (armasm)
-
Linker (armlink)
-
Format converter (fromelf)
-
Librarian (armar)
-
C and C++ libraries
-
-
CodeWarrior IDE
-
RVD debugger and legacy AXD debugger
-
Instruction set simulators (RVISS, ADS ARMulator)
-
Previous tools suite (ADS)
Compiler optimization options
-
Optimization levels
-
-O0 – Best debug view, restricted optimization
-
-O1 – Most optimizations, good debug view
-
-O2 – Full optimization (the default), limited debug view
-
-O3 – Higher optimization (added in RVCT 2.1)
-
-
-Otime OR -Ospace
- Optimize for reduced space or execution time
-
Specify processor
-
Using architecture number, e.g. --cpu 5TE
-
Specific processor, e.g. --cpu ARM7TDMI
-
To see a list of options try armcc --cpu list
-
-
Interleaved C and Assembler listing
- With -S -fs OR --asm --interleave
ATPCS
-
ARM Thumb procedure call standards
- Useful for mixing C / C++ and assembly
-
Register usage
-
Arguments to function, return values, corruptible: r0, r1, r2, r3
-
Register variables which must be preserved: r4, r5, r6, r7, r8, r9, r10, r11
-
r9 is used as sb (static base) if RWPI option selected
-
r10 is used as sl (stack limit) if software stack checking selected
-
-
Scratch register (corruptible): r12
-
Libraries and Semihosting
-
RVDS includes ANSI C and C++ libraries
-
File handling, math, etc
-
Linker automatically links in the correct library variant for an application depending on endianness, floating point usage, position independence, etc
-
-
Semihosting is used to access host debug facilities
- Provides implementation of standard input and output facilities by invoking a software interrupt trapped by debug tools
-
User can provide replacement implementation of specific functions for embedded use (retargeting)
- No need to rebuild whole library
C/C++ Hints and Tips
-
Background on compiling and linking
-
Compiler performs optimizations that are safe, it does not reorder instructions if that would change behavior
-
RVCT tools produce and consume elf objects and images
-
ELF files contain one or more section, each section can be either code or data, but not both
-
The code section usually contains constant data values in literal pools
-
A section can be moved independently of other sections
-
-
The compiler has no visibility outside the compilation unit it is compiling
- No knowledge of absolute addresses, relative addresses between sections (e.g. between code and data) or the source or object code of other files
-
The only common knowledge across source files are the included headers which contain layout of structures and classes, and function prototypes
-
The linker assigns addresses and lays out sections to form a final image based on a scatter file
-
Linker uses compiler generated relocations to patch the object code to take into account the final relationships between sections
-
Linker does not look at the source code
-
Cannot subdivide or insert extra information into sections
-
Can remove a section if it is not required by any part of the program
-
Inlining of functions
-
Inlining can improve performance, at the expense of a larger image, by incorporating the body of the inlined function directly into the calling code
- Only possible when the caller and callee are in the same compilation unit
-
The compiler can choose which functions to inline automatically
-
At -O0 and -O1 it chooses only functions marked with the __inline keyword
-
At -O2 and -O3 the compiler considers all functions
-
Inlining can be disabled by using the --no_inline option
-
A function can be marked with __forceinline to force it to be inlined in all cases
-
Function that is not static (extern) if auto-inlined also has a normal version generated
-
Parameter passing
-
The first four word sized arguments passed to a function are transferred in registers r0-r3
-
Arguments smaller than a word will use the entire register anyway
-
Arguments larger than a word are passed in multiple registers
-
-
If more than 4 arguments are passed then the stack is used (slower)
-
Therefore, always try to limit arguments to 4 words or fewer
-
C++ uses the first argument to pass the this pointer
Loops
-
Subtract and compare to zero can be done in one instruction (SUBS) but must use an unsigned int counter or test not equal to zero, rather than greater than or equal to zero
Replace
for (loop = 1; loop <= total; loop++)
With
for (loop = total; loop != 0; loop--)
- Loop limit (total) used only once so the register used for the limit can be reused by the compiler
Division
-
ARM core contains no division hardware
- Typically implemented by a run-time library
-
Compiler will try to optimize division
- Divide by a constant two will use right shift operation
-
Same problem with remainder (modulo) operations
- Use if statement instead of modulo operation when checking range
Floating point
-
Software floating point library called fplib
- Use compile time option (default) --fpu softvfp
-
VFP floating point coprocessor
-
Use option --fpu vfp
- Specifying a --cpu will also select option
-
Available for ARM9E, ARM10 and ARM11 cores
-
Actually a mix of hardware coprocessor and emulation, requires VFP support code (provided with RVDS) for unusual cases
-
-
Use --fpu softvfp+vfp when Thumb code uses floating point
-
Thumb does not have coprocessor instructions
-
Compiler calls an ARM state library or compiles the function using floating point as ARM
-
Variable types
-
Global and static variables are held in RAM (in RW/ZI sections)
-
Requires load / store to memory (slow)
-
External global variables also require extra level of indirection because compiler has to load a pointer to the variable first
-
-
Local variables are held in registers for fast processing but if the compiler runs out of registers it will use the stack
-
Prefer int size local variables instead of byte or short because this avoids additional shifts and masks
Stack issues
-
C/C++ code uses stack extensively
-
The stack is used to hold
-
Function return addresses
-
Caller’s registers that must be preserved according to the procedure call standard (ATPCS)
-
“Spilled” local variables
-
Local arrays, structures and classes (in C++)
-
-
Things to consider
-
Keep functions small with fewer variables
-
Avoid large local structures, classes or arrays, use the heap, especially for Thumb
-
Beware of recursion
-
-
Measure stack usage
-
Link with --callgraph option to see static stack usage information
-
Compile with software stack checking --apcs/swst
-
Unaligned accesses
-
ARM hardware requires access to memory to be on natural boundaries
-
Compiler will reorder layout of global data in a module unless --O no_data_reorder option is used
-
Compiler cannot reorder structures so it will add padding, you can rearrange structure members so that padding is minimized (smaller members first)
-
-
If unaligned access is required warn the compiler by using the __packed type qualifier
-
Required for network protocols, reusing legacy code
-
Requires additional instructions when loading and storing data
-
Using __packed on a structure will remove all padding, it may be more efficient to specify __packed on an individual member and not the entire structure
-
-
Beware when using pointers to unaligned data
Multifile compilation
-
Default at level -O3 if multiple object files are specified
-
armcc --multifile -c file1.c file2.c …
-
Compiles multiple source files into a single object file
-
Behaves as if all the source files are one big source file
-
-
Benefits
-
Make more observations
-
Inline more often
-
Can share code segments and literal pool data
-
Improves global data access (same base pointer)
-
Fewer license checkouts
-
Cross file type checking
-
-
Use carefully
-
All code placed in single ELF section
-
All data placed in another ELF section
-
Potential scatter loading or undefined reference type problems
-
Increased compilation time
-
-
Works best with small groups of related source files
Useful references
Application Notes and Articles
-
Application Notes
-
App Note 34: Writing Efficient C for ARM
-
App Note 36: Using C Global Data
-
App Note 61: Big and Little Endian Byte Addressing
-
-
Manuals
-
Application Binary Interface (ABI) for the ARM Architecture
-
ARM ELF Specification
-
Assembler Guide
-
Compilers and Libraries Guide
-
Linker and Utilities Guide
-
Books
-
ARM Architecture Reference Manual (ARM ARM), second edition – David Seal
-
ARM system-on-chip architecture, second edition – Steve Furber