Freaky Assembly?

After a looong time, I was debugging some embedded C code and thought I found something freaky:

C code

for (i = 0; i < 1000000; i++);

ARM code disassembly (as generated by GNU ARM gcc)

0x0000019c <main+196>: mov r3, #0 ; 0x0
0x000001a0 <main+200>: str r3, [r11, #-16]
0x000001a4 <main+204>: b 0x1b4 <main+220>
0x000001a8 <main+208>: ldr r3, [r11, #-16]
0x000001ac <main+212>: add r3, r3, #1 ; 0x1
0x000001b0 <main+216>: str r3, [r11, #-16]
0x000001b4 <main+220>: ldr r2, [r11, #-16]
0x000001b8 <main+224>: mov r3, #999424 ; 0xf4000
0x000001bc <main+228>: add r3, r3, #572 ; 0x23c
0x000001c0 <main+232>: add r3, r3, #3 ; 0x3

0x000001c4 <main+236>: cmp r2, r3
0x000001c8 <main+240>: bls 0x1a8 <main+208>

The three highlighted lines above in effect initialize r3 with 999999: first initializes r3 with 999424, then adds 572 to it, then adds 3 to it.

What puzzled me was why couldn’t it do that directly (mov r3, #999999)?

After some scratching my head and plowing through the ARM book: ARM instructions are 32-bit — of which Operand 2 can be only 12-bits. In addition (from the ARM book):

– Of these 12 bits, 8-bits are for data, and 4-bits are used for ROR.
– The ROR bits are in turn multiplied by 2 before being applied on the 8-bits.

The combination of ROR and shifting by 2 greatly extends the range. The assembler automatically does it for you if it sees an operand greater than 8-bits.

This can be a great (but wicked) interview question (I’d never do that to anyone ;-)).

Do verify, here’s the math…

999424 + 572 + 3 is the closest tuples you can get to add up to 999999 using the 12-bit ROR with x2 multiplier for the RoR.

Just for verification, here are the instructions from memory:

1b8: 3D39A0E3 ; 0xE3A0393D
1bc: 8F3F83E2 ; 0xE2833F8F
1c0: 033083E2 ; 0xE2833003

To get 999424 (0x0F4000):
0x0000003D ROR 18 (0x9 x 2) = 0x000F4000 (ROR 18 = LSL 6)
As confirmed by the instruction: E3A03 93D

To get 572 (0x023C):
0x0000008F ROR 30 (0xF x 2) = 0x0000023C (ROR 30 = LSL 2)
As confirmed by the instruction: E2833 F8F

To get 3 (0x0003):
0x00000003 ROR 00 (0x0 x 2) = 0x00000003 (ROR 00 = LSL 0)
As confirmed by the instruction: E2833 003

Note: the LSL is just for convenience, it’s good only if data has all zeros padded on the left (at least enough to cover the LSL).

ARM Assembler

My ARM assembler cheat sheet.


  1. Load-and-Store Architecture
  2. Von Neumann Architecture


T – Thumb architecture extension

  • ARM Instructions are all 32 bit
  • Thumb instructions are all 16 bit
  • Two execution states to select which instruction set to execute

D – Core has debug extensions
M – Core has enhanced multiplier
I – Core has Embedded ICE Macrocell
S – Fully synthesis able

Word = 32-bits
Half-word = 16-bits

Program Counter

The program counter is two instructions ahead. An instruction is 4 bytes, so we’re talking 8 bytes ahead. That is PC + 8. So, the net result is that the program counter is pointing the instruction being fetched, not the instruction being executed. The instruction being executed is at PC-8.

Fetch – PC
Decode – PC-4
Execute – PC-8

Interrupt Vector Table

Reset – 0x00000000
Undefined Instruction – 0x00000004
Software Interrupt – 0x00000008
Prefetch Abort – 0x0000000C
Data Abort – 0x00000010
Reserved – 0x00000014
IRQ – 0x00000018
FIQ – 0x0000001C

The entries in the Interrupt Vector Table are not the addresses of the ISR’s, but pointers to another table the VSR table (Vector Service Routine) which contains the addresses of the ISR. Why not store the ISR address directly in the Interrupt Vector Table? Because a branch instruction is limited in range to 26 bits (64MB). So, instead the IVT entry has the instruction: LDR pc, [pc,#-0xFF0]. This essentially replaces PC with value from VSR.

Example: Any IRQ causes a jump to IRQ vector (0x18)
0x18    LDR pc, [pc,#-0xFF0]  ; Loads PC with the address from VICVectAddr (0xFFFFF030) register.

In effect it does this:  LDR pc, [addr]   ; PC+8+addr

That is, -0xFF0 =  -0x00000FF0 = 0xFFFFF00F+1 = 0xFFFFF010

PC = 0x18 + 8 + -0x0FF0   ; the 8 is because PC is 8 bytes ahead always (i.e. two instructions ahead)
= 0x20 + 0xFFFFF010
= 0xFFFFF030

Exception Handling:

When an exception occurs, the core:

  1. Copies CPSR to SPSR_<mode>
  2. Sets the appropriate CPSR bits: Mode field bits (to enter IRQ mode). Set IRQ disable flag. FIQ is kept enabled to allow for nesting of FIQ over IRQ.
  3. Maps in banked registers.
  4. Stores the return address, i.e. next instruction to be executed (PC+4) in LR_<mode>
  5. Sets the PC to vector address.
  6. The instruction at the vector address is essentially an instruction that loads the exception handler’s address into the PC. The exception handler address is itself fetched from an offset. That is, the 32 byte interrupt vector block (8 interrupt vectors * 4 bytes each) is often followed immediately by a 32 byte address lookup table.

Note: In step 6, one could have the instruction to directly branch to the exception handler’s address (instead of loading the exception handler’s address into the PC), but the branch instructions support an offset of only 26 bits (64MB address range).

To return, the exception handler needs to:

  1. Restore CPSR from SPSR_<mode>
  2. Restore PC from LR_<mode>

Now step 2. is tricky:

  1. In the case of FIQ or IRQ, when an exception occurs the current instruction is discarded. So, when we return from interrupt, we don’t just restore PC from LR, but PC = LR-4, so that the discarded instruction gets re-executed. This is done by:
    SUBS R15, R14, #4    ; Restores the PC from LR, and changes the mode back to User mode.
  2. In the case of an SWI interrupt, the current instruction is not discarded, so we just simply restore PC from LR. This is done by:
    MOVS R15, R14        ; Restores the PC from LR
  3. In the case of DAbt interrupt (Data Abort), the exception occurs after the execution of the current instruction (which is the one that caused the exception), thus causing the next instruction to be discarded. So, when we return from the interrupt, we need to re-execute the instruction that caused the exception. Since the LR contained the PC+4 (i.e. the next instruction), we have to roll back to discarded instruction, plus roll back again to the instruction that caused the exception. This is done by:
    SUBS R15, R14, #8

Note in the above, special instructions (SUBS, MOVS,… – i.e. data processing instructions with S-bit set) are used to restore the PC and change the mode at the same time (when the mode changes the CPSR gets restored from SPSR). This is because if the PC is restored before the CPSR is restored (i.e. CPSR still contains the IRQ handler’s state), it will screw things up. If the CPSR is restored (i.e. operating mode is changed) before the PC is restored then the banked LR which contains the PC will be inaccessable.

Exception Handling (according to Freescale)

  1. Finish current instruction
  2. LR_irq := return link
  3. SPSR_rq := CPSR
  4. CPSR[4:0] := 0x10010   ; Enter IRQ mode
  5. CPSR[5] := 0    ; Put the processor in ARM state
  6. CPSR[7] := 1    ; Disable further interrupts
  7. PC := 0x0018    ; Jump to interrupt vector


  1. CPSR[31:28] – NZCV (Negative, Zero, Carry-over, Overflow)
  2. CPSR[7] – IRQ disable (0=enable/1=disable)
  3. CPSR[6] – FIQ disable (0=enable/1=disable)
  4. CPSR[5] – Thumb Mode (you should not set/unset this bit directly)
  5. CPSR[4:0] – operating mode (FIQ, IRQ, System, User, Undefined Instruction)

R13: Stack Pointer (SP)
R14: Link Register (LR)
R15: Program Counter (PC)


mrs r0, cpsr
orr r0,r0,#0x80
msr cpsr_c,r0
mov r0,#1
bx lr

mrs r0, cpsr
bic r0,r0,#0x80
msr cpsr_c,r0
bx lr

Subroutine Link Register

The LR (R14) stores the return address when Branch with Link operations are performed, calculated from the PC. Thus to return from a linked branch
• MOV r15,r14
• MOV pc,lr

Stack Pointer

The caller pushes the return address onto the stack.
Then calls the function.
The function pops the return address from the stack.

APCS - ARM Procedure Call Standard
    Name    Register    APCS Role

    a1      0           argument 1 / integer result / scratch register
    a2      1           argument 2 / scratch register
    a3      2           argument 3 / scratch register
    a4      3           argument 4 / scratch register

    v1      4           register variable
    v2      5           register variable
    v3      6           register variable
    v4      7           register variable
    v5      8           register variable

    sb/v6   9           static base / register variable
    sl/v7   10          stack limit / stack chunk handle / reg. variable
    fp      11          frame pointer
    ip      12          scratch register / new-sb in inter-link-unit calls
    sp      13          lower end of current stack frame
    lr      14          link address / scratch register
    pc      15          program counter

Types of Stacks

In an Empty stack, the stack pointers points to the next free (empty) location on the stack, i.e. the place where the next item to be pushed onto the stack will be stored.

In a Full stack, the stack pointer points to the topmost item in the stack, i.e. the location of the last item to be pushed onto the stack.

ARM compiler: push    {fp, ip, lr, pc}
is the same as:  STMFD sp!, {fp, ip, lr, pc}

This first pushes in the order: pc, lr, ip, fp  (i.e. PC is pushed in first, and FP last).

ARM Toolchain – Crosstool

Was able to get an arm-elf toolchain built and working fine, but not so much luck in building an arm-elf-linux toolchain. It cross-compiled programs without errors, but the compiled executable crapped out at runtime. So googling for answers… I came across Dan Kegel’s crosstool – a really cool GNU toolchain builder. It downloads all the correct gcc, glibc, binutils, etc. and builds your toolchain. I built two toolchains, arm-unknown-linux and arm-xscale-linux. The toolchain built with it works great.

The Ubuntu shell is not bash by default! Instead it is linked to something called dash. Just make sure you relink /bin/sh to bash instead of dash. No idea when they did this, but I found that out after encountering this maddening error, pointing to some header files during the build:

missing terminating ” character.

Using it:

Example (for kernel compilation makefile):

export ARM_TOOLCHAIN=/opt2/crosstool/arm-unknown-linux-gnu/bin

make ARCH=arm CROSS_COMPILE=arm-unknown-linux-gnu-

ARM Toolchain

Updated (April 26, 2008)

Here’s my cheat-sheet for building a 64-bit GNU ARM toolchain (cross-compiler x64 to ARM). So far this has been working well for me on an LPC2148 (ARM7TDMI-S), i.e. gcc and gdb via OpenOCD JTAG.


  1. Some builds (like binutils-2.18 and newlib-1.15) needed the setting MAKEINFO=/usr/bin/makeinfo to be passed to the make (binutils-2.17 and newlib-1.16 didn’t need this).
  2. Update: some systems (like Ubuntu 8.10) have strict checking turned on, where warnings are treated as errors. You may need to disable this the build of binutils and gdb using the –disable-werror configuration option.

Here are the steps:

environment (needed only for build):

export GCC_VERSION=4.2.3
export NEWLIB_VERSION=1.16.0
export GDB_VERSION=6.8
export DIST=/opt1/gnuarm.dist    # tars will be downloaded here
export WORKDIR=/opt4/gnuarm.tmp    # tars will be unzipped and built here
export GNUARM_HOME=/opt/gnuarm   # Resulting binaries will be installed here
export SRC=$WORKDIR/src
export BUILD=$WORKDIR/build
export TARGET=arm-elf
md -p $DIST
md -p $SRC
md -p $BUILD
sudo mkdir -p $PREFIX


cd $DIST


cd $SRC
tar jxf $DIST/binutils-${BINUTILS_VERSION}.tar.bz2
tar jxf $DIST/gcc-${GCC_VERSION}.tar.bz2
tar jxf $DIST/gdb-${GDB_VERSION}.tar.bz2
tar zxf $DIST/newlib-${NEWLIB_VERSION}.tar.gz
cp $DIST/t-arm-elf gcc-${GCC_VERSION}/gcc/config/arm/t-arm-elf


md $BUILD/binutils
cd $BUILD/binutils
$SRC/binutils-${BINUTILS_VERSION}/configure –target=$TARGET –prefix=$PREFIX –enable-interwork –enable-multilib
make all install 2>&1 | tee make.out

gcc core:

md $BUILD/gcc
cd $BUILD/gcc
$SRC/gcc-${GCC_VERSION}/configure –target=$TARGET –prefix=$PREFIX
–enable-interwork –enable-multilib –enable-languages=”c,c++” –with-newlib –with-headers=$SRC/newlib-${NEWLIB_VERSION}/newlib/libc/include

make all-gcc install-gcc 2>&1 | tee make.out


md $BUILD/newlib
cd $BUILD/newlib
$SRC/newlib-${NEWLIB_VERSION}/configure –target=arm-elf –prefix=$PREFIX –enable-interwork –enable-multilib

make all install 2>&1 | tee make.out

gcc (phase two):

cd $BUILD/gcc
make all install 2>&1 | tee make.out


md $BUILD/gdb
cd $BUILD/gdb
$SRC/gdb-${GDB_VERSION}/configure –target=$TARGET –prefix=$PREFIX –enable-interwork –enable-multilib

make all install 2>&1 | tee make.out


echo ‘export GNUARM_HOME=/opt/gnuarm’ >> ~/.profile
echo ‘export PATH=$GNUARM_HOME/bin:$PATH’ >> ~/.profile

The scripts can be downloaded from here.

ARM boards

Just got back from a trip to Chicago – Oak Park, a renovated “gentrified” neighbourhood, west of Chicago. Just spent the night researching a good ARM board to buy. Narrowed it down to these (all these boards have USB and SD card reader):

ARM7TDMI-S boards:

  • Olimex SAM7-P256 (SAM7-Pxxx Rev. E) – Atmel AT91SAM7S256, 256K Flash, 64K RAM, 60MHz, 18MHz crystal. $87.
  • Olimex LPC-P2148 – NXP LPC2148, 512K Flash, 32K+8K RAM, 60MHz, 12MHz crystal. $77.

ARM7TDMI-S boards w/Ethernet:

  • Olimex SAM7-LA2 – Atmel AT91SAM7A2, 1MB Flash, 4MB SRAM, 30MHz, 6MHz crystal. $140.
  • Olimex SAM7-EX256 – Atmel AT91SAM7X256, 256K Flash, 64K RAM, 55MHz, 18MHz crystal. This board is loaded with stuff. $120.

ARM920T boards w/MMU (for running embedded Linux):

  • Olimex CS-E9302 – Cirrus Logic EP9302, 16MB Flash, 32MB SDRAM, 200MHz. Cirrus Logic has a very good linux forum. $180.
  • Olimex SAM9-L9260 – Atmel AT91SAM9260, 512MB NAND Flash, 64MB SDRAM, 180MHz, 18MHz crystal. $217.
  • … or just get another Slug – XScale IXP420 (ARMv5TE), 8MB Flash, 32MB SDRAM, 266MHz. $80.

Which RTOS?

After scouring through a number of MCU’s I’ve decided to go with the Atmel AVR, in particular the ATmega168.

But then, my recent adventure into the Slug’s hardware, got me thinking ARM. I settled for Atmel’s AT91SAM7 series of ARM MCU’s.

My next thought was how I can run Debian/ARM on it. But unfortunately that was quickly ruled out – as the largest of the 7S or 7X series has only 512K Flash and 128K SRAM. Note that the 7SE series has an external memory bus.

So what can I squeeze into 256K Flash and 64K RAM? My hunt came up with these (in terms of the number of microcontrollers each has been ported to):

  1. FreeRTOS (free)
  2. uC/OS-II (commercial)
  3. eCos (pdf book: Embedded Software Development with eCos)
  4. TNKernel (free)

And these bootloaders:

  1. u-Boot
  2. Redboot
  3. Apex

Choosing a Microcontroller

As I went about choosing what is the good microcontroller to get started on, here’s my shortlist:

8/16 bit: