.

Writing mostly about computers and math.

📅 

I got an Arduino Uno for Christmas this year and I've been playing around with it a little bit over the last few days. While trying to get a simple PWM program to work, I noticed that it starts acting weird at higher frequencies. The square wave actually produced by the microcontroller drifts farther and farther away from the expected signal the higher the frequency gets. Here's some oscilloscope output to show you what I mean. All of the signals should be 5 Vp-p and have a 50% duty cycle. The scope is set to 5V/div in every image.

PWM Output

Using digitalWrite at 1kHz

1 kHz, 500 µs/division

Using digitalWrite at 10kHz

10 kHz, 50 µs/division

Using digitalWrite at 100kHz

100 kHz, 10 µs/division

Even using my terrible oscilloscope, it's pretty clear that something is going on that causes the timing to be way out at higher frequencies. The 100 kHz signal should take up one division at that timebase, but instead it takes four (the Gibbs phenomenon is probably just due to the low bandwidth of my scope). So, what could be going on that makes the Arduino spend 20 µs on each transition?

First of all, here's the code I used to generate these figures:
/*
Very, very, very basic PWM test code
*/

int pwm_pin = 4; // Pin to output the signal on
long int pwm_freq = 100e3; // PWM frequency in Hz
int period = int(1e6/float(pwm_freq)); // Period of the PWM signal in microseconds
int half_period = period/2; // Half the PWM period in microseconds

// Set up our pin
void setup() {
  // Output mode
  pinMode(pwm_pin, OUTPUT);
}

// Do this forever
void loop() {
  digitalWrite(pwm_pin, HIGH); // Output high
  delayMicroseconds(half_period); // Wait half a period
  digitalWrite(pwm_pin, LOW); // Output low
  delayMicroseconds(half_period); // Wait half a period
}

You can see it's pretty basic; it just uses digitalWrite to switch pin 4 from high to low halfway through each period. At first I thought that maybe the problem was in delayMicroseconds, but the Arduino documentation says this should be accurate down to 3 µs, so it shouldn't have a problem at 100 kHz (5 µs between transitions). The only other function I call on every loop iteration is digitalWrite, so the problem must lie there. I Googled around and, sure enough, lots of other people have encountered this behavior.

There is another technique we can use to write to the digital pins on the Arduino board. Instead of digitalWrite, I tried using the port registers directly. This is much more complicated and brittle, but there's no other way that I could find to get high-speed signals to come out of the digital I/O pins. The AVR microcontrollers used on the Arduino boards have three 8-bit registers called PORTB (digital pins 0-7), PORTC (analog pins 0-5), and PORTD (digital pins 8-13) that correspond to the physical analog and digital pins on the board. I made a diagram to explain:

ATmega168 PORTB-D registers

Each of the eight bits in the register corresponds to a pin. For example, in the register that contains digital pin 4, PORTD, the lowest bit corresponds to pin 0, the next bit corresponds to pin 1, etc. Here are some more diagrams to help explain:

The port registers and the pins that correspond to each bit

So, for the equivalent of digitalWrite(4, HIGH);, we need to set pin 4 to 1 and for digitalWrite(4, LOW);, we need to set it to 0. The PORTD variable contains all eight bits in the register, so we'll have to make sure not to upset the other bits when we assign a value to the register. We can do this by using C's bitwise operators on the PORTD variable, like this:
PORTD |= B00010000;
PORTD &= B11101111;

You can see that we used two different numbers to manipulate the PORTD variable. In the first line, we OR every bit except number 4 with 0, leaving their values unchanged. We OR bit 4 with 1, making its value 1, or HIGH. In the second line, we do basically the opposite, ANDing every bit except 4 with 1, which doesn't change their values, and ANDing bit 4 with 0, setting its value to 0 or LOW.

The nice thing about this way of manipulating the pins is that you can change more than one pin at once. For example, if I wanted to set pins 3, 5, and 7 to high, I could write: PORTD |= B10101000; Or, if I want to set pins 2, 3, and 6 high, I can do PORTD |= B01001100; I can also switch every pin off at once by writing PORTD &= B00000000; or even PORTD = 0;

Here, then, is what my loop() function looked like using PORTD instead of digitalWrite. Note that I used hex instead of binary in my code, but the values are the same.
void loop() {
  PORTD |= 0x10; // Set pin 4 to 1
  delayMicroseconds(half_period); // Wait half a period
  PORTD &= 0xef; // Set pin 4 to 0
  delayMicroseconds(half_period); // Wait half a period
}

And here's what the output signal looks like:

Using a port register at 100 kHz

100 kHz, 5 µs/division

That's more like it. Each half of the square wave takes exacly 5 µs, which is exactly what we expect for a 100 kHz signal with a 50% duty cycle. Obviously, using the port register is much, much faster than using digitalWrite, but why?

Disassembling Some Binaries

To find out, I thought I might try to disassemble the binaries that get put on the Arduino. I did this on Ubuntu (apt-get install arduino arduino-core), but all the same tools should come with the Windows and Mac versions of the Arduino IDE.

On Linux, the Arduino IDE stores binaries generated by the verify button in /tmp/build*.tmp/. In this folder, there should be an ELF file that contains the object code. The Arduino package includes special AVR versions of a bunch of binutils, so I used avr-objdump to get a look at the assembly code. Specifically, avr-objdump -d <YOUR FILE>.cpp.elf will give you the assembly from the object file.

Virtually all of the code is the same in both versions of the program, so I'll focus on the parts that are significantly different: the loop function and the digitalWrite function itself. First, here is the assembly for the loop function in both versions of the code, digitalWrite on the left, and port registers on the right:

00000118 <loop>:
 118: cf 93        push r28
 11a: df 93        push r29
 11c: c4 e0        ldi r28, 0x04 ; 4
 11e: d1 e0        ldi r29, 0x01 ; 1
 120: 61 e0        ldi r22, 0x01 ; 1
 122: 88 81        ld r24, Y
 124: 0e 94 2c 01  call 0x258 ; 0x258 
 128: 80 91 06 01  lds r24, 0x0106
 12c: 90 91 07 01  lds r25, 0x0107
 130: 0e 94 ac 01  call 0x358 ; 0x358 
 134: 60 e0        ldi r22, 0x00 ; 0
 136: 88 81        ld r24, Y
 138: 0e 94 2c 01  call 0x258 ; 0x258 
 13c: 80 91 06 01  lds r24, 0x0106
 140: 90 91 07 01  lds r25, 0x0107
 144: df 91        pop r29
 146: cf 91        pop r28
 148: 0c 94 ac 01  jmp 0x358 ; 0x358 
00000104 <loop>:
 104: 5c 9a        sbi 0x0b, 4 ; 11
 106: 80 91 06 01  lds r24, 0x0106
 10a: 90 91 07 01  lds r25, 0x0107
 10e: 0e 94 37 01  call 0x26e ; 0x26e 
 112: 5c 98        cbi 0x0b, 4 ; 11
 114: 80 91 06 01  lds r24, 0x0106
 118: 90 91 07 01  lds r25, 0x0107
 11c: 0c 94 37 01  jmp 0x26e ; 0x26e 

Well. I was surprised at how much shorter the code is when you take out those digitalWrites, but it makes a lot of sense when you break it down. First I'll briefly explain what the code does without digitalWrite since it's shorter.

Using the Port Registers

The first line, sbi 0x0b, 4 is exactly equivalent to the line PORTD |= 0x10 in the C source. The instruction sdi A,b sets bit b in register A to 1, so sdi 0x0b, 4 sets bit 4 in register 11 (the PORTD register) to 1. Easy.

The next three lines set up our call to delayMicroseconds. The lds Rd, k instruction is a load instruction; it puts the value in SRAM at address k onto register Rd. In this code, we're loading the values at 0x0106 and 0x0107 into registers 24 and 25, respectively. Then, we call the function at 0x26e, which is delayMicroseconds.

After that, we see the instruction cbi 0x0b, 4. This should look familiar; it's the opposite of the sdi instruction. This particular instruction sets the value of bit 4 in register 11 to 0.

After that, we have another call to delayMicroseconds (using jmp instead of call) and the loop repeats.

Using digitalWrite

The digitalWrite version of the code is a little more than twice as long as the port register version, so let's take a look at the parts that are different.

Right off the bat, we have two push instructions. These instructions push the named registers onto the stack, in this case registers 28 and 29. For the unfamiliar, the stack is basically a structure that allows values to be stored in the much larger SRAM rather than in the comparatively small registers. The push instructions here put the values in registers 28 and 29 into memory somewhere so that they can be retrieved later (for example, the two pop instructions at the end of the function do this). Elsewhere in the program these values were initialized to 0, so those values are added to the stack at this point.

So, following these two push instructions, we have two ldi instructions. These are load instructions just like the lds we saw earlier, but they load a constant value onto a register rather than loading a value from SRAM. Here, we're loading the values 4 and 1 into registers 28 and 29, which we'll talk about next. After this, we load the value 1 into register 22.

The next line, ld r24, Y is a little complicated. It's a load instruction, so it puts a value from memory into the named register. The interesting thing about this instruction is the special value Y. There are three of these values, X, Y, and Z. Each one refers to a pair of registers; r26 and r27 for X, r28 and r29 for Y, and r30 and r31 for Z. Each pair refers to a 16-bit memory address with the high byte in the higher register and the low byte in the lower one. In this case, we're loading the value at address 0x0104 (r28 and r29) into register r24. This value is used by digitalWrite to determine which port register it needs to write to, but we'll get to that in a bit.

After that, things are largely the same as in the code that uses port registers. We next load the values at 0x106 and 0x0107 into r24 and r25 and call delayMicroseconds. After that, we repeat the indirect load into r24 and call digitalWrite again before calling delayMicroseconds for a final time.

What Does digitalWrite Do?

This is the complicated part. The digitalWrite function looks like this:
00000258 <digitalWrite>:
 258: 0f 93        push r16
 25a: 1f 93        push r17
 25c: cf 93        push r28
 25e: df 93        push r29
 260: 1f 92        push r1
 262: cd b7        in r28, 0x3d ; 61
 264: de b7        in r29, 0x3e ; 62
 266: 28 2f        mov r18, r24
 268: 30 e0        ldi r19, 0x00 ; 0
 26a: f9 01        movw r30, r18
 26c: e8 59        subi r30, 0x98 ; 152
 26e: ff 4f        sbci r31, 0xFF ; 255
 270: 84 91        lpm r24, Z
 272: f9 01        movw r30, r18
 274: e4 58        subi r30, 0x84 ; 132
 276: ff 4f        sbci r31, 0xFF ; 255
 278: 14 91        lpm r17, Z
 27a: f9 01        movw r30, r18
 27c: e0 57        subi r30, 0x70 ; 112
 27e: ff 4f        sbci r31, 0xFF ; 255
 280: 04 91        lpm r16, Z
 282: 00 23        and r16, r16
 284: c9 f0        breq .+50      ; 0x2b8 
 286: 88 23        and r24, r24
 288: 21 f0        breq .+8       ; 0x292 <digitalWrite+0x3a>
 28a: 69 83        std Y+1, r22 ; 0x01
 28c: 0e 94 ca 00  call 0x194 ; 0x194 
 290: 69 81        ldd r22, Y+1 ; 0x01
 292: e0 2f        mov r30, r16
 294: f0 e0        ldi r31, 0x00 ; 0
 296: ee 0f        add r30, r30
 298: ff 1f        adc r31, r31
 29a: ec 55        subi r30, 0x5C ; 92
 29c: ff 4f        sbci r31, 0xFF ; 255
 29e: a5 91        lpm r26, Z+
 2a0: b4 91        lpm r27, Z
 2a2: 9f b7        in r25, 0x3f ; 63
 2a4: f8 94        cli
 2a6: 8c 91        ld r24, X
 2a8: 61 11        cpse r22, r1
 2aa: 03 c0        rjmp .+6       ; 0x2b2 <digitalWrite+0x5a>
 2ac: 10 95        com r17
 2ae: 81 23        and r24, r17
 2b0: 01 c0        rjmp .+2       ; 0x2b4 <digitalWrite+0x5c>
 2b2: 81 2b        or r24, r17
 2b4: 8c 93        st X, r24
 2b6: 9f bf        out 0x3f, r25 ; 63
 2b8: 0f 90        pop r0
 2ba: df 91        pop r29
 2bc: cf 91        pop r28
 2be: 1f 91        pop r17
 2c0: 0f 91        pop r16
 2c2: 08 95        ret

Ugh. This one is 54 lines, which is a bit more than I care to analyze a line at a time. Briefly, digitalWrite starts by checking whether a valid pin number was provided. It then determines whether PWM is enabled on the pin and disables it if it is. Next, it determines which port register contains the given pin (register 24) and sets its value to HIGH or LOW (register 22). This ends up requiring 54 lines of assembly, not including the 35 lines required for the potential call to turnOffPWM and the extra 10 lines of assembly added to our main loop. This is compared to 1 line of assembly if we want to write directly to the port register ourselves.

Some Actual Numbers

So, the ATmega168 on the Arduino Uno board is clocked at 16 MHz, which means that (generously assuming an average of 1 cycle per instruction), each instruction will take 62.5 ns. This means that we should expect a call to digitalWrite to take somewhere between 3 µs and 6 µs, depending on whether or not turnOffPWM is called. Writing to the port register, on the other hand, should take about 125 ns since the sbi and cbi instructions each take two clock cycles. This is enough to account for the timing issues we saw at 100 kHz, but I thought I'd measure it anyway to see what the difference actually is. Here's a table I made by timing and averaging 1,000,000 calls to digitalWrite and 1,000,000 port register manipulations:

MethodTime (ns)
digitalWrite6005
Port Register440

There's some overhead from the for loop I used, but the conclusion is still the same: writing directly to the port register is a little more than an order of magnitude faster than using digitalWrite. You sacrifice a lot of flexibility, but if you need to switch the digital pins faster than once every few microseconds it's really the only option.

References

  1. AVR Assembler User Guide.
  2. http://people.ece.cornell.edu/land/courses/ece4760/AtmelStuff/AVRinstr2002.PDF