Network performance tuning on pcengines APU2

I have a pair of pcengines APU2 devices running OpenBSD as firewalls. They're getting a bit long in the tooth, but they work well and since my internet uplink is 100Mbit/s, they still have are more than fast enough to support that. They have, however, become the limiting factor for copying around the VM images that I use for OpenSSH testing, capping out at around 300Mbit/s on local copies. This annoyed me enough that I wanted to do something about it.

The since the platform is pushing ten years of age, the obvious thing to do would be to update the hardware. Unfortunately there is not an obvious successor to the APU2 series in the "long term supported, low power, serial console, multi-network amd64" hardware category. Plus, they support jumbo frames, and testing with those showed they already could send close to a gigabit already, so it seemed like they should be able to receive that too.

I knew from previous experience that a large amount of the cost is per-packet overhead, and since the hosts, the firewall and the switch all supported jumbo packets that I should try that. Unfortunately the results were disappointing: jumbo packets did not significantly improve the performance. This turned out to be due to a couple of things: a previously unknown bug in the em(4) driver on specific chips that caused ethernet frames of very specific sizes around mbuf boundaries to be truncated which invalidated some tests, and poor hardware performance seemingly caused by PCIe power management.

When testing, there's quite a few variables to be aware of, and if possible control for:

BIOS updates

Update the BIOS to coreboot v4.9.0.2 or newer to get the Core Performance Boost feature. I'm using v4.19.0.1.

Enter BIOS settings via F10, then make sure you have the following settings:
Enabled: Core Performance Boost.

Disabled: PCIe Power Management.

r Restore boot order defaults
n Network/PXE boot - Currently Disabled
u USB boot - Currently Enabled
t Serial console - Currently Enabled
k Redirect console output to COM2 - Currently Disabled
o UART C - Currently Enabled - Toggle UART C / GPIO
p UART D - Currently Enabled - Toggle UART D / GPIO
m Force mPCIe2 slot CLK (GPP3 PCIe) - Currently Disabled
h EHCI0 controller - Currently Disabled
l Core Performance Boost - Currently Enabled
i Watchdog - Currently Disabled
j SD 3.0 mode - Currently Disabled
g Reverse order of PCI addresses - Currently Disabled
v IOMMU - Currently Disabled
y PCIe power management features - Currently Disabled
w Enable BIOS write protect - Currently Disabled
z Clock menu
x Exit setup without save
s Save configuration and exit

Results

The raw symbol rate on a gigabit ethernet is 1e9/8/1024/1024 = 119.2 MByte/s. There's a per-frame overhead of about 38 bytes: preamble (8), source and destination MAC (12), type (2), CRC (4) and interframe gap (12). If you're using 802.1q VLANs, that adds another 4. TCP and IP add another 20 each. This gives us a theoretical maximum throughput for a TCP stream on 1500 and 9000 byte gigabit frames of

1500 MTU: (1500-40) / (38+1500) * 119.2Mbyte/s = 0.95 * 119.2MByte/s = 113.24MByte/s

9000 MTU: (9000-40) / (38+9000) * 119.2Mbyte/s = 0.99 * 119.2MByte/s = 118MBytes/s.

MTU hw.setperf CPB PCIe power mgmt pf MByte/s
1500 0 disabled enabled pass in quick on em1 32.6
1500 0 disabled enabled set skip on em1 32.5
1500 100 disabled enabled pass in quick on em1 33.5
1500 100 disabled enabled set skip on em1 36.5
1500 0 enabled disabled pass in quick on em1 30.0
1500 0 enabled disabled set skip on em1 38.9
1500 100 enabled disabled pass in quick on em1 46.8
1500 100 enabled disabled set skip on em1 55.7
9000 0 enabled disabled pass in quick on em1 107.0
9000 0 enabled disabled set skip on em1 109.0
9000 100 enabled disabled pass in quick on em1 111.0
9000 100 enabled disabled set skip on em1 111.0

References