Category: TECHNICAL

Linux AER errors on NVME – ACPI Platform Error Interface (APEI)

I’ve recently experienced a number of errors in my proxmox server related to the NVME drives.

{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:  fru_text: PcieError
{1}[Hardware Error]:   section_type: PCIe error
{1}[Hardware Error]:   port_type: 0, PCIe end point
{1}[Hardware Error]:   version: 0.2
{1}[Hardware Error]:   command: 0x0406, status: 0x0010
{1}[Hardware Error]:   device_id: 0000:a1:00.0
{1}[Hardware Error]:   slot: 0
{1}[Hardware Error]:   secondary_bus: 0x00
{1}[Hardware Error]:   vendor_id: 0x2646, device_id: 0x5013
{1}[Hardware Error]:   class_code: 010802
{1}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
nvme 0000:a1:00.0: AER: aer_status: 0x00002001, aer_mask: 0x00000000
nvme 0000:a1:00.0:    [ 0] RxErr                  (First)
nvme 0000:a1:00.0:    [13] NonFatalErr           
nvme 0000:a1:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

After looking into this. I’ve found a series of articles and posts about this issue:

and sadly none of these posts seem to have a solid answer…. and even worse, most replies are just telling people to silence the errors, or even disabling the error recovery features entirely (Absolutely bonkers).

Follow up

For my server, I’ve found this issue is almost entirely related to the NVME drives overheating.. I’ve installed some thin heatsinks, and the issue has almost completely resolved itself.

I cannot confirm if this is true for the others… but I’ve got a feeling it’s the root cause for my server considering how hot these NVME storage devices are getting on GEN 4 / GEN 5.

Welcome to techno hell. Everything needs a heatsink now.

PCIe 4.0 vs. PCIe 3.0 SSDs Benchmarked | TechSpot

Investigating NVME LBA sizes and formatting for performance

I grabbed 4 1TB Kingston KC3000 NVME disks, and as I plan on using them in a proxmox server. I’ve investigated their LBA (Logical block addressing) format support to gain better performance by using the larger native block size.

Bonus fact: Some NVME drives also have LBA formats that support on-disk metadata, and this can be bypassed for ZFS or similar filesystems that maintain metadata.

Prerequisite

Install nvme cli for debian/ubuntu (Optional)

apt install nvme-cli

nvme-cli (with human readable output)

nvme id-ns -H /dev/nvme0n1

The LBA part is the important part for now.

Notice: There are 2 LAB Formats (512 and 4096). 512 is the current in use format, but 4096 is marked as giving better relative performance.

smartctl (with all information)

smartctl -a /dev/nvme0

Again, the support LBA size is the important part.

Format the NVME disk to use a different LBA

nvme format --lbaf=1 /dev/nvme0n1

After that. test the result from the previous

Additional Resources

https://wiki.archlinux.org/title/Advanced_Format

TPM 2.0 Firmware Upgrade

I’ve recently started building a 2U rack server, and went to install the 20pin TPM that I had spare. While the unit worked fine, I noticed the firmware was the original/oldest possible firmware version.


Finding the latest firmware releases

The Trusted Computing Group has a list of verified firmware releases for all of the known TPM chips.

For example, my chip is a Infineon, SLB 9665, and that show’s the latest firmware is 5.63


Infineon OPTIGA™ SLB 9665 TPM2.0

Data Sheet: https://www.infineon.com/dgdl/Infineon-data-sheet-SLB9665_2.0_Rev1.2-DS-v01_02-EN.pdf?fileId=5546d462689a790c016929d1d3054feb

I was able to find both 5.62.3126.2 / 5.63.3353.0, but the upgrade path from 5.0.1089.2 is not perfect. As my firmware ends in .2 and the latest upgrade supports `.0` and I have no idea if that’s a problem.

TPM20_5.0.1089.2_to_TPM20_5.62.3126.2.BIN then in theory TPM20_5.62.3126.0_to_TPM20_5.63.3353.0.BIN?


Firmware Bundle

Infineon TPM Firmware Update Tools release version is 01.01.2481.00.
TPMFactoryUpd in this release is version 01.01.2212.00.
IFXTPMUpdate.efi in this release is version 01.01.2212.00.
TVicPort.sys in this release is version 4.0.

This file contains what I believe is the (unmodified) Infineon TPM Firmware Update Tools, but I cannot give any guarantee for this. (I’ve added additional firmware to this 7z).

If you have a more recent version of this package, please forward it to me.


Updating the SLB 9665

Part 1: Error 0xE0295507

I tried to patch the TPM unit I had from the oldest firmware to the latest, and kept hitting this platformAuth is not the Empty Buffer error.

This seems to be because the TPM (even after it’s been cleared) has been accessed by the computer you’ve attached it too, and you seem to have to clear and then flash the TPM chip before the BIOS/Kernel/OS gets’ it’s grubby little fingers into it…

There is a forum post about people disabling the TPM in the BIOS before they can flash, but this didn’t help me using the SuperMicro server motherboard, as it just makes the TPMFactoryUpd tool fail with a missing TPM error.

Untested: I’ve heard Windows has a powershell Disable-TpmAutoProvisioning to stop windows from activating any TPM it sees. (I have no idea if that works to resolve the issue above).

Part 2: Success

<TO BE CONTINUED>


References

Adjacent notes

© 2024 Shlee

Theme by Anders NorenUp ↑