memory corruption problem? running ubuntu-bionic-4.14.52
I'm running ubuntu-bionic-4.14.52-00491-g367dd8955e03-dirty-desktop-mali-2018-07-01.img on a LePotato and running into a kernel panic within a few minutes of activity.
The problem happens randomly and always within a few minutes of start up. I can cause the problem to happen more quickly by running Firefox and browsing to a few web sites, but I think that's just because this application uses memory a lot.
example error messages from console connected via USB UART:
[ 505.655708] Unable to handle kernel paging request at virtual address ffffdf7f709d543b
[ 505.656378] Internal error: Oops - SP/PC alignment exception: 8a000000 [#1] SMP
[ 505.656381] Unable to handle kernel paging request at virtual address ffffa0b91039eb0c
I will attach the complete sequence of console messages if I can.
Questions:
- is anyone else running into this?
- does this system need a ddr calibration sequence
- do I have a lemon LePotato board?
Thanks
Comments
Here are the console messages. Every run time session of ubuntu-bionic-4.14.52 ends this way:
https://gist.github.com/ormike/b8b5c3ff846d721bab741b49d4bcaf84
I don't really think I have a lemon LePotato because the same hardware can run CoreELEC-LePotato.arm-8.90.4.img with continuous video playback from playlist on repeat for 24+ hours
update: I have run the latest ubuntu 18.04 image on LePotato without the memory corruption problem . . . after installing the heat sink
@loverpi said in another post: "You need a heatsink in any production environment due to chip hotspots. There is no integrated heatspreader and the packaging isn't sufficient to distribute localized hot spots."
I'm also curious about what @adamg said in another post and I'm wondering if it is related: "(there) is actually a hardware issue, the XTAL_IN is hooked incorrectly on LePotato, a SoC PLL has been used rather than an external crystal, probably as a cost saving measure (5 cents?). This could be fixed with some PHY calibration and the proper schematic which I don't have. A simple way to test is to increase CPU load during transfer, ethernet will die because PLL will wane."
I'm not super reassured that a memory corruption problem was seen when the temperature wasn't even getting about 37 C with 3.3 V fan and no heat sink. With fan and heat sink it sits at 32 C.
$ cat /sys/class/hwmon/hwmon0/temp1_input
This is the heat sink I used: https://www.loverpi.com/collections/libre-computer-project/products/libre-computer-board-heatsink-for-aml-s905x-cc-and-all-h3-cc
The issue @adamg mentioned should be related to the measuring tool used and not the board itself. If XTAL_IN was off by that much, the entire board would not even boot.
Transistors usually start working unreliably after 100C. Advertised transistor junction max is 125C. The SoC die itself is significantly smaller than the package (black plastic that hide the wires going to the pads) and are prone to be extremely hot. Without a heatsink, localized temperatures can easily be multiple 10s of degrees higher than where the sensor is.