-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thermal runaway - monitor temperature variance #23373
Thermal runaway - monitor temperature variance #23373
Conversation
Thanks John for quick response to this safety issue. |
Hey @pillopaolo,
|
I can't help but think that this issue is caused by a hardware bug. The only conditions that can effectively lead to the issue you (and others) are experiencing, is that:
That's why I asked what kind of sensor you're using, since only thermistor-based sensors are updated by the ISR. If you look at the first post in #20749 by @zenturacp, all temp sensors stop updating, not just the active hotend. This most probably means that update_raw_temperatures() (temperature.cpp@2943) is not being invoked anymore. I can't find a code path that may lead to this situation, unless the 2 conditions I mentioned above are true. Since hw serial communication seems to trigger the issue (Octoprint, TFTs), a hw interrupt problem sounds probable. Especially since all reports so far are about SKR boards. |
Yeah sure - can be a hardware bug - the odd thing was mine was both as you mentioned.. The wierd thing was it was the exact same GCODE file every time.. Have never seen it since.. Now I'm running klipper and have never tried the issue on klipper not even the same GCODE file so something tells me it can be software or Klipper handling hardware different |
What about using counter that always increase when Temperature::isr() called? We will know if the Temp Timer stop by compare previous counter vs current counter. |
Yeah, an ISR counter (or even better, millis() since last invocation) is most probably an important debug info to log to the serial once we detect the problem. I don't want to assume what the cause is and what debug info would be useful to log, I'll leave that to @thinkyhead or someone more involved than I am. |
What if ISR is called, but not properly executed? Or the HW fails to report the temperature for some reasons (issue with temperature sensor, digital input, etc.). |
I did add an external monitor to that got a connection to Octoprint and monitored the divination on temperature if it is 0 in 1 or 2 minutes somthing is really not updated - no pid is 100% stable in that way my watchdog does not at all rely on anything.. |
@zenturacp that's actually a nice idea for an Octoprint plugin, which would do firmware-independent and detailed cross-check of various vitals. |
If the ISR is running but not executing properly, then the problem may be in analogRead(). |
We can use both methods. Check temperature variance and isr counter at once. |
ISR is a time-sensitive routine, I'd rather leave this decision to @thinkyhead. |
My point is:
I am just trying to say that, no matter what the cause is (ISR, analogRead(), temperature capped by the thermistor table as in my previous incident #17327, etc.), the "freeze" test would reliably detect the issue. This is what I mean with being a ROBUST solution to this and other potential future issues. |
93add80
to
66139b6
Compare
66139b6
to
8ae1e38
Compare
5c5cc8d
to
b85d202
Compare
d3e137b
to
b078876
Compare
9a7b8a4
to
0356f64
Compare
I believe it was originally intended for the thermistor tables to have "crazy" values at the start and end so that any readings outside the expected range would produce an obvious error. (See |
As far as I remember, no matter how the table was made, temperature.cpp in past Marlin versions (e.g. Marlin 1.1.8 Jun-2018, as described in #17327) used to do an extrapolation above the max temperature value found in the table. Then in later versions (e.g. Marlin 2.0 Mar-2020) this extrapolation was removed and the temperature value capped = max temperature in table, very dangerous. Not sure how temperature.cpp is now, I did not check. In the past 2 years I simply modified thermistor1.h and added a "crazy" value as in thermistor11.h. |
@pillopaolo after your recent incident report, I felt unsafe too (you know, long, unattended prints...). Also, had some free time to kill. So yeah, it was a pretty selfish act ;) Have a happy and prosperous new year, everyone! |
I understand that. Thank you for taking the time (and risk) to test it, let's hope it'll lead us somewhere. |
Dont know if I can add anything here but I have just jumped from 2.0.9.2 (that is working perfect) on an MKS Sgen-L v1 LPC1768 board up to the latest 2.0.9.3 and I get thermal errors every time I try to print. Pid tune on he0 & he1 both work as they should and I can set the bed to 80c and it will stay within 0.5c either side of the set temp for hour upon hour of testing, as soon as I ask the machine to print anything up to 1min into the print it errors out with The only way to get this to stop is by uncommenting #define THERMAL_PROTECTION_VARIANCE_MONITOR or to bump up I have a few partial logs if they might be of any use to anyone with the logs you can see the temps are stable and moving a few .x degrees up and down so no readings seem to be stuck as per some of the above posts. and just as I type it`s gone and threw the error out mid print... back to 2.0.9.2 again. |
@CBDesignS your temperatures are exceptionally stable (that's amazing btw, what kind of bed / sensor are you using?). Either increase |
@zeleps I see a potential problem: if the T variation is < of the ADC resolution, the VARIANCE test might give a false positive. Especially with older boards (10 Bit = 1024 points). |
@pillopaolo you can't auto-tune a PID system so well (as to have such temperature stability that remains invariant for a long time) on such a coarse temperature resolution. If you see @CBDesignS logs, you'll notice that the bed has a resolution of 0.04°C, which is quite high res (probably a 15bit ADC). The temperature seems extremely stable, but still there are minor fluctuations, it's just that there isn't any in a given 40'' window. I can't imagine how a hotbed can keep the temperature stable to within 0.04°C for 120''. There should probably be some more details in the comments of THERMAL_PROTECTION_VARIANCE_MONITOR about it's working and tuning. @CBDesignS case is clearly a false positive, but I think it can work if periods are bumped up sufficiently (within a safe range, of course). Also, if the problem occurs even after setting higher values for the thermal protection periods, THERMAL_PROTECTION_VARIANCE_MONITOR can always be disabled. |
@zelps The heated bed is a standard 12v creality 310x310 bed & 4mm creality glass build plate and as far as I know its a 100k ntc thermistor set as ( 1 : 100kΩ EPCOS - Best choice for EPCOS thermistors). It is powered by its own 30A 12v psu & I removed the cheap ass creality foam insulation from the bottom and replaced it by a custom made "sheeps wool insulation pad" that holds the temperature . Up until this PR I have never had to change any thermal settings as the defaults worked out of the box and still do work in 2.0.9.2. I will make a couple of changes to THERMAL_PROTECTION_PERIOD 60 and THERMAL_PROTECTION_BED_PERIOD 120 as you suggested to see if that helps. |
Since this is a false positive issue, it proves an important point: nothing is obvious. Your case is a bit extreme (absolutely stable bed temp for 40''? give me a break...), but shows that this new feature should be pre-configured better if it's going to be turned on by default, otherwise it's destabilizing the system. I'll come up with something and return. Just our of curiosity, did you try preheating hotend / bed (without a print) for a while? If yes, how come it didn't crash? Thermal protection is running regardless of printing, and just having heaters on for a while should have the same result. |
I never pre heated the bed or extruder as I was that use to just asking it to print and off it went. I could set a temperature on the bed and off it would go quite happy and get up to temp and sit there for hours. The longest pid tune of the bed I used was 9 cycles and it passed every time. Setting THERMAL_PROTECTION_PERIOD 60 and THERMAL_PROTECTION_BED_PERIOD 120 has let me slow print a benchy in 1hr 16mins with the bed at 80c and extruder at 230c. I also aggree that this has to be stable if turned on by default because every tom dick and harry will start flooding git,fb,reddit etc screaming that marlin has broken their printer and want it fixed because they need to print this weeks random tiktok sponsored articulating toy.. (and to save elensp or tinkyhead having to close she loads of tickets) Thank you Devs & knoledgable people for the help getting this going in the right direction because the more protection available the better. |
First of all you don't have to downgrade, just comment out |
@zeleps Your quote was "But my opinion is that you won't be seeing the issue if you managed to print for more than an hour with 60/120''. So since you were confident I would not have any more problems I though why not give it a try. |
I'm sorry about your print, I know how frustrating it is. I'm thinking of some tweaks that would allow you to avoid disabling the feature completely, and I'll come back with a new PR for these, but for the moment it's probably best if you comment out Please be aware that this covers a - rather rare - issue where temperatures stop updating but heaters remain powered. So far this issue has occurred in SKR1.4 boards afaik, so it probably doesn't affect you, and in any way you're not in a worse place than you were with 2.0.9.2. If you want to know more, please check the reports in issue #20749. |
@zeleps No worries. I know the risks and I had planned on staying beside the printer for the whole of the print today just incase anything did go wrong. todays print was a first print prototype part and printed in cheap ass nasty petg so nothing lost as I was doing some general tidying up work beside the printer. I have an skr 1.4 turbo board in another printer but I have never updated the firmware for years as it is rock solid. I ask it to print and off it goes.. The skr 1.4 uses an lpc1769 that is a very simmilar chip to the one on this mks Sgen L v1 that uses an lpc1768, could they share architecture and bugs between them ? |
I really don't know that. I don't want to assume things. We need more data on the issue, and hopefully this variance monitor will bring more cases to light. |
Hi @zeleps, as just discussed, the problem reappeared yesterday. I had already started and partially aborted prints several times. Thermal runway detections also occurred before, so I had to reset the printer manually. However, this did not happen during a print, but only between two prints. By the way, Octoprint was running continuously, so I only reset the printer. The actual undetected thermal runaway happened this time during the warm-up phase. So there was no print (yet). Please find the attached log-files, as requested. |
Considering every log entry shows changes, it does appear to be getting a temperature update from that sensor, which is curious. Ive seen a poor connection cause similar results.
But then from here : to here : Almost 10 minutes without a change on the bed temp reading. I see you have TMC drivers in UART mode. Im curious is the uart comms are taking priority from the temp isr as it can be a long running blocking operation. Anyone else reporting have UART or SPI stepper drivers? |
Ok, two things:
@InsanityAutomation, this is interesting, I'm using a UART configuration and had this issue a long time ago, but it hasn't occurred for months. The commit I'm posting here will clearly show if there are any delays in the ISR, since the counter values will be correlated to log timestamps, so let's wait and see what comes up. |
We had a fruitful session with @MakerMeik last night, where the problem manifested itself in all its glory. Here are the serial logs, which show temperatures stuck at 2022-03-07 21:29:18,861, without any heating taking place. Please note that the version running is this, which displays some debug info every couple of seconds. Findings:
No heaters were on at the time, so this went undetected. But this time, I had some int32 incremental counters added at 3 different points, one at the beginning of the temp ISR, one after bed ADC measuring and one at every watchdog refresh. All three counters continued increasing normally while the temperatures remained stuck, so the only possible explanation is that ADC was returning the same values for each sensor. I'm not familiar with LPC1769's ADC reading mechanism, I'll look into it later, but this is strong evidence towards a HAL issue, which might well explain why the issue occurs only on SKR boards so far.
If someone's more experienced with LPC1769, it would be a great help if they could provide some insight, now that we have more data on what's working and whatnot (and we know it's not the ISR), as well as towards what to debug next. I'm pretty positive that we're close to identifying and fixing this long-standing issue. |
I made a graph of the first derivative of the counters over time, and this is the result: (I scaled the values to have the graphs separated vertically) As you can see, the ISR (blue) and bed measuring (orange) occur at very stable intervals and have some minor periodic fluctuations, while watchdog refreshes (gray) are more irregular (expected). The watchdog spike is where homing is issued (also normal). The big spike is the restart. |
Some new findings came up today. I gave @MakerMeik a version that prints out the whole LPC1768 ADC block, and this is what came up: Normal operation:
Readings get stuck:
As you can see, channels 1 and 2 are 'on' (Pins 0.24, 0.25), and there is no change to the ADCR register. ADC1 and 2 contain a normal reading, but the DONE flag is not set, and they don't update anymore from that point on. Thermal malfunction occurred normally after 35''. Here are the full logs, issue occurred at 18:05:35. This seems to be an ADC issue. Whether it's a hardware malfunction that occurs randomly or it's triggered by something else, I can't really tell. If there's someone more experienced with LCP MCUs, it would be interesting to share their opinion on the issue. Also, please note that the Arduino LPC library does not check "doneness" of the values on , so the fact that DONE bit (31) is 0 goes undetected by Marlin. IMHO this should be changed, and the fact that values are not ready to read for a long time should be a hard error. Thoughts? |
There has been a number of reports on an issue (#20749) regarding temperature readings that get stuck while heaters remain on. I've noticed the same issue twice on my rig in the past, but I've moved on to a MAX31865 sensor and the issue has not appeared since.
It seems that the issue appears on SKR boards, and it might be related to interrupts being disabled at some point, which is the only obvious explanation to why thermistor values stop updating. Probably affects watchdog as well, since it never fires up. Since it hasn't occurred on my printer after I started using the MAX31865, it sounds like it has to do with the ISR (MAX31865 readings are independent of the ISR). This is a difficult bug to reproduce, and also a serious fire hazard, because heaters continue heating indefinitely.
I came up with this solution, after a request by @pillopaolo to implement a mechanism that would trigger a thermal runaway kill. The feature is built inside the thermal protection mechanism, and it's only active during the stable heating phase on all thermally protected heaters, which ensures that there will be a temperature fluctuation due to PID / bangbang. Also, it seemed ok to use the same detection window as TEMP_PERIOD, which is an acceptable time window for thermal runaway defined for each heater. Takes BOGUS_TEMPERATURE_GRACE_PERIOD under consideration, so _temp_error() persists until temperature starts varying or kill() occurs.
I've tested this with timed disablement of the temperature raw value updates (and it works fine), but I can't reproduce the original issue, so it remains to be seen if my solution will work when it occurs.
I strongly suggest that it is enabled by default, better safe than sorry.
Requirements
Thermal protection must be enabled.
Benefits
Detection of potential fire hazard. Will probably help to pinpoint the original issue, especially if some related state data is written to the serial prior to killing (I leave this to someone more engaged with the project).
Configurations
Any THERMAL_PROTECTION setting will do.
Related Issues
#20749