Incident Postmortem: BSD.am home server @ 3-4 July 2023
Incident InformationBetween the hours of Mon Jul 3 03:05:59 2023 and Tue Jul 4 01:10:15 2023 the home server named BSD.am (also known as pingvinashen.am) was completely down.
The event was triggered by a battery issue due to high temperature at the apartment where the home server resides.
A battery swell caused the computer to shut down as it produced higher than normal heat into the system.
The event was detected by the monitoring system at mon.bsd.am which notified the operators using email and chat systems (XMPP).
This incident affected 100% of the users of the following services:
Multiple community members contacted the operator (yours truly) asking for an ETA.
Response
After receiving an email at Mon Jul 3 03:06:49 2023, the Chief Debugging Officer (yours truly) started analyzing the possible issue. According to Monit (mon.bsd.am) all the services were unavailable and the server was not reachable by IP (based on ICMP).
The usual possibility, network failure at the ISP level, was ruled out, as the second home server (arnet.am) was functioning properly.
The person closest to the server physically, was the operator’s sibling (lucy.vartanian.am), however she did not have the background in Unix system administration nor in hardware maintenance. Also, she was asleep.
Hours later the siblings (yours truly) organized a FaceTime call to debug the issues remotely.
The system did boot the kernel properly, however it would shutdown before the services could complete their startup.
Clearly, the machine needed to be shipped to the operator (yours truly) to be debugged at the spot.
So that’s what the team did.
Precise addresses are removed for privacy
Recovery
At the operator’s (yours truly) location, the BIOS logs have listed that the system suffered from a ASF2 Force Off. This usually means a thermal problem.
The operator (yours truly) disassembled the laptop, hoping the system needs a little dust clean-up and a thermal paste update.
Turns out the problem was actually a swollen battery.
After removing the battery, the system booted fine. Just to be sure that the swollen battery was the root cause, a complete system stress test was ran. No issues detected (Well, except “Missing Battery”).
The systems was returned to its residency, connected to the internet and all services were accessible again.
Precise addresses are removed for privacy
Next Steps
If you’re new here, then first of all I’d like to thank you for reading this IR Postmortem article.
Yes, this was an IR Postmortem of a home server of a tiny community in a tiny country. This was not about Amazon, Google, Netflix, etc.
I wrote this for two reasons.
First, I wanted to show you how awesome the actual internet is. You see, when Amazon dies, everything dies with it. Your startup infra, your website, your hobby projects, everything.
When my server dies, only my server dies. And that’s the beauty of the internet. If you can, please, keep that beauty going.
Second, I run a small security company, illuria, Inc., where we help companies harden their environment and recover from incidents. It’s been years since I wrote an IR postmortem personally (my team members who do that are way smarter than me!), and I thought it would be a nice exercise to write it all by myself
I hope you liked this.
That’s all folks…
Reply via email.
https://weblog.antranigv.am/posts/2023/07/incident-postmortem-bsd-am-home-server-3-4-july-2023/
@antranigv Enjoyed this one, thanks for sharing. I’m always curious how others are dealing with things like these. Have you now got anything in place to monitor the temperature of that box?
@hadret @antranigv@weblog.antranigv.am Thank you for your response! Currently I do not have a temp monitor in the room. However, we are planning to move both servers (bsd.am and arnet.am) to a new apartment, where it WILL have a temp and humidity sensor which will be parsable using SNMP and the servers will act accordingly.
And for the machines themselves, my laptop-server does report CPU temp but nothing more, while @inky 's server should be able to report the motherboard's temp.
@antranigv@sigin.fo @antranigv@weblog.antranigv.am @inky Ah, makes sense. Depending on some factors, you might also be able to monitor temperature of your drives — these are the most important things to me running on my NAS and #FreeBSD can monitor them no problem, some details here: https://chabik.com/nuc-temperature-monitoring-w-prometheus-on-freebsd/
I’m looking forward to some follow up posts in the future!