On February 11, it came to our attention around 7pm that the Secure Shell service on our Raspberry Pi 4 server had lost it’s security RSA keys during an update the previous day around 9pm.
During this period, the website remained available and we were able to work through the Web interface on the site pages themselves, it was only remote access to the server’s OS that was unavailable.
Scheduled maintenance was then planned as per our internal procedures, to reactivate the SSH service through direct interaction with the server, which would take less than 1 hour, we allocated 1 hour of downtime to allow ample time to correct the problem, and also to correct a minor issue with access privileges with the system users, in case our server should ever be accessed without authorisation.
During this scheduled downtime, a cascade failure occurred. Firstly, the system logs reveal that a split-second disconnection of our encrypted USB storage occurred, which appears to have caused corruption in files relevant to the database which contains the content and settings of our website, custom scripts we’d written for the administration of the server, and some other minor files.
Secondly, the backup software we use wouldn’t accept the password we used and recorded to decrypt the backup and be able to restore the database to its most recent working state.
And finally, our domain name registrar, dynu, had an outage with one of it’s partner’s fibre optical infrastructure, making our domain name refuse connections.
The final outage lasted from around 11pm until around 5am, with some residual issues reported after that were resolved via simple means, such as clearing the server cache – full service restoration was confirmed at 2pm the following day (February 12).
What we’re doing to prevent a recurrence
- We’re changing our server backup solution to prevent us from getting locked out of our backups again
- We’re in talks with Dynu to figure a fix for lack of DNS / nameserver redundancy
- We’ve changed the way our encrypted storage is handled by the system to prevent further disconnections
- We’ve instructed the package manager on our system to place a “hold” on the SSH package, so it cannot be upgraded in future without our manual intervention
- We’re continuing to monitor the site and server for signs of a recurrence
Data loss statement
In this incident, no user data was lost. The database was repaired, not restored to a previous state.
The only data lost as a result of this incident were some internal-use utilities we had written which were used for server management, which can be replaced easily.
We’d like to extend our sincere apologies to our visitors for this incident. We’d also like to extend our gratitude for the patience afforded to us while we investigated and worked on the problems.
We’re always working to improve our services and will use this experience and the knowledge gained from it to help prevent future problems, and resolve them more efficiently when any do arise.
Thank you for reading!