This Article Has A Follow-Up
This article has been followed up recently. Some additional information may be available in the follow-up post found here.
Over the last few months, we’ve noticed numerous small and bizarre problems with our server – most of which were internal. Here’s a list of the most note-worthy ones, although there were many, many more.
1. Refit of June 5
Working with our server during the refit on June 5th highlighted some small problems we hadn’t noticed before. Logging into the server with a physical mouse & keyboard failed twice with a username and password that was known & verified to be correct. The third time we were able to log in, but the command line loaded notably and worryingly slowly.
We also noticed that sometimes the server would only shutdown when a reboot command was issued, command recalling (pressing up & down arrows to scroll through previous commands) didn’t work and sometimes changes made to files wouldn’t be acknowledged immediately.
2. Issues with our internal access control system
We use an internal system we call Universal Client Credentials for access control. This system enables a single username and password per client (currently ourselves & PRSPXCTVS) to get into website dashboards, file sharing & email client portals.
This system worked well for just over a month, but slowly the system began to deny more & more permissions for each user which should have been enabled. This confused us because we didn’t change the system in any way. Usernames, passwords and rights for users were left alone.
3. Filesystem slowdowns
We noticed that the filesystem was becoming slower & slower, to the degree that an SSD connected via USB 3.1 gen 1 (a.k.a. USB 3.0) was unusably slow, if it worked at all. Bearing in mind that this is compared to a Micro SD card read over USB 2.0.
4. “The straw that broke the horse’s back”
The major issue that occurred and caused us to take much more note of these random problems (and triggered us to put our site into Quarantine mode) was with the database software, MySQL.
MySQL worked as expected when used via the websites. Each website’s system user could only access it’s own website’s database, and required a password to do so.
The problem came when interacting directly with MySQL via the command line. Attempting to connect to any SQL user from the unprivileged Administrator account yielded ‘Access Denied’ errors – even when using the correct passwords.
Because we’d started having issues with the database which contains user’s personal data we’d determined it was time to Quarantine our website before the issue could expand to allow improper access. We informed PRSPXCTVS at this point so that they could take any action they felt appropriate.
At this stage, we were awarding much more credence to these seemingly random issues and discovered that many more were present than we initially thought. This prompted us to change our plans on June 28th and we decided to swap our server entirely.
So that’s what we’ve done. We’re now running on a more conventional Dell server which we tested extensively before & after it’s instatement. We couldn’t find any of these bizarre issues present on the new Dell server and we’ve determined it’s secure to release it from Quarantine mode.
We’re going to continue monitoring for these bizarre issues re-occurring, but we have no reason to believe they will return.
We’re not sure what could cause problems like this, and we can’t investigate further as the previous server’s storage was fully erased for security.
We did, however, download the server’s logs and thoroughly audit them for data insecurities, unable to find anything.
Loss of Data
All data from the original database was transferred to the temporary server at the time of transfer. Then again the database was transferred at the same time as website service from the temporary server to the new permanent one, leaving no opportunity for data loss.
Security of Data
All data with the exception of logs and backups have been securely erased from both the old and temporary servers. The data was not simply deleted in the same way a user might delete an old photo, as that would leave an opportunity for recovery.
The drives from the temporary and old permanent server were instead formatted to erase all data, then filled to capacity with random data and formatted again. This procedure was undertaken twice per drive to remove all opportunities for recovery.
The logs and backups from the old server have been stored offline in a USB thumb drive for analysis. Logs contain no personal data and are system-level only, whilst the backups which contain backups of the database are double-encrypted like they’ve always been.
The logs and old backups will be kept only on the thumb drive until up to Monday 13th July when the thumb drive will undergo the same secure erasure procedure as the drives from the old and temporary servers.
The investigation we’ve done of these logs so far indicate that there was no improper intrusion, nor indication that one could’ve been possible at any point.
We’ve already been through much of our response. We’ve transferred to a new server and we’ve kept direct data transfer to a minimum. The database was the only thing transferred directly from server to server. The software was all freshly installed and all configuration was rebuilt from the defaults. This was all in an effort to prevent whatever caused these ‘gremlins’ to appear in our server in the first place from transferring to our new server and causing similar problems.
Now we’ve determined that our new server is good to go after over 12 hours of testing after the transfer alone, we’ll continue to monitor it for similar signs but our focus will be shifted to auditing the remaining logs and backups from the original server to find what the problem actually was.
There are some changes we want to make going forward, however. Firstly, we’re going to change our policy with regards to backups. From today, backups of our server are to be kept for no longer than 14 days.
The old server is NOT to be used again in any capacity by S-City Tech for official business.
Also, software marked as ‘beta’, ‘pre-release’ or anything other than stable is to have no place on S-City Tech’s critical hardware (such as but not limited to the new server) in future.
Finally, any issues server-related are to be awarded more credence by the S-City Tech team, however minor they may appear. We were able to notice & stop the progression of these issues before they had damaging effects, we may not be so lucky if this were to happen again.
We appreciate all our visitor’s and customer’s patience whilst we work out the kinks in our hardware. We’re still new to self-hosting and we’re always looking to do better.
All users will be notified of this issue, including that data loss didn’t occur. They can then take action as they see fit.