On Saturday 8 July 2023, user accounts started disappearing from the Vivaldi Social Mastodon instance. What was going on, how did this happen, and what were the consequences?
This is a very long blog post, but to be fair, this was also to be a very long weekend.
If you want to skip to the conclusion, there’s a TL;DR (too long; didn’t read) section at the end.
Something’s not right
It was around 17:25 Oslo time (CEST) on the Saturday that I first noticed something was wrong. I’d just got home from a bike ride, and when I happened to check my Vivaldi Social tab, it suddenly asked me to log in again. “Unusual”, I thought, but I wasn’t immediately alarmed by it. But then when I did log in, I saw that my home timeline was now completely empty. I quickly reached out to my colleagues.
“doing anything with mastodon? my home timeline is suddenly empty”Me, to my fellow sysadmins – Saturday 8 July 17:26 CEST
My fellow sysadmin Hlini very quickly got back to me. No work was ongoing, and his account was showing the same symptoms as mine. He offered to start heading to a computer so he could help me, an offer which I gratefully accepted.
By 17:32, another colleague outside of the sysadmin team had also noticed the same issue. I started to look into the database to see what was going on.
Something bad has happened
Looking at the database I could see that the affected accounts had apparently been deleted, and then recreated as a completely new account when the user logged back in.
Immediately, I started looking to see what database backups were available. As expected, we had a nightly backup from 23:00 UTC on Friday night. I started copying the file to somewhere I could make use of it.
While I was waiting for the backup file to copy, I started checking the database for other users that might be affected. Jón von Tetzchner’s account and another one that I checked had also been deleted, but had not yet been recreated, likely because those users had not tried to log back into their accounts yet.
By this time, Hlini had arrived at a computer and started looking into things with me.
I started checking the web server logs for account deletion requests, but nothing matching the account deletions showed up; and then I realized something else was odd about these deletions.
Normally when an account is deleted in Mastodon, the username is permanently reserved as unusable. If you were to try to create a new account with the same name as a deleted account, it would not allow it (since, due to the nature of the Fediverse, having a new account with the same address as an old one would not be a good thing).
But in the case of these deletions, we were getting reassigned the exact same usernames, so these could not be not normal deletions.
By 18:39, Hlini had figured out the pattern: all accounts with an ID lower than 142 (ie. the oldest accounts) were missing from the database.
We hadn’t seen any discussion from other Mastodon server admins about anything like this, and we wondered if this could be something unique to our setup – after all, Vivaldi Social uses vivaldi.net accounts for logins (thanks to Mastodon’s OAuth support) instead of the normal signup and login system of Mastodon. We started considering asking the Mastodon developers for help, and we also started discussing strategies for restoring the lost data from the backup.
Something bad is happening right now
At 19:10, I checked the database again, and I saw that all accounts with an ID lower than 217 were now missing from the database, and that number was increasing. This meant that accounts were still being actively deleted from the database.
By this point we both agreed that we needed more help, so at 19:18 we contacted the Mastodon developers. We immediately got a reply from Renaud, and he pinged Claire and Eugen to enlist their help.
Stemming the flood
At 19:20, Hlini restarted all of the docker instances in our Mastodon setup. The deletions seemed to stop the moment he did this. The lowest ID in the database was now 236.
Fortunately it turned out that it would stay that way.
The investigation begins
198 accounts in total had been deleted during the course of this incident, and over the next few hours, together with the Mastodon devs, we started looking into what could be going on. On Eugen’s suggestion, we looked into the possibility of it being the
UserCleanupScheduler deleting accounts that were “unconfirmed”, but this was eventually ruled out, as the deleted users could never have matched the query that it operated on.
Since we had upgraded to Mastodon 4.1.3 just 48 hours before the incident occurred, the Mastodon devs looked into all the code changes between v4.1.2 and v4.1.3 to see if anything there could be related. They even (and I cannot credit them enough for this) went the extra mile and looked through our published changes to see if any of the changes we had made could possibly lead to this. The conclusion though was that none of the changes could have triggered anything like this.
At the suggestion of Renaud and Eugen, we checked the filesystem to see if the deletions were being done directly in the database, or if they were being triggered by Mastodon itself. We could see that the avatar and header images for the deleted accounts had themselves also been deleted. This meant that the deletions had to be coming from the Mastodon application itself.
We also started looking for signs of system intrusion, since it was certainly a possibility that this was some kind of deliberate attack. I spent some time checking the various logs that we had available to us, but I didn’t find anything (though in these cases, the absence of evidence can never rule out the possibility).
Because Mastodon v4.1.3 included a security fix, the devs also looked into the possibility of a related exploit, for which we combed through the logs, and examined the filesystem for evidence of such an attack. Again though, nothing was found.
We debated whether we should take Vivaldi Social offline altogether while we continued the investigation. The Mastodon devs gave arguments in both directions:
- In favour of taking it offline: if we have to roll back the database to the backup, then more content will be lost the longer we keep it up.
- In favour of keeping it running: if it is an attack, and it resumes, it might give us more opportunity to investigate how it’s being done.
We ultimately decided to keep it running. In truth what swung the decision that way was probably not the balance between the above arguments, but just a simple fact of us being sysadmins…
Humans need sleep
The hours were passing by, and soon it was starting to get late on Saturday night. Claire wrote and provided for us a patch that we could deploy to add some logging for account deletion actions. I added the patch to our local repo and Hlini started the build process going.
I let the others know that I was going to at least attempt to get some sleep, sentiments which the others also shared. At 00:15 I went off to bed. At 00:29 Hlini reported that he had deployed the patched version and also went to get some rest after checking that everything was still working.
So ended the first day.
Dawn of the second day
Sleep did not come easily, since so much was going through my head, but I eventually managed to get maybe a few hours of actual sleep, before getting back to work somewhere around 09:30 on Sunday.
The Mastodon devs were already awake and had asked us some more follow-up questions, so I set about answering those, checking the admin actions log, detailing our API usage, doing more searching of the log files and doing my best to provide details about our production environment.
By this point, although they were certainly keeping an open mind about the possibility of this being a software bug, the devs felt it more likely that this problem was being caused by some kind of attack. Because of this, I made sure that our internal security team at Vivaldi was up-to-date with everything that had occurred and been discovered so far. This would actually turn out later to lead to a lucky break in our investigation…
Some time after midday I also started to think about our options for data restoration. At 12:26 I asked the devs if they had any advice for how to do a selective restoration, and in particular how to deal with accounts that had been recreated. I then started looking at the database relations myself to figure out which tables would likely need to be recovered. At 12:54 I began the first steps of developing a script to restore the lost data.
The lucky break
Roughly an hour into my development at 13:56, Yngve (one of our security experts at Vivaldi) replied to my earlier message about the current state of the investigation. He reported that his profile page in Vivaldi Social was not loading properly. A series of HTTP 500 errors were being reported when he tried to view it. I checked the database, and could see that Yngve’s account was not one of the 198 accounts that had been deleted, so this was very curious. I checked his profile page myself, and also saw the same errors.
I looked into the Mastodon logs to see if I could find any sign of these errors, and found indeed that there were error messages. I forwarded these findings on to the Mastodon devs. Claire replied, asking for the full stacktraces for the log entries, which I was able to also extract from the logs.
At the same time, another of my sysadmin colleagues, Ísak, started looking in to the logs with me. He discovered that there were now a lot of errors like these in the logs, and there was a common element to all of them. Each of the errors referenced an account on another Mastodon instance – the same account and instance in every single case. For the purposes of this retelling of the story, we’ll call the other instance
Based on this information, Claire started directing me to return the results from various specific database queries. One of the queries that we ran was to select all statuses (aka. posts or toots) from the database belonging to the mystery account on
That query returned 17600 rows.
Either this was an incredibly prolific Mastodon user, or something else very strange had happened.
At 14:43, via querying the database and comparing to the backup I confirmed that all of the statuses for all of the deleted accounts had been reassigned to this 1 user on
Initially Claire suggested that maybe we were seeing an incredibly rare case of PostgreSQL database corruption, and I was absolutely willing to entertain any possibility at this point, but just a few minutes later, realization struck.
At 15:00, having started developing a theory, Claire asked me to check for
AccountMergingWorker entries in the Mastodon worker logs. Several more database queries followed, and also some commands run via the Rails console. By this point she was pretty sure that the account merge worker had for some reason kicked in and started merging all accounts into the 1 remote account, but she had not yet pinned down what exactly that reason could be.
Ísak dug back in to the web server logs for the day before, looking for the first sign of any contact from the
social.example.com instance. He found that the instance had contacted Vivaldi Social at 17:15:33 on Saturday. This was just 10 minutes before I had first noticed problems with my account at the start of this saga.
Claire now started asking questions about our database setup. Specifically, did we have any kind of PostgreSQL replication configured. I was able to confirm that yes, we did indeed have a 2 server replication setup.
This turned out to be the final piece of the puzzle.
At 17:28, Claire put forward her theory as to what had happened:
- Vivaldi Social was notified by
social.example.comabout a username change for one of their accounts.
- The new account was created in our database with a
nullvalue in the
- The new account then got its
URIset to the correct value for the remote account.
- A run of the
AccountMergingWorkerwas scheduled (via Redis), to merge data from the old account to the new.
- Because of delays in the database replication, steps 3 and 4 effectively happened in the wrong order, purely due to bad luck with the timing.
- The AccountMergingWorker dutifully started merging all accounts with a matching
URIvalue into the new account.
And it just so happens that all local accounts in a Mastodon instance have a
null value in their
URI field, so they all matched.
Claire theorized that such a thing would be more likely to occur if the database were under heavy load that caused extended replication delays.
“What are the next steps?”
With a very probable cause now identified, Jón asked what the next steps should be. We agreed that the best course of action was for me to return to focusing on recovery efforts, while Claire would write a patch that we could apply to prevent this reoccuring.
Also with Renaud’s help, I was able to provide further details about the replication setup, which confirmed that the worker processes were configured in a way that they could perform database reads against the standby server (using a tool called Makara). This information led all 3 of the devs to agree that we had almost certainly now found the root cause.
Within just a few minutes, Claire provided a patch for us, which I pushed to our repository.
The recovery process was possibly even more of a rollercoaster ride than the problem itself.
At 16:09, Jón and I had a discussion about the best course of action for recovery. Given the amount of perceived data corruption that had occurred, I was now leaning towards doing a full rollback of the database, and Jón agreed that was probably the way to go. After helping out the Mastodon devs with the replication investigation, I took a much needed break. At 17:03, the sysadmin team had a quick discussion and together we agreed to proceed with the full database restore.
A full restore was actually not going to be as simple as we hoped. We knew already from the earlier restore of the backup database that there was one part of the process that would take a very long time, due to a known performance issue that affected our instance. So instead of restoring directly, we planned to convert the .dump file into a .sql file, and edit the .sql file to replace the offending query with a more performant one.
The conversion of the .dump file finished at 17:23, then the problem was how to perform the edits. Editing a 54GB text file comes with its own set of problems. Most text editors will struggle with that kind of size.
At the same time as I was pondering the best strategy for editing the file, Claire started suggesting a strategy for doing a selective database recovery. I explained that we had decided to do a full restore, but when we discussed the reasoning for this (the difficulty in knowing what data had been corrupted), I eventually became convinced that perhaps a selective restore would be possible after all. She agreed to help out by providing full details of what database tables the account merge operation will have touched.
In the sysadmin team we decided to split our efforts. I would work on the selective restoration efforts, while Hlini would carry on working on the full restore procedure. I continued working on the script that I had started on earlier in the day.
Patches and configuration changes
Before he could work on the restoration process, Hlini also had the task of applying the patch that we’d been given, and also changing our database configuration so that we weren’t using this (no-longer-recommended) replication setup. At 17:58, he deployed the changes, and something went wrong. We had our first (and only) full downtime of Vivaldi Social in that weekend. By 18:18, Vivaldi Social was back up and running, and by 18:44 the changes were deployed again, this time successfully.
So, we were now reasonably confident that we were protected against this incident happening again. Now the only thing we needed to focus on was getting the data back.
Recovery work continues
At 19:25, I told the other sysadmins that I had hit a wall in my scripting process. I was having technical issues with the PDO module that I was using to interact with the database, and I needed another pair of eyes to look at it. I showed my work so far to Ísak to see if he could spot my obvious mistake, but he didn’t find anything immediately obvious. By 19:55, I decided to give up completely on the script.
Instead I decided to help out Hlini with his task. He was now in the process of editing the 54GB .sql file.
His first attempt was to edit the entire file with
vim. I checked the changes that he was applying, and they seemed to look ok.
I tried running the import of the resulting .sql file, and it quickly returned an error. It turned out that the file that
vim ended up saving was for some reason truncated at 629MB, so we decided to try another approach. I split the file into 2 chunks to make editing it easier:
head -c50000 dump.sql > split
tail -c+50001 dump.sql > split2
and Hlini applied his changes again. I then merged the files together (after removing the trailing newline that vim had added):
head -c-1 split > split3
cat split3 split2 > dump-patched.sql
and when that was done, attempted another import.
Unfortunately that attempt also failed.
For the third attempt, I split the file precisely around the section to be edited, so it was split into 3 pieces. This attempt also failed in the same way as the previous one, but by doing the split in this way I was able to generate a sensible diff of the edited section, and that allowed me to spot the error we had made (a case of mismatched paratheses).
At 21:26, on the fourth attempt, we finally started running the import of the .sql file in preparation for the full restore.
While the import was running, I asked Ísak to have another go at trying to figure out what was wrong with my code, and he started digging into it.
The bug in my code
At 22:50, Ísak got back to me. He’d figured out what I’d done wrong. Like most bugs it was a stupid mistake on my part, related to how I had bound the parameters of one of the database queries (by reference instead of by value). In my defence, I was very tired by this point.
Now that the problem had been found, I decided once again to start working on the selective restoration process (the database import was still running).
By 23:04, I had managed to finish the first part of the script, which would fix the user, account and identity records of the 198 affected users. I used the script to restore mine and Ísak’s profiles first. We saw that the avatar and header images were now missing (as they had been deleted from the filesystem), but otherwise the profiles looked good. With advice from Ísak, I adjusted the script to force Mastodon to refetch the avatar image when the user next logs in to Vivaldi Social.
By 23:55, I had finally finished my work on the script. In theory it would now be able to return everyone’s statuses, follows, followers and all other relationship data that referenced the 198 users to the state it was in before the incident.
Selective restoration begins
I soon realized that due to database relationship constraints we would need to do the restoration in 2 stages, fixing the user/account/identity records for all 198 users first, before then proceeding with the rest of the data.
At 00:08, I completed the first stage of this, fixing the accounts of all the affected users.
Next I tried running the second stage fixes for just a single account (my own). This worked, but my profile now showed the same issues that Yngve’s had done earlier on Sunday. Checking with Ísak, we realized this was because I had boosted posts that belonged to one of the other affected users.
Seeing no other option than to go ahead and run the second-stage fixes for all 198 users, I took the plunge and started it going at 00:46.
At 00:55 the script hit an
UPDATE query that failed due to a duplicate key. I figured out that this was because the user in question had logged in to their account and set up some of their follows again. I modified the script to handle this scenario by deleting records that could not be restored in this way and keeping the newer record intact.
The script was soon humming away again, gradually fixing all of the corrupted and lost data.
The anticipation of success
By 01:14, Ísak reported that he could now view all of his posts, although his profile was not yet working.
By 01:18 though, my profile page was finally working again! This meant that all the accounts that I had boosted posts from had now been restored. The mood in the sysadmin team was now extremely buoyant.
“holy shit guys, this is actually working”Me, to my fellow sysadmins – Monday 10 July 01:18 CEST
“holy moly my profile looks ok”Hlini, to his fellow sysadmins – Monday 10 July 01:26 CEST
The script finished its final task at 01:27 on Monday morning. Things were looking pretty good now. The only thing that was noticably off was that the Home feed of the affected users was now empty. I was fairly sure that this was just a simple indexing issue, and asked Hlini if he could fix it. At 01:40, Hlini reported that the re-indexing of the home feeds had completed.
And there it was. My home feed, and the home feeds of all 197 other affected users were now back as they were, almost as if nothing had ever happened.
The full rollback and restore never became necessary.
Emotionally I was now on an insane high after the devastatingly exhausting and stressful period that we had all just been through. I thanked the sysadmin team for all of their help, and the Mastodon devs for all of theirs. I knew that I desparately needed sleep, but that in my current state I would not be able to do so, so I went out for a 2 AM run to try and clear my head.
Eventually I was able to rest, and get some sleep.
On Monday morning I returned to work at 10:50, and soon got a report that there were some things that needed fixing. The main issues were:
- 6 Users with symbols in their usernames couldn’t log in. This turned out to be due to a mistake I’d made in the recovery script, and was very easily fixed.
- Web settings for the 198 affected accounts had been lost. This was an omission in the recovery script, I adjusted the script to add the table that I’d missed, and this data was restored.
- Counters on profiles (number of followers, posts etc) were wrong. This was considered a lower priority to fix and was left for now.
- One user (not among the 198 accounts) reported problems with their account that were hard to explain. I looked into this, but was not able to fix it straight away.
By 14:20 on Monday I knew that I was at my exhaustion limit and couldn’t do anything more. I took the rest of the day off to go for a walk and try not to think about anything Mastodon-related.
On Tuesday I returned to looking at the 2 remaining problems. The incorrect counters turned out to be easy to fix, just needed the right command, which Claire provided for us. Then Claire and I looked at the broken account, eventually leading me to write another script to fix the data issues that we saw, fixing the 4 accounts that were affected in this way.
Official fixes in Mastodon
The Mastodon devs warned other server owners of the dangers of using Mastodon with a Makara-based replication setup. Fortunately such setups are rare, as they would only ever be even considered for large instances like ours.
With the release of Mastodon v4.1.5 came 2 changes that are related to this incident:
- Preventing the use of Makara with sidekiq workers
- A proper fix to ensure correct order of operations in account merges
So hopefully no-one else will have to go through this.
An experience not to be repeated
I do hope I don’t have to go through an experience like this ever again. Although the satisfaction and extreme emotional high at the end of the ordeal was quite something to experience, that really didn’t make up for the stress that we all went through. I have since recovered though, having taken plenty of time off work in lieu of all the extra weekend hours worked.
I do though want to give some massive thanks to everyone who helped out.
- To Hlini and Ísak, my amazing sysadmin colleagues. Together we did the impossible, and without both of you we would never have done it. Thank you! (And Claudia, we missed you! Hope you enjoyed your vacation away from all of this!)
- To Renaud, Claire, and Eugen of the Mastodon developer team, who went above and beyond all expectations to help us out. You folks were amazing, you took our situation very seriously, and immediately jumped in to help us. I really could not have asked for anything more. Thank you!
- To our CEO Jón S. von Tetzchner, for being supportive throughout the entire crisis, and helping coordinate communication with the Mastodon devs. Thank you!
- To Yngve, for providing the crucial piece of information that led us to find the root cause. Thank you for your vigilance.
- To Björgvin, Jane, Daniel, Paweł and everyone else in the Vivaldi team for reporting issues to us and helping make sure that everything was restored correctly. Thanks!
- To our Vivaldi Social users, for being understanding of what was going on, and for all of your kind words after we got everything working again. We appreciate you all very much!
I did warn you it was long.
A bug in the code combined with an ill-advised database configuration caused 198 user accounts to be merged to a single remote account. An entire weekend was spent finding the cause and fixing the damage caused.
Incident timeline summary
Times in UTC
- Saturday 15:15 – Vivaldi Social is sent an account rename message from an external instance. Rogue account merge operation starts in response to this.
- Saturday 15:25 – First signs of incident are noticed.
- Saturday 17:20 – Account merge operation is stopped after restarting docker containers. 198 accounts in total were deleted/merged between 15:15 and 17:20.
- Sunday 13:00 – Probable root cause is identified.
- Sunday 14:25 – Root cause is confirmed.
- Sunday 21:55 – Data restoration commences.
- Sunday 23:27 – Data restoration completed.
- Monday 10:40 – 6 affected accounts with symbols in username are fixed.
- Monday 11:05 – Lost web settings data is restored.
- Tuesday 15:31 – Incorrect counter values are fixed.
- Tuesday 16:01 – 4 accounts with invalid data are fixed.
One last thing
Happy System Administrator Appreciation Day to all the sysadmins out there!