What happened to Vivaldi Social?

2023-07-282023-08-01 — Software, System Administration, Technology — 25 min read

On Saturday 8 July 2023, user accounts started disappearing from the Vivaldi Social Mastodon instance. What was going on, how did this happen, and what were the consequences?

This is a very long blog post, but to be fair, this was also to be a very long weekend.

If you want to skip to the conclusion, there’s a TL;DR (too long; didn’t read) section at the end.

Something’s not right

It was around 17:25 Oslo time (CEST) on the Saturday that I first noticed something was wrong. I’d just got home from a bike ride, and when I happened to check my Vivaldi Social tab, it suddenly asked me to log in again. “Unusual”, I thought, but I wasn’t immediately alarmed by it. But then when I did log in, I saw that my home timeline was now completely empty. I quickly reached out to my colleagues.

“doing anything with mastodon? my home timeline is suddenly empty”
Me, to my fellow sysadmins – Saturday 8 July 17:26 CEST

My fellow sysadmin Hlini very quickly got back to me. No work was ongoing, and his account was showing the same symptoms as mine. He offered to start heading to a computer so he could help me, an offer which I gratefully accepted.

By 17:32, another colleague outside of the sysadmin team had also noticed the same issue. I started to look into the database to see what was going on.

Something bad has happened

Looking at the database I could see that the affected accounts had apparently been deleted, and then recreated as a completely new account when the user logged back in.

Immediately, I started looking to see what database backups were available. As expected, we had a nightly backup from 23:00 UTC on Friday night. I started copying the file to somewhere I could make use of it.

While I was waiting for the backup file to copy, I started checking the database for other users that might be affected. Jón von Tetzchner’s account and another one that I checked had also been deleted, but had not yet been recreated, likely because those users had not tried to log back into their accounts yet.

By this time, Hlini had arrived at a computer and started looking into things with me.

I started checking the web server logs for account deletion requests, but nothing matching the account deletions showed up; and then I realized something else was odd about these deletions.

Normally when an account is deleted in Mastodon, the username is permanently reserved as unusable. If you were to try to create a new account with the same name as a deleted account, it would not allow it (since, due to the nature of the Fediverse, having a new account with the same address as an old one would not be a good thing).

But in the case of these deletions, we were getting reassigned the exact same usernames, so these could not be not normal deletions.

By 18:39, Hlini had figured out the pattern: all accounts with an ID lower than 142 (ie. the oldest accounts) were missing from the database.

We hadn’t seen any discussion from other Mastodon server admins about anything like this, and we wondered if this could be something unique to our setup – after all, Vivaldi Social uses vivaldi.net accounts for logins (thanks to Mastodon’s OAuth support) instead of the normal signup and login system of Mastodon. We started considering asking the Mastodon developers for help, and we also started discussing strategies for restoring the lost data from the backup.

But then…

Something bad is happening right now

At 19:10, I checked the database again, and I saw that all accounts with an ID lower than 217 were now missing from the database, and that number was increasing. This meant that accounts were still being actively deleted from the database.

By this point we both agreed that we needed more help, so at 19:18 we contacted the Mastodon developers. We immediately got a reply from Renaud, and he pinged Claire and Eugen to enlist their help.

Stemming the flood

At 19:20, Hlini restarted all of the docker instances in our Mastodon setup. The deletions seemed to stop the moment he did this. The lowest ID in the database was now 236.

Fortunately it turned out that it would stay that way.

The investigation begins

198 accounts in total had been deleted during the course of this incident, and over the next few hours, together with the Mastodon devs, we started looking into what could be going on. On Eugen’s suggestion, we looked into the possibility of it being the UserCleanupScheduler deleting accounts that were “unconfirmed”, but this was eventually ruled out, as the deleted users could never have matched the query that it operated on.

Since we had upgraded to Mastodon 4.1.3 just 48 hours before the incident occurred, the Mastodon devs looked into all the code changes between v4.1.2 and v4.1.3 to see if anything there could be related. They even (and I cannot credit them enough for this) went the extra mile and looked through our published changes to see if any of the changes we had made could possibly lead to this. The conclusion though was that none of the changes could have triggered anything like this.

At the suggestion of Renaud and Eugen, we checked the filesystem to see if the deletions were being done directly in the database, or if they were being triggered by Mastodon itself. We could see that the avatar and header images for the deleted accounts had themselves also been deleted. This meant that the deletions had to be coming from the Mastodon application itself.

An attack?

We also started looking for signs of system intrusion, since it was certainly a possibility that this was some kind of deliberate attack. I spent some time checking the various logs that we had available to us, but I didn’t find anything (though in these cases, the absence of evidence can never rule out the possibility).

Because Mastodon v4.1.3 included a security fix, the devs also looked into the possibility of a related exploit, for which we combed through the logs, and examined the filesystem for evidence of such an attack. Again though, nothing was found.

We debated whether we should take Vivaldi Social offline altogether while we continued the investigation. The Mastodon devs gave arguments in both directions:

In favour of taking it offline: if we have to roll back the database to the backup, then more content will be lost the longer we keep it up.
In favour of keeping it running: if it is an attack, and it resumes, it might give us more opportunity to investigate how it’s being done.

We ultimately decided to keep it running. In truth what swung the decision that way was probably not the balance between the above arguments, but just a simple fact of us being sysadmins…

XKCD comic in 3 panels. 1 on-panel character talking by radio to his superior.

Panel 1, underling: "We took the hostages, secured the building, and cut the communication lines like you said."
Panel 1, superior: "Excellent."

Panel 2, underling: "But then this guy climbed up the ventilation ducts and walked across broken glass, killing anyone we sent to stop him."
Panel 2, superior: "And he rescued the hostages?"

Panel 3, underling: "No, he ignored them. He just reconnected the cables we cut, muttering something about "uptime"."
Panel 3, superior: "Shit, we're dealing with a SYSADMIN." — https://xkcd.com/705/

Humans need sleep

The hours were passing by, and soon it was starting to get late on Saturday night. Claire wrote and provided for us a patch that we could deploy to add some logging for account deletion actions. I added the patch to our local repo and Hlini started the build process going.

I let the others know that I was going to at least attempt to get some sleep, sentiments which the others also shared. At 00:15 I went off to bed. At 00:29 Hlini reported that he had deployed the patched version and also went to get some rest after checking that everything was still working.

So ended the first day.

Dawn of the second day

Sleep did not come easily, since so much was going through my head, but I eventually managed to get maybe a few hours of actual sleep, before getting back to work somewhere around 09:30 on Sunday.

The Mastodon devs were already awake and had asked us some more follow-up questions, so I set about answering those, checking the admin actions log, detailing our API usage, doing more searching of the log files and doing my best to provide details about our production environment.

By this point, although they were certainly keeping an open mind about the possibility of this being a software bug, the devs felt it more likely that this problem was being caused by some kind of attack. Because of this, I made sure that our internal security team at Vivaldi was up-to-date with everything that had occurred and been discovered so far. This would actually turn out later to lead to a lucky break in our investigation…

Some time after midday I also started to think about our options for data restoration. At 12:26 I asked the devs if they had any advice for how to do a selective restoration, and in particular how to deal with accounts that had been recreated. I then started looking at the database relations myself to figure out which tables would likely need to be recovered. At 12:54 I began the first steps of developing a script to restore the lost data.

The lucky break

Roughly an hour into my development at 13:56, Yngve (one of our security experts at Vivaldi) replied to my earlier message about the current state of the investigation. He reported that his profile page in Vivaldi Social was not loading properly. A series of HTTP 500 errors were being reported when he tried to view it. I checked the database, and could see that Yngve’s account was not one of the 198 accounts that had been deleted, so this was very curious. I checked his profile page myself, and also saw the same errors.

I looked into the Mastodon logs to see if I could find any sign of these errors, and found indeed that there were error messages. I forwarded these findings on to the Mastodon devs. Claire replied, asking for the full stacktraces for the log entries, which I was able to also extract from the logs.

At the same time, another of my sysadmin colleagues, Ísak, started looking in to the logs with me. He discovered that there were now a lot of errors like these in the logs, and there was a common element to all of them. Each of the errors referenced an account on another Mastodon instance – the same account and instance in every single case. For the purposes of this retelling of the story, we’ll call the other instance social.example.com.

Based on this information, Claire started directing me to return the results from various specific database queries. One of the queries that we ran was to select all statuses (aka. posts or toots) from the database belonging to the mystery account on social.example.com.

That query returned 17600 rows.

Either this was an incredibly prolific Mastodon user, or something else very strange had happened.

At 14:43, via querying the database and comparing to the backup I confirmed that all of the statuses for all of the deleted accounts had been reassigned to this 1 user on social.example.com.

Initially Claire suggested that maybe we were seeing an incredibly rare case of PostgreSQL database corruption, and I was absolutely willing to entertain any possibility at this point, but just a few minutes later, realization struck.

The realization

At 15:00, having started developing a theory, Claire asked me to check for AccountMergingWorker entries in the Mastodon worker logs. Several more database queries followed, and also some commands run via the Rails console. By this point she was pretty sure that the account merge worker had for some reason kicked in and started merging all accounts into the 1 remote account, but she had not yet pinned down what exactly that reason could be.

Ísak dug back in to the web server logs for the day before, looking for the first sign of any contact from the social.example.com instance. He found that the instance had contacted Vivaldi Social at 17:15:33 on Saturday. This was just 10 minutes before I had first noticed problems with my account at the start of this saga.

PostgreSQL replication

Claire now started asking questions about our database setup. Specifically, did we have any kind of PostgreSQL replication configured. I was able to confirm that yes, we did indeed have a 2 server replication setup.

This turned out to be the final piece of the puzzle.

The theory

At 17:28, Claire put forward her theory as to what had happened:

Vivaldi Social was notified by social.example.com about a username change for one of their accounts.
The new account was created in our database with a null value in the URI field.
The new account then got its URI set to the correct value for the remote account.
A run of the AccountMergingWorker was scheduled (via Redis), to merge data from the old account to the new.
Because of delays in the database replication, steps 3 and 4 effectively happened in the wrong order, purely due to bad luck with the timing.
The AccountMergingWorker dutifully started merging all accounts with a matching URI value into the new account.

And it just so happens that all local accounts in a Mastodon instance have a null value in their URI field, so they all matched.

Claire theorized that such a thing would be more likely to occur if the database were under heavy load that caused extended replication delays.

“What are the next steps?”

With a very probable cause now identified, Jón asked what the next steps should be. We agreed that the best course of action was for me to return to focusing on recovery efforts, while Claire would write a patch that we could apply to prevent this reoccuring.

Also with Renaud’s help, I was able to provide further details about the replication setup, which confirmed that the worker processes were configured in a way that they could perform database reads against the standby server (using a tool called Makara). This information led all 3 of the devs to agree that we had almost certainly now found the root cause.

Within just a few minutes, Claire provided a patch for us, which I pushed to our repository.

Recovery process

The recovery process was possibly even more of a rollercoaster ride than the problem itself.

At 16:09, Jón and I had a discussion about the best course of action for recovery. Given the amount of perceived data corruption that had occurred, I was now leaning towards doing a full rollback of the database, and Jón agreed that was probably the way to go. After helping out the Mastodon devs with the replication investigation, I took a much needed break. At 17:03, the sysadmin team had a quick discussion and together we agreed to proceed with the full database restore.

A full restore was actually not going to be as simple as we hoped. We knew already from the earlier restore of the backup database that there was one part of the process that would take a very long time, due to a known performance issue that affected our instance. So instead of restoring directly, we planned to convert the .dump file into a .sql file, and edit the .sql file to replace the offending query with a more performant one.

The conversion of the .dump file finished at 17:23, then the problem was how to perform the edits. Editing a 54GB text file comes with its own set of problems. Most text editors will struggle with that kind of size.

At the same time as I was pondering the best strategy for editing the file, Claire started suggesting a strategy for doing a selective database recovery. I explained that we had decided to do a full restore, but when we discussed the reasoning for this (the difficulty in knowing what data had been corrupted), I eventually became convinced that perhaps a selective restore would be possible after all. She agreed to help out by providing full details of what database tables the account merge operation will have touched.

In the sysadmin team we decided to split our efforts. I would work on the selective restoration efforts, while Hlini would carry on working on the full restore procedure. I continued working on the script that I had started on earlier in the day.

Patches and configuration changes

Before he could work on the restoration process, Hlini also had the task of applying the patch that we’d been given, and also changing our database configuration so that we weren’t using this (no-longer-recommended) replication setup. At 17:58, he deployed the changes, and something went wrong. We had our first (and only) full downtime of Vivaldi Social in that weekend. By 18:18, Vivaldi Social was back up and running, and by 18:44 the changes were deployed again, this time successfully.

So, we were now reasonably confident that we were protected against this incident happening again. Now the only thing we needed to focus on was getting the data back.

Recovery work continues

At 19:25, I told the other sysadmins that I had hit a wall in my scripting process. I was having technical issues with the PDO module that I was using to interact with the database, and I needed another pair of eyes to look at it. I showed my work so far to Ísak to see if he could spot my obvious mistake, but he didn’t find anything immediately obvious. By 19:55, I decided to give up completely on the script.

Instead I decided to help out Hlini with his task. He was now in the process of editing the 54GB .sql file.

His first attempt was to edit the entire file with vim. I checked the changes that he was applying, and they seemed to look ok.

I tried running the import of the resulting .sql file, and it quickly returned an error. It turned out that the file that vim ended up saving was for some reason truncated at 629MB, so we decided to try another approach. I split the file into 2 chunks to make editing it easier:

head -c50000 dump.sql > split
tail -c+50001 dump.sql > split2

and Hlini applied his changes again. I then merged the files together (after removing the trailing newline that vim had added):

head -c-1 split > split3
cat split3 split2 > dump-patched.sql

and when that was done, attempted another import.

Unfortunately that attempt also failed.

For the third attempt, I split the file precisely around the section to be edited, so it was split into 3 pieces. This attempt also failed in the same way as the previous one, but by doing the split in this way I was able to generate a sensible diff of the edited section, and that allowed me to spot the error we had made (a case of mismatched paratheses).

At 21:26, on the fourth attempt, we finally started running the import of the .sql file in preparation for the full restore.

While the import was running, I asked Ísak to have another go at trying to figure out what was wrong with my code, and he started digging into it.

The bug in my code

At 22:50, Ísak got back to me. He’d figured out what I’d done wrong. Like most bugs it was a stupid mistake on my part, related to how I had bound the parameters of one of the database queries (by reference instead of by value). In my defence, I was very tired by this point.

Now that the problem had been found, I decided once again to start working on the selective restoration process (the database import was still running).

By 23:04, I had managed to finish the first part of the script, which would fix the user, account and identity records of the 198 affected users. I used the script to restore mine and Ísak’s profiles first. We saw that the avatar and header images were now missing (as they had been deleted from the filesystem), but otherwise the profiles looked good. With advice from Ísak, I adjusted the script to force Mastodon to refetch the avatar image when the user next logs in to Vivaldi Social.

By 23:55, I had finally finished my work on the script. In theory it would now be able to return everyone’s statuses, follows, followers and all other relationship data that referenced the 198 users to the state it was in before the incident.

Selective restoration begins

I soon realized that due to database relationship constraints we would need to do the restoration in 2 stages, fixing the user/account/identity records for all 198 users first, before then proceeding with the rest of the data.

At 00:08, I completed the first stage of this, fixing the accounts of all the affected users.

Next I tried running the second stage fixes for just a single account (my own). This worked, but my profile now showed the same issues that Yngve’s had done earlier on Sunday. Checking with Ísak, we realized this was because I had boosted posts that belonged to one of the other affected users.

Seeing no other option than to go ahead and run the second-stage fixes for all 198 users, I took the plunge and started it going at 00:46.

At 00:55 the script hit an UPDATE query that failed due to a duplicate key. I figured out that this was because the user in question had logged in to their account and set up some of their follows again. I modified the script to handle this scenario by deleting records that could not be restored in this way and keeping the newer record intact.

The script was soon humming away again, gradually fixing all of the corrupted and lost data.

The anticipation of success

By 01:14, Ísak reported that he could now view all of his posts, although his profile was not yet working.

By 01:18 though, my profile page was finally working again! This meant that all the accounts that I had boosted posts from had now been restored. The mood in the sysadmin team was now extremely buoyant.

“holy shit guys, this is actually working”
Me, to my fellow sysadmins – Monday 10 July 01:18 CEST

“holy moly my profile looks ok”
Hlini, to his fellow sysadmins – Monday 10 July 01:26 CEST

The script finished its final task at 01:27 on Monday morning. Things were looking pretty good now. The only thing that was noticably off was that the Home feed of the affected users was now empty. I was fairly sure that this was just a simple indexing issue, and asked Hlini if he could fix it. At 01:40, Hlini reported that the re-indexing of the home feeds had completed.

Jubilation

And there it was. My home feed, and the home feeds of all 197 other affected users were now back as they were, almost as if nothing had ever happened.

The full rollback and restore never became necessary.

Emotionally I was now on an insane high after the devastatingly exhausting and stressful period that we had all just been through. I thanked the sysadmin team for all of their help, and the Mastodon devs for all of theirs. I knew that I desparately needed sleep, but that in my current state I would not be able to do so, so I went out for a 2 AM run to try and clear my head.

Eventually I was able to rest, and get some sleep.

The follow-up

On Monday morning I returned to work at 10:50, and soon got a report that there were some things that needed fixing. The main issues were:

6 Users with symbols in their usernames couldn’t log in. This turned out to be due to a mistake I’d made in the recovery script, and was very easily fixed.
Web settings for the 198 affected accounts had been lost. This was an omission in the recovery script, I adjusted the script to add the table that I’d missed, and this data was restored.
Counters on profiles (number of followers, posts etc) were wrong. This was considered a lower priority to fix and was left for now.
One user (not among the 198 accounts) reported problems with their account that were hard to explain. I looked into this, but was not able to fix it straight away.

By 14:20 on Monday I knew that I was at my exhaustion limit and couldn’t do anything more. I took the rest of the day off to go for a walk and try not to think about anything Mastodon-related.

On Tuesday I returned to looking at the 2 remaining problems. The incorrect counters turned out to be easy to fix, just needed the right command, which Claire provided for us. Then Claire and I looked at the broken account, eventually leading me to write another script to fix the data issues that we saw, fixing the 4 accounts that were affected in this way.

Official fixes in Mastodon

The Mastodon devs warned other server owners of the dangers of using Mastodon with a Makara-based replication setup. Fortunately such setups are rare, as they would only ever be even considered for large instances like ours.

With the release of Mastodon v4.1.5 came 2 changes that are related to this incident:

So hopefully no-one else will have to go through this.

An experience not to be repeated

I do hope I don’t have to go through an experience like this ever again. Although the satisfaction and extreme emotional high at the end of the ordeal was quite something to experience, that really didn’t make up for the stress that we all went through. I have since recovered though, having taken plenty of time off work in lieu of all the extra weekend hours worked.

I do though want to give some massive thanks to everyone who helped out.

To Hlini and Ísak, my amazing sysadmin colleagues. Together we did the impossible, and without both of you we would never have done it. Thank you! (And Claudia, we missed you! Hope you enjoyed your vacation away from all of this!)
To Renaud, Claire, and Eugen of the Mastodon developer team, who went above and beyond all expectations to help us out. You folks were amazing, you took our situation very seriously, and immediately jumped in to help us. I really could not have asked for anything more. Thank you!
To our CEO Jón S. von Tetzchner, for being supportive throughout the entire crisis, and helping coordinate communication with the Mastodon devs. Thank you!
To Yngve, for providing the crucial piece of information that led us to find the root cause. Thank you for your vigilance.
To Björgvin, Jane, Daniel, Paweł and everyone else in the Vivaldi team for reporting issues to us and helping make sure that everything was restored correctly. Thanks!
To our Vivaldi Social users, for being understanding of what was going on, and for all of your kind words after we got everything working again. We appreciate you all very much!

TL;DR

I did warn you it was long.

A bug in the code combined with an ill-advised database configuration caused 198 user accounts to be merged to a single remote account. An entire weekend was spent finding the cause and fixing the damage caused.

Incident timeline summary

Times in UTC

Saturday 15:15 – Vivaldi Social is sent an account rename message from an external instance. Rogue account merge operation starts in response to this.
Saturday 15:25 – First signs of incident are noticed.
Saturday 17:20 – Account merge operation is stopped after restarting docker containers. 198 accounts in total were deleted/merged between 15:15 and 17:20.
Sunday 13:00 – Probable root cause is identified.
Sunday 14:25 – Root cause is confirmed.
Sunday 21:55 – Data restoration commences.
Sunday 23:27 – Data restoration completed.
Monday 10:40 – 6 affected accounts with symbols in username are fixed.
Monday 11:05 – Lost web settings data is restored.
Tuesday 15:31 – Incorrect counter values are fixed.
Tuesday 16:01 – 4 accounts with invalid data are fixed.

One last thing

Happy System Administrator Appreciation Day to all the sysadmins out there!

Join the Conversation

Alexey Shpakovsky says:

2023-07-28 at 20:22

@[email protected] holy molly, what a ride! Great job at getting to the bottom of the issue, and also restoring the data nicely!!
Marcos Dione says:

2023-07-29 at 09:14

Hi! I’m not a Masto sysadmin, but a sysadmin nonetheless, and I really love the fact you spent all this time writing down this postmortem (postzombie? postresucitation?)!

I have one question. at some point you say:

> At the suggestion of Renaud and Eugen, we checked the filesystem to see if the deletions were being done directly in the database, or if they were being triggered by Mastodon itself.

What do you men by this? What could be in the filesystem that shows this? Logs? History files? Or is this like wielding a club in a pitch dark room?

Happy SA day!

Log in to Reply
1. Thomas Pike says:
  
  2023-07-29 at 14:11
  
  By this, I meant checking in the filesystem specifically to see if the avatar image for the deleted accounts still existed. Since it did not, we could draw the conclusion that the accounts were not simply deleted in the database.
  
  Log in to Reply
Tuga says:

2023-07-29 at 14:06

“Purely due to bad luck with the timing.”

Unlike real-life scenarios influenced by various unpredictable factors, coding is purely based on the instructions and procedures put in place by a programmer. There is no concept of “luck”. An occurrence often dubbed as “bad luck” in coding is usually the result of overlooked contingencies, bugs, or faulty logic within the code. Thus, the preferred term would be “bad code” or “poorly written code”, not “bad luck”.

Log in to Reply
1. Thomas Pike says:
  
  2023-07-29 at 14:11
  
  Sure, a more accurate description would probably be a “timing bug”.
  
  Log in to Reply
  1. Vekkq says:
    
    2023-07-30 at 00:02
    
    A race condition.
    
    Log in to Reply
    1. Thomas Pike says:
      
      2023-07-30 at 00:33
      
      That’s the term I was looking for. Thank you!
      
      Log in to Reply
Don Glee says:

2023-07-29 at 14:51

@thomasp Thank you for the hard work and for sharing stories about what happened back then. :tony_happy:
Rijk van Geijtenbeek says:

2023-07-29 at 15:47

@thomasp now I wonder if there was also a problem for the external account involved here…
1. Thomas Pike says:
  
  2023-07-29 at 15:54
  
  It’s a good question. I don’t think they would have seen much. There might be a possibility that because they were now (according to Vivaldi Social) following a whole bunch of accounts that they weren’t before, that their server could have received notifications for them (if those followed accounts were also on Vivaldi Social, and they posted something), but their server would not have any record of such a follow relationship, so I suspect that any such notification would be ignored. I’m not an ActivityPub expert yet though.
  
  Log in to Reply
wildente says:

2023-07-29 at 16:48

Before reading this, I knew that you had a more than stressful couple of days. After reading this, I’m actually kind of proud now, since one of the affected accounts was mine. Referring to the xkcd cartoon, turns out the Sysadmin did rescue the hostages!

Log in to Reply
Cezar says:

2023-07-29 at 18:22

Fascinating to read! Thank you!

Log in to Reply
Mark S. says:

2023-07-29 at 23:32

Literally NO ONE gives a flying fuck about Vivaldi Social or this blog post.

Log in to Reply
1. Thomas Pike says:
  
  2023-07-30 at 00:32
  
  And yet you saw fit to comment on it. Curious use of your time.
  
  Log in to Reply
Vasil Kolev says:

2023-07-30 at 06:06

I don’t run mastodon, so I might be wrong, but from the whole post it looks like there is not enough visibility in what the system does. You had to reach to the developers to see that the account merge was happening, and I’d expect that operation like this would be logged and greppable, thus easier to identify.

Also, I hope after your experience that there will be some better backup/restore tools, as being able to restore a single user seems like an important feature for such system.

(great post-mortem, but the way, and hope you got enough sleep afterward:) )

Log in to Reply
1. Thomas Pike says:
  
  2023-07-30 at 11:54
  
  If we’d known it was the AccountMergingWorker, we absolutely could have grep’d for it much earlier than we did, but up until that moment on Sunday, we had no idea.
  
  I’ll probably try to at least release the script that we used to restore the lost accounts, maybe that will be of use to someone, somewhere.
  
  Log in to Reply
TerrorBite :veripawed3: says:

2023-07-31 at 05:19

@thomasp That's wild. These kinds of outage post-mortems are always worth a read, and it's usually the tiniest unforeseen thing that sets off a chain reaction of disaster. Great work on the restore!
Stu says:

2023-07-31 at 18:01

@thomasp That was a fantastic read, appreciate you being open about this.
old sysops says:

2023-07-31 at 19:23

@thomasp thanks for sharing this!As a #sysadmin myself i can relate.Hope it won't start again.
Joeri says:

2023-07-31 at 19:38

@thomasp 8th July? That’s about the same time I migrated (back) to Vivaldi Social from another instance. I hope I didn’t trigger something bad accidentally 🫣
1. Thomas Pike says:
  
  2023-07-31 at 19:44
  
  You can rest assured that this was not your fault 😀
  
  It was also definitely not the fault of the instance that sent the merge notification to us either. Just an unfortunate software glitch.
  
  Log in to Reply
Luk says:

2023-07-31 at 21:31

@thomasp incredible story, thank you for sharing!
rahaeli says:

2023-08-01 at 04:17

This is an excellently written postmortem and some really strong investigation and debugging work. Thank you so much for spending the time to write it up! I love reading writeups that show other people’s thought processes as they work through an incident; it always provides good insight into how other people start digging into things when the exact scope of the incident isn’t clear yet, and there’s always stuff I can learn from it.

I hope you got a few days of Not Thinking About Mastodon afterwards!

Log in to Reply
fjcasella says:

2023-08-04 at 18:34

I’m exhausted from reading this .. probably because I’m not a tech geek. But I’m glad you shared it because it illustrates what you guys at Vivaldi do in the background of things that we all can take for granted .. you make it all look easy. In fact, I find that the people at Vivaldi seem extremely taleted and intelligent. You make great stuff here and are usually transparent about it all. Which is why I’m puzzled that nobody got back to me promptly (as you usually do) about why you changed the logo on Vivaldi Social instance for the month of June, when you haven’t done anything like that before. I found that logo to be very controversial that can cause division in our society today. I’m left totally surprised! I haven’t used Vivaldi Social since (as well as, many who I’ve refered to Vivialdi), and I miss it deeply. In any event, thank you for your great work and mostly making a positiive difference in the world through it.

Log in to Reply
1. Thomas Pike says:
  
  2023-08-05 at 00:22
  
  You’ve commented on my personal blog, so you’ll get a personal response instead of a Vivaldi official response.
  
  Your comment is personally disappointing to me, but it actually serves as a perfect reminder as to why Pride is still very much necessary in this world. I have friends and relatives in the LGBTQ+ community, who I care deeply for, and I very much support their right to exist, and to live their life in their own way. As such I am very happy that Vivaldi chose to give a small token of support to this marginalized community that is under relentless attack.
  
  I bear you no ill will (though that may change depending if and how you choose to continue this discussion), but I felt I could not in good conscience leave this comment unanswered.
  
  Log in to Reply
  1. Thomas Pike says:
    
    2023-08-05 at 03:28
    
    Speaking of relentless attack, you can of course be forgiven for not being aware that last June was the 1-year anniversary of the mass shooting that occurred in Oslo (home city of Vivaldi’s HQ), specifically targetting the pride community (https://en.wikipedia.org/wiki/2022_Oslo_shooting). Someone I know lost one of their closest friends that day, and was himself wounded in the attack.
    
    Log in to Reply
    1. fjcasella says:
      
      2023-08-05 at 15:23
      
      I did not know thiis, or remember, so thank you for sharing. I am sorry this happened. From a marketing perspective, do you think it would have been better to share this in a social media post and pin it instead of changing the logo? Then it might help people like me to better connect the dots and have a better understanding of the matter?
      
      Log in to Reply
      1. Thomas Pike says:
        
        2023-08-05 at 15:42
        
        Thank you for your replies. While I don’t believe that Vivaldi should be under any _obligation_ to explain changing the image header in the way that it did, I do agree that an official blog post on the subject would have been a nice idea.
    2. fjcasella says:
      
      2023-08-06 at 18:33
      
      Thomas – Thank you for the dialog, it’s important to understand each other today more than ever, even if we agree to disagree. Again, thanks for this blog post and sharing the important work you do …
      
      Log in to Reply
  2. fjcasella says:
    
    2023-08-05 at 15:19
    
    Thank you for your response and dialog Thomas. I was hoping that you’d respond. Something I wish Vivaldi would have done as they usually do, and I really would like to hear from Jon. So I will not get into it, as this is your personal blog. I wrote much of it anyway in the most recent entry on my Vivaldi blog. I just want you to know that I agree with you. My point to be clear is not about the subject but about the action. That I always have fould Vivaldi welcoming to all until this action. When you highlight one part of society, you can push away another part of society. So it is best to just provide world-class sociale media and serve the customer.
    
    Log in to Reply
    1. dbouley says:
      
      2023-08-07 at 18:09
      
      I agree with you on this point. It’s good that people are made aware of the wickedness that is happening in our world. But, on a public service like Vivaldi, focusing on just one group of people in this way can easily bring division, possibly suggesting that one group of people are more important than another.
      
      If Vivaldi wants to show support for marginalized communities of people, then I would like to make case for the thousands of Nigerian Christians slaughtered simply for their faith last year…. https://www.heritage.org/africa/commentary/slaughter-nigerian-christians-warrants-international-attention
      
      Log in to Reply
christiehmalry says:

2023-08-05 at 03:57

You’re a trooper TP thank you so much for your sterling work.

Log in to Reply
Richard B says:

2023-08-11 at 14:53

Very interesting read: sorry you had to go through it (been there before myself with non-Mastodon software years ago). Any chance you could make the “partial restoration” scripts and other bits available online – yes they’ll probably get outdated quickly and will require patching for any other system, but if an admin has a similar issue at least they’ll have a starting off point rather than having to start from scratch (even a list of tables and the necessary restore order can help).

Log in to Reply
greybeard says:

2023-08-20 at 20:35

Wow !
All of you deserve agood sauna, massage and if you’re still awake a nice beverage.
Personally, I didn’t even realize this was happening. Does not seem to have affected my accounts (tho I may may have been napping at the time.
Great Job. Many Thanks for all your work at keeping Vivaldi’s Mastodon Instance up and running.
Also many thanks for this awesome post. it must have taken the wind out of you to put this all together to share with us.

In this line of work, Sysadmin Day should be every day ending in ‘y’. (Yes I know this rule does not hold up in all languages 😉 .)
Cheers to you, the Vivaldi Team and the Mastodon Team. :ThumbsUp

Log in to Reply