Postmortem of todays update, or, why testing is important.
Today (28/12/2019), we released a new and fairly large update, which included a new quest and four new kingdoms, and was planned to be the last one for 2019.
While implementing the quest, which we do along with the writer to make sure it's all up to spec and to see if everything works as planned, we encountered an issue, in where the kingdom of Aeston shouldn't be accessible until you started the new quest.
Simple, right?
Wrong.
While the idea does sound simple, because in reality, it is, the implementation would turn out to be a pain.
As we didn't have any system for it in place, I started working on it, extending the current system of travel to allow for locations to be unlocked after a certain stage in a quest, instead of only being accessible after the quest was completed, and I got it working, tested and all.
Along with the quest who was tested and working, and all new locations implemented, I pushed it for a release, and released it.
At first, it went fine. Everything got pushed onto the servers without a hitch.
But then, the bot was dead.
I checked the launch logs, and sure enough, there was an error in the new quest code which somehow wasn't caught during testing.
I fixed it, and released it again, at which point everything worked fine, and I went onto other work.
But then the bot died again.
And again.
And again.
And a- you get the picture.
After another round of fixes and restarts, I encountered another bug with the location command, where it just wouldn't work due to it not finding a connection in certain locations.
The cause for this, was that in my addition of the new system for the travel command, I completely neglected the location command.
As such, I fixed it. In production.
Or so I thought.
What I ended up doing instead was a big ol' test in production in order to get it all up and running again, which after a load of restarts, I managed to do.
Finally, done, I thought.
KA-BLAM! You aint done yet, suckah!
While I was fixing all the other things, I missed some typos and one very, very thing with the location command.
See, the thing with the location command, is that the fix I implemented replaced the data of the locations it couldn't find with ???
, which led to a connection showed up as ???
in the time needed to travel between them.
After a lot of looking around in the code, it turns out I mixed up the location ids, and the position of the characters, writing 86
instead of 68
.
And I wasn't done yet.
As it turned out, the bot would now crash, seemingly at random, which after a lot of trial and error, we figured out had to do with, you guessed it, the travel command!
In a strange coincidence of events, the bot was failing to find a complete path between two locations due to the connection definitions being broken, at which point, it would say "nah fam", and die.
To fix this, we've gone over the connection definitions and made sure they are as they should be.
As of the publishing of this post, everything is back up and working fine again, with some minor bugs we need some more time to crunch, and we (hopefully) won't experience anything like this again.
... at least not until next year.
Moral of the story, test everything before you go to production, even if you don't think it'll affect something else.