The big infrastructure change
In the beginning, there was one server.
This one server ran all of Discord Dungeons - The bot, the databases, the website, you name it.
As our ambitions grew, we had to increase the capacity of the server, so we did, time and time again, having done several major moves until one day, when it was completely impractical to upgrade the server more, mainly due to the crazy costs involved with high-end servers.
The solution? Add another server.
As this was quite early in the project, and in my career of server management, it didn't go as smoothly as I had initially hoped, and my lack of knowledge ended up in me offloading side services such as the web presence (website, wiki etc) to the new server, which - to be fair - actually did improve performance for the first main server.
Until it happened again. Our ambitions and thus our infrastructure needs grew. Several times over and over again, so I moved the databases to their own server, which, again, increased performance of the bot, but there was still frequent issues with uptime.
Then, as if the universe had read my mind, a new member joined the Discord server, complaining about the bots frequent downtime, offering to provide help with resolving it. Seeing their previous experience, I accepted the offer and since then, GottZ have either been completely running most of the infrastructure, or helping out immensely, while providing valuable insights to learn from, which I am extremely grateful for.
Now, at this stage, we were still running on bare metal, and portability was basically a pipe dream, I've only ever ran on bare metal, I didn't know anything else. However, GottZ introduced me to this magical piece of software that would come to revolutionise how we ran stuff - Docker.
Simply put, Docker is a way of packaging software into a neat little bundle, complete with everything needed to run only that one piece of software, without needing to install anything other than the docker runtime on a server.
Said and done. We moved our infrastructure, bot and related projects to docker containers and that's how we've been running ever since.
However, not all is gold and glory, remember how our needs grew? Yeah, they kept on growing, so we had to add more servers, to the point where we're currently operating 7 separate servers for this one project.
Keeping track of all of this has been a challenge to say the least. We have to keep track of what services run where, we have to do networking between the servers, the whole nine yards.
This was fine for a long while, and we did manage, however, there was a better way. Kubernetes.
Remember how I said docker revolutionised our infrastructure? Kubernetes is the same thing on steroids. It's a platform that promises the ability for you to link together multiple servers into one big cluster where you can just deploy whatever and Kubernetes handles everything related to networking, where to run the stuff, how to deal with storage, and everything else you need.
This was a couple of years ago, and I've been wanting to move to kubernetes for a long while, but we've never gotten around to it. Until now.
As with everything, it wasn't a dance on roses and a lot of work had to be done.
From my previous, but limited experiences with kubernetes, I knew I needed a couple of things; A storage provider, and a reverse proxy that allows me to expose services to the web.
The first one, the storage provider, wasn't hard to set up, as I had some experience using Longhorn, which touts itself as "Cloud native distributed block storage for Kubernetes", and it's perfect for our needs.
Without going into too much detail, longhorn allows us to use the storage across all the servers as one big pool of storage, which can be replicated across one, or all of the servers, reducing the chances of data loss, and it's easy to use on top of that.
The reverse proxy? A whole different story.
To understand this, we first have to go a bit into the workings of Kubernetes.
So, in Kubernetes, there's a thing called an ingress, which basically describes and manages an external access to services inside the cluster, anything from being able to access something on port 8000 to serving an http route.
As the plan was to move everything to Kubernetes, we needed a way to route links into Kubernetes and have them serve the right content, and this is where the proxy comes in. It listens for requests, checks where they want to go, and routes it appropriately.
Now, in my previous experiences I had used Traefik, which is an industry standard, and when setup is extremely easy to configure and use, you basically tell it, "hey, I want people to access service x on this url" and it just figures it out, serves SSL certificates and whatnot. (Of course you have to set your DNS up, but that's no biggie)
However, this is where the first major roadblocks began.
See, the docs for Traefik? Not the best. As I'm working away, setting it up according to the loose instructions I had found on their site, I just could not get it working. Not one single time. After spending a couple days on it, and countless youtube videos of help, it turned out I needed to add a networking layer, which would allow the LoadBalancer services to get an IP.
Said and done. After more research, I found the industry standard for this, MetalLB, which, was super easy to setup with a L2 network, which hands out IPs as it should, without issue.
So, I tried the Traefik approach again, and - of course - it still didn't work.
"Crap", I thought to myself, being out of options.
I began going on a rampage, on the brink of giving up, saying "screw it", going with other proxies, which offered worse configuration options, just out of pure frustration.
But then, as if the universe had read my mind (or just some google algorithm), I found the solution, which was s0 stupidly simple.
I had to make the Traefik listen to the servers external IP address.
Said and done, reloaded the configs and - it just worked. Days of frustration just lifted off my shoulders in what could only be described as pure bliss.
Having gone through all of this, I sure didn't feel like doing it all manually again, if something were to go wrong, and that's where Ansible comes in.
Ansible, a tool for automating IT tasks, like setting up servers, was the perfect fit, as it allows for expressing steps in what's basically plain english, and it just works, like magic.
Setting everything up with Ansible did take some testing, but now that it's done, we can magically install and add servers to the cluster with one single command, completely automated - everything from creating users, to installing useful software, Ansible does it all.
The migration
With all the base work done, it was time for migrations. For this, we needed something that'd help us deploy automatically. After all, all this fancy work hasn't been done just for manual deployment - we want to focus on coding, not deploying.
In order to do this, we need a tool capable of handling a Continuous Delivery flow, capable of looking at private GitHub Repos, and pull from our private Docker Registry, all while being fairly easy to use.
After searching and asking around, I. eventually found ArgoCD - a full CD suite, with a nice web interface, which fits extremely well into our toolbox.
Said and done. Deploying was a brief, setting up authentication too. And here we are, with a nice, easy to use tool for automatic deployments.
Now, you might be wondering; "How did you manage to migrate everything"?
Well. Let me tell you about it, because it was a journey.
In order to do the migration, there was a couple of steps;
- Figure out which services run where
- Figure out if they run in Docker
- If not, make them run in Docker
- Verify they work as intended.
Service Discovery
Let's go through step by step.
The first step - figuring out what ran where - was mainly needed to strategise which order the migrations were done, as well for ensuring we didn't miss migration of any services or their data.
Thankfully, we had already seperated a lot of concerns, so figuring out the service map wasn't a huge hassle.
Docker.
Here's where the fun begins.
As we started using Docker fairly early, most services were running in Docker. Some as regular images, some with docker-compose, some were built straight on the server upon running (which wasn't ideal), and some weren't even running in Docker.
Some services were even worse.
Take the API for example. We had 2 versions of it, the one that was running, and a version that's been sitting for a pretty long while, which, I thought we were running already.
As it turns out, the API server we were running was running on code directly on the server, in a docker-compose stack that built it upon running. Worst of all? It was running on NodeJS 10, which is ancient by this point.
As mentioned, we had already done work on dockerizing and rewriting, which meant that re-deploying the API was just a matter of using the images that already existed.
And this was the story with most of our services - the images already existed - so it was just to configure and deploy.
For the services not running on Docker? A whole different story.
The website.
If you're unfamiliar, we run the https://discorddungeons.me site ourselves, so no website hosting provider or the like, and this was previously done on a DigitalOcean droplet, outside of all other servers, just ticking away money for no good reason, and I've been wanting to move that server for at least 2 years.
On this site and server. We ran a couple different concerns, neither of which running in docker;
- res.discorddungeons.me - Used for serving all the resources, game images, maps, icons and alike.
- bot.discorddungeons.me - A static site used for inviting the bot
- discord.discorddungeons.me - A static site used for redirecting to the discord invite (which is useful so we don't have to type discord.gg/blablabla all the time)
- donate.discorddungeons.me - The donations page
- discorddungeons.me - The front page site
As this was a lot of different concerns to tackle, I had to start somewhere, and again, my constant rewriting came into action again.
In a similar fashion as the API, I had already done work in rewriting the donations page and dockerized it, which came as a godsend – The previous version ran on PHP, and dockerizing a PHP application is pure hell.
Once done and working (Which came with a bunch of headaches related to signatures and backend stuff), I moved on to the front page.
This came with one small challenge.
As the front page is really just a static page without dynamic content, I could leverage that in order to improve load times.
There exists tons of static site frameworks, generators and everything in between, promising extremely fast built times, load times and whatnot. However, I just needed a small library, and it just so happened that I had checked out Astro.
Now, Astro is a static site generator, with a syntax similar to JSX used in React, which is perfect since I'm rather well-versed in React.
Rewriting the front page was straight forward, and pretty fun, even with the headache of figuring out why styling didn't work properly.
The static sites.
bot.discorddungeons.me and discord.discorddungeons.me are essentially the same. Simple, one-page sites with some text, and I didn't want to run a seperate container for each, which, again, was an easy concern, with the solution of just adding another enabled site to the nginx container, and pointing two routes in Traefik to it.
This post is getting extremely long, so I'll try keeping this last one brief,
The wiki.
By far, the worst service to move was the wiki, as it consists of 4 services.
- A database
- A web server
- A PHP backend
- Parsoid
Remember how I mentioned dockerizing PHP is hell? This was even worse. I'm talking all stages of grief, full on inferno.
To start, I had to move the databases.
This was the biggest initial hurdle, as I had zero clue on how to actually import the databases, since I couldn't just upload them to the server.
Or so I thought.
As it turns out, I'm able to mount a longhorn volume onto a specific system, then mount that as a standard hard drive, completely transparently, which, to be honest, is some really good sofware magic.
Once that was figured out, the rest went way smoother. Atleast in my romanticised version.
In reality, I had to write a full on suite of tools in PHP to download and install mediawiki with all of it's extensions onto a longhorn volume, which I could then mount into an nginx and php container, which the nginx container then talks to.
Yes, it's as cluttered as it sounds. Welcome to the world of PHP.
The reason?
The standard MediaWiki image, when built, with all the extensions we run, would be over a gigabyte in size, which is way too much, when it can just be served from disk directly.
Final Thoughts
Doing this, I've learned alot, and the team has too. We still have some things to move, but this will lead to more stability in our services, and a way easier time in replication, updating and more.
Until next time.
– Mackan