In my time (13 years and counting!) with Logikcull, we’ve migrated from servers in a closet, to a private colo cage and eventually found the way to our current home in the cloud. Each one of those moves and the major upgrades we’ve done along the way have involved downtime and doing off-hours work, sometimes several days’ worth. These “maintenance windows” have spanned from software-only updates of existing systems to full-scale shutdown and relocation of physical hardware.
While they all presented their own unique challenges, I’ve been through enough to see some patterns emerge. For the remainder of this post, I’d like to walk through some of those learnings. These insights are probably less profound than they are subtle. It’s a collection of little observations and A-ha! moments captured over many anxious nights and weekends throughout the last several years that I’ll try to distill into the following six tips.
Maintenance windows and data migrations are definitely not the time for surprises. You’re typically given a fixed window to execute a set of tasks that can be unpredictable at best. Your customers (both internal and external) may be waiting for the minute the window expires so that they can get back into your system to get work done. The duration of the window is a commitment to customers that needs to be kept. Each maintenance period that I’ve experienced existed somewhere on a continuum of ‘lone genius working down his/her own list, and delegating tasks where possible’ to ‘several informed people executing pieces of a shared plan’. Without exception, the ‘shared plan’ windows were the most effective (and enjoyable).
If you’re on board with the idea of writing it down beforehand, you should strongly consider keeping the plan in a shared medium that allows collaborative (preferably real time) editing, like a Google Doc or wiki page. This allows you to distribute out the work of compiling the plan, and keeps it all in one SPOT (single point of truth). It will also benefit you during the window, as we’ll explore later. Keeping a central plan becomes increasingly critical as your system becomes more decoupled and the pieces require more specialized knowledge. This is not a time that you want to be manually merging attached document revisions that were shared via email.
I’ve also observed that it’s best to be as detailed as possible. In addition to broad strokes like ‘upgrade Apache’, I like to see step-by-step instructions down to the level of individual clicks and commands. At a minimum, these are the ‘happy path’ steps, but you could also include validation steps, recovery from common errors and pre/post checks. In the best cases, items from the plan can be pasted directly into a live terminal, which cuts down on fat-finger mistakes. Keep in mind that you may be working through this plan at 2AM, so efforts to cut out thinking are rarely wasted.
Leave yourself lots of breadcrumbs like notes, screenshots, links to web-based management UI and code snippets to help get into the right context for a particular step. I ask myself whether someone else could successfully pick up my sections if I lost power right before the window. To take it one step further, each section of the plan should have a primary and backup owner. From power outages to stomach flu and kids with night terrors, life doesn’t stop for a maintenance window, so it’s a good idea to build in some extra capacity.
With your plan in hand, it’s time to read it through. Ideally, you and multiple teammates have a chance to individually read through the plan from start to finish, ask questions and revise if needed. Approach it with a skeptical eye and ask questions like:
I’ve seen even better results when the upgrade team meets beforehand to do what we refer to as a ‘table read’ of the maintenance plan. The specifics can be tailored to your particular situation, but this generally entails a face-to-face meeting where each section’s primary owner walks down their steps verbally and talks through the expected outcome. Sometimes hearing it spoken out loud can spark priceless insights like “that won’t finish during the window” or “we could patch that system beforehand”.
If the stakes are high enough, you may also want to do a ‘dress rehearsal’. This is a souped-up version of the table read where you pull the team together and actually execute the maintenance plan on a non-production environment. We maintain multiple pre-production environments for testing features before they’re released. These systems are a scaled down version of production, so we’re still able to go through the motions without necessarily working on production-sized data.
Whether you’re acting as a solo editor or doing high fidelity run throughs, this revision step always turns up some useful nugget. At the worst, it will make things feel more familiar when it really counts.
Just as it’s important to have the plan in a shared place, I think it’s fairly crucial to have 1-2 shared channels for communication during the window. After the events of 2020, we’re all familiar with video calls like Zoom and Google Hangouts. Having a fully synchronous channel where you can ask for someone’s attention and know that they’re listening is incredibly important when there is a complex set of steps where order and ‘gates’ matter. Even just saying steps out loud and hearing people respond affirmatively really helps my confidence in a high pressure situation. Screen sharing can get multiple eyes on a problem without requiring someone to simultaneously think and narrate what they are seeing.
I also like having a secondary async channel such as Slack, MS Teams Chat or one of several others. This is a good forum for asking lower-priority questions, communicating status or sharing code snippets. Admittedly, it’s also a good place to share goofy memes and jokes along the way. Keeping it light and trying to have a good sense of humor can really help to keep up morale during this stressful time.
At Logikcull, we’ve had good success keeping the async channel open in the run-up to the window and then hopping on synchronous channels 30-60 minutes before the start of the window. We’ll use this time to check up on the current status of systems, start opening consoles, pulling up management UIs and checking in with one another.
If you’re starting a maintenance window having followed the previous three tips, you’re starting strong. It's very likely that you’ll kick off with multiple people cruising through their sections independently and knocking items off the list. If your plan is stored in a real time collaboration tool like Google docs, you can literally watch the items being formatted with strikeout (Cmd + shift + X shortcut for the win!). Sometimes things go more smoothly than expected and people can pitch in on other efforts or jump ahead to start shaving time off the end.
However, it’s common to hit bumps on the way. Maybe a database fails to restart, Windows Updates freeze midstream or a third-party package provider is down. These windows typically take place off hours, so being online but blocked is like rubbing salt in a wound. These things happen, but with a solid plan the current state and remaining work is known, so you can try to quickly shift people to work on later steps in order to keep up the pace. Also if you did the table read, there’s a good chance these possibilities have been previously discussed and ideas are already flowing. Getting stuck in one area doesn’t mean that the whole team has to grind to a halt, and your teammates are more likely to connect the dots if they’ve been properly primed with preparation.
This is a great example of Eisenhower’s idea that “Plans are useless, but planning is indispensable.”
It’s best to think of your maintenance plan as a living document. Unexpected setbacks AND the insights they generate should be captured inline with the context in which they are being executed. I’ve seen my coworkers repeatedly rise to the occasion in order to work around hardware flakiness, stale third party documentation and unresponsive processes (to name just a few unexpected setbacks). Those flashes of brilliance can be captured and fed into the next round of maintenance, lest they be lost in the heat of so many moments. Being diligent about capturing the insights along the way can make the next round better.
It’s also useful to maintain a section of follow-up and ‘not-doing’ items. Some will be known before the window, because they were intentionally scoped out. Others will be discovered along the way. It’s easy to think we’ll remember, but the pace of many maintenance windows is like drinking from a firehose with information and obstacles flying at you from all directions. It’s very easy to forget something along the way.
These documents become a record of the events, which help to remind us which things went well and which things still need followup. When starting planning for another round of maintenance, it’s common for our team to start off by making a copy of the previous maintenance plan and using it as a starting point. This helps us to maintain some consistency from window to window and avoid pitfalls from the last time around. We’re continually improving our processes with each new iteration.
Quite possibly the best thing you can do during a maintenance window is nothing! It’s best to avoid doing things in a maintenance window if at all possible. These windows can be extremely stressful for both you and your customers, especially when things go wrong. Over the years, we’ve put in the work to implement ‘pause’ features, clustered services and add redundant pieces of infrastructure in order to continually expand the list of machines that can be down without causing an outage for customers. This allows us to start patching, upgrading and rebooting critical infrastructure weeks ahead of the actual window. Anything that can be taken care of in advance of the window is almost always worth it. It carries the side benefits of cutting down on the required turnaround for rolling out critical security patches and making your infrastructure more robust to outages in general.
While it may not be obvious that optimizing for maintenance would help with normal operations, it makes some sense. If you’re always analyzing dependencies between systems, then you have an idea for the blast radius of particular kinds of failures. If you’re building in redundancy which allows parts of your infrastructure to be updated independently, you’re making your system more robust. If your team is reinforcing collaboration by working in a cohesive fashion under time pressure, then you’ll have more practice when it’s unplanned. Done right, maintenance windows can be a great way to reinforce behaviors that will help your team reduce your Mean Time To Recovery. The changes to your system will also help to reduce the Mean Time Between Failures.
You can probably tell that I’ve given this some thought. I think it caught my interest because infrastructure maintenance is necessary, but can easily feel like a chore. If needs to happen regardless, I’d like to think about the experience and figure out a way to elevate it and use it to drive other operational areas to be better.