Last week, I took some time away from work to travel back to Minnesota and camp in the Boundary Waters Canoe Area (BWCA) with my friends. When I was growing up in Wisconsin, my dad took my brothers and me to the BWCA with an outfitter guide. These days I feel confident going out on my own without a guide.
Our party was six people, all experienced backpackers. Everyone had been to the BWCA on a canoe camping trip before. Still, some things went wrong, and since it's my job to keep things from going wrong, I wanted to write about how the lessons I learned as a Site Reliability Engineer apply to the world outside computers.
Each of us carried a hiking pack with our personal stuff – clothes, toiletries, sleeping bag and pad. We split the shared goods – 2-person tent, water filter, rain tarp – among bags evenly. But the spread of weight was drastically different from person to person. On the low end, James and I packed ultralight Osprey Atmos packs weighing in fully loaded around 30 lb (14 kg), while Dinh and Jared packed wider, larger "portage packs" clocking in around 50 lb (23 kg).
Carrying bigger, heavier packs means you can bring more. It also means you have to portage them to get to your campsite. Our campsite was separated from the initial put-in lake by three other lakes, each separated by land. We had to pull the canoes out three times, portage the canoes to the other side, haul the bags over, load the bags, and put the canoes back in. The process is lengthy and strenuous: our longest portage was about 0.7 mi (1.12 km) of hilly, rocky trail. We rented light Kevlar canoes, but a "light" canoe still weighs 50 lb (23 kg) – a significant load for your shoulders.
Our canoe – light packers – had the option to haul in the following configuration in one trip:
- A: carry a pack and the canoe
- B: carry a pack and help balance the nose of the canoe
- C: carry two packs
The other canoe – heavy packers – did not have this option because their packs were so heavy, and had to make two trips between the entry and exit point for every portage. The discrepancy in active portage tiplanme made it harder to keep our party together because the "light" canoe would be in the water waiting while the "heavy" canoe was still moving down the trail.
When we unpacked at camp, I noticed lots of ways that our heavy campers could have saved weight:
- Someone brought a blue hardware store tarp that weighed about 4 lb (2 kg). Modern ultralight tarps can come in as low as 8 oz (224g), pack smaller, and serve the same function.
- Someone brought two Cliq camp chairs, each of which weighed 3.7 lb (1.7 kg). They're nice chairs, but we didn't strictly need chairs with legs. Every campsite had logs configured around the firepit grate as seating, and we always sat around the fire anyway. A great alternative would have been the folding foam canoe chairs – these doubled as secure, comfortable seats in the canoes when we paddled, and when we arrived at camp, we unclipped them from the canoes, brought them up to the campfire sitting logs, and used them as seats. Any equipment that pulls double duty is worth its weight!
- Someone brought a bulky 32 oz (900 ml) plastic canteen with a built-in LifeStraw. The LifeStraw doesn't weigh much, but it adds volume to the water bottle, and it's redundant when we're already hauling water filters for the lake water. My hydration bladder fit nicely along the inside flat back of my hiking pack and took up much less volume.
In my work as an infrastructure engineer, I see lots of ways that we misuse tools and add extraneous redundancy and overhead to daily processes:
- Approval and sign-off workflows in our shared repos demand code review and approval from people who are not only outside of the codebase, but outside of the domain – the mandatory reviewers don't understand the problem that's being solved in the pull request.
- Overly-complex domain tools are chained together in redundant ways: using Terragrunt to template out Terraform charts that deploy using Helm charts under the hood makes it incredibly hard to understand how a parameter turns into a value, and makes the system inscrutable to new engineers. It doesn't add value, either: Terraform can do all of this without additional tooling.
- Sometimes, less is more. Many of our shared infrastructure modules define categories of databases, storage systems, or workloads to be used in production environments. But sometimes this work to define domain problems as infra modules is premature, and users find it easier to simply write the verbose infrastructure by hand ("going full YAML") to get what they need. Rich Hickey's Simple Made Easy is a wonderful talk about how to understand your domain to write the correct abstraction for your users without overcomplicating.
Plan backups for your backups
Throughout the trip, we had a series of redundant systems to get where we needed to go without getting lost in the wilderness. We had two canoes, and one person brought a pair of FRS/GMRS radios so we could communicate between parties. We had two waterproof maps of the area, detailing the location of campsites and length of portages. And we had a GPS in each boat as a quick reference and backup to the maps.
On our last day, we had chosen a campsite a stone's throw away from our entry point. We camped on Alton Lake, one lake over from Sawbill Lake via an 0.1 mile (0.15 km) portage. We planned to get back to the parking lot one hour after we shoved off into the water. The heavy boat launched first while we finished loading the light boat. But as our boat headed south along the east shore toward the east portage landing, we saw the other boat head south along the west shore and blow straight past the portage. Worried, we landed our boat at the portage, moved to the exit point, and gave the other party five minutes to catch up. We tried to figure out what happened while we waited, but we only got the full story after we got back to the parking lot:
- The person with the radios had decided to pack them both up, reasoning that it was a short, straightforward trip and we wouldn't need them.
- The person in the other boat had folded their map up into their tent by mistake and couldn't access it when they loaded into the canoe.
- No one in the other party had reviewed the course closely before shoving off.
- Someone in the other party had mistaken the location of the portage, telling the party it was at the south end of the lake rather than halfway down the east bank.
- The person in the other boat with the GPS didn't review the route as they paddled.
- I had asked the other crew to keep our boats together in a single convoy, but they opted to set off immediately so they could get back to civilization faster.
To anyone who has ever participated in a high-pressure production outage that's gone poorly, this sequence of failures likely looks familiar. Disasters are usually not caused by a single event, but rather a series of systemic failures that chain to cause an unwanted outcome. An example of this is the Tenerife airport disaster, in which two 747 passenger jets collided on a runway, resulting in massive loss of life. Major factors in this incident included:
- Pressure on one of the pilots to take off ASAP to remain in compliance with his airline's duty-time regulations
- Pressure to take off immediately to avoid further delays from deterioration of weather conditions
- Sudden fog which greatly limited visibility, preventing plane crews from seeing each other and tower crew members from seeing either plane
- Radio interference which made it difficult for plane crews to understand tower instructions
When you build practices for daily operation – or incident management – into your organization, take time to think through the human factors. By working to answer these questions, you will have a better understanding of how your system will fail in ways you didn't design it to. You can use these answers to build stronger processes that mitigate broken links in the disaster chain.
- Are your engineers typically working on code and production changes as a team, or are they independent cowboys shipping solo?
- How do your team members typically communicate when something is going right – or wrong?
- How do they behave when under stress, as opposed to during business as usual?
- What conflicts exist in your organization map? What teams don't fully trust one another, and are likely to withhold information in a critical period?
- Which teams are motivated to uphold the integrity of the system as a whole, and which teams are directly motivated to uphold the integrity of their deliverables above other teams' work?
- How are engineers being pressured by executive leadership? Which teams are being overloaded into taking on more work than they can reasonably support? What kinds of tech debt will definitely result in future incidents if not addressed?
Rethink your knots
When camping in the northwoods of the midwestern United States, the most common concern is black bears. Black bears in Minnesota average from 154 lb (70 kg) to 275 lb (125 kg) and are known to be persistent and intelligent about acquiring food. Although they rarely approach campers and direct encounters are uncommon, they become more dangerous when acclimated to human food, learning that human campsites have food which can be foraged.
To minimize bear attacks, park staff have drilled into campers the utmost importance of making food totally inaccessible and removing it from campsites. Before you go to bed in your tent every night, you have to take the following precautions:
- Remove all scent from your campsite. Store food and toiletries (e.g. deodorant sticks, toothpaste) in sealed dry bags. Scoop any loose food debris into the fire pit and either burn it or bury it in ash.
- Wash dishes in a bucket. Bring the bucket of dirty water to the woods several hundred feet from your campsite (usually near the latrine) before dumping it. Brush your teeth and spit out here.
- Hang a bear bag. Attach a rope between two trees at least 12 feet (4m) apart. Rig a pulley system with carabiners. Attach your food bag to the center of the rope between the trees and pull it up. Secure the working end of the pulley rope to the tree.
- Watch out for common pitfalls. Bears can figure out how to shred an improperly-hung rope, dropping the bag right into their claws. Bears can stand on their hind legs and shred bags that hang lower than 12 feet above the ground (the dreaded "bear bag pinata").
- Do this every night. Do not skimp. Do not skip one night because it's late, the sun is down, and you're cozy in your tent.
All of us in the group have hung bear bags before, with varying degrees of ease. The first night, we hung our bear bags between two trees in the simplest configuration: two ropes joined in the middle by a carabiner, hauled up by hand. We hung the bag successfully, but we had trouble lifting it: everyone's food and toiletries were reasonably heavy at the start of the trip, before we had eaten all the food, and it took three people to successfully pull the bag to hanging height. We also used a rope toggle to secure the free ends.
I had brought along a copy of The Little Book of Incredibly Useful Knots, a wonderful book that goes into detail on various sorts of knots, bends, hitches, and more. It gave me some great ideas on how we could minimize the amount of hardware we needed to use and improve the hanging setup using only the rope we brought with us.
The next night, I offered to improvise a hanging method that would be simpler, use fewer carabiners, and accomplish the same security for our bear bags:
- Two ropes joined in the middle by a Zeppelin bend, eliminating the need for a carabiner
- Use of the now-free carabiner to form a pulley, easing the burden of lifting the heavy bags through a 2:1 mechanical advantage
- Securing the working ends of the ropes to the tree using a round turn and two half hitches to clean up the slack and minimize un/winding work
- Hanging the bags from a butterfly loop to secure their position along the line between the trees
We had success with most of these techniques, but struggled with tossing our rope over the branches in the trees available to us. We ended up skipping the pulley and muscling the bags up to height – which wasn't so bad in the end, because by the last day, we had eaten most of the food and the bags were much lighter. But the techniques I got to practice made me much more comfortable with my own ability, and I'm now confident that I can nail down a two-tree pulleyed bear bag system on my own next time. (Outside of hanging the bear bags, I also learned some great knots for tying guylines to support tarps and tents!)
In systems engineering, especially in software, it's important to return to your assumptions and your working system frequently. Often, you'll find a live production system has evolved organically to support business needs, and developers have bolted functionality on like campers bolt carabiners to their backpacks. If you take a big step back and audit the system holistically, you can often find ways to rearchitect these legacy applications to simplify them and make them easier to work with.
For example, I've worked on two major "data mover" systems whose purpose was to move data from one big database to another on a regular basis. They effectively served as ETL applications: extract data from the source, transform it to fit the form the consumers want, load it into the destination. But their business logic code had grown organically without a design document or, seemingly, a plan, and both systems were quickly becoming unmaintainable.
I wrote a project proposal to completely overhaul each system from first principles. The proposal included the time to fully audit each system's business responsibilities and source code, diagram and document each system, build a drop-in replacement, migrate business logic to the new app, and perform a switchover of the daily job. "Big bang" rewrites are often seen as productivity poison by project managers because engineers love to spend time refactoring code that they see as "ugly" or "sub-optimal," but if you put the right plan into place beforehand, you can build your way toward success from the start. I was able to execute both of these "big bang" rewrites successfully by spending my time getting buy-in from my managers and team leads so that they understood how supporting this project would end up directly supporting their work.
The more time I spend inside large systems, the more I tend to think about the world around me as a large system. Even though I type glyphs into a screen all day for a living, I still find that the lessons I learn about reliability and resilience from the digital world apply to the analog one. By packing light, planning backups for my backups, and rethinking my knots, I find that my systems improve in all parts of my life. As a site reliability engineer, you're in an excellent place to help your fellow humans by sharing your favorite techniques, discussing novel approaches, and reworking systems that have been taken for granted. It's your duty to make the world a safer, more reliable place!