The Developer is Dead, Long Live the Developer

April 17, 2014 12:00 AM

I came across an article called "How DevOps is Killing the Developer." It mourns the rise of DevOps and an ever increasing set of skills a developer has to have to work in the resource-constrained environment of a startup. The assumption is that a developer has to fill all these roles even though his position is at the top of the company's hierarchy and no one else can do what they can.

I can relate to mourning how things used to be. Writing code is fun, and it's understandable that a developer would want to work on it all day long.

I'm intrigued by this notion of DevOps and the assumption that no one else can do what developers can do, therefore they should be doing what they've specialized to do.

However, times for developers are a-changing and DevOps plays an important role in that.

This is my view of the situation, about what DevOps is about and the change developers are facing.

DevOps is about shared responsibility

The author claims that DevOps stems from startups. Resource constraints and the need to quickly respond to an everchanging market require developers to fill roles that no one else can fill. I've done my fair share of that, and it's certainly a valid assumption.

However, DevOps has different goals than developers knowing all the infrastructure automation tools out there.

DevOps is about tearing down silos in more traditional companies, where there's a much stricter separation between development and operations.

It's a cultural shift that fosters people working together, towards a common goal, which ultimately leads to serving the customer.

The totem pole described in the article is the exact thing that DevOps is trying to improve. In a more traditional sense, it's described as the ivory tower of parts in an organization, whether that be operations, development or your quality assurance team. The tendency is to throw things over the wall, let them handle it, and not bother with whatever happens after the release anymore.

The ultimate beneficiary of whatever anyone in the company does should be the customer. The argument that developers are too expensive for any other tasks than writing code suggests they're too good to talk to customers, to fix their own code, to see where it breaks in production under their own responsibility, to see how it affects customers.

Customer support falls into the same reign. Traditionally, company's have a front line support team to weed out the unimportant support requests, best to be done by replying with canned responses. Everyone loves those, right?

The outcome is that developers shouldn't just sympathize with their operations team, the people who run their code in production, they should sympathize with their customers.

DevOps pushes the focus of everyone towards working together rather than a single person trying to wear as many hats as possible at the same time, which will inarguably lead to burnout.

The resource constraints of a startup are a natural cause, at least initially, for people having to do whatever it takes to make their product succeed. It doesn't have to be that way, of course, but improving the constraints is up to the company's entire team, at least in a world where people work together.

Ultimately, DevOps is about empathy, with everyone on your team and with your customers.

You build it, you run it

In this world of silos, development threw releases at the ops or release team to run in production.

The ops team makes sure everything works, everything's monitored, everything's continuing to run smoothly.

When something breaks at night, the ops engineer can hope that enough documentation is in place for them to figure out the dial and knobs in the application to isolate and fix the problem. If it isn't, tough luck.

Putting developers in charge of not just building an app, but also running it in production, benefits everyone in the company, and it benefits the developer too.

It fosters thinking about the environment your code runs in and how you can make sure that when something breaks, the right dials and knobs, metrics and logs, are in place so that you yourself can investigate an issue late at night.

As Werner Vogels put it on how Amazon works: "You build it, you run it."

The responsibility to maintaining your own code in production should encourage any developer to make sure that it breaks as little as possible, and that when it breaks you know what to do and where to look.

That's a good thing.

The developer is dead, long live the developer

It's okay to mourn what we used to do as developers. Heck, I enjoyed writing code too, when I started out as a developer.

But the specialist developer is becoming a liability for any business that's competing in ever-changing markets (which are all of them).

This doesn't mean that a developer should be doing everything at the same time. But they should be ready to when the need arises. It benefits the entire team they're working in.

A developer probably doesn't need to know all the available automation stacks out there. Knowing one should be plenty. What really matters is the willingness to change, to learn a new stack when necessary.

Having more wide-spread knowledge about the environment their code runs in gives a much better picture on how the code can be improved to run better, in production, to serve happy customers.

Here are a few more noteworthy responses to the blog post above:

Start by Building and Selling Small Products

April 07, 2014 12:00 AM

We love grand ideas. As engineers in particular, we like the idea of building something big that solves an idea we've had. We love sweating the details, we love refining architecture, we love building the right tools for the job.

When you start a business, this approach can foil your grand idea before you even hit a market. This ignores that ideas alone are worth nothing without a market for them.

What if, instead of working on your grand idea right away, you start working on something small?

When you've found a market you don't instantly need to build a big product to serve it, you can start much smaller than that. Starting small has the benefit that you can gauge out the market and its interests and that you keep risk lower than with building something big right away. Products are shrouded in uncertainty, keeping the risk low means keeping your losses low, which is rather beneficial when you're bootstrapping.

Building something small initially and slowly increasing the scope of what you build has another benefit. Your initial products start feeding into the work for the subsequent products. The money made from your first products helps you to build more products.

Smaller Products can beget more, slightly bigger products. I love this idea, also called "stacking bricks."

The Riak Handbook turned out to be such a product for me. Working on it took the better of three months, a big part of that spent full time on writing, editing, creating the publishing workflow (I wouldn't recommend building the publishing chain yourself), and the marketing site.

While the book's most sales happened in the first days of going public, it continued to sell for now more than two years, both on the site and on the Kindle store.

It helped a lot in getting Travis CI off the ground as a product. It wasn't a lot of money that came in, but for the following twelve months after publishing the book, it brough in up to $1000 per month extra. Quite handy as passive income when you're working on bootstrapping another product.

How can you build something small before building something bigger? Here are a few ideas.

Build an audience with writing

The Riak Handbook started out by way of this very site, if you will. Early on in the craze of NoSQL databases, I started writing about the fun and silliness I had playing with some of them.

That in turn led to the idea of the NoSQL Handbook, which, after more thought, distilled into the Riak Handbook.

I unknowingly built up interest for the technology, just by sharing what I discovered playing with them. The joys of new technologies.

Sell what you learn (and what you know)

A book or even a series of screencasts is a great means to start stacking the bricks. It's a nice next step from building up an audience, and it doesn't even require you to be an expert at something right away.

Rather than dump your entire knowledge into a book, write about what you learn, or learn as you write.

Here's a little secret: the Riak Handbook was my personal Riak learning experience.

Sure, I've had exposure with it before, working for Basho and with their customers, but my deepest exposure with all facets of Riak was writing the book.

It turned out to be a great learning experience for distributed systems and for Riak itself, even picked up some Erlang along the way.

Sell something that doesn't yet exist

Here's a crazy idea, before you actually build something, sell it. Put up a landing page for your product, start marketing it, see if someone bites.

If they do, you have all the more incentive to actually build it.

I'm a big fan of grand ideas myself. But the Riak Handbook, as small as it is, was a convincing exercise that it pays to start small. It pays off slowly, and revenue will start trickling in, but as you add more products, you add more revenue.

Heck, if you enjoy writing and selling books, keep doing it. Build more, sell more of them.

For some more inspiration on starting small and building your way up, I'd highly recommend these books:

What More Do You Need to Start?

April 04, 2014 12:00 AM

I used to have this beautiful dream that I'd some day open my own coffee shop.

It sounds simple enough. All you need is an espresso machine and you're good to go.

Except you need to rent a shop, buy a giant grinder, or two, buy coffee beans, hire a barista or two, buy cups, take away paper cups, a water filter, and buy a freakishly expensive espresso machine.

Add some interior for the shop, and you're easily in the several tens of thousands to start your business.

Compare that to what you need to start selling products on the internet. You need to spend some time to find an audience, to find your market, and then, all you need is a laptop and an internet connection.

I have to keep reminding myself how little you need to start a business and to actually run it.

Beyond your computer, you don't even have to buy anything. You can rent (even pay by the hour) anything you need to serve your application, and you can utilize other services and products to help build yours. Heck, you can build on a ton of open source tools and libraries as well to help you get the job done.

You don't even need an office. All it takes is a itch you want to scratch. Whether you want to build and sell a product, build a business around it, or you just want to open source something.

I love that about what I do, and it's hard to understate how lucky and privileged we are being able to take something off the ground like that and to tell other people about it.

All it takes is a laptop.

What's keeping you?

The Joys of New Technology

April 01, 2014 12:00 AM

Just recently I told you to disregard new and technology unknown to you when you're building a product and a business from scratch. This is quite important, as in the uncertain beginnings of a new product, it's better to be safe than sorry when choosing technology, as long as its future is uncertain. Whether we like it or not, most products' and businesses' futures are, especially when they're just starting out.

However, I just recently experienced myself what happens when you're not exposed to new technology, focusing on getting a business off the ground.

Travis CI uses mostly boring, I'm sorry, proven technology. Our data goes into PostgreSQL, we use Ruby everywhere, mostly JRuby. It even uses older virtualization technologies.

But recently, I've been feeling like I'm falling behind, like I'm missing out on at least playing with some new toys, getting some fresh ideas into my head for solving problems.

We have a lot of problems yet to solve in our code base, and over the last two years, we played it safe. Which has been a good thing, it allowed us to scale up to 1000 customers with just boring technology.

But most of us have a natural curiosity when it comes to technology. We want to play with new toys, just like our kids do.

I've found that this is even more important when you write regularly. Just trying out something new gives some fresh insight into what you can do with technology.

You may not be able to solve a problem with something you're playing with right away, but it might come handy in the future. When worse comes to worse, you write a blog post about what you've learned, and share it with the rest of us.

Just this week I played with Docker and etcd. It was fun, and it was a day well spent.

What have you been playing with lately?

Here are some ideas to get you started:

The New Technology Fallacy

March 31, 2014 12:00 AM

When we set out to build our first product (Scalarium, now better known as Amazon OpsWorks, we started off with a rookie mistake.

Initially, we played with some ideas to test if we can fit them together. This was mostly focussing on orchestration of servers and their provisioning. We tried out a mix of RabbitMQ, Nanite and Chef (early adopters, yay!)

Back then, NoSQL databases just started appearing, and we thought, screw MySQL, we're going to use something new and shiny. We started out using Amazon's SimpleDB, but were soon hindered by its limitations.

We built Scalarium on Rails, so it was only natural that we started writing our own ActiveRecord-like persistence layer. First, we wrote it to work with SimpleDB, and it was aptly called SimplyStored.

Later, after some first exposure to CouchDB, we decided to use it instead. It was gaining some traction, and we had good access to local community support.

After we hit a wall with our first attempt at using a custom storage layer, we rewrote it to support CouchDB. Slightly different semantics made some things harder to rewrite, and some things were quite awkward to handle, in particular when trying to map an ActiveRecord-like query model on top of a database that requires you to define your data queries upfront and store them as JavaScript or Erlang views.

We spent a lot of time on SimplyStored, and for no good reason other than the technical fascination with the idea of not using MySQL, a proven and fast database, which would've been very sufficient for our purposes.

In the end, we still managed to build a good product, but shipping it was extended unnecessary amounts of time by trying to start building our own stack rather than use what's there and what's proven.

It's tempting to use new technologies when you're building something new. After all, you've got a clean slate and can just play around.

But when you build a new product as a new company, getting something up and running, something to throw at users, is even more important.

It's okay to build up some technical debt along the way. Yes, you will spend time later cleaning it up, but at least you can do that knowing that what you've built initially was successful enough.

With a proven product that's bringing in revenue, you have more freedom to gradually remove the technical debt built up in the beginning. Of course you'll be adding new debt along the way, but that's just the circle of software engineering life, isn't it?

At Travis CI, we also made some mistakes of where we focused our attention while building a product. Some things were more focused on building something that's technically sound rather than make sure we get a working product in front of customers quickly.

Early on, Travis CI started out as a test balloon if you will, with a few simple components to prove its technical viability as a continuous integration system. Leaving some challenges aside, we were able to scale it out quite nicely. We're still working on removing the technical debt that we built up, but given that our customer base allows us to do so, that's just fine.

With Scalarium, it took us almost a year to get something in front of customers, an insanely long time. Looking back, the thing I would've done differently is not building our own persistence layer, which has no relation at all to what we were trying to build. It just took away precious time from building our first version that could be used by customers.

When building something new, be careful to not fall into the trap of shiny new technologies. They can be blessing and a burden, but the latter is all the more likely when you step into the unknown.

Using proven technology can be incredibly boring, but they give you the room to make sure that what you're trying to build and sell is sound as a product, rather than a technical masterpiece.

When building something new, simple and proven technology wins. You can always add more bells and whistles later.

Three Simple yet Incredibly Hard Productivity Tips

March 28, 2014 12:00 AM

Our working days (even our spare time and holidays) are filled with distractions. Every social network that we used is fighting for our attention. Plus, emails are always waiting to be replied to, archived or deleted. Push notifications are constantly reminding us to reply to a friend, that one

Together, they've formed the holy trifecta of distractions trying to pull us away from getting work done.

Here are some simple yet incredibly hard suggestions:

  • Disable push notifications on your phone except for the most important services.

    I've come to think of push notifications as push interruptions. They do nothing but distract, they urge you to pick up your phone, to do something. They directly appeal to our need for something new, something exciting.

    I only have push notifications enabled for text messages these days and for our alerting. If there's one thing I want to be made aware of, it's when production is down.

  • Avoid checking email first thing in the morning

    As helpful as email is in communicating, plowing through your morning inbox sucks the bejesus out of your creativity. I found it to be poison for mine, in particular getting started in the morning.

    Rather than continuously have email open, only check it in intervals. If you can't get used to that easily, set a timer, and don't break the timer.

  • Kill Twitter, Facebook, and all the others

    Okay, this is harsh. But I found that Twitter is just as bad for my creativity juice as reading email first thing. There's always a lot going on, which is why we like checking our social network feeds in the first place.

    And that's exactly what they prey on, our time, the little bit of attention we can muster up to focus on something for a short period of time. I love reading what's happening out there, but at the same time, I love getting work done.

These steps sound so simple, yet they're incredibly hard. We get excited by the thought of a new email bringing us good news, by a friend texting us or by someone liking a photo. But does it really add anything so useful that it warrants distracting us from what's really relevant?

I've removed Twitter, Instagram, games, even email from my phone. It's quite deliberating. It does turn an iPhone into a rather expensive two-factor authentication device, but it removes a lot of pointless distractions.

Banksy says it best:

No more vibrating phone when an email comes in, when someone sends me a direct message or likes a photo.

All that can wait. My focus can't.

Building an Ethical Business

March 27, 2014 12:00 AM

With our own company growing, both in terms of our team size and our customer base, I keep finding myself thinking more about what kind of company we want it to be.

This touches on all aspects of the business, relationships with our customers, marketing our product, how we treat our community, both globally and locally, and most importantly, treating and growing our team.

What it boils down to for me is openness on the one hand and fairness on the other. Everywhere a company is active, there are always humans involved. Any issues that come up are best served by being brought out in the open, treated with empathy, the will to solve the problem, and the assumption that people are generall well-intentioned in what they do.

This goes into all directions, because at the very core, empathy is the most fundamental skill, both for the humans working in a company, and for the company itself.

Empathy is sometimes described as a personal trait, but it's a skill, a skill that can be learned, that can be honed, and that can be instilled as a core value of a company.

Empathy means taking your customers' issues seriously, acknowledging their problems, helping them fix any issues they might have, no matter if the issue is on their end or on yours.

Customer loyalty isn't something you can buy, it isn't something you can put a number on. Customer loyalty is something you have to work on every single day.

Empathy means building relationships with your customers rather than look at them as transactions. When they have issues, you have issues, it's simple as that.

It also means that when there's a bigger issue at stake that affects your company and your customers, it's tackled out in the open, head on, rather than swept under the proverbial rug. This includes security issues, operational/production issues, but also issues that affect your company in other ways.

We like to think that a company's brand and image can be controlled. The more we repeat our values, what we stand for, the more our customers will believe it.

That's bollocks. You can spend years trying to make yourself look pretty on the outside, but that facade can be destroyed by that one small thing that you didn't want to make public at the time.

Empathy means being open and honest about anything that affects your company. Does that mean you have to tell the world exactly how much money you're making or losing?

In what detail you make what you do public is up to you. We tend to fear that publishing too much could play into the hands of our competitors, that it could confuse customers, despite there being next to no proof this is actually the case.

I admire Buffer's openness in this regard. They're publishing their team's salaries, the letters they send to their investors, numbers about growth and losses. Can that hurt your company in any way? No one knows, because it just hasn't been done before.

Empathy means that your company is aware of its surroundings. Even in times of companies selling things to a global audience, with a distributed team, companies have a home, where they pay taxes, and a community they're inadvertently a part of.

An ethical business is about giving back to the community it's working in. I found inspiration on this in "The Knack", where the employees can get involved in community work on the company's time, and they get to choose a good cause to give something to at the end of the year.

Amy Hoy is doing something similar, part of their profits go to local charities. I found this very inspirational, and we started doing the same with part of our profits last year.

Beyond that, there's community work, helping kids and schools in need, lots of opportunities to jump in and help out in a company's local surroundings.

Empathy means treating your vendors with the same courtesy as you treat your customers. The same applies to them, you want to build relationships rather than think of vendors as a transactional means for your business.

Vendors are people, just like your customers, the people in your community, the people working in your company.

The people on your team are the most important for any company. Some would argue differently, but I'd say that for an ethical business, how you treat the people working for you is what shapes any interaction your business has with its surroundings, with its customers.

There's a quote in "Small Giants" that stuck with me:

For all the extraordinary service and enlightened hospitality that the small giants offer, what really sets them apart is their belief that the customer comes second.

On first sight, it sounds harsh. Clearly, a business' customers are the most important for its continuing success, no?

It takes a happy and driven team to make for happy customers. Relationships can only be made between humans. While customers can use your software or product, or whatever it is you're selling, whenever they have issues, they expect a human to help them out.

Building healthy relationships between your company and your customers requires all people in the company to have healthy relationships with each other, with the people they work with, the people they work for.

Just like with your customers, you can't buy your team's loyalty. It requires you to build relationships with them. Relationships are based on trust.

I'd argue that you can only earn trust by putting your trust in someone in return. Allowing people to do the right thing, yet still give them room to fail and learn, is the simplest beginning to build trust.

When it comes to the people you work with, trust is reflected on different levels, not just work, but also how your company treats their personal lives.

Trust, in turn, comes down to empathy.

Empathy is the recurring theme in this post, it's the recurring theme in any human interaction. Empathy means you take your time to appreciate, to contemplate what another person is thinking, what they're saying.

Whether it's your customers or one of the people you work with. Listening to their concerns and treating them as if they're yours is the start of building trust. If people learn that you can listen to them, without judgment, and help them figure something out, you're off to a good start.

As Chad Fowler said, empathy is your most important skill.

This is something I'm trying to work on every day, work against my instincts, listen to people first, ask questions, before I pass in my own view of judgment. It's hard work.

For an ethical business, a lot of things come down to "doing the right thing."

The right thing in the global, local or your company's micro scope can have lots of different meanings, and figuring those out will be the hardest. We've spent a lot of time working that out for our little business, and we're still at the very beginning.

Does an ethical company strive for profits? Of course, the question is how they're used. A company needs cash to survive in the long run, but it also needs to take care of its surroundings to function well.

What makes an ethical company then? I believe the core values lie in openness, honesty and, most importantly, empathy. Those are skills that need to be acquired, practiced and honed. We're only at the beginning of this journey for ourselves, and we're working hard to stick to these values.

On Assessing Risk in Socio-Technical Systems

March 24, 2014 12:00 AM

I gave a talk about risk and safety in engineering at the DevOps user group in Frankfurt recently.

I talked about practical drift, normalization of deviance and the general ideas of risk and how complex systems make it almost impossible to predict all possible outcomes for a system running in production. The idea of the unknown unknowns (thanks, Donnie Rumsfeld!) and Black Swans (courtesy of Nassim Taleb) also came up.

A black swan, or an unknown unknown, is an event that is not just unlikely, no one has ever seen or consider it before. It's an accumulation of events so unlikely, that them coming together is beyond the risks anyone would normally consider, 9/11 comes to mind.

I had a chat with one attendee, who suggested that, before you build a system, you look at its properties and look at the possible influences of each one, considering the possible risks of things, going further and further back the causal chain of possible events that could lead up to an incident in the system to be designed and built.

As engineers, this seems like a plausible idea to us. You sit down, you look at your system from all known angles, you measure things, you apply some math here and there.

We like to think of engineering as a predictable practice. Once something's built with the right measurements, with the right tools and with a touch of craftsmanship, it'll last.

As a German, this idea certainly appeals to me. If there's anything we enjoy doing, it's building machines or parts for machines, or build machines to build parts of other machines.

The Boeing wing test

Take this picture, for instance. It's a magnificent sight, and it's a testimony to predictive engineering. It's the infamous wing test for the Boeing 787 Dreamliner.

For the test, the plane's wings are attached to a pretty impressive contraption. They're slowly pulled upwards to find the breaking point.

This test is intended to go way beyond the circumstances commonly found during normal flight operations, up to 150% above normal levels.

There's a video from a similar stress test for the Boeing 767 too. The wings break spectacularly at 154% beyond normal levels.

The engineers are cheering. The wings were built to withstand this kind of pressure, so it's only understandable, especially for us fellow engineers, that these guys are beyond happy to see their predictions realized in this test.

Ideally, you will never see wings being bent to these extremes.

Wings are but one piece in the big, complex system that is a modern plane.

A plane operates in an environment full of uncertainty. While we like to think we can predict the weather pretty well, its behavior cannot be controlled and can change in unpredicted, maybe even unprecendented ways. It is a system in itself.

This is where we come back to the idea that risk in complex systems can be assessed upfront, when designing, before building it.

A plane, on its own already a complex system, interacts with more complex systems. The humans steering it are one of them, the organization the pilots participate in are another. The weather is yet another complex system.

The interaction points of all these systems are almost boundless.

Engineers can try to predict all the possible states of a plane's operating environment. After all, someone is programming these states and the plane's responses to them.

But they can't predict how a human operator will interpret whatever information the system is presenting to them. Operating manuals are a common means to give us much insight as possible, but they're bound to what is known to the designer of the system before it is put into production use.

This is where socio-technical systems come into play. Technology rarely stands on its own, it interacts with human operators to get the job done. Together, they form a system that's shaped and driven both by technology and the social interactions in the organization operating it.

Complex systems exist on the micro and the macro level

A plane's wing is bound to wind, jet stream, speed, the material used to build it, the flaps to adjust the planes altitude. But it doesn't end there. It's bound to the care that was used building it, designing it, attaching it to the plane, the care of maintaining it.

With these examples along, the wing is part of several feedback loops. In "Thinking in Systems", a feedback loop is how a system responds to changing conditions. The wing of a plane can respond to increasing pressure from upwards winds by simply bending. But as we've seen above, it can only bend so far until it snaps.

But the wing is able to balance the increasing pressure nonetheless, helping to reduce impact of increasing wind conditions on the plane.

The wing then interacts with the plane, with its wheels, with its speed, its jet engines, its weight. The plane interacts with the pilots, it interacts with the wind, with the overall weather, with everchanging conditions around it.

The wing is therefore resilient. As per "Thinking in Systems":

Resilience is a measure of a system's ability to survive and persist within a variable environment. The opposite of resilience is brittleness and rigidity.

A wing is a complex system on the macro level, and it is constructed of much smaller complex systems at the micro level. It's a complex system constructed of more complex systems. It's part of even bigger complex systems (the plane), that are bound to even more complex systems (the pilot, weather conditions, jet stream, volcano ash).

These systems interact with each other through an endless amount of entry and exit points. One system feeds another system.

Quoting from "Thinking in Systems":

Systems happen all at once. They are connected not just in one direction, but in many directions simultaneously.

"Thinking in Systems" talks about stock and flow. A stock is a system's capacity to fulfill its purpose. Flow is an input and output that the system is able to respond to.

Stock is the wing itself, the material it's made of, whereas flow is a number of inputs and outputs that affect the stock. For instance, a type of input for a wing is speed of air flowing around it, another one the pressure built on it from the jet stream. The wing responds in different ways to each possible input, at least as far as it's been knowingly constructed for them.

If pressure goes up, the wing bends. If the flow of air is fast enough, the wing will drift, keeping the plane in the air.

Once you add more systems surrounding it, you increase the number of possible inputs and outputs. Some the wing knows how to respond to, others he may not.

As systems become complex, their behavior can become surprising.

The beauty of complex systems is, and this is a tough one to accept for engineers, the system can respond to certain inputs whether it was intended to do so or not.

If pushed too far, systems may well fall apart or exhibit heretofore unobserved behavior. But, by and large, they manage quite well. And that is the beauty of systems: They can work so well. When systems work well, we see a kind of harmony in their functioning.

With so many complex systems involved, how can we possibly try and predict all events that could feed into any of the systems involved, and how they then play into the other complex systems?

Our human brains aren't exactly built to follow a nearly infinite number of input factors that could contribute to an infinite number of possible outcomes.

It's easier to learn about a systems elements than about its interconnections.

The Columbia disaster

Let's dwell on the topic of wings for a minute.

During the Columbia crash on February 1, 2003, one of the low-signal problems the crew and mission control experienced were the loss of a few sensors in the left wing.

Those sensors indicated an off-scale low reading, indicating that the sensors were offline.

Going back to the launch, the left wing was the impact zone of a piece of foam the size of a suitcase, the risk of which was assessed but eventually deemed not to be hazardous to life.

The sensors went offline about five minutes before the shuttle disintegrated. Around the same time, people watching the shuttle's reentry from the ground noticed that debris being shed.

The people at mission control didn't see these pictures, they were blind to what was going on with the shuttle.

Contact with the crew and the shuttle broke off five minutes later.

Mission control had no indication that the shuttle was going to crash. Their monitoring just showed the absense of some data, not all of it, at least initially.

A wing may just be one piece, but its connections to the bigger systems it's part of can go beyond what is deemed normal. Without any visuals, would you be able to assume that the shuttle is currently disintegrating, perishing the entire crew, just by seeing that a few sensors went offline?

Constraints of building a system

When we set out to build something, we're bound by cost. Most things have a budget attached to them.

How we design and build the system is bound by these constraints, amongst others.

If we were to sit down and try to evaluate all possible outcomes, we will eventually exhaust our budget before we even started building something.

Should we manage to come up with an exhaustive catalog of possible risks, we then have to design the system in a way that protects it from all of them.

This, in turn, can have the curious outcome that our system loses resilience. Protecting itself from all possible risks could end up creating a rigid system, one that is unable to respond to emerging risks by any other means than failing.

Therein lies the crux of complex systems and their endless possibilities of interacting with each other. When we try to predict all possible interactions, there will still be even more at some point in the future.

The conditions a system was designed for are bound to change over time as it is put into production use. Increasing usage, changing infrastructure, different operations personell, to name a few.

Weather changes because of climate change, and it takes decades for the effect to have any possible impact on our plane's wings.

How complex systems fail

With a sheer infinite amount of interactions and emerging inputs increasing them even further, the system can have an incredible amount of failure modes.

But, according to Richard Cook's "How Complex Systems Fail",

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

It requires multiple failures coming together for the system to fail.

With so many systems interacting with each other, predicting how and when a combination of failures is coming together feels beyond our mental capacity.

The human factor

What then holds our systems together when they're facing uncertainy in all directions?

Surprisingly, it's the human operator. Based on ever increasing exposure to and experience operating systems in production is a human the truly adaptable element in the system's equation.

Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure.

What is important for any organization is that these experiences are openly shared to increase overall exposure to these systems, to bring issues to light, to improve the system as its inputs and the system's response to them change over time. Because, depending on their exposure, the knowledge of the system's behaviour under varying circumstances can be unevenly spread.

Again, quoting from Cook:

Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure.

Following this, maybe designing systems should focus more on building them with the human operator in mind than trying to protect them from as many possible causes of failure as possible, including the human operator.

Organization culture and risk

Assuming your organization has a good track record when it comes to safety and assessing risk. Is that an indicator that future projects are in good hands? Is a history of risk assessment and safety enough to warrant continuing safety?

According to Cook:

People continuously create safety.

Subsequently, a good safety track record is no indication for the future. Safety is not a one-time purchase, it is a continuing process that shifts between production and monetary pressure, people's work load, and any activity at the sharp end, on the production system.

The Challenger incident is an interesting example here. On January 26, 1986, the Challenger shuttle lifted off the launchpad, only to be disintegrated in the atmosphere 73 seconds later. The flight's designation was STS-51-L.

NASA, going back to the Apollo program, inarguably has a history of successfully finishing missions, even to the moon. They had good experience constructing and running hazardous equipment in production.

But, with the Shuttle program, the organization found itself in different circumstances. Stemming from the Vietnam war, budgets were cut significantly, staff shrank to about 1/3 of its original size as a consequence.

NASA relied a lot more on external contractors to work on specific parts of the Space Shuttle, just like the solid booster rockets propelling the shuttle into the atmosphere.

For budget reasons, the rockets' design was based on the Titan rocket, the booster rocket used in the Apollo program. Everyone at NASA assumed that the rockets were not only a good fit, but that there was sufficient experience with them in the organization.

Something else was different with the Shuttle program. NASA suddenly found itself under production pressure from potential customers. The program was aimed to be as economical as possible, with up to 50 launches per year to make sure that costs are fully covered by revenue. The US Army was very much interested in using the Shuttles as a means of transporting satellites and other gear into space.

Following the changes in production pressure and working with more external contractors, NASA introduced a bigger management structure. Four layers of managers and sub-managers eventually existed at NASA, with every sub-manager reporting up the stream, representing their own teams.

When the first Shuttles were launched, the team responsible for the booster rockets noticed behaviour that was different from their experience in the Apollo program.

The joints holding the parts of the rockets together were rotating, the O-rings sealing the joints of the parts either burnt through under certain circumstances, or they behaved in unexpected ways at very low temperatures. When rubber gets below certain temperatures, it stiffens up, making it unable to move an potentially fulfill its duty.

Most conditions were only seen in isolation rather than together affecting a single flight. For most of them, the team thought they understood their respective risks.

All these issues were known to the engineering teams involved, they were even considered critical to human life.

Before every launch, NASA held an assessment meeting where all critical issues were discussed. The issues found by the solid booster rockets were brought up regularly in the summaries given by their respective managers. There were slides showing notes on the issue, and the risk was discussed as well.

With every launch, the engineers learned a few new things about the behaviour of the solid booster rocket. Some of these things made it up the reporting chain, others didn't.

On the evening of the fatal Challenger launch, the teams came together to talk about the final go or no go.

A few of the engineers from the contracting companies had doubts about the launch, as the forecast for Cape Canneveral predicted very low temperatures, lower than during any previous launch of a Space Shuttle.

While the engineers voiced their concerns and initially suggested to delay the launch, management eventually overruled them and gave the go for launch.

Again from Richard Cook:

All ambiguity is resolved by actions of practitioners at the sharp end of the system.

There were a lot of miscommunication issues involved in this meeting alone, but the issue goes much deeper. The layers of management within the organization added an unintended filtering mechanisms to safety issues and risks.

During presentations in assessment and pre-launch meetings, information was usually presented in slide form. In the Challenger days, they used overhead projectors, during later years, engineers and management resorted to using PowerPoint.

Regardless of the tool, the information was presented in a denser form (denser with every management layer), using bullet points, with several things compacted into a single slide.

This had the curios effect of losing salience for the relevant information, the data that possibly could have indicated real risks rather than intermingle them with other information.

The Columbia accident suffered from similar problems. From the Columbia Accident Investigation Board's Report Vol. 1:

As information gets passed up an organization hierarchy, from people who do analysis to mid-level managers to high-level leadership, key explanations and supporting information is filtered out. In this context, it is easy to understand how a senior manager might read this PowerPoint slide and not realize that it addresses a life-threatening situation.

Edward Tufte has written an excellent analysis of the use of PowerPoint to assess the risk of the Columbia incident. Salience and losing detail in condensed information play a big part in it.

The bottom line is that even in the most risk-aware organizations and hazardous environments, assessing safety is an incredibly hard but continuous process. Your organization can drift into a state where a risky component or behaviour becomes the norm.

In "The Challenger Launch Decision", Diane Vaughan coined the term "normalization of deviance." What used to be considered a risk has now become a normal part of the system's accepted behaviour.

Scott Snook later improved it to "practical drift", "the slow steady uncoupling of practice from written procedure."

Sidney Dekker later made it even more concrate and coined the term "drift into failure", "a gradual, incremental decline into disaster driven by environmental pressure, unruly technology and social processes that normalize growing."

How do you prevent practical drift or drift into failure? Constant awareness, uncondensed sharing of information, open feedback loops, reducing procedural friction, loose layers, involved the people at the sharp end of the action as much as possible, written reports instead of slide decks as suggested by Tufte?

Maybe all of the above. I'd be very interested in your thoughts and experiences.

On Working (Too) Hard

March 21, 2014 12:00 AM

There's a prevailing idea when it comes to startups and building and running your own business.

The idea that to be successful, you need to work hard, put in long hours, and push your team to the limit as well.

Keeping up with the competition, trying to make your customers happy, your investors too, and trying everything you can to turn your business venture into a success, however that is defined.

Some companies even go as far as advertising it as normal that you can just take your work everywhere you go, to the park, to your kids' soccer game, maybe even to the pub?

I've fallen into this trap, I've been putting in 10-12 hours per day, working from home, with my family around. The family is understanding, but that doesn't justify these kinds of working hours.

As someone working on a product that's used around the globe and at every hour of the day, I can relate to this idea. When production is broken, it's handy to have something around to respond. When a customer is having troubles, I want to help them. I'm used to taking my computer with me, even during the weekends.

Adding to that, with customers only coming online when it's the end of the business day in Europe, our support usually ramps up in the later hours, where customers come into our live chat and expect someone to help them with their problem.

Helping customers succeed is one of the most important purposes of a business, and we're trying as best as we can to help them out.

But that thought drove me into a habit that's hard to break free from. It's the fear that there could be a new customer support issue every day, that there could be a new customer in the live chat that needs help.

This very habit has driven me to being on the computer from the morning to the evening hours, always waiting for someone to approach us with an issue.

It's a habit that's been having a very destructive effect on my work and my life, and the two are not the same.

A few weeks back, as our team grew more and more, I've come to the realization that working longer hours gives a bad example, not just for myself, but it sets an implicit expectation that others on the team work just as long. It's poison for a team for even one person, in particular a manager, to work longer hours. It gives the impression that it's normal and expected to work longer than what your contract says.

It wears people out, it's worn me out, on multiple occasions.

We recently started doing support rotation, where everyone gets a dedicated day of doing customer support, escalating to others on the team where necessary. That gives the support person the freedom to only focus on customers for the entire day, and it gives the rest of the team the ability to focus on getting other work done.

It's been effective for us, and it's already improved our own tooling, our user interface and our documentation.

But I still can't shake the habit of always wanting to help. You can say that it's a noble habit to have, but its downsides are starting to show.

There's a quote from Small Giants that stuck with me:

For all the extraordinary service and enlightened hospitality that the small giants offer, what really sets them apart is their belief that the customer comes second.

A company's team is what sets it apart. That team needs to be well rested to make for happy customer.

The only way to get them to do that is to discourage long working hours.

In our company, that starts with me.

Success doesn't depend on how much you work, it depends on where you focus your time in the best way possible.

Don't work too hard.

Sites

Want to add your blog? Email mark@basho.com or fork the Planet repo and add it yourself.