This is how senior IT architects and CTOs understand change. To me, it’s a pretty unusual point of view. It explains the roles and qualities of an architect in a large enterprise, and how to train yourself to achieve those qualities. It includes expert advice on how to explain stuff, draw diagrams, examine architecture, to name a few.
The book is not about systems architecture.
The writing is superb, it’s super easy to read and profound at the same time.
Go and read it if you:
- Think that “Move fast and break things” is not a good idea most of the time. The book explained why and when economies of speed are better than economies of scale.
- Fear code and prefer configuration
- Don’t know what to do with black markets inside of your organization
- Don’t know how to achieve better software quality by increasing speed
Senior IT Architects and CTOs play a key role in such a digital transformation endeavor. They combine the technical, communication, and organizational skill to understand how a tech stack refresh can actually benefit the business, what “being agile” and “DevOps” really mean, and which technology infrastructure can improve quality while moving faster. Their job is not an easy one, though: they must maneuver in an organization where IT is often still seen as a cost center, where operations means “run”as opposed to “change”, and where middle-aged middle-management has become cozy neither understanding the business strategy nor the underlying technology. It’s no surprise then that software / IT architects have become some of the most sought-after IT professionals around the globe.
Aristotle already knew that a good speech contains not only logos, the facts and structure, but also ethos, a credible character, and pathos, emotions, usually triggered by a good story.
Not only does it resemble architectural viewpoints, it also highlights each model’s strengths and liabilities. I learned how people talk and think differently: Corporate IT aligns all the time, consultants “tee off” things and “touch base” without breaking a sweat, while no one at Google ever used buzzwords like “big data”, “cloud”, or “service-oriented architecture” because all these existed internally before the terms were coined.
It seems one could aspire to innovate like a start-up founder, draw PowerPoints like a management consultant, code like an Internet engineer, and have the political skills of a Corporate IT denizen. I will try.
First, if your systems are still running and can absorb change at a reasonable rate after 5 years, there was likely a good architect involved. For a more concrete description, senior architects in the enterprise work at three levels:
Define the IT Strategy, e.g., by assuring that the IT landscape adequately supports the business strategy or defining a set of necessary IT characteristics for systems to be built or bought. Strategy also includes “retiring” systems (in the Blade Runner sense of the word) lest you want to live among Zombies.
Exercise Governance over the IT landscape to achieve harmonization, complexity reduction, and to make sure that systems integrate into a meaningful whole. Governance occurs through architecture review boards and inception.
Deliver Projects to stay grounded in reality and receive feedback on decisions from real project implementations. Otherwise control remains an illusion.
Other Passengers
If you are riding the elevator up and down as a successful architect, you may encounter other folks riding with you. You may, for example, meet business or non-technical folks who learned that a deeper understanding of IT is critical to the business. Be kind to those folks, take them with you and show them around. Engage them in a dialog – it will allow you to better understand business needs and goals. They might even take you to the higher floors you haven’t been to.
You may also encounter folks who ride the elevator down merely to pick up buzzwords to sell as their own ideas in the penthouse. We don’t call these people architects. People who ride the elevator but don’t get out are commonly called lift boys. They benefit from the ignorance in the penthouse to pursue a “technical” career without touching actual technology. You may be able to convert some of these folks by getting them genuinely interested in what’s going on in the engine room. If you don’t succeed, it’s best to maintain the proverbial elevator silence, avoiding eye contact by examining every ceiling tile in detail. Keep your “elevator pitch” for those moments when you share the cabin with a senior executive, not a mere messenger.
In the end, most architects exhibit a combination of these prototypical stereotypes. Periodic gluing, gardening, guiding, impressing and a little bit of all-knowing every now-and-then can make for a pretty good architect.
“Head of” was aptly described in an on-line forum I stumbled upon:
This title typically implies that the candidate wanted a director/VP/executive title but the organization refused to extend the title. By using this obfuscation, the candidate appears senior to external parties but without offending internal constituencies.
Fools with Tools
The scale and complexity of doing architecture at the enterprise level is what makes large-scale IT architecture exciting, but it also presents one of the biggest dangers. It’s far too easy to get lost in this complexity and have an interesting time exploring it, without ever producing tangible results. Such cases are the source of the stereotype that enterprise architecture resides in the ivory tower and delivers little value.
Another danger lies in the long feedback cycles. Judging whether someone performs good enterprise architecture takes even longer than judging good IT architecture. While the digital world forces shorter cycles, many enterprise architecture plans still span three to five years. Thus, enterprise architecture can become a hiding ground for wanna-be cartographers. That’s why enterprise architects need to show impact.
Some enterprise architects associate themselves closely with a specific enterprise architecture tool, which captures the diverse aspects of the enterprise landscape. These tools allow structured mapping from business processes and capabilities, ideally produced by the business architects, to IT assets such as applications and servers. Done well, such tools can be the structured repository that builds the bridge between business and IT architecture. Done poorly, they become a never-ending discovery and documentation process that produces a deliverable that’s missing an emphasis and is outdated by the time it’s published.
Visit all Floors
Architecture, if taken seriously, provides significant value at all levels. The short film Powers of 10, produced in 1977 by Charles and Ray Eames for IBM, comes to mind: the film zooms out from a picnic in Chicago by one order of magnitude every ten seconds until it reaches , showing a sea of galaxies. Subsequently, it zooms in until at  it shows the realm of quarks. Interestingly, the two views don’t look all that different. I feel that large enterprises are the same: the complexity seems similar from far out as it’s from up close. It’s almost like a fractal structure: the more you zoom in or out, the more it looks the same. Therefore, performing serious enterprise architecture is as complex and as valuable as fixing a Java concurrency bug, as long as the enterprise architects leave the ivory tower penthouse and take the elevator at least a few floors down.
Skill, Impact, Leadership
When asked to characterize the seniority of an architect, I resort to a simple framework that I believe applies to most high-end professions. A successful architect has to stand on 3 “legs”:
- Skill is the foundation for practicing architects. It requires knowledge and the ability to apply the knowledge.
- Impact measures how well an architect applies his or her skill in projects to the benefit of a project or a company.
- Lastly, leadership assures that the state of the practice advances and more architects are grown.
This classification maps well to other professional fields such as medicine: after studying and acquiring skill, doctors practice and treat patients before they go to publish in medical journals and pass their learnings to the next generation of doctors.
Skill is the ability to apply relevant knowledge, for example about specific technologies, such as Docker, or architectures, such as Cloud Architectures. Knowledge can usually be acquired by taking a course, reading a book, or perusing on-line material. Most (but not all) certifications focus on verifying knowledge, partly because it’s easily mapped to a set of multiple choice questions. Knowledge is like having a drawer chest full of tools. Skill implies knowing when to open which drawer and which tool to use.
Impact is measured in benefit for the business, usually in additional revenue or reduced cost. Faster times to market or the ability to incorporate unforeseen requirements late in the product cycle without having to start over positively affect revenue and therefore count as impact. Focusing on impact is a good exercise for architects to not drift off into PowerPoint-land. As I converse with colleagues about what distinguishes a great architect, we often identify rational and disciplined decision-making as a key factor in translating skill into impact. This doesn’t mean that just being a good decision maker makes a good architect. You still need to know your stuff.
Leadership acknowledges that experienced architects do more than make architecture. For example, they should mentor junior architects by passing on their knowledge and experience. They should also further the state of the field as a whole, for example through publications, teaching, speaking engagements, research or blogging.
The Virtuous Cycle
While the model is rather simple, just as a stool cannot stand on two legs it’s important to appreciate the balance between the three aspects. Skill without impact is where new architects start out as students or apprentices. But soon it is time to get out into the world and make an impact. Architects that don’t make an impact don’t have a place in a for-profit business.
Impact without leadership is a typical place for a solution architect who is engrained in projects. However, without leadership this architect will hit a glass ceiling and won’t become a thought leader in the field. Many companies don’t put enough emphasis on nurturing or pushing their architects to this level out of fear that any distraction from daily project work will cost them money. As a result, architects in these companies plateau at an intermediate level and typically won’t lead the company to innovative or transformative solutions. Missing such opportunities is penny-wise and pound-foolish. In contrast, some companies like IBM formalize the aspect of leadership as “give back”: distinguished engineers and fellows are expected to give back to the community inside and outside the company.
Likewise, leadership without (prior) impact lacks foundation and may be a warning signal that you have become an ivory tower architect with a weak relation to reality. This undesirable effect can also happen when the impact stage of an architect lies many years or even decades back: the architect may preach methods or insights that are no longer applicable to current technologies. While some insights are timeless, others age with technology: putting as much logic as possible into the database as stored procedures because it speeds up processing is no longer a wise approach as the database turns out to be the bottleneck in most modern web-scale architectures. The same is true for architectures that rely on nightly batch cycles. Modern 24/7 real-time processing doesn’t have any night.
The circle closes when a senior architect mentors junior architects. Because feedback cycles in (software) architecture are inherently slow, this process can save new architects many years of learning by doing and making mistakes. 10 well-mentored junior architects will have more impact than one senior architect. Every architect should know that scaling vertically (getting smarter) only works up to a certain level and can imply a single point of failure (you!). Therefore, you need to scale horizontally by deploying your knowledge to multiple architects. As I am continuously trying to recruit architects and sense that many other large enterprises are in a similar need of architects, scaling the skill set is as important as ever.
Mentoring not only benefits the mentee, but also the mentor. The old saying that to really understand something you need to teach it to someone else is most true for architecture. Likewise, giving a talk or writing a paper requires you to sharpen your thoughts, which often leads to renewed insight.
Humans are actually terrible decisions makers, especially when small probabilities and grave outcomes like death are involved. Kahneman’s book Thinking, Fast and Slow5 shows so many examples of how our brain can be tricked, it can make you wonder how humanity could get this far despite being such terrible decision makers. I guess we had a lot of tries.
Making decisions is a critical part of an enterprise-scale architect’s job and many IT decisions, for example, cybersecurity risks or system uptime have similar characteristics of small probability but grave downsides. Being a good architect therefore warrants a conscious effort to becoming a better decision-maker.
IT purchasing decisions are often made based on extensive requirement lists that are calculated into scores. However, when you pick the “winner” with a score of 82.1 over the “loser” with 79.8, it’d be challenging to prove the statistic significance of this decision. Still, numeric scores may be better than traffic light comparison tables that rate each attribute as “green”, “yellow”, or “red”. A product may get “green” for allowing time travel, but a “red” for requiring planned downtime. While this may make it look roughly equivalent to one with the opposite properties, I know which one I’d prefer. Sadly, such comparison charts are often either reverse-engineered from a preferred outcome or the desire to maintain the status quo. I have seen IT requirements analogous to demanding that a new car must rattle at 60 mph and have a squeaky door in order to appropriately replace the existing one.
Bias
Kahneman’s book lists so many ways in which our thinking is biased that it’s really worth reading. For example, confirmation bias describes our tendency to interpret data in such a way that it supports our own hypotheses. The Google Ad dashboard was designed to overcome this bias.
Another well-known bias is prospect theory: when faced with an opportunity, people tend to favor a smaller, but guaranteed gain over the uncertain chance for a larger one – “A sparrow in the hand is better than the pigeon on the roof.” When it comes to taking a loss, however, people are likely to take a (long) shot at avoiding the penalty over coughing up a smaller amount for sure. We tend to “feel lucky” when we can hope to escape a negative event, an effect called loss aversion. Studies show that people typically demand 1.5 to 2 times the rational payoff to take a gamble on a positive outcome. Being offered a coin toss that makes them pay $100 on head but gives them $120 on tail, most people will kindly decline despite the expected value being 0.5 x -$100 + 0.5 x $120 = $10. Most people will accept the offer when the payout is between $150 and $200.
I am sure you have seen project managers avoid the certain loss in velocity for performing a major refactoring because the payoff in system stability or sustained velocity is uncertain. They are feeling lucky.
Micromort
Ron and Ali help us think rationally about the jar with pills from above. A one in one million chance of dying is called 1 micromort. Taking one pill from the jar amounts to being exposed to exactly one micromort. The amount you are willing to pay to avoid this risk is called your micromort value. Micromorts help us reason about decisions with small probabilities, but very serious outcomes, such as deciding whether to undergo surgery that eliminates lifelong pain, but fails with a 1% probability, resulting in immediate death.
To calibrate the micromort value, it helps to consider the risks of daily life: a day of skiing clocks in at between 1 and 9 micromorts while motor vehicle accidents amount to about 0.5 per day. So a ski trip may run you some 5 micromorts – the same as swallowing 5 pills. Is it worth it? You’d have to compare the enjoyment value you derive from skiing against the trip’s cash expense plus the “cost” of the micromort risk you are taking.
So how much should you demand to take one pill? Most people’s micromort value lies between $1 and $20. Assuming a prototypical value of $10, the ski trip that may cost you $100 in gas and lift tickets, costs you an extra $50 in risk of death. You should therefore decide whether a day in the mountains is worth $150 to you. This also shows why a micromort value of $1,000,000 makes little sense: You’d hardly be willing to pay $5,000,100 for a one-day ski trip unless you are filthy rich! Lastly, the model helps you judge whether buying a helmet for $100 is a worthwhile investment for you if it reduces the risk of death in half.
The micromort value goes up with income (or rather, consumption) and goes down with age. This is to be expected as the monetary value you assign to your remaining life increases with your income. A wealthy person should easily decide to buy a $100 helmet while a person who is struggling to make ends meet is more likely to accept the risk. As you age, the likelihood of death from natural causes unstoppably increases until it reaches about 100,000 micromorts annually, or almost 300 per day, by the age of 80. At that point, the value derived from buying a risk reduction of 2 micromorts is rather small.
Repeatedly asking questions can annoy people a slight bit, so it’s good to have the reference to the Toyota Production System handy to highlight that it’s a widely adopted and useful technique instead of you just being difficult. It’s also helpful to remind your counterparts that you are not challenging their work or competence, but that your job requires you to understand systems and problems in detail so you can spot potential gaps or misalignments.
Whys Reveal Decisions and Assumptions
When conducting architecture reviews, “why” is a useful question as it helps draw attention to the Decisions that were made as well as the assumptions and principles that led to those decisions. Too often results are presented as “god-given” facts that “fell from the sky” or wherever you believe the all-deciding divine creator (the real chief architect!) resides. Uncovering the assumptions that led to a decision can provide much insight and increase the value of an architecture review. An architecture review is not only looking to validate the results, but also the thinking and decisions behind. To emphasize this fact one should request an Architecture Decision Record from any team submitting an architecture for review.
Unstated assumptions can be the root of much evil if the environment has changed since the assumptions were made. For example, traditional IT shops often write elaborate GUI configuration tools that could be replaced with a few lines of code and a standard software development tool chain. Their decisions are based on the assumption that writing code is slow and error prone, which no longer holds universally true as we learn once we overcome our Fear of Code. If you want to change the behavior of the organization, you often have to identify and overcome outdated assumptions first.
Coming back to The Matrix, the explanation given by the Oracle that “…you didn’t come here to make the choice, you’ve already made it. You’re here to try to understand why you made it.” could make a somewhat dramatic, but very appropriate opening to an architecture review.
As most upper management is well versed in financial models, I often describe investing in architecture as the equivalent to buying an option: an option gives the buyer the right, but not the obligation to execute on a contract, e.g. buying or selling a financial instrument, in the future. In IT architecture, the option allows you to make changes to the system design, the run-time platform, or functional capabilities. Just as in the financial world, options aren’t free - the ability to act on a contract in the future, when more information is available, has a value and therefore a price. I don’t think the Black-Scholes model accurately computes the value of large-scale IT architecture, but it makes apparent that architecture has a measurable value and can therefore demand a price.
When defining architecture in large organizations, architects need to know more than how to draw UML diagrams. They need to:
- gain architecture insights while Waiting in the line at a Coffee Shop
- tell whether Something Is Architecture in the first place
- tackle complexity by Thinking in Systems
- know that Configuration isn’t better than coding
- hunt zombies so they Don’t have their brain eaten
- navigate the IT landscape with an Undistorted world map
- automate everything so that they Never have to send a human to do a machine’s job
- think like software developers as Everything becomes software-defined
Positive feedback loops can be dangerous due to their “explosive” nature. Policies are often designed to counteract such positive feedback loops with negative ones, e.g. by taxing rich people more heavily or by increasing gasoline tax while subsidizing public transit. However, it’s difficult to balance out the exponential character of positive feedback loops. Thinking in systems helps us reason about such effects.
… provides dramatic examples of how misunderstanding a system led to major catastrophes such as the Three-mile-island nuclear reactor incident or the capsizing of the Deepwater Horizon drilling platform.
Their mental model deviated from the real system, causing them to make fatal decisions.
It has repeatedly been observed that humans are particularly bad at steering systems that have slow feedback loops, i.e. that exhibit reactions to changing inputs only after a significant delay. Overuse of credit cards is a classic example. Also, humans are prone to taking actions that have the opposite of the intended effect. For example, people react to overly full work calendars by setting up “blockers”, which make the calendars even fuller. Instead, one needs to understand and fix what causes the full calendars, for example, a misaligned organizational structure that requires too many alignment meetings. You can’t fix a system by merely addressing the symptoms.
Understanding system effects can help you devise more effective ways to influence the system and thus its behavior. For example, transparency is a useful antidote to the bounded rationality effect because it widens peoples’ bounds. An example from Donella Meadows’ book illustrates that having the electricity meter visible in the hallway caused people to be more conservative with their energy consumption without additional rules or penalties. Interestingly, systems thinking can be applied to both organizational and technical systems. We’ll learn this, for example, when we Scale an Organization.
As described in Question Everything if you request better documentation for architecture reviews, “the system” may respond by scheduling lengthy workshops that drain your available time. If you increase pressure, the system will respond with sub-quality documentation that increases your review cycles. You must therefore get to the root of the problem and highlight the value of good documentation, properly train architects, and allocate time for this task in project schedules.
Yoda, the wise teacher of Jedi apprentice Luke Skywalker in the Star Wars movies, knows that fear leads to anger; anger leads to hate; hate leads to suffering. Likewise, Corporate IT’s fear of code and the love of configuration can lead it down a path to suffering that is difficult to escape from. Beware of the dark side, which has many faces, including vendors peddling products that “only require configuration”, as opposed to tedious, error-prone coding. Sadly, most complex configuration really is just programming, albeit in a poorly designed, rather constrained language without decent tooling or useful documentation.
Corporate IT, which is often driven by operational considerations, tends to consider code the stuff that contains all the bugs, causes performance problems, and is written by expensive external consultants who are difficult to hold liable as they’ll have long moved to another project by the time the problems surface. Corporate IT’s fear of code plays to the advantage of enterprise vendors who tout configuration over coding with slogans like “this tool does not require programming, everything is done through configuration.” The most grotesque example of fear of code I have observed had corporate IT providing application servers that carry no operational support once you deploy code on it. It’s like voiding a car’s warranty after you start the engine - after all, the manufacturer has no idea what you will do to it!
Simple things should be simple, complex things should be possible.
MapReduce is a positive example: it encapsulates and thus abstracts away the difficult parts of distributed data processing, such as controlling and scheduling many worker instances, dealing with failed nodes, aggregating data, etc. But it nevertheless leaves the programmer enough flexibility to solve a wide range of problems.
Packaging the Abstraction
Vendors showing shiny drag-and-drop demos can make us believe that painting a thin visual veneer over an existing programming model can provide a higher level of abstraction. However, when talking about programming abstraction we must distinguish the model from the representation. A complex model, such as workflow, which includes concepts like concurrency, long-running transactions, compensation, etc., carries heavy conceptual weight, even when wrapped in a pretty visual packaging. This is not to say visual representations have no value. Especially in the case of workflow, visual representations are quite expressive. But they cannot wave a magic wand that makes the challenges of workflow design go away.
Visual programming may initially appear to increase productivity but generally does not scale very well. Once applications grow, it becomes difficult to follow what’s going on. Debugging and version control can also be a nightmare. I generally apply two tests when vendors provide a demo of visual programming tools:
I ask them to enter a typo into one of the fields where the user is expected to enter some logic. Often this leads to cryptic error messages or obscure errors in generated code down the line. I call this “tightrope programming”: as long as you stay exactly on the line, everything is well. One misstep and the deep abyss awaits you. I ask them to leave the room for 2 minutes while we change a few random elements of their demo configuration. Upon return, they would have to debug and figure out what was changed. No vendor was ever willing to take the challenge.
Abstractions are a very useful thing, but believing that calling the abstraction “configuration” is going to eliminate complexity or the need to hire developers is a common fallacy. Instead, treat configuration as a first-class citizen that requires design, testing, version control, and deployment management just like code. Otherwise, you have created a proprietary, poorly designed, language without tooling support.
Worse yet, there’s no “business case” for updating the system technology. This widespread logic is about as sound as considering changing the oil in your car a waste of money - after all the car still runs if you don’t. And it even makes your quarterly profit statement look a little better, that is, until the engine seizes.
A team from Credit Suisse described how to counterbalance this trap in their aptly titled book Managed Evolution. The key driver for managed evolution is to maintain agility in a system. A system that no one wants to touch has no agility at all - it can’t be changed. In a very static business and technology environment, this may not be all that terrible. Today’s environment is everything but stable, though, rendering the inability to change a system into a major liability for IT and the business.
Most things are the way they are for a reason. This is also true for the fear of change in corporate IT. These organizations typically lack the tools, processes, and skills to closely observe production metrics and to rapidly deploy fixes in case something goes awry. Hence they focus on trying to test for all scenarios before deploying and then running the application more or less “blind”, hoping that nothing breaks. Jeff Sussna describes the necessity to break out of this conundrum very nicely in his great book Designing Delivery12.
One may think that by not changing running systems, IT can keep the operational cost low. Ironically, the opposite is true: many IT departments spend more than half of their IT budget on “run” and “maintenance”, leaving only a fraction of the budget for “change” that can support the evolving demands of the business. That’s because running and supporting legacy applications is expensive: operational processes are often manual; the software may not be stable, necessitating constant attention; the software may not scale well, requiring the procurement of expensive hardware; lack of documentation means time-consuming trial-and-error troubleshooting in case of problems. These are reasons why legacy systems tie up valuable IT resources and skills, effectively devouring the brains of IT that could be applied to more useful tasks, for example delivering features to the business.
Digital companies also have to deal with change and obsolescence. The going joke at Google was that every API had two versions: the obsolete one and the not-yet-quite-ready one. Actually, it wasn’t a joke, but pretty close to reality. Dealing with this was often painful - every piece of code you wrote could break at any time because of changes in the dependencies. But living this culture of change allows Google to keep the pace up - the most important of today’s IT capabilities. Sadly, it’s rarely listed as a performance indicator. Even Shaun knows that zombies can’t run fast.
I like to understand whether a vendor’s and our world view align. For that I prefer to meet with vendors’ senior technical staff, such as a CTO, because too many “solution architects” are just glorified technical sales people. When the account manager starts the meeting with “please help us understand your environment”, which roughly translates into “please tell me what I should sell to you”, I typically preempt the exercise by asking the senior person about their product philosophy. Discussing what base assumptions and decisions are baked into a product gives great insight into a vendor’s world map. Asking them about the toughest problem they had to solve when developing their product tells you much about where the center of their map is located. Naturally, this only works when talking to someone who is actually working on the product. Looking at the company leadership page or at the company history can also help you understand “where they come from”.
A vendor’s architect once stated that automation shouldn’t be implemented for infrequently performed tasks because it isn’t economically viable. Basically, the vendor calculated that writing the automation would take more hours than would ever be spent completing the task manually (they also appeared to be on a fixed-price contract).
I challenged this reasoning with the argument of repeatability and traceability: wherever humans are involved, mistakes are bound to happen and work will be performed ad-hoc without proper documentation. That’s why you don’t send humans to do a machine’s job. The error rate is actually likely to be the highest for infrequently performed tasks because the operators are lacking routine.
The second counter-example are disaster scenarios and outages: one hopes that they occur infrequently, but when they happen, the systems better be fully automated to make sure they can return to a running state as quickly as possible. The economic argument here isn’t about saving manpower, but to minimize the loss of business during the outage, which far exceeds the manual labor cost. To appreciate this thinking, one needs to understand Economies of Speed. Otherwise, you may as well argue that the fire brigade should use a bucket chain because all those fire trucks and pumps are not economically viable given how rarely buildings actually catch fire.
Allowing users to specify what they want and providing it quickly in high quality would seem like a pretty happy scenario. However, in the digital world one can always push things a little further. For example, Google’s “zero click search” initiative, which resulted in “Google Now”, considered even one user click too much of a burden, especially on mobile devices. The system should anticipate the users’ needs and answer before a question is even asked. It’s like going to McDonalds and finding your favorite happy meal already waiting for you at the counter. Now that’s customer service! An IT world equivalent may be auto-scaling, which allows the infrastructure to automatically provision additional capacity under high load situations without any human intervention.
Tacit knowledge is knowledge that exists only in employees’ heads but isn’t documented or encoded anywhere. Such undocumented knowledge can be a major overhead for large or rapidly growing organizations because it can easily be lost and requires new employees to re-learn things the organization already knew. Encoding tacit knowledge, which existed only in an operator’s head, into a set of scripts, tools, or source code makes these processes visible and eases knowledge transfer. Tacit knowledge is also a sore spot for any regulatory body whose job it is to assure that businesses in regulated industries operate according to well-defined and repeatable principles and procedures. Full automation forces processes to be well-defined and explicit, eliminating unwritten rules and undesired variation inherent in manual processes. Ironically, classic IT often insists on manual steps in order to maintain separation of duty, ignoring the fact that approving a change and manually conducting a change are independent things.
Explicit Knowledge is Good Knowledge
Tacit knowledge is knowledge that exists only in employees’ heads but isn’t documented or encoded anywhere. Such undocumented knowledge can be a major overhead for large or rapidly growing organizations because it can easily be lost and requires new employees to re-learn things the organization already knew. Encoding tacit knowledge, which existed only in an operator’s head, into a set of scripts, tools, or source code makes these processes visible and eases knowledge transfer. Tacit knowledge is also a sore spot for any regulatory body whose job it is to assure that businesses in regulated industries operate according to well-defined and repeatable principles and procedures. Full automation forces processes to be well-defined and explicit, eliminating unwritten rules and undesired variation inherent in manual processes. Ironically, classic IT often insists on manual steps in order to maintain separation of duty, ignoring the fact that approving a change and manually conducting a change are independent things.
The Ironies of Automation
Alas, architecture wouldn’t be interesting if there weren’t some trade-offs. What are the downsides of automation, besides, perhaps, that it takes effort to create? Quite a few actually, and they’ve been known for a good while, aptly summarized in the classic paper The Ironies of Automation by Lisanne Bainbridge (a PDF is easily found via web search). The paper highlights that operating a highly automated system places significantly higher demands on the operator due to several effects taking place:
- Likely, the automation handles the easy parts, leaving the operator with the difficult ones.
- Because the system runs in auto-pilot all the time, operators might not be paying close attention to the system state.
- When manual take-over is needed, the system is already in an abnormal state, which makes managing it even harder.
- Automation black-boxes the system, giving the operator less opportunity to learn how the system reacts to inputs.
- Automation can mask small problems until they become big ones (see also Control is an Illusion).
Several major disasters were triggered by operator error in the context of automated systems, often involving faulty signals, meaning the automation itself was in a failure state (see Every System is Perfect). A dramatic recent example of the dangers of automation is the crash of two Boeing 737 Max aircraft, apparently due to malfunctioning of an automated mechanism that was introduced unbeknown to the pilots. Highly automated environments need transparency and highly trained operators.
if software eats the world, there will be only two kinds of people: those who tell the machines what to do and those where it’s the other way around.
Getting Attention
Technical material can be very exciting, but ironically more so to the presenter than to the audience. Keeping attention through a lengthy presentation on code metrics or data center infrastructure can be taxing for even the most enthusiastic audience. Decision makers don’t just want to see the hard facts, but also be engaged and motivated to support your proposal. Architects therefore have to use both halves of their brain to not only make the material logically coherent, but to also craft an engaging story.
Pushing Paper
The technical decision papers published by my team yielded much praise, but also unexpected criticism like “All that architects can do is produce paper”. You might want to remind people that documentation provides value in numerous ways:
- Coherence - Agreeing on and documenting design principles and decisions improves consistency of decision making and thus preserves the conceptual integrity of the system design.
- Validation - Structured documentation can help identify gaps and inconsistencies in the design.
- Clarity of Thought - You can only write what you have understood. If someone claims that writing their thoughts down is too much effort, I routinely challenge them that this is likely because they haven’t really understood things in the first place.
- Education - New team members become productive faster if they have access to good documentation.
- History - Some decisions were good decisions at the time based on the context and what was known. Documentation can help understand what these were.
- Stakeholder Communication - Architecture documentation can help steer a diverse audience to the same level of understanding.
Useful documentation doesn’t imply reams of paper, rather the opposite: most technical documents my team writes are subject to a five-page limit.
This chapter examines some of the challenges of technical communication and gives advice on how to overcome them:
- Our stuff is complicated. Helping management reason about complex technical topics requires you to build a careful ramp for the audience.
- People are busy. They won’t read every line you write, so make it easy for them to navigate your documents.
- There’s always too much to tell. If everything is important, nothing is important. Place an emphasis!
- Excite your audience by not just showing the building blocks, but also the pirate ship.
- Technical staff often struggles to create a good picture. Help them by sketching bank robbers.
- A picture can not only say more than a thousand words but actually help you design better systems.
- Your audience often understands the components, but not their relationships. You must draw a line.
Build a ramp, not a cliff for the reader
Martin Fowler occasionally introduces himself as a guy “who is good at explaining things”. While this certainly has a touch of British Understatement™, it also highlights a critically important but rare skill in IT. Too often technical people produce either an explanation at such a high level that it is almost meaningless or spew out reams of technical jargon with no apparent rhyme or reason.
High-performance Computing Architectures for Executives
A team of architects once presented a new hardware and software stack for high-performance computing to a management steering committee. The material covered everything from workload management down to storage hardware. It contrasted vertically integrated stacks like Hadoop and HDFS, which comprise a file system and a workload distribution mechanism, against stand-alone workload management solutions like LSF, which run on top of an independent high performance file system. In one of the comparison slides “POSIX compliance” jumped out as a selection criteria. While this may be completely appropriate, how do you explain to someone who knows little about file systems what this means, why it is important, and what the ramifications are?
First, create a language
When preparing technical conversations, I tend to use a two-step approach: first I set out to establish a basic mental model based on concrete vocabulary. Once equipped with this, the audience is able to reason in the problem space and to discern the relevance of parameters onto the decision. The mental model doesn’t have to be anything formal, it merely has to give the audience a way to make a connection between the different elements that are being described.
Consistent level of detail
Determining the appropriate level of detail to support the line of reasoning is difficult. For example, we pretended “POSIX” is a single thing when in reality there are many different version and components, the Linux Standard Base etc. The ability to draw the line at roughly the right level of detail is an important skill of an architect. Many developers or IT specialists love to inundate their audience with irrelevant jargon. Others consider it all terribly obvious and leave giant gaps by omitting critical details. As so often, the middle ground is where you want to be.
Drawing the line at the right level of detail depends on you knowing your audience. If your audience is mixed, building a good ramp is ever more important as it allows you to catch up folks less familiar with the details without boring the ones who are. The highest form is building a ramp that audience members already familiar with the subject matter appreciate despite not having learned anything new. This is tough to achieve, but a noble goal to aim for.
Getting the level of detail “just right” is usually a crapshoot, even if you do know the audience. At least as important, though, is sticking to a consistent level of detail. If you describe high-level file systems on slide one and then dive into bit encoding on magnetic disks in slide two, you are almost guaranteed to either bore or lose your audience. Therefore, strive to find a line that maintains cohesion for reasoning about the architectural decision at hand, without leaving too many “dangling” aspects. Algorithm-minded people would phrase this challenge as a graph partition problem: your topic consists of many elements that are logically connected, just like a graph of nodes connected by edges. Your task is to split the graph, i.e. to cover only a subset of the elements, while minimizing the number of edges, i.e. logical connections, being cut.
The written word has distinct advantages over the spoken word:
- it scales: you can address a large audience without gathering everyone in one room (podcasts admittedly can also accomplish that)
- it’s fast: people read 2-3 times faster than they can listen
- it’s searchable: you can find what you want to read quickly
- it can be edited and versioned: everybody sees the same, versioned content.
Blatant typos or grammar issues are like the proverbial fly in the soup: the taste is arguably the same, but the customer is unlikely to come back for more.
To assess what a short paper will “feel” like to the reader without wasting printer paper, I zoom out my wysiwyg editor far enough that all pages appear on the screen. I can’t read the text anymore, but can see the headings, diagrams, and overall flow, e.g. the length of paragraphs and sections. This is exactly how a reader will see it when flipping through your document to decide whether it’s worth reading. If they see an endless parade of bullet points, bulky paragraphs, or a giant mess, the paper will leave “the hand” quite quickly as gravity teleports it into the recycling bin.
A few mechanisms can make reading your paper more like watching Shrek:
Story-telling headings replace an executive summary: your reader should get the gist of the paper just by reading the headings. Such headings reduce word count and still take busy readers through the whole paper. Headings like “introduction” or “conclusion” aren’t story-telling and have no place in a short paper.
Diagrams provide a visual anchor for important sections. Readers who flip through a paper likely pause at a diagram, so it’s good to position them strategically near critical sections. Callouts, i.e. short sections that are offset in a different font or color, indicate to the reader that this additional detail can be safely skipped without losing the train of thought.
The curse of writing: linearity
Technical topics are rarely one-dimensional, but your text is forced to be: one word after the other, one paragraph after the next. Only a well-thought-out logical structure can overcome this limitation, as described by Barbara Minto in her book The Pyramid Principle. The “pyramid” in this context denotes the hierarchy of content, not the Pyramids in IT. While somewhat over-hyped and over-priced, the book’s sections on order are a gem: every list or grouping should have an order, either by time (chronological), structure (relationships), or ranking (importance). Note that “alphabetical” and “serendipitous” aren’t valid choices. “How is this ordered?” has become a standard question I ask when reviewing documents containing a list or grouping.
In the hand - First Impressions Count
When Bobby and I wrote “Enterprise Integration Patterns”, the publisher highlighted the importance of the “in the hand” moment, which occurs when a potential buyer picks the book from the shelf to give a quick glimpse to front and back cover, maybe the table of contents, and to leaf through (back in 2003 people still bought books in physical bookstores). The reader makes the purchasing decision at this very moment, not when he or she stumbles on your ingenious conclusion on page 326. This is one reason why we included many diagrams in that book: almost all facing pages contain a graphical element, such as an icon (aka “Gregorgram”), a pattern sketch, a screen shot, or a UML diagram, sending the message to potential readers that it isn’t an academic book, but a pragmatic and approachable one. Technical papers should do the same: use a clean layout, insert a handful of expressive diagrams, and, above all, keep it short and to the point.
Loose usage of the word “this” as a stand-alone reference is another pet peeve of mine, e.g. stating that “this is a problem” without being clear what “this” actually refers to. Jeff Ullman cites such a “non-referential this” as one of the major impediments to clear writing, exemplified in his canonical example:
If you turn the sproggle left, it will jam, and the glorp will not be able to move. This is why we foo the bar.
Do we foo the bar because the glorb doesn’t move or because the sproggle jammed? Programmers understand the dangers of dangling pointers and Null Pointer Exceptions well, but don’t seem to apply the same rigor to writing – maybe because your readers don’t throw a stack trace at you?
Another fantastic advice from Minto is the following:
Making a statement to a reader that tells him something he doesn’t know will automatically raise a logical question in his mind […] the writer is now obliged to answer that question. The way to ensure total reader attention, therefore, is to refrain from raising any questions in the reader’s mind before you are ready to answer them.
Follow this single piece of advice and your technical paper will stand above 80% of the rest. This rule also applies to unsubstantiated claims. An internal presentation once stated on the first slide: “only technology ABCD has proven to be a viable solution”. When I demanded to see proof, it turned out that none existed due to “lack of time and funding.” These aren’t just wording issues, but fatal flaws. A reader no longer wants to see page 2 if they cannot trust page.
Writing good documents in an organization that is generally poor at writing can give you significant visibility, but it can also rock the political system. The first time I sent a positioning paper on digital ecosystems to senior management, a person complained to both my boss and my boss’s boss about me not having “aligned” the paper with her. Communication is a mighty tool and some people in your organization will fight hard to control it. Pick your targets wisely and make sure you have enough “ammunition”.
When I create a document or a diagram, feedback sometimes includes “ABC is missing” or “you could also include DEF”. While this is well intentioned, I remind the reviewers that completeness wasn’t my primary goal. Rather, I am looking for a scope that is big enough to be meaningful, small enough to be comprehensible, and cohesive enough to make sense. Compare this to drawing a map: a street map of Chicago that ends half-way through the city would be awkward. However, including all of Lake Michigan because the lake doesn’t actually end in a straight line 3cm off the coastline would make the map a lot less useful. Adding Springfield at the same scale is also unlikely to be helpful.
Any diagram or text you create is a model of reality that must set a specific emphasis to be useful. Comments like “ABC is missing” can be helpful to “round off” your model and make it more cohesive. But you also need to decide when something is better placed into another model. For me personally, I can make that decision really only once I have it in front of my eyes - it’s not something I can do a priori.
3-second test
I show the audience a slide for a mere 3 seconds and ask them to describe what they saw. In most cases, the responses boil down to a few words from the headline and statements like “2 yellow boxes on the left and one blue barrel on top”. The authors are usually disappointed to hear this dramatic simplification of their precious content, but understand that less is indeed more.
Making a Statement
The slide or paragraph title sets the tone for a clear and focused statement. Some authors prefer titles that are full sentences while others prefer short phrases. After having gone back-and-forth throughout my career I settled on using both, but in different contexts: “big” presentations tend to have titles consisting of single words or short phrases like the Architect Elevator because they represent a concept that I will explain. The visuals in this case are truly a visual aid to me, the speaker, to draw the audience’s attention and help them memorize the content via a visual metaphor.
For technical presentations that are prepared for a review or decision-making session, however, I prefer clear statements, with which one can either agree or disagree. These statements are much better represented as full sentences, akin to the story-telling headings in documents for busy people. In such cases, “Stateless application servers and full automation enable elastic scale-out” is a better title than “Server Architecture”. What you certainly want to avoid are verbose phrases or crippled sentences that don’t make a statement: “Server infrastructure and application architecture overview diagram (abstracted for simplicity’s sake)“.
A Pop Quiz
I participate in many architecture reviews and decision boards. While such boards often exist due to an undesirable separation of decision-makers and knowledge holders, many large enterprises need them to harmonize the technical landscape and to gain an overview across silos. The topics for these meetings can be fairly technical in nature, making me skeptical whether the audience is truly following along.
To ensure that the decision body understands what they are deciding, I inject a pop quiz9 into the presentation by telling the presenter to pause and blank the slide (hitting “B” will do this in PowerPoint) and asking the audience who would like to recap what was said up to this point. After observing nervous laughter and frantic staring at the floor, I usually ask the presenter to try to recap the key points so we have a chance of passing this (fictitious) test. In the end, this is a test for the presenter more so than for the audience.
Simple Language
I don’t exclude myself from the pop quiz. When replaying what the speaker said, I often intentionally use very simple language to make sure I really capture the essence. In a presentation about network security architecture in the untrusted network zone, after watching a handful of rather busy slides, I summarized the speaker’s statement as follows: “what worries you is the black line going all the way from top to bottom?” His resounding “yes” both confirmed that I had correctly summarized the issue, and that the presenter took away insight how to better communicate this very aspect. While this technique may seem overly simplistic at first, it validates that there is a solid connection between the model being presented (such as vertical lines depicting legal network paths from the Internet to the trusted network) and the problem statement (security risks). Removing all noise and reducing the statement down to the “black line” sharpens the message.
Technical Memos
The idea to create documents that don’t try to be encyclopedic (who reads an encyclopedia, anyway?), but describe a particular aspect of the system and place a specific emphasis on it, isn’t new: 20 years ago, Ward Cunningham defined the notion of a technical memo in his Episodes pattern language:
Maintain a series of well formatted technical memoranda addressing subjects not easily expressed in the program under development. Focus each memo on a single subject. […] Traditional, comprehensive design documentation […] rarely shines except in isolated spots. Elevate those spots in technical memos and forget about the rest.
Keep in mind, though, that writing technical memos is more useful, but not necessarily easier than producing reams of mediocre documentation. The classic example of the concept of technical memos gone wrong is a project Wiki full of random, mostly outdated, and incohesive documentation. This isn’t tool’s fault (the Wiki was not quite coincidentally also invented by Ward), but rather due to a lack of emphasis being placed by the writers.
Show the Kids the Pirate Ship!
Why the whole is much more than the parts.  This is what people want to see. When you look at the cover of a Legotm toy box you don’t see a picture of each individual brick that’s inside. Instead, you see the picture of an exciting, fully assembled model, such as a pirate ship. To make it even more exciting, the model isn’t sitting on a living room table, but is positioned in a life-like pirate’s bay with cliffs and sharks – captain Jack Sparrow would be jealous. What does this have to do with communicating system architecture and design? Sadly, not much, but it should! Technical communication too frequently makes the mistake of listing all the individual elements, but forgets to show the pirate ship: we see tons of boxes (and hopefully some lines), but the gestalt of what they mean as a whole isn’t clear.
Is this a fair comparison, though? Lego is selling toys to kids while architects need to explain the complex interplay between components to management and other professionals. Furthermore, IT professionals have to explain issues like network outages due to flooded network segments, something much less fun than playing pirates. I’d posit that we can learn quite a few things from the pirate ship for the presentation of IT architecture.
Get Attention
The initial purpose of the pirate ship is to draw attention among all the other competing toy boxes. While kids come to the toy store to hunt for new and shiny toys, many corporate meeting attendees are there because they were delegated by their boss, not because they want to hear your content. Grabbing their attention and getting them to put down their smartphones requires you to show something exciting. Sadly, many presentations start with a table of contents, which I consider extremely silly. First, it isn’t exciting: it’s like a list of assembly instructions instead of the ship. Second, a table of contents’ purpose is to allow a reader to navigate a book or a magazine. If the audience has to sit through the whole presentation anyhow, there is no point in giving them a table of contents at the beginning. The old adage of “tell them what you are going to tell them”, which is vaguely attributed to Aristotle, certainly doesn’t translate into a slide with a table of contents. You are going to tell them how to build a pirate ship!
You may feel that excitement is a bit too frivolous for a serious workplace discussion. That’s where you should look back at Aristotle. Some 2300 years ago he concluded that a good argument is based on logos, facts and reasoning, ethos, trust and authority, and pathos, emotion! Most technical presentations deliver 90% logos, 9% ethos, and maybe 1% pathos. From that starting point, a good dose of pathos can go a long way. You just have to make sure that your content can match the picture presented on the cover: pitching a pirate ship and not having the cannons inside the box is prone to lead to disappointment.
Focus on Purpose
Coming back to the pirate ship, the box also clearly shows the purpose of the pieces inside. The purpose isn’t for the bricks to be randomly stacked together, but to build a cohesive, balanced solution. The whole really is much more than the sum of the parts in this case. It’s the same with system design: a database and a few servers are nothing special, but a scale-out, masterless NoSQL database is quite exciting.
Alas, the technical staff who had to put all the pieces together is prone to dwell on said pieces instead of drawing attention to the purpose of the solution they built. They feel the audience has to appreciate the work that went into assembling the pieces as opposed to the usefulness of the complete solution. The bad news is: no one is interested in how much work it took you; people want to see the results you achieved.
Show Context
The Lego box cover image also shows the pirate ship within a useful context, such as a (fake) pirate’s bay. Likewise, the context in which an IT system is embedded is at least as relevant as the intricacies of the internal design. Hardly any system lives in isolation and the interplay between systems is often more difficult to engineer than the innards of a single system. So you should show a system in its natural habitat.
Many architecture methods begin with a system context diagram, which rarely turns out to be a useful communication tool because it aims for a complete system specification without Placing an Emphasis. Such diagrams show a bunch of bricks, but not the pirate ship. They are therefore rarely suitable to be shown on the cover.
The Content on the Inside
Lego toys also show the exact part count and their assembly, but on a leaflet inside the box, not on the cover. Correspondingly, technical communication should display the pirate ship on the first page or slide and keep the description of the bricks and how to stack them together for the subsequent pages. Get your audience’s attention, then take them through the details. If you do it the other way around, they may all be asleep by the time the exciting part finally comes.
Consider the Audience
Just like Lego has different product ranges for different age groups, not every IT audience is suitable for the pirate ship. To some levels of management that are far removed from technology you may need to show the little duckie made from a handful of Lego Duplo bricks.
Play at Work
While on the topic of toys: building pirate ships would be classified by most people as playing, something that is commonly seen as the opposite of work, as we are reminded by the proverb “all work and no play makes Jack a dull boy.” Pulling another reference from the 80s movie archives, let’s hope that lack of play doesn’t have the same effect on IT architects as it had on the author Jack in the movie The Shining – he went insane and tried to kill his family. But it certainly stifles learning and innovation.
Most of what we know we didn’t learn from our school teachers, but from playing and experimenting. Sadly, most people seem to have forgotten how to play, or were told not to, when they entered their professional life. This happens due to social norms, pressure to always be (or appear) productive, and fear. Playing knows no fear and no judgment; that’s why it gives you an open mind for new things.
If playing is learning, times of rapid change that require us to learn new technologies and adapt to new ways of working should re-emphasize the importance of playing. I actively encourage engineers and architects in my team to play. Interestingly, Lego offers a successful method called Serious Play for executives to improve group problem solving. They might be building pirate ships.
When assuming the role of an “architecture sketch artist”, I tend to pursue two different approaches, often sequenced one after the other.
The System Metaphor
First, I look for noteworthy or defining features, i.e. for the key decisions. Is it a pretty vanilla web-site for a customer to review information, like a customer information portal? Or is it rather a new sales channel, or even a piece of a cross-channel strategy? Is it designed to handle tons of volume or is it rather an experiment that will see little traffic, but must evolve very quickly? Or is it a spike to test out new technologies and the use case is secondary? Once I have established this frame, I start filling in the detail.
I am a big fan of Kent Beck’s notion of a system metaphor that describes what kind of “thing” the system is. As Kent wisely states in Extreme Programming Explained11:
We need to emphasize the goal of architecture, which is to give everyone a coherent story within which to work, a story that can easily be shared by the business and technical folks. By asking for a metaphor we are likely to get an architecture that is easy to communicate and elaborate.
In the same book Kent also states that “Architecture is just as important in XP [Extreme Programming] projects as it is in any software project”, something to be kept in mind by folks who are tempted to shun architecture because they are agile.
Just like with Diagram-driven Design, architecture sketching can also be a useful design technique. If the picture makes no sense (and the architecture sketch artist is talented) then something may be inconsistent or wrong in the architecture.
Viewpoints
Once I have a rough idea about the nature of the system, I let the metaphor drive which aspects or viewpoints to examine. This is where doing an architecture sketch differs from performing an architecture analysis. An analysis typically walks through a fixed, structured set of aspects, as defined for example by methods such as ATAM or arc42. This is useful as a “checklist” to uncover missing aspects or gaps. In contrast, a criminal sketch artist doesn’t want to draw the details of a person’s trouser finishings (hemmed?, cuffed?), but highlight those characteristics that are unique or noteworthy. The same is true for the architecture sketch artist.
Following a fixed set of viewpoints always runs the risk of becoming a Paint by Numbers exercise where one fills in every section of a template, but forgets to place an emphasis or omits critical points in the process. I therefore find the viewpoint descriptions in Nick Rozanski and Eoin Woods’ Software Systems Architecture useful because they don’t prescribe a fixed notation, but highlight concerns and pitfalls. Nick and Eoin also separate perspectives from views. When sketching an architecture you are most likely interested in a specific perspective, such as performance and security, that spans multiple viewpoints, for example a deployment or functional view.
When talking about Diagram-Driven Design, I don’t mean generating code from UML diagrams. I am pretty firmly rooted in Martin Fowler’s UML as Sketch camp, meaning UML is a picture to aid human comprehension, not a programming language or specification. If people question my view, I simply quote Grady Booch, who as co-creator of the UML remarked that “The UML was never intended to be a programming language.”15 Instead, I am talking about a picture that conveys important concepts – the proverbial big picture that does not get caught up in irrelevant details.
Make sure your text is readable by using sans serif fonts of decent size and good color contrast. It’s amazing how many slides contain 10pt Times Roman font in dark gray on dark blue background. “I know you can’t read this” isn’t a good introduction into a slide. It’s even more shocking to see that slides with tiny fonts often consist to 50% of empty space that could have been used for larger boxes and larger fonts.
Reduce visual noise by making sure elements are properly aligned and have consistent form and shape (e.g. border widths, arrowhead sizes etc). If things look different, make sure that it expresses meaning. Increase the size of arrow heads on the lines so that they can be more easily spotted. If direction isn’t critical to understanding the diagram, then omit the arrowheads.
Diagramming as Design Technique
Once you embrace diagramming as a design technique you can apply a number of methods to aid with your system design:
Establish a Visual Vocabulary and Viewpoints
Good diagrams use a consistent visual language. A box means something (for example, a component, a class, a process), a solid line something else (maybe a build dependency, data flow, or an HTTP request), and a dashed line something else yet. No, you don’t need a Meta-Object Facility and correctness-proven semantics, but you need to have an idea what element or relationship you are depicting how.
Limit the Levels of Abstraction
One of the most frequent problems I encounter in technical documents is a wild mix of different levels of abstraction (the same problem can be found in source code). For example, the way configuration data affects a system’s behavior may be described like this:
The system configuration is stored in an XML file, whose “timetravel” entry can be set to either true or false. The file is read from the local file system or alternatively from the network but then you need NFS access or have Samba installed. It uses a SAX parser to preserve memory. The “Config” class, which reads these settings, is a singleton because…
In these few sentences we learn about the file format, project design decisions, implementation detail, performance optimizations and more. It’s rather unlikely that a single reader is actually interested in this smörgåsbord of facts.
Now try to draw a picture of this paragraph! It will be nearly impossible to get all these concepts onto a single sheet of paper.
Reduce to the Essence
Billboard-size database schema posters, which include every single table, stick to a single level of abstraction, but are still fairly useless because they don’t place an emphasis, especially when shrunk down to fit on a single presentation slide. Omit unimportant detail to concentrate on what’s relevant.
Find Balance and Harmony
Limiting the levels of abstraction and scope does not yet guarantee a useful diagram. Good diagrams lay out important entities such that they are logically grouped, relationships become naturally clear, and an overall balance and harmony emerges. If such a balance doesn’t emerge, it may just be that your system doesn’t have one.
I once reviewed a relatively small module of code that consisted of a rather entangled mess of classes and relationships. When the developer and I tried to document this module, we just couldn’t come up with a half-decent way to sketch up what’s going on. After a lot of drawing and erasing we came up with a picture that vaguely resembled a data processing pipeline. We subsequently refactored the entangled code to match this new system metaphor. It improved the structure and testability of the code significantly, thanks to diagram-driven design!
Indicate Degrees of Uncertainty
When looking at a piece of code, one can always figure out what was done, but it’s much harder to understand why it was done. It can be even more difficult to understand which decisions were made consciously and which ones simply happened. When creating diagrams, we have more tools at hand to express these nuances: for example, you can use a hand-drawn sketch to convey that it is merely a basis for discussion as opposed to an engineering blueprint that represents the ultimate truth. Many books, including Eric Evans’, use this technique effectively to avoid the precision vs. accuracy dilemma: “next week it will be roughly 15.235 degrees”. Don’t make precise-looking slides if you know they aren’t accurate.
Diagrams are Art
Diagrams can (and should) be beautiful – little works of art, even. I am a firm believer thatsystem design has a close relationship to art and (non-technical) design. Both visual and technical design start with a blank slate and virtually unlimited possibilities. Decisions are often influenced by multiple, usually conflicting forces. Good design resolves these forces to create a functional solution, which attains a good balance and some degree of beauty. This may explain why many of my friends who are great (software) designers and architects have an artistic vein or at least interest.
How do I tell a well-structured architecture from a Big Ball of Mud? By the lines (between components).
Electric circuit diagrams provide a canonical example of system behavior that depends heavily on connections between components. One of the most versatile elements in analog circuitry is the operational amplifier, short op-amp. Paired with a few resistors and a capacitor or two, this element can act as a comparator, amplifier, inverted amplifier, differentiator, filter, oscillator, wave generator, and much more – op-amp circuits fill many books. The system’s behavior, which varies widely, doesn’t depend on the list of elements, but solely on how they are connected. In the world of IT, a database can act as a cache, ledger, file storage, data store, content store, queue, configuration input, and much more. How the database is connected to its surrounding elements is fundamental, just like the op-amp.
UML
Speaking of lines, UML has a beautiful abundance of line styles: in a class diagram, classes (boxes) can be connected through association (a simple line), aggregation (with a hollow diamond on one end), composition (a solid diamond), or generalization (triangle). Navigability can be indicated by an open arrow and a dependency by a dashed line. On top of this, multiplicities, e.g. a truck having four to eight wheels but only one engine, can be added to the relationship lines. In fact, UML class diagrams allow so many kinds of relationships that Martin Fowler decided to split the discussion into two separate chapters inside his defining book UML Distilled20. Interestingly, UML allows composition to be visually expressed through a line or as containment, i.e. drawing one box inside the other.
With such a rich visual vocabulary, why invent your own? The challenge with UML notation is that you can appreciate the nuances of the relationship semantics between classes only if you have in fact read UML Distilled or the UML specification. That’s why such diagrams aren’t as useful when addressing a broad audience: the visual translation of solid diamond vs. hollow diamond or solid line vs. dotted line isn’t immediately intuitive. This is where containment works well: a box inside another is easily understood without having to add a legend.
Elements of Style
Most architects will develop their own visual style over time. My diagrams tend to be bold with large lettering because I value readability over subtle aesthetics. As a result, my diagrams look like cartoons to some viewers, but I am fine with that. My diagrams virtually always have lines, but I keep the lines’ semantics to two or at most three concepts. Each type of relationship that I depict with lines should be intuitive. For example, I may depict a data flow with broad, gray arrows, while control flow is shown in thin, black lines. The line width suggests that a large amount of data flows through the system’s data flow while the control flow is much smaller, but significant. The best visual style, borrowed from advice on writing, is the one “that keeps solely in view the thought one wants to convey”22.
Coworkers also routinely talk to each other to solve problems without following the lines in the organizational pyramid. This is a good thing because otherwise managers would quickly become communication bottlenecks. In many cases the org chart depicts the control flow of the organization, e.g. to give budget approvals, while the data flow is much more open and dynamic. Ironically, the way people actually work with each other is rarely depicted in a diagram. Part of the reason may be that this data is difficult to gather, the other part may be that it doesn’t look nearly as neat as the org chart pyramid.
Navigating Large Organizations
This chapter presents different angles of understanding organizations:
- How command-and-control structures are intended to work and why they don’t work that way because Control is an Illusion.
- The reasons why Pyramids went out of vogue 4500 years ago, but are still widely used in IT systems and organizational charts.
- How Black Markets compensate for the inflexibility of command-and-control, but cause a new set of problems.
- How experience in scaling distributed computer systems can be applied to Scaling and Organization.
- Why fast-moving things can appear chaotic while slow-moving things seem well coordinated when in reality it’s often the opposite due to Slow-motion Chaos.
- Why governance by decree is difficult and better done by Planting Ideas through Inception.
“Control is an Illusion”. Even more attention drew my explanation: “You believe that you have control when people tell you what you want to hear.” Perhaps this wasn’t the kind of control these senior executives wanted to have over their business.
The Illusion
How can control be an illusion? “Having control” is based on the assumption that a direction set from top-down is actually being followed and has the desired effect. And this can be a big illusion. How would you know that it does, if you are simply sitting at the top, pushing (control) buttons instead of working alongside the staff? You may rely on management status reports, but then you make another major assumption that the presented information reflects reality. This may be yet another big illusion.
Steven Denning uses the term “semblance of control” in contrast to “actual control” for this phenomenon in large organizations. A more cynical version would be to claim that the inmates are running the asylum. In either case, not the state you want your organization to be in.
Control isn’t a one-way street: turning the furnace on and off may seem like controlling a part of the system, but an actual control circuit keeps the room temperature constant based on a closed feedback loop: turning the furnace on heats the room; the thermostat measures the room temperature and turns the furnace off when the desired temperature is reached. The control circuit is based on one or more sensors, such as the room temperature sensor, and one or more actors, such as the furnace, which influence the system.
The feedback loop compensates for external factors such as the outside temperature or someone opening the window. The control loop doesn’t determine up-front how warm the room should be by running the heater for a pre-computed amount of time. Instead, the control circuit is set to a specific target and the “controller” uses sensors to measure continuously whether the target is achieved and acts accordingly. One can quickly draw the analogy to project planning that commonly attempts to predict all factors up-front and subsequently tries to eliminate all disturbances. It’s like running the heater for exactly 2 hours and then blaming the cold weather for the room not being warm enough.
Jeff Sussna describes the importance of feedback loops in his book Designing Delivery, drawing on the notion of cybernetics. Although most people think of cyborgs and terminators when they hear the term, cybernetics is actually a field of study that deals with “control and communication in the animal and the machine”. Such control and communication is almost always based on a closed signaling loop.
When we portray large organizations as “command-and-control” structures, we often focus only on the top-down steering part, and less on the feedback from the “sensors”. But not using the sensors means one is flying blind, possibly with a feeling of control, but one that’s disconnected from reality. It’s like driving a car and turning the steering wheel when you have no lights and have no clue where the car is actually headed – a very dumb idea. It’s shocking to see how such behavior bordering on absurdity can establish itself in large organizations or systems.
Problems on the way up
Even if an organization uses sensors, e.g. by obtaining the infamous status reports, not all is necessarily well. Anyone who has heard the term Watermelon status understands: these are the projects whose status is “green” on the outside, but “red” on the inside, meaning they are presented as going well, but in reality suffer from serious issues. Corporate project managers and status reporters aren’t straight-out liars, but they do tend to take some literary license to make their project look good or are just overly optimistic. “700 happy passengers reach New York after Titanic’s maiden voyage” is also factually correct, but not the status report you may want to get.
Translating from the military context back into the world of large-scale IT organizations, how do you obtain actual control, not illusionary control, in an organization? In my experience you need three elements:
- Enablement: It may sound trivial, but first you need to enable people to perform their work. Sadly, corporate IT knows many mechanisms that disable people: HR processes that restrict recruiting, servers that take 4 weeks to be provisioned, black markets that are not accessible to new hires. A thermostat connected to a furnace with a plugged gas line won’t do much good.
- Autonomy: Let people figure out how to achieve their goals because they have the shortest feedback cycles that allow them to learn and improve. You let the thermostat decide when to turn the furnace on and off, so do the same for your teams!
- Pressure: Set very specific goals that the team needs to achieve, e.g. revenue generated or quantifiable user engagement. A thermostat is only useful if someone sets the desired temperature.
Navigating Large Organizations
This chapter presents different angles of understanding organizations:
- How command-and-control structures are intended to work and why they don’t work that way because Control is an Illusion
- The reasons why Pyramids went out of vogue 4500 years ago, but are still widely used in IT systems and organizational charts.
- How Black Markets compensate for the inflexibility of command-and-control, but cause a new set of problems.
- How experience in scaling distributed computer systems can be applied to Scaling an Organization
- Why fast-moving things can appear chaotic while slow-moving things seem well coordinated when in reality it’s often the opposite due to Slow-motion Chaos.
- Why governance by decree is difficult and better done by Planting Ideas through Inception.
Functional pyramids as we find them in IT system designs face another challenge: the folks building the base layer not only have to move humongous amounts of material, they also have to anticipate the needs of the teams building the upper layers. Building a pyramid from the bottom up negates the principle of “use before reuse”: designing functions to be reused later without first actually using them can be a guessing game at best. It also dangerously ignores the Build-Measure-Learn Cycle of learning what’s needed from observing actual usage.
Not limited to pyramids, but applicable to any layered system is the challenge of defining the appropriate seams between the layers. Done well, these seams form abstractions that hide the complexity of the layer below while leaving the layer above with sufficient flexibility. Well-working examples like abstracting packet-based network routing behind data streams (sockets) are rare and when implemented well, enable major transformations like the Internet.
Running an organization as a pyramid can be slow and limit feedback cycles, which are needed to drive innovation. However, some organizations have a pyramid model that’s even worse: the inverse pyramid. In this model, a majority of people manage and supervise a minority of people doing actual work. Besides the apparent imbalance, the inevitable need of the managers to obtain updates and status reports from the workers is guaranteed to grind progress to a halt. Such pathetic setups can occur in organizations that used to completely depend on external providers for IT implementation work, and are now starting to bring IT talent back in-house. It can also happen during a crisis, e.g. a major system outage, which gets so much management attention that the team spends more time preparing status calls than resolving the issue.
A second anti-pattern occurs when organizations realize the issues inherent in their hierarchical pyramid setup. They therefore supplement the existing top-down reporting organization (often referred to as line organization), with a new project organization. The combination is typically called a matrix organization (for once, this isn’t a movie reference) as people have a horizontal reporting line into their project and a vertical reporting line into the hierarchy. However, organizations that are not yet flexible and confident enough to give project teams the necessary autonomy are prone to creating a second pyramid, the project pyramid. Now employees struggle not only with one but with two pyramids.
Living in Pyramids
While IT building pyramids can be debated, organizational pyramids are largely a given: we all report to a boss, who reports to someone else, and so on. In large organizations, we often define our standing by how many people are “above” us in the corporate hierarchy. The key consideration for an organization is whether they actually live the pyramid, i.e. whether the lines of communication and decision making follow the lines in the hierarchy. If that’s the case, then the organization will face severe difficulties in times that favor Economies of Speed because pyramid structures can be efficient, but they are neither fast nor flexible: decisions travel up and down the hierarchy, often suffering from a bottleneck in the coordination layer.
Luckily, many organizations don’t actually work in the patterns described by the organization chart but follow a concept of feature teams or tribes, which have complete ownership of an individual product or service: decisions are pushed down to the level of the people actually most familiar with the problem. This speeds up decision making and provides shorter feedback loops.
Some organizations are looking to speed things up by overlaying communities of practice over their structural hierarchy, bringing people with a common interest or area of expertise together. Communities can be useful change agents, but only if they are empowered and have clear goals. Otherwise, they run the risk of becoming communities of leisure, a hiding place for people to debate and socialize without measurable results.
One should wonder then why organizations are so enamored with org charts that they adorn the second slide of almost any corporate project presentation. My hypothesis is that static structures carry a lower semantic load than dynamic structures: when presented with a picture showing two boxes A and B connected by a line, the viewer can easily derive the model: A and B have a relationship. One can almost imagine two physical cardboard boxes connected by a string wire. Dynamic models are more difficult to internalize: if A and B have multiple lines between them that depict interaction over time, possibly including conditions, parallelism, and repetition, it’s much more difficult to imagine the reality the model is trying to depict. Often only an animation can make it more intuitive. Hence we are more content with static structures even though understanding a system’s behavior is generally much more useful than seeing its structure.
Black Markets to the Rescue
Ironically, under the covers of law-and-order, such organizations are intrinsically aware that their processes hinder progress. That’s why these organizations tolerate a “black market” where things get done quickly and informally without following the self-imposed rules. Such black markets often take the innocuous form of needing to “know who to talk to” to get something done quickly. You need a server urgently? Instead of following the standard process, you call your buddy who can “pull a few strings.” Setting up an official “priority order” process, usually for a higher price, is fine. Bypassing the process to get special favors for those who are well connected is a “black market”.
Another type of black market can originate from “high up.” While it’s not uncommon to offer different service levels, including “VIP support”, providing senior executives with support that ignores those very process or security-related constraints that were imposed by the executives in the first place, is a black market. Such a black market appears for example in the form of executives sporting sexy mobile devices that are deemed too insecure for employees, notwithstanding the fact that executive’s devices often contain the most sensitive data.
Black Markets Are Rarely Efficient
These examples have in common that they are based on unwritten rules and undocumented, or sometimes secret, relationships. That’s why black markets are rarely efficient, as you can see from countries where black markets constitute a major portion of the economy: black markets are difficult to control and deprive the government of much-needed tax income. They also tend to circumvent balanced allocation of resources: those with access to the black market will be able to obtain goods or favors that others cannot. Black markets therefore stifle economic development as they don’t provide broad and equal access to resources. This is true for countries as much as large enterprises.
In organizations, black markets often contribute to Slow Chaos where on the outside the organization appears to be disciplined and structured, but the reality is completely different. They also make it difficult for new members of the organization to gain traction because they lack the connections into the black market, presenting one way Systems resist change.
Black markets also cause inefficiency by forcing employees to learn the black market system. Knowing how to work the black market is undocumented organizational knowledge that’s unique to the organization. The time it takes employees to learn the black market doesn’t benefit the organization and presents a real, but rarely measured cost. Once acquired, the knowledge doesn’t benefit the employee either, because it has no market value outside of the organization. Ironically, this effect may contribute to large organizations tolerating black markets: it aids employee retention because much of their knowledge consists of undocumented processes, special vocabulary, and black market structures, which ties them to the organization.
Worse yet, black markets break necessary feedback cycles: If procuring a server is too slow to compete in the digital world, the organization must resolve the issue and speed up that process. Circumventing it in a black market fashion gives management a false sense of security, which often goes along with fabricated heroism: “I knew we can get it done in 2 days”. Amazon can get it done in a few minutes for a hundred thousand customers. The digital transformation is driven by democratization, i.e. giving everyone rapid access to resources. That’s exactly the opposite of what a black market does.
You Cannot Outsource a Black Market
Another very costly limitation of black markets is that they cannot be outsourced. Large organizations tend to outsource commodity processes like human resources or IT operations because specialized providers have better economies of scale and lower cost structures. Naturally, outsourcing provides only the officially established, inefficient processes. Because services are now performed by a third-party provider, and processes are contractually defined, the unofficial black market bypass no longer works. Essentially, the business has subjected itself to a work-to-rule slowdown. Organizations that rely on an internal black market, therefore, will experience a huge loss in productivity when they outsource part of their service portfolio.
Beating the Black Market
How do you avoid running the organization via a black market? More control and governance could be one approach: just like the DEA cracks down on the black market for drugs, you could identify and shut down the black market traders. However, one must recall that the IT organization’s black market isn’t engaged in trading illegal substances. Rather, people circumvent processes that don’t allow them to get their work done. Knowing that overambitious control processes caused the black market in the first place makes more control and governance an unlikely solution. Still, some organizations will be tempted to do so, which is a perfect example of doing exactly the opposite of what has the desired effect (see Every System is Perfect).
The only way to avoid a black market is to build an efficient “white” market, one that doesn’t hinder progress, but enables it. An efficient white market reduces people’s desire to build an alternate black market system, which does take some effort after all. Trying to shut down the black market without offering a functioning white market is likely to result in resistance and substantial reduction in productivity. Self-service systems are a great tool to starve black markets because they remove the human connection and friction by giving everyone equal access, thus democratizing the process. If you can order IT infrastructure through a self-explanatory tool that provides fast provisioning times, there’s much less motivation to do it “through the back door”. Automating undocumented processes is cumbersome, though, and often unwelcome because it may highlight the Slow Chaos.
Feedback and Transparency
Black markets generally originate as a response to cumbersome processes, which result from process designers taking the reporting or control point-of-view: inserting a checkpoint or quality gate at every step provides accurate progress tracking and valuable metrics. However, it makes people using the process jump through an endless sequence of hurdles to get anything done. That’s the reason I have never seen a single user-friendly HR or expense reporting system. Forcing people designing processes to use them for their own daily work can highlight the amount of friction the processes create. This means no more VIP support, but support that’s good enough for everyone to use. HR teams should apply for their own job openings to see how painful the process is (I applied to my own job openings for that very reason).
Transparency is a good antidote to black markets. Black markets are inherently non-transparent, providing benefit only to a small subset of people. Once users gain full transparency of the official processes, such as ordering a server, they may be less inclined to want to order one from the black market, which does carry some overhead and uncertainty. Therefore, full transparency should be embedded into an organization’s systems as a main principle.
Replacing a black market with an efficient, democratic white market also makes control less of an illusion : if users use official, documented, and automated processes, the organization can observe actual behavior and exert governance, e.g. by requiring approvals or issuing usage quotas. No such mechanisms exist for black markets.
The main hurdle to drying up black markets is that improving processes has a measurable up front cost while the cost of the black market is usually not measured. This gap leads to the cost of no change being perceived as being low, which in turn reduces the incentive to change.
Avoid Sync Points - Meetings Don’t Scale
Let’s assume people individually do their best to be productive and have high throughput, meaning we have efficient and effective system components. Now we need to look at the integration architecture, which defines the interaction between components, i.e. people. One of the most common interaction points (short of e-mail, more on that later) surely is the meeting. The name alone gives some of us goose bumps because it suggests that people get together to “meet” each other, but doesn’t define any specific agenda, objective, or outcome.
From a systems design perspective meetings have another troublesome property: they require multiple humans to be (mostly) in the same place at the same time. In software architecture, we call this a “synchronization point”, widely known as one of the biggest throughput killers. The word “synchronous” derives from Greek and essentially means things happening at the same time. In distributed systems for things to happen at the same time, some components have to wait for others, which is quite obviously not the way to maximize throughput.
The longer the wait for the synchronization point, the more dramatic the negative impact on performance becomes. In some organizations finding a meeting time slot among senior people can take a month or longer. Such resource contention on people’s time significantly slows down decision-making and project progress (and hurts Economies of Speed). The effect is analog to locking database updates: if many processes are trying to update the same table record, throughput suffers enormously as most processes just wait for others to complete, eventually ending up in the dreaded deadlock. Administrative teams in large organizations acting as transaction monitor underlines the overhead caused by using meetings as the primary interaction model. Worse yet, full schedules cause people to start blocking time “just in case”, a form of pessimistic resource allocation, which has exactly the opposite of the intended effect on the System Behavior.
While getting together can be useful for brainstorming, critical discussions, or decisions (see below), the worst kind of meetings must be status meetings. If someone wants to know where a project “stands”, why would they want to wait for the next status meeting that takes place in a week or two? To top it off, many status meetings I attended had someone read text off a document that wasn’t distributed ahead of the meeting lest someone read through it and escapes the meeting.
I describe the attributes required for fast software development and deployment (often referred to as “DevOps”) as follows:
- Development velocity assures that you can make code changes swiftly. If the code base is fraught with technical debt, such as duplication, you will lose speed right there.
- Once you made a code change, you must have the confidence in your code’s correctness, e.g. through code reviews, rigorous automated tests, and small, incremental releases. If you lack confidence, you will hesitate and you can’t be fast.
- Deployment must be repeatable, usually by being 100% automated. All your creativity should go into writing great features for your users, not into making each deployment work. Once you decide to deploy, you must depend on the deployment working exactly as it did the last 100 times.
- Your run-time must be elastic because once your users like what you built, you must be able to handle the traffic.
- You need feedback from monitoring to make sure you can spot production issues early and to learn what your users need. If you don’t know in which direction to head, moving faster is no help.
- And last but not least you need to secure your run-time environment against accidental and malicious attacks, especially when deploying new features frequently, which may contain, or rely on libraries that contain, security exploits.
ITIL to the Rescue?
If you challenge IT operations about slow chaos, you are likely to receive a stare of disbelief and a reference to ITIL, a proprietary but widely adopted set of practices for IT service management. ITIL provides common vocabulary and structure, which can be of huge value when supplying services or interfacing with service providers. ITIL is also a bit daunting, consisting of five volumes of some 500 pages each.
When an IT organization refers to ITIL, I wonder how large the gap between perception and reality is. Do they really follow ITIL, or is it used as a shield against further investigation into the slow chaos? A few quick tests give valuable hints: I ask a sysadmin which ITIL process he or she primarily follows. Or I ask an IT manager to show me the strategic analysis of the customer portfolio described in section 4.1.5.4 of the volume on service strategy. I also prominently display a set of ITIL manuals in my office to thwart anyone’s temptation of hand-waving their way through the conversation. ITIL itself is a very useful collection of service management practices. However, just like placing a math book under your pillow didn’t get you an A-grade in school, referencing ITIL alone doesn’t repel slow chaos.
The Way Out
“How come no one cleans up the slow chaos?”, you may ask. Many traditional, but successful organizations simply have too much money to really notice or bother. They must first realize that the world has changed from pursuing economies of scale to pursuing Economies of Speed. Speed is a great forcing function for automation and discipline. For most situations besides dynamic scaling, it’s OK if provisioning a server takes a day. But if it takes more than 10 minutes, you know there’ll be the temptation to perform a piece of it manually. And that’s the dangerous beginning of slow-moving chaos. Instead, let software eat the world and don’t send humans to do a machine’s job. You’ll be fast and disciplined.
Corporate IT tends to have its own vocabulary. The top contender for the most frequently used word must be to align, which translates vaguely into having a meeting with no particular objective beyond mulling over a topic and coming to some sort of agreement short of an official approval. Large IT organizations tend to be slowed down by doing this a lot.
Emperor’s new Clothes
Traditional IT governance can also cause an awkward scenario best described as the Emperor’s New Clothes: a central team develops a product that exists primarily in slide decks, so-called vaporware. When such a product is decreed as a standard, which is essentially meaningless, customers may happily adopt it because it’s an easy way to earn a “brownie point”, or even funding, for standard compliance without the need for much actual implementation. In the end everyone appears happy, except the shareholders: it’s a giant and senseless waste of energy.
To affect lasting change in an organization you need to understand:
- That organizations Will not change if there’s no pain.
- How to Lead Change by showing a better way of doing things.
- Why organizations need to think in Economies of Speed instead of Economies of Scale.
- Why an Infinite Loop is an essential part of digital organizations.
- Why excessively Buying IT services can be a fallacy.
- How to speed up organizations by Spending less Time Standing in Line.
- How you can get the organization to Think in New Dimensions.
To illustrate the stages a person or an organization tends to go through when transforming their habits, I drew up the example of someone changing from eating junk food to leading a healthy lifestyle. With no scientific evidence, I quickly came up with 10 stages:
- You eat junk food. Because it’s tasty.
- You realize eating junk food is bad for you. But you keep eating it. Because it is tasty.
- You start watching late-night TV weight-loss programs. While eating junk food. Because it is so tasty.
- You order a miracle-exercise machine from the late-night TV program. Because it looked so easy.
- You use the machine a few times. You realize that it’s hard work. Worse yet, no visible results were achieved during the two weeks you used it. Out of frustration you eat more junk food.
- You force yourself to exercise even though it’s hard work and results are meager. Still eating some junk food.
- You force yourself to eat healthier, but find it not tasty.
- You actually start liking vegetables and other healthy food.
- You become addicted to exercise. Your motivation changed from losing weight to doing what you truly like.
- Friends ask you for advice on how you did it. You have become a source of inspiration to others.
Change happens incrementally and takes a lot of time plus dedication.
Digital Transformation Stages
Drawing the analogy between my colleague’s company and my freshly created framework, I concluded that they must be somewhere between stage 3 and 4 on their transformation journey. What he attended was the digital equivalent of watching late-night miracle solutions. Maybe the company even invested in or acquired one of the nifty start-ups, which are young, hip, and use DevOps. But upon returning to his desk, he experienced that the organization was still eating lots of junk food.
I suggest that the transformation scale from 1 to 10 isn’t linear: the critical steps occur from stage 1 to 2 (awareness, not to be underestimated!), 5 to 6 (overcoming disillusionment) and from 7 to 8 (wanting instead of forcing yourself). I would therefore give his company a lot of credit for starting the journey, but warn them that disillusionment is likely to lie ahead.
Tuning the engine
Not everyone who buys snake oil is a complete fool, though. Many organizations adopt worthwhile practices but don’t understand that these practices don’t work outside of a specific context. For example, sending a few hundred managers to become Scrum Master certified doesn’t make you agile. You need to change the way people think and work and establish new values. Holding a stand-up meeting every day that resembles a status call where people report 73% progress also doesn’t transform your organization. It’s not that stand-up meetings are a bad idea, rather the opposite, but they are about much more than standing up. Real transformation has to go far beyond scratching the surface and change the system.
Systems theory teaches us that to change the observed behavior of a system, you must change the system itself. Everything else is wishful thinking. It’s like wanting to improve the emissions of a car by blocking the exhaust pipe. If you want a cleaner running car, there’s no other way than going all the way back to the engine and tuning it. When you want to change the behavior of a company, you need to go to its engine – the people and the way they are organized. This is the burdensome, but only truly effective way.
A tractor passing the race car
One particular danger of leading change with a different approach is that the existing, slow approaches are often more suitable for the current environment. This is a form of Systems resisting change and can result in your fancy new software / hardware / development approach being pummeled by the old, existing ways. I compare this to building a full-fledged race car, just to find out that in your corporate environment each car has to pull 3 tons of baggage in the form of rules and regulations. And instead of a nice, paved race track, you find yourself in a foot-deep sea of process mud. You will find out that the old corporate tractor slowly but steadily passes your shiny new Formula 1 car, which is busily throwing up mud while shredding its rear tires. In such a scenario, it becomes difficult to argue that you devised a better way of doing things.
It’s therefore critical to change processes and culture along with introducing new technology. A race car on a tractor pulling contest will be laughable at best. You’ll have to dry up the swamp and build a proper road before it makes sense to commission a race car. You also need to employ your Communication Skills to secure management support when setbacks happen.
Setting course
To motivate people for change, you can either dangle the digital carrot, painting pictures of happy, digital life on far horizons, or wield the digital stick, warning of impending doom through disruption. In the end, you’ll likely need a little bit of both, but the carrot is generally the more noble approach. For the carrot to work, you need to paint a tangible picture of the alternate future and set visible, measurable targets based on the company strategy. For example, if the corporate strategy is based on increasing speed to reduce time-to-market, a tangible and visible goal would be to cut the release cycle for your division’s software products or services in half (or more) every year. If the goal is resilience, you set a goal of halving average times-to-recovery for outages (setting a goal related to the number of outages has two issues: it incentivizes hiding outages and it’s not the number of outages that count, but the observed downtime). If you want to add a little stick to that goal, deploy a chaos monkey that verifies systems’ resilience by randomly disabling components.
The island of sanity
Some companies’ change programs sail far off the mainland to overcome the constraints imposed by the old world: innovation teams move into separate buildings, use Apple hardware, run services in the Amazon Cloud, and wear hoodies. I refer to this approach as building an “island of sanity in the sea of desperation”. I did exactly this in the year 2000 when our somewhat traditional consulting company vied for talent with Internet startups like WebVan and Pets.com (a plastic bag and a sock puppet decorate my private Internet Bubble archive).
Sooner or later, though, the island will become too small for the people on it, causing them to feel constrained in their career options. If the island has drifted far from the mainland because the mainland hasn’t changed much at all, re-integration will be very difficult, increasing the risk that people leave the company altogether. That’s what happened to most of my team in 2001. Second, people will wonder why they have to live on a small and remote island when other companies feature the same, desirable (corporate) lifestyle on their mainland. Wouldn’t that seem much easier? Or, as a friend once asked, or rather challenged, me in a very pointed way: “Why don’t you just quit and let them die?”.
Skunkworks
Many significant innovations that came out of people working in a separate location have managed to transform the mothership, though. The best-known example perhaps is the IBM PC, which was developed far away from IBM’s New York headquarters in Boca Raton, Florida. The development bypassed many corporate rules, e.g. by mostly using parts from outside manufacturers, by building an open system, or by selling through retail stores. It’s hard to imagine where IBM (and the computer industry) would be without having built the PC.
IBM was certainly a company not used to moving quickly with insiders claiming that it “would take at least nine months to ship an empty box”. But the prototype for the IBM PC was assembled in one month and the computer was launched only 1 year later, which required not only development, but also manufacturing setup. The team didn’t circumvent all processes and for example passed the standard IBM quality assurance tests.
The IBM PC is a positive example of an ambitious but specific project being led by existing management under executive sponsorship. People working on traditional projects probably didn’t feel that this project was a threat, but rather just felt that it was impossible for IBM to make a computer for less than $15,000. This approach avoided the “island” syndrome or the 2-speed IT approach where one-half of the company is “the future” and the other one the “past”, which won’t survive.
The Valley of the Blind
One shouldn’t underestimate resistance to change and innovation in large and successful enterprises that have “done things this way” for a long time. H. G. Wells’ short story of the “Country of the Blind” comes to mind: an explorer falls down a steep slope and discovers a valley completely separated from the rest of the world. Unbeknownst to the explorer, a genetic disease has rendered all villagers unable to see. Upon realizing this peculiarity, the explorer feels that because “the one-eyed man is king” he can teach and rule them. However, his ability to see proves to have little advantage in a place designed for blind people without windows or lights. After struggling to take advantage of his gift, the explorer is to have his eyes removed by the village doctor to cure his strange obsessions.
Oddly, two versions of the story exist, each with a different ending: in the original version, the explorer escapes the village after struggling back up the slope. The revised version has him observe that a rock slide is about to destroy the village and he’s the only one able to escape along with his blind girlfriend. In either case, it’s not a happy ending for the villagers. Be careful not to fall into the “in the land of the blind, the one-eyed man is king” trap. Complex organizational systems settle into specific patterns over time and actively resist change. If you want to change the behavior, you have to change the system.
A modern IT organization or start-up would have spent a few minutes deciding on the product and have accounts setup, a private repository created, and the first commit made in about 10 minutes. The speed-up factor comes to 210 days * (24 hours / day) * (60 minutes / hour) / 10 minutes ≈ 30,000! If that number alone doesn’t scare you, keep in mind that one organization published a paper (without selecting or implementing a product such as BitBucket, GitHub, or GitLab) and is merrily dragging their legacy along. Their “decision” is thus about as meaningful as prescribing that men should wear black shoes, but brown is also allowed for historical reasons. Meanwhile the other organization is already committing code in a live repository. If you extrapolate the traditional organization’s timeline to include vendor selection, license negotiation, internal alignment, paperwork, and setting up the running service, the ratio may well end up in the hundreds of thousands. Should they be scared? Yes!
Old Economies of Scale
How can this happen? Traditional organizations pursue economies of scale, meaning they are looking to benefit from their size. Size can indeed be an advantage, as can be seen in cities: density and scale provide short transportation and communication paths, diverse labor supply, better education, and more cultural offerings. Cities grow because the socioeconomic factors scale in a superlinear fashion (a city of double the size offers more than double the socioeconomic benefits), while increases in infrastructure costs are sublinear (you don’t need twice as many roads for a city twice the size). But density and size also bring pollution, risk of epidemics, and congestion problems, which ultimately limit the size of cities. Still, cities grow larger and live longer than corporate organizations. One reason lies in the fact that organizations suffer more severely from the overhead introduced by processes and control structures that are required or perceived to be required to keep a large organization in check. Geoffrey West, past president of the Santa Fe Institute, summarized this dynamic in his fascinating video conversation Why cities keep growing, corporations and people always die, and life gets faster.
In corporations, economies of scale are generally driven by the desire for efficiency: resources such as machines and people must be used as efficiently as possible, avoiding downtimes due to idling and retooling. This efficiency is often pursued by using large batch sizes: making 10000 of the same widget in one production run costs less than making 10 different batches of 1000 each. The bigger you are, the larger batches you can make, and the more efficient you become. This view is overly simplistic, though, as it ignores the cost of storing intermediate products, for example. Worse yet, it doesn’t consider revenue lost by not being able to serve an urgent customer order because you are in the midst of a large production run: the organization values resource efficiency over customer efficiency.
The manufacturing business has realized this about half a century ago, resulting in most things being manufactured in small batches or in one continuous batch of highly customized products. Think about today’s cars: the number of options you can order are mind boggling, causing the traditional “batch” thinking to completely fall apart: cars are essentially batches of one. With all the thinking about “lean” and “just in time” manufacturing it’s especially astonishing that the IT industry is often still chasing efficiency instead of speed.
While in static environments being big is an advantage thanks to economies of scale, in times of rapid change economies of speed win over and allow start-ups and digital native companies to disrupt much larger companies. Or as Jack Welsh famously stated: “If the rate of change on the outside exceeds the rate of change on the inside, the end is near.
Behold the Flow!
The quest for efficiency focuses on the individual production steps, looking to optimize their utilization. What’s completely missing is the awareness of the production flow, i.e. the flow of a piece of work through a series of production steps. Translated into organizations, individual task optimization results in every department requiring lengthy forms to be filled out before work can begin: I have been told that some organizations require firewall changes to be requested 10 days in advance. And all too often the customer is subsequently told that some thing or another is missing from the request form and is sent back to the beginning of the line. After all, helping the customer fill out the form would be less efficient. If that reminds you of government agencies, you may get the hint that such processes aren’t designed for maximum speed and agility.
Besides the inevitable frustration with such setups, they trade off flow efficiency for processing efficiency: the work stations are nicely efficient, but the customers (or products or widgets) chase from station to station, fill out a form, pick a number, and wait. And wait. And wait some more just to find out they are in the wrong line or their need cannot be processed. This is dead time that isn’t measured anywhere except in the customers’ blood pressure. Come to think of it, in most of these places, the people going through the flow are not customers in the true sense as they don’t choose to visit this process, but are forced to. That’s why you are bound to experience such setups at government offices, where you could at least argue that misguided efficiency is driven by the pursuit to preserve taxpayer money. You’ll also commonly find it in IT departments that exert strong governance.
Flow-based thinking calls this concept the cost of delay (see the excellent book The Principles of Product Development Flow1), which must be added to the cost of development. Launching a promising product later means that you lose the opportunity to gain revenue during the time of delay. For products with large revenue upside, the cost of delay can be much higher than the cost of development, but it’s often ignored. On top of avoiding the cost of delay, deferring a feature and launching sooner also allows you to learn from the initial launch and adjust your requirements accordingly. The initial launch may be an utter failure, causing the product to never be launched in the second country. By deferring this feature you avoided wasting time building something that would have never been used. Gathering more information allows you to make a better decision.
The Value and Cost of Predictability
How come intelligent people ignore basic economic arguments such as calculating the cost of delay? They are working in a system that favors predictability over speed. Adding a feature later, or, worse yet, deciding later whether to add it or not may require going through lengthy budget approval processes. Those processes exist because the people who control the budget value predictability over agility. Predictability makes their lives easier because they plan the budget for the next 12-24 months, and sometimes for good reasons: they don’t want to disappoint shareholders with run-away costs that unexpectedly reduce the company profit. As these teams manage cost, not opportunity, they don’t benefit from an early product launch.
Chasing predictability causes another well-known phenomenon: sandbagging. Project and budget plans sandbag by overestimating timelines or cost in order to more easily achieve their target. Keep in mind that estimates aren’t single numbers, but probability distributions: a project may have a 50 percent chance of being done in four weeks’ time. If “you are lucky and all goes well” it may be done in 3 weeks, but with only a 20% likelihood. Sandbaggers pick a number far off on the other end of the probability spectrum and would estimate eight weeks for the project, giving them a greater than 95% chance of meeting the target. Worse yet, if the project happens to be done in four weeks, the sandbaggers idle for another four weeks before release to avoid having their time or budget estimates cut the next time. If a deliverable depends on a series of activities, sandbagging compounds and can extend the time to delivery enormously.
Old World Hurdles
Unfortunately, traditional companies aren’t built for rapid feedback cycles. They often still separate run from change and assume a project is done by the time it reaches production. Launching a product is about the 120 degree mark in the innovation wheel-of-fortune, so making 1/3 of a single revolution counts for nothing if your competition is on their 100th refinement.
What keeps traditional organizations from completing rapid learning cycles? Their structure as a layered hierarchy: in a fairly static, slow moving world, organizing into layers has distinct advantages: it allows a small group of people to steer a large organization without having to be involved in all details. Information that travels up is aggregated and translated for easy consumption by upper management. Such a setup works very well in large organizations but has one fundamental disadvantage, though: it’s horribly slow to react to changes in the environment or to insights at the working level. It takes too much time for information to travel all the way up to make a decision because each “layer” in the organization brings communication overhead and requires a translation. Even if architects can ride the elevator, it still takes time for decisions to trickle back down through a web of budgeting and steering processes. Once again we aren’t talking about a difference of 10%, but of factors in the hundreds or thousands: traditional organizations often run feedback cycles to the tune of 18 months while digital companies can do it in days or weeks.
In times where most every organization wants to become more “digital” and the technical platforms are readily available as open source or cloud services, building a fast-learning organization is a critical success factor.
Build - Measure - Learn
There’s one loop, though, that’s a key element of most digital companies: the continuous learning loop. Because digital companies know well that control is an illusion, they are addicted to rapid feedback. Eric Ries eternalized this concept in his book The Lean Startup2 as the Build - Measure - Learn cycle: a company builds a minimum viable product and launches it into production to measure user adoption and behavior. Based on the insights from live product usage the company learns and refines the product. Jeff Sussna aptly describes the “learning” part of the cycle as “operate to learn” – the goal of operations isn’t to maintain the status quo, but to deliver critical insights into making a better product.
Digital RPMs
The critical KPI for most digital companies is how much they can learn per Dollar or time unit spent, i.e. how many revolutions through the Build - Measure - Learn cycle they can make. The digital world has thus changed the nature of the game completely and it would be foolish at best (fatal at worst) to ignore this change.
Pivoting the Layer Cake
To speed up the feedback engine you need to turn the organizational layer cake on its side by forming teams that carry full responsibility from product concept to technical implementation, operations, and refinement. Often such an approach carries the label of “tribes”, “feature teams”, or “DevOps”, which is associated with a “you build it, you run it” attitude. Doing so not only provides a direct feedback loop to the developers about the quality of their product (pagers going off in the middle of the night are a very immediate form of feedback), but it also scales the organization by removing unnecessary synchronization points: all relevant decisions can be made within the project team.
Running in independent teams that focus on rapid feedback has one other fundamental advantage: it brings the customer back into the picture. In the traditional pyramid of layered command-and-control the customer is nowhere to be found - at best somewhere interacting with the lowest layer of the organization, far from where decisions are made and strategies are set. In contrast, “vertical” teams draw feedback and their energy directly from the customer.
The main challenge in assembling such teams is to get a complete range of skill sets into a compact team, ideally not exceeding the size of a “2 pizza team” that can be fed by 2 large pizzas. This requires qualified staff, a willingness to collaborate across skill sets, and a low-friction environment.
At Google, getting a USB charger cable was a matter of 2.5 minutes: 1 minute to walk to the nearest Tech Stop, 30 seconds to swipe your badge and scan the cable at the self-checkout, and 1 minute to walk back to your desk. In corporate IT, I had to mail someone, who mailed someone, who asked me the type of phone I use and then entered an order, which I had to approve. Elapsed time: about 2 weeks. Speed factor: 14 days x 24 hours/day x 60 minutes/hour / 2.5 minutes = 8064, in the same league as setting up a Source Code Repository.
The challenge an organization faces when “moving up the stack”, e.g. from infrastructure to application software platform or from software platform to end-user application is well-known and has aptly been labeled as the Stack fallacy. Even successful companies underestimate the challenge and are subject to the fallacy: VMware missed the shift from virtualization software to Docker containers, Cisco has been spending Billions in acquisitions to get closer to application delivery, and even mighty Google failed to move from utility software like search and mail to an engaging social network, a market dominated by FaceBook.
How can organizations have too much money? After all, their goal is to maximize profits and shareholders returns. To do so, companies use stringent budgeting processes that control spending. For example, proposed projects are assessed by their expected rate of return against a benchmark typically set by existing investments, sometimes called IRR, “Internal Rate of Return”.
Such processes can hurt innovation, though, when new ideas must compete with existing, highly profitable “cash cows”. Most innovative products can’t match established products’ performance or profitability during early stages. Traditional budgeting processes may therefore reject new and promising ideas, a phenomenon that Christensen coined the Innovator’s Dilemma4. However, when they later surpass sustaining technologies, they threaten organizations that didn’t invest early on and now lag behind.
Rich companies tend to have a high IRR and are therefore especially likely to reject new ideas. Also, they perceive the risk of no change as low – after all, things are going great. This dampens the appetite for change (see No Pain, No Change) and increases the danger of disruption.
Beware of the HiPPO
Despite its downsides, companies making investment decisions based on expected return at least use a consistent decision metric. Many rich companies have a different decision process: that of HiPPO, the Highest Paid Person’s Opinion. This approach isn’t just highly subjective but also susceptible to shiny, HiPPO-targeted vendor demos, which peddle incremental “enterprise” solutions as opposed to real innovation. Because those decision makers are far removed from actual technology and software delivery, they don’t realize how fast new solutions can be built on a shoestring budget.
To make matters worse, internal “sales people” exploit management’s limited understanding to push their own pet projects, often at a cost orders of magnitude higher than what digital companies would spend. I have seen someone make it to board level with the idea of exposing functionality as an API, at a cost of many million Euros. It’s easy to sell people in the stone age a wheel.
When in university, we often wonder whether and how what we learn will help us in our future careers and lives. While I am still waiting for the Ackerman function to accelerate my professional advancement (our first semester in computer science blessed us with a lecture on computability), the class on queuing theory was actually helpful: not only can you talk to the people in front of you in the supermarket checkout line about M/M/1 systems and the benefits of single queue, multiple servers systems (which most supermarkets don’t use), but it also gives you an important foundation to reason about Economies of Speed.
Looking Between the Activities
When looking to speed things up in enterprises, most people look at how work is done: are all machines and people utilized, are they working efficiently? Ironically, when looking for speed, you mustn’t look at the activities, but between them. By looking at activities you may find inefficient activity, but between the activities is where you find inactivity, things sitting around and waiting to be worked on. Inactivity can have a much more detrimental effect on speed than inefficient activity. If a machine is working well and almost 100% utilized, but a widget has to wait 3 months to be processed by that machine, you may have replicated the public health care system, which is guided by efficiency, but certainly not speed. Many statistics show that typical processes in corporate IT, such as ordering a server, consist to over 90% of wait times. Instead of working more we should wait less.
Our university professor reminded us that if we remember only one thing from his class, it should be Little’s Result.
Finding Queues
Queuing theory proves that driving up utilization increases processing times: if you live in a world where speed counts, you have to stop chasing efficiency. Instead, you have to have a look at your queues. Sometimes these queues are visible like the lines at government offices where you take a number and wonder whether you’ll be served before closing time. In corporate IT the queues are generally less visible – that’s why so little attention is paid to them. By looking a little harder, you can find a them almost everywhere, though:
- Busy calendars: When everyone’s calendar is 90% “utilized”, topics queue for people to meet and discuss. I waited for meetings with senior executives for multiple months.
- Steering meetings: Such regular meetings tend to occur once a month or every quarter. Topics will be queued up for them, often holding up decisions or project progress.
- E-mail: Inboxes fill up with items that would take you a mere 3 minutes to take care of, but that you don’t get to for several days because you are highly “utilized” in meetings all day. Stuff often rots in my inbox queue for weeks.
- Software releases: Code that is written and tested but waiting for a release is sitting in a queue, sometimes for 6 months.
- Workflow: Many processes for anything from getting an invoice paid to requesting a raise for employees have excessive wait times built in. For example, ordering a book takes large companies multiple weeks, as opposed to having it delivered the next day from Amazon.
To get a feeling for the damage done by queues, consider that ordering a server often takes 4 weeks or more. The infrastructure team won’t actually bend metal to build a brand new server just for you: most servers are provisioned as virtual machines these days (thanks to Software Eating the World). If you reasonably assume that there are 4 hours of actual work in setting up a server consisting of assigning an IP address, loading an operating system image, doing some non-automated installations and configurations, the time spent in the queue makes up 99.4% of the total time! That’s why we should look at the queues. Reducing the 4 hours of effort to 2 won’t make any difference unless you reduce the wait times.
Living Along a Line
IT architecture is a profession of trade-offs: flexibility brings complexity, decoupling increases latency, distributing components introduces communication overhead. The architect’s role is often to determine the “best” spot on such a continuum, based on experience and an understanding of the system context and requirements. A system’s architecture is essentially defined by the combination of trade-offs made across multiple continua.
Quality vs. Speed
When looking at development methods, one well-known trade-off is between quality and speed: if you have more time, you can achieve better quality because you have time to build things properly and to test more extensively to eliminate remaining defects. If you count how many times you have heard the argument: “we would like to have a better (more reusable, scalable, standardized) architecture, but we just don’t have time”, you start to believe that this god-given trade-off is taught in the first lecture of “IT project management 101”. The ubiquitous slogan “quick-and-dirty” further underlines this belief.
The folks bringing this argument often also like to portray companies or teams that are moving fast as undisciplined “cowboys” or as building software where quality doesn’t matter as much as in their “serious” business, because they cannot distinguish Fast Discipline from Slow Chaos. The term banana product is sometimes used in this context – a product that supposedly ripens in the hands of the customer. Again, speed is equated with a disregard for quality.
Ironically, the cause for the “we don’t have time” argument is often self-initiated as the project teams tend to spend many months documenting and reviewing requirements or getting approval, until finally upper management puts their fist on the table and demands some progress. During all these preparation phases the team “forgot” to talk to the architecture team until someone in budgeting catches them and sends them over for an architecture review which invariably begins with “I’d love to do it better, but…” The consequence is a fragmented IT landscape consisting of a haphazard collection of ad-hoc decisions because there was never enough time to “do it right” and no business case to fix it later. The old saying “nothing lasts as long as the temporary solution” certainly holds in corporate IT. Most of these solutions last until the software they are built on is going out of vendor support and becomes a security risk.
Now we can portray the trade-off between the two parameters as a curve whose shape depicts how much speed we have to give up to achieve how much better quality.
Changing the Rules of the Game
Once you moved into the two-dimensional space, you can ask a much more profound question: “can we shift the curve?” And: “if so, what would it take to shift it?” Shifting the curve to the upper right would give you better quality at the same speed or faster speed without sacrificing quality. Changing the shape or position of the curve means we no longer have to move along a fixed continuum between speed and quality. Heresy? Or a doorstep to a hidden world of productivity?
Probably both, but that’s exactly what digital companies have achieved: they have shifted the curve significantly to achieve never-before-seen speeds in IT delivery while maintaining feature quality and system stability. How do they do it? A big factor is following processes that are optimized for speed, as opposed to resource utilization or schedule predictability. The key ingredients are of technical or architectural nature: automation, independent deployability of code modules, resilient run-times, advanced monitoring, analytics, etc:
They understand that software runs fast and predictably, so they never send a human to do a machine’s job. They turn as many problems as possible into software problems, so they can automate them and hence move faster and often more predictably. If something does go wrong, they can react quickly, often with the users barely noticing. This is possible because everything is automated and they use Version Control. They build resilient systems, ones that can absorb disturbance and self-heal, instead of trying to predict and eliminate all failure scenarios.
Inverting the Curve
If adding a new dimension doesn’t make folks’ head hurt enough, tell them that in software development it’s even possible to invert the curve: faster software often means better software! Much time in software development is spent due to friction and manual tasks: long wait times for severs or environments to be set up, manual regressing testing, etc. Removing this friction, usually by Automating Everything, not only speeds up software development but also increases quality because manual tasks are a common source of errors. As a result, you can use speed as a lever to increase quality.
What Quality?
When speaking about speed and quality, one should take a moment to consider what quality really means. Most traditional IT folks would define it as the software’s conformance to specification and possibly adherence to a schedule. System uptime and reliability are surely also part of quality. These facets of quality have the essence of predictability: we got what we asked or wished for at the time we were promised it. But how do we know whether we asked for the right thing? Probably someone asked the users, so the requirements reflect what they wanted the system to do. But do they know what they really want, especially if you are building a system the users have never seen before? One of Kent Beck’s great sayings is: “I want to build a system the users wish they asked for.”
The traditional definition of quality is a proxy metric: we presuppose to know what the customers want, or at least that they know what they want. What if this proxy isn’t a very reliable indicator? Companies living in the digital world don’t pretend to know exactly what their customers want because they are building brand-new solutions. Instead of asking their customers what they want, they observe customer behavior. Based on the observed behavior they quickly adjust and improve their product, often trying out new things using A/B testing. One could argue that this results in a product of much higher quality, one that the customers wish they could have asked for. So you’re not only able to shift the curve of how much quality you can get for how much speed, you can also change what quality you are aiming for. Maybe this is yet another dimension?
Losing a Dimension
What happens when a person who is used to working in a world with more degrees of freedom enters a world with fewer? This can lead to a lot of surprises and some headaches, almost like moving from our three-dimensional world to the Planiverse. The best way out is education and Leading Change.
Transforming from Bottom-up
This book’s main purpose is to encourage IT architects to take an active role in transforming traditional IT organizations who must compete with digital disruptors. “Why are technical architects supposed to take on this enormous task?”, you may ask, and rightly so: many managers or IT leaders may have strong communication and leadership abilities that are needed to change organizations. However, today’s digital revolution is not just any organizational restructuring, but one that is driven by IT innovation: mobile devices, cloud computing, data analytics, wireless networking, and the Internet of Things, to name a few.
Leading an organization into the digital future therefore necessitates a thorough understanding of the underlying technologies along with their application for competitive advantage. It’s hard to imagine that instigating a digital transformation purely from “the top down” can be successful. Non-tech-savvy management can at best limp along based on input from external consultants or trade journals. That’s not going to cut it, though: competition in the digital world is fierce and customer expectations are increasing every day. When we hear of a successful start-up company that went public or was acquired for a huge sum of money, we usually forget the dozens or even hundreds of start-ups in the same space that didn’t make it despite a great idea and a bunch of smart people working extremely hard on it. Architects, who are rooted in technology, are needed to help drive the transformation.
If you are not yet convinced that transforming the organization is part of your job as an architect, you may not have much of a choice: recent technology advances can only be successfully implemented if the organizational structure, processes, and often the culture also change. For example, “DevOps” style development is enabled through the advent of automation technologies, but relies on breaking down change and run silos. Cloud computing can reduce time-to-market and IT cost dramatically, but only if the organization and its processes empower developers to actually provision servers and make necessary network changes. Lastly, being successful with data analytics requires the organization to stop making decisions based on management slide sets, but on hard data. All these are major organizational transformations. Technology evolution has become inseparable from organizational evolution. Correspondingly, the job of the architect has broadened from designing new IT systems to also designing a matching organization and culture.
Transforming from Inside-out
Most digital markets are winner-takes-all markets: Google owns search, FaceBook owns social, Amazon owns fulfillment and cloud, Netflix mostly owns content (battling with Amazon). Apple and Google’s Android own mobile. Google tried to get into social and floundered. Microsoft struggles in search and mobile. Amazon also struggles in mobile just like Google repeatedly dabbles in fulfillment and can never quite get traction. In cloud computing even almighty Google is at best a runner-up with Amazon holding a huge lead. Watching this battle of the titans from the sidelines of a traditional organization often resembles watching world-class athletes compete from the bleachers while eating popcorn: these organizations sport multi-hundred-Billion Dollar evaluations (Netflix being the “baby” with roughly $50B market capitalization in 2016), have access to world’s top IT talent, and are run by extremely talented and skilled management teams.
Watching vendor demos and purchasing a few new products aren’t going to make an organization competitive against these guys. As the overall direction of the digital revolution has become fairly clear, and technology has been democratized to the point where every individual with a credit card can procure servers and big data analytics engines within minutes, the main competitive asset for an organization is its ability to learn fast. External consultants and vendors can give a boost, but cannot substitute for an organization’s ability to learn . Architects aretherefore needed to drive or at least support the transformation from inside the organization.
From Ivory Tower Resident to Corporate Savior
In times of digital disruption, the job of the IT architect has become more challenging: keeping pace with ever faster technology evolution, but also being well-versed in organizational engineering, understanding corporate strategy, and communicating to upper management is now part of being an architect. But the architect’s job has also become more meaningful and rewarding, if he or she takes up the challenge. The new world does not reward architects who draw diagrams while sitting in the ivory tower, but hands-on innovation drivers and change agents. I hope this book encourages you to take the challenge and equips you with useful guidance and a little wisdom along your way.
Nothing but the truth
Extorting a final reference from the movie The Matrix, when Morpheus asks Neo to choose between the red pill, which will eject him into reality, and the blue pill, which will keep him inside the illusion of the Matrix, he doesn’t describe what “reality” looks like. Morpheus merely states:
Remember: all I’m offering is the truth. Nothing more.
If he had told Neo that the truth translates into living in the confines of a bare bones hovercraft ship patrolling sewers in the middle of a war against the machines who perpetually hunt the ship to chop it up with their powerful laser beams, he may have taken the blue pill. But Neo had already understood that there’s something wrong with the current state, the Matrix illusion, and felt a strong desire to change the system. Most corporate IT residents, in contrast, are quite content with their current environment and position. So you likely need to push them a little harder to take the red pill.
Just like in the movie The Matrix, though, the new digital reality that awaits the red-pill-taking folks may not be exactly what they expected. In a meeting, a fellow architect once proudly proclaimed that for transformation to succeed the architect’s life needs to be made easier. Aiming to make one’s life easier is unlikely to lead into the digital future but will rather end up in disappointment. Technological advances and new ways of working make IT more interesting and valuable to the business, but they don’t make it easier: new technologies have to be learned and the environment generally becomes more complex, all the while the pace picks up. Digital transformation isn’t a matter of convenience, but of corporate survival.
Looks are Deceiving
Just as it seems unlikely that a simple block of ice can sink a modern (at the time) marvel of engineering, small, digital companies may not feel threatening to a traditional enterprise. Most start-ups are run by relatively inexperienced, sometimes even naive people who believe they can revolutionize an industry while sitting on a beanbag because their office space hasn’t been fully set up yet. They are often understaffed and have to secure multiple rounds of external funding before turning profitable, if ever at all.
However, just like 90% of an iceberg’s volume lies under water, digital companies’ enormous strength is hidden: it lies in their ability to learn much faster, often orders of magnitude faster than traditional organizations. Dismissing or trivializing start-ups’ initial attempts to enter an established market could therefore be a fatal mistake. “They don’t understand our business” is a common observation from traditional businesses. However, what took a business 50 years to learn may take a disruptor only 1 year or less because they are set up for Economies of Speed and have amazing technology at their disposal.
Digital disruptors also don’t have to unlearn bad habits. Learning new things is difficult, but unlearning existing processes, thought patterns, and assumptions is disproportionately more difficult. Unlearning and abandoning what made them successful in the past is one of the biggest transformation hurdles for traditional companies.
Some traditional businesses may feel safe from disruption because their industry is regulated. To demonstrate how thin a safety net regulation provides, I routinely remind business leaders that if the digitals have managed to put electric and self-driving cars on the road and rockets into space, they are surely capable of obtaining a banking or insurance license. For example, they could simply acquire a licensed company.
Lastly, digital disruptors don’t tend to attack from the front. They tend to choose weak spots in existing business models that are highly inefficient, but not significant enough for large, traditional enterprises to pay attention to. AirBnB didn’t build a better hotel and Fintech companies aren’t interested in rebuilding a complete bank or insurance company. Rather, they attack the distribution channels, where inefficiency, high commissions, and unhappy customers allow new business models to scale rapidly with minimum capital investment. Some researchers claim that had the Titanic hit the iceberg head on, it might not have sunk. Instead, it was taken down because the iceberg tore open a large portion of the relatively weak side of the hull. That’s where the digitals hit.