Why does Software Suck? Part II Sunday, January 10, 2010 at 2:27 pm

Last week I went over some of the reasons why modern-day computers are what they are. Today I plan to go over some reasons why, regardless of what it’s running on, writing correct software is hard – one of the hardest engineering feats out there. Not in terms of requiring lots of intelligence, but diligence.

When I read The Mythical Man-Month a few months back, I was struck how dead-on accurate it was about the pitfalls of software engineering, even though it was written back in 1975, when the craft of software was so much younger. But here ware, more than thirty years later, and although we’ve built systems up higher and higher on top of yesterday’s systems, and we have the internet and dual core processors and the Playstation 3 and Photoshop, most of Brooks’ critiques are just as valid today as they were then. He begins his book by comparing software engineering to the tar pits of bygone eras, trapping powerful dinosaurs and sabre-tooth tigers, sinking them, struggling with all their awesome might, into the pit. If you want to understand software – or even how to manage extremely complex projects – I can’t recommend the book to you strongly enough. Here will follow some things I got out of both the book and my experiences, pulling from my often-inaccurate memory.

Here, in a nutshell, are the problems with software engineering:

1. You must be perfect. You cannot be almost-perfect or leave a few things ambiguous. Nothing is ambiguous because everything must become some series of zeros and ones for the processor to run. Every line you write, every bit that is compiled, every flip of some switch deep inside the computer’s memory bank must be perfect. If it isn’t perfect, maybe it’ll work right most of the time. And maybe sometimes it’ll crash terribly and destroy all your data. Human beings are not accommodated to working perfectly and without flaw. In fact, sometimes it is the imperfections – the noticeable paintbrush strokes, the symmetrical dimple, the beauty spot, the awkward laugh – that we find charming. We adapt to imperfections and interpret them. You do not have such wiggle room in programming a machine. A computer does not interpret; it is a dumb machine that does exactly what you tell it to. A slight mistake causes significant consequences.

2. You must be perfect in continually unique tasks. When you are laying a building, you have simple repetitive tasks that must be done. All must be done well – laying the foundation, constructing support beams, laying bricks – but these are a couple of tasks repeated thousands of times. Laying the second brick is not a different task than laying the first brick, it is just another brick. After several hundred, you become better at them and become a bricklaying expert. There is no such analogue in computer programming. If a programmer finds himself writing the same piece of code, what he does is separate that task into its own subroutine, and whenever he needs it done, he makes a call to that one task. This was the whole point of a computer – if you define how to do something once, you don’t have to define it ever again, and the computer will do it over and over again for you. What this means is when you are making a program, you don’t have the repetition of laying a thousand bricks; once you’ve figured out how to lay a brick, you define the steps needed to lay one brick and then just make a subroutine call to do that every time you find yourself needing to lay a brick. You don’t see the problem of bricklaying ever again (unless you find out you did it wrong and need to modify it). This means that when a programmer is writing a set of tasks, almost everything is unique. There are generally not repetitious programming tasks which must be done over and over again, everything is approached afresh, defined, and then submitted to some library of common tasks. And each task must be done perfectly.

3. Reading code is much more difficult than writing it. It is very difficult to explain this to someone without the experience of working on a software project. Programming and coding are not easily-visualized disciplines. In fact, there is nothing inherently visual about them at all, regardless of how many flow-charts you may want to make. A programmer goes from a pure (or vague) algorithm in his head straight to a list of concrete instructions. These are not lists like “Pick up milk at the grocery store” – but rather explicit instructions about memory structures and how to process those memory structures. Again, these are not visual and beyond a certain level of complexity cannot be described comprehensively with any two-dimensional visual aid. When it’s all at the front of your mind, and you’re seeing the math of how it works, it’s relatively straightforward to define. However, unless you are extremely strict about writing down why you’re doing everything as you do it, you can go back to these mathematical definitions of how to move memory around and ask yourself what on earth did I do. And if it is hard, a month or two down the line to interpret what you yourself did, it is far more difficult to interpret what someone else did. And if you are on a large software project, you will have to look at and fix problems in other people’s code. If you fail to interpret precisely what they were trying to do, you are likely to introduce further problems. I assure you there are lines in Windows code that no one any longer knows what they’re there for. But if you remove them, the product breaks. This is why software projects tend to get larger and larger, and never smaller – no one knows what the “legacy code” is (that’s what we call this old code nobody knows what it does anymore but it’s somehow necessary) or how to fix it.

4. On large-scale projects, you have many external dependencies. It doesn’t sound so bad if you have to rely on someone else to do their job, but remember from 1) and 2) above that all these jobs must be done absolutely perfectly. I promise you, no matter how great a company is, not everyone there will write perfect code. Any given software engineer writes code that other people rely on and he has to rely on code written by other people. Consider Jim, who’s in a team of people writing the task that renders images when you double-click on an image file. Jim has to rely on code written by people working in the file system, code which takes something like a filename and gives him back the series of zeros and ones which he will eventually make into an image. If there’s anything wrong in the file system code, Jim’s code will not work. Jim’s code also relies on the code that makes a window with the little ‘x’ in the corner and file drop-down menu, and if there’s anything wrong there, Jim’s code will not work. And so on for other tasks which determine things like the monitor size, what kind of monitor it is, what the color scheme on the computer is, and so forth. And this is all before Jim even gets down to brass tax. If those teams have failed, Jim is going to be behind schedule (and quite possibly harassed by upper management for being behind). After that, Jim has to figure out his part of the code – determining what kind of image file it is, then processing it, then displaying it. Once Jim’s written this code, it may be called into by other people – the file system folks may then again re-use his code to display a preview image, or another program may want to show an image in the same way and re-use Jim’s code to do that. And if those people find problems in Jim’s code (or if they try to use it in a way Jim didn’t anticipate), then their code will fail and Jim will have to fix what he did. Every single one of these literally dozens of dependencies for something as simple as displaying an image on-screen is an opportunity for something to go wrong, for a bug to creep in, or for communication to fail between people and between teams. And if the product ships with any problem left unfound or unfixed, it is left for people who come along later trying to use the product as a start point for a bigger project to discover a work-around for the less-than-perfect product.

Issues 1 & 2 (and to some extent, 3) above are about programming anything – whether in a group or solo. Because perfection is required, fixing a problem in code – or as we say, fixing a bug – has a law of diminishing returns. Every time you try and fix an imperfect piece of code (and remember, it may be imperfect because something you are depending on is imperfect), you have some probability of introducing another imperfection, and possibly a devastating one. The larger and more incomprehensible a programming project becomes, the more difficult it is not to introduce a new bug. Although this is true for individual projects, it is especially true when more than a handful people are working on the same product. This is why large-scale programming products begin limiting the number of fixes they will make before the product ships – because every time you “fix” something you have some probability (dependent upon the complexity of the code and the thoroughness of your engineers) of breaking something else.

Issues 3 & 4 are specifically about large-scale team projects. Issue 3 – the difficulty of interpreting code – is why once you have a product, parts of it remain unchanged for very long periods of time, even if everyone recognizes that they are buggy or need to be changed. It is just too difficult to interpret exactly what something is doing and why it is there. And 4 simply exponentiates the problems of 1, 2, and 3, because every new dependency is an opportunity for a schedule to fall behind, communication or interpretation to break down, or for a bug to be introduced.

Although all these problems are, I think, part of the nature of software, they can be mitigated with good practices. I have not seen very many good practices put into practice, but in theory they could be. To avoid the problems of imperfection, rigorous testing can be demanded for every task in a program, on top of rigorously-defined functionality for each task. In most places I have been, a lot of code has been written before the programmer had a clear idea of what it was needed for. Although planning for the product as a whole is always undertaken, planning for each step and each piece is needed as well. Up-front planning is expensive, but in the end it will create better software, and make it easier to read code (if each piece has a rigorous definition). Likewise, testing is usually done from a high-level perspective, but if every task – every entry and exit point of every function – were tested for completion and correctness, this could cut down substantially on imperfections that creep into software. Again, the reasons this is not done is because doing so is very time-expensive, but a failure to do so just increases end-of-cycle testing and the scope and number of bugs in a product. And the final, and I think one of the most significant issues – cross-dependencies on large-scale products – can only be gotten around by clearly defining interchangeable parts to a programming product. The industrial revolution turned on the concept of interchangeable parts – the firing piece of one musket was the same as another, because all the pieces that touched other bits of a rifle were built to a particular specification. Computing has yet to catch up with this concept. I have yet to work on a project where low-level internal interfaces were clearly defined. On the level of the product as a whole, inputs and outputs to a program are clearly and rigorously defined. However, inputs and outputs from one programmer’s code to another programmer’s code are not defined at all but rather vaguely and sloppily hashed out as we go along. This is why the guts of software often look to me like a plate of spaghetti; if there were a more clearly architected inside to a product, I think this would help tremendously with all of the problems of software – bugginess, late ship schedules, difficult maintenance, and so on.

There is one final issue which exacerbates all the above problems, although it is not an issue of programming but of capitalism. Although I am attempting to make the case in the above, that it is much more difficult to make functional software than it is to make a functional building or a functional piece of hardware, in one sense software is much easier than any of these: software can be changed, and distributed, on the fly. Once you build a building, to modify it you typically have to shut it down, move people in, and spend days or weeks or even months retooling it. In software, it is a button on a keyboard that changes these. It is a few hours to recompile the program and then you can just update a released product with a patch online. Software is by its nature ephemeral. From a venture capitalist point of view, because software can be changed quickly, the investment input is minimal compared with other ventures. It’s because investment is small and turnaround time is quick that we saw things like the dot-com bubble. In many ways, software is a sort of venture capital wet dream. It’s cheap and changes fast. Everyone can get rich quickly (that’s the theory, if not the reality). This impulse toward capitalist ephemerality works against the necessity of software to be written perfectly. Perfection takes time, and when near-perfection can be done quickly to the siren-song of a million potential dollars, the time to make software air-tight, or even to perform well, is rarely taken. That will put you behind-market! And so we get buggy, better-than-nothing software offered up by the marketplace.

Welcome to software. I have no easily-implemented solutions to the above, and any solutions I do have conflict with the drive to market.

Why does Software Suck? Part I Sunday, January 3, 2010 at 11:01 pm

Anyone who has used a computer for any length of time has seen it. The program suddenly loses data, it goes slowly for no reason at all, it freezes, your operating system crashes. If you are on a Windows, this can be met with useful messages like “A fatal exception 0E has occurred at 002D:4C21000E” graced with a gentle blue background. Thank you, Windows. (Although newer versions try to avoid showing you the infamous blue screens of death). On a Macintosh, OS X crashes by giving you a little translucent pane in gray with the words “You need to restart your computer” in four languages. Contrary to popular belief, crashes are not more pleasant with beveled edges. Thank you, Apple.

Why does this happen? The personal computer market started in the 1970s. It is now the year 2010. Why haven’t we had more progress in creating reliable systems over the past forty years? The short answer is that we have had progress – vast progress, think back to something even as recent as Windows 95 – but the progress has been slow and halting and there’s no time in the foreseeable future that we will have widely-available multi-purpose computers that do not crash, or that perform uniformly quickly and reliably. A little introduction to computer hardware and computer history is necessary to demonstrate why I believe this. So in part 1 I’m going to explain what all programs generally and the operating system specifically has to do to even get off the ground, and the historical reasons why the machinery we’re using is a mismatch for the tasks we are trying to do; and in part 2 I’m going to go through why programming anything at all correctly is somewhere between extremely difficult and impossible.

Computer hardware was originally designed, and continues to be designed, based on something called the Von Neumann architecture. The quick-and-dirty summary of the Von Neumann architecture is this: there is a piece of hardware which contains space for a set of instructions (we call this a program) which is then sent to a processor that executes all thosetente instructions.* If the program needs to store any information, it can put this in memory (RAM, hard drive). This is how computers have worked since they first appeared, and all in all, it’s a pretty functional system. However, notice something: this system of hardware implicitly assumes you are only running one program at a time. There is space for one set of instructions to be run on one processor. Which works great until you want the machine to do more than one thing at a time – for example, use a text editor and download internet content, or play music and scan for viruses (or a million billion other common tasks).

But computers are fast – this was the whole point of them, performing complex and repetitive mathematical tasks quickly – and it is possible to execute many hundreds of thousands of these instructions sequentially at a blazingly fast rate. So to get more out of them, it would be nice to run multiple programs (in the architecture we’re discussing, these are instruction sets) at once, right? So to get around the single-program structure of the Von Neumann architecture, software engineers came up with something that is basically time-sharing.

Let’s assume you are rich. Maybe you are, I don’t know. Let’s further assume you own a summer home on the beach that you and your wife (or husband) take the whole family there for three months a year. The rest of the year, that real estate is just sitting there, unused but still costing you money. You come up with this great idea: let’s rent it out to other families during the rest of the year. That way it’s still getting used and we’re making up a little bit of the cost for it.

This is exactly how multi-processing (executing multiple programs at once) works. The beach house is your computer’s processing hardware. You (and the other tenants) are the programs that run on it. To execute multiple programs on a piece of hardware that was fundamentally designed to run one program at once, we time-share. The process of switching tenants is called “task switching” – one program is taken off the processor and all its data and everything it’s doing is stored precisely in memory so it can come back on the processor later without knowing anything has happened at all. (Think of Han Solo frozen in carbonite.) Then another program is taken from memory and put on the processor and starts up. This happens many, many times a second.

So everything should be solved right? Not quite. When you time-share a beach house (or computer), you are somewhat at the mercy of the other tenants. You could come back to your beach house and find it totally trashed. Blinds askew, furniture toppled, hairballs and cat fur everywhere. You could be stuck cleaning up a previous tenant’s mess. The same is true for programs that get plopped back on the processor, with one key difference: unlike you and I coming back to our beach house, the program doesn’t know that it has been away. It was just frozen in time, stored, and then restored. It has no way of knowing that someone else was using its house, and usually can’t tell that any time has passed at all. It can’t take a look around because it doesn’t know that anything has changed. So the program is going to continue as if nothing has happened, and if something has happened – if a piece of memory it had assumed was one thing was accidentally changed by another program, for example – well, that’s when you get strange behavior and program crashes.

This brings us up roughly to the Windows 95 era. This is when you would select Start > Shut Down and there would come up a screen saying “It is now safe to turn off your computer.” And everyone recommended you to restart your machine every day. Why? Well your computer was only the one beach house and after having all those tenants in it it was impossible to assure everyone that the place was just like they expected it to be. So it was not uncommon for programs to tread on other programs’ toes, so to speak. Best to just reboot the whole thing so you know where everything is.

The computer operating system was originally a program that was designed to provide support to other programs – a kind of library of common operations. Do you need to draw something on the screen? Do you need to find out what the time and date is? Do you need to write letters to the screen and read from the keyboard? The operating system can help with that! The operating system would also help boot up your computer and allow you to navigate around the file system. As we moved more and more toward multi-processing, there was another place the operating system could obviously help with: keeping processors separate so they didn’t interfere with each other. And this is just what was developed. The system is called “virtual memory” – and while it’s not important to get into the nitty-gritty, it’s basically carving up the time-shared house into different rooms for each program to live in. Although a program has full control of the processor when it’s running on the processor, in order to access storage, it now has to go through the operating system – and what the operating system does is it lies. The program thinks it’s accessing one place, but the operating system actually keeps a separate copy of every location for every program so they can’t interfere with each other. In fact, there is no way they can access each other’s storage, even accidentally. The operating system is the tidy butler keeping every tenant separate so that no one else has to see their mess. And ideally, none of them will realize that anyone else is ever there.

This seems really great, but where this opens pandora’s box is when it comes to what computer programmers call “threading.” Threading is getting a single program to create several copies (or forks) of itself. Why on earth would you do this? As programs have become more complex, it has become obvious that not only do you want multiple programs to run simultaneously, but you want a single program to do different tasks simultaneously – like spell-checking and doing a word-count. It just speeds everything up! Thus, “threading.” Each thread usually has a different job to do (if you work in the corporate world, think on how many things Microsoft Outlook is doing at the same time – checking mail, checking your calendar, looking at a to-do list…). It’s not uncommon for a large program to be running dozens of threads. And remember, these threads are treated just like different programs by the operating system** – so they are taken on and off the processor dozens of times a second. If this seems like it could get complex very quickly, it does – it is easy to have threads lying around that aren’t doing anything, but are taking up time on the processor, or threads that are all waiting on each other to do something and never do anything themselves (thread deadlock). Threads make a conceptual mess very, very quickly. And when looking at how many different processes are having to be taken on and off your processor, threads add up just like programs. The overhead of having to freeze and store all of a program or thread’s information, and then bring another back from memory to start running adds up much more quickly with threads involved.

The supposed answer to all this is multiple processors, but these are a long way from being an ideal solution – or even a workable solution. To some extent, you can run multiple programs better with multiple processors. But the way these have been designed, they are still accessing the same memory, and the hardware infrastructure around the processors was designed for one, not two of them. So they cannot both access memory at the same time. One processor cannot talk to the other very easily, and so running multiple threads from one program across multiple processors is difficult. Currently the biggest advantage to having more than one processor is you have to only do half as many task switches between threads/programs (or one-fourth, if you’ve shelled out a lot of money for one of the quad-cores). Fundamentally, we have taken two single-program processors and glued them to the same bit of memory.

So let’s summarize. The computer you are reading this on bears its internal organs from a machine designed to run one program at a time. Presently it is running multiple programs at a time by taking them on and off its internal brain more quickly than you can perceive. Not only that, but within some of those programs, it is still taking different threads on and off its internal brain, all in the pursuit of the illusion of multitasking. These internals have not been substantially redesigned from the original single-program model; all these things are hacks and small, cumulative modifications to get around it. There’s enough space in all this to drive through truckloads of program crashes and system slow-downs. And this is all just the infrastructure your computer and operating system has to support to run anything useful on top of it. Although there is some hope of things looking up eventually with multiple processors, the way they are designed now does not significantly change this infrastructure.

These are some of the historical reasons we have what we have today – we are not using our computer architecture for what it was originally designed to do, and although we’ve gotten better at it, the more complex workarounds we make for the machine, and the more adjustments we slap onto it, the more likely there is to be some point at which one of them will fail, and the less likely it is that anyone will understand why or where the failure happened.

* These instructions can include conditional statements – this is how we create programs that do different things every time depending on input – and this input can be from a human interacting with a keyboard, from a file, an internal clock, a random number generator, whatever.

** With one exception: threads of a single program will all see the same memory space – that is, they are all given access to the same rooms in the beach house.

Why Series: Why Economics? Sunday, November 8, 2009 at 3:40 pm

I have been wanting to start a new series of posts on this blog, a series that I have come, at least in my mind, to call Why. Why do things work the way they do? This is not an attempt to explain the mysteries of the world and the universe and existence, just to ask questions, and maybe to find some possible answers. To explore. If I could answer such questions with certitude, I’d either be certifiably insane or the supreme dictator of the universe. I’m clearly not the latter and I hope I’m not the former, so I’m looking at this as an exploration – a journey – rather than a destination. So these explorations will typically take the form of ‘Why does [some phenomenon] happen?’ or, the shorthand ‘Why [some phenomenon]?’

Why blog about this at all? It keeps me accountable to actually asking questions – questions l may otherwise avoid out of laziness or complacency – and doing diligence to find reasonable answers. And then, ideally, I could engage in lively conversation with you in the comments and we could all come away more enlightened. Although I have some ideas of the first few things I want to look into, including some that I happen to have some insight into (for example, Why software sucks – and no, it’s not because Microsoft is evil, my Maccy and Linuxy friends, or anything so simple as that), I’d like to take suggestions of what to look into. So if you have an idea, submit in the comments or contact me.

Today is a rather light one: why economics? Not why does the economy work the way it does (clearly almost no one understands that or we wouldn’t've gone through the subprime-mortgage-induced credit crash), but what is economics and why does it exist in the first place?

Wikipedia, the world’s best source of eighty-percent accurate information, defines economics as “the social science that studies the production, distribution, and consumption of goods and services.” That’s a decent enough definition, and I’m willing to accept it with one caveat: that we define the term “goods” to include all scarce resources, real and socially-agreed upon. Let me unpack why I defined it this way. General “resources” so we are not limited only to manufactured goods, but we can include natural goods like beaches, gold, fresh water, and even (in a society with slavery) other human beings. “Scarce” so we can safely exclude goods which are, for present purposes if not in reality, unlimited (e.g., air or solar energy). “Real or socially-agreed upon” because this allows us to consider things like beaches and computers alongside patents (one socially-agreed upon “thing”), and sunlight rights (which is in fact a scarce commodity among the towering buildings of Tokyo). My definition may not be expansive enough, but I feel it’s a good start.

Depending on what terms your favorite science-y author likes to use, humans are hypersocial, supersocial, or ultrasocial creatures. I first came across this concept in Jonathan Haidt’s phenomenal book The Happiness Hypothesis, where Haidt looks at the science of social animals before looking at human sociality, and applying that to human happiness. Although a discussion of how animal ultrasociality works is far beyond what I want to look at here, suffice it to say humans are the only animals we know of that demonstrate sociality that extends beyond kin altruism (helping out other individuals that share a significant amount of genetic material). Humans have developed a complex series of reciprocity-based moral intuitions and tribalism to handle altruism beyond kinship, and the upshot is that we can band together and better survive as a group but still attempt guarantee a benefit to the individual. And this also means that we live in a world formed not only (or even primarily) by our physical environment – grass and trees and apartments and grocery stores – but also in a world of complex social ties of reciprocity and altruism and betrayal and kinship and love. You and I are not cats or horses, who are concerned only with next-of-kin and finding food and copulating. We have these webs of social interactions which give rise to non-kinship relationships like friends and nations and the mafia and a thousand other things. The fact that these social webs exist, regardless of what evolutionary or other process created them, I regard as so obvious it doesn’t require defending. But here we are, and these things exist.

So if economics is the travel and distribution of goods, where do they travel? Obviously among these social webs. This distribution of goods exists in other animals too (a pack of African Dogs may “own” the meat of a kill), but at nowhere near the level of complexity as humans, because African Dogs do not have the same set of complex social interactions. Sometimes goods travel in one direction (e.g., through extortion or bribe or military conquest), but typically two entities come together and they both exchange something that the other entity wants. This is why economists say things like “economics is not a zero sum game” – usually, everyone gets something they want.

But however the details of economics play out in different societies and between societies, we have this thing called economics because we have scarce resources and we are ultrasocial beings. We don’t all simply horde what we have and refuse to exchange goods with one another, and we can’t magically create everything that we want and so are limited by how much of a good exists. And so we engage in distribution and movement of goods, and everyone tries to benefit themselves and their social webs. Economics exists because of scarce goods and human sociality. These two things both give rise to economics and they are the rules of the game.