Why does Software Suck? Part I Sunday, January 3, 2010 at 11:01 pm
Anyone who has used a computer for any length of time has seen it. The program suddenly loses data, it goes slowly for no reason at all, it freezes, your operating system crashes. If you are on a Windows, this can be met with useful messages like “A fatal exception 0E has occurred at 002D:4C21000E” graced with a gentle blue background. Thank you, Windows. (Although newer versions try to avoid showing you the infamous blue screens of death). On a Macintosh, OS X crashes by giving you a little translucent pane in gray with the words “You need to restart your computer” in four languages. Contrary to popular belief, crashes are not more pleasant with beveled edges. Thank you, Apple.
Why does this happen? The personal computer market started in the 1970s. It is now the year 2010. Why haven’t we had more progress in creating reliable systems over the past forty years? The short answer is that we have had progress – vast progress, think back to something even as recent as Windows 95 – but the progress has been slow and halting and there’s no time in the foreseeable future that we will have widely-available multi-purpose computers that do not crash, or that perform uniformly quickly and reliably. A little introduction to computer hardware and computer history is necessary to demonstrate why I believe this. So in part 1 I’m going to explain what all programs generally and the operating system specifically has to do to even get off the ground, and the historical reasons why the machinery we’re using is a mismatch for the tasks we are trying to do; and in part 2 I’m going to go through why programming anything at all correctly is somewhere between extremely difficult and impossible.
Computer hardware was originally designed, and continues to be designed, based on something called the Von Neumann architecture. The quick-and-dirty summary of the Von Neumann architecture is this: there is a piece of hardware which contains space for a set of instructions (we call this a program) which is then sent to a processor that executes all thosetente instructions.* If the program needs to store any information, it can put this in memory (RAM, hard drive). This is how computers have worked since they first appeared, and all in all, it’s a pretty functional system. However, notice something: this system of hardware implicitly assumes you are only running one program at a time. There is space for one set of instructions to be run on one processor. Which works great until you want the machine to do more than one thing at a time – for example, use a text editor and download internet content, or play music and scan for viruses (or a million billion other common tasks).
But computers are fast – this was the whole point of them, performing complex and repetitive mathematical tasks quickly – and it is possible to execute many hundreds of thousands of these instructions sequentially at a blazingly fast rate. So to get more out of them, it would be nice to run multiple programs (in the architecture we’re discussing, these are instruction sets) at once, right? So to get around the single-program structure of the Von Neumann architecture, software engineers came up with something that is basically time-sharing.
Let’s assume you are rich. Maybe you are, I don’t know. Let’s further assume you own a summer home on the beach that you and your wife (or husband) take the whole family there for three months a year. The rest of the year, that real estate is just sitting there, unused but still costing you money. You come up with this great idea: let’s rent it out to other families during the rest of the year. That way it’s still getting used and we’re making up a little bit of the cost for it.
This is exactly how multi-processing (executing multiple programs at once) works. The beach house is your computer’s processing hardware. You (and the other tenants) are the programs that run on it. To execute multiple programs on a piece of hardware that was fundamentally designed to run one program at once, we time-share. The process of switching tenants is called “task switching” – one program is taken off the processor and all its data and everything it’s doing is stored precisely in memory so it can come back on the processor later without knowing anything has happened at all. (Think of Han Solo frozen in carbonite.) Then another program is taken from memory and put on the processor and starts up. This happens many, many times a second.
So everything should be solved right? Not quite. When you time-share a beach house (or computer), you are somewhat at the mercy of the other tenants. You could come back to your beach house and find it totally trashed. Blinds askew, furniture toppled, hairballs and cat fur everywhere. You could be stuck cleaning up a previous tenant’s mess. The same is true for programs that get plopped back on the processor, with one key difference: unlike you and I coming back to our beach house, the program doesn’t know that it has been away. It was just frozen in time, stored, and then restored. It has no way of knowing that someone else was using its house, and usually can’t tell that any time has passed at all. It can’t take a look around because it doesn’t know that anything has changed. So the program is going to continue as if nothing has happened, and if something has happened – if a piece of memory it had assumed was one thing was accidentally changed by another program, for example – well, that’s when you get strange behavior and program crashes.
This brings us up roughly to the Windows 95 era. This is when you would select Start > Shut Down and there would come up a screen saying “It is now safe to turn off your computer.” And everyone recommended you to restart your machine every day. Why? Well your computer was only the one beach house and after having all those tenants in it it was impossible to assure everyone that the place was just like they expected it to be. So it was not uncommon for programs to tread on other programs’ toes, so to speak. Best to just reboot the whole thing so you know where everything is.
The computer operating system was originally a program that was designed to provide support to other programs – a kind of library of common operations. Do you need to draw something on the screen? Do you need to find out what the time and date is? Do you need to write letters to the screen and read from the keyboard? The operating system can help with that! The operating system would also help boot up your computer and allow you to navigate around the file system. As we moved more and more toward multi-processing, there was another place the operating system could obviously help with: keeping processors separate so they didn’t interfere with each other. And this is just what was developed. The system is called “virtual memory” – and while it’s not important to get into the nitty-gritty, it’s basically carving up the time-shared house into different rooms for each program to live in. Although a program has full control of the processor when it’s running on the processor, in order to access storage, it now has to go through the operating system – and what the operating system does is it lies. The program thinks it’s accessing one place, but the operating system actually keeps a separate copy of every location for every program so they can’t interfere with each other. In fact, there is no way they can access each other’s storage, even accidentally. The operating system is the tidy butler keeping every tenant separate so that no one else has to see their mess. And ideally, none of them will realize that anyone else is ever there.
This seems really great, but where this opens pandora’s box is when it comes to what computer programmers call “threading.” Threading is getting a single program to create several copies (or forks) of itself. Why on earth would you do this? As programs have become more complex, it has become obvious that not only do you want multiple programs to run simultaneously, but you want a single program to do different tasks simultaneously – like spell-checking and doing a word-count. It just speeds everything up! Thus, “threading.” Each thread usually has a different job to do (if you work in the corporate world, think on how many things Microsoft Outlook is doing at the same time – checking mail, checking your calendar, looking at a to-do list…). It’s not uncommon for a large program to be running dozens of threads. And remember, these threads are treated just like different programs by the operating system** – so they are taken on and off the processor dozens of times a second. If this seems like it could get complex very quickly, it does – it is easy to have threads lying around that aren’t doing anything, but are taking up time on the processor, or threads that are all waiting on each other to do something and never do anything themselves (thread deadlock). Threads make a conceptual mess very, very quickly. And when looking at how many different processes are having to be taken on and off your processor, threads add up just like programs. The overhead of having to freeze and store all of a program or thread’s information, and then bring another back from memory to start running adds up much more quickly with threads involved.
The supposed answer to all this is multiple processors, but these are a long way from being an ideal solution – or even a workable solution. To some extent, you can run multiple programs better with multiple processors. But the way these have been designed, they are still accessing the same memory, and the hardware infrastructure around the processors was designed for one, not two of them. So they cannot both access memory at the same time. One processor cannot talk to the other very easily, and so running multiple threads from one program across multiple processors is difficult. Currently the biggest advantage to having more than one processor is you have to only do half as many task switches between threads/programs (or one-fourth, if you’ve shelled out a lot of money for one of the quad-cores). Fundamentally, we have taken two single-program processors and glued them to the same bit of memory.
So let’s summarize. The computer you are reading this on bears its internal organs from a machine designed to run one program at a time. Presently it is running multiple programs at a time by taking them on and off its internal brain more quickly than you can perceive. Not only that, but within some of those programs, it is still taking different threads on and off its internal brain, all in the pursuit of the illusion of multitasking. These internals have not been substantially redesigned from the original single-program model; all these things are hacks and small, cumulative modifications to get around it. There’s enough space in all this to drive through truckloads of program crashes and system slow-downs. And this is all just the infrastructure your computer and operating system has to support to run anything useful on top of it. Although there is some hope of things looking up eventually with multiple processors, the way they are designed now does not significantly change this infrastructure.
These are some of the historical reasons we have what we have today – we are not using our computer architecture for what it was originally designed to do, and although we’ve gotten better at it, the more complex workarounds we make for the machine, and the more adjustments we slap onto it, the more likely there is to be some point at which one of them will fail, and the less likely it is that anyone will understand why or where the failure happened.
* These instructions can include conditional statements – this is how we create programs that do different things every time depending on input – and this input can be from a human interacting with a keyboard, from a file, an internal clock, a random number generator, whatever.
** With one exception: threads of a single program will all see the same memory space – that is, they are all given access to the same rooms in the beach house.





















