by Jay Johansen | Apr 2, 2008
I participate in forums for programmers now and then. And I've often noticed that someone will mention how long or difficult a certain project has been, and then another person will jump in and write something to the effect of, "Why in the world is it taking you months to do that? That's a simple problem. I could write a program to do that in a week! All you have to do is ...", followed by a couple of sentences of a simplistic solution.
I just saw two examples of this in the last few days.
One was a discussion about electronic voting machine software, with several posters discussing how you could make sure it was accurate and honest. Then someone posted one of these "I could write that in a week" responses. What's the big deal? he asked. All you have to do is read in the list of votes and count how many for each candidate. Trivial.
The second was about a system to track contributions to some non-profit organization. The original poster complained about how bug-ridden the software he had to maintain was. Then someone posted an "I could write that in a week" response, basically suggesting the programmer simply throw away the bug-infested code and start over.
It is, I suppose, possible that the people who make these posts are geniuses who really could do in one week what the most skilled people I know would take months to accomplish. But I suspect that they are students or recent graduates, and they think that real-world problems are just like their school assignments.
Because you see, what makes most real-world programming problems complex is not the "core problem". It is all the messy details and special cases.
For the donation tracking system, the "one week" guy said that all you need is a database with two tables: one for donor with name and address, and another for the donation with date, amount and credit card number. A couple of simple screens and you're done.
For a school project, that might well be all that is required. But if I was assigned this as a real world project, there are dozens of questions that would immediately come to my mind. Like, what about contributions by check or bank draft? Do we send donors an acknowledgement letter? If so, where does the text of this letter come from, and is it that same for everyone or does it vary? What reports do we have to produce? Surely at a minimum we want to know the total amount raised. Maybe we need reports showing contributions by various categories of donors. If so, what are those categories and what information do we need to categorize people? Does the donation system have to interface to our general accounting system? Do we send people year-end contribution summaries for income tax purposes? What do those look like? Etc etc.
In real life, just getting answers to basic questions like these would surely take weeks or months. And in real life, the programmers often don't know what questions to ask, and the users never seem to volunteer what later turns out to be vital information. I would fully expect on a system like this that when we did our first user testing -- or worse, when we went into production -- suddenly the users say things like, "Hey, wait, I entered a bunch of contributions with a donor name of 'Anonymous' and this crazy system added them all together." "Well, yeah," the programmer replies, "You just created one donor named 'Anonymous' and then you posted all those contributions against that same donor." "But the system should know that 'Anonymous' is special" the user says, incredulous that the stupid programmers didn't provide for this obvious fact. Not that exact problem, of course, but some little detail or special case like that. Or rather, a hundred little details or special cases like that.
Or in the case of the voting machine: In real life, a voting machine does not just have to handle a simple flat file with a list of votes for candidates for a single office. There are surely many offices being voted on in any given election. Some of these offices apply to all voters, but others apply only to voters in certain precincts. Like, the people in Detroit don't get to vote for the mayor of Ann Arbor or vice versa. In most races a voter can only vote for one candidate. But in races for school boards or city councils where a number of representatives are elected at large, a voter may be able to vote for several candidates simultaneously. One would hope that the machines provide a way for voters to go back and correct a mistake. When I lived in New York there were routinely more parties than candidates. We had lots of small parties who would only run their own candidates for local offices in areas where they had significant support; for the big offices like governor and president, they would just endorse a big party candidate. So if Democrat Party candidate Joe Smith got 5 million votes and Liberal Party candidate Joe Smith got 1 million votes and Free Love Party candidate Joe Smith got 100 thousand votes, that was counted as 6.1 million votes for Joe Smith for purposes of deciding who wins, but they also had to report all the totals by party so each party could claim their share of a victory. I don't know if any other state has such a system. Does your voting software handle that?
Not to mention, I'm sure somewhere in there will be buried a requirement like: in races for governor and state legislature the Republicans are coded as "R" and the Democrats as "D", but for county commissioners Republicans are "1" and Democrats are "2", except in this one city where Republicans are "D" and Democrats are "R".
You think I'm making this up? Just today I was working on a program to scan barcodes attached to products in our warehouse and match them against a database. I didn't even have to worry about the scanning part for this effort: I picked it up from where we have the string of letters and digits that resulted from the scan. Trivial, right? I read in the stock number, look it up on the database, we find a match or we don't.
Except ... except that there are three different kinds of codes that could be on those labels. I had to look at the scanned value and figure out which of the three types it was based on the format, like the position of hyphens and whether it's all digits or mixed digits and letters. Oh, except that sometimes we get labels where they left out the hyphens. And oh yeah, the one code is 16 digits, which sounds simple, except that there are still many boxes in the warehouse with the old format that was only 15 digits, those I have to pad with a zero. Two of the formats I can match against corresponding fields in the database. A third format I have to convert to one of the other two according to a certain formula. Oh, after I submitted my first draft they came back and said that in that last case I have to try to match against the database twice, first by the converted value and then against another field by the original value. And ... but you get the idea. I considered that a pretty easy problem as real-world problems go.
In school, the instructor creates a problem that is cleanly defined, with few special cases and exceptions or none at all. This makes perfect sense, because the point of the exercise is to make sure that the student learns how to manage a hash table or how the SQL group-by clause works or whatever the current lesson is. We don't want to get bogged down in a bunch of irrelevant details. We want a problem that the student can solve in a week or two and that the instructor can evaluate and grade in a few minutes.
But apparently this is leaving many students with the seriously wrong impression that this is what real-world problems are like.
© 2008 by Jay Johansen