Ah the memories
Eric Lippert’s blog post about partial order sorting was, as always, interesting and well written. However, this specific post brought back memories of a project I did for Gateway Foods of Lacrosse Wisconsin back in the early 90’s. (I googled them and they don’t appear to no longer exist.)
Though my project was not as fun or easy to understand as Eric’s example, Gateway wanted to perform the "hot and new" business analysis called Activity-Based Costing, or ABC. Instead of just looking at which food items were most profitable, they looked at which customers were most profitable.
Aced-out by the SQL dude
They hired one of the big eight firms (remember when they were called "the big eight?) Booze-Allen-Hamilton to do the analysis and then went looking for a programmer. I heard about the project from my contact "George" for whom I had delivered a training course on Clipper earlier. I got excited, but really had no chance at it because the IT department didn’t believe in PCs, so they hired a SQL programmer to implement.
A few months went by and George called to tell me the project was going no where. A few more later George cakked to tell me the project was in real trouble. A few month more still, and I got a call from George asking if I could I start on the project that week in hopes of salvaging it (of course they had already spent most of the budget on the SQL guy, but what can you do?)
Be careful what you wish for
Later that week I arrived and met with the people from Booze, and they proceeded to hand me a stack of paper 10 inches thick that contained the allocation formulas! (I kid you not.) My job was to write a Clipper program to apply those formulas to a HUGE 300Mb database (remember, this was the early 90s and 300Mb was huge back then.) The allocation formulas were for things like "Allocate the cost of fuel to deliver to each customer based on the miles required to travel to customer. Sum the cost of fuel and multiply by the miles traveled for each customer divided by total miles traveled to all customers." And so on. Around 1800+ formulas in all.
For the next three days, my eyes glazed over as they tried to understand the formulas. I starting thinking "What the hell have I got myself into? There is no way I will ever be able to understand all of these formulas well enough to encode them into a program. No wonder the SQL guy failed. I am screwed too!" But finally, it started to dawn on me; "Don’t learn the formulas, learn the patterns." I’ve always been extremely good at pattern recognition so why didn’t I see it immediately? (which contrasts nicely with all the other things at which I’m extremely bad.)
Patterns; what a concept!
The initial database had number tables like fooditems, customers, and order. It turned out there were only three patterns:
- Scan a table and perform a calculation like sum or average and write the output to a one-record table. Examples included calculating total sales, average order size, etc.
- Scan a table and perform a calculation like sum or average for each fooditem, or customer, or similar, and write the output to a table that has one record for every fooditem, customer, or similar.
- Scan a table and perform a calculation like every record and write the output to a table with the same number of records as the scanned table.
That’s it. Three patterns. All 1800+ formulas boiled down to three patterns. SQL gurus know the following are examples of the above three patterns:
SELECT SUM(SubTotal) AS TotalPrice INTO OneRec FROM Orders
SELECT CustomerID, SUM(SubTotal) AS TotalOrdered INTO OnePerCustomer FROM Orders GROUP BY CustomerID
SELECT opc.CustomerID, opc.TotalOrdered/c.TotalDeliveryMiles AS SalesPerMile INTO CustomerFactors FROM Customer c INNER JOIN OnePerCustomer opc ON opc.CustomerID=c.ID
Technique triumphs over toolset
Ultimately I succeeded using Clipper where the SQL programmer had failed. He didn’t fail because he used SQL and my success was not because I used Clipper. He failed because he tried to hand-code all 1800+ formulas and I succeeded because I wrote a Clipper program to handle the three patterns, not the 1800+ formulas. In hindsight given the tools available at the time, the best solution would probably have used Turbo Pascal and SQL, but such is life.
The Gateway people entered the allocation formulas from Booze into a grid in my program, my program then determined which formulas to calculate and in which order, and then program executed those formulas. I used Clipper’s "macro" capability to allow me to execute the text formulas much like how Execute() can be used in VBScript. However, VB.NET programmer could do the same thing today by writing a program that generates SQL and then passing that SQL to ADO.NET to execute.
Back to Eric’s Post
The "which formulas in which order" was the part I was reminded of by Eric’s post. It actually took me three weeks to write that first program. 12 hours a day. 7 days a week. In the dead of winter. In Lacrosse Wisconsin. Brrrrr. (I’m from the south in Atlanta and I am a wuss when it comes to frigid weather!) Today I could probably write that same program in VB.NET and SQL in an afternoon. But then hardware is a lot faster now.
Why three weeks? My program took almost two days to apply the 1800+ formulas to the 300Mb database! Ever try to debug a program where you have to step through code for 24 hours before you get to see the next bug? Not very productive. It was here I actually learned the tremendous value of creating an "intermediate representation." HTML is a perfect example: one program can generate HTML and another can display it. SQL, XML, and MSIL are also immediate representations. One of the benefits of an intermediate representation is you can validate output without having to observe behavior. In essense intermediate representations provide much the same benefits as state machines.
I split my program into two modules; one that compiled the formulas, and a second that executed the compiler’s output. My "compiler" took only about 2 hours to run which allowed me to debug my program in this lifetime. It still took two days to execute the formulas, but at least those formulas executed correctly!
Just when I thought I was out…
I went home victorious. Or so I thought. A few weeks later I get a call. (They tracked me down while I was on vacation, no less!) The Gateway people decided they wanted to modify the formulas and run "what-ifs." After each "what-if" it took them two days to get an answer. "Couldn’t I make this thing any faster?" (Imagine if the SQL guy had succeeded in encoding all those formulas into a SQL script? He would have had a job for life! A boring-as-hell job, but job nonetheless. Until Gateway realized all the money they were wasting and then threw him out on his ear.)
My original program, except for the compiler part, was almost exactly what Eric Lippert blogged about a few days ago. I was damn proud I wrote it. However, optimizing it was much, much harder. Turns out what was happening was each pass on each of the tables took a really, really long time (we knew that), but also there were formulas that could be performed on the same pass, but I doing a complete pass for each formula.
Well, to finally cut a long story short, another three weeks and I was able to implement an optimizer which batched together formulas that could run at the same. That version took only about 6 hours to run. Good enough, and that was the last of Gateway foods for me (I never was very good at milking clients; guess that’s why never made much money while in consulting.)
So, back to Eric’s example. Let’s assume it took one minute for each seperate Eric put on, but if he was able to put on multiple items "at the same time", then each batch of items would only takes him 1.5 minutes to put on. His original program’s output, which dressed him in this manner:
tophat : [ ] shirt : [ ] bowtie : ["ashirt"a] socks : [ ] vest : ["ashirt"a] pocketwatch : ["avest"a] underpants : [ ] trousers : ["aunderpants"a] shoes : ["atrousers"a, "asocks"a] cufflinks : ["ashirt"a] gloves : [ ] tailcoat : ["avest"a]
This would take Eric 12 minutes to dress. However, if he batched items that could be done at the same time (i.e. items that had no prior dependencies), Eric could dress in 4.5 minutes:
tophat : [ ] socks : [ ] shirt : [ ] underpants : [ ] gloves : [ ] then trousers : ["aunderpants"a] bowtie : ["ashirt"a] vest : ["ashirt"a] cufflinks : ["ashirt"a] then pocketwatch : ["avest"a] shoes : ["atrousers"a, "asocks"a] tailcoat : ["avest"a]
So here’s my challenge Eric (and I hope you don’t hate me for it): Implement an optimizer that will generate batched output where items in each batch have no interrelated dependencies, as shown in the last example above. It was a bitch to write in Clipper; how about in JScript?