Friday, April 18, 2008

Software Performance and Scale

My understanding of software performance has come a long way since i started programming back in high school. I built my very first program using Basic, on a Sinclair ZX Spectrum+. At the time, I did not realize the scale issue with my program or the flaw in my logic, but that dint stop me. 22 years later, i am designing a system for processing twenty to a few hundred peta bytes of data every month using ten to hundred thousand machines, each with 8 Gigs of RAM, 1 Tera byte of storage and most of it with 4 cpu, some with 8. This time around, a mistake can cost me my career, not to mention millions of dollars in tangible loss to my employer. On the other hand, a performant and scalable design will mean millions of dollars saved - both in hardware and power consumption.

In this blog I use different real world performance and scale problems I faced in my Programming and Engineering career to illustrate different approaches in building super scale systems. This will be an interesting read for Computer Science students, Programmers, Engineers and Software Architects.

My first program : A lesson on how not to program
Introduction to scale and performance

As a high school student, I was fascinated by maths and numbers. I can't explain why, but for some reason I found myself summing up consecutive odd numbers (1+3+5+7... ). I would be concerned if my kids did this in their spare time, I turned out to be OK, i think... To my surprise, they all turned out to be perfect squares [1, 4 (1+3), 9(1+3+5), 16(1+3+5+7),..].

I started writing down the sum for larger numbers to verify my theory. I wasted a lot of time doing additions and multiplications using pen and paper, and it progressively became harder and harder. This is hard since at each step, i have to do a multiplication to prove that the sum really is the next square. As an example, say the last number i added was 7, and got 16, which is 4x4. The next step will be add 9 (the odd number following 7) to 16 (the last sum) and i get 25. To check if 25 is a perfect square, i find 5x5 (5x5 follows 4x4 from previous step). This is equivalent to checking if the square root of 25 is 5. I dint know how to find square roots, but finding squares was sufficient to validate the theory.

It is important to understand the scale limits of this method, as this theme is common when designing computer programs and algorithms. The amount of time it takes to do one complete step (Last odd number + 2, then add the result to the current sum; Add 1 to the last square root, then square it; Compare results and make sure are they are same) is not the same as the numbers become larger. The time to find the sum operations are about the same between steps, however the time to do multiplication becomes longer and longer as numbers become bigger. Using a calculator, it still takes longer since there are more numbers to be keyed in and written down on paper. Though I realized there is a scale issue with this method, I did not carefully analyze it. Let me do that now.

Using a calculator, it takes me 30 seconds to do one step with two digit number as the square root. I had to press 15 keys (88x88= 7569+175=), then write down 9 digits (88 7744 175). It takes me 40 seconds to complete the same steps for a three digit number (20 keys, 13 digits). The point to note here is that as numbers become longer, there are more keys to press on the calculator, there are more digits to be written down on paper and there are more numbers to compare to verify the result. Now for the interesting part. Assuming 1 minute per step (which is very optimistic) and doing this for 8 hours a day for 1 year on all working days, i could do only 125 thousand steps (260*8*60). So that's a lot of manual hours, and still not getting very far.

Back to my school days : I wrote down the sum and squares for about 50 numbers before I realized the scale limits of this method. I was not sure if the odd number sums will work for ever. I also lacked the advanced math skill to prove it, which I got to learn about 3 years later in college. When i did prove it, later in college, i realized that this - sum of consecutive odd numbers starting at 1 was a perfect square - was nothing new to mathematicians. [Hint: (x + 1) square]

Around the time i was preoccupied with Maths and Numbers, my dad got me a Sinclair ZX Spectrum+ home computer. I joined a basic computer programming class for kids and started learning how to program. Programming, I realized was a practical solution to the scale issue i had with 'calculator, pen and paper', which lead me to write my first real program in basic. The program repeated the steps i did on paper, and I saw it validate my theory up to a large number. I was excited to see that my theory - sum of consecutive odd numbers starting at 1 was a perfect square - held for larger numbers.

There are two fundamental flaws in my method. In all the excitement of learning to use a new tool - programming - and finding that my theory held for large numbers, i dint really think much about these flaws.

  • I replaced one scale issue with another. The initial scale issue was the slow progress I made using pen and paper. I solved this using a computer program. However, my computer program has a scale limitation - it has finite amount of memory and computing power. I stopped the program once I saw it work for large numbers. Had I let it continue, at some point my program would have stopped working correctly due to arithmetic overflow. With 16 bit integers on that machine, the largest number it processed was around 65,000. I still have a scale limit - one that is much larger than the 2,500 (50x50) using pen and paper, nevertheless, it is a limit.
  • The second flaw is logical. No matter how large a number i prove, it does not prove the theory - there is a finite chance that this was a coincidence. Writing a mathematical proof is the only way to verify the theory. This program does not solve the real problem - prove the theory.

Looking back on my first program, i realize now that i did two things well.

  • The program was very efficient. I started off computing manually by paper, and as it turns out, i selected the steps that minimized my manual work. Coincidentally? computers found my steps to be more efficient, compared to other ways of doing similar steps. In particular, I used multiplication instead of division and remember the previous sum and square to obtain values for the next step. Computers take lot more time and energy to perform division, compared to multiplication. Remembering the values used for last operation reduced the amount of time it took to run - Order(n) - and also saved on memory.
  • I recognized a scale issue early on and used a new tool / technology effectively.

Both of these were valuable lessons.

The first point - try it out manually before writing a program - has it's own practical scale limits with large programs. A more practical approach is to write down the design / algorithm. This helps in three ways. As we document designs, we tend to make it complete and by doing that, discover blind spots. The second benefit is peer and partner design reviews. This is very valuable with large teams with complex dependencies. It gives Peers and Partners the opportunity to find issues with designs early on. The third benefit is maintenance and sustenance. Documented designs prolong the life of a product by enabling enhancements as opposed to re-writes.

The second point applies directly to things i do today. Scale issues in large system need to be identified at Architecture/Design time. Always be on the lookout for new tools / algorithms to tackle scale issues.