Thursday, October 16, 2008

Ebay Does It, Why Can't We?

It occurs to me that as web applications have taken over the world over the past decade, what’s become clear is that we have a heck of a lot to learn about managing them.

Complexity in the software industry has sky-rocketed, and applications that once were versioned and released every six months are now being upgraded with new production code every couple of weeks. EBay is a prime example of this model. It is widely known that EBay has become so adept and addressing issues and deploying changes to their servers, that every two weeks a whole new version of Ebay is up and running in their multitude of data centers around the world.

Now, Ebay is not a simple application, and downtime at Ebay can cost millions of dollars, not to mention generate an angry mob of users who are not only desperately trying to buy the latest iPhone 3G, but some of whom are also trying to make their living. This is serious stuff. Downtime at Ebay is front page news.

It’s redundant to point out dangers of our new Software-as-a-Service software paradigm. However, the one-to-many relationship between application server and end-users does present a great deal of pitfalls that did not exist in Bill Gates’ world. When Microsoft Outlook crashes, you may be upset, say something not very nice, then restart it. When your application servers crash, money starts flowing straight out the back door as your customers all collectively say something not very nice about you.

So why aren’t more companies able to follow the Ebay model? What does Ebay know that they don’t?

The answer may lie not just in how they deploy new changes, but how they handle and resolve issues quickly when they do occur. Is this truly one the great remaining challenges in the realm of software? Some would say so, and they have good reasons to back them up.

Multi-tier applications represent some of the greatest levels of complexity ever seen in the software industry. With pieces of your application running on many heterogeneous, physically dispersed servers and environments, understanding what went wrong in these environments can be next to impossible. When issues occur, most often the only hope a team has is to attempt to reproduce the same conditions that caused the error, and hope it happens again. This means that to understand the root-cause of issues, recreating the environment, re-populating the database, and generating the required load on the servers is the only solution. Frequently, the pain of going through this effort is too great, and the issues lie dormant… until the next time something bad happens!

What the software industry is screaming out for is the ability to quickly capture, reproduce, and isolate issues as they occur. What we need is something like ‘TiVo™ for Software’.

One solution that has finally emerged from the chaos introduces the concept of recording and replaying software execution. This technology revolves around the core ability to not only record an application’s execution, but just as importantly, the complex environment in which the application ran.

With this new ability, teams can dispense with massive amounts of inefficient workflows that have traditionally been manual, iterative and error prone.

Imagine this common scenario: Your newly out-sourced team in India is handling QA for your complex, multi-tier application. They’re doing a great job and have found over 100 issues with your application. You’ve got your problem reports, log files, and the very large database datasets that your application was using when the bad things happened.

Next comes the fun part.

Now it’s your turn to bring up the same environment that your Indian team was running. I hope you’re using virtual servers! Finally, let’s take a shot at generating the same load on our application that existed when the problem occurred. Hopefully, the moons have aligned, and your fingers are crossed…

Now let’s fast-forward to 2008. Your Indian team is using your recording system. You arrive in the morning, logon to your defect tracking system, load the recording of an issue they found, and press ‘play’.

This time, every event that affected your application in that complex environment, including output from your authentication, LDAP, caching and e-commerce servers, has all been recorded and stored. Even the database and its dataset are no longer required. Most importantly, the end-user traffic that ultimately triggered the problem to occur has been recorded as well. All of these elements are perfectly reproduced allowing you to focus on the most important thing: What went wrong.

Anyone who has been involved in software development can relate to the age-old conundrum of trying to reproduce an issue that simply doesn’t appear to exist. At least not on your machine. Too many sleepless nights have been wasted chasing down phantom bugs. It’s time for the madness to stop.

The problems we’re facing are only getting more and more complex as new technologies are brought to market. This new software paradigm is here to stay. Luckily, I believe new technologies such as record and replay will help control the chaos.

1 comment:

Unknown said...

Interesting!

Bacioni

Davide