farmdev

The Promise of the Cloud

As web developers we are faced with this problem: how do we scale up our code to handle high traffic? A lot of time and engineering goes into this problem -- time to simulate the traffic we expect and add servers to our cluster, cache heavy database access, etc, in anticipation of the load. Time is precious. This time could be spent optimizing the usefulness of our web product and creating interesting content. No one really congratulates you when a website works, they expect it to work.

When Google App Engine was released their pitch was: "Run your web apps on Google's infrastructure. Easy to build, easy to maintain, easy to scale." As a web developer I was excited by this because it sounded like I could spend my time on the important thing: innovation! I started running some internal apps for an online radio station (CHIRP Radio) because the price was right (free) and we knew that eventually we'd have a lot of data so infinite scalability was appealing. The apps do not get heavy traffic but they are used nearly every second of the day by a live DJ in the studio since the station broadcasts 21hrs/day.

After one year of running these apps, here's the reality of what Google App Engine offers to web developers: a volatile environment that's capable of high traffic and lots of data stoage but one which requires custom code.

Nothing has been published about the hardware that each app runs on but I have noticed from the logs that an app instance will typically start up and serve about 10 requests before another instance starts up somewhere else. Within each request, there are a lot of limitations for what an app can do. If it uses too much cpu it might die or time out. The biggest killer is app startup time, which is limited just like any other request (IMO, the limitations should be more lax for app startup). Then there is the datastore. Even after its growing pains, there are still random datastore timeouts in our app at least once a day.

The Google App Engine status page has some checks that sample latency across the entire system. When they reach a threshold, an error is reported and acknowledged by Google. Occasionally there are refunds for paid accounts if the errors are substantial. However, any latency that happens in your app might not be enough to tip over the sampling threshold of the status monitor. When your app times out, the result is a 500 page for your users. This instability is unpredictable and thus hard to plan for and develop against. A page you built might run fine for a couple days but then it might time out the next day.

But it is possible to write better code to work with this volatile system. So what does that mean for web developers? We are back to spending time on optimization. This is not the cloud I was hoping for. I was hoping that Google App Engine would take on all the responsibility of making the servers scale if I were to deliver fully tested, working code. If someone could solve this problem of scalability then it would be a huge benefit to developers. It would allow us to spend our time dreaming up and implementing websites. Well, maybe App Engine will solve this problem over time.

Even with the need to micro-optimize, Google App Engine is still useful though. The optimization is much different than typical scaling. Instead of expanding a server or sharding a database you just have to make the code more compact so that it uses very little cpu, caches all datastore queries as much as possible, etc. This will, admittedly, take less time and effort than scaling up with a fully self-hosted infrastructure would.

Ultimately, I hope that future cloud providers understand the limited benefits of App Engine's volatile system. Actually, it makes App Engine feel more like shared hosting, which never worked well.