I came across this article today on Coding Horror about how Google has a monopoly on search engines and how something must be done about it. I'm not one who falls into the "Google Is Evil" camp; I actually think they are a benevolent force in the world :) However, as with any monopoly, the lack of competition stifles progress. And when I think about the state of today's technology, I can't help but wonder why Google has not fixed the most fatal flaw in their Googlebot :
It does not behave like a web browser.
Search engines are made for people and the majority of people browse the Internet with a web browser. The first comment on the article is a cry for help: "What can we do?" I have an answer to that question. And you can take my answer and turn it into a business plan and climb the golden staircase to success. Any smart investor would be begging you to take their money. Google generated $5.37 billion dollars in Q2 of 2008 and their flagship product doesn't even work! In fact, I'm going to give this to you all for free; all I ask is that you visit me one day and say thanks. Are you ready?
If you build what I am about to propose, Google would soil their pants. You would invoke the mighty forces of the free market and perhaps Google will fix their own Googlebot. That is, if they don't buy yours first.
Before I get into the technical details, let's consider the problem. Here is a little hobby site I built for browsing and listening to records on eBay: http://aintjustsoul.net/ As you click around you'll notice it uses lots of sexy Ajax to load images and play sounds. This is good for users with web browsers but not good for Googlebots. In fact, before I added static representations of content, you could not find my website by searching Google. As a webmaster (I love that word), I should not have to produce a static, non-sneaky version of my site just for Google. Humans can already use it, right?! Whether Google wants to admit or not, Ajax enabled websites are the future of the Internet. There are just so many usability issues that can be solved with Ajax. Gmail is the best example of how Ajax improves user experience.
If Google doesn't learn how to crawl the web like a real person IT WILL FAIL.
Here is my recipe for a browser-like search indexer. I'll sheepishly point out that I gave myself two hours to build a prototype of this and I failed. However, I am confident that someone with more experience fighting cross-domain browser limitations could build one in two hours or less! That is your challenge. Digg this, slashdot it, do whatever it takes. This is how you can help.
Ingredients:
- Indexer: A server that accepts a POST with two parameters: url, link_clicked, and text. This service saves data for the search index. The link_clicked would be the text of a link that might have been clicked while at the given URL (the problem with URLs that do not change is that there is no way to send a person back to the page from a search engine; however, people use anchor based navigation to work around this).
- Crawler: An HTML file that you can load in a web browser and give it a URL to start at. It loads the page, posts the text to the Indexer then clicks each link, posting a "snapshot" after each click.
- Database: A very big database. I'd suggest the Amazon Simple Storage System (S3).
- Grid: A way to run many web browsers in parallel, like, at least 1,000 at once. The Web is big but don't let that intimidate you! I'd suggest the Amazon Elastic Compute Cloud (EC2) and taking a look at setting up Selenium Grid on EC2 for ideas on how to automate web browsers. The Windmill project may also be useful. The Saucelabs Selenium service might even be great for this.
There you have it. Using these ingredients, I cannot see any technical limitations to building a search engine indexer that behaves like a real web browser. The Crawler is a little complicated so I'll point out some approaches. Conceptually, you want to do something like this (a JavaScript example using jQuery):
$(document).ready(function() { load_url("http://aintjustsoul.net/"); take_snapshot(); }); function take_snapshot(url, link_clicked) { // save the text: $.post( "/path/to/indexer", { url: window.location.href, // includes the hash # link_clicked: link_clicked && link_clicked.text(), text: $("body").text() }); // For every <a> tag (a link), click it and take another snapshot. // Note that this query will probably need to be done on an iframe (see below). $("a").each(function(link_clicked) { take_snapshot(url, link_clicked); }); }; function load_url(url) { // FIXME: cross-domain compatible Ajax load (see below) };
This code obviously will not work as is. Mainly because the cross-domain security will force you to load into an iframe or some similar approach. But you get the idea. There are several solutions to the cross-domain issues; one is detailed here using iframes and the dojox.io module solves it like this:
dojox.io.xhrPlugins.addCrossSiteXhr("http://aintjustsoul.net/"); dojo.xhrGet({url:"http://aintjustsoul.net/", ...});
If you want to be boring you could make your own proxy server in Python (or whatever) that loads URLs locally and passes through the content. (It would be slightly more exciting if the proxy had a bacon feature, like http://bacolicio.us/.) The take_snapshot() function would then need to pull some tricks to rewrite link URLs before clicking on them.
I'm still convinced this is easy. I have no idea why Google isn't doing it already. Some other things to consider: You'd probably end up wrestling a little bit with popup windows and JavaScript conflicts, but window.onerror() can help you log these problems for analysis. You'd need a comprehensive browser farm. But Firefox is a great place to start since you can run it cheaply on EC2 using Linux. Most sites seem to work in Firefox these days so it might even be sufficient enough for indexing purposes.