January 06, 2003

Google Disses 'Mercury' (Sort Of)

So, having noticed recently that Google News indexes Willamette Week but not the Portland Mercury, I sent an email suggesting the latter as a source to be added to the list of news sites they crawl in order to generate content for Google News.

They claim they can't do it. Why? Because of the URL scheme at the Mercury website, wherein from week to the next, the main URL for each section of the paper contains a different story -- rather than having a front page which links to each story's individual URL.

Of course, users of the Mercury website know that each story does, in fact, possess a unique URL. It's just that you don't see it immediately. When you click from the front page to any given article for the week, the upper right-hand side includes the words, "Linking? Use This URL!" -- with the "URL" containing (of course) the specific archive URL of that story.

Now, this doesn't appear to be overly complicated. Google's web crawling software would simply have to follow each link from the front page and grab whatever individual archive URL is contained without those magic words, "Linking? Use This URL!"

But what is Google's response?

Unfortunately we can not properly spider sites set up in that manner.

Okay, so in other words, the big bad king of search engines can't handle a process which would go something like this: (1) Crawl main page; (2) Descend each link; (3) Find the text "Linking? Use This URL!"; (4) Descend and index the URL contained within.

If they can't handle something this straightforward, then Google and its software is nowhere near as intelligent as they make themselves out to be, despite how useful they are.

At any rate, the upshot of all of this is that Google News manages to provide links into Willamette Week content but not Portland Mercury content. Not exactly an issue of crisis proportions, of course. But still. That makes their results a little unrepresentative of the City of Portland. And all because they claim to be unable to perform what is, in the scheme of everything Google does, a frighteningly simply piece of spidering.

« Previous Next »