Costs and Benefits of Search Engines

Chris Smart writes about the latest money making schemes for OS distributors, Canonical is getting paid by Yahoo to make them stop using Google as the default Firefox search engine [1]. I think this is OK, the user can easily change it back if desired and it allows them to pay the salaries of more employees – who contribute code back to upstream projects.

MSN uses 455M, Google uses 189M, and Yahoo uses 109MGoogle referrs 8250 hits, but Bing only refers 280

Above are sections of my Webalizer output related to my blog which show the data transfer use of (Bing presumably) which is 50% greater than that of Google and Yahoo combined. Why does MSN need to do 455MB of transfer so far this month to scan my blog when Google gets the job done with 189M and Yahoo only takes 109M? Also judging by the referrals Bing is only 3% as much use to me as Google.

MSN uses 525M, Yahoo uses 132M, and Google only uses 35M

Above is a sample of the Webalizer output from, MSN is using 525MB of data to scan the site which contains about 1.2G of static files that change very rarely. A Russian malware site seems to be downloading it three times a month, and Google only takes 35MB of data transfer to scan the site (which is probably still excessive).

If Bing was a quality search engine that returned appropriate results then this could be forgiven. But however it is a very poor search engine that returns bad results. For example if you query Google or Yahoo for “bonnie++” you will get an entire page of search results concerning my Bonnie++ benchmark, and those results are ordered in a sensible way. If you ask Bing then the first four results concern “Bonnie” (three women and a plant) and most of the first page don’t concern my benchmark.

Some time ago I had blocked MSN from scanning a server that I ran. The server in question had all the web servers for my domain plus quite a few other small domains. The total MSN data transfer was 3G per month which was almost half the data allowance for the server in question (data plans in Australia suck – that’s why my web servers are hosted in Germany now), so it was a question of whether to allow normal operation of the business or MSN searches. With Microsoft not running a popular search engine (then or now) it was an easy decision.

I think that anyone who accepts money from Microsoft/Bing is doing their users a mis-service. Bing is simply an inferior search engine, it gets bad results and imposes excessive costs on service providers. Yahoo however seems to be a reasonable service, not as good as Google for the web hosters but not too bad.

I wonder what would happen if Yahoo offered some sponsorship money to the Debian project in exchange for being the default search engine. I’m sure it would be dramatic.

10 comments to Costs and Benefits of Search Engines

  • SIO

    Yandex is not a malware site, it’s local russian search engine. It’s even more popular that Google: (blue line), but still not very useful when searching worldwide.

    Looks like they’re trying to fix this disadvntage =)

  • etbe

    SIO: Any site that downloads 3G of data in a month from a rarely changing site that contains only 1.2G of static data is a hostile site. If they keep doing that they will get banned by lots of web servers and have decreasing relevance for anyone who wants to search sites outside Russia.

    You can’t run a useful search engine if you force the web masters to block you.

  • SIO

    Well, I’m not defending them. Even though I own no websites, I can compare the digits you posted. And even for me it looks inappropriate.

    That comment was just to make it clear, what they tried to do :)

    I’m sure you should ban it first (even before bing) if you want to reduce data transfer. Just because Yandex will not bring you visitors. Ever. Everyone here knows that it’s silly to use Yandex for global search

  • etbe

    I’m not going to ban them at the moment. I’ve got a fairly large bandwidth quota that I have no risk of reaching at the moment. When I put some 200M files online that get downloaded a few thousand times a month I might have some problems and then look at what I can stop.

  • Maybe Google already has your site pages in its index and Bing is catching up?

  • etbe

    Albert: Nice theory, but my blog has less than 900 published posts, 28 categories, and not many tags. I think that even Google’s 27,689 hits in the first 27 days of the month is excessive. Also this goes on month after month.

    Also has 2,652 files and 177 directories. While Apache does render directory indexes in different formats it still seems difficult to imagine how more than 4,000 hits can be required in a month. After all the Apache directory indexing algorithms should be well known and catered for by the search engines. No search engine can do anything with a Debian package (AFAIK) or anything other than the start of a flv file. So excluding those there is only 200M of data on my web site and none of the big files ever change (a HTTP if modified since request will show this). So I can’t imagine how they can be transferring much data.

    Unless of course Webalizer counts if modified since as data transferred.

  • The whole issue of moving Firefox’s search bar from Google to Yahoo! is interesting on a few levels. Mozilla gets paid by Google if it’s used in Chrome (the little search bar in Firefox, not Google’s browser), so the old way benefited both Google and Mozilla. Now with the change to Yahoo! (which is really Bing by proxy) the money goes to Canonical and Microsoft.

    Essentially, what Canonical has done is cut off funding to Mozilla (Firefox upstream) to line their own pockets thanks to Microsoft. That’s pretty special.


  • Russell,

    You might find these two pages useful; you’re certainly not alone in being frustrated by this crap.


  • etbe

    Chris: That is very interesting. Why didn’t you mention that in your blog post?

    John: Thanks for that information, MS being sneaky as usual. The funny thing is that even if none of the hits on my blog with Bing as a referrer are fake – it still doesn’t matter as there is almost no traffic reported as coming from them.

  • Only just thought of it!