From the Google blog post: "This [torsorophy] example opened our eyes, and over ...

blahedo · on Feb 2, 2011

I don't think anything about this is clear-cut. I can imagine a very simple algorithm, as follows:

* When query Q gets made more than N times at bing.com,

* Mine clickstream data for the next M urls requested after searches for Q

* Any url that appears more than T times (possibly spread across some number of users) is presumed to have been found relevant to Q, and derived either from later searches (corrected spelling) or curated sites or other search engines. Add to mapping of valid responses to Q.

It's not a very good algorithm, of course, and if you have any other source of information about Q you're probably better off using that instead. But it or something like it could explain the torsorophy example and every other part of Google's narrative, and it's not particularly suspicious or questionable, and it certainly doesn't involve targeting Google.

moultano · on Feb 2, 2011

Except that in this case, the honeypot searches were not done at bing.com

blahedo · on Feb 2, 2011

No, but they must have also been doing regular searches at bing.com in order to test their claim. Where do you think they got their bing results from?

tghw · on Feb 3, 2011

It doesn't matter, because the results (from the honeypot) had nothing to do with the searches, so it would have been impossible to clue Bing in directly by searching on Bing a lot.

blahedo · on Feb 3, 2011

Are you claiming that they made the queries without actually making the queries? Reread the first line of my algorithm: once you identify a query Q as more than just a one-off mistake—whether it's an actual new item or a common misspelling or, maybe, a trap—then you decide it's worth looking into.

Put another way, it's impossible not to clue Bing in on at least the fact that you are making these searches.

broofa · on Feb 3, 2011

That is exactly what's being claimed. The queries were not made on bing.com, they were made on google.com. The only way Bing can become aware of the results of these google.com queries is if they're "spying" on the user's activity via the Bing Toolbar and IE8 suggested search features.

From http://goo.gl/Bi0JH (Google blog):

"We gave 20 of our engineers laptops with a fresh install of Microsoft Windows running Internet Explorer 8 with Bing Toolbar installed. As part of the install process, we opted in to the “Suggested Sites” feature of IE8, and we accepted the default options for the Bing Toolbar.

We asked these engineers to enter the synthetic queries into the search box on the Google home page, and click on the results, i.e., the results we inserted. We were surprised that within a couple weeks of starting this experiment, our inserted results started appearing in Bing. Below is an example: a search for [hiybbprqag] on Bing returned a page about seating at a theater in Los Angeles. As far as we know, the only connection between the query and result is Google’s result page (shown above)."

henryw · on Feb 3, 2011

The Bing toolbar tracked user clicks on google.com search result and added it to Bing. Of course the user had the option to opt out.

zzleeper · on Feb 2, 2011

I can think of an even better algorithm:

1) When a user enters google.com/search? , scrap the page (pay special attention to their spell corrector)

2) Send all the data back to MSFT

3) Profit nicely!

It's simpler, and probably works as well as the best competitor out there ;)

wrs · on Feb 2, 2011

All search engines look for keywords in URLs. That's why domain names with keywords in them are more expensive. If you're going to mine clickstreams from the toolbar, it's not far-fetched to think you would mine the entire source URL, including the query string, for keywords. Then there would be no need to postulate special treatment for Google search URLs, since the keyword is in plain sight (google.com?q=xxxxxx).

If that is what's happening, then the moral/ethical argument against Microsoft would have to be that they should treat Google specially, by explicitly ignoring clicks on Google's result pages. That seems to me to open up quite a can of worms.

lysium · on Feb 3, 2011

I'm wondering, if this is really the case, why does Microsoft not just point that out, but instead comes with ad hominem arguments?

las3rjock · on Feb 3, 2011

To crawl Google URLs of the form google.com?q=x would be to disregard http://www.google.com/robots.txt , which seems like bad netiquette to me.

wrs · on Feb 4, 2011

They aren't crawling, just noticing what pages clients who visit google.com?q=xxx go to next.

If anybody's search toolbar checks a site's robots.txt before sending clickstream data, I would be very surprised.

A client-side robots.txt rule would also make anti-phishing features trivial to bypass...just put a robots.txt on your phishing site.

HardyLeung · on Feb 2, 2011

I'd play devil's advocate here and say that I give Bing a slight benefit of the doubt. Given they cannot perform a parallel Google search on every single user query (this would have been easiest to spot), the technique cannot be of the form "if on google, do x". Rather, it is likely they look at what users search and what results they get as one of the parameters, say normally influencing 0.2% of the search ordering on average (assuming 500 "signals"). In the case of "torsorophy", out of the 500 "signals" only one came up, the Google result (via Bing toolbar from a few users). So, their algorithm makes a note of that, like, hmm... I don't really have anything else so I'd just rehash what users have searched and subsequently visited. So the 0.2% becomes 100%, and internally Bing associates the word "torsorophy" with that website. Next time "torsorophy" is searched it returns that site.

Yes, Bing is likely using Google's data (if available, via Bing toolbar) as part of the ingredients. You can argue whether this is stealing or not, whether this is Google results or user search/visit behavior, and but I would say it isn't exactly "if on google, do x".

moultano · on Feb 2, 2011

> I would say it isn't exactly "if on google, do x".

I wish bing had taken this opportunity to answer that question in this blog post. Since they didn't, it makes me more likely to assume the worst. All they would have to say is something like "We use anonymized click data and don't special case Google" to avoid most of the controversy. Then we would be back to debating whether it is ethical to do something theoretically ok even if you know that it practically means you'll be cribbing from a competitor.

HardyLeung · on Feb 2, 2011

One thing is sure. This episode makes Microsoft look like a cheater, but also Google like a child. I rather Google spend more time on combating search spam and content farms than this concerted "caught-in-the-act" ambush. Instead, Google likes to cheapen themselves with endless argument with everyone (how supporting WebM and Flash is open, how the iPhone world is "draconian", etc).

Just call it "imitation is the most sincere form of flattery" and be done with it, would you, Google? Spend your time improving UI, defend your cash cow against spam, focus on making Android better, be more coherent in your social/location strategies, lead the industry in privacy (not just talk of privacy). All of these will benefit users more.

michaelneale · on Feb 2, 2011

It is all fun and games, and a bit childish. Obviously google are not offended, just trying to get some PR about it - just business.

But the argument "I rather Google spend.." can be applied to anything - it isn't as if they haven't be tackling other problems, and they are clearly worried about bing (at least in PR, bing spends big - lots of MS employees posting on forums, including here, which I am sure is encouraged). Bing is losing a lot of money for Microsoft, so by that token, I could say that they should stop it, and spend money on xbox (which is fantastic - no idea if it is losing money, but what a leader) - make it stupidly cheap, make it dominate the home etc ...

HardyLeung · on Feb 2, 2011

You can also argue that Bing fits in the "offense is the best defense" category. Hit Google where it hurts so Google can't cannibalize MSFT's core business as quickly. From a customer's perspective, it keeps Google honest.

zzleeper · on Feb 2, 2011

Let's go one step further from your argument:

Why should Google improve their search spam algorithm? Would they gain anything if MSFT then their compares their own results against Google's, to delete the spam farms thanks to Google's work?

HardyLeung · on Feb 2, 2011

I can't agree with that. You always want to improve, even if there is no "competition". Don't worry much about competitors that only imitate, but pay attention to those who innovate and exceed you. So, exceed yourself before others do.

Terretta · on Feb 2, 2011

> "the technique cannot be of the form 'if on google, do x' ... Rather, it is likely they look at what users search and what results they get"

How do they look at what a user searches and what result they get? Ok, the "click stream" can see what pages a user visits.

But to get the search term, they are parsing something, and unless they implemented a universal search recognizer that will rank up results from any old site's search (allowing 20 SEO guys to push private SERPs to the top), it seems more than probable the parser indeed would start with "if on google, do x".

jshen · on Feb 2, 2011

"But to get the search term, they are parsing something"

referrer in the http request?

HardyLeung · on Feb 2, 2011

Yup, it's as simple as that :D

Terretta · on Feb 3, 2011

Most referrers do not contain search terms. You can't look at REFERER and infer it indicates the subsequent URL is correlated on a given keyword. If that's what you want, you write a parser to look at URLs, recognize human-determined search indicators (q=, s=, search=), and correlate the subsequent URL with the indicated keywords.

Hand waving that this is "clickstream" data doesn't mean Bing is not looking specifically for Googled search terms and the resultant user selected URLs.

I personally don't mind them doing it. I just think their post is non-responsive in hopes of persuading the non-technical reader there's "no deliberate copying to see here" when they're clearly intentionally ingesting an extraordinary volume of Google-keyword-to-Google-result data and using that data to map keywords seen only on Google to results on Bing.

Flattery, etc.: http://www.wired.com/epicenter/2009/06/kayak-bing/

AmazingMe · on Feb 3, 2011

If Google analytics can scrape this data, so can bing. Google never answered if they are using clickstream from google toolbar / chrome or not. Check this simple script if you want same data. http://forums.digitalpoint.com/showthread.php?t=1680579&...

jshen · on Feb 3, 2011

a general tokenizer might do the job.

greendestiny · on Feb 2, 2011

The problem is that the honeypot doesn't prove anything. These are rare terms by design so clickstream data is all probably all Bing has to go on, and that clickstream data was voluntarily submitted to Microsoft by Google employees.

Google need to go to some extra steps to show that Bing is copying Google links for popular terms. If Bing is weighting clickstream data from Google searches very highly, that is more or less admitting that other search engines work better.

pacemkr · on Feb 2, 2011

The most I can deduce from the experiment is that Bing look at what people click on Google and that plays some role in their search ranking. If you synthesize a term that doesn't exist in nature I can see how a search algorithm can weigh the only datapoint it has, data coming from the toolbar, heavily. This may not the best approach, but a far cry from copying, which is what Bing are being accused of.