Click Fraud, Google AdWords and gclid

If this page is useful, please click the “+1” button

Published on July 16th, 2007 by Jeremy ChatfieldA tardy response to Matt Cutts posting and Shuman Ghosemajumder‘s joining in the debate… I don’t see some important stuff in Google’s approach to this issue. Maybe I’m selectively vision impaired. Or maybe I read too many other reports about naughtiness on the net.

Update 2009-09 – you can now use Google’s Webmaster Console (Settings) to ignore “gclid” as a page variation. Why is this important? Read the rest of the article and the notes to Matt Cutts at Google about why.
New information since this article (2008-03)! Google Blog article on click fraud forensics.
And also Richard Ball’s analysis of Google’s lousy management of click quality.

This recent Forbes news report about click fraud is probably just about right for the percentage – if we use an advertisers definition of click fraud, and ignore that Google doesn’t charge for the clicks it already identifies as invalid. Do you care if a click is fraudulent, if it isn’t charged? What does the presence of an identifiable and large volume of fraudulent or otherwise invalid clicks imply about the unidentified volume? Well, I’m not covering that here. I’ve tackled some of it in previous postings and I’m sure I’ll return to this topic. Again. And probably also cover, again, rational advertiser response to fraud (you should drop your bid, so your advertising costs remain the same) and why advertisers are irrational (and how this is good for search engines and great news for cheaters, social deviants and scumbags). No, this article is about measurement techniques and what they can tell you. Numbers…

If you can not measure it, you can not improve it – Lord Kelvin

I should clearly say that while this is article is critical of Google, I find the state of many other search engines to be much worse. Google offers the highest quality and volume paid search engine of those that we consistently use or have used, as measured by conversion rates and volume of conversion from specifically targeted adverts. I have a lot of respect for what Google has done. I think AdWords is the smartest high volume paid search advertising system we deal with. That doesn’t mean that it is perfect, or that it can not be improved.

This gclid thing, what is it?

About two years ago, I noticed “&gclid=(stuff)” being added to web server log files on clicks from Google. Pretty cool idea, I thought, and we started writing web log analysis programs that used the gclid to determine whether visitors were unique. We realised that this gclid tag should give you some idea about whether visitors are seeing you for the first time from an advert, or returning, without an additional cost for the click. In other words, it provides illumination into whether Google is charging for duplicate clicks, and the behaviour of unique visitors (e.g. whether individual anonymous users bookmark or use the “back” button a lot) – visitor behaviour and click fraud insight with one tag!

You’ll see the tag in detailed web server logs in the request. You don’t (shouldn’t) add “gclid” yourself. It gets added by Google to each impression that is delivered. That is, you submit an advert, and each time that advert is shown, Google adds this tag to the destination URL. It makes each advert impression unique, and allows the adverts to be tracked by web analytic packages. Specifically, I believe that it helps Google to work out when they shouldn’t charge for a second click on an advert.

But gclid autotagging has some major weaknesses for advertisers to detect click fraud. It may be that Google could use web server log files to identify click fraud, but it is not a usable technique, on a mass scale, for advertisers. Why on earth not?

This is going to get technical. And financial. And microeconomic and other things… So strap on your hard hat, buckle up, and do whatever else you do to protect yourself online. We’re going for a ride.

Basics of AdWords & Click Tracking

Imagine that I open a Google AdWords account, and create a campaign and an AdGroup and an advert, and a keyword. I enable auto-tagging, which appends extra information to the destination URL, either “?gclid=(stuff)” or “&gclid=(stuff)” (depends on whether there is a previous tag on the URL). You can enable Auto Tagging on the “My Account” tab (up with the “Reports”, “Analytics” and “Campaign Management” tabs), under “Account Preferences”. It appears to be linked to using Google Analytics – you can register for Google Analytics, enable Auto Tagging, but you don’t have to actually use Google Analytics…

The “(stuff)” that is added appears to be unique for each advert impression, and appears to be unique in a clever way… The first part of the ID varies rapidly and the last part varies slowly. This is clever because when you are looking for string matches, you get an early failure in the string match, helping to speed the search up – an indication that some smart people may have been working on this.

Note that this is yet another way to identify Editorial Review related visits to your web site. That’s something that most web analytics packages fail to identify – but are incredibly useful to know. With an editorial review click, if you use a macro, such as “{keyword}” or “{creative}” in the destination URL, then those are not substituted by the real value, as would be the case if the advert was being shown to a real user. Additionally, a gclid tag is not added to editorial review visits. Note this carefully – we have come across some sites (mostly eCommerce sites with URLs that are a database query) where the addition of the gclid tag can cause the site to fail with a 404 (page not found). Because Google’s site and content detection systems do not add the gclid tag, they will gleefully direct hundreds of dollars of spend to the site, with very little likelihood of conversion. The only ways that you can find out if this affects you is manually browse to the destination URL you use, appending a gclid parameter, or to click on one of your own adverts, in real search. Clicking an advert shown in the AdWords User Interface to check, is no confirmation that the gclid is harmless to your site.

What is in the gclid?

I’ll guess that the last part of the gclid value encodes, or more likely references in some way, the advertiser ID, the keyword, adgroup, campaign and account ID’s. The first part, that changes rapidly, is probably some combination of timestamp and instance ID or advertising channel (where the advert was published). I suspect that the account and keyword part is a database ID that delivers a row with the account ID, campaign and so on – rather than being an encoding. I suspect that the first part is a timestamp and instance ID, which will also be recorded on Google servers and will tell them when the advert impression was delivered, on which site and how long it was between that impression and the click.

But the advertiser *only* knows that they’ve seen a gclid. Not that it has been delivered by Google, and not that it is related to their advertising. The gclid tag can be faked, in any request to the web server. You should see later in this article that there is an incentive for click fraudsters to forge gclid’s on requests…

Now, slightly more confusing is what happens on the web server. A request comes in shaped like “/foo.html?gclid=juiuyvyuvuyvjhasfdhgkhj”. The user’s browser will show what the server delivers – the web page that was requested. The tag? That’s left in place. So if the user finds the page helpful, the page may be submitted to delicious or another bookmarking system, complete with a tag that does not specifically identify the page – it’s just some random tracking information, left behind because the web is still pretty primitive and undeveloped. Advert tracking information is not a page name… at least not unless you can conceptually separate the idea of tracking from the page that is addressed – you reach the same content with or without the tag.

If you do some snazzy stuff with your web servers, you can both do the tracking and remove the gclid tag (and other tracking tags), so that users will bookmark only the base page name. You could, for example, use a rewriter that strips off standard tags; that is, when you request “/foo.html?gclid=xxx”, you get a redirection that is just “/foo.html”. A rewriter does, of course, introduce a few more delays in page delivery and increases the points of failure. It is also pretty rare. I can’t think of any major web site that does this. However, the user ends up with a more readable URL, better handled in bookmarks. And you’d get fewer appearances of gclid in web server logs. We’ll revisit this later – it is more important than it might seem at first.

OK, so that’s the basics. Where the gclid comes from. What it probably consists of. How it might be used by Google (I claim no inside knowledge). How it ends up being re-used legitimately. How it appears in web server log files and search engines.

Useless to advertisers? Come on… Get real.

Assume that I have enabled Autotagging and that I am using the uniqueness of the gclid to determine unique visitors.

In an ideal world, I get a unique visitor for each click that I pay for. If I don’t, then the advertising channel is lying to me about sending me users. That’s close to Google’s (narrow) definition of click fraud. Have they delivered the click they charge for? Google will gleefully send you the same person, charging for the click each time, if the user conducts a variety of searches with a time delay between clicks, from unique adverts. Google even appear willing to charge twice for sending the same person, if there’s a sufficient delay between clicks. Check your AdWords account – you will occasionally find keywords with a low impression rate, where the number of clicks is greater than the count of impressions. Google’s reporting interface is really part of their billing interface – this data represents what they think you should be charged for.

By counting clicks, and counting unique glcid values, I can match up the two and determine that Google is sending me unique visitors, unique at the level of having clicked on different adverts and not having double clicked a link, or suffered from a noisy button (you do get extra clicks from some mice, if the click-debouncing circuitry isn’t good enough).

What value is this to an advertiser? It says that you have seen as many visitors as Google claims it is sending. It tells (for low volume keyword cases) whether Google is charging on second clicks on the same advert, and how long it takes for Google to decide that a second click is payable rather than free. This may match Google’s definition of click fraud, but it isn’t an advertisers definition of click fraud.

The problem with Google AdWords and click fraud is *NOT* only that advertisers think they aren’t getting visitors. It is what the intention of those visitors is. Counting visitors, and making sure they are unique, yeah, that’s important. Making sure they come from the geotarget you’ve selected – that’s important and not specifically addressed directly by AdWords; you need to use Google Analytics, or another web analytics package with a geolocation database or some gelocation databases with your web server log file analysis. I keep meaning to publish an article on geolocation, but it is a hugely complex area. I’ll probably end up publishing several bits of article, so they can be better maintained.

I’ve seen a few users complaining, mostly on the AdWords Help Forum, that they haven’t been delivered clicks. I’m pretty sure that most of these users are getting the visitors they’ve paid for. They simply haven’t tagged the adverts and can’t tell the difference between paid search clicks and ordinary search clicks, and especially they can’t identify paid search clicks from AdWords that bring in visitors from Google’s search partners and content matching programs. I have no serious quibble with Google about receiving the volume of clicks that my clients are paying for – it isn’t volume, but quality. We’ve previously written at moderate length about tagging adverts so you can identify whether you receive clicks from advertising.

Quality of clicks is a whole other story. And very little to do with gclid, as we will see.

In any case, gclid data isn’t authoritative. Assume that a user has bookmarked my site. Last year. So this is an old, old gclid value, that suddenly re-appears. I can’t just look at current day activity and test to see whether gclid and click counts match, I have to identify *unique* gclids, unique on the first usage – and ignore the rest. I have to check back through a year of records to see if this ID previously appeared (strictly, back into some point in 2005, when gclids were first issued).

If there was a financial value to my client in doing this long-record check, it might be worth doing. Of course, I could optimise it. I could just stuff gclid’s into a DB, with a custom written program, and query the DB. Shouldn’t take more than a week or so to develop, with some docs and a web interface of sorts. Then you have to interact with the tech team at the client site to embed this software with the infrastructure that they have, or that they’ve outsourced. Then you need access to web server log files going back to an unspecified date in 2005 (IIRC, it was around Q3 that we first noticed gclid) – which may mean trawling backups. Ever tried getting a backup from a technical organisation, who doesn’t really care whether you can spot a unique visitor for whom you’ve paid less than a couple of dollars? It’s a difficult argument, one that I rarely win, at least. Then it’ll take far longer to manage this exercise of trawling the records than to implement the technology… and at the end of it, you still only know whether this user was unique. A lot of effort for a small increment in confidence.

Intention, not volume

There’s a few of us with a fairly consistent message that Google persistently ignores. They go under names like John K, Richard Ball, CPC Curmudgeon and a few others… You can find them, fairly easy, using a blog search tool.

The message is this. We know that Google sends us visitors on each click. What we don’t believe, is that Google always sends us high quality visitors that are likely to buy. Some of those clicks may be bots. Some may be bored users clicking on anything. Some may be genuinely interested. I won’t speak for the other people that I mentioned above – because I haven’t consulted them – but I’m quite certain that Google understands the concept of the quality of a click. And I’m suspicious that Google manages the click quality to assure Google of a revenue stream. I have no evidence that Google manipulates the quality to benefit customers… Though I believe that large customers can demand that Google removes low quality click sources (such as, in most of my clients’ cases, domain parks and sites identified only by numeric IP addresses, etc).

Now, this allegation of naughtiness by Google is pretty tricky. After all, the quality of a click is in part down to the advertiser. For example, if I am selling saucepans, and I use as a keyword “cheap flights” or “used cars”, then I might expect a low CPC. However, my conversion rate will be low. People looking for cheap flights, a car or shoes, flowers, whatever… well, *some* of them may want new saucepans. Some people will just click on the advert to find out what it has to do with the search. The low conversion rate from this is my fault – I should have used relevant keywords and a more specific advert, and a better landing page (one that explains the relationship between cheap flights and saucepans, for example).

However, if Google or, for that matter, any other search engine, sends me people who searched for something different, or if they send me a lot of traffic from content pages, then my conversion rates will probably decline – and it isn’t my fault. My clients conversion rates depend, at least in part, on the quality of Google’s advertising partners, and how keen Google is to transfer my money to people whose primary interest is making money for themselves, rather than helping users or my clients’ business. That, in turn, depends on how broadly Google interprets broad match, and what web pages Google consider to be suitable for content match (and even for site targeting).

Google will say that they control click fraud, but they are controlling click fraud *FOR THEIR BENEFIT*, not mine. If the definition that Google used included conversion activity, then I’d be more convinced that they cared about my clients. As it stands, Matt, and Shuman and Eric Schmidt are essentially at pains to assure the world that Google is protecting Google’s revenues. Why they come under repeated questioning, is that they have shown no sign that they recognise the quality of a click can represent a fraudulent activity perpetrated, supported or tolerated by Google. Well, that and a bunch of advertisers who haven’t taken relatively trivial steps to allow themselves to identify paid search clicks, and relatively uniquefied paid search clicks (add tags to advert, and enable autotagging – it can’t get much easier to set your mind at rest, but it is so rarely done by default).

Google assures advertisers that we now have more control. It is true that Google has added new reports that help to identify poor content match sites. The new “Placement Report” tells us which sites result in AdWords Conversion Tracking events. Brilliant, if you can use AdWords Conversion Tracking. The Search Query report, which sheds a modest increment of light on the breadth of search queries matched by broad match (but not the immensely more useful report of search queries for which no-one clicked). And the Invalid Clicks data. And the Cost Per Action beta test. And they’ve now started blogging seriously about click fraud.

(I could do a whole aside here about clients that do not trust Google with their sales data, so refuse to use AdWords Conversion Tracking, or that have offline conversion and can not infer quality of the content network without significant effort – I’ll leave that out of this essay).

For example, Google has now removed the 500-site limit for the “site exclusion” mechanism. If you identify a low quality site (lots of clicks, lots of spend, no conversions) then you could remove that site from your content match advertising. However, identifying a good site is not cheap. Take one UK account, targeting the UK only. There’s about 10,000 rows of placement data, per month. Sites that have no conversions in one month account for more than 90% of the rows, and a large fraction of the expense. But most sites without conversions have too low a spend to reject them as useless… So we’ll continue to spend at a high rate, because we can’t (yet) reject these 8,000 to 9,000 sites, without an additional data source (in addition to impression, click and conversion data).

Advertisers Costs, Buying Short and Selling Long

When I find a low quality click source, I reject it. But that costs the advertiser money, to discover… Let’s investigate what Google makes out of this learning process, shall we? I’ll make up some numbers. These are representative, but do not accurately portray any specific client that I have.

Assume that I see a conversion rate of 1%. Assume that I have an AvCPC of $0.10 for the keyword. I’ll spend about $10.00 for a conversion on keyword search. Assume that the content network averages the same conversion rate (this is not usually true – the true conversion rate is much lower for reasons discussed in another article). I can then afford to spend $10 on 100 clicks – the $0.10 AvCPC that we saw for keyword search. Google picks up 50% of that. They make $5.00 and the AdSense partners share $5.00. If there was just one site in the list, then that’s one AdSense partner that receives $5.00. I can easily see that this is good value and I’ll invest more to get more placements with them.

Now, if Google spreads the love, and displays my clients adverts on ten sites… well, if I get one conversion from this $10.00 spend, I can’t disprove that the other nine sites were useless. Assume an even spread of clicks (it isn’t – it looks more like a power law distribution than anything else, from the data that I have – a lot of sites with a few clicks and a few sites with a lot of clicks). That means that I have to spend (at least – the real cost is higher and more difficult to calculate – I’m simplifying, OK?) $100.00 to prove that the ten sites are worth advertising on. Google picks up $50.00 and I’ve still only got one conversion… So my response is to drop the bid price. If Google is going to spread my advert everywhere, then I need to spend less per click, in order to compensate for lower quality clicks on the content network. I need to drop the bid to $0.01 to justify the conversion rate.

Next step in the arms war is that Google spots this and introduces “Smart Pricing”. This means that if I bid $0.10, some sites get the whole sum, and most of the others get a lot less. The others get a lot less because Google has decided that they are lower quality. Lower quality in what way? Hmm, interesting question. I’ll bet it has to do with Quality Score type metrics, traffic volume and CTR. And of course advertisers don’t know which are the premium sites and which aren’t.

Now, if Google manipulates the average price paid using those measures to allocate the payment, then they can make sure that most of the clicks my clients see come from a wide range of sites. The more sites used by Google, the more money spent to assure that these sites do not yield a conversion. So low quality sites, and many of them, work for Google’s benefit, drive up my clients costs, don’t yield a lot of conversions but don’t allow a lot of rationally based decisions on site exclusion.

Rational advertiser response to finding a low quality site is to add it to the site exclusion list. But identifying a low quality site, even for a high volume client, may take many months or even years.

Contrast what happens when a publisher starts to see a declining revenue… It takes about ten to twenty minutes to find a new site name, pay for it, and get a hosting plan working and to redirect AdSense to the new site. About 20 minutes after the income on a site starts to decline because it is being excluded, a money-grubbing leech can have a new site up with cheesy scraped content and stuffed with adverts for every network. Advertisers now have to identify this garbage site and exclude it all over again… costing advertising funds and wasting account management time (wasted compared with the case that there weren’t useless sites operating). The people who develop low value, poor conversion sites, can generate new sites rapidly, but advertisers spend a long time individually identifying poor value sites – this weights the system in favour of those who produce poor value sites.

Google, on the other hand, could take a message from advertisers… Get enough exclusions resulting from spend on the site, and you get lost from the AdSense network. Not just as a site, but as a publisher. Low conversion rates are not a problem for Google, unless advertisers make it a problem for them. Low conversion rate sites and publishers are something that Google wants. It generates more revenue for Google.

The point is that Google benefits from distribution click fraud. There is no incentive and no control over Google’s collusion (whether intentional, accidental or systemic) with the publishers of web sites who manipulate clicks for profit, with no intention of responding to the advertising. So, if you can, and if you trust Google with your commercial information, consider using the Beta CPA programme. Regrettably I have no current client that will do so.

So, this is why Google will continue to be the target of criticism about click fraud. While Google manages search quality for their own benefit, and while advertisers use the defaults that Google gives them, there will continue to be allegations that paid search traffic from Google is subject to fraudulent charges. It is because the things that Google believes to be click fraud are only part of what advertisers identify as click fraud.

Gclid and user behaviour tracking

Well, if gclid is at best peripheral to the problem, at least we can use the gclid for something useful. Identifying user behaviour. Or can we…

You can use the gclid as a proxy for a cookie. If your advertising includes the gclid, and the gclid is unique for each impression, then you can spot returning users from bookmarks – though you may pick up some bookmarks from social networking sites. You can therefore extract two more measurements… The number of times that a page is referenced in social networks (that referer_info field), and the ratios of bookmark using users with cookies versus those who have deleted the cookie. So you can infer the additional success of your programs that depend on cookies for measurement; while it is still an exercise in stats, it is at least a numerically based exercise, with your own data, rather than that of an industry pundit or terrifying percentage estimation from a commercial vested interest who wants to flog you an authentication based service, or a flash cookie service.

Other than that? Well, Google doesn’t publish a spec for the gclid. If we knew what the parts meant, we could do more with them. As could the bad guys…

Faking the gclid

So, what happens if there is a third party running a site designed to allow bots or human networks to click for revenue generation? Firstly, they should be using browsers and bots that neglect to offer the “Referer_info” (sic) field, or forge fake content in it. This field, which could be examined by Google during the redirect, tells you what the browser had previously requested. In other words, if the user was using Google to search, you find out that the last request was for a page on “google.com” (or “google.co.jp” or whatever), and you can find out the last query.

Clearly, if you are running something a bit dodgy, you don’t pass on good information, such as the site where you found the advert – carried in the optional referrer_info field from the browser and sent to the server on each request. Confusingly, some (legitimate) versions of the AOL browser suppress the referer_info field. Most bots don’t offer a referer_info field. If it is a fraudulent browser, there’s reasons that it should offer a forged referer_info field. For example, you could be fooled into thinking that Google.com was offering a lot of users that did nothing on the site, if the referer_info was forged to pass on fake queries on Google properties – making content matched sites look more attractive.

Bogus traffic could even deliberately fake competitor information. This kind of anti-competitive activity is visible in the world of computer viruses, where one virus writer may embed messages taunting another writer. So, if I wanted to paint, ohh, MySpace as a bad source, then I could stuff the referer_info field with data for a MySpace page (or pages). That’d deflect attention from my scummy sites and have you add MySpace to your site exclusions list – making it more likely that you’ll see impressions from the poorer quality sites.

Even worse, what if the gclid was forged in other requests? That is, you see an excess of clicks received from Google, compared with the clicks they claim they sent. But if the clicks never originated with Google, then the value of gclid for users become weaker… If most of the traffic you see has a unique gclid, and you can’t tell which were added tags by Google and which were fake gclids added by botnets to confuse the server analysis, then gclid is rendered pointless for advertisers. Note that Google, because it knows which codes it used, *CAN* identify the real clicks in an analytics package, and is the only party that can do so.

Spiders and the gclid of doom

Now, if users have saved pages with a gclid to a social bookmarking site, then the tag is treated as part of the page ID. That is “/foo.html” and “/foo.html?gclid=hiufuyviuybfkjgkhghjkkh” are treated as two completely different pages. You can certainly eject spiders that attempt to crawl, if the tag is present.

As recommended in Matt Cutts blog, you can change the search engine spider response to a tagged page, by adding:

User-agent: *
Disallow: *gclid=*

to your robots.txt file. This will at least mean that you don’t get heavy gclid re-use from random strangers who’ve never seen the advert. But that doesn’t really add much to gclid usage. And why would you want to help reduce your page relevance, just because a tracking tag has been found on it? The search engines should be aware of the use of gclid, and actively removing valid gclid’s from tags. Shouldn’t they, Google? There’s no good excuse, that I can imagine at present, for Google to be crawling pages and identifying them as different pages, just because of tracking tags from their own advertising system.

Could the gclid be more useful?

Absolutely. For example, it should be possible for an advertiser to query the gclid with Google. If I authenticate to an account, I ought to be able to submit a bunch of gclid values and find out:

whether this advert impression was for my clients’ account
when the advert impression was served
where the impression was served (site and AdSense publisher ID)

And I can then infer the delay between impression and click (useful for visitor behaviour analysis). Of course, Google could offer that information, too. And they could save a bunch of work by confessing as to whether they treated the second and subsequent clicks from that advert as being uncharged (e.g. key bounce or users that double click links) or charged (e.g. too long an interval between the first and second clicks to ignore the second click as intentional revenue bearing activity by the user).

I do realise that publishing this information allows bot writers to tailor their bots to avoid detection. In information security, security through obscurity has long been a failed defence. That is, you have to design systems that are open to scrutiny, but that preserve trust. Google’s approach to click quality is equivalent to a failed strategy in information security. It didn’t work for spies. It won’t work for advertising.

Summary

Google’s use of gclid helps Google to identify user activity in response to advertising.

You can enable Google’s use of gclid by turning on autotagging in the “My Account” area of AdWords.

With appropriately written web analytics, gclid will currently allow you some insight into user behaviour, and offers clues about Google’s unpublished policies on counting second clicks as revenue.

gclid can be subverted, and if use were widespread in analysing fraud, one possible reaction from fraudsters could render gclid useless for advertisers.

gclid could be more useful, if advertisers were allowed to verify tag values and extract information from Google – but only if they authenticate as an advertiser, and only about their own clicks. This would help mitigate the effect of forged or erroneous gclids.

The real click fraud problems with Google are undocumented policies on double clicking charges and Google controlled click quality. Not whether you get clicks, but from whom, via which sites.

We are not aware of any other paid search vehicles offering similar unique advert tagging mechanisms.

We are not aware of any web analytics package, other than Google Analytics, that uses gclid by default to help identify paid search adverts and unique users. This is a shame, because there’s a lot you can learn about users at the moment, if you recognise this tag. I’ll gladly maintain a list of analytics packages here, that *do* correctly and usefully (my opinion) handle gclid without special configuration. I expect the list below to be empty of competition for some time:

Web Analytics Packages that use gclid sensibly

Google Analytics – basic usage of the tag – doesn’t indicate revisits, whether the tag has been saved to bookmarking sites, extract referrer, etc.

What Should Google Do

Matt Cutts (praise be his name – seriously – the organic search indexes would be a real disaster area without him and his team) has asked what Google should be doing. I’ll suppress my mild annoyance that I’m acting as an unpaid Google product manager and I’ll take the question seriously. As much as I take anything seriously, that is. I haven’t thought a lot about Google’s actions… So expect this section to evolve as I think of stuff or people suggest things.

Document the double click policy. It isn’t fair to advertisers to receive two charges for a single impression, without a rational description for why this is regarded as fair. This *will* open the door for fraudsters – when you know how a system works, it can be easier to subvert it. In InfoSec, the general rule is that if you don’t document, or obfuscate, only the miscreants understand the policy. Sophisticated bot-builders and human clicker fraudsters have already worked out where the edges are… it is advertisers that don’t know.
Check that destination URLs, especially on new accounts, have tracking tags, and use the alerts system to draw attention to adding tracking tags. Document how to use destination URL tags, and autotagging, and what to look for in web server log files, for at least the major web analytics systems – or even allow web analytics vendors an area to document how to use their tags. I’ll drop rank for my articles, here, when Google does document this stuff properly, but that’s the right thing to recommend. Call me Cut-Me-Own-Throat Dibbler and see if I care.
Look at what some of your competitor paid search vehicles are doing. Some of them make tagging *much* easier. For example, one competitor makes it easy to use a redirection server (e.g. Nedstat/Sitestats’ redirection service) and to append standard tags to destination keywords. Google requires an editorial review when I change a tracking tag – painful, that is…
Add some more values that are substituted in AdWords. In addition to keyword, creative, and placement, could do with matchtype, position, search query, at least. Why? It’s a real sweetener to induce people to tag – that’ll help reduce your workload of advertisers who have no real data on which to base their claims, and helps build the third party analytics industry.
When auto-tagging is enabled, add a dummy gclid (e.g. “gclid={gclid}”). This will protect advertisers and help Google to deliver adverts that have fewer 404′s.
Allow advertisers to submit gclid values for checking. At the most basic, Google could simply confirm whether they were issued for an authenticated AdWords account. At most extensive, allow information on the timestamp of impression, position, publisher ID, and URL for the advert – allowing advertisers insight denied when browsers or bots suppress or fabricate referrer_info. I see a difference in ratio of clicks with and without referrer_info, when I compare keyword search and content match – this implies that there’s a different audience for content match and some of it is probably not acting in the best interests of my clients. If advertisers can compare what you think you’ve done, and what is reported, it makes it harder for the bots to hide and for browsers to obscure fraudulent activity.
Allow automated web server log file submissions from advertisers – report on the valid and invalid gclid data found – until web analytics vendors get a clue about analysing marketing information (yeah, I might criticise Google, but I reserve my main loathing for useless web analytics). I’ll guess that you guys have a tool that does this sort of thing for your analyses. You could use it as a pre-screen. See no glcids? then diagnose “enable auto-tagging”. See a lot of wierd gclids and you guys will want to investigate.
Allow advertisers a dial for experimentation. I’ve found ways to control content match that the account strategists I talk to at Google (previously known as maximisers) seem to think are novel. I think you could turn this into a way to allow advertisers to control their risk with content match. Turn the dial down, and you get exposed to better sites, with less traffic. Turn it up and you get more traffic, but it may be less tightly focused and hence lower converting. Also applies to broad match… Just because I bid high to gain position doesn’t mean that I want “chicken sandwich” matched with “turkish feudalism” (I made that one up, it’s based on the etymology of the sandwich).
Automatically remove valid gclids from organic search indexes – combine the page ranks. Tracking tags are not a separate page id. And, for that matter, not just gclid, but Urchin tags (utm_.*), Core Metrics tags (cm_mmc), Nedstat tags (ns_.*), etc. Just because users end up linking to pages with tracking tags, doesn’t mean that Google has to reflect the tracking data as if the page was unique. Does it?
Search is a very powerful tool. Why do so few web analytics packages do anything sane with the data from search? This include Google Analytics. They aren’t ducking my venom, either. 🙂 Have a conference on “ways to extract data from search, about user behaviour, for marketing purposes”. Have someone from Google ready to talk about anonymity and privacy – there’s a lot of nonsense in the industry and a bit of dancing on eggs that Google does. We don’t identify someone by name. We don’t pass on their IP address. But it is immensely important to be able to do stuff like saying “this users’ search evolved in normal ways” and “this user search is not evolving or represents an unusual search evolution that falls outside statistical bounds of normalcy”. We extract that information in order to understand what users are trying to achieve – but it accidentally sheds some light on click fraud…

Material Disclosure

Merjis manages paid search and does a little SEO. We also have some custom web server log file analysis, and a redirection service, search analysis and other similar software – unproductised, so far…

Change History

2007-08-17 – added first paragraph link to the new Google Click Traffic Quality resource, and noted, only here, the similar Yahoo!Search Marketing click traffic quality resource.

2007-07-26 – added link in the second paragraph to John K’s blog about Google results. Tightening up? Could be. I’ve some more analysis coming up shortly and may be able to infer changes. I’ve already screwed the hatches closed for my clients, so detecting the effect of Google cleaning up may not be possible for me. The loss of business to those clients is small, but the cost was high – really poor ROAS, couldn’t justify the spend, and we’re not usually compensated on a percentage of spend basis; no skin off my nose to implement ways to get a better ROAS.

2007-07-23 – added reference to the PaidContent.org article. And fixed an embarrassing typo in the change history, where I’d put 25% and not 15%. Oops. Adjacent key error and haste. Tut tut.

2007-07-21 – more error corrections and clarifications, mostly surrounding security-through-obscurity paragraphs.

2007-07-20 – tightened up some language; corrected typos; now a little less nasty about the self-serving scum that generate AdSense pages with no intention of helping users or advertisers – I was in need of committing a random act of kindness, I guess, but I’m feeling better now, thank you; added new section on “what Google should do”, following Matt Cutts comment. Added new second paragraph about the Forbes report. And yes, I do believe in an average of around 15% for low quality clicks for an unsophisticated advertiser using Google defaults for an account and bidding to be on page 1. I’ve got some numbers here…

2008-03-15 – updated link target for “other naughtiness” in the first paragraph – the draft had changed state and was getting a 404. Link rot.

2008-03-23 – added link to new Google Blog article by Shuman, about the kinds of data that they inspect and the inferences that they make.

2008-08-20 Noticed that ClickTracks is one of those web analytics vendors adversely affected by Google’s private use of data. If Google were Microsoft, this is the sort of activity that would get censure. Google holds the data to gain advantage for their closed systems.

2009-09-21 Google Webmaster Console now allows users to request that Google ignores parameters. Intriguingly, none of my customers has had Google recommend by default, ignoring “gclid”, though all clients have that parameter present in the “additional” list. Also, none have had recommended ignoring utm_campaign, or the other Urchin tags often used with Google Analytics, or the Overture paid search tags such as OVRAW, etc. So the lack of attention to “gclid” doesn’t look too odd, given that it is in the company of an august body of other well known tracking parameters.