Ghost Busting
There might be more spammy traffic sneaking in to your Google Analytics data than you originally thought. Referral spam from sites like semalt.com and 100dollars-seo.com have been around for a while now (I’ve seen them as far back as February of 2014). Many of us know about this and have heard of a variety of filtering methods to keep this kind of spam out of our data. But spammers have evolved, and there are new issues to address now. Ghostly issues.
What is ‘Ghost Spam’?
What hasn’t been addressed as much is the huge influx of ‘ghost spam’ that comes in through not just referral channels, but direct and organic as well. These ghost spam visitors don’t actually visit your website. Instead, they target your Google Analytics tracking ID. Using this ghostly method, spammers can automatically generate or retrieve hundreds of thousands of tracking IDs and generate false hits on websites around the globe. Using typical htaccess and source filters won’t work on these ghosties, because they are coming in through the direct and organic channels with no identifiable source. These spams sessions have to be dealt with in a different way. Scary.
How to Identify and Isolate Ghost Spam
Ghost spam isn’t always obvious at first glance because direct and organic spam sessions don’t provide an incriminating source domain like referral spam does. The key to spotting ghost spam is the hostname dimension in Google Analytics. Remember: “Hostname” refers to the domain that your tracking code was on when a particular session triggered. So when someone visits www.yourdomain.com, the tracking code on that site triggers and tells Google Analytics that the hostname is www.yourdomain.com. But since ghost spam uses auto-generated tracking codes and never visits your site, they can’t possibly send an accurate hostname to Google Analytics.
If you see sessions triggering a hostname that is not a domain that you have some form of tracking on, then those are almost certainly ghost spam (with few exceptions)!
There are a number of ways to see hostnames, but a good “big picture” method is to go to Audience > Technology > Network. Then, switch the Primary Dimension to Hostname.
Look at all these illegitimate hostnames! I certainly don’t have tracking code on 4webmaster.org or hulfingtonpost or darodar or (not set). These are all ghost spam. Note the fourth hostname: translate.googleusercontent.com. This is a legitimate hostname because it is what triggers when users view your site through Google Translate. So we want to leave that alone.
But what channels are they coming from? Simply set the secondary dimension to Default Channel Grouping or Medium while in the Network view with Hostname set as the Primary Dimension.
Ghost spam is coming in through Organic, Direct and Referral channels. But without looking at the hostname, we have no way of seeing that direct and referral ghost traffic.
How to Use Hostname Filters to Block Ghost Spam Traffic
Ok, so now that we’ve identified ghost traffic, we need to filter it out. We are going to create an include hostname filter that only allows valid hostnames to enter our Analytics data. Why not exclude invalid hostnames? Well, because there are so many spammers out there and because they are changing all the time, it will generally be much easier to specify which hostnames are valid rather than having to constantly update a list of invalid hostnames.
This where we need to be careful! If we create a filter that includes valid hostnames, but we forget one or two, we are going to lose data. And as we all know, data in Google Analytics cannot be retroactively altered. This is why it is essential to create a test view and apply your new hostname filter to it. This will allow you to spot any mistakes and make corrections before you apply it to your master view.
Steps to Creating the Hostname Filter
- Identify valid hostnames using Audience > Technology > Network > Primary Dimension Hostname
- Create a regular expression using the valid hostnames identified in step one
- Create the filter using the regular expression to include valid hostnames only
1. Identify Valid Hostnames
To identify which hostnames are valid, I recommend going back into historical data as far as you can and pulling a list of hostnames using the method identified above (Audience > Technology > Network > Primary Dimension Hostname). Comb through that list of hostnames and identify which are valid. These may include:
- Any domain or subdomain that you have tracking code on
- translate.googleusercontent.com (if users that translate your content)
- webcache.googleusercontent.com (if users are accessing cached versions of your site)
- Any other website (social media, advertising, payment portals etc) that you have configured with your tracking code
2. Create a Regular Expression
Once you have identified valid hostnames, you will create a regular expression. For example, let’s say these are the valid hostnames you have identified in step one:
- www.yourdomain.com (your website)
- yourdomain.com (your website without www subdomain)
- shop.yourdomain.com (your shop subdomain)
- translate.googleusercontent.com (Google Translate services)
- webcache.googleusercontent.com (cached versions of your site)
- www.youtube.com (YouTube ads configured with your tracking data)
To create a regular expression, you will list each of these domain names with a pipe “|” in between them. Make sure there are no spaces in between domains and no pipe “|” at the end of the expression. In order to include all subdomains in a regex, you can add .* before the domain name. I recommend doing this so you don’t miss on www. and non-www. versions of your domain. So the example above would result in a regex of:
.*yourdomain.com|.*googleusercontent.com|www.youtube.com
3. Create the Hostname Filter
Now that you have your regex, it’s time to create the filter itself. Create a new filter and set the Filter Type to Custom and Include. Set the Filter Field to Hostname and paste your regex into the Filter Pattern.
That’s it! You’ve created a filter that will bust all the ghosts trying to sneak their way into your data. Remember that you are including valid hostnames, which means that any time you add tracking to a new domain, you must update the hostname filter to include that new domain. It is essential to keep a clean unfiltered view so that you can always go back and see any data that has been filtered out of your main view.
But now that you’ve blocked future ghost spam, how can you segment them out of historical data?
Segmenting Ghost Spam Out of Historical Data
It’s likely that hundreds or even thousands of ghost spam visits per month have skewed your Analytics data for the past few months. Ghost spam causes traffic and bounce rate to be inflated and other engagement metrics like average session duration and pages per session to be dragged down. In order to get a better picture of your site’s actual data, you can create a segment that uses the same regular expression that you used in your hostname filter.
Click into Google Analytics segments (All Sessions by default), then click +New Segment.
Make sure the filter is set to Sessions, Include. Set the first dropdown to Hostname and the second dropdown to matches regex. Paste the regex that includes all valid hostnames into the field and save the segment.
Now, when this segment is active, you can go back into historical data and view it with all ghost spam removed!
Note that this method only blocks ghost spam, and not the actual referral spam that physically visits your site via bots (because those are triggering a valid hostname). To block those, you can create a filter that excludes the specific sources of each spam bot, and you can block them in the htaccess file. But that is covered in another blog post on Moz by Jared Gardner, so I won’t delve into that here.
Happy ghost busting!