At the beginning of the year, we made a troubling discovery about Siteturner: the site had been cloned in its entirety. And the clone was showing up in Google’s search results. At first, the discovery was disturbing. Then, it was annoying. Finally, in the end, we just felt a little sad.
Sometime over the 2014 holiday season, someone, somewhere, went to a little bit of trouble to mirror this site with a proxy script, creating an almost flawless Siteturner clone. Stumbling upon it for the first time, I was confused – slightly shocked – and didn’t immediately understand what I was seeing. I’ve worked online for nearly twenty years, but until that moment had missed out on this sleazy scraping technique. A few cursory Google searches turned up little information about it, suggesting that it wasn’t common.
Site content is harvested by scrapping bots and scripts and re-posted elsewhere all the time, and there’s not much to be done when it happens but file DCMA complaints and report it to Google. But what do you do when your entire website has been neatly copied, and the copy is nearly indistinguishable from the original? We got to find out.
What’s Going on Here?
Here’s what the enterprising gentleman had done. First, he registered a domain name remarkably similar to our own, changing three letters. Then, on his server, he used a proxy script that, in effect, mirrored Siteturner. The design was nearly identical; the content was nearly identical. Updating Siteturner would instantly update the clone, and that was our first clue that Siteturner hadn’t been copied by a scraper like HTTrack.
There were three notable differences between the clone site and the original. Rather optimistically, the scraper went to the trouble of editing our PNG logo to reflect the ripoff domain name he had purchased. To be honest, I can appreciate this attention to detail. Go big, or go home. The second difference was that the script replaced all instances of the name ‘Siteturner’ with the altered domain name. At first glance, a visitor to the clone site would have no reason to suspect the content was being ripped off from its smarter, better-looking author.
Should I Panic?
Readers who have zeroed in on a clone of their own website can relax a bit. Anyone reading this in the year 2015 has little to worry about. While digging for info, it became clear to us why you don’t see it happening more often: mirroring sites is a giant waste of time. There’s little to gain from this activity but the knowledge that you’ve slightly aggrieved and annoyed another human being. The infringing party has more in common with a hobbiest, or a common internet troll, than a copyright criminal.
Ten years ago, Google wasn’t as hip to these shady techniques as they are now, and it wasn’t unheard of for clone sites to replace the original site in the search results, or even outrank the original. But the last reference we could find to that happening was way, way back in 2011, and, indeed, Google has waged an all out war on content scrapers since that time. While it would be nice if their algorithm was sophisticated enough to omit these scam sites from the results entirely, as long as they fail to rank and don’t hurt the ranking of the original sites, there’s not much point in losing sleep over them.
You Still Don’t Have to Take It
Once we realized Siteturner wasn’t doomed to Google oblivion, we felt a bit better. But it’s a slap in the face to see all your hard work callously thieved and re-branded, and there’s no reason to tolerate it. Fortunately, when it comes to copyright infringement, you’ve got options.
In the rare instance that scraped content is outranking your original content, Google provides a handy tool that will allow you to report scrapping sites. Or you might want to consider filing a DCMA takedown request with Google to have infringing content expunged from their search results entirely. Know that this option is serious business: read the fine print, and don’t report a copyright infringement if the copyright wasn’t yours to begin with.
Our first action was to use DomainTools to find out everything we could about the domain, who registered it, and who was hosting the website. Most hosting providers will take reports of infringement seriously, and though that might not prevent a clone site from reappearing in the future, you can at least get it taken down temporarily.
But the best way to deal with a proxy mirror has nothing to with filling out forms and tracking down abuse@ e-mail addresses. You can take it down yourself.
Finding and Blocking the IP
Every time someone clicks a link on a mirrored website, a request for the content is made on the original. This is why updates to the original site are reflected on the imitator instantly: no real data is stored on the infringing server. And that’s good news, because it means all you need to do to put the copy website out of commission is find the IP address of the client making the request, and block that IP address using either a firewall or htaccess.
Tracking down the IP won’t be too difficult if you have access to your server’s raw Apache log. Visit the cloned website, click on a specific link, and note the time you clicked on it. Download the raw log, and search around the bottom of it for the request. It ought to look something like this (all requests will look something like this, which is why it’s important to note which link you clicked, and when):
126.96.36.199 - - [03/Mar/2015:16:23:31 -0500] "GET /url-of-the-link-you-clicked-on-the-clone-site / HTTP/1.1" 200 19147 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:37.0) Gecko/20100101 Firefox/37.0"
Once you’ve identified the insidious IP address, you can then go about blocking it with your firewall (we used Cloudflare), or block it via htaccess (we did that, too). By adding the following to htaccess, you can deny access to the IP shown in the example above:
deny from 188.8.131.52
allow from all
After the IP is blocked, any visit to the mirrored site will return an Access Denied error, and a 403 error will be recorded in your log.
If you don’t manage your own server, and don’t have access to the logs, you can try contacting your hosting provider for help. Explain the situation to them, and hopefully someone will help you block the mirror.
Is That It?
There’s not a day that goes by that our logs don’t reveal a new brute-force bot hammering away on our doorstep. Although mirrored sites are aggravating, if you’re concerned about the integrity and security of your website, time might be better spent making sure your passwords are rock-solid and your plugins and themes are up to date. Threats like this one, from a few months ago, hold far more potential to bring your website and your business to its knees.
Our conclusion: block mirrors or report them, but don’t take them too seriously. They’re an obnoxious reality of doing business online, and being a creator of unique content. Take pride in the fact that if someone wants to copy you, you’re probably doing something right.