Archive for the ‘Common Questions’ Category

Will crawling affect your Google Analytics?

This is a question we’ve been asked a few times. Many users want to know if the hits generated when their pages are checked for broken links will show up in Analytics. The answer for both CloudTrawl and DeepTrawl is no, it won’t be a problem. For reasons I’ve written about previously, both products don’t execute the Javascript on your pages. Since Google Analytics relies on Javascript to count page views it has no way of knowing that we’ve visited a page, so this won’t show up.

As a side note, in the next version of DeepTrawl we’re planning to implement a way to make sure all of your pages contain analytics tracking code, until then it’s relatively easy to do this check yourself using DeepTrawls ability to add your own new checks.

Link checking JavaScript links

Having worked on DeepTrawl for a long time there are a few questions which come up again and again. Some of these are equally applicable to CloudTrawl (since it also does link checking) & I’ll try to address some of them in this blog.

One of the really common support questions is “does your link checker work with JavaScript links”. The answer is always no, and it’s likely neither product ever will. Here’s why.

JavaScript is a tricky beast. In the simplest case it might seem easy for our software to follow a JavaScript link. Perhaps one looking like this:

<a href=”#” onclick=”‘somedoc.html’, ‘newWindow’)”>Click Me</a>

The problem comes in that onclick property. You really could have anything in there. For instance there could be JavaScript to:

– Ask the user which of several pages they’d like

– Take account of their session (are they logged in?) and act accordingly

– Get some information from the server

The examples above can stretch out into infinity because JavaScript code can do anything you want, that’s the power of using a programming language in a web page.

So the answer might appear to be that we should build in a JavaScript engine and have our products do exactly what the browser would have done when it encountered that link. The problem really comes when the script has some interaction with the user. There’s no way to know what the user would have done, so there’s really no way to know where that link should lead. The code might be asking the user to click one of two buttons but it could just as easily be asking them to sign up for a service, entering a new username, address etc. and giving them some unique content based on what they entered.

If we tried to implement this we’d get into a spiral of problem solving and at every step there would be things we couldn’t solve perfectly. We’d have to fudge it & take a guesses. I don’t think guesses are what anyone wants from a link checker. We want certainty.

When I first started thinking about this problem I decided to take a look at what some of our competitors where doing. Some claim to be able to follow JavaScript links so I tested them out with some real examples and found results I didn’t like.

They would cope with very simple examples like the one above, perhaps even slightly more complex ones, like this:

<a href=”#” onclick=”‘somedoc’ + ‘.html’, ‘newWindow’)”>Click Me</a>

(Note that here I’m building somedoc.html by adding together two strings).

But if things got a little more complicated they just refused to check the link. That isn’t what I want for our customers. I’m not saying our competitors are intentionally misleading in their marketing, there are a lot of very simple JavaScript links on the web but I think people would expect all types of links to be handled and I just don’t believe we could ever live up to that promise.

Since we can’t make a good guarantee I didn’t want to make one at all. False negatives are really bad, our products saying they’ll find an error then failing to find it are the issues that keep me awake at night.

Usually when people ask, I tell them it’s a good idea to always have a regular <a> link with an href attribute to match every JavaScript link, even if the ‘copy’ is hidden away at the bottom of the page. This means DeepTrawl and now CloudTrawl will be able to scan the site for broken links. But there’s another really important reason to make sure all your links can be found somewhere in the page in regular old fashioned <a> tags.

Google and other search engines don’t guarantee to follow JavaScript links. A lot of the time they discover the content in a site the same way our technology does, by starting at the first page and following all the links. If a link is in JavaScript the search engines may not follow it. So, they may not see your pages and won’t be listing them in the search results. Ouch. Better get putting in those regular <a> tags!