Archive for the ‘CloudTrawl’ Category

The world’s biggest companies have boatloads of broken links

Since we recently released CloudTrawl we decided to undertake some research to prove just how valuable it is. The uptime of major websites and the damage to reputation and profits downtime causes has been written about extensively, so we decided to go a different way. Every web user has seen a broken link; they often make our blood boil & frequently people will leave a site on seeing one assuming the content they’re seeking simply doesn’t exist. 404 has become the new blue screen of death. Broken links are a real risk to reputation & profit but we’ve never seen a comprehensive study on just how common they are in major sites.

We decided to undertake that study and to perform it on the group of sites whose owners aren’t lacking in resources: the Fortune 500.

The Results

Here’s a big figure to open with:


You read that right, 92% of the sites in our sample included at least one broken link & most had several. 68% had more than 10 broken links, 49% had more than 50 and a surprising 43% of Fortune 500 sites have more than 100 broken links.

We also broke down the amount of pages which had broken links against the total amount of pages in each site. A stunning 13% of all pages in Fortune 500 sites have at least one broken link (many pages have several).


What isn’t shown in the figures is the importance of some of these links. We saw examples of broken links to annual reports, quarterly statements, social presences (e.g. broken Facebook links) & external + internal news articles. Perhaps most worrying were the unreachable legal notices & terms & conditions documents. Along with making users leave the sites (& possibly making lawyers pass out!) these things are bad for search engine optimization. Google won’t be able to find these pages & sites will be penalized.

Our Method

To get a fair cross section of the Fortune 500 we chose 100 companies at random across the set. We entered their names into Google and picked the first US / international result owned by that company. This resulted in a mix of sites. Some were corporate (company news, quarterly statements etc.) and some were online presences for customers (stores & marketing). We rejected any sites which CloudTrawl didn’t finish crawling in 5 hours or which contained more than 5,000 pages (these can sometimes spawn loops in page generation and unfairly bias results, search engines also stop crawling sites if they think this is happening).

To eliminate false positives we quality checked results both randomly and where sites contained a high percentage of broken links. To make sure the headline figures weren’t biased we only check links (not images) and only checked for 404 & 410 http error codes, ignoring server timeouts etc. as these can sometimes be temporary.


Although there are some big headline figures above, the one that troubles us most is the 13%. Essentially we’re saying that more than 1/10 Fortune 500 web pages has a severe bug that’s waiting to pop up and grab unsuspecting users.

Next time you see a 404 error you’ll at least have the consolation that they’re proven to be really common. Of course we do give webmasters the tools to fix these issues – and I think we’ve presented a decisive demonstration of why they’re needed.

Note; feel free to use the infographics in this post; we hereby release them for use on other sites.

OMG; We’ve Launched!

It’s a proud day over at; we just launched the full live service!

We’d loved it if you sign up for the free trial and we’re all ears for new feature requests & suggestions.

So, what made it into the first version? CloudTrawl is designed to watch out for stuff that goes wrong on it’s own, even if you don’t change your site. So for the first version we have:

Link Checking (we check every page of your site, daily or weekly)

Uptime Minotoring (we check your site is online every 30 seconds / 24×7)

We also have features like complete history charting, the ability to share site reports and settings with colleagues & customers, very cool looking real time views for uptime checks, the ability to “Start Now” for link checks, image validation and a lot more.

Even this tidy set of features is really just the tip of the iceberg of what’s planned for CloudTrawl. The ultimate goal: monitor absolutely everything that could go wrong with your site on it’s own; over time we’ll be adding more checks and we’d love for you to tell us what extra features and checks you think CloudTrawl should have.

Happy Trawling!

Last bug is fixed!

This is a real development milestone. All of the code for CloudTrawl v1 has been written for a while and we’ve been focused entirely on testing. Our testing has included a lot of steps:

1. Automated testing; we now have a massive suite of automated tests which can be run at the click of a button

2. Functional testing; making sure every feature works as described and they all hang together well

3. Cross browser testing; making sure the interface works across browsers and operating systems

4. Scale; running up hundreds or thousands of uptime checks and hundreds of link & image checks simultaneously to make sure the system performs well with lots of people using it (if I can think of a way to make this not boring it deserves a blog post all of it’s own).

5. Third party testing; we got the guys over at TestLab² to do a barrage of tests to make sure we hadn’t missed anything.

And then this evening it finally happened… the last known bug was fixed. So “OMG”, it’s so nearly time to open the champagne hit the release button. Watch this space!

Why we’re building CloudTrawl using Amazon Web Services (and why you should consider them too)

For those not in the know AWS is a Cloud hosting provider; they allow their customers to use servers on a pay as you go basis, starting up and shutting them down quickly and paying by the hour.

Some of their customers are traditional web sites, some are web applications. In both cases the beauty is that extra web servers can be added almost instantly to cope when peak load comes along, i.e. when lots and lots of people are using the site.

So what’s so special about CloudTrawl that we need this? Are we expecting 100 users to log on one hour and the 10,000,000 the next? Well no, probably not.

The answer lies in the type of things CloudTrawl does:

1) Uptime Checking

This is nice and consistent. At launch we’ll have three servers doing this job, based in the US, Ireland and Japan. That number will grow but not overnight, as we get more customers we can add more.

2) Link Checking

This is the big reason we need a true Cloud service to run on, but it’s not obvious at first site. Using other online link checking services we’ve seen you set up your account and your site is scanned perhaps once a day, once a week or once a month. That’s nice and consistent right? Surely we can balance all of that out and just add servers as we need them? Nope, afraid not. We have an awesome button that rides right over that idea:


That little Start Now button means our service needs to be truly flexible. One minute we could be checking 10 sites for broken links, the next minute it could be 1,000.

So we needed to make sure we’d always have enough servers to do all that work and that’s why we’re running on AWS. We can automatically start up as many servers as we need to do the work and our customers don’t have to wait around.

If they’re worried their site might have broken links they can always hit Start Now and see CloudTrawl checking their site in real time and even fix the errors as they come in.

Pretty cool hugh?

So what’s the lesson for the web community? Well, the requirement to scale your site can come when you least expect it. Once your site is gaining some popularity it may be time to start seriously wondering: will one server always be enough? What if I suddenly get linked to from the BBC, CNN or Slashdot?

Luckily scaling isn’t necessarily that hard. For example If you have a site running static HTML Amazons EC2 is pretty easy to set up for scaling. If you’re into WordPress services like WP Engine are designed to scale automatically for you. It’s not that old-fashioned single server hosting is dead, but if you think there’s a chance you might see a big spike in traffic some day, now is a great time to start looking into options.

Sharing reports with customers

feeltheloveI’d like to tell you a little about what we’re working on right now. In the past we’ve had quite a few requests from web consultants who’d like to be able to share exported reports from DeepTrawl with their own branding attached.

That’s been a priority for CloudTrawl from the beginning. If you’re a consultant working in web design we think CloudTrawl is exactly the kind of thing you’ll want to use and share with your clients, because:

– CloudTrawl proves the site you’ve created is consistently available & functioning well
– It allows both them and you to rest easy knowing if there’s a problem you’ll be alerted
– CloudTrawl is an awesome value-add service you can provide & shows you care about their site

So how are we going to make this work? Perhaps by allowing you to export a PDF containing a report and manually email it to your client? Nope, that’s so last century.

Surely the best way would be to allow them to log into CloudTrawl directly, see reports themselves and optionally allow them to change settings so they can get alerts for things link downtime and broken links.

That’s exactly what we’re doing. We’re implementing a feature called site sharing. Your CloudTrawl account could contain perhaps 10’s or 100’s of sites, all being constantly monitored. You can choose to share any one of these with anyone. If they don’t already  have a CloudTrawl account we’ll automatically send them an email inviting them to create one for free, they’ll then be able to see and interact with the reports and settings you’ve shared with them.

As an added bonus when that user logs in they’ll see your branding.

That’s some serious added value for your clients. For a low monthly fee you’ll be able to add all the sites under your care, share their reports with your customers and prove you care about their site. Feel the love!

Uptime check frequency – why does it matter?

Since we starting planning our uptime monitoring service we wanted to offer something different – not just a me too service but a deal changer. In short we wanted to do uptime checking better than anyone else.

One of the big differentiators is how often a monitoring service checks your site. Is it once per hour? Once every 5 minutes? The industry consensus seems to be that once per minute is adequate for everyone. That’s an assumption we wanted to challenge. At first checking every minute may seem pretty ok. When you receive a downtime alert SMS, does it really matter that it came perhaps 50 seconds after your site went down? We say yes, and here’s why.

bar graph white

The Slashdot effect

This is a pretty common issue when running a site. You want lots of visitors, millions would be nice. You’ve paid for a server or hosting service which can deal with your normal amount of traffic but then a massive spike comes along from a popular link. Now your site should be serving huge amounts of requests. But it doesn’t, it falls over under the weight of the traffic.

If you know the site has gone down you may be able to quickly add capacity or deal with the influx by replacing some of the big images and keep it online. In that situation 50 seconds is too long to wait. You could easily have lost several thousand viewers. If you’re selling stuff how many sales will that lose you? We’d bet quite a lot.

Even for your regular traffic 50 seconds could mean losing important viewers. Not good.

Another time you need to know right away is when you’re doing updates to your site. Perhaps a piece of hardware is being changed out. Perhaps the server settings are being tweaked. In that situation isn’t it best to know immediately if there’s a problem? Especially since you’re probably sitting right in front of your computer ready to fix any issues.

History matters

The next reason for more frequent checks is viewing the uptime history of your site. The more often the checks, the better the history.


When viewing charts like the one above it’s important to know the figures are accurate. Is your service provider really maintaining the uptime you expect? Is it time to switch? Better data allows you to make better decisions.

So for these and many less dramatic reasons we decided that 1 minute checks simply aren’t good enough. We’re developing our uptime monitoring to check every 30 seconds. That’s better than anything we’ve seen anyone offer. In fact it’s 2x better than the existing market leaders.

Let’s talk a little about confirmations

stationsThere’s another thing to think about when trying to get meaningful downtime alerts as fast as possible: False positives and how we deal with them. A lot of existing services will send you an alert once more than one of their monitoring stations has checked your site is down. This is because monitoring stations themselves aren’t infallible. The network could be flaky near one of the stations and that’s why it sees your site as down. So it’s a good idea to get another station to confirm before alerting you.

The problem with a lot of services we’ve seen is that they’ll do this in their own sweet time. A station will check your site, see it’s down and then wait at least another minute for another one to confirm it.

We made an architectural decision that when a station sees your site is down it will immediately ask another station to check it. That way you get alerts right away, and we make sure there aren’t any false positives.

One last thing: Realtime

One final observation we made was that other services give you no feedback about what’s happening until something goes wrong. This unnerved us. You can sit and look at some service’s web interfaces and have no clue anything is happening at all. For peace of mind we added the Realtime view.


This constantly shows what CloudTrawl is doing. You can actually see the countdown until the next check, where it’ll come from and the results of the last check from every worldwide monitoring station.

To sum up, we:

Check your site every 30 seconds (2x better than our competition).

Perform immediate confirmations (no false positives – no delays).

Show a realtime view allowing you to see exactly what CloudTrawl is doing and exactly what the state of your site is at any time.

These three reasons are why I’m personally very proud of what were doing with our uptime checking and why I genuinely believe there is no service out there which can beat us.

Why does CloudTrawl exist?

For a long time we’ve been selling DeepTrawl, a desktop application which checks sites for errors such as broken links, bad spelling, invalid html, missing meta-tags any many other issues a webmaster can introduce without realizing.

It makes sense to use a desktop app to check for these things. For instance, when your site is live and not being updated you’re aren’t going to get invalid html appearing. There’s no magic; that happens because you’ve changed something. It makes sense that you have a tool you run after you make changes, perhaps even running it over HTML on your hard drive before you put it live.

The problem is, not every type of error needs your intervention do pop up and grab a user. The bug trolls can move in under the bridge at any time, it doesn’t matter if you checked for trolls when you built the bridge.

Some of these issues cross over with what DeepTrawl does, just as a broken link can be created when you update a site it can happen on its own. The content you’re pointing to on an external site changes and suddenly you have a broken link. No warning, no alarms ringing just bam… your users are sent down a black hole when they click your link. Not good.

But it gets worse, your site could go offline. Not because you did something wrong, just because that’s what websites do. The internet was designed to be very resilient, unfortunately that doesn’t apply to the server running your site or any server. They go down… website visitors get unhappy.

So we need something online constantly checking for stuff that goes wrong on it’s own. That’s why we’re now working hard on CloudTrawl. There are a lot of services which will check for downtime. There are a few which will check for broken links automatically on a weekly basis. The problem is that’s several services, several monthly payments, several things to keep track of.

The idea behind CloudTrawl is you shouldn’t need several of these things, you should have one. One service which takes care of your site and tells you when bad stuff happens.

So this is what we’re spending our time working on. Hopefully sometime early in 2012 DeepTrawl will get a new baby sister, and CloudTrawl can fill the gaps DeepTrawl leaves… answering the question “what about when my site breaks on it’s own“?

Oh, and yes this is the first ever blog post on, so in time honored tradition:

Hello World!