Archive for the ‘Web Musings’ Category

The world’s biggest companies have boatloads of broken links

Since we recently released CloudTrawl we decided to undertake some research to prove just how valuable it is. The uptime of major websites and the damage to reputation and profits downtime causes has been written about extensively, so we decided to go a different way. Every web user has seen a broken link; they often make our blood boil & frequently people will leave a site on seeing one assuming the content they’re seeking simply doesn’t exist. 404 has become the new blue screen of death. Broken links are a real risk to reputation & profit but we’ve never seen a comprehensive study on just how common they are in major sites.

We decided to undertake that study and to perform it on the group of sites whose owners aren’t lacking in resources: the Fortune 500.

The Results

Here’s a big figure to open with:


You read that right, 92% of the sites in our sample included at least one broken link & most had several. 68% had more than 10 broken links, 49% had more than 50 and a surprising 43% of Fortune 500 sites have more than 100 broken links.

We also broke down the amount of pages which had broken links against the total amount of pages in each site. A stunning 13% of all pages in Fortune 500 sites have at least one broken link (many pages have several).


What isn’t shown in the figures is the importance of some of these links. We saw examples of broken links to annual reports, quarterly statements, social presences (e.g. broken Facebook links) & external + internal news articles. Perhaps most worrying were the unreachable legal notices & terms & conditions documents. Along with making users leave the sites (& possibly making lawyers pass out!) these things are bad for search engine optimization. Google won’t be able to find these pages & sites will be penalized.

Our Method

To get a fair cross section of the Fortune 500 we chose 100 companies at random across the set. We entered their names into Google and picked the first US / international result owned by that company. This resulted in a mix of sites. Some were corporate (company news, quarterly statements etc.) and some were online presences for customers (stores & marketing). We rejected any sites which CloudTrawl didn’t finish crawling in 5 hours or which contained more than 5,000 pages (these can sometimes spawn loops in page generation and unfairly bias results, search engines also stop crawling sites if they think this is happening).

To eliminate false positives we quality checked results both randomly and where sites contained a high percentage of broken links. To make sure the headline figures weren’t biased we only check links (not images) and only checked for 404 & 410 http error codes, ignoring server timeouts etc. as these can sometimes be temporary.


Although there are some big headline figures above, the one that troubles us most is the 13%. Essentially we’re saying that more than 1/10 Fortune 500 web pages has a severe bug that’s waiting to pop up and grab unsuspecting users.

Next time you see a 404 error you’ll at least have the consolation that they’re proven to be really common. Of course we do give webmasters the tools to fix these issues – and I think we’ve presented a decisive demonstration of why they’re needed.

Note; feel free to use the infographics in this post; we hereby release them for use on other sites.

404’s Are so important…

… there’s even a TED video devoted to them!

<object width=”526″ height=”374″>
<param name=”movie” value=””></param>
<param name=”allowFullScreen” value=”true” />
<param name=”allowScriptAccess” value=”always”/>
<param name=”wmode” value=”transparent”></param>
<param name=”bgColor” value=”#ffffff”></param>
<param name=”flashvars” value=”vu=;year=2012;theme=art_unusual;event=TED2012;tag=marketing;tag=technology;tag=web;&preAdTag=tconf.ted/embed;tile=1;sz=512×288;” />
<embed src=”” pluginspace=”” type=”application/x-shockwave-flash” wmode=”transparent” bgColor=”#ffffff” width=”526″ height=”374″ allowFullScreen=”true” allowScriptAccess=”always” flashvars=”vu=;year=2012;theme=art_unusual;event=TED2012;tag=marketing;tag=technology;tag=web;&preAdTag=tconf.ted/embed;tile=1;sz=512×288;”></embed>

Why we’re building CloudTrawl using Amazon Web Services (and why you should consider them too)

For those not in the know AWS is a Cloud hosting provider; they allow their customers to use servers on a pay as you go basis, starting up and shutting them down quickly and paying by the hour.

Some of their customers are traditional web sites, some are web applications. In both cases the beauty is that extra web servers can be added almost instantly to cope when peak load comes along, i.e. when lots and lots of people are using the site.

So what’s so special about CloudTrawl that we need this? Are we expecting 100 users to log on one hour and the 10,000,000 the next? Well no, probably not.

The answer lies in the type of things CloudTrawl does:

1) Uptime Checking

This is nice and consistent. At launch we’ll have three servers doing this job, based in the US, Ireland and Japan. That number will grow but not overnight, as we get more customers we can add more.

2) Link Checking

This is the big reason we need a true Cloud service to run on, but it’s not obvious at first site. Using other online link checking services we’ve seen you set up your account and your site is scanned perhaps once a day, once a week or once a month. That’s nice and consistent right? Surely we can balance all of that out and just add servers as we need them? Nope, afraid not. We have an awesome button that rides right over that idea:


That little Start Now button means our service needs to be truly flexible. One minute we could be checking 10 sites for broken links, the next minute it could be 1,000.

So we needed to make sure we’d always have enough servers to do all that work and that’s why we’re running on AWS. We can automatically start up as many servers as we need to do the work and our customers don’t have to wait around.

If they’re worried their site might have broken links they can always hit Start Now and see CloudTrawl checking their site in real time and even fix the errors as they come in.

Pretty cool hugh?

So what’s the lesson for the web community? Well, the requirement to scale your site can come when you least expect it. Once your site is gaining some popularity it may be time to start seriously wondering: will one server always be enough? What if I suddenly get linked to from the BBC, CNN or Slashdot?

Luckily scaling isn’t necessarily that hard. For example If you have a site running static HTML Amazons EC2 is pretty easy to set up for scaling. If you’re into WordPress services like WP Engine are designed to scale automatically for you. It’s not that old-fashioned single server hosting is dead, but if you think there’s a chance you might see a big spike in traffic some day, now is a great time to start looking into options.

There’s no spam in Facebook, right?

Something rather depressing landed in my inbox today. I subscribe to the mailing list over at They send through offers every few days, some of them are pretty cool. Then there’s this:


It seems it’s now possible to pay to get Facebook likes. Presumably the company offers the people giving the likes money or some other sweetener to encourage them to hit them blue button of power.

This strikes me as very much the same thing as black hat seo, tricking the system to seem popular and sell more stuff. It’s a shame really, we’re planning on putting like buttons on when it goes live, but with people buying likes doesn’t that cheapen the whole thing just a little?

Where can I see how Google ranks my site in the USA?

Just a very quick tip we think is worth sharing.

When you’re outside of the USA and you go to, quite often you’ll find that you’re redirected to one of Google’s country specific sites. This presents a problem for web masters. If we’re outside of the 50 states how can we tell where our site is ranking in the worlds biggest economy?

Well thankfully the answer is really simple. All you need to do is use this address for Google:

This will always take you to the US version and you can see where your site is ranking for specific keywords as if you were sitting in the USA.

Book recommendation: The Lean Startup

lean startupMany of us are quite familiar with the concept of testing ideas. Not sure which Buy Now page design will work better? Use an AB test and find out. Not sure if a feature of your site is clear and easy to use? Get some users in front of it and watch how they use it.

Services like  Google Web Optimizer Google Analytics experiments make this so simple it seems crazy not to test any idea with real people and get real figures on what works and what doesn’t.

This is the type of thinking encouraged by The Lean Startup, an influential new book by Eric Ries. He systematically dismantles the reasons for using what he calls “vanity metrics”, e.g. How many new visitors your site is getting per month. The reason he doesn’t like this type of thinking? Because if you’re making changes to your site it’s way to easy to imagine the changes you’re making are the reason for the increase in visitors. In fact this may be due to word of mouth and your graphs would keep going up even if you did nothing.

It may appear that this book is for hardened software developers and their CEO’s, not for web designers and site owners but we’d argue there’s a lot to be learned here for both camps. A web site is a big interactive thing and site owners can easily fall into business traps just like software developers.

Now a word of warning: In our opinion the book became a little repetitive, but even if you only read the first few chapters it could get your brain buzzing and soon enough you may find your thinking completely rewired and your way of viewing your work could be greatly enhanced. You can get the book here.

Want cool charts for your site?

With the release of CloudTrawl drawing closer we’ve been concentrating polishing the user interface. We’ve been working hard to make it feel like something webmasters intuitively already know how to use. To achieve this we’ve taken some inspiration from services such as Google Analytics. Their interface gets one thing really, really right: charts.

Analytics charts look awesome and are really easy to use. Base on that inspiration we’ve come up with our own charting system which we believe is just as cool and intuitive:


The fully interactive chart above took only two days to implement. Now we can re-use it over and over to show lots of different kinds of data. It isn’t flash, it isn’t a static image, it has rollovers and all the other neat stuff you’d expect.

So how did we get something so feature rich up and running so fast? Easy; Google Charts.

Like a lot of the stuff Google does this is so easy to use it makes you want to cry. We’re a Java shop and so we used the GWT API which allowed us to create the extra controls for viewing data between two dates.

But if all you need is a simple chart with some copy and paste html this is really easy. Googles Quick Start guide has some script which you can copy paste and edit to show your first chart with your own data in a couple of minutes.

A shot of their chart gallery is below to give you an idea of some of the cool stuff you can use to jazz up your site:


Many data centers run far too cold

I remember a few years ago I was working in a data center which was so cold we needed to wear sweaters and gloves just to work.

Recently we’ve heard a lot about hosting providers moving to colder countries just to save on the expense and environmental impact of keeping servers cool while avoiding downtime due to machines running too hot.

I just read an excellent Wired article which concludes that many machine rooms run far too cold anyway. Many of us could save a lot of money, co2 & frozen hands by just dialling up the temperature. It’s a highly recommended read for anyone involved in operating a data center:

World’s Data Centers Refuse to Exit Ice Age

Regression tests are awesome things

So I’m sitting here waiting for my regression tests for a new CloudTrawl component to run. If you’re not in the know a regression test is a bit of code which tests some other code. Kind of like a check list to stop a programmer making dumb mistakes.

While my mind wondered I thought –

“Why on earth don’t regular web sites have regression tests?!”

And then it hit me, that’s what we do. Seriously DeepTrawl & CloudTrawl are the regression test. Awesome. Job done. thought over.

Legacy Internet Explorer vs. 99.999

Right now we’re hearing a lot about individuals being asked (begged?) to move away from Internet Explorer 6. As many readers may know even Microsoft is getting in on the act.

This is important for home users. Security is a big problem for all web users, combining a lack of security awareness with a browser which won’t be patched at all in coming years is a really, really nasty mix.

There’s a reason why many of these users have’t already upgraded, if they’re really happy with IE6 that probably means they’re not into the latest-greatest web apps. They spend their time doing email, browsing Amazon & eBay. Productivity boosts possibly aren’t their bag.

But here’s the thing. There’s a very large group of people who really would benefit from using the latest web technologies: pretty much everyone who works in an office.

Google docs is basic but awesome for collaboration. Lucid Charts freaking rocks. I challenge anyone to watch their demo and tell me Visio is more compelling. More and more web apps are appearing which push not just what a web browser can do but what a productive professional can do.

Even better I.T. departments don’t need to get involved for users to start playing with these things and prove their value. Maybe for free, maybe by using their departments expense account users themselves can start using these apps and test their value without I.T. needing to invest their time or budget. I would imagine that I.T. departments would be thrilled by this.

In the I.T. world there’s long been talk about the five nines. This means systems should have a guaranteed uptime of 99.999. This amounts no more than about 5 minutes down time per year. This is an excellent goal (which of course as an uptime monitor we applaud).

Seeking this value means I.T. has it’s head screwed on when buying systems, they’re serving the organization well by keeping it’s employees productive.

So here’s the elephant in the room. The next improvement in productivity many not come from internal systems, it may very well come from applications chosen by end users, hosted in the cloud.

That makes it seems really, really crazy that many I.T. departments hang on to historic browsers which won’t work well with these new productivity aids.

We understand the reasons. IE6 at least has an understood security profile. Probably more importantly, many organizations have internal applications which were written specifically for IE6 and absolutely will not work with anything else.

Since Microsoft plan to stop supporting it organizations are going to have to move beyond IE6 and are doing so right now. At the same time they may need to upgrade or dump their creaking IE6 compatible applications. That will hurt. Users will lose data. Hair will be torn out.

I believe many applications used by organizations in the future will be chosen by the users and won’t have much to do with the I.T. department. But there will still be many applications which will be developed for or sold to organizations from the top down. These will cost a lot of money, probably $100k+.

So here’s my plea to larger organizations: when considering buying any new software system please, please make sure it’s standards compliant. It should work perfectly in every modern browser. It’s html should validate. It shouldn’t use Flash or anything else which keeps you locked in to technologies which may disappear.

This way next time I.T. wants to upgrade browsers company wide they’ll be able to do it without fear of breaking everything and users can use the latest and greatest innovative web-apps without being held back by an ageing browser no-one in the company wants to keep.

The commitment to 99.999% uptime is awesome, let’s make sure we can keep a similar commitment to keeping workers a productive as they want to be.