web content anlayzer

Martin Scharm

11 years ago

[This article was first published on binfalse » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Just developed a small crawler to check my online content at binfalse.de in terms of W3C validity and the availability of external links. Here is the code and some statistics…

The new year just started and I wanted to check what I produced the last year in my blog. Mainly I wanted to ensure more quality, my aim was to make sure all my blog content is W3C valid and all external resources I’m linking to are still available.
First I thought about parsing the database-content, but at least I decided to check the real content as it is available to all of you. The easiest way to do something like this is doing it with Perl, at least for me.
The following task were to do for each site of my blog:

Check if W3C likes the site
For each link to external resources: Check if they respond with 200 OK
For each internal link: Check this site too if not already checked

While I’m checking each site I also saved the number of leaving links to a file to get an overview.
Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

#!/usr/bin/perl -w

# for more informations visit:
#
# http://binfalse.de

use strict;
use LWP::UserAgent;
use XML::TreeBuilder;
use WebService::Validator::HTML::W3C;

# only if the url looks like ^(http(s)?:\/\/)?[^\/]*$domain it’s recognized to
# be an internal link
my $domain = "example.org/";

my %visited = ();
my @tovisit = ( "http://".$domain );

my $browser = LWP::UserAgent->new;
$browser->timeout(10);
my $validator = WebService::Validator::HTML::W3C->new( detailed => 1 );

while (@tovisit)
{
my $act = shift @tovisit;
print "processing: ".$act." (todo:".@tovisit.")\n";

my $response = $browser->get ($act);
# our site avail !?
if ($response->is_success)
{
#check w3c validity
if ( $validator->validate($act) )
{
if ( $validator->is_valid )
{
validator ("ok", $validator->uri);
}
else
{
foreach my $error ( @{$validator->errors} )
{
validator ("error", $validator->uri, $error->msg, $error->line);
}
}
}
else
{
validator ("failed", $validator->validator_error);
}

my $iLinks = 0;
my $oLinks = 0;

my $xml = XML::TreeBuilder->new();
$xml->parse ($response->decoded_content);
foreach my $link ($xml->find_by_tag_name (‘a’))
{
my $href = $link->attr (‘href’);
next unless defined $href;

# links this link to our domain?
if ($href =~ m/^(http(s)?:\/\/)?[^\/]*$domain/i)
{
# intern
# add to array if:
# not yet visited &&
# link ends with / (all my content ends with / and i don’t want to check .tgz and .png and so on…)
push (@tovisit, $href) if (! defined $visited{$href} && $href =~ m/\/$/);
$iLinks++;
}
else
{
# extern -> check if side is available
my $res = $browser->get ($href)->code;
failed ($act, $href, $res) if (! defined $visited{$href} && $href =~ m/^http/i && $res != 200);
$oLinks++;
}

$visited{$href} = 1;
}
# for data analyzing
loglinks ($iLinks, $oLinks);
}
else
{
failed ($act);
}
}

sub failed
{
my $site = shift;
my $ext = shift;
my $res = shift;
open FAIL, ">>/tmp/check-links.fail";
print FAIL $site . "\n" if (! defined $ext);
print FAIL $site . " -> " . $ext . " (" . $res .")\n" if (defined $ext);
close FAIL;
}

sub validator
{
my $status = shift;
my $site = shift;
my $msg = shift;
my $line = shift;
open VAL, ">>/tmp/check-links.val";
print VAL $status . ": " . $site . "\n" if (! defined $msg || ! defined $line);
print VAL $status . ": " . $site . " -> " . $msg. " (" . $line .")\n" if (defined $msg && defined $line);
close VAL;
}

sub loglinks
{
my $intern = shift;
my $extern = shift;
open LOG, ">>/tmp/check-links.log";
print LOG $intern . " " . $extern . "\n";
close LOG;
}

You need to install LWP::UserAgent, XML::TreeBuilder and WebService::Validator::HTML::W3C. Sitting in front of a Debian based distribution just execute:

1	aptitude install libxml-treebuilder-perl libwww-perl libwebservice-validator-css-w3c-perl libxml-xpath-perl

The script checks all sites that it can find and that match to

1	m/^(http(s)?:\/\/)?[^\/]*$domain/i

So adjust the $domain variable at the start of the script to fit your needs.
It writes all W3C results to /tmp/check-links.val, the following line-types may be found within that file:

1
2
3
4
5
6

# SITE is valid
ok: SITE
# SITE contains invalid FAILURE at line number LINE
error: SITE –> FAILURE (LINE)
# failed to connect to W3C because of CAUSE
failed: CAUSE

So it should be easy to parse if you are searching for invalids.
Each external link that doesn’t answer with 200 OK produces an entry to /tmp/check-links.fail with the form

1	SITE -> EXTERNAL (RESPONSE_CODE)

Additionally it writes for each website the number of internal links and the number of external links to /tmp/check-links.log.

If you want to try it on your site keep in mind to change the content of $domain and take care of the pattern in line 65:

1	$href =~ m/\/$/

Because I don’t want to check internal links to files like .png or .tgz the URL has to end with /. All my sites containing parseable XML end with /, if your sites doesn’t, try to find a similar expression.

As I said I’ve looked to the results a bit. Here are some statistics (as at 2011/Jan/06):

Processed sites	481
Sites containing W3C errors	38
Number of errors	63
Mean error per site	0.1309771
Mean of internal/external links per site	230.9833 / 15.39875
Median of internal/external links per site	216 / 15
Dead external links	82
Dead external links w/o Twitter	5

Figure 1: outgoing links

Most of the errors are now repaired, the other ones are in progress.
The high number of links that aren’t working anymore comes from the little twitter buttons at the end of each article. My crawler is of course not authorized to tweet, so twitter responds with 401 Unauthorized. One of the other five fails because of a cert problem, all administrators of the other dead links are informed.

I also analyzed the outgoing links per site. I’ve clustered them with K-Means, the result can be seen in figure 1. How did I produce this graphic? Here is some R code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

library(MASS)

x=read.table("check-links.log")
intern=x[,1]
extern=x[,2]
z <- kde2d(intern,extern, n=50)

# cluster
colnames(x) <- c("internal", "external")
(cl <- kmeans(x, 2))

# draw
png("check-links.png", width = 600, height = 600)

# save actual settings
op <- par()
layout( matrix( c(2,1,0,3), 2, 2, byrow=T ), c(1,6), c(4,1))

par(mar=c(1,1,5,2))
contour(z, col = "black", lwd = 2, drawlabels = FALSE)
points(x, col = 2*cl$cluster, pch = 2 * (cl$cluster – 1),lwd=2,cex=1.3)
points(cl$centers, col = c(2,4), pch = 13, cex=3,lwd=3)
rug(side=1, intern )
rug(side=2, extern )
abline(lm( external ~ internal, data = x))
title(main = "internal -vs- external links per site")

# boxplot left
par(mar=c(1,2,5,1))
boxplot(extern, axes=F)
title(ylab=‘external links’, line=0)

# boxplot right
par(mar=c(5,1,1,2))
boxplot(intern, horizontal=T, axes=F)
title(xlab=‘internal links’, line=1)
# restore settings
par(op)

dev.off()

You’re right, there is a lot stuff in the image that is not essential, but use it as example to show R beginners what is possible. Maybe you want to produce similar graphics!?

Download:
Perl: check-links.pl
R: check-links.R
(Please take a look at the man-page)

To leave a comment for the author, please follow the link and comment on their blog: binfalse » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.