web content anlayzer
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Just developed a small crawler to check my online content at binfalse.de in terms of W3C validity and the availability of external links. Here is the code and some statistics…
The new year just started and I wanted to check what I produced the last year in my blog. Mainly I wanted to ensure more quality, my aim was to make sure all my blog content is W3C valid and all external resources I’m linking to are still available.
First I thought about parsing the database-content, but at least I decided to check the real content as it is available to all of you. The easiest way to do something like this is doing it with Perl, at least for me.
The following task were to do for each site of my blog:
- Check if W3C likes the site
- For each link to external resources: Check if they respond with
200 OK
- For each internal link: Check this site too if not already checked
While I’m checking each site I also saved the number of leaving links to a file to get an overview.
Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | #!/usr/bin/perl -w # for more informations visit: # # http://binfalse.de use strict; use LWP::UserAgent; use XML::TreeBuilder; use WebService::Validator::HTML::W3C; # only if the url looks like ^(http(s)?:\/\/)?[^\/]*$domain it’s recognized to # be an internal link my $domain = “example.org/”; my %visited = (); my @tovisit = ( “http://”.$domain ); my $browser = LWP::UserAgent->new; $browser->timeout(10); my $validator = WebService::Validator::HTML::W3C->new( detailed => 1 ); while (@tovisit) { my $act = shift @tovisit; print “processing: “.$act.” (todo:”.@tovisit.“)\n“; my $response = $browser->get ($act); # our site avail !? if ($response->is_success) { #check w3c validity if ( $validator->validate($act) ) { if ( $validator->is_valid ) { validator (“ok”, $validator->uri); } else { foreach my $error ( @{$validator->errors} ) { validator (“error”, $validator->uri, $error->msg, $error->line); } } } else { validator (“failed”, $validator->validator_error); } my $iLinks = 0; my $oLinks = 0; my $xml = XML::TreeBuilder->new(); $xml->parse ($response->decoded_content); foreach my $link ($xml->find_by_tag_name (‘a’)) { my $href = $link->attr (‘href’); next unless defined $href; # links this link to our domain? if ($href =~ m/^(http(s)?:\/\/)?[^\/]*$domain/i) { # intern # add to array if: # not yet visited && # link ends with / (all my content ends with / and i don’t want to check .tgz and .png and so on…) push (@tovisit, $href) if (! defined $visited{$href} && $href =~ m/\/$/); $iLinks++; } else { # extern -> check if side is available my $res = $browser->get ($href)->code; failed ($act, $href, $res) if (! defined $visited{$href} && $href =~ m/^http/i && $res != 200); $oLinks++; } $visited{$href} = 1; } # for data analyzing loglinks ($iLinks, $oLinks); } else { failed ($act); } } sub failed { my $site = shift; my $ext = shift; my $res = shift; open FAIL, “>>/tmp/check-links.fail”; print FAIL $site . “\n“ if (! defined $ext); print FAIL $site . ” -> “ . $ext . ” (“ . $res .“)\n“ if (defined $ext); close FAIL; } sub validator { my $status = shift; my $site = shift; my $msg = shift; my $line = shift; open VAL, “>>/tmp/check-links.val”; print VAL $status . “: “ . $site . “\n“ if (! defined $msg || ! defined $line); print VAL $status . “: “ . $site . ” -> “ . $msg. ” (“ . $line .“)\n“ if (defined $msg && defined $line); close VAL; } sub loglinks { my $intern = shift; my $extern = shift; open LOG, “>>/tmp/check-links.log”; print LOG $intern . ” “ . $extern . “\n“; close LOG; } |
You need to install LWP::UserAgent
, XML::TreeBuilder
and WebService::Validator::HTML::W3C
. Sitting in front of a Debian based distribution just execute:
1 | aptitude install libxml-treebuilder-perl libwww-perl libwebservice-validator-css-w3c-perl libxml-xpath-perl |
The script checks all sites that it can find and that match to
1 | m/^(http(s)?:\/\/)?[^\/]*$domain/i |
So adjust the $domain
variable at the start of the script to fit your needs.
It writes all W3C results to /tmp/check-links.val
, the following line-types may be found within that file:
1 2 3 4 5 6 | # SITE is valid ok: SITE # SITE contains invalid FAILURE at line number LINE error: SITE –> FAILURE (LINE) # failed to connect to W3C because of CAUSE failed: CAUSE |
So it should be easy to parse if you are searching for invalids.
Each external link that doesn’t answer with 200 OK
produces an entry to /tmp/check-links.fail
with the form
1 | SITE -> EXTERNAL (RESPONSE_CODE) |
Additionally it writes for each website the number of internal links and the number of external links to /tmp/check-links.log
.
If you want to try it on your site keep in mind to change the content of $domain
and take care of the pattern in line 65:
1 | $href =~ m/\/$/ |
Because I don’t want to check internal links to files like .png
or .tgz
the URL has to end with /
. All my sites containing parseable XML end with /
, if your sites doesn’t, try to find a similar expression.
As I said I’ve looked to the results a bit. Here are some statistics (as at 2011/Jan/06):
Processed sites | 481 |
Sites containing W3C errors | 38 |
Number of errors | 63 |
Mean error per site | 0.1309771 |
Mean of internal/external links per site | 230.9833 / 15.39875 |
Median of internal/external links per site | 216 / 15 |
Dead external links | 82 |
Dead external links w/o Twitter | 5 |
Most of the errors are now repaired, the other ones are in progress.
The high number of links that aren’t working anymore comes from the little twitter buttons at the end of each article. My crawler is of course not authorized to tweet, so twitter responds with 401 Unauthorized
. One of the other five fails because of a cert problem, all administrators of the other dead links are informed.
I also analyzed the outgoing links per site. I’ve clustered them with K-Means, the result can be seen in figure 1. How did I produce this graphic? Here is some R code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | library(MASS) x=read.table(“check-links.log”) intern=x[,1] extern=x[,2] z <- kde2d(intern,extern, n=50) # cluster colnames(x) <- c(“internal”, “external”) (cl <- kmeans(x, 2)) # draw png(“check-links.png”, width = 600, height = 600) # save actual settings op <- par() layout( matrix( c(2,1,0,3), 2, 2, byrow=T ), c(1,6), c(4,1)) par(mar=c(1,1,5,2)) contour(z, col = “black”, lwd = 2, drawlabels = FALSE) points(x, col = 2*cl$cluster, pch = 2 * (cl$cluster – 1),lwd=2,cex=1.3) points(cl$centers, col = c(2,4), pch = 13, cex=3,lwd=3) rug(side=1, intern ) rug(side=2, extern ) abline(lm( external ~ internal, data = x)) title(main = “internal -vs- external links per site”) # boxplot left par(mar=c(1,2,5,1)) boxplot(extern, axes=F) title(ylab=‘external links’, line=0) # boxplot right par(mar=c(5,1,1,2)) boxplot(intern, horizontal=T, axes=F) title(xlab=‘internal links’, line=1) # restore settings par(op) dev.off() |
You’re right, there is a lot stuff in the image that is not essential, but use it as example to show R beginners what is possible. Maybe you want to produce similar graphics!?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.