Screenscraping using wget and Shell Scripting

Submitted by nigel on Sunday 6th December 2009

There are always a number of ways of achieving the same goal in computing. In this tutorial I wish to screenscrape the FIFA 2010 World Cup football schedule from a news website to generate an .ics icalendar file. I am going to be developing the .ics generator in Perl, and Perl comes with LWP (short for "Library for WWW in Perl") which is a popular group of Perl modules for accessing data on the Web.

I certainly could use LWP - and have many times in the past for other projects - but it isn't the easiest tool to use. The problem I have with it is the awkwardness of dropping anchor points throughout the screen scraped data, and then having to write convoluted and difficult to maintain code to extract the actual information that is required.

Wherever practicable I therefore try and eschew the use of LWP, and instead go for Linux command line utilities based around wget and including sed and awk. The beauty of this method is it is loosely coupled with the Perl ics generator and should the news website page format change, I will only have to modify some sed/awk commands and not get dirty modifying Perl.

I have located a news site at ESPN Soccernet which contains all the schedule information. So that's our starting point.

We'll now create a shell script and set this link to a variable so it can easily be changed in a trice should ESPN reorganise their site. We will be using the wget utility as previously mentioned, so lets add a call to that with the required parameters.

#!/bin/sh
 
# URL to screenscrape
wc2010http='http://soccernet.espn.go.com/world-cup/fixtures?cc=5739&ver=global'
 
# Screenscrape
wget -q -O - $wc2010http
Now save this script - I have called it runwc2010 on my machine - and set the file mode bits to executable
$ chmod u+x runwc2010
If you now run the code with
$ ./runwc2010
you should see the HTML output of the ESPN page on your terminal monitor. If you do, great! It should look something similar to the output below - although this has been trimmed for the sake of brevity.
<div class="mod-container mod-no-footer mod-grp-matchup">
<div class="mod-header">
<h4>June 11, 2010 @ 15:00UK</h4>
<h4>Group Stage Group A</h4>
<h5>at Soccer City (Johannesburg)</h5>
</div>
<div class="mod-content">
 
<div class="gradient-container-down">
<div class="team visitor">
 
<div class="wc-flag-35 wc-flag-35-467"> </div>
 
<p>South Africa</p>
</div>
 
<div class="vs">
<p> v </p>
 
</div>

We now need to pare down that output to extract only the data we want. This will be achieved using sed and awk which we will build up one step at a time. We will pipe the output of the wget to sed and awk.

Firstly, we want to crop the output between a starting point where the fixtures are displayed, and an end point after the World Cup final. Looking through the HTML code, the text 'June 11, 2010 @ 15:00UK' denotes the very first game, whilst the text 'Table' is the first text after the final in the HTML. So we will constrain our output with these two strings. When translated into sed, it looks like:

wget -q -O - $wc2010http | sed -n '/June 11, 2010 @ 15:00UK/,/Table/p'
Next we should eliminate all the HTML tags - we just don't need them and they are cluttering the output. The sed regex to remove HTML tags is s/<[^>]*>//g so we will pipe our output into that.
wget -q -O - $wc2010http | sed -n '/June 11, 2010 @ 15:00UK/,/Table/p' | sed 's/<[^>]*>//g'
This now produces the following output:
                July 11, 2010 @ 19:30UK
                Final
                at Soccer City (Johannesburg)
 
 
 
 
 
 
 
                                Winners SF1
 
 
 
                                 v 
 
 
 
 
 
 
                                Winners SF2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
                                Tables and Fixtures
Loads of white space! Lets get rid of leading white space on each line - to do that with sed the regex is s/^[ \t]*// which can be concatenated onto our previous regex, so our wget/sed command now looks like:
wget -q -O - $wc2010http | sed -n '/June 11, 2010 @ 15:00UK/,/Table/p' | sed 's/<[^>]*>//g;s/^[ \t]*//;'
which produces
July 11, 2010 @ 19:30UK
Final
at Soccer City (Johannesburg)
 
 
 
 
 
 
 
Winners SF1
 
 
 
 v 
 
 
 
 
 
 
Winners SF2
Now lets get rid of empty lines with the /^$/d sed regex. Now our command looks like:
wget -q -O - $wc2010http | sed -n '/June 11, 2010 @ 15:00UK/,/Table/p' | sed 's/<[^>]*>//g;s/^[ \t]*//;/^$/d'
which produces:
Winner Q/F 1
July 7, 2010 @ 19:30UK
Semi-finals
at Durban (Durban)
Winner Q/F 3
 v 
Winner Q/F 4
July 10, 2010 @ 19:30UK
Third Place
at Nelson Mandela Bay (Port Elizabeth)
Loser SF1
 v 
Loser SF2
July 11, 2010 @ 19:30UK
Final
at Soccer City (Johannesburg)
Winners SF1
 v 
Winners SF2
Tables and Fixtures
We are now close to what we want. We could do with removing some lines we don't need however - all lines that contain the words 'Tables', 'nbsp' and 'Home' (these are links to individual pages for each team in the World Cup - not needed here. We will use an awk statement for this, again piped from the previous output.
wget -q -O - $wc2010http | sed -n '/June 11, 2010 @ 15:00UK/,/Table/p' | sed 's/<[^>]*>//g;s/^[ \t]*//;/^$/d' | awk '$0 !~ /Table/ && $0 !~ /Home/ && $0 !~ /nbsp/ {print}'
So this is (in part) our final output and beneath that, the final script
July 6, 2010 @ 19:30UK
Semi-finals
at Green Point (Cape Town)
Winner Q/F 2
Winner Q/F 1
July 7, 2010 @ 19:30UK
Semi-finals
at Durban (Durban)
Winner Q/F 3
Winner Q/F 4
July 10, 2010 @ 19:30UK
Third Place
at Nelson Mandela Bay (Port Elizabeth)
Loser SF1
Loser SF2
July 11, 2010 @ 19:30UK
Final
at Soccer City (Johannesburg)
Winners SF1
Winners SF2
#!/bin/sh
 
# URL to screenscrape
wc2010http='http://soccernet.espn.go.com/world-cup/fixtures?cc=5739&ver=global'
 
# Screenscrape
wget -q -O - $wc2010http | sed -n '/June 11, 2010 @ 15:00UK/,/Table/p' | sed 's/<[^>]*>//g;s/^[ \t]*//;/^$/d' | awk '$0 !~ /Table/ && $0 !~ /Home/ && $0 !~ /nbsp/ {print}'
blog terms
Perl