Consuming Linked Data from the Web

Part 1: Looking up URIs with curl

In this exercise you will practice looking up URIs from the Web of Data, using the cURL command line tool

what do you get back from the previous step? nothing. why?
get the list of curl options: $ curl --help
which option can be used to provide more verbose output?
repeat the same URI lookup, but with more verbose output? $ curl -v http://tomheath.com/id/me

what is the HTTP method used in the request sent by curl? (clue: it's in capitals)
which content types is curl willing to accept?

how did the server respond? which respond code was provided? how long was the content in the response?
what should you do next to get some content? (clue: what is the value of the Location header in the response?)
what does the flag -L do?

repeat the request; this time tell curl to follow Location: hints
what HTTP response code was sent with the second response from the server?
did you get a document back? what was the content-type of the document? what was the length of the content in the response?

which option can you use to send custom headers to the server?
resend the request with a custom header asking for the content in RDF/XML: $ curl -H "Accept:application/rdf+xml" http://tomheath.com/id/me
include the -L flag to tell curl to follow the redirect. did your custom header work? what was the content-type of the response?
try using an additional flag -I to show just the headers; this helps with debugging the interaction, though you don't get to see the content
reenable the display of content and try redirecting the content returned by the server to a file: $ curl -L -H "Accept:application/rdf+xml" http://tomheath.com/id/me > output.rdf
have a look at the output: $ cat output.rdf

In this exercise you will experiment with crawling Linked Data using the LDspider crawler

create a text file called seed.txt in /home/sssw2012/session01/ldspider/ containing one or more URIs, one per line. Examples you can use include
- http://tomheath.com/id/me
- http://data.ordnancesurvey.co.uk/id/50kGazetteer/218013
change to the directory where ldspider is installed: $ cd /opt/ldspider
run a simple crawl of the seed URIs: $ java -jar ldspider-1.1e.jar -d /home/sssw2012/session01/data -a /home/sssw2012/session01/ldspider/ldspider.log -s /home/sssw2012/session01/ldspider/seed.txt
inspect the messages from the crawler in the terminal window; did your crawl succeed?
the -d /home/sssw2012/session01/data flag tells the crawler where to store the crawled data. open the zip file to check you got what you expected.

run the crawler again, but with no parameters: $ java -jar ldspider-1.1e.jar ; this gives the help file; these links also help:
- http://code.google.com/p/ldspider/
- http://code.google.com/p/ldspider/wiki/GettingStartedCommandLine

read about the various crawling strategies at http://code.google.com/p/ldspider/wiki/GettingStartedCommandLine
try the load-balanced crawling strategy by executing: $ java -jar ldspider-1.1e.jar -c 1 -o /home/sssw2012/session01/data/crawl.nq -a /home/sssw2012/session01/ldspider/ldspider.log -s /home/sssw2012/session01/ldspider/seed.txt
notice that the output isn't going into a zip file, but into the file called crawl.nq; open the file and take a look; do you notice anything strange about the data in the file?
study the various options for running the crawler; which flag do you need to add to avoid having the header information added to the data? add it and rerun the crawl
vary the value of max-uris to something a bit higher and rerun the crawl; does this have a significant effect on the time taken?
study the configuration options again; how would you combine these to focus the crawl in different ways?

In this exercise you will practice starting a Jena Fuseki server and using it for storing and querying RDF data

go to the directory where Fuseki is installed: $ cd /opt/jena-fuseki-0.2.1-incubating
start a fuseki server with in-memory storage: $ ./fuseki-server --update --mem /memdataset (the --update flag tells the server to accept SPARQL updates; the --mem flag tells the server to use in memory storage; the /memdataset part is a name for the dataset.
in Firefox, visit the homepage of your Fuseki server at http://localhost:3030/
click on Control Panel; your dataset should be selected in the drop-down box; click Select to get a sparql query form.
enter describe <http://tomheath.com/id/me>, select XML output and click Get Results
does the resulting document contain any triples? No, so let's add some using the ldspider crawler

you'll need to tell LDspider to send data to Fuseki using the SPARQL Update protocol. The Fuseki update endpoint will be at http://localhost:3030/memdataset/update, so...
go back to the ldspider directory: $ cd /opt/ldspider
run the crawler with the command: $ java -jar ldspider-1.1e.jar -c 1 -oe http://localhost:3030/memdataset/update -a /home/sssw2012/session01/ldspider/ldspider.log -s /home/sssw2012/session01/ldspider/seed.txt -e
scroll back through the output from the crawler; can you see evidence of data being added to the store?
go back to the Fuseki query form and rerun the query describe <http://tomheath.com/id/me>

If you have any extra time, have a go at these additional exercises

in a new terminal window run the command: $ sudo apt-get install raptor-utils (password for installation is "sssw2012")
this installs the rapper tool for manipulating RDF, e.g. converting between formats
run the command: $ rapper --help to see the options available
use curl to get an RDF/XML description of Tom, then pass it through rapper to convert the output to ntriples. clue, ask someone about unix pipes if you're not sure how to combine curl and rapper.