Consuming Linked Data from the Web
Part 1: Looking up URIs with curl
In this exercise you will practice looking up URIs from the Web of Data, using the cURL command line tool
Create a Working Directory
- open a terminal window, you should see a prompt: $
- move to the session01 directory: $ cd session01
- make a new directory: $ mkdir curl
- change to new directory: $ cd curl
Simple Lookups
- use curl to lookup a URI: $ curl http://tomheath.com/id/me
Debugging the Client-Server Interaction
- what do you get back from the previous step? nothing. why?
- get the list of curl options: $ curl --help
- which option can be used to provide more verbose output?
- repeat the same URI lookup, but with more verbose output? $ curl -v http://tomheath.com/id/me
Analysing the Request
- what is the HTTP method used in the request sent by curl? (clue: it's in capitals)
- which content types is curl willing to accept?
Analysing the Response
- how did the server respond? which respond code was provided? how long was the content in the response?
- what should you do next to get some content? (clue: what is the value of the Location header in the response?)
- what does the flag -L do?
Following an HTTP 303 Redirect
- repeat the request; this time tell curl to follow Location: hints
- what HTTP response code was sent with the second response from the server?
- did you get a document back? what was the content-type of the document? what was the length of the content in the response?
Asking for a Specific Content-type
- which option can you use to send custom headers to the server?
- resend the request with a custom header asking for the content in RDF/XML: $ curl -H "Accept:application/rdf+xml" http://tomheath.com/id/me
- include the -L flag to tell curl to follow the redirect. did your custom header work? what was the content-type of the response?
- try using an additional flag -I to show just the headers; this helps with debugging the interaction, though you don't get to see the content
- reenable the display of content and try redirecting the content returned by the server to a file: $ curl -L -H "Accept:application/rdf+xml" http://tomheath.com/id/me > output.rdf
- have a look at the output: $ cat output.rdf
Part 2: Crawling
In this exercise you will experiment with crawling Linked Data using the LDspider crawler
Running a Simple Crawl
- create a text file called seed.txt in /home/sssw2012/session01/ldspider/ containing one or more URIs, one per line. Examples you can use include
- http://tomheath.com/id/me
- http://data.ordnancesurvey.co.uk/id/50kGazetteer/218013
- change to the directory where ldspider is installed: $ cd /opt/ldspider
- run a simple crawl of the seed URIs: $ java -jar ldspider-1.1e.jar -d /home/sssw2012/session01/data -a /home/sssw2012/session01/ldspider/ldspider.log -s /home/sssw2012/session01/ldspider/seed.txt
- inspect the messages from the crawler in the terminal window; did your crawl succeed?
- the -d /home/sssw2012/session01/data flag tells the crawler where to store the crawled data. open the zip file to check you got what you expected.
Getting Help
- run the crawler again, but with no parameters: $ java -jar ldspider-1.1e.jar ; this gives the help file; these links also help:
- http://code.google.com/p/ldspider/
- http://code.google.com/p/ldspider/wiki/GettingStartedCommandLine
Crawling Strategies
- read about the various crawling strategies at http://code.google.com/p/ldspider/wiki/GettingStartedCommandLine
- try the load-balanced crawling strategy by executing: $ java -jar ldspider-1.1e.jar -c 1 -o /home/sssw2012/session01/data/crawl.nq -a /home/sssw2012/session01/ldspider/ldspider.log -s /home/sssw2012/session01/ldspider/seed.txt
- notice that the output isn't going into a zip file, but into the file called crawl.nq; open the file and take a look; do you notice anything strange about the data in the file?
- study the various options for running the crawler; which flag do you need to add to avoid having the header information added to the data? add it and rerun the crawl
- vary the value of max-uris to something a bit higher and rerun the crawl; does this have a significant effect on the time taken?
- study the configuration options again; how would you combine these to focus the crawl in different ways?
Part 3: Storing Data
In this exercise you will practice starting a Jena Fuseki server and using it for storing and querying RDF data
Starting Fuseki
- go to the directory where Fuseki is installed: $ cd /opt/jena-fuseki-0.2.1-incubating
- start a fuseki server with in-memory storage: $ ./fuseki-server --update --mem /memdataset (the --update flag tells the server to accept SPARQL updates; the --mem flag tells the server to use in memory storage; the /memdataset part is a name for the dataset.
- in Firefox, visit the homepage of your Fuseki server at http://localhost:3030/
- click on Control Panel; your dataset should be selected in the drop-down box; click Select to get a sparql query form.
- enter describe <http://tomheath.com/id/me>, select XML output and click Get Results
- does the resulting document contain any triples? No, so let's add some using the ldspider crawler
Storing Some Data
- you'll need to tell LDspider to send data to Fuseki using the SPARQL Update protocol. The Fuseki update endpoint will be at http://localhost:3030/memdataset/update, so...
- go back to the ldspider directory: $ cd /opt/ldspider
- run the crawler with the command: $ java -jar ldspider-1.1e.jar -c 1 -oe http://localhost:3030/memdataset/update -a /home/sssw2012/session01/ldspider/ldspider.log -s /home/sssw2012/session01/ldspider/seed.txt -e
- scroll back through the output from the crawler; can you see evidence of data being added to the store?
- go back to the Fuseki query form and rerun the query describe <http://tomheath.com/id/me>
Part 4: Extras
If you have any extra time, have a go at these additional exercises
- in a new terminal window run the command: $ sudo apt-get install raptor-utils (password for installation is "sssw2012")
- this installs the rapper tool for manipulating RDF, e.g. converting between formats
- run the command: $ rapper --help to see the options available
- use curl to get an RDF/XML description of Tom, then pass it through rapper to convert the output to ntriples. clue, ask someone about unix pipes if you're not sure how to combine curl and rapper.