SSSW 2012 Hands-on Session 6 - Working with Heterogeneous Data and Services

The overall goal of this session is to show what tools and services exist that can help finding, processing and using data on the Web, which are not necessarily pre-defined, are not necessarily known in advance, and that have structures we are not necessarily aware of. More importantly, you should encounter during this hands-on session issues related to the heterogeneity of data, the quality of data and the inaccuracy of the tools.

From annotations to annotations

Here, we will try to take advantage of distributed resources for manipulating data on the web. Your goal is to be able to convert semantic markup from HTML+RDFa web pages so that it can be indexed by search engines. For that purpose, it will need to be converted to the schema.org ontology. If one looks at some pages online, such as http://data.linkedevents.org/event/89be8dd3-758e-4cec-9254-80f6d093bdd0, which is a linkedevents page served by Virtuoso, or a corresponding offline page http://www.codemotion.it, they would find RDFa that they are RDFa pages, i.e. there are provided with RDF annotations.

Your first task is to try extracting the RDF markup from these pages. You may try various online resources such as http://rdfa.info/tools/ (honnestly I use the python one which is more robust).
Then you my try to find out the vocabularies used in these pages, either by hand or programmatically,
You will then have to consult our local alignment server for availability of alignments between these vocabularies and schema.org.

This can take advantage of the HTML interface of the Alignment server or use the REST access to that server:

curl -L -H "Accept:application/rdf+xml" \
   'http://aserv.inrialpes.fr/rest/find?onto1=http://xmlns.com/foaf/0.1/&onto2=http://schema.org/'

You will try to use these alignments for converting the vocabulary used in the ontologies, by finding corresponding entities and replacing them:

curl -L -H "Accept:application/rdf+xml" \
   'http://aserv.inrialpes.fr/rest/corresp?id=http://aserv.inrialpes.fr:8089/alid/1335542900539/204&entity=http://xmlns.com/foaf/0.1/mbox'

From Text to Annotations

The pages:

contain news about the olympic games. They are not however annotated with structured data and therefore don't connect to the Semantic Web.

The tools Stanbol enhancer, FRED, and Tipalo are text processing tools that analyse the text of a page, and generate RDF data reflecting the content of the page. Trying them on the pages above, and inspecting the results, consider the following questions:

What does the result express? Is it the whole content of the page? Can you understand how the tools came up with these results?
Do you agree with the results? What would you have done differently?
Do you think these results could be used automatically? Would you be able to combine the results in order to get a coherent representation of the page content?

Note:

You can give the whole text to Stanbol enhancer;
FRED can get up to 1000 character, if you have more you will have to spit the input into parts;
Tipalo takes a wikipedia page as input, hence once Stanbol enhancer has recognized a set of dbpedia entities, you can have additional RDF data about that entity by giving its corresponding wikipedia page URI as input to Tipalo. You will have to feed Tipalo with one entity at a time.

Enriching Annotations with Resources From the Web

The annotations you obtained previously are using specific vocabularies that are driven by the tools used to produce them. What we would like to do here is to enrich this representation with more ontological constructs from resources available on the Web. In other words, we want to build a more complete representation of these annotations, by including links and relations to resources we will find using a variety of services.

Using Watson, find ontologies that can be used to represent the classes and properties that are mentioned in the automatically generated annotations. Choose some alternative representations of these concepts and relationships and see how you would integrate them with the structures you obtained from the tools used before.
Using sameAs.org, can you find links between the instances you obtained and other resources?
Using Sindice.com and Sig.ma, can you find other entities that should be linked to the entities mentioned in the automatic annotation you obtained before.

Reflexion: Do you think you could use the tools above automatically? What would you need to implement as additional process to create an automatically enriched annotation for texts?

Querying across linked datasets

In the SQUIN directory, execute the shell command ./bin/squin.sh start to launch the SQUIN service. Now go to the URL http://localhost:8080/SQUIN/ to start running SPARQL queries directly on the Linked Data cloud!
Dereference the following URI in your browser and look at the owl:sameAs links:

<http://geo.linkeddata.es/resource/Provincia/Segovia>

Run the following SPARQL query in SQUIN:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT * 
WHERE {
 <http://geo.linkeddata.es/resource/Provincia/Segovia> owl:sameAs ?o1.
}

Look at the results. What did you get back. And note that you are not running this query on the GeoLinkedData.es sparql endpoint.

Run the following SPARQL queries in SQUIN:
Get more owl:sameAs links

PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT *
WHERE {
<http://geo.linkeddata.es/resource/Provincia/Segovia> owl:sameAs ?o1.
OPTIONAL {
  ?o1 owl:sameAs ?o2. 
 }
}

What is the lat and long? Where is that data coming from?

PREFIX owl: >http://www.w3.org/2002/07/owl#<
PREFIX wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT *
WHERE {
<http://geo.linkeddata.es/resource/Provincia/Segovia> owl:sameAs ?o1.
 OPTIONAL{
  ?o1 wgs84_pos:lat ?lat.
  ?o1 wgs84_pos:long ?long.
 }
 OPTIONAL {
  ?o1 owl:sameAs ?o2.  
 }
}

Who was born in Segovia?

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
<http://geo.linkeddata.es/resource/Provincia/Segovia> owl:sameAs ?o1.
 OPTIONAL{
  ?o1 wgs84_pos:lat ?lat.
  ?o1 wgs84_pos:long ?long.
  OPTIONAL {
    ?ppl <http://dbpedia.org/ontology/birthPlace> ?o1 ;
           foaf:name ?name.
  }
 }
 OPTIONAL {
  ?o1 owl:sameAs ?o2.
 }
}

Hands-on session 6: Working with Heterogeneous Data and Services

From annotations to annotations

From Text to Annotations

Enriching Annotations with Resources From the Web

Querying across linked datasets