Library clips

sharing ideas thoughts and feedback

October 10, 2005

Semantic web: subject searching (XML, DC, RDF, ontologies…)

Filed under: search, semantic

I’ve been surfing the web in an effort to learn about XML, RDF, ontologies…basically the semantic web.
I’ve found this hard to grasp on my own so I’ve decided to share my understandings via this blog in an effort to induce comment in order to clarify my grip of the subject matter (after all what are blogs for!)

NOTE: I’ve used the web to gain knowledge (as opposed to consulting a person or taking a course or reading a print publication) and used the web to publish and disseminate my learning, and hopefully in turn use the web for discussion…so in tackling “web 2.o” I have been using the tools that web2.0 has to offer!

Here is my effort in describing what I understand of the semantic web aspect of Web 2.0 and how a part of it may emulate subject searching on an OPAC or scholarly database.

Ontology is the hardest term for me to grasp, I thought it was about mapping all the vocabularies used on the web (by mapping I mean establishing the semantic relationships) but it seems to be this and more, a type of controlled vocabulary that is more detailed and complex than a thesaurus and it deals with more than just subject terms (in the library environment this is where a thesaurus stops, and AACR2 takes care of other descriptive terms, but this is only grammatical, it doesn’t infer relationships as ontologies do).

Ontology will map relationships between anything; subject terms, people, places… it is very complex…it sets rules, relationships, interferences and map things in every instance of a term or phrase so a computer can process the information accordingly.

So an ontology for the web will serve computers not people… …I’m not sure if you can use an ontology for browsing purposes, maybe it doesn’t have a structure and is just a list of inferences that represents stuff…so it’s not used to navigate, it’s just a running list of relationships and rules (logical statements).

For an example of an instance of an ontology see, Converting a controlled vocabulary into an ontology: the case of GEM.

The semantic web promises contextual search where all information is marked-up in order to define itself, set rules (so they can be retrieved in a given context), and mapped to other resources (to distinguish relationships).
This will allow users to search, similar to fielded searching; if you search for an author you will hope to find items created by the author, ignoring items where the author name appears in the full-text or in the bibliography (this is searching without the noise)…as mentioned before the semantic web goes beyond fielded searching and defines relationships at a very granular level.
Since information will be machine processable, it is mentioned that we can have electronic web-agents perform automated searching on our behalf. (What we don’t have to surf the web ourselves!)

For our purposes we will narrow the focus to ontologies mapping subject terms on the web, so the web can act like an OPAC or fielded database.

MY INTERPRETATION

HTML is a mark up language for web browsers

<strong>hello</strong>

the web browser is told to make this bold

XML is a mark-up tool to make your own mark up language

So you can make up your own tags
<bike>racer</bike>

The beauty about XML is that tags are not just about presentation (like most of HTML); they can be about structure, or anything for that matter. You can even separate content from form (which is great because you can take the content from, say a memo, and present it in the structure of a poster.)…you can separate parts of mark-up by keeping, for example, all your structural tags in a DTD file.
The DTD file can also act like a simple ontology as it can be used to define the grammar of your XML syntax(similar to an XML namespace).

I guess XML tags can be verbs as well, their just labels in the end.

The other great thing is that XML is interoperable, meaning you can import/export XML files to any program…it’s up to the program to be able to read it, and how it reads it. One program may interpret the tag different to another program, and it can ignore the tags that aren’t important.

When it comes to the web (which is one big program or database) then all these tag meanings need to correlate. All the databases are coming together to form one big database, but really the databases are still separate, they will be mapped together so they can work in unison.

Define mark-up elements

Firstly we have to give meaning to our mark-up labels and then map relationships between everyone’s mark-up labels – as there will be multiple mark-up standards out there. Even though there is freedom with XML we can’t have 2 or more people using the “bike tag”, when they mean different things.

So each person who uses XML tags needs to give meaning to them, as they mean nothing if they can’t be interpreted.

In an ideal world if everyone used the same set of mark-up labels to describe resources (as HTML does for presentation) then we wouldn’t have a problem in searching the web like an OPAC, this is what DC tries to do.

But the reality is that there are many mark-up standards to cover many different types of resources, industries and disciplines, so creating a universal standard is futile, rigid, and non-productive.

So we need a way to tell the web the mark-up (standard) we used to describe a web page so it doesn’t clash with another standard that happens to use the same tag term for perhaps a different meaning….it’s a way to distinguish between ambiguous elements.

This is achieved with the XML namespace (lives in the header of your document) it points to a URL, saying this where I got my mark up tags from (verification).
eg. It may point to the web page of a particular mark-up tag set like DC

Map mark-up

Now the next problem is for the web to map all the different mark-up standards, eg DC creator relates generally to the author of an item, but other mark-up standards may use the mark-up “author” or “owner”…they all roughly refer to the same thing so the web needs to map these mark-up standards so when we do a search we are searching in multiple communities (mark-up standards) at the one time, this is interoperability.

I’m not sure how this will happen, the web has to compare elements from multiple mark-up standards and decide, yes “Creator” in DC is similar if not equal to “Author” in some other language, so I’ll merge these items in the results.
I know in the library enviroment, it doesn’t matter which label you use, the MARC number (unique identifier) is able to identify what both elements mean and then it can be mapped via the Z 39.50 standard
…but this is easy in a library environment as all labels/elements have a MARC number, in the web there is no consensus, it’s much more chaotic.

RDF

So how does all this actually work, from the computers perspective, making it machine-processable (an illusion that the machine understands)

RDF is made from XML, it is a framework to exchange the data we have been talking about above, it allows this data to work beyond proprietary systems – the metadata can be shared amongst different web applications, it also allows you to use multiple mark-up standards within the same document.

RDF also has it’s own namespace to list its own custom mark-up tags…for more see, An idiot’s guide to the resource description framework.

RDF encapusulates the resource, it makes statements about resources…so it’s meta-data for the meta-data (so RDF is a container that holds all this information), ultimately it allows for different vocabularies to exist in a distributed way without needing a central place.

What it is stipulating is the structure for your mark-up (source code for your web page), and this is all held within the RDF <rdf> beginning and end tags </rdf>.

The structure is made up of 3 aspects:

  • Resource - URI pointing to the resource (which is usually a URL webpage)
  • Property type - properties describing the resource (eg. DC) and validate this
    via an XML namespace
  • Value - value of the property. ie name of the title or person or the subject term, etc…

NOTE: the property can become a resource itself with its own property types and values

See below for more on how it does this by defining the structure of a resource.

Metadata for the web RDF and the Dublin core

What is RDF?

The semantic web: a network of content for the digital city

An introduction to the resource description framework

Although beyond the scope of this article, I will also mention that RDF is more than this…it is also refered to Data Modelling.

This article explains that even if you use a tag such as <Author>Paul Warren</Author>, XML on it’s own can’t say anything about Paul Warren as it can’t map relationships when used by itself…more from the article:

“…we have no way of saying anything about “Paul Warren”, e.g. that he is employed by BT, that he has written other articles etc. To overcome this limitation Resource Description Framework (RDF) was designed […] based on the use of “subject, verb, object” triples (e.g. “Paul Warren”, is employed by, “BT”).”

So the semantic web is more than listing values in a database within elements, ie. listing a whole heaps of author names under the element ID of “Author”…it is different to a relational database in that it is a list of free floating logical statements, that aren’t limited to be only identified within one element, instead the value itself (the actual authors name) can have a unique ID via a URI…read what I’m refering to at this article.

Map mark-up elements

This is where ontologies will have to shine, in the form of RDF schema, OWL, OIL, DARPA.

From this example:

Just say in describing the aboutness of an object I use DC:subject, and someone else uses AB:descriptor.

Well the web will know via the URI namespace where these mark-up standards reside, and since someone has done the groundwork, to say [X dc:subject Y] is the same as [Y ab:descriptor X] (this is an ontology at work), then the computer can deduce that “subject”, and “descriptor” mean the same thing, so we get higher recall, and context from our dataset.

Going back, in the library environment it is easier as MARC uniquely identifies what a subject or descriptor element is regardless of what you label it via a string of numbers, this is the advantage of using a central registery. This is impossible on the web so the idea of RDF is achieving a decentralised approach by using XML, namespaces, RDF, and ontologies.

Map subject terms

My question is then after you have mapped the elements from the different mark-up sets (I presume this is done manually by a team of people), what about mapping the values?

So if we know that the element, “subject”, from one standard equals, “descriptor”, from another fair enough, but what if the value in one document is “bike”, and the value in the other is “bicycle” - in other words do we have to map thesaurai, so subject terms from different thesauri are related, and considered when serving up results.

…this point is the precise focus of this blog post.

Even before mapping the different controlled vocabularies, how do you define them, is this using a namespace as well?

Then comes the question of mapping them?

Recap

  • Using XML we can describe or mark-up content anyway we see fit, whatever works for us personally
  • Using RDF we can merge these packets of information
  • But for purposes of aggregating our conent with other content, we need to distinguish our tags so they don’t collide with tags describing other content that may happen to use the same tags but mean a different concept.
    So we have to define where we got our mark-up tags from therefore showing what they mean or refer to (via a namespace)
    Library communities have been using the MARC standard (a string of numbers that uniquly identifies an element regardless of what it is labelled), since the web is not a controlled environment using a centralised registery like this is not possible.
  • Then the web has to map the mark-up tags from different systems.
    Library communities have been using z39.50 standard, whereas the web has to bridge elements from various standards using ontologies
  • Then we have to tell the web from which controlled vocabulary we got our subject terms
  • Then the web can map these subject terms from all the different vocabularies
    …defining the homonyms, synonyms, and many other granular relationships

    …this is at the heart of the semantic web - higher recall, and contextual searching (higher precision).

Subject terms again

So the term “bush” in one vocabulary is mapped to the word “forest” in another vocabulary…so if I do a search for “bush” I also get hits for “forest”. (this is like a synonym ring)

This makes for exhaustivity across disciplines, but we may end up having too much recall, or even lose our context.
If we do a search for “bush”, we get hits for “forest” as well, but we may not want hits for “forest”.
So how far do we go with mapping subject terms from different vocabularies?

This is where using operators such as “NOT“ help, also web results could be returned in clusters that show folders of the mapped subject terms…so if you searched the web for the term “bush” in a subject field (what! the web with a subject field…dream on) it would show only those hits, but also list folders with hits from similar subject terms.

…also need to map the homonyms (the word “bush” could be refering to a presidents name or the name of a rock band) this is searching in context…I guess a way around this is to present the results in clusters, showing the results for the term “bush” split into folders as nature, music, president, etc…

Hopefully the semantic web will get around this problem of searching in context by using the elaborate ontologies put in place…so a particular web pages mark-up can define that the term “bush” in this webpage means “president” and not a “tree” and not a “rock band”…at the moment we have to use boolean searching methods to gain some context, this may be a problem at present as lay people don’t search this way, they want quick results with context by using a simple query.

At the moment, search engines don’t find subject searching viable and prefer to use free-text page ranking and other methods…see more..

If the web mapped all these subject terms from different vocabularies; could we use this as a browsing mechanism, it would be a massive cross discipline meta-thesaurus. (This is probably unlikely, as the semantic web would be using ontologies and not a thesaurus). Although maybe we could browse a discipline thesaurus, or even a particular thesaurus…(they could all have check boxes, so you can search webpages seen from the perspective of a set of thesauri).

But then we would have a directory of thesauri to choose from, when you find one (or many) you then do a search or browse for a subject term, then you apply this to the search engine query, then scroll your results and view clusters of related or similar subject terms…timely process!

In brief

  • Syntax is XML (meta-data)
  • Community eg. DC (point to a URL via a XML namespace)
  • Structure is RDF (meta-data about the meta-data)
  • Semantics is ontology (maps rules and relationships - between different elements, and different values, and more)

Please add your views to this post as I’m sure there are parts I haven’t interpreted properly…I will then add an addenda of some sort or fix up the mistakes.

Doubts

This is all put in a different perspective with so much clarity by this essay,
The Semantic Web, Syllogism, and Worldview

Social Web

The social web, mostly the blogosphere and bookmark folksonomies are creating a watered-down version of the semantic web by default. Aggregated user-defined tags are a quick answer, and are of great value for discovery, sharing, and aggregating to make topic portals, but the question is does an aggregate of personal contexts cross over to a defined community context, at the moment “not really”…although we do see notions of the power law, long tail, and related algorithms helping to create emergent vocabularies, but is this enough.

The notion of structured blogging and dataBlogging will contribute to the semantic web; by adding tags to blog content we can derive context. As is mentioned in this article if content such as a job listing is tagged with the appropriate tag, then any website can aggregate all job listings, the current players will need to re-think their services other than just providing content…lots of players are starting to aggregate content, just look at Yahoo! News or Google News, they are the new competition to traditional news aggregators, so now services are moving forward beyond just delivering content, and into customer service such as personalisation, customisation and integration.

New web 2.o services are doing the same thing Technorati Tags is aggregating blogosphere content into user-defined categories, and the same is being done for the many social bookmark services by services such as Wink.

More references:

The semantic web

What is an Ontology

A quick guide to…XML

A dozen primers on standards

Metadata and the web

RDF - The resource description framework

Re-inventing subject access for the semantic web

XML and the resource description framework: the great web hope

The semantic web: how RDF will change learning technology standards

Semantic web – on the respective roles of XML and RDF

Webopedia – RDF

JISC – Semantic Web Technologies

Search Engines and Resource Discovery on the Web: Is the Dublin Core an Impact Factor?

Writing Semantic Markup

Themes and metaphors in the semantic web discussion

Semantic Web technologies for digital libraries

Re-inventing subject access for the semantic web

A bit of commentary on Google and the Semantic Web

How Google beat Amazon and Ebay to the Semantic Web

[ADDED 14/06/06:
Why using RDF instead of XML?
A description of XML namespaces
Avoiding Information Overload : Knowledge Management on the Internet]

Comments »

The URI to TrackBack this entry is: http://libraryclips.blogsome.com/2005/10/10/semantic-web-subject-searching-xml-dc-rdf-ontologies/trackback/

No comments yet.

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>



Anti-spam measure: please retype the above text into the box provided.

Please note that comments are moderated and will                  not therefore appear immediately.
                    Please do not repost.


Library clips
Library clips Subscribe by Email                                                    

Get free blog up and running in minutes with Blogsome | Theme designs available here

Related Posts Plugin for WordPress, Blogger...