Library clips

sharing ideas thoughts and feedback

June 14, 2006

Attention agents

Filed under: km, semantic, attention

I recently came across a km article called Avoiding Information Overload: Knowledge Management on the Internet…it talks about the many aspects of knowledge representation; search engines, ontologies, XML, RDF, metadata, extraction, semantics, etc.
Something I was curious about was its mention of “agents”…one day we won’t need browsers or search engines as your agent will search and retrieve on your behalf based on your “profile”.

The term “profile” could be similar to “attention” in a way…Alex Barnett often explains about the potential of a portable attention file where you can plugin to a service and cut to the chase (the service will seem to know your interests, and personalise to you).

This attention file is based on your searching (eg. Google personalised search), RSS reading (eg. Findory, Rojo), bookmarks, browser recorder, transactions at services like amazon, etc…

The point I’m making is that this scenario so far is us plugging in our attention file into a service (possibly via OPML)…then when we use the service it will return personally relevant stuff based on our attention file.

What about even further where we ask an agent to stand in for us, give it some commands and it will fetch what we need based on our attention file.
I suppose this has nothing new to do with attention files, but maybe how they could be used by an agent…I think attention files could be very usable in the semantic web.

I find agents or bots similar to search feeds…but what you are doing is saying “go find stuff I like”.
At the moment we collect RSS feeds ourselves, maybe in the future a bot can go looking for RSS feeds it knows we will like based on our attention file (something Rojo does, but this is based on the data Rojo has about us…we need our own attention file made up from all the services we use. I bet if I then plugged this into Rojo, the feed recommendations would be much more tuned).
Further to that, maybe it could replace our RSS Readers…based on our attention file, our bot could trawl the net (not sure if you ask it where to visit) and return a list of daily readings…the more expansive our attention file becomes, the more the bot will seem to know your preferences.

June 4, 2006

Technorati cooking semantic tags

Filed under: General, tags, semantic

Technorati Tags is great at collecting user based tags from blog posts, when you view a tag like apple you’ll get a mix of computer and food posts.

What if there was a register for all these tags…microformats does something similar.

When you include the tag review in your blog post, you append it according to what the code for review is in the register…this is hReview.
When you use the hReview, you are saying this blog post is a review/summary/critique…you don’t use the hReview if your blog post is about reviewing your homework before you hand it in…just use a simple Technorati Tag like review in these cases…this I believe is the difference…there has been a consensus or agreed use for this special tag.
Let’s just be careful who ever makes these up does it in a group environment, we want these special tags to get it right.

Structured Blogging tries to achieve the same thing, by people having a field in their blog editors.

Now for each of these special tags there is a dedicated search module…check out the hReview tag search engine in the Technorati kitchen…compare it with the tag “review” in Technorati Tags.

eg.
Technorati Tags - search for book+review.
Technorati Reviews - search for book.
I like the idea of separating the topic from the activity, but then again book is maybe a document type.

In the future for a book review on cats we could mark up a blog post:
rel=tag cats
MF hReview

All you would have to do is search book in the review engine, and you will get all book reviews about cats…is this right…is the search for the term book a fielded or full-text search?

A service like Edgeio who collects posts with the tag “listing” could really benefit from a hListing tag (this will disambiguate tags)…maybe we can have a glossary for hTags…this way tags have got meaning.
But then Edgeio wouldn’t harvest much content as not everyone is going to use hTags, people are already Technorati Tagging away…it’s here to stay.

I guess microformats or these “h” Type tags are a way to share clean and unpolluted information…can anyone make these up, is all they need a little momentum.
What if people make up similar one’s will we need not only a glossary, but a thesaurus…will the namespace solution not be a cure.
NOTE: I use the word cure because tags are spreading like a virus, the tagspace is uncontainable, it is still a semantic mess.

See more from Solution Watch and /message…and of course the Technorati Weblog.

Check out Microformats search in the Technorati Kitchen.

Check out what microformats is all about.

[ADDED: now I see that Technorati Tags is just a microformat like all these new ones (topic/subject), the problem is that the tags aren’t used only for topic labels, whereas at least hReview is certain it will be describing a review. But what kind of review is it…a restaurant, music, book, film, etc…it seems the hReview has an item type field within it…but is this a free choice like tagging, or a selection (it’s a drop down in the hReview Creator below)

Not sure if I’ve grasped this concept entirely…it looks promising.

Here is a sample Review.

Here is the hReview Creator.]

April 6, 2006

Microformat clash

Filed under: semantic

A while ago I posted on how services like Edgeio can aggregate tagged blogposts or tagged URL’s and include this data into their service…as long as these tagged posts or object use the microformat rel=tag.

What if a service aggregates data from objects with the tag “apple”, it may be a service listing Apple computers…then what if another service starts, and also aggregates data from objects with the tag “apple”, but this service is a listing for peoples experiences with the fruit apple.
These services may not know about each other, thus they are creating noise for each other…I guess this is the difference between the publisher explicitly submitting a post rather than the service collecting posts on their own.

Maybe service one has to broadcast to the blogosphere to use the tag “applecomputer” instead…if a service like Edgeio starts up, it is essential they need to tell everyone what tag names they aggregate, and hope that other services don’t come along and try to advocate the same tag, especially if they are about a different topic or type of service.

If another classified services start up they can just aggregate the “listing” tag also, this is also handy for publishers as their posts will appear on multiple boards at the same time.

But if a new service started that aren’t about classifieds but chose to use the “listing” tag…if they generated enough popularity, people would tag their posts “listing” in order to appear on this new service. The repercussions are that Edgeio will now get posts tagged “listing” but the content may have nothing to do with their service…so there has to be a greater quality control filter.
(And also this new service will get all posts tagged “listing” that were intended to appear at Edgeio).

So are these microformat tags going to one day start clashing, due to the harnessing the bottom up approach?
Will a more defined approach such as structured blogging be more appropriate.

Maybe there is more to this, not sure…

Check out Alex Barnett’s latest podcast on microformat’s…I’m yet to listen to this.

December 19, 2005

Blogosphere as an OPAC

Filed under: General, blogs, tags, semantic

Burningbird has her take on Structured Blogging, mentioning that 2006 will be the year of metadata (that old semantic thing)

…here’s a pioneer to this idea.

Basically it includes some additonal fields in your blogs posting interface, these fields are descriptive metadata…so you can explicitly say, this post is a movie review…I guess this would work by labelling your post with the blog category “movie”, and if you check the structured blogging field “review”, this would distinguish the post from just being about movies in general, to explicitly a movie review…I guess it is leaving blog categories to describe the subject, and the blog fields to describe the post type or style.
Actually, looking at an example, it is much more definitive, if this takes off we can aggregated precise content according to our requests (very semantic).

Can’t we already do this using Tags in our posts, which are collated by services such as Technorati…we can go to Technorati Tags and browse.
It’s just a bit messier with tags as it is a bottom-up system, you have to browse around a bit for tags like film, cinema, movies, dvd…although this may be a good thing as it is more specific, but it doesn’t neccessarily mean the posts will be “review” type, you could add this tag (but that may limit your results).

Sample: film OR cinema OR movie OR dvd

film OR cinema OR movie OR dvd AND review

(film OR cinema OR movie OR dvd) AND review

But this would mean people would have to tag their posts with both tags, eg. movie, review…eg. movie+review…and what about blogs categories, they are not generally going to have both categories; movie, and review.

So the aboutness of these tags are a bit more vague, whereas with Structured Blogging the aboutness can be more exact…but then a more controlled system like Structured Blogging is going to be harder to get everyone to adhere to, even if the 3 biggest blogging platforms have plugins, will everyone use it (I’d have to ask my host), and what about the rest of the other blogging platforms.

Also who decides on these metadata fields, who is the authority, can we suggest fields?

I like the idea of defining an element set, kind of like a version of DC for the blogosphere, but a centralised subject index is another ball game (the values within the element).
So I still like the idea of bottom-up tags, but I also like the idea of a subtle structure, as tags are used not only as subject terms, so if we could just classify our tags into facets, these facets are the Structured Blogging fields, eg. Author, Review, Media, Publisher, Title, etc…

If this works, this is a step towards the user driven semantic web (ie. we the bloggers define and describe contents at the time of posting according to a standard)…then any service will be able to aggregate posts on a topic with hopefully minimal noise.

I can even see this working even better at a blog collective level, such as the Corante Web Hub, as tag clouds can start to get out of hand…that is of course unless participants decide on a tag set, but this has more to do with the values in the field, what I like about Structured Blogging is that it defines an element type…like search fields in an OPAC.

At the moment you can’t search within a tag at Technorati Tags, let alone limiting this to an OPML, hopefully Structured Blogging will allow us to search within a subject type, that is also a certain document type…this is targeted fielded searching, the blogosphere as an OPAC.

More

How will this integrate with social bookmarking services, if you bookmark a post (that has been made with Structured Blogging) will the fields auto-fill, and will these fields be available in bookmark services…then we would be able to search bookmarks by author, title, subject type, document type, format type, etc…this would be a human indexed web OPAC.

October 10, 2005

Semantic web: subject searching (XML, DC, RDF, ontologies…)

Filed under: search, semantic

I’ve been surfing the web in an effort to learn about XML, RDF, ontologies…basically the semantic web.
I’ve found this hard to grasp on my own so I’ve decided to share my understandings via this blog in an effort to induce comment in order to clarify my grip of the subject matter (after all what are blogs for!)

NOTE: I’ve used the web to gain knowledge (as opposed to consulting a person or taking a course or reading a print publication) and used the web to publish and disseminate my learning, and hopefully in turn use the web for discussion…so in tackling “web 2.o” I have been using the tools that web2.0 has to offer!

Here is my effort in describing what I understand of the semantic web aspect of Web 2.0 and how a part of it may emulate subject searching on an OPAC or scholarly database.

Ontology is the hardest term for me to grasp, I thought it was about mapping all the vocabularies used on the web (by mapping I mean establishing the semantic relationships) but it seems to be this and more, a type of controlled vocabulary that is more detailed and complex than a thesaurus and it deals with more than just subject terms (in the library environment this is where a thesaurus stops, and AACR2 takes care of other descriptive terms, but this is only grammatical, it doesn’t infer relationships as ontologies do).

Ontology will map relationships between anything; subject terms, people, places… it is very complex…it sets rules, relationships, interferences and map things in every instance of a term or phrase so a computer can process the information accordingly.

So an ontology for the web will serve computers not people… …I’m not sure if you can use an ontology for browsing purposes, maybe it doesn’t have a structure and is just a list of inferences that represents stuff…so it’s not used to navigate, it’s just a running list of relationships and rules (logical statements).

For an example of an instance of an ontology see, Converting a controlled vocabulary into an ontology: the case of GEM.

The semantic web promises contextual search where all information is marked-up in order to define itself, set rules (so they can be retrieved in a given context), and mapped to other resources (to distinguish relationships).
This will allow users to search, similar to fielded searching; if you search for an author you will hope to find items created by the author, ignoring items where the author name appears in the full-text or in the bibliography (this is searching without the noise)…as mentioned before the semantic web goes beyond fielded searching and defines relationships at a very granular level.
Since information will be machine processable, it is mentioned that we can have electronic web-agents perform automated searching on our behalf. (What we don’t have to surf the web ourselves!)

For our purposes we will narrow the focus to ontologies mapping subject terms on the web, so the web can act like an OPAC or fielded database.

MY INTERPRETATION

HTML is a mark up language for web browsers

<strong>hello</strong>

the web browser is told to make this bold

XML is a mark-up tool to make your own mark up language

So you can make up your own tags
<bike>racer</bike>

The beauty about XML is that tags are not just about presentation (like most of HTML); they can be about structure, or anything for that matter. You can even separate content from form (which is great because you can take the content from, say a memo, and present it in the structure of a poster.)…you can separate parts of mark-up by keeping, for example, all your structural tags in a DTD file.
The DTD file can also act like a simple ontology as it can be used to define the grammar of your XML syntax(similar to an XML namespace).

I guess XML tags can be verbs as well, their just labels in the end.

The other great thing is that XML is interoperable, meaning you can import/export XML files to any program…it’s up to the program to be able to read it, and how it reads it. One program may interpret the tag different to another program, and it can ignore the tags that aren’t important.

When it comes to the web (which is one big program or database) then all these tag meanings need to correlate. All the databases are coming together to form one big database, but really the databases are still separate, they will be mapped together so they can work in unison.

Define mark-up elements

Firstly we have to give meaning to our mark-up labels and then map relationships between everyone’s mark-up labels – as there will be multiple mark-up standards out there. Even though there is freedom with XML we can’t have 2 or more people using the “bike tag”, when they mean different things.

So each person who uses XML tags needs to give meaning to them, as they mean nothing if they can’t be interpreted.

In an ideal world if everyone used the same set of mark-up labels to describe resources (as HTML does for presentation) then we wouldn’t have a problem in searching the web like an OPAC, this is what DC tries to do.

But the reality is that there are many mark-up standards to cover many different types of resources, industries and disciplines, so creating a universal standard is futile, rigid, and non-productive.

So we need a way to tell the web the mark-up (standard) we used to describe a web page so it doesn’t clash with another standard that happens to use the same tag term for perhaps a different meaning….it’s a way to distinguish between ambiguous elements.

This is achieved with the XML namespace (lives in the header of your document) it points to a URL, saying this where I got my mark up tags from (verification).
eg. It may point to the web page of a particular mark-up tag set like DC

Map mark-up

Now the next problem is for the web to map all the different mark-up standards, eg DC creator relates generally to the author of an item, but other mark-up standards may use the mark-up “author” or “owner”…they all roughly refer to the same thing so the web needs to map these mark-up standards so when we do a search we are searching in multiple communities (mark-up standards) at the one time, this is interoperability.

I’m not sure how this will happen, the web has to compare elements from multiple mark-up standards and decide, yes “Creator” in DC is similar if not equal to “Author” in some other language, so I’ll merge these items in the results.
I know in the library enviroment, it doesn’t matter which label you use, the MARC number (unique identifier) is able to identify what both elements mean and then it can be mapped via the Z 39.50 standard
…but this is easy in a library environment as all labels/elements have a MARC number, in the web there is no consensus, it’s much more chaotic.

RDF

So how does all this actually work, from the computers perspective, making it machine-processable (an illusion that the machine understands)

RDF is made from XML, it is a framework to exchange the data we have been talking about above, it allows this data to work beyond proprietary systems – the metadata can be shared amongst different web applications, it also allows you to use multiple mark-up standards within the same document.

RDF also has it’s own namespace to list its own custom mark-up tags…for more see, An idiot’s guide to the resource description framework.

RDF encapusulates the resource, it makes statements about resources…so it’s meta-data for the meta-data (so RDF is a container that holds all this information), ultimately it allows for different vocabularies to exist in a distributed way without needing a central place.

What it is stipulating is the structure for your mark-up (source code for your web page), and this is all held within the RDF <rdf> beginning and end tags </rdf>.

The structure is made up of 3 aspects:

  • Resource - URI pointing to the resource (which is usually a URL webpage)
  • Property type - properties describing the resource (eg. DC) and validate this
    via an XML namespace
  • Value - value of the property. ie name of the title or person or the subject term, etc…

NOTE: the property can become a resource itself with its own property types and values

See below for more on how it does this by defining the structure of a resource.

Metadata for the web RDF and the Dublin core

What is RDF?

The semantic web: a network of content for the digital city

An introduction to the resource description framework

Although beyond the scope of this article, I will also mention that RDF is more than this…it is also refered to Data Modelling.

This article explains that even if you use a tag such as <Author>Paul Warren</Author>, XML on it’s own can’t say anything about Paul Warren as it can’t map relationships when used by itself…more from the article:

“…we have no way of saying anything about “Paul Warren”, e.g. that he is employed by BT, that he has written other articles etc. To overcome this limitation Resource Description Framework (RDF) was designed […] based on the use of “subject, verb, object” triples (e.g. “Paul Warren”, is employed by, “BT”).”

So the semantic web is more than listing values in a database within elements, ie. listing a whole heaps of author names under the element ID of “Author”…it is different to a relational database in that it is a list of free floating logical statements, that aren’t limited to be only identified within one element, instead the value itself (the actual authors name) can have a unique ID via a URI…read what I’m refering to at this article.

Map mark-up elements

This is where ontologies will have to shine, in the form of RDF schema, OWL, OIL, DARPA.

From this example:

Just say in describing the aboutness of an object I use DC:subject, and someone else uses AB:descriptor.

Well the web will know via the URI namespace where these mark-up standards reside, and since someone has done the groundwork, to say [X dc:subject Y] is the same as [Y ab:descriptor X] (this is an ontology at work), then the computer can deduce that “subject”, and “descriptor” mean the same thing, so we get higher recall, and context from our dataset.

Going back, in the library environment it is easier as MARC uniquely identifies what a subject or descriptor element is regardless of what you label it via a string of numbers, this is the advantage of using a central registery. This is impossible on the web so the idea of RDF is achieving a decentralised approach by using XML, namespaces, RDF, and ontologies.

Map subject terms

My question is then after you have mapped the elements from the different mark-up sets (I presume this is done manually by a team of people), what about mapping the values?

So if we know that the element, “subject”, from one standard equals, “descriptor”, from another fair enough, but what if the value in one document is “bike”, and the value in the other is “bicycle” - in other words do we have to map thesaurai, so subject terms from different thesauri are related, and considered when serving up results.

…this point is the precise focus of this blog post.

Even before mapping the different controlled vocabularies, how do you define them, is this using a namespace as well?

Then comes the question of mapping them?

Recap

  • Using XML we can describe or mark-up content anyway we see fit, whatever works for us personally
  • Using RDF we can merge these packets of information
  • But for purposes of aggregating our conent with other content, we need to distinguish our tags so they don’t collide with tags describing other content that may happen to use the same tags but mean a different concept.
    So we have to define where we got our mark-up tags from therefore showing what they mean or refer to (via a namespace)
    Library communities have been using the MARC standard (a string of numbers that uniquly identifies an element regardless of what it is labelled), since the web is not a controlled environment using a centralised registery like this is not possible.
  • Then the web has to map the mark-up tags from different systems.
    Library communities have been using z39.50 standard, whereas the web has to bridge elements from various standards using ontologies
  • Then we have to tell the web from which controlled vocabulary we got our subject terms
  • Then the web can map these subject terms from all the different vocabularies
    …defining the homonyms, synonyms, and many other granular relationships

    …this is at the heart of the semantic web - higher recall, and contextual searching (higher precision).

Subject terms again

So the term “bush” in one vocabulary is mapped to the word “forest” in another vocabulary…so if I do a search for “bush” I also get hits for “forest”. (this is like a synonym ring)

This makes for exhaustivity across disciplines, but we may end up having too much recall, or even lose our context.
If we do a search for “bush”, we get hits for “forest” as well, but we may not want hits for “forest”.
So how far do we go with mapping subject terms from different vocabularies?

This is where using operators such as “NOT“ help, also web results could be returned in clusters that show folders of the mapped subject terms…so if you searched the web for the term “bush” in a subject field (what! the web with a subject field…dream on) it would show only those hits, but also list folders with hits from similar subject terms.

…also need to map the homonyms (the word “bush” could be refering to a presidents name or the name of a rock band) this is searching in context…I guess a way around this is to present the results in clusters, showing the results for the term “bush” split into folders as nature, music, president, etc…

Hopefully the semantic web will get around this problem of searching in context by using the elaborate ontologies put in place…so a particular web pages mark-up can define that the term “bush” in this webpage means “president” and not a “tree” and not a “rock band”…at the moment we have to use boolean searching methods to gain some context, this may be a problem at present as lay people don’t search this way, they want quick results with context by using a simple query.

At the moment, search engines don’t find subject searching viable and prefer to use free-text page ranking and other methods…see more..

If the web mapped all these subject terms from different vocabularies; could we use this as a browsing mechanism, it would be a massive cross discipline meta-thesaurus. (This is probably unlikely, as the semantic web would be using ontologies and not a thesaurus). Although maybe we could browse a discipline thesaurus, or even a particular thesaurus…(they could all have check boxes, so you can search webpages seen from the perspective of a set of thesauri).

But then we would have a directory of thesauri to choose from, when you find one (or many) you then do a search or browse for a subject term, then you apply this to the search engine query, then scroll your results and view clusters of related or similar subject terms…timely process!

In brief

  • Syntax is XML (meta-data)
  • Community eg. DC (point to a URL via a XML namespace)
  • Structure is RDF (meta-data about the meta-data)
  • Semantics is ontology (maps rules and relationships - between different elements, and different values, and more)

Please add your views to this post as I’m sure there are parts I haven’t interpreted properly…I will then add an addenda of some sort or fix up the mistakes.

Doubts

This is all put in a different perspective with so much clarity by this essay,
The Semantic Web, Syllogism, and Worldview

Social Web

The social web, mostly the blogosphere and bookmark folksonomies are creating a watered-down version of the semantic web by default. Aggregated user-defined tags are a quick answer, and are of great value for discovery, sharing, and aggregating to make topic portals, but the question is does an aggregate of personal contexts cross over to a defined community context, at the moment “not really”…although we do see notions of the power law, long tail, and related algorithms helping to create emergent vocabularies, but is this enough.

The notion of structured blogging and dataBlogging will contribute to the semantic web; by adding tags to blog content we can derive context. As is mentioned in this article if content such as a job listing is tagged with the appropriate tag, then any website can aggregate all job listings, the current players will need to re-think their services other than just providing content…lots of players are starting to aggregate content, just look at Yahoo! News or Google News, they are the new competition to traditional news aggregators, so now services are moving forward beyond just delivering content, and into customer service such as personalisation, customisation and integration.

New web 2.o services are doing the same thing Technorati Tags is aggregating blogosphere content into user-defined categories, and the same is being done for the many social bookmark services by services such as Wink.

More references:

The semantic web

What is an Ontology

A quick guide to…XML

A dozen primers on standards

Metadata and the web

RDF - The resource description framework

Re-inventing subject access for the semantic web

XML and the resource description framework: the great web hope

The semantic web: how RDF will change learning technology standards

Semantic web – on the respective roles of XML and RDF

Webopedia – RDF

JISC – Semantic Web Technologies

Search Engines and Resource Discovery on the Web: Is the Dublin Core an Impact Factor?

Writing Semantic Markup

Themes and metaphors in the semantic web discussion

Semantic Web technologies for digital libraries

Re-inventing subject access for the semantic web

A bit of commentary on Google and the Semantic Web

How Google beat Amazon and Ebay to the Semantic Web

[ADDED 14/06/06:
Why using RDF instead of XML?
A description of XML namespaces
Avoiding Information Overload : Knowledge Management on the Internet]

Get free blog up and running in minutes with Blogsome | Theme designs available here