Library clips

sharing ideas thoughts and feedback

June 14, 2006

Attention agents

Filed under: km, semantic, attention

I recently came across a km article called Avoiding Information Overload: Knowledge Management on the Internet…it talks about the many aspects of knowledge representation; search engines, ontologies, XML, RDF, metadata, extraction, semantics, etc.
Something I was curious about was its mention of “agents”…one day we won’t need browsers or search engines as your agent will search and retrieve on your behalf based on your “profile”.

The term “profile” could be similar to “attention” in a way…Alex Barnett often explains about the potential of a portable attention file where you can plugin to a service and cut to the chase (the service will seem to know your interests, and personalise to you).

This attention file is based on your searching (eg. Google personalised search), RSS reading (eg. Findory, Rojo), bookmarks, browser recorder, transactions at services like amazon, etc…

The point I’m making is that this scenario so far is us plugging in our attention file into a service (possibly via OPML)…then when we use the service it will return personally relevant stuff based on our attention file.

What about even further where we ask an agent to stand in for us, give it some commands and it will fetch what we need based on our attention file.
I suppose this has nothing new to do with attention files, but maybe how they could be used by an agent…I think attention files could be very usable in the semantic web.

I find agents or bots similar to search feeds…but what you are doing is saying “go find stuff I like”.
At the moment we collect RSS feeds ourselves, maybe in the future a bot can go looking for RSS feeds it knows we will like based on our attention file (something Rojo does, but this is based on the data Rojo has about us…we need our own attention file made up from all the services we use. I bet if I then plugged this into Rojo, the feed recommendations would be much more tuned).
Further to that, maybe it could replace our RSS Readers…based on our attention file, our bot could trawl the net (not sure if you ask it where to visit) and return a list of daily readings…the more expansive our attention file becomes, the more the bot will seem to know your preferences.

June 4, 2006

Technorati cooking semantic tags

Filed under: General, tags, semantic

Technorati Tags is great at collecting user based tags from blog posts, when you view a tag like apple you’ll get a mix of computer and food posts.

What if there was a register for all these tags…microformats does something similar.

When you include the tag review in your blog post, you append it according to what the code for review is in the register…this is hReview.
When you use the hReview, you are saying this blog post is a review/summary/critique…you don’t use the hReview if your blog post is about reviewing your homework before you hand it in…just use a simple Technorati Tag like review in these cases…this I believe is the difference…there has been a consensus or agreed use for this special tag.
Let’s just be careful who ever makes these up does it in a group environment, we want these special tags to get it right.

Structured Blogging tries to achieve the same thing, by people having a field in their blog editors.

Now for each of these special tags there is a dedicated search module…check out the hReview tag search engine in the Technorati kitchen…compare it with the tag “review” in Technorati Tags.

eg.
Technorati Tags - search for book+review.
Technorati Reviews - search for book.
I like the idea of separating the topic from the activity, but then again book is maybe a document type.

In the future for a book review on cats we could mark up a blog post:
rel=tag cats
MF hReview

All you would have to do is search book in the review engine, and you will get all book reviews about cats…is this right…is the search for the term book a fielded or full-text search?

A service like Edgeio who collects posts with the tag “listing” could really benefit from a hListing tag (this will disambiguate tags)…maybe we can have a glossary for hTags…this way tags have got meaning.
But then Edgeio wouldn’t harvest much content as not everyone is going to use hTags, people are already Technorati Tagging away…it’s here to stay.

I guess microformats or these “h” Type tags are a way to share clean and unpolluted information…can anyone make these up, is all they need a little momentum.
What if people make up similar one’s will we need not only a glossary, but a thesaurus…will the namespace solution not be a cure.
NOTE: I use the word cure because tags are spreading like a virus, the tagspace is uncontainable, it is still a semantic mess.

See more from Solution Watch and /message…and of course the Technorati Weblog.

Check out Microformats search in the Technorati Kitchen.

Check out what microformats is all about.

[ADDED: now I see that Technorati Tags is just a microformat like all these new ones (topic/subject), the problem is that the tags aren’t used only for topic labels, whereas at least hReview is certain it will be describing a review. But what kind of review is it…a restaurant, music, book, film, etc…it seems the hReview has an item type field within it…but is this a free choice like tagging, or a selection (it’s a drop down in the hReview Creator below)

Not sure if I’ve grasped this concept entirely…it looks promising.

Here is a sample Review.

Here is the hReview Creator.]

April 6, 2006

Microformat clash

Filed under: semantic

A while ago I posted on how services like Edgeio can aggregate tagged blogposts or tagged URL’s and include this data into their service…as long as these tagged posts or object use the microformat rel=tag.

What if a service aggregates data from objects with the tag “apple”, it may be a service listing Apple computers…then what if another service starts, and also aggregates data from objects with the tag “apple”, but this service is a listing for peoples experiences with the fruit apple.
These services may not know about each other, thus they are creating noise for each other…I guess this is the difference between the publisher explicitly submitting a post rather than the service collecting posts on their own.

Maybe service one has to broadcast to the blogosphere to use the tag “applecomputer” instead…if a service like Edgeio starts up, it is essential they need to tell everyone what tag names they aggregate, and hope that other services don’t come along and try to advocate the same tag, especially if they are about a different topic or type of service.

If another classified services start up they can just aggregate the “listing” tag also, this is also handy for publishers as their posts will appear on multiple boards at the same time.

But if a new service started that aren’t about classifieds but chose to use the “listing” tag…if they generated enough popularity, people would tag their posts “listing” in order to appear on this new service. The repercussions are that Edgeio will now get posts tagged “listing” but the content may have nothing to do with their service…so there has to be a greater quality control filter.
(And also this new service will get all posts tagged “listing” that were intended to appear at Edgeio).

So are these microformat tags going to one day start clashing, due to the harnessing the bottom up approach?
Will a more defined approach such as structured blogging be more appropriate.

Maybe there is more to this, not sure…

Check out Alex Barnett’s latest podcast on microformat’s…I’m yet to listen to this.

December 19, 2005

Blogosphere as an OPAC

Filed under: General, blogs, tags, semantic

Burningbird has her take on Structured Blogging, mentioning that 2006 will be the year of metadata (that old semantic thing)

…here’s a pioneer to this idea.

Basically it includes some additonal fields in your blogs posting interface, these fields are descriptive metadata…so you can explicitly say, this post is a movie review…I guess this would work by labelling your post with the blog category “movie”, and if you check the structured blogging field “review”, this would distinguish the post from just being about movies in general, to explicitly a movie review…I guess it is leaving blog categories to describe the subject, and the blog fields to describe the post type or style.
Actually, looking at an example, it is much more definitive, if this takes off we can aggregated precise content according to our requests (very semantic).

Can’t we already do this using Tags in our posts, which are collated by services such as Technorati…we can go to Technorati Tags and browse.
It’s just a bit messier with tags as it is a bottom-up system, you have to browse around a bit for tags like film, cinema, movies, dvd…although this may be a good thing as it is more specific, but it doesn’t neccessarily mean the posts will be “review” type, you could add this tag (but that may limit your results).

Sample: film OR cinema OR movie OR dvd

film OR cinema OR movie OR dvd AND review

(film OR cinema OR movie OR dvd) AND review

But this would mean people would have to tag their posts with both tags, eg. movie, review…eg. movie+review…and what about blogs categories, they are not generally going to have both categories; movie, and review.

So the aboutness of these tags are a bit more vague, whereas with Structured Blogging the aboutness can be more exact…but then a more controlled system like Structured Blogging is going to be harder to get everyone to adhere to, even if the 3 biggest blogging platforms have plugins, will everyone use it (I’d have to ask my host), and what about the rest of the other blogging platforms.

Also who decides on these metadata fields, who is the authority, can we suggest fields?

I like the idea of defining an element set, kind of like a version of DC for the blogosphere, but a centralised subject index is another ball game (the values within the element).
So I still like the idea of bottom-up tags, but I also like the idea of a subtle structure, as tags are used not only as subject terms, so if we could just classify our tags into facets, these facets are the Structured Blogging fields, eg. Author, Review, Media, Publisher, Title, etc…

If this works, this is a step towards the user driven semantic web (ie. we the bloggers define and describe contents at the time of posting according to a standard)…then any service will be able to aggregate posts on a topic with hopefully minimal noise.

I can even see this working even better at a blog collective level, such as the Corante Web Hub, as tag clouds can start to get out of hand…that is of course unless participants decide on a tag set, but this has more to do with the values in the field, what I like about Structured Blogging is that it defines an element type…like search fields in an OPAC.

At the moment you can’t search within a tag at Technorati Tags, let alone limiting this to an OPML, hopefully Structured Blogging will allow us to search within a subject type, that is also a certain document type…this is targeted fielded searching, the blogosphere as an OPAC.

More

How will this integrate with social bookmarking services, if you bookmark a post (that has been made with Structured Blogging) will the fields auto-fill, and will these fields be available in bookmark services…then we would be able to search bookmarks by author, title, subject type, document type, format type, etc…this would be a human indexed web OPAC.

October 10, 2005

Semantic web: subject searching (XML, DC, RDF, ontologies…)

Filed under: search, semantic

I’ve been surfing the web in an effort to learn about XML, RDF, ontologies…basically the semantic web.
I’ve found this hard to grasp on my own so I’ve decided to share my understandings via this blog in an effort to induce comment in order to clarify my grip of the subject matter (after all what are blogs for!)

NOTE: I’ve used the web to gain knowledge (as opposed to consulting a person or taking a course or reading a print publication) and used the web to publish and disseminate my learning, and hopefully in turn use the web for discussion…so in tackling “web 2.o” I have been using the tools that web2.0 has to offer!

Here is my effort in describing what I understand of the semantic web aspect of Web 2.0 and how a part of it may emulate subject searching on an OPAC or scholarly database.

Ontology is the hardest term for me to grasp, I thought it was about mapping all the vocabularies used on the web (by mapping I mean establishing the semantic relationships) but it seems to be this and more, a type of controlled vocabulary that is more detailed and complex than a thesaurus and it deals with more than just subject terms (in the library environment this is where a thesaurus stops, and AACR2 takes care of other descriptive terms, but this is only grammatical, it doesn’t infer relationships as ontologies do).

Ontology will map relationships between anything; subject terms, people, places… it is very complex…it sets rules, relationships, interferences and map things in every instance of a term or phrase so a computer can process the information accordingly.

So an ontology for the web will serve computers not people… …I’m not sure if you can use an ontology for browsing purposes, maybe it doesn’t have a structure and is just a list of inferences that represents stuff…so it’s not used to navigate, it’s just a running list of relationships and rules (logical statements).

For an example of an instance of an ontology see, Converting a controlled vocabulary into an ontology: the case of GEM.

The semantic web promises contextual search where all information is marked-up in order to define itself, set rules (so they can be retrieved in a given context), and mapped to other resources (to distinguish relationships).
This will allow users to search, similar to fielded searching; if you search for an author you will hope to find items created by the author, ignoring items where the author name appears in the full-text or in the bibliography (this is searching without the noise)…as mentioned before the semantic web goes beyond fielded searching and defines relationships at a very granular level.
Since information will be machine processable, it is mentioned that we can have electronic web-agents perform automated searching on our behalf. (What we don’t have to surf the web ourselves!)

For our purposes we will narrow the focus to ontologies mapping subject terms on the web, so the web can act like an OPAC or fielded database.

MY INTERPRETATION

HTML is a mark up language for web browsers

<strong>hello</strong>

the web browser is told to make this bold

XML is a mark-up tool to make your own mark up language

So you can make up your own tags
<bike>racer</bike>

The beauty about XML is that tags are not just about presentation (like most of HTML); they can be about structure, or anything for that matter. You can even separate content from form (which is great because you can take the content from, say a memo, and present it in the structure of a poster.)…you can separate parts of mark-up by keeping, for example, all your structural tags in a DTD file.
The DTD file can also act like a simple ontology as it can be used to define the grammar of your XML syntax(similar to an XML namespace).

I guess XML tags can be verbs as well, their just labels in the end.

The other great thing is that XML is interoperable, meaning you can import/export XML files to any program…it’s up to the program to be able to read it, and how it reads it. One program may interpret the tag different to another program, and it can ignore the tags that aren’t important.

When it comes to the web (which is one big program or database) then all these tag meanings need to correlate. All the databases are coming together to form one big database, but really the databases are still separate, they will be mapped together so they can work in unison.

Define mark-up elements

Firstly we have to give meaning to our mark-up labels and then map relationships between everyone’s mark-up labels – as there will be multiple mark-up standards out there. Even though there is freedom with XML we can’t have 2 or more people using the “bike tag”, when they mean different things.

So each person who uses XML tags needs to give meaning to them, as they mean nothing if they can’t be interpreted.

In an ideal world if everyone used the same set of mark-up labels to describe resources (as HTML does for presentation) then we wouldn’t have a problem in searching the web like an OPAC, this is what DC tries to do.

But the reality is that there are many mark-up standards to cover many different types of resources, industries and disciplines, so creating a universal standard is futile, rigid, and non-productive.

So we need a way to tell the web the mark-up (standard) we used to describe a web page so it doesn’t clash with another standard that happens to use the same tag term for perhaps a different meaning….it’s a way to distinguish between ambiguous elements.

This is achieved with the XML namespace (lives in the header of your document) it points to a URL, saying this where I got my mark up tags from (verification).
eg. It may point to the web page of a particular mark-up tag set like DC

Map mark-up

Now the next problem is for the web to map all the different mark-up standards, eg DC creator relates generally to the author of an item, but other mark-up standards may use the mark-up “author” or “owner”…they all roughly refer to the same thing so the web needs to map these mark-up standards so when we do a search we are searching in multiple communities (mark-up standards) at the one time, this is interoperability.

I’m not sure how this will happen, the web has to compare elements from multiple mark-up standards and decide, yes “Creator” in DC is similar if not equal to “Author” in some other language, so I’ll merge these items in the results.
I know in the library enviroment, it doesn’t matter which label you use, the MARC number (unique identifier) is able to identify what both elements mean and then it can be mapped via the Z 39.50 standard
…but this is easy in a library environment as all labels/elements have a MARC number, in the web there is no consensus, it’s much more chaotic.

RDF

So how does all this actually work, from the computers perspective, making it machine-processable (an illusion that the machine understands)

RDF is made from XML, it is a framework to exchange the data we have been talking about above, it allows this data to work beyond proprietary systems – the metadata can be shared amongst different web applications, it also allows you to use multiple mark-up standards within the same document.

RDF also has it’s own namespace to list its own custom mark-up tags…for more see, An idiot’s guide to the resource description framework.

RDF encapusulates the resource, it makes statements about resources…so it’s meta-data for the meta-data (so RDF is a container that holds all this information), ultimately it allows for different vocabularies to exist in a distributed way without needing a central place.

What it is stipulating is the structure for your mark-up (source code for your web page), and this is all held within the RDF <rdf> beginning and end tags </rdf>.

The structure is made up of 3 aspects:

  • Resource - URI pointing to the resource (which is usually a URL webpage)
  • Property type - properties describing the resource (eg. DC) and validate this
    via an XML namespace
  • Value - value of the property. ie name of the title or person or the subject term, etc…

NOTE: the property can become a resource itself with its own property types and values

See below for more on how it does this by defining the structure of a resource.

Metadata for the web RDF and the Dublin core

What is RDF?

The semantic web: a network of content for the digital city

An introduction to the resource description framework

Although beyond the scope of this article, I will also mention that RDF is more than this…it is also refered to Data Modelling.

This article explains that even if you use a tag such as <Author>Paul Warren</Author>, XML on it’s own can’t say anything about Paul Warren as it can’t map relationships when used by itself…more from the article:

“…we have no way of saying anything about “Paul Warren”, e.g. that he is employed by BT, that he has written other articles etc. To overcome this limitation Resource Description Framework (RDF) was designed […] based on the use of “subject, verb, object” triples (e.g. “Paul Warren”, is employed by, “BT”).”

So the semantic web is more than listing values in a database within elements, ie. listing a whole heaps of author names under the element ID of “Author”…it is different to a relational database in that it is a list of free floating logical statements, that aren’t limited to be only identified within one element, instead the value itself (the actual authors name) can have a unique ID via a URI…read what I’m refering to at this article.

Map mark-up elements

This is where ontologies will have to shine, in the form of RDF schema, OWL, OIL, DARPA.

From this example:

Just say in describing the aboutness of an object I use DC:subject, and someone else uses AB:descriptor.

Well the web will know via the URI namespace where these mark-up standards reside, and since someone has done the groundwork, to say [X dc:subject Y] is the same as [Y ab:descriptor X] (this is an ontology at work), then the computer can deduce that “subject”, and “descriptor” mean the same thing, so we get higher recall, and context from our dataset.

Going back, in the library environment it is easier as MARC uniquely identifies what a subject or descriptor element is regardless of what you label it via a string of numbers, this is the advantage of using a central registery. This is impossible on the web so the idea of RDF is achieving a decentralised approach by using XML, namespaces, RDF, and ontologies.

Map subject terms

My question is then after you have mapped the elements from the different mark-up sets (I presume this is done manually by a team of people), what about mapping the values?

So if we know that the element, “subject”, from one standard equals, “descriptor”, from another fair enough, but what if the value in one document is “bike”, and the value in the other is “bicycle” - in other words do we have to map thesaurai, so subject terms from different thesauri are related, and considered when serving up results.

…this point is the precise focus of this blog post.

Even before mapping the different controlled vocabularies, how do you define them, is this using a namespace as well?

Then comes the question of mapping them?

Recap

  • Using XML we can describe or mark-up content anyway we see fit, whatever works for us personally
  • Using RDF we can merge these packets of information
  • But for purposes of aggregating our conent with other content, we need to distinguish our tags so they don’t collide with tags describing other content that may happen to use the same tags but mean a different concept.
    So we have to define where we got our mark-up tags from therefore showing what they mean or refer to (via a namespace)
    Library communities have been using the MARC standard (a string of numbers that uniquly identifies an element regardless of what it is labelled), since the web is not a controlled environment using a centralised registery like this is not possible.
  • Then the web has to map the mark-up tags from different systems.
    Library communities have been using z39.50 standard, whereas the web has to bridge elements from various standards using ontologies
  • Then we have to tell the web from which controlled vocabulary we got our subject terms
  • Then the web can map these subject terms from all the different vocabularies
    …defining the homonyms, synonyms, and many other granular relationships

    …this is at the heart of the semantic web - higher recall, and contextual searching (higher precision).

Subject terms again

So the term “bush” in one vocabulary is mapped to the word “forest” in another vocabulary…so if I do a search for “bush” I also get hits for “forest”. (this is like a synonym ring)

This makes for exhaustivity across disciplines, but we may end up having too much recall, or even lose our context.
If we do a search for “bush”, we get hits for “forest” as well, but we may not want hits for “forest”.
So how far do we go with mapping subject terms from different vocabularies?

This is where using operators such as “NOT“ help, also web results could be returned in clusters that show folders of the mapped subject terms…so if you searched the web for the term “bush” in a subject field (what! the web with a subject field…dream on) it would show only those hits, but also list folders with hits from similar subject terms.

…also need to map the homonyms (the word “bush” could be refering to a presidents name or the name of a rock band) this is searching in context…I guess a way around this is to present the results in clusters, showing the results for the term “bush” split into folders as nature, music, president, etc…

Hopefully the semantic web will get around this problem of searching in context by using the elaborate ontologies put in place…so a particular web pages mark-up can define that the term “bush” in this webpage means “president” and not a “tree” and not a “rock band”…at the moment we have to use boolean searching methods to gain some context, this may be a problem at present as lay people don’t search this way, they want quick results with context by using a simple query.

At the moment, search engines don’t find subject searching viable and prefer to use free-text page ranking and other methods…see more..

If the web mapped all these subject terms from different vocabularies; could we use this as a browsing mechanism, it would be a massive cross discipline meta-thesaurus. (This is probably unlikely, as the semantic web would be using ontologies and not a thesaurus). Although maybe we could browse a discipline thesaurus, or even a particular thesaurus…(they could all have check boxes, so you can search webpages seen from the perspective of a set of thesauri).

But then we would have a directory of thesauri to choose from, when you find one (or many) you then do a search or browse for a subject term, then you apply this to the search engine query, then scroll your results and view clusters of related or similar subject terms…timely process!

In brief

  • Syntax is XML (meta-data)
  • Community eg. DC (point to a URL via a XML namespace)
  • Structure is RDF (meta-data about the meta-data)
  • Semantics is ontology (maps rules and relationships - between different elements, and different values, and more)

Please add your views to this post as I’m sure there are parts I haven’t interpreted properly…I will then add an addenda of some sort or fix up the mistakes.

Doubts

This is all put in a different perspective with so much clarity by this essay,
The Semantic Web, Syllogism, and Worldview

Social Web

The social web, mostly the blogosphere and bookmark folksonomies are creating a watered-down version of the semantic web by default. Aggregated user-defined tags are a quick answer, and are of great value for discovery, sharing, and aggregating to make topic portals, but the question is does an aggregate of personal contexts cross over to a defined community context, at the moment “not really”…although we do see notions of the power law, long tail, and related algorithms helping to create emergent vocabularies, but is this enough.

The notion of structured blogging and dataBlogging will contribute to the semantic web; by adding tags to blog content we can derive context. As is mentioned in this article if content such as a job listing is tagged with the appropriate tag, then any website can aggregate all job listings, the current players will need to re-think their services other than just providing content…lots of players are starting to aggregate content, just look at Yahoo! News or Google News, they are the new competition to traditional news aggregators, so now services are moving forward beyond just delivering content, and into customer service such as personalisation, customisation and integration.

New web 2.o services are doing the same thing Technorati Tags is aggregating blogosphere content into user-defined categories, and the same is being done for the many social bookmark services by services such as Wink.

More references:

The semantic web

What is an Ontology

A quick guide to…XML

A dozen primers on standards

Metadata and the web

RDF - The resource description framework

Re-inventing subject access for the semantic web

XML and the resource description framework: the great web hope

The semantic web: how RDF will change learning technology standards

Semantic web – on the respective roles of XML and RDF

Webopedia – RDF

JISC – Semantic Web Technologies

Search Engines and Resource Discovery on the Web: Is the Dublin Core an Impact Factor?

Writing Semantic Markup

Themes and metaphors in the semantic web discussion

Semantic Web technologies for digital libraries

Re-inventing subject access for the semantic web

A bit of commentary on Google and the Semantic Web

How Google beat Amazon and Ebay to the Semantic Web

[ADDED 14/06/06:
Why using RDF instead of XML?
A description of XML namespaces
Avoiding Information Overload : Knowledge Management on the Internet]

September 21, 2005

Scuttle: auto community tags

Filed under: tags, semantic

Yesterday I posted about using facets or automated tags within del.icio.us to augment the browsing experience…well I just discovered that Todd over at Big IDEA has implemented automated tags in scuttledu…see the added browsing feature.

From the post:

“When you register for the service, you are asked to provide your grade level and subject area. When you add a bookmark, these two pieces of information become tags. You have the option of not using these tags as well.

…When you log in to the service you are in “teacher mode” by default. This means that when you add a bookmark it will automatically be tagged with your grade level and subject area. You can change the grade level and subject area if you teach more than one subject. You can also switch to “not-teacher mode” by clicking an icon in the upper-right corner. In teacher mode, the icon is a person wearing a graduation cap; in not-teacher mode, the icon is a person wearing a baseball cap. The link toggles between the two modes. In “not-teacher” mode, your bookmarks are not automatically tagged with your grade level and subject area. Use this, for example, if you are bookmarking sites not related to your teaching duties.”

More on this other post:

“Scuttledu does not try to impose a formal taxonomy on users; if users don’t want to tag their bookmarks with grade level and subject area tags, they don’t have to. It is interesting to note, however, that there have been some attempts by del.icio.us users to use community-defined tags. Emily writes about her experience in the nptech tagging experiment…”

Site search: auto-suggest vocabulary

Filed under: General, search, semantic

Robin Good has the low down on a new intelligent site search tool called LookAhead.

Basically it is a search box for your site, that auto-suggests search terms as you type…these suggested search terms are actually subject terms based on a thesaurus or lexicon that you have programmed into your LookAhead site search tool.
If you don’t have a controlled vocabulary to import into the tool, they also have a service called Lex-It which will generate a lexicon for you (obviously not as effective as human indexing or thesaurus construction).

So basically it it subject searching with auto-complete, but the beauty of it is that it has a pre-browsing window, you don’t have to commit to viewing a page until you are done navigating the vocabulary (which seems to be listed alphabetically).

Next to each suggested term is a number denoting the number of terms under that, and so on…if the number is one, this will be a direct link to the webpage.

If you type in the term “RSS”, the backend thesaurus says if someone types in “RSS” also show hits from the terms “Atom”, “Feed”, “Syndication”…very much an enterprise search mechanism. Sometimes when this happens you wonder why you are getting these type of hits, it is good when the search engine tells you that it also has included hits from these other terms.

Not sure if it does this, but it would be good if it would say “see also Atom”, “broader term Feed”, “narrower term RSS metrics”…that way you could look at only articles within the term “RSS”, or only articles within the term “Atom”, or articles from both by choosing to click on the broader subject term “feed”, or articles from the narrower subject term “RSS Metrics”
…I guess this is an alternate view to an alphabetical view.

More from the interview on Robin’s post:

“…On the Google search box that allows my readers to find content on my site, the moment that they would start typing, whatever key word or sentence on some topic or issue or tool, they would get immediately a popup box next to that search box that would reveal all of the articles that I have on that topic.”

…We also provide the webmaster or the site owner with the ability to create their own unique vocabulary.
So, for example, if you type in the term, you start typing in the term “mountain lion,” perhaps the term mountain lion is no where to be found on the site, but in fact the webmaster through our technology has said panther and mountain lion are the same, they are equivalent, so you could type in mountain lion and in fact find the pages with panther on it.

…So, for example, if you typed in mountain, and you selected mountain lion from the drop down, it could take you directly to a mountain lion page, or it could just populate the search box with “mountain lion,” and then you could further modify that.”

Refining the results

A related function to this service is experienced when searching on Scirus, although this lacks the auto-suggest, and pre-browsing window…but the similar thing is the term suggestions…really it’s only refining what you got (your results), in contrast with the focus of this post which is about the initial discovery.

If you search a subject term on Scirus like “kuroko mines”, the sidebar will list keyword suggestions…although I don’t think this is based on a thesaurus, I think it is based on these keywords appearing in your results, so it’s a refined search. Actually Engineering Village2 also does this but you can refine your results not only by keyword, but by subject term, publisher, author, document type, year, etc…
I guess Teoma, and Clusty also do this with their subject/topic/group clustering.

I suppose these are more mining techniques to drill down into the opaque web (they make up for lousy search terms)
ie. a lay person might be looking for “bottle wholesalers in Australia”, and their search term may be: bottle wholesaler Australia, they may be lucky, they may be not…the cluster might suggest the term “manufacturers”, offering a worm hole into what would be the 200th page in the results (as it turns the successful website may have called themselves a manufacturer even though they are also a wholesaler…so the refined suggestion was fortunate, and made up for the intelligence of the user not thinking to use that term as an alternative search attempt)

…boolean searching also fits into this scenario as another web-mining tool.

September 20, 2005

del.icio.us with facets

Filed under: General, tags, semantic

fac.etio.us groups tags into facets, so you can find a bookmark via multiple paths, naturally you can do this sort of thing with tagging as you can apply multiple tags to the one bookmark, but this system divides batches of tags into facets…it’s the next best thing to a prefix for tags.

So now instead of browsing through one long list of tags, a tag set is divided into facets to give you more contextual browsing

…but this system doesn’t seem anything new, because if you wanted you could use tag bundles to make facets in your del.icio.us account…although what this system does is aggregate many accounts, and groups tags into facets.

Now I wonder how the tags are grouped into facets, is this done by a computer or is it a person placing tags in groups (facets)…ohhh no controlling the browsing of a folksonomy, this was bound to happen as our tag sets grow wildly…actually I see it more as an offer to browse another way (more contextually).

This process could be automated into system designed facets quite easily for a tool like del.icio.us…for each bookmark, you could also choose from a drop down menu to describe the document type as CiteULike does, eg, journal article, conference, thesis, etc…this would be like an automated facet for the bookmark in general.

This does create more work for the user, but at least the system is still a pure folksonomy…also users are selecting from the facet list, who decides the facet list…is there a facet suggestion list?

I like this idea as the tags people apply are generally more “subject/topic based”, so to separate these from other functional or descriptive tags we could choose from various drop-down menu’s to apply the other types of tags…this leaves the tag list to be subject/topic based.

So if I bookmark a page in del.icio.us on “sustainability” I can give it the tag “sustainability” and “sustainabledevelopment”, etc…but at the same time for a little extra work I can also select some options from 4 or 5 drop-down menu’s, this isn’t too much thinking, you just have to select from the options.

So, similar to the CiteULike approach, there could be the following fields (just fill in what you need to):

Automated facets by the system

DOCTYPE
doc
xls
html
xml
pdf
ppt
etc…

DOMAIN
gov
org
com
edu
etc…

DATE

USER/CONTRIBUTOR

For the bookmark

FORMAT (choose one or none from the following)
Journal article
White paper
Conference
Blog post
Website
Product
Type in your own
etc…

ACTION
to read
to blog
need access
to buy
to send
Type in your own
etc…

GEO
various countries…

ORG
various organisations…

Example

So if I have a pdf of a white paper on energy efficiency (sustainable practice) from a firm in Australia, and I want to blog about it, here is the tagging process.

USER-DEFINED TAGS (subject tags)
- energy
- sustainabledevelopent
- sustainability

AUTOMATED SYSTEM TAGS FOR THE BOOKMARK
- pdf
- org
- 20/09/05
- johnt

USER-SELECTED BOOKMARK FIELDS (drop-down menu) FOR NON-SUBJECT TAGS
- White paper
- to blog
- Australia
- Organisation name

What about subject facets?

The tags applied by the user, as per usual, are now considered the “subject tags” (eg. sustainability, etc.)…but then how do we make facets out of these?
eg. genre, technology, attribute, etc…
…a drop down menu for a subject facet just couldn’t cover the scope, although you could give a choice, and a space to type your own.
When thinking of applying a subject facet to your tag, it would be like thinking of which broad category/discipline this tag would come under…in some circumstances the facet name could be the same as your tag name, eg. politics.

Then you could browse del.icio.us by these facets…or at least you could make up these facets for your own account (that means you would have to customise the fields in your bookmarklet)

For more see:
Siderean - Trees, tags and facets
Facets + Tags

Another way of having automated tag bundles is having a pre-fix for a tag (tag for the tag)…the other day I posted about doing this manually in del.icio.us..
Wists has this feature (a form of meta-tagging, they call themes)…in this sense your tag bundles or facets could automatically be created by the pre-fix heading, although I also the like the option to also make your own weird and wonderful tag bundles.

see more…

[ADDED 23/09/05: Furl lets you check the boxes read or unread at the time of bookmarking…so later on you can sort your archive by read or unread.
This has an added handiness, as at the time of bookmarking you may not have read the webpage you are bookmarking, so you can’t really fill in the comments/description field…later on once you have read the webpage, you can check it as “read” and also add a comment.]

[ADDED 3/10/05: Faceted Folksonomy]

September 1, 2005

Search trends: relevancy, discovery, findability

Filed under: General, tags, search, semantic

The choices (search/browse)

Directory vs. PageRank vs. clustering vs. social search vs. personalised search vs. human-indexed web vs. social tags vs. computer processed tags vs. rss

For starters (for me anyway) there was Yahoo! Directory, ODP, or Google search (based on Page Rank)

To alleviate overload of results and relevancy this was what followed (not sure if that equates to the old precision vs. recall issue):

Cluster resultsClusty, Gigablast, Mooter
Here is more on Clusty.

Keyword extractionTagCloud
Here is more on TagCloud.

Refine searching (similar to Clustering) – Teoma, Ask Jeeves

Here is more on Teoma
Here is more on Ask Jeeves

Search by social network (re-ranking)Eurekster
Here is more on Eurekster.
(there are also engines that base relevancy on social network ratings of pages)

Search by your search history plus implicit profile (re-ranking)Google Personalised
Here is more on Google Personalised (mentions old explicit profile approach).

Search by your reading behaviour at the granular level (re-ranking)Findory
Here is more on Findory.

Search by RSS - Technorati, Feedster, Blogpulse, etc…
See more on indexing issues:
SEW: How Search Engines Index RSS & Why It Doesn’t Necessarily Matter
For the Vox Populi, Part II: A Comparison of How Some Blog Aggregation and RSS Search Tools Work for Keyword Search

Browse social bookmarks - Furl, del.icio.us, Spurl, Simpy, Connotea, CiteULike, etc…
More here:
Ontology is Overrated — Categories, Links, and Tags
CollaborativeRank

Search full-text human/social webZniff (Spurl), Simpy, Furl, or meta-search (Gataga-now defunct)
Zniff searches the full-text of the Spurl bookmark community, as Furl and Simpy have full-text search of their bookmark communities.
I think Zniff has got the idea, just incorporate tags (human indexed web) as part of the ranking…Gataga would just augment this by being a meta-engine…then re-rank again for personalisation.

Even with tagging, search is still needed (discovery vs. findability)
See here:
Yahoo My Web Tagging & Why (So Far) It Sucks…and more.
Coming To Terms With Tags: Folksonomies, Tagging Systems And Human Information

Some of my related posts:
Zniff clusters
Gataga: meta-tag search
Google with del.icio.us clusters
RSS: full-text or summaries!
Blogs/RSS Engines: keyword search comparison

…and finally:
Tagging alone is not a panacea for retrieval!

All this re-ranking, when you can become search experts, using boolean syntax, fielded searching, etc..dig into the opaque web! (I think this means using your search intelligence to get those results that are relevant to you at the top of your results)…but will lay people develop skills..

That’s the issue, lay searchers, and also many experts are both finding relevancy and exhaustivity an issue (one more than the other)…what will the semantic web do…will it finally be a library catalogue for the web (DC) and probably more (ontologies and all that stuff)…Google don’t care, keyword search works for the lay person…what about the research needs of the scholar…even if people go to the trouble to add metadata to webpages to help relevancy in searches, the search engines have to take notice of this input in their indexing procedures.

Context, relevancy (re-ranking, metadata, concepts) applies more than ever in the time crucial environment of the enterprise…here are 2 articles on enterprise search:
Beyond keyword search: How a “concept” search can improve information gathering in a CI context.
Recent Trends in Enterprise Search

It comes down to ranking for relevancy (general and personalised), search syntax, user interface, education in searching and how search engines work, and most of all understanding there are an array of tools for an array of needs.
That is, web search engines are serving individual purposes and needs…because they can do a lot of things it doesn’t mean they do them all well…if keyword search doesn’t serve your purpose (eg. scholarly research) use a subscription database…maybe Answers.com is better for facts…there is more than one tool (a world outside Google).

[ADDED 10/10/05: Better query refinement]

June 22, 2005

Social bookmarks vs. free text search

A lot of people are starting to use social bookmarking tools as a means for SEO (increasing your web traffic)…tagging your own blog posts gets you double the exposure…

Particletree points this out with an experiment using del.icio.us to bookmark webpages instead of waiting for his site to be crawled by Google …read the post for the results.

Here is more from this post:

”I think the reason del.icio.us is so successful at bringing the appropriate audience to good material is because they track the changing web by using people to calculate what is essentially page rank. They get access to decent fuzzy logic for a fraction of the cost and the democracy of the system allows anyone to get their idea of what deserves face-time into the system almost immediately.”

The differences for SEO:

  • Crawling vs. pinging

    Search engines like Google take longer to index content
    Google Sitemaps is their solution to overcome this issue.

  • Pinging services enable the World Live Web or the ChangingWeb; you can keep track of new additions to the web

  • Social bookmarking and blog categories vs. PageRank

    The web community chooses the importance of a page, not an algorithm

    …and once a page has been tagged, it is visible and shared around, and can be re-tagged by many people, applying many tags (according to their way of seeing the world, so to speak - hopefully applying a popular and/or accurate tag), increasing its exposure, and maybe making it visible on the front page (most popular tags).

    Users track other users tag accounts (called an inbox in del.icio.us) on a daily basis because they like the stuff they bookmark, or people track a general tag itself (also at the user level) to keep up with the latest according to content bookmarked with that tag.

    Now if you bookmark a page for SEO reasons, someone tracking the tag you used will come across your page, and then they might tag your page with another tag, and someone tracking that tag will see it, and at the same time people tracking someones whole account or a tag within an account will see it as a new inclusion.

    So people will come across your page without even looking for it (serendipity)…as folksonomies are largely about discovery (see the end of this post).

    In constrast to Google; where you will come across the page if you are trying to find something specifically.
    Although people do set Google Alerts, or do repetitive searches daily with their favourite search terms…so they may come across your page this way, but again will your page be ranked at a reasonable level before people stop clicking through results.

    Two engines that are using tagging instead of page rank are Technorati Tags (covers the blogosphere) and Gataga (covers the folkosphere – or whatever it’s called?… I guess the tagosphere would be both of these combined).

    NOTE: see my post on Gataga searching in many fields and searching in just the tag field…according to my trials and observations.

    These tools cover a portion of the web that is savvy with current awareness, so it’s a good place to be a part of if you want traffic to your site.

Tags vs. free-text search

What subject tags try to achieve is to filter through the noise in search results, often this is called the “gray web”.

Even though the authors of websites tried this in the earlier days, it failed because of spam issues, and now it’s different as users are tagging pages (multiple heads are better than one when defining the aboutness of something)…see more here.

In Google you see results according to your search terms coupled with the algorithm (that decides on the ranking of results).
So there may be more relevant results than you think, but you can’t see them as they may be hit 2,620 or hit 100,265…you’re just not going to scroll through that many results (so this is a part of the open web, that is not invisible, it’s just hidden)..see more here.
So you’re only solutions to bring all the relevant hits to the top is too improve your search terms, make them more precise, use boolean, etc…

In social bookmark tools pages are found according to a tag or subject term, as opposed to a free keyword search, or both eg. Zniff (even though this is just for one bookmark tool)

…soon these social bookmark tools will have lots and lots of results per tag (will the same problem emerge even at the subject level, let alone the free-text level).
At the moment this is alleviated by searching for more specific tags or combining tags.

Also social bookmark tools have a popular tags home page which increases visibility

So, the more your webpage/post is seen, the more it has a chances of being blogged about, increasing your traffic again.

Socialbookmark tools don’t index all the web, only the pages people choose to tag (it’s a selective version of the web)

But this isn’t a problem for SEO, as you can just bookmark your own page, and away it goes…

Of course there is way more to SEO, but this is just about one aspect of SEO in relation to traditional search compared to social bookmarks.

This post digresses to ideas of the semantic web, where blogs, webpages, index their markup codes with more fields so to speak, eg. date, author, subject, review, job listing…this way anyone can aggregate the content and share it…structured blogging is a foray into semantic blogging.

In the end, is Google (PageRank) suitable to the lay persons searching needs, or will a subject fielded search be more appropriate.

I guess the arguement is in the accuracy in indexing the aboutness of a page…if the page is not correctly indexed according to the searcher, then they may think it doesn’t exist, that’s why free-text search is the safe option.

So the problem is; we are getting too many results

…we need more context for better precision

…but who defines the context is the question?
(tag/subject name and bookmarking the right items within this tag)

…I guess this is now a combination of the user (social bookmarks) and the author (blogs)

…another question is of controlled vocabularies

(that’s out the window in a non-domain specific, multi-discipline environment, with millions of contributors - the content is already uncontrollable, let alone trying to control labelling it all…labelling it brings context, which is want we want, but we can’t control the labels…well we have no choice, and who says controlled labels will help people search in context better than a user-defined/free-tagging system).

…as mentioned the user may have more of a chance finding something with free-text, rather than jumping from tag to tag trying to locate something (although when you find the right tag, you will hopefully find lots of items of quality, compared to free-text where you are competing with a lot of noise).

But then again, maybe free-text is for finding and tags are for discovery…maybe they shouldn’t be compared as one or the other, as they are slightly different tools.

Do we need to educate users on search terms and syntax techniques?

…or should we define webpages by user-defined tags?
(well this is happening anyway)

But then if someone searches a tag/s and doesn’t find what they are looking for, they may quit, whereas the item they were looking for was located in another tag.

The search experience needs to be intuitive.

I think our current answer is too use a bit of both.

Whether the tags become part of the pagerank (Zniff), or from the results page you show the tags applied to each hit (see comment on this post).

Get free blog up and running in minutes with Blogsome | Theme designs available here