Napsterization has a follow up post on comparing keyword searching on RSS/Blog Engines.
Here is the comparison chart.
Main difference between the previous analysis on URL lookups is that people have a different expectation out of keyword searching, and because it is the live web, results are shown by date rather than relevance (eg. Google).
But with so many results, duplication, and non-blog content, how can we create some sort of personal relevance…we need date limits, limits to top % of blogs (a top blog is according to their number of incoming links-although this doesn’t mean it is more relevant than an unknown blog…PubSub segments the blogosphere by popular blogs, enabling you to filter or personalise your result set…also Feedster limits to blog hosts, etc..)
Duplication of URL’s and titles seems an easier enough problem to clean (I think!), but what about stories that are re-syndicated (many people are publishing blog aggregators-like a public version of an RSS Reader, where they take several RSS feeds and re-publish the content like a mega-blog.
This means the exact same content is being indexed by the Blog/RSS engines…the permalink should be the same (otherwise there would be a copyright issue) but the same permalink will be indexed multiple times (due to being re-published in multiple places).
Google News has a good way to alleviate duplicates, and older content that is related to the current content, by collapsing the stories under the current story
Blogpulse kind of do something similar with their conversation seed feature, this is based on a URL or on a keyword.
Also to note in comparing these engines are some are exclusive to RSS blog content and some incorporate RSS from any type of format.
Here are some experts from the blog post:
“…Does the user want a quick taste of what is out there around a particular topic? Or do they want every instance of a keyword match, with an accurate count of those instances? Do they want to see only the most relevant posts that use the keywords, or the most recent?….
…Technorati has redone keyword search, removed the 7 day result limit…results only go back to last October….
…Technorati has reduced duplication over the past 10 months, bringing it more in line with Blogpulse’s cleaner result set…
…Technorati is now faster than Blogpulse…”
Phenomenally only returned 3 duplicates
Only blog posts (not mixed with RSS from traditional news), so easier to analyse just the blogosphere from the result set (well there is no distinguishing to be made, as they are all blog posts)
All blog titles in the result set correctly linked to the permalinks
Missing some posts (but aren’t they all)
Cleaner and easier to find an initial meme (the conversation seed)
Hard to compare as the search is forced to create a phrase search, therefore result set will be smaller and more succinct in comparison
Lots of duplication
Not all results link to the permalink, some just link to the front page
Harder to find an initial meme (the conversation seed)
On this sample search almost ¾ of the results was non-blog content (traditional news, del.icio.us, etc…)…hard to distinguish the view of just the blogosphere
Large result set as Feedster allows any instances of the search term, in any order
Not all results link to the permalink
¼ of the results were non-blog content, making it hard to distinguish the view of just the blogosphere
Large result set as Bloglines allows any instances of the search term, in any order
Not all results link to the permalink
I wonder how IceRocket would perform.
It would be good for the post to sum up when they’d use a particular engine over another one…I guess I’m being lazy.
Even though Blogpulse didn’t cover every post in the blogosphere, it seems the easiest and cleanest to use in analysing the blogosphere.
Although for exhaustivity it seems that Technorati (beginning October 2004) and Bloglines (for a blog to be included it has to be registered by a Bloglines account RSS reader) indexes posts over a long period of time compared to Feedster and Blogpulse (only latest 6 months).
Since Blogpulse seemed the best overall, this seems a great tool for current stuff, to get a more historic picture both Technorati and Bloglines are the go (just a harder time going through the results, connecting to the actual posts, and summing up the trail and spread of a meme).