You are here

Information Architecture

Udell Observations on Taxonomy in Practice

I find myself retreating from politics and the real-world ramifications of social theory, back into the world of ideas about ideas.

This morning, via a new site in my aggregation list, John Udell on knowledge gardening

Conventional wisdom holds that people will never assign metadata tags to content. It just isnâ??t on the path of least resistance, the story goes, and those few who do step off the path succeed only in creating unwieldy taxonomies. (Do you file the revised XML Schema specification under xml/specifications or specifications/xml? We can never agree, and many good minds are sacrificed in the vain attempt.) Yet somehow, users of Flickr and do routinely tag content, and those tags open new dimensions of navigation and search. Itâ??s worth pondering how and why this works.

Abandoning taxonomy is the first ingredient of success. These systems just use bags of keywords that draw from &endash; and extend &endash; a flat namespace. In other words, you tag an item with a list of existing and/or new keywords.

Two quick lessons, here: People don't check their behavior to make sure it conforms to theory; and in order to find out what people do, you need to watch what they do. (To paraphrase Malinowski: Don't just listen to what they say they do; watch what they really do.)

I find taxonomies fascinating, yet limiting. This site makes use of Drupal's taxonomy module for organization. But I've always found the hierarchical nature of taxonomy to be a bit frustrating. Drupal's design lets me give a term multiple parents, and define associated keywords. It really serves, in effect, as a browsing tool, and as such can serve as a combination of gauntlet and conceptual joke (after the link, note the crumb trail at the top of the page).

Aside: SoulSoup is too much. I could just pour through it all morning. Alas, I need to get some work done, now...

How I Want To Work, Part I

Here's how I want to work: I want to be able to just note stuff down, wherever I happen to be at that moment, and have it get stored and automatically categorized, and be available for publication wherever I want from wherever I am, whenever I want to. This has been an achievable dream for nearly ten years -- people are constantly hacking together systems to do just that. But we're stuck in a technologically-determined rut that keeps these solutions from being developed.

I've been thinking about these things a lot, and decided it was time that I wrote it all out, to organize my own ideas as much as anything else. So here's part one, where I try to unpack what it is that I'm really asking for, and start to get a sense for what's not working now, and why. So, as a separate story (because they're long, and would push everything down the page and out of site), here's how I want to work...

How I Want To Work, Part One

[continued from blog entry]

Here's how I want to work: I want to be able to just note stuff down (in my ideal world, wherever I am at that moment) and have it get stored and automatically categorized, organized -- by timestamp, at least, but ideally also in some kind of navigable taxonomy

That Pernicious "Search Is King" Meme

There's an ever-waxing meme out there which basically boils down to this: "Forget about organizing information by subject -- let a full-text search do everything for you." The chief rationale is that such searching will help increase serendipity by locating things across subject boundaries.

Here's the problem: It's a load of crap. It throws the baby out with the bathwater, by discarding one time-honored, effective way of organizing for serendipity in exchange for another, inferior (but sexier) one.

This morning, via Wired News:

"We all have a million file folders and you can't find anything," Jobs said during his keynote speech introducing Tiger, the next iteration of Mac OS X, due next year.

"It's easier to find something from among a billion Web pages with Google than it is to find something on your hard disk," he added.

... which is bullshit, incidentally. At least, it is on my hard drive...

The solution, Jobs said, is a system-wide search engine, Spotlight, which can find information across files and applications, whether it be an e-mail message or a copyright notice attached to a movie clip. "We think it's going to revolutionize the way you use your system," Jobs declared.

In Jobs' scheme, the hierarchy of files and folders is a dreary, outdated metaphor inspired by office filing. In today's communications era, categorized by the daily barrage of new e-mails, websites, pictures and movies, who wants to file when you can simply search? What does it matter where a file is stored, as long as you can find it?

Ah, I see -- the idea of hierarchically organizing data is bad because it's "dreary" and "outdated" -- that is, of course, so quintessentially Jobsian a dismissal that we can be pretty sure the reporter took his words from The Steve, Himself.

But this highlights something important: That this is not a new issue for Jobs, or for a lot of people. Jobs was an early champion (though, let's be clear, not an "innovator") in the cause of shifting to a "document-centric paradigm". The idea was that one ought not have to think about the applications one uses to create documents -- one just ought to create documents, and then make them whatever kind of document one needs. Which, to me, seems a little like not having to care what kind of vehicle you want, when you decide to drive to the night club or go haul manure.

But I digress. This is supposed to be how Macs work, but it's actually not: Macs are just exactly as application-centric as anything else, though it doesn't appear that way at first. The few attempts at removing the application from the paradigm, like ClarisWorks and the early versions of StarOffice (now downstream from OpenOffice), merely emphasized the application-centricity even more: While word processors and spreadsheet software could generally translate single-type documents without much data loss, there was no way that they were going to be able to translate a multi-mode (i.e. word processor plus presentation plus spreadsheet) document from one format to another without significant data loss or mangling.

Take for example, Rael Dornfest, who has stopped sorting his e-mail. Instead of cataloging e-mail messages into neat mailboxes, Dornfest allows his correspondence to accumulate into one giant, unsorted inbox. Whenever Dornfest, an editor at tech publisher O'Reilly and Associates, needs to find something, he simply searches for it.

Again, a problem: It doesn't work. I do the same thing (though I do actually organize into folders -- large sigle-file email repositories are a data meltdown just waiting to happen). This is a good paradigmatic case, so let's think it through: I want to find out about a business trip to Paris that was being considered a year and a half ago. I search for "trip" and "paris". If my spam folder's blocked, and assuming we're still just talking about email, I'm probably not going to get a lot of hits on Simple Life 2 or the meta-tags for some other Paris Hilton <ahem!> documentary footage. In fact, unless the office was in Paris, and the emails explicitly used the term "trip", which they may well not, I probably won't find the right emails at all. Or I'll only find part of the thread, and since no email system currently in wide use threads messages, I won't have a good way of linking on from there to ensure that I've checked all messages on-topic. (And that could lead into another rant about interaction protocols in business email, but I'll stop for now.)

By contrast, if I've organized my email by project, and I remember when the trip was, I can go directly to the folder where I keep that information and scan messages for the date range in question.

The key problem here is that search makes you work, whereas with organization, you just have to follow a path. I used to train students on internet searching. This was back in the days when search engines actually let you input Boolean searches (i.e., when you could actually get precise results that hadn't been Googlewhacked into irrelevance). Invariably, students could get useful results faster by using the Yahoo-style directory drill-down, or a combination of directory search and drill-down, than they could through search.

If they wanted to get unexpected results, they were better off searching (at least, with the directory systems we had then and have now -- these aren't library catalogs, after all). And real research is all about looking for unexpected results, after all.

And that leads me to meta data.

Library catalogs achieve serenditity through thesaurii and cross referencing. (Though in the 1980s, the LC apparently deprecated cross-referencing for reasons of administrative load.)

The only way a system like Spotlight works to achieve serendipitous searching -- and it does, by the accounts I've read -- is through cataloged meta-data. That is, when a file is created, there's a meta-information section of the file that contains things like subject, keywords, copyright statement, ownership, authorship, etc. Which almost nobody ever fills out. Trust me, I'm not making this up: from my own experience, and that of others, I know that people think meta-data is a nuisance. Some software is capable of generating its own meta-data from a document, but such schemes have two obvious problems:

  1. They only include the terms in the document -- no synonyms or antonyms or related subjects, and no obvious way of mapping ownership or institutional positioning -- so they're no real help to search.
  2. They only apply to that software, and then only going forward, and then only if people actually use them.

Now, a lot of this is wasted cycles if I take the position that filesystems aren't going away and this really all amounts to marketer wanking. But it's not wasted cycles, if I consider that the words of The Steve, dropped from On High, tend to be taken as the words of God by a community of technorati/digerati who think he's actually an innovator instead of a slick-operating second-mover with a gift for self-promotion and good taste in clothes.

This kind of thinking, in other words, can cause damage. Because people will think it's true, and they'll design things based on the idea that it's true. And since "thought leaders" like Jobs say it's important, people will use these deficient new designs, and I'll be stuck with them.

But there's little that anyone can do about it, really, except stay the course. Keep organizing your files (because otherwise, you're going to lose things, trust me on this, I know a little about these things). The "true way" to effective knowledge management (if there is one) will always involve a combination of effective search systems (from which I exclude systems like Google's that rely entirely on predictive weighting) with organization and meta-data (yes, I do believe in it, for certain things like automated resource discovery).

Funny, who would have thunk it: The "true way" is balance, as things almost always seem to come out, anyway. You can achieve motion through imbalance, but you cannot achieve progress unless your motions are in harmony -- in dynamic balance, as it were. What a strange concept...

Unsung Development of the Moment: Wikipedia Reinvents

Wikipedia is probably the most siginificant, important website on the net right here in May/June 2004. It's the signal success we can point to for bazaar-style projects, and the great white hope for the persistence of free, non-corporate-sponsored information on the web. Not to disregard Wikipedia's smaller cousin, WikInfo; they're just not big enough to be a great white hope, yet.

So, now, Wikipedia has done something intriguing: You can now talk about any article, or view previous versions. These appear to be benefits of upgrade to version 1.3 of MediaWiki, the hyper-extended Wiki implementation that Wikipedia developes and uses to drive the site.

Tired terms like "community portal" don't do this justice. I don't think the great mass of the digerati really have any clue how important Wikipedia (and WikInfo) are. This kind of move, once they notice it, could blow Wikipedia wide open.

My great fear is that it could literally blow it wide open: How will they be able to handle the loads? Will their community software be able to cope with input from every Tom, Dick and Harry with an opinion?

The upside, of course, is that with a project of this broad scope, we'll finally get that "online experiment" that other "communities" have been claiming to be for years.

Addendum: I've posted this on Mefi; let's see if anybody cares.

Second addendum: Mefites assure me that it's always had that functionality, though it wasn't as obvious as it is, now. I wonder if they've made changes that will let them handle the greater load and have decided to front-and-center those features?

Subscribe to RSS - Information Architecture