The Future of Indexing?

By Jan Wright, WrightInformation

A recent article in the Society for Technical Communications' Intercom magazine proclaimed that indexing is on the rise (Seth Maislin, "The Indexing Revival," February, 2005), and that there is a renaissance of work in the field. But at the WritersUA March Conference, Microsoft's Longhorn features session declared that Longhorn's Help system will not contain an index, because "no one uses it." Then, to add to the discussion, at that same conference Apple revealed that their next help engine will include synonym rings and will add a form of indexing back into their display. Who's right? Who's correctly predicting the trends?

All three, actually

"Indexing is an arcane art whose time has not yet come," according to Lise Kreps, an indexer I worked with at Aldus and Microsoft long ago. This arcane art, in which a human designates the subject of a chunk of information, and where that chunk is, still hasn't been completely mimicked by automation. Natural language search engines are providing some great results, and some very mixed results, and these results improve for people who are willing to learn the tricks for getting the best out of each engine. But natural language engines only solve one aspect of searching. And there are vast holes that are not picked up by an engine, due to metaphor, or the way a document is titled or structured, or how the search is phrased. If we rely totally on automation to retrieve information, some will be lost. "Important information will no longer be made retrievable. Instead, information will become important simply because it is retrievable" (Richard Evans, 2002). If the information's structure doesn't work with the engine well, or if the user is not in a search engine mode, the search may come up short and miss pieces of important data.

Easy answer searches, such as "What's the shortcut for typing an em dash," are great for search engines. But for a more complex problem, natural language engines can come up short. Try this one, for instance: "How do you freeze the header columns in Excel so that they don't scroll, and so that they appear on each printed page?" Searching for an answer to a complex question is an iterative process. The user switches search modes several times in the course of a complex search. A search like the Excel question may have a user starting by typing words or terminology in the search box. " ‘Freezing' didn't work. What else can I try?" And it is at that moment of "What else can I try" that a user loves to see a list of other categories, or terms, or an analyzed index.

Unless a browsing search presents categories or analysis, information may become unretrievable for the user who is at a loss for words. Categories can be provided by indexes and other such analyzed lists of content in context. There is a browsing period in the search process that natural language engines don't accommodate, a time when a reader wants to know what other types, what other modes, what other features, what other subsets, or what other ideas he or she can use to solve a problem or get more information.

In this light, who is right in predicting the trends in indexing? I said all three. Seth Maislin is right, because the people who tend to do indexing, whether freelancers or in-house, are going to see the need for their categorization, classification, and language fine-tuning as companies face the fact that they have libraries and libraries of bits of information to control and make retrievable. Microsoft is right about their analysis of their own indexes, because they have not exposed true indexes to their users in their mainstream products for several years, so indeed, "no one" is using the index in their products. Their users have learned not to expect much from that Index tab. And Apple is right to make a form of indexing accessible again, because they recognize that users take alternating paths to information. Users have different learning styles, different searching styles, and different iterating paths within one search session.

As Gordon Meyer of Apple says, "The Apple Help search engine is really quite good (full-text, natural language) but some users just aren't ‘searchers.' The index is there to provide alternative access for those who don't, or won't, use the search function. A key reason behind its inclusion now, as opposed to in an earlier version [of Apple Help], is that we've added a technical solution for generating the index ‘on-the-fly'--based on tagging done by instructional designers--which makes interlinked pages much more compatible with our continuous publishing, Internet-driven, model."

Users don't have physical clues to online information outside of help's navigation systems. Search appears fickle, especially if the user needs to type in the same question after a month or two has passed. I know I have tried to find the answer to the Excel header question more than once, and the answers I get vary each time. I phrase it a little differently whenever I look that up, and I can't remember what the perfect term was that got me the original answer. I know it is in there, but I'm not hitting it 100%. This unpredictability, the feeling that you didn't get the search "right," makes search feel unreliable for complex questions. Providing a homing device that is always there, like an index, gives the user an alternative source of help.

Serving all your users, and all your information, may mean using an old form of access and linking it to information you haven't even written yet. Predicting what topics to interlink to an index means categorizing and classifying the nature of the knowledge your company publishes now, and is likely to publish in the future.

It's about aboutness

There is still a strong need to connect "aboutness metadata" to chunks of content. That aboutness metadata can be exposed, as in an index, or listed in a categories list, or hidden in fields and used by a fine-tuned search engine. Indexes may go away in the next version of Longhorn, but they will be back in other ways, because it still takes human analysis to provide oversight on "aboutness." Searching for content in the right context is a last frontier, and although we are on the edges of the frontier, we still don't have automated content retrieval completely solved. We get a lot of results that don't meet our needs at the time, or when we switch search modes. There's still a lot of unfindable information. Aboutness metadata provides the contextual clues.

In the last several conferences I've attended, an emphasis has been put on metadata schemas. One company I've contracted with has over 100 fields of metadata to be filled out on each document for its intranet. I don't think attaching that much metadata to content will solve all the issues, because employees don't have that much time, and companies normally don't have that much money. And automation or natural language engines will probably not solve it all, because if they did, both Google and the talking paper clip would always work.

What I do see happening is a cycle of effort back and forth as people realize the problems they are have getting good information recall. The proposed solutions flip-flop back and forth between human analysis and automation, as the companies find the problems inherent with using each way of addressing findability.

For the most part, content developers do not want to take the time to keyword documents and fill in metadata fields. When this individual reluctance scales up to a large help system, you are left with only automation. That means gaps when the user lacks the terminology, or gaps in a topic that the search engine seems to skip over.

It's about learning pathways

A good user assistance system should leave a user more able to cope with the next question he or she has, by adding a bit of explanation, or pattern recognition, or map-like structures that show how information is accessible. That learning-for-next-time is a piece that we need to address. Let's say a user found what was needed this time with a full-text search engine. Great. Next question, no, he found 90 hits with full-text search and gave up. We need to make that number smaller on our end (with good weighting, metadata, and vocabulary control). But we also want to help users make the results more focused on their end. How can we do that? How do we help them recognize the patterns in search that work in this particular body of information? We could do it by exposing some pieces of the metadata in a non-threatening alternative access mode. We have figured out some great ways of doing it for specific tasks: walking a user through a decision path, exposing contextual help, exposing tutorials. And we have figured out some standards that the user learns to expect: exposing indexes, TOCs, cross references, or related topics. We need to figure out how we can expose structure-to-learn pathways depending on the question's context and the topic's context.

If we want the system to scale and to meet challenges like changed and updated information, that will require aboutness metadata on the topic side, and predictive ability on the search side, and that's where the indexing skills come in. Building up a body of controlled aboutness information is a task that takes off from indexing, and reforms and reshapes it into something that can serve multiple purposes. For example, if all the topics in a help system have metadata attached, dealing with product name, task, version, and aboutness, results of a search could automatically lead to matched topics with the same metadata attributes, regardless of whether the topic lives locally or on the web, and regardless of whether it has been changed recently. But it takes a very controlled set of aboutness metadata, in place, and followed rigorously.

Broadening the index world

The first steps to this type of controlled language sets involve analyzing types of content and types of questions, and creating controlled vocabularies, so that your data-to-be is standardized across all of your documents. This involves developing the standards, checking data across all documents, and reworking where some content has been analyzed in the metadata too much, and other content not enough. That's human work, and indexing skills are a natural for it. You can rely on automated concordances to sample what is in each body of knowledge, but the final analysis still needs to be human, and matched to the needs of the company and the users.

At some point you will notice overlapping areas, where help crosses over the web forums, and where one structure could be devised multiple ways. That's when this kind of work becomes highly political and cultural -- whose structure of the universe do we take as the "real" one? As soon as you get into those questions, it becomes highly charged, because no two people structure content the same way. How does your in-house structure resemble how the user thinks about the content?

Jorge Luis Borges claimed to have found a Chinese encyclopedia that divided animals into these categories:

a. belonging to the Emperor
b. embalmed
c. tame
d. sucking pigs
e. sirens
f. fabulous
g. stray dogs
h. included in the present classification
i. frenzied
j. innumerable
k. drawn with a very fine camelhair brush
l. et cetera
m. having just broken the water pitcher
n. that from a long way off look like flies

  (Michael Foucault, The Order of Things)

Our first take on this is "Hunh?" We in the west are used to a class and species approach to animals. Take a second look, and you realize that how we categorize is a reflection of our culture. These are important categories to the person who wrote the list. The way we break down content as content developers and representatives of a company's product is also cultural, and as writers and editors of content, we have a slightly different culture than our users do. Our notions of what user assistance looks like may resemble this animal category list to some of our users. Our categories of tasks and concepts may not make any sense to them. And our aboutness metadata must reflect their categorizations as well as our own, or their searches may not get good results from our data.

You can see this cultural difference between developer and user when you look at poorly designed interfaces. Both application interfaces and web site interfaces are the "face" of a culture presented to users who may not be coming from a similar cultural background. Think of a senior citizen getting his first computer handed down from his son, and wondering why there is a C: drive and no B: drive. On web sites, the cultural gap is sometimes even wider: web site information architecture brings politics into the mix, and whoever has the most power in the organization gets their material out in front. The users are not part of the company's culture, and often their opinions of what should be easily accessible on a home page are lost in the politics.

Alternative access points like indexes are not used very often on web sites, due to several things:

  • a lack of truly great tools that allow this work to be done quickly, easily, and flexibly, with instant updating
  • a lack of skilled people power

Web sites change and update too much to handcode such things as indexes. So if a company wants an index on its site, they have to build a controlled vocabulary, apply it in an easy-to-use tool, and take the time to pull terms as metadata into each posted page. (You can stop at a certain point: apply terms down to a certain level, and rely on a mini search for really rapidly changing materials or ASP pages).

Adding that data means dictating some field-filling to your content creators, and making that vocabulary easy to use. The company also needs to build that vocabulary with an interface in mind: knowing what the user will be seeing and navigating. The interface should come first, and then design the vocabulary to fill it. That's a lot of pre-coordinated work, and it doesn't mention how much maintenance goes into the vocabulary. That's why there aren't more real indexes on the web. Search is so much easier to implement.

Back to the Index

We started out talking about indexing, and somehow, we've wound up discussing categories and metadata instead. That's where indexing is going. Traditional indexes work great for static information contained within finite boundaries. But the boundaries of user assistance aren't very finite any more, and static indexing no longer works as well when your knowledge base keeps growing or changing. This doesn't mean indexing goes away. It morphs, and becomes controlled vocabulary or taxonomy work or aboutness metadata. Pieces of larger vocabulary structures can be exposed as an index, or aid a search engine's work, or predict what labels a company needs for content management.

So it can be said that indexing skills are still important, for two reasons: 1) ensuring that your users can have alternative paths of access to information, and 2) realizing that information becomes more retrievable when it is tagged with aboutness metadata. Indexing keeps both your users and important data from becoming lost.

Whether indexes will disappear or not depends on the amount of pain users are willing to endure when searching, and the amount of time and money companies are willing to spend to make information more retrievable. It's hard to tell at this point, but I see signs that alternative access is starting to make its case. Until we know, the best course is to keep your options open: learn what you can about standardizing vocabulary, think about the interfaces in which you would like to expose alternate lists and indexes, keep your documents tagged with a minimum set of metadata, and keep an eye on what other companies are doing, especially those with money to spend on development. If they start spending money on this kind of work, it's because the pain is increasing.

Resources to check:

Jan Wright has been working with online content since 1991, both as an indexer and as a taxonomist. Her clients have included companies such as Microsoft, Autodesk, and Apple, and her company has provided indexing, controlled vocabularies, and structured taxonomies for online content. She has a masters degree in library science, and extensive experience in software documentation. Her web site is