Migrating from HTML to XHTML and XML - Part II

By Char James-Tanny, JTF Associates, Inc.


This is the second part of a two-part article describing a detailed methodology for migrating HTML files to the structure and flexibility of XHTML and/or XML.

In Part 1, I compared HTML, XHTML, and XML, and looked at the differences between them. I also covered ways that you can migrate from HTML to XHTML and how to fix common problems.

In Part 2, I'll cover validating your XHTML files, migrating from XHTML to XML, possible XML standards, creating your own standard, validating your XML files, and creating an XML to HTML transform.

Validating your XHTML files

Validating guarantees that you have tagged your files correctly by checking them against the applicable World Wide Web Consortium (W3C) standard. Validation is important because correctly tagged files:

  • Are properly indexed by search engines.
  • Render better and faster than files with errors.
  • Are usually both backward and forward compatible.
  • Are easier, faster, and cheaper to maintain.

If you aren't sure how to correctly tag your files according to the W3C standards, a validator will help by pointing out the mistakes.

Tip: I usually create a "shell" for my pages and validate it, then add content. The shell includes everything but the content, including graphics and navigation. Once the shell is valid, the possibility of errors after adding the content is reduced.

Free validators include:

You can find more validators with Google. The ones I've listed are those that I've worked with. All produce reports showing what the errors are and where they occur.

The free tools typically only check one page at a time. Many HTML editors include built-in validators or link to installed validators. Browser extensions (such as the Accessibility Toolbar for Internet Explorer and the HTML Validator for Mozilla and Firefox) let you validate individual pages quickly.

Migrating from XHTML to XML

As discussed in Part 1, it's easier to convert your HTML files to XHTML before finishing the conversion to XML. The initial passes help you determine the structure of the content and get the files cleaned up.

The process requires planning because of the many intermixed pieces. You're going from an HTML file that references a CSS (for formatting and layout) to XML files (for content) that reference a schema (.xsd) or DTD. You must also have a transform, which applies formatting to the XML when creating the desired output (such as (X)HTML, PDF, XML using another schema, and so on). The transform file uses the .xsl extension.

The first step is to determine which standard you will use (or if you'll create a new one). You must adapt your content to the chosen standard, following its rules, or the files won't display correctly. The two most common standards (explained later) are DocBook and DITA. However, if the files you are creating don't match to an existing standard, you can create your own.

The next step is to convert your XHTML files to XML, using the chosen standard (DTD or schema). Many XML editors let you reference the DTD or schema to guarantee that your files are tagged correctly.

What XML standards are available?

In Part 1, I briefly covered DTDs and schemas. To recap: both define the rules for the files that use them. However, DTDs use complex non-XML syntax and support document-oriented data typing, while schemas use XML and support both data and document-oriented data typing.

You can choose from existing standards or create your own.


The oldest standard, DocBook, is large and robust. The DTD is currently at version 4.4 (January 2005), and schemas are in development. DocBook was originally designed to divide one long topic into smaller topics during transformation. Primarily intended for print (and now PDF) output, several transforms have been developed for online Help, including HTML, HTML Help, and XHTML.

The simplified DocBook standard contains 116 elements, which is a subset of the entire standard. However, it is not extensible (that is, it can't be extended with new tags).

Get more information on DocBook at the DocBook Wiki (http://wiki.docbook.org).


Darwin Information Typing Architecture, or DITA, is an open standard created originally by IBM and now managed by OASIS (http://www.oasis-open.org/). Unlike DocBook, DITA was originally designed around the concept of topics and topic types (concept, task, and reference).

DITA uses multiple reusable topics which can be assembled in different combinations. It is optimized for navigation and search, and is extensible.

Get more information on DITA at Source Forge (http://sourceforge.net/projects/dita-ot/). The dita-users Yahoo! group provides peer support, and user groups have been created in several cities.


Microsoft Assistance Markup Language, or MAML, was designed specifically for online Help for Vista, Microsoft's next operating system. The schemas were taken offline earlier this year and aren't available yet.

MAML includes several topic templates: conceptual, FAQ, Automated Task, and procedural. More information will be available soon.

Other schemas and DTDs

Visit the OASIS site for a list of other schemas and DTDs. And, of course, you can always create your own!

Creating your own DTD or schema

In this case, the sample HTML files don't work with either DocBook or DITA, because they are designed for a conference program. Therefore, we'll create a new schema called "Conference XML". The samples include both a DTD and schema, although only the schema will be referenced by the XML file. By using the existing XHTML file, we can set up the structure for the DTD and schema.

Sample DTD

This is the Conference XML DTD. It defines "parent" elements (conference, session, and speaker), and then the elements required for each parent element.

<!ELEMENT conference (session,speaker)>
<!ELEMENT session (title,description,time,date)>
<!ELEMENT speaker (name,company,website,bio)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT website (#PCDATA)>

The DTD uses document-oriented typing only, identified by #PCDATA (parsed character data). Any information found within a specific tag will be parsed (interpreted). However, PCDATA doesn't let us confirm that the correct type of data has been entered. For example, someone could enter any character data for website.

Sample schema

This sample schema, designed as an upgrade to the sample DTD, defines the same XML standard. It's much longer than the corresponding DTD because it includes individual elements, parent element definitions, and data typing.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <xs:documentation xml:lang="en">
          Schema for conference session sample.
   <!-- definition of simple elements -->
   <xs:element name="title" type="xs:string" minOccurs="1" maxOccurs="1" />
   <xs:element name="description" type="xs:string" minOccurs="1" maxOccurs="1" />
   <xs:element name="time" type="xs:time" minOccurs="1" maxOccurs="1" />
   <xs:element name="date" type="xs:date" minOccurs="1" maxOccurs="1" />
   <xs:element name="name" type="xs:string" minOccurs="1" maxOccurs="1" />
   <xs:element name="company" type="xs:string" minOccurs="1" maxOccurs="1" />
   <xs:element name="website" type="xs:anyURI" minOccurs="1" maxOccurs="1" />
   <xs:element name="bio" type="xs:string" minOccurs="1" maxOccurs="1" />
<!-- definition of complex elements -->
   <xs:element name="speaker"/>
            <xs:element ref="name"/>
            <xs:element ref="company"/>
            <xs:element ref="website"/>
            <xs:element ref="bio"/>
   <xs:element name="session"/>
            <xs:element ref="title"/>
            <xs:element ref="description"/>
            <xs:element ref="time"/>
            <xs:element ref="date"/>
            <xs:element ref="speaker" maxOccurs="unbounded"/>

The schema uses data-oriented typing, identified by type. This lets us confirm that the correct type of data has been entered. For example, website uses the type anyURI. If the content inside the website tag isn't a URL, an error results.

Reworking the HTML files

We'll start with the XHTML file (from Part 1):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/ xhtml" xml:lang="en" lang="en">
     <p class="title"></p>
     <p class="description"></p>
     <p class="timedate"></p>
     <p class="name"></p>
     <p class="company"></p>
     <p class="website"></p>
     <p class="bio"></p> </html>

Because the XHTML file has been structured, converting it to XML will go fairly quickly.

Sample XML file

The XML file needs some standard content, shown in bold text. Variable content (such as the name of the XSL and XSD files) is displayed with italics.

The rest of the information comes from the XHTML file. For example, <h1>Session</h1> is converted to <session>...</session>, and <p class="title"></p> is converted to <title></title>.

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="sample.xsl"?>
<conference xsi:schemaLocation=". schema.xsd"
     <time></time> - <date></date>
       <name></name>, <company></company>

Validating your XML files

You can validate your XML files by using an:

Sample transform

Once you have an XML file that matches your schema, you need a transform. The transform applies formatting to the XML when creating the desired output. The sample transform is for HTML-based output. Without a transform, the XML is displayed in its raw format in the browser. (People could explore the structure of the file, but not see the formatted end result.)

In this case, it also applies logic. For example, the tag under shows that for each conference session, the title, description, time, and so on, should be displayed. This means that if a new session is added to the XML file, it will be displayed. No other changes need to be made to any of the supporting files.

This transform converts the XML to HTML for viewing through a browser.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
<xsl:template match="/">
    <xsl:for-each select="conference/session">
    <h1><xsl:value-of select="title"/></h1>
      <p><xsl:value-of select="description"/></p>
      <p><xsl:value-of select="time"/> - <xsl:value-of select="date"/></p>
    <h2><xsl:value-of select="speaker/name"/>, <xsl:value-of select="speaker/company"/></h2>
      <p><xsl:value-of select="speaker/website"/></p>
      <p><xsl:value-of select="speaker/bio"/></p>


In this article, you learned ways of validating your XHTML files, how to migrate from XHTML to XML, possible XML standards, how to create your own standard and validate your XML files, and how to create an XML to HTML transform.

Char James-Tanny is president of JTF Associates and has almost 25 years of experience as a technical writer. She is well known in the Help community for her knowledge of online Help tools and concepts. Author of two books, she speaks frequently at conferences around the world on Help topics, cross-browser issues, and tool-specific functionality. Char is an AuthorIT Certified Consultant, a senior member of STC's Boston Chapter, and a 2006 Microsoft Help MVP. Visit her web site www.helpstuff.com External link.