Migrating from HTML to XHTML and XML - Part I

By Char James-Tanny, JTF Associates, Inc.



Introduction

This is the first part of a two-part article describing a detailed methodology for migrating HTML files to the structure and flexibility of XHTML and/or XML. By using XHTML to add structure and separate content from presentation, you'll be better positioned for a move to XML. Even if you never move to XML, your XHTML files will be easier to create and maintain, and will be more accessible. (See Using Web Standards to Create HTML Files.)

Comparing HTML, XHTML, and XML

HTML, XHTML, and XML are called "markup languages," because they include tags to indicate structure, formatting, and layout. They all have their roots in SGML (Standard General Markup Language), which is a meta language (a language used to create or describe other languages).

  • HTML (Hypertext Markup Language) was announced in 1991, and last updated in December, 1999. The original purpose was to make information available to anyone, anywhere. By the last update, developers could use tables for layout, specify fonts and colors, and apply browser-specific markup (like "marquee" and "blink"). Cascading Style Sheets let developers separate content and presentation, but it wasn't a requirement, and many developers continued to embed layout information within their HTML files. However, HTML wasn't extensible enough to handle all the ways that people wanted to use it.
  • XML (Extensible Markup Language) was announced in early 1998. XML was designed so that it wouldn't suffer the same growing pains as HTML, and was geared more toward describing data, not just sharing information. XML also went beyond HTML by implementing rules requiring the separation of content and presentation, structured and semantic authoring, and syntax.
  • XHTML (Extensible HyperText Markup Language) was announced in January 2000. XHTML is XML that browsers interpret as HTML, and lets developers use both HTML and XML features. They also have to follow many of the rules associated with XML.

While HTML, XHTML, and XML have several things in common, there are some differences:

  • Document Type Definition (DTD) or schema
  • Focus (purpose)
  • Structure and validity
  • Syntax
  • Extensibility

Understanding Document Type Definitions (DTD) and schemas

Both a DTD and schema set the rules for any files that use them.

HTML and XHTML use pre-defined DTDs, while XML uses a custom DTD or schema. You can use publicly available DTDs or schemas, like DocBook and DITA, or create your own.

The pre-defined DTDs for HTML and XHTML tend to be very flexible, and are available at the W3C's site. Structure isn't enforced, and authors can choose if they want to tag strictly to standards or not.

However, the available XML standards are very strict, with rules that must be followed. For example, the DTD or schema may include rules on which tags can be used inside other tags, the minimum number of items in a list, and more. If the rules are not followed, the content isn't displayed and errors result.

The following paragraphs briefly describe the differences between DTDs and schemas. I'll cover them in more detail in Part 2.

Document Type Definition

  • Uses complex non-XML syntax.
  • Supports limited data typing (a way of identifying data).
  • Doesn't support newer XML features.

Schema

  • Uses XML syntax. The schema must follow the inherent rules of tagging XML: lowercase tags and attributes, quoted attributes, and so on.
  • Allows for simple and complex data typing. Simple data typing describes stand-alone attributes or elements. Complex data typing describes elements with embedded elements or attributes.
  • Describes both elements and attributes
  • Supports namespaces. A namespace indicates where an element is defined. For example, what do you think of when you read the word "orange"? The namespace will determine if "orange" is a fruit, a tree, a color, a river, a major mobile phone operator, a British bicycle manufacture, a city, or something else.

Focus

Each meta language has its own focus, or purpose.

  • HTML focuses on the ultimate display, typically in a browser window. Even though Cascading Style Sheets (CSS) were added as a standard in 1996, most developers continued to apply inline formatting. When the HTML 3.2 specification was released, it included table tags, and many developers took advantage of this to use tables (and nested tables) for layout. (The worst file I ever saw had 14 nested tables. It was almost impossible to edit the page, and the only way I was able to do it was by enabling different table borders to find the correct cells. And, yes, I would have modified the page to use CSS for layout, but the client preferred the tables.)
  • XML focuses on describing the data, but requires a transform to display the results in a browser or other user agent. Transforms convert the data to a specific output, and existing transforms can be used to generate HTML/XHTML, PDF, XML (using a different DTD or schema), and RTF. Some XML editors include default transforms. For example, oXygen distributes transforms for HTML Help and JavaHelp.
  • XHTML focuses on either the display or describing data, or both. Depending on how the XHTML was tagged, files will typically be displayed within a browser window, but may also be displayed in any user agent (mobile phone, Blackberry, or my favorite, the Internet-enabled refrigerator).

Structure and validity

Structure and validity determine the how well the rules are followed.

  • A structured document follows the rules proscribed in the DTD or schema. For example, XML files must have one root tag, and elements must be properly nested.
  • A well-formed document is one that follows the rules of the language. For example, XML and XHTML require that that all tags have an end tag. In HTML, singleton tags, such as img, are allowed.
  • A valid document is one that is structured and well-formed. XML that is well-formed, but doesn't specify a DTD or schema, is not valid.
HTML XML XHTML
Doesn't have to be structured Must be structured Doesn't have to be structured (but it helps)
Doesn't have to be well-formed Must be well-formed Doesn't have to be well-formed (but it helps)
Doesn't have to be valid Must be valid Doesn't have to be valid (but it helps)

If an XML file isn't structured, well-formed, or valid, it won't display. An XHTML file typically will display whether or not the files follow the rules or not.

Syntax

XML syntax is very important, because it is very strict and must be followed. However, the rules are fairly simple to understand and easy to use.

HTML Syntax

HTML syntax is the most lax of all three languages, and pretty much anything goes. Elements don't have to be properly nested, and they are not case sensitive. Neither empty nor non-empty elements are required to have end tags.

XML Syntax

The XML syntax elements are:

  • The file must include an XML declaration.

    The following is a sample declaration that defines the XML version and character encoding (in this case, Latin-1/West European character set).
  • <?xml version="1.0"encoding="ISO-8859-1"?>

  • All elements have a closing tag. (The XML declaration is not considered part of the XML document, so it does not need a closing tag.) This applies to all tags, included paragraph (<p></p>), images (<img /> or <img></img>), and line breaks (<br /> or <br></br>).
  • Start and end tags use the same case. XML tags are case sensitive. Therefore, <Content> isn't the same as <content>.
  • All tags are properly nested. For example, to display text that is bold and italic, you can use <strong><em></em></strong> or <em><strong></strong></em>, but not <strong><em></strong></em>.
  • A root element is included. All other elements sub-elements are contained within this element.
  • Attributes must always be in quotes.

In addition, white space in an XML file is preserved, which is the opposite of HTML.

XHTML Syntax XHTML syntax is not quite as strict as XML, but is stricter than HTML.

  • The file must include a DOCTYPE declaration: strict, transitional, or frameset. The strict DOCTYPE requires that no layout information is included in the file, while the transitional DOCTYPE allows for some layout information.

    While you can include the XML declaration in an XHTML file, it tends to cause problems in Internet Explorer 6. (This has been fixed for IE7.)

  • All elements must have a closing tag.
  • All tags must be lowercase.
  • All elements must be correctly nested.
  • Attributes must be in quotes.

Extensibility

Extensible languages can be extended by anyone. HTML and XHTML are extensible only with Cascading Style Sheets. XML, however, is fully extensible, especially if you're using a custom DTD or schema.

Migrating from HTML to XHTML

The migration path consists of replacing incorrect tags and markup with the correct replacements. For example, here is part of an HTML file that needs some work:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD><TITLE>HTML Sample File for WritersUA Session</TITLE></HEAD>
<BODY BGCOLOR="white">
<P><B><I><FONT SIZE=5 COLOR=blue FACE="Arial">Session</FONT></I></B></P>
<P><B><FONT SIZE="3" COLOR=black FACE="Arial">Introducing Windows "Longhorn" Help</FONT></B></P>

You may know of other examples that are in much worse shape, especially if they use tables or nested tables for layout. Visit Vincent Flanders site, Web Pages That Suck, if you need some truly bad examples.

So how do we go about fixing badly tagged files? The following steps describe a possible process (although you can do the steps in any order). You can manually fix each item, or you can use any HTML editor that supports a global search-and-replace.

At this point, you don't want to remove any layout tables. Nested tables can be a real pain to untangle when you're trying to maintain a design. After you've cleaned the code, you can create a new design and then drop the content in place.

Fix the DOCTYPE

The current DOCTYPE is close, but wrong, because it doesn't include a link to the DTD. The DOCTYPE I use most often is for transitional XHTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

For more information, read Fix Your Site with the Right DOCTYPE External link by Jeffrey Zeldman.

Add the namespace

As explained earlier, the namespace determines how the data is interpreted. XHTML uses the XML namespace:

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">

Determine class names

At this stage, you can do one of three things:

  • Ignore class names for now.
  • Create semantic class names to replace the display settings.
    Semantic classes are those named for their purpose. For example, if you have a style called "boldRed" and the boss says to make it purple, then you have a dilemma: do you rename the class (and fix it everywhere), or do you have a style called "boldRed" that isn't red? Of course, the real problem with boldRed is that you can't identify when to use it; you only know what it will look like.
    A better name would be to use the purpose. For example, if you typically use boldRed for warnings, then name the class "warning."
  • Create any class names to replace the display settings, or possibly none at all. At this point, you're striving for consistency, so you can get away with converting:
  • <P><B><FONT SIZE="3" COLOR="black" FACE="Arial">

    To: <p class="bodytext">

    Or: <p>

Remove display tags and attributes

This step takes the longest time. Any tags used only for display (such as <FONT>) must be removed. You may have to make several passes through your files to catch the display tags. Use the classes from the previous step, if you created them, to help pull the file into shape.

Change all tags to lowercase

Convert any remaining tags that are uppercase or mixed case to lowercase.

Quote all attribute values

Attributes provide additional information for the element. For example, the <img> tag requires attributes that specify the source and alternate text. You can use either double or single quotes, as long as you're consistent.

Double quotes are the most common. However, if an attribute includes quotes, then you must nest single and double quotes.

Close all tags

Make sure all tags are closed. HTML doesn't require closed tags, although many developers used them for paired tags (like <p>).

Singleton tags (like <br> and <img>) don't require closed tags in HTML. Instead of creating paired tags (for example, <br> </br>), you can use just one tag by including a space and a slash before the close bracket. For example, to close the <br> tag, type <br />. To close the <img> tag, type <img [attributes] />. (Be sure to include the space before the slash.)

Remove tables (optional)

The W3C guidelines state that it's best if tables are used for data, not layout. However, using layout tables does not violate the XHTML specification.

If you want to remove the tables, now is the time. You need to know how to create positioned elements (typically <div>) to re-create the layout. Or you might decide to create a new layout! Once the files have been cleaned, you can use copy and paste to quickly move content from the tabled layout to the new layout.

Sample Files

For the sake of examples, this HTML file was cleaned in two passes.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD><TITLE>Sample HTML File</TITLE></HEAD>
<BODY BGCOLOR="white">
<P><B><I><FONT SIZE=5 COLOR=blue FACE="Arial">Session</FONT></I></B>
<P><B><I><FONT SIZE="3" COLOR=black FACE="Arial">Title</FONT></B></I>
<P><I><FONT SIZE=3 COLOR="black" FACE="Arial">Description</FONT></I></B>
<P><B><I><FONT SIZE="3" COLOR="black" FACE="Arial">Time - Date</FONT></I></B>
</BODY>
</HTML>

During the first pass, the code was cleaned, following the steps above, and basic CSS classes were assigned.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>...</head>
<body>
<p class="style1">Session</p>
<p class="style2">Title</p>
<p class="style3">Description</p>
<p class="style2">Time - Date</p>
</body>
</html>

During the second pass, the basic classes were replaced with semantic classes (and a minor change was made to the structure of the file).

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/ xhtml" xml:lang="en" lang="en">
<head></head>
<body>
<h1>Session</h1>
<p class="title"></p>
<p class="description"></p>
<p class="timedate"></p>
</body>
</html>

Conclusion

In this article, you learned about the differences between HTML, XHTML, and XML, and you learned how to clean HTML files. The second installment will show you how to validate your XHTML files. Then we'll go into more detail on XML, including existing standards.


Char James-Tanny is president of JTF Associates and has almost 25 years of experience as a technical writer. She is well known in the Help community for her knowledge of online Help tools and concepts. Author of two books, she speaks frequently at conferences around the world on Help topics, cross-browser issues, and tool-specific functionality. Char is an AuthorIT Certified Consultant, a senior member of STC's Boston Chapter, and a 2006 Microsoft Help MVP. Visit her web site www.helpstuff.com External link.


up