Migrating from HTML to XHTML and XML - Part IBy Char James-Tanny, JTF Associates, Inc.IntroductionThis is the first part of a two-part article describing a detailed methodology for migrating HTML files to the structure and flexibility of XHTML and/or XML. By using XHTML to add structure and separate content from presentation, you'll be better positioned for a move to XML. Even if you never move to XML, your XHTML files will be easier to create and maintain, and will be more accessible. (See Using Web Standards to Create HTML Files.) Comparing HTML, XHTML, and XMLHTML, XHTML, and XML are called "markup languages," because they include tags to indicate structure, formatting, and layout. They all have their roots in SGML (Standard General Markup Language), which is a meta language (a language used to create or describe other languages).
While HTML, XHTML, and XML have several things in common, there are some differences:
Understanding Document Type Definitions (DTD) and schemasBoth a DTD and schema set the rules for any files that use them. HTML and XHTML use pre-defined DTDs, while XML uses a custom DTD or schema. You can use publicly available DTDs or schemas, like DocBook and DITA, or create your own. The pre-defined DTDs for HTML and XHTML tend to be very flexible, and are available at the W3C's site. Structure isn't enforced, and authors can choose if they want to tag strictly to standards or not. However, the available XML standards are very strict, with rules that must be followed. For example, the DTD or schema may include rules on which tags can be used inside other tags, the minimum number of items in a list, and more. If the rules are not followed, the content isn't displayed and errors result. The following paragraphs briefly describe the differences between DTDs and schemas. I'll cover them in more detail in Part 2. Document Type Definition
Schema
FocusEach meta language has its own focus, or purpose.
Structure and validityStructure and validity determine the how well the rules are followed.
If an XML file isn't structured, well-formed, or valid, it won't display. An XHTML file typically will display whether or not the files follow the rules or not. SyntaxXML syntax is very important, because it is very strict and must be followed. However, the rules are fairly simple to understand and easy to use. HTML SyntaxHTML syntax is the most lax of all three languages, and pretty much anything goes. Elements don't have to be properly nested, and they are not case sensitive. Neither empty nor non-empty elements are required to have end tags. XML SyntaxThe XML syntax elements are:
<?xml version="1.0"encoding="ISO-8859-1"?> In addition, white space in an XML file is preserved, which is the opposite of HTML. XHTML Syntax XHTML syntax is not quite as strict as XML, but is stricter than HTML.
ExtensibilityExtensible languages can be extended by anyone. HTML and XHTML are extensible only with Cascading Style Sheets. XML, however, is fully extensible, especially if you're using a custom DTD or schema. Migrating from HTML to XHTMLThe migration path consists of replacing incorrect tags and markup with the correct replacements. For example, here is part of an HTML file that needs some work: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> You may know of other examples that are in much worse shape, especially if they use tables or nested tables for layout. Visit Vincent Flanders site, Web Pages That Suck, if you need some truly bad examples. So how do we go about fixing badly tagged files? The following steps describe a possible process (although you can do the steps in any order). You can manually fix each item, or you can use any HTML editor that supports a global search-and-replace. At this point, you don't want to remove any layout tables. Nested tables can be a real pain to untangle when you're trying to maintain a design. After you've cleaned the code, you can create a new design and then drop the content in place. Fix the DOCTYPEThe current DOCTYPE is close, but wrong, because it doesn't include a link to the DTD. The DOCTYPE I use most often is for transitional XHTML: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> For more information, read Fix Your Site with the Right DOCTYPE Add the namespaceAs explained earlier, the namespace determines how the data is interpreted. XHTML uses the XML namespace: <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> Determine class namesAt this stage, you can do one of three things:
<P><B><FONT SIZE="3" COLOR="black" FACE="Arial"> To: <p class="bodytext"> Or: <p> Remove display tags and attributesThis step takes the longest time. Any tags used only for display (such as <FONT>) must be removed. You may have to make several passes through your files to catch the display tags. Use the classes from the previous step, if you created them, to help pull the file into shape. Change all tags to lowercaseConvert any remaining tags that are uppercase or mixed case to lowercase. Quote all attribute valuesAttributes provide additional information for the element. For example, the <img> tag requires attributes that specify the source and alternate text. You can use either double or single quotes, as long as you're consistent. Double quotes are the most common. However, if an attribute includes quotes, then you must nest single and double quotes. Close all tagsMake sure all tags are closed. HTML doesn't require closed tags, although many developers used them for paired tags (like <p>). Singleton tags (like <br> and <img>) don't require closed tags in HTML. Instead of creating paired tags (for example, <br> </br>), you can use just one tag by including a space and a slash before the close bracket. For example, to close the <br> tag, type <br />. To close the <img> tag, type <img [attributes] />. (Be sure to include the space before the slash.) Remove tables (optional)The W3C guidelines state that it's best if tables are used for data, not layout. However, using layout tables does not violate the XHTML specification. If you want to remove the tables, now is the time. You need to know how to create positioned elements (typically <div>) to re-create the layout. Or you might decide to create a new layout! Once the files have been cleaned, you can use copy and paste to quickly move content from the tabled layout to the new layout. Sample FilesFor the sake of examples, this HTML file was cleaned in two passes. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> During the first pass, the code was cleaned, following the steps above, and basic CSS classes were assigned. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> During the second pass, the basic classes were replaced with semantic classes (and a minor change was made to the structure of the file). <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ConclusionIn this article, you learned about the differences between HTML, XHTML, and XML, and you learned how to clean HTML files. The second installment will show you how to validate your XHTML files. Then we'll go into more detail on XML, including existing standards.
|