Geek Blight - Creating structured documents from plain text

Creating structured documents from plain text

Posted on 2007-10-01T20:08Z. Updated on 2020-07-25T19:13Z.

Since personal computers appeared, they have been trying to solve the problem of text composition. The goal is to be able to write and format a long piece of text like a book or an article with minimum work. Ideally, you would only have to write the text itself and nothing more. This is impossible, since you need to provide information about how the text should be formatted, either by providing the formatting information yourself or providing clues about the document structure so the computer, which is smarter than a typewriter, can format the text itself. Or maybe you could provide a mixture of the two, the structure and some directives about how you want it to look like.

Typewriters

Think for a moment about typewriters. They are simple devices, with many limitations, but they are easy to use and let you concentrate on the contents of what you’re writing. The most basic models can’t make text bold or italic, and they can’t change the size of the text. These restrictions force you to play by some rules and forget a bit about the appearance of text. Naturally, they have disadvantages derived from being a mechanical device, in the sense that you type the text directly to the final physical medium, mistakes are hard to correct and you have to think beforehand how you are going to number the document sections and other aspects. But, apart from that minimal preparation, writing the text is straightforward.

Word processors

Word processors are the inmediate and typical computer solution to the problem. They are sophisticated typewriters, as powerful as presses. Word processors, with time, have become big programs with many options and features, trying to become the swiss army knife of text composition. It’s unfortunate that many people use them taking every decision on the document format without paying attention to the structure of the document itself. It’s very common to use two Enter key presses to separate paragraphs, using spaces or tabs to indent the first line of them or to somehow align them at some position. Those documents lose all the advantages of being managed by a computer, because in reality they are "typewriter documents" that, on top. have styles everywhere that have been applied by hand and can’t be changed easily.

It is true, however, that most word processors let you define paragraph styles (document title, section title, subsection title, normal paragraph, etc) and write your document quickly. Everytime you write something you indicate what it is, and define easily somewhere else how that type of paragraph is to be represented. If later you want to make all section titles underlined, it’s a matter of changing the style of section titles instead of going over every section title. The same goes for section numbering and automatic generation of Tables of Contents (TOCs), among others. Some word processors like LyX force you to work this way. For many people, that’s a positive thing. I’ve used LyX to write some very long documents and I must admit that it’s a pleasure to work with it. LyX uses LaTeX in the background, but that’s another story.

This approach to the problem is relatively good. You barely lose time thinking about the text format. Strictly speaking, you only write the text itself. The rest is configured usually through a handful of mouse clicks. However, word processors have traditionally suffered from other problems. For example, if you want to send the document to someone else, you usually need to be sure the other person has a word processor capable of opening that type of document (maybe the same word processor you used). Until not long ago, the document files on disk were stored in particular and undocumented formats. This was specially true for Microsoft Word, the most widely used word processor. You couldn’t and in many cases can’t automatically operate on them, or export them to a format really suitable for the web. Still, like I mentioned before, when you used paragraph styles the problem was solved nicely even though the resulting document wasn’t very flexible.

Nowadays word processors try to output their documents in an XML dialect so they can be translated easily to other formats and gain the flexibility they lacked in the past. Remember that word processors want to become the swiss army knife of text composition, and always feel the need to adapt and meet the needs of everybody.

Structured plain text

On the other hand, we have structured plain text formats like TeX (or LaTeX), HTML, XHTML or DocBook. These formats tried to do the right thing from the start, more or less. You would create a plain text file with your text editor of choice. It would hold the document contents and information about the document structure. Based on this structure, you give the computer some instructions (or none) about how the document should be formatted when appearing on the final presentation medium, be it paper or the computer screen.

In other words, they provide the same solution as a word processor using styles, with some differences. First, by their plain text nature, they have or had a bit more flexibility than their word processor counterparts. If you are going to write the plain text file yourself, it’s obvious that the internal file format is open and well known and, even more important, understandable for a human. In a worst case scenario, you could send them to someone not having programs to format or display it and they could still read it. But, as the format is open, it’s much easier to find tools to represent those document no matter whatever platform you’re using, with varied degrees of accuracy. Everyone who has written a webpage knows the problems Internet Explorer has with Cascading Style Sheets (CSS), if you decide to create a sophisticated one.

Furthermore, many of these documents can be put on the web easily. HTML and XHTML are designed directly for the web, and DocBook can very easily be translated to them. There are also tools for LaTeX that allow you to transform it to HTML (latex2html being one). In the modern world where everything can be done online and people start to avoid using sheets of paper, everytime a group of people discuss about text composition somebody suggests to avoid word processors completely and write your documents in (X)HTML instead. They would be structured, they can be easily printed if needed, transformed to other formats and, with CSS, you can easily change their appearance. These people definitely have a point in what they say. The problem is… how to say it… that writing (X)HTML and DocBook tags is completely annoying, among other minor problems. You need more than a plain text editor to create them easily if you don’t want to lose your nerves writing tags. This is obviously a problem…

Formatted plain text

…because formatted plain text formats and applications wouldn’t exist otherwise. You probably use these applications more than once a day. They are present in forum software and wikis. Surrounding every paragraph with <p> or <para> makes you lose too much time, so some people has rightly thought "What if we rescue the good things typewriters had?". Like using the Enter key to separate paragraphs. Can’t we use those formatting elements to infer the document structure from them? Furthermore, the people who use word processors using the Enter key to separate paragraphs would feel almost at home. Even with minor glitches, in my humble opinion, formatted plain text is probably one of the best ideas in text composition.

For example, much web forum software lets you break paragraphs by pressing the enter key twice. Some of them translate your text to HTML using <br /> elements. Others surround each paragraph with the proper tags. The level of sophistication varies. That change alone lets you structure your document writing much fewer tags. I’m using that at this moment, as the blog software surrounds my paragraphs with the proper tags automatically. In MediaWiki, the wiki used by Wikipedia, the formatting goes further. Titles are surrounded by equal characters (=) that are way easier to type than header tags. You could have:

= Section title =

Section paragraph.

== Subsection title ==

Subsection paragraph.

Lists can be written using asterisks or dashes prepending each list item. But wikis and forum software have a problem: it’s not very easy to get the documents out of them. So the last big question is… is there any software that lets me use plain text format as indicator of document structure easily, without the need for tags, and that is independent by itself, so it can then be translated to (X)HTML, DocBook or something else? The answer is yes: AsciiDoc. I "discovered" AsciiDoc some years ago, but it was not until today that I decided to lose some minutes reading the user guide (that explains the conventions used to format the plain text and how they are going to be interpreted by AsciiDoc). Shame on me, because it’s very easy to use. AsciiDoc turns out to be incredibly flexible and powerful, yet the default settings are more than enough for most users. I inmediately converted a couple of webpages to AsciiDoc to see the results and they are excellent. I think I’m going to keep using AsciiDoc whenever I can, because the XHTML output is flexible and customizable, and the DocBook output can be easily translated into chunked HTML or serious-looking PDF documents using FOP. Give AsciiDoc a try if you ever need to write an article, book or Unix manpage.

Load comments