Monday, February 18, 2013

Convert To HTML

This is part 2 of a multipart tutorial on How To Publish An Ebook: Convert to HTML.

You may not know this, but an ebook is a lot like a web page. You're looking at a web page right now, and what makes it work is a markup language called HTML. The people who keep track of internet standards can tell you the difference between HTML4, HTML5, XHTML, XML and several other similar data format standards. You probably don't care. I'll be inexact in my terminology: when I say HTML I may be playing fast-and-loose and really mean XHTML instead. Will this hurt anything?

I hope not. I'd rather you get a few niggling details wrong and get the overall concept right.

Let's suppose you've gone through the process of writing a book. And that book just happens to be in DOCX format. This is the file format that Microsoft Word uses. If you use another word processor, you can probably get it converted to DOC or DOCX format easily enough. Otherwise, ask and we'll work out that contingency.

There are a lot of ways to convert a DOCX file into HTML. And they all work fairly well. However, they tend to generate bloated HTML code. You can generally represent something in many different ways. And when you have a Word document, it can have a lot of odd formatting things that anyone might put in for any reason. But that's not you because you're writing an ebook.

However, file format translation programs don't know that, and they can add a lot of just-in-case code in their HTML translations. Maybe you like having all that bloat in your HTML files, but I like to be able to see what's going on when I inspect an HTML file. (Generally this happens after something goes wrong and I need to find out why.)

That's why I like a clean, lean, light-weight HTML translation that's relatively minimalist. (And if your ebook design is not minimalist, you're doing something wrong.)

That's where Rick Boatright's translation comes in handy for me. Go here to see what i mean.

You'll see two boxes. One on the left and another on the right.

Go into Word. Hit Control-A to select everything. Then hit Control-C to copy everything.

Go into your browser and Rick's translation page. Click in the left box and hit Control-V to paste everything.

Select the checkboxes you want, then click the button marked "Clean up Word Text" and wait for a few minutes--depending on how long your document is. If it barfs, break up  your ebook into chapters and try again.

When you get each piece of your ebook translated to HTML, click in the right box and hit Control-A to select everything, and Control-C to copy everything. Then paste your buffer into a Notepad file and save it off with an extension of HTML into project directory.

When done, you should be able to double-click on each HTML document to see what it looks like in your browser. Pictures can be a bit tricky. You may want to get some help to get pictures put in the right places. It's not hard to do, just hard to explain in a brief post like this one.

Now is a good time to look for badly translated symbols like smart-quotes, copyright or trademark symbols and other bits of noise that'll hurt the appearance of your ebook. It's best to find these errors as early in the workflow process as possible to avoid rework

Do you have to use Rick Boatright's translator? No. Can you use other HTML translators? Probably. I'm only telling you what worked for me. And you might have some other way that works better. I certainly haven't cornered the market on truth. Can you avoid Word altogether and use another tool that generates clean HTML automatically? Idunno. Haven't tried.

Let me know if you have tried something different--like, say, Scrivener.

(You can find the bullet-point outline of How To Publish An Ebook here.)


  1. I've created a few epubs in eCub, nice and clean, import plain text and then format it using the standard xml markup for epub and mobi

    1. I love Sigil so much, but I'll give eCub a look next time I have to create an ebook.


Those more worthy than I: