www.kurttrue.net - making epubs

back to topWhat is an EPUB, anyway?

The EPUB standard is like the Texas Constitution. Nobody's ever read the whole thing, but you can get a pretty good idea of what's in there by observing how it's been applied in real world situations.

So, sure, before setting out to build an epub-generating application, I read most of the important stuff in the EPUB standard, but I mostly got the hang of things by dissecting epubs that I downloaded from Project Gutenberg.

If you ever want to dissect an epub yourself, there's not much to it. An epub is just a directory zipped into a single file with a .epub extension. And when I say "zipped" I mean zipped. I'm talking about standard zip compression. So you can just unzip an epub with any zip utility. You might have to change the file extension first, depending on how persnickity your zip utility is with regard to file extensions.

Your unzipped epub will look something like this:


  mimetype [a text file containing the string "application/epub+zip"]
  META-INF
    container.xml [the file that tells the epubreader where to find metadata]
    OEBPS
  	  content.opf [top level bibliographic data and a list of files in the zip, in xml format]
	  toc.ncx [the epub's table of contents in xml format]
	  CONTENTS
	    cover.xhtml
	    contentfilea.xhtml
	    contentfileb.xhtml
	    contentfilec.xhtml
	    style.css
	    coverimg.jpeg
	    anotherimg.jpeg
	    onemoreimg.jpeg

So once I started looking at unzipped epubs, not the directory structure, but the actual file contents, I noticed that epubs typically contained a lot of redundant data, or, to use a term that you hear a lot in my line of work, I felt as if I were looking at many views of the same data.

For example, if you were to unzip the epub version of Moll Flanders, you would find the title (which is not really Moll Flanders, but we'll get to that later) in the file manifest (typically named content.opf), and in the table of contents file (typically called toc.ncx), and on the book's first page, which you might find in a document called chapter1.html or contentFileA.xhtml or ax532sm@wbp*xyz.htm.

The epub standard isn't too persnickity about what you call your content files, or what you put in them, as long as they're formatted as xhtml. (Your xhtml might have href attributes referring to image, video or sound files).

So what is an epub then? It's a zipped file that contains xhtml and other content (represented in the example by the /OEBPS/CONTENTS directory), and a some metadata files that tell the ereader how to present and navigate the content.

As I see it, the problem that an epub generating application needs to solve is the creation of the metadata. Right? Because the content presumably already exists in some form that can be marked up in a way that makes sense to the application.

All sorts of high level bibliographic data that the application needs (title, author, date of publication, ISBN) can be parsed out of those content files, right? Because the contents are all formatted as xhtml.

To use the most obvious example, if I want the application to be able to find my title, I can mark up my title like this:

 <div class="titlestyle" title="bib=dc:title">The Fortunes and Misfortunes of the Famous Moll Flanders</div>

Then, when the application encounters the attribute title="bib=dc:title", it knows that the value of the element with that attribute is the epub's title.

And if the application can find high level bibliographic data in the content files, does that mean it can find table of contents data in there too? And could it build table of contents data "on the fly" by parsing the content file or files? Well, sure it can. Because the table of contents (the file called toc.ncx, in our example) mostly consists of a lot of references to id attributes inside the contents.

So, for example, if the preface of the Moll Flanders epub begins in the file ./CONTENTS/contentfilea.xhtml with a heading that looks like this:

 <h4 id="preface"><b>THE AUTHOR'S PREFACE</b></h4>

…then my application can harvest the data in that heading and create an entry in the epub's table of contents file that looks like this:

    <navPoint playOrder="2" id="np-2">
       <navLabel>
	     <text>THE AUTHOR'S PREFACE</text>
       </navLabel>
       <content src="CONTENTS/contentfilea.xhtml#preface"/>
    </navPoint>

Well, that makes things easy, right? I can just make my application look for id attributes that consist of "preface" or "introduction" or begin with the text string "chapter." Hmmm… but what if I need to mark up the epub version of King Lear? Then I guess my application would have to look for id attributes that start with "act" and "scene." Oh, but what if I'm marking up the epub version of Virgil's Aeneid? Or the California Vehicle Code? Or the speeches of Susan B. Anthony? Or your grandmother's recipe book?

Hmmm… well, I suppose the application could look for id attributes that begin with "stanza" or "section" or "amendment" or "speech" or "soupsandstews," and it could employ a controlled vocabulary that would require all content to be organized in such a way as to conform to a certain fixed number publication types corresponding to particular table of contents templates.

It could do that, but that would be really dumb. So, instead of relying on the id attribute to find out what belongs in the table of contents, the application looks for a key-value pair tucked inside of a title attribute. So, for example, the heading for the preface of Moll Flanders would look like this:

 <h4 title="navcategory=chapter" id="preface">THE AUTHOR'S PREFACE</h4>

So now the value of id can be any text string that uniquely identifies the id attribute's parent element, and the key-value pair in the title attribute tells isgihgen.jar that this heading represents an entry in the table of contents (a "navPoint" in epub parlance), and that the level of the navPoint is "chapter."

So far so good, right? But what if I want the book title to appear in the table of contents (a pretty standard practice in the publishing world), and I want that same title to appear in my epub's top-level metadata. What that means in practical terms is I want the title to appear in two different files, the table of contents file (./OEBPS/toc.ncx, in our example) and the file that epub community usually refers to as the "manifest" (./OEBPS/content.opf, in our example).

Well then, if I want the title to appear in both those places, then I can mark it up this way:

 <div class="titlestyle" title="navcategory=book bib=dc:title" id="booktitle">The Fortunes and Misfortunes of the Famous Moll Flanders</div>

Now, you probably notice that the title attribute has two key-value pairs, separated by some white space. The first key-value pair navcategory=book means that the text inside the title attribute's parent element (in this case a "div") needs to go into the table of contents, and that the text belongs to the category "book." (More on categories later.)

The title attribute's second key-value pair bib=dc:title probably looks familiar. That's the pair that tells isgihgen.jar that the text inside the div element is the canonical title of the epub. ("Canonical" is just a fancy way of saying "The real thing.")

So what exactly is the application going to do with that canonical title? When isgihgen.jar sees "bib=dc:title" in that div element, it's going to take that canonical title and use it to create an entry in the file manifest (./OEBPS/content.opf) that will look like this:

 <metadata>
     …
   <dc:title>The Fortunes and Misfortunes of the Famous Moll Flanders</dc:title>
     …
 </metadata>

As you can see, that entry goes in the metadata element of the manifest. The metadata element supplies the ereader with all sorts of important information, like title, author, ISBN, publisher, language, and date of publication.

Any data in your content files that need to go into metadata element, can be marked up with a bib= string within a title attribute. So, for example, you can mark up your author's name this way:

 <p title="bib=dc:creator">Daniel Defoe</p>

And then isgihgen.jar will add an entry to your metadata element that looks like this:

 <dc:creator>Daniel Defoe</dc:creator>

But what if my entry in the metadata element needs to include attributes? For example, what if I want isgihgen.jar to pick up my ISBN from my content file and turn it into a metadata entry that looks like this:

 <dc:identifier id="uuid_id" opf:scheme="uuid">123456789</dc:identifier>

Well, then you just add some more key-value pairs. The keys all start with the string bib. Here's what that looks like:

 <div class="isbn" title="bib=dc:identifier bib.id=uuid_id bib.opf:scheme=uuid">123456789</div>

An alternate method for populating the metadata element is to add a meta element to your content's head element. For example, this meta element in a content file…

 <head>
            …
      <meta name="dc:subject" content="London (England) -- Fiction"/>
            …
 </head>

… turns into this content.opf element.

 <metadata>
             …
      <dc:subject>London (England) -- Fiction</dc:subject>
             …
 </metadata>

Putting a lot of meta elements in your content file's head element might not be the most convenient method for adding dc:* elements to your file manifest (for reasons we'll see later), but you'll probably find it's a handy way to mark up your book's cover.

An epub's cover is typically just an xhtml file that contains a reference to an image file. In our example, the cover is represented by a file called cover.xhtml.

Your epub's file manifest needs to know which one of your files represents your cover, and it needs to know where your cover image is, but you don't need to worry about all that. You just need to add a couple of meta elements to your cover file's head element that look like this:

 <meta name="type" content="cover"/>
 <meta name="id" content="[unique_id_for_your_cover]"/>

So the cover file for our Moll Flanders epub, marked up for isgihgen.jar might look like this:

 <?xml version="1.0"?>

 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <title>Cover</title>
     <meta name="type" content="cover"/>
     <meta name="id" content="owl-cover"/>
     <link id="stylelink" rel="stylesheet" type="text/css" href="style.css"/>
   </head>
   <body>
     <div class="cover">
       <img src="bigcover.jpg" alt="Cover" />
     </div>
   </body>
 </html>

And if the content of Moll Flanders begins in contentfilea.xhtml, the top of contentfilea.xhtml would look something like this:

  <?xml version='1.0' encoding='utf-8'?>
     <html xmlns="http://www.w3.org/1999/xhtml">
         <head>
               …
               <meta name="dc:subject" content="London (England) -- Fiction"/>
               …
         </head>
             <body>

             <div class="titlestyle" title="navcategory=book bib=dc:title" id="booktitle">The Fortunes and Misfortunes of the Famous Moll Flanders</div>

             <div class="fancystyle">Who was Born in Newgate, and during a Life of continu'd Variety for Threescore Years, besides her Childhood, was Twelve Year a Whore, five times a Wife (whereof once to her own Brother), Twelve Year a Thief, Eight Year a Transported Felon in Virginia, at last grew Rich, liv'd Honest, and dies a Penitent. Written from her own Memorandums…<div/>

             <p>
                 <b>by</b>
             </p>

             <p title="bib=dc:creator">Daniel Defoe</p>

             <div class="heading">ISBN: </div>
             <div class="isbn" title="bib=dc:identifier bib.id=uuid_id bib.opf:scheme=uuid">123456789</div>

             <h4 title="navcategory=chapter" id="preface"><b>THE AUTHOR'S PREFACE</h4></p>

             <p>The world is so taken up of late with novels and romances…

Great. So now we know how to mark up our content so that isgihgen.jar knows what to do with it. But how does isgihgen.jar know where to find the content? Or where to save the epub? Or what kind of structure to apply to the epub's table of contents? Or, who knows? What xml namespace to apply to content.opf?

Yes, there's some important stuff isgihgen.jar needs to know, and all that important stuff goes in a simple xml document called the ISGIH file. (ISGIH stands for "Important stuff goes in here.")

Below are the contents of sample_a.xml, a sample ISGIH file available at the kurttrue/isgihgen github respository.

 <?xml version='1.0' encoding='UTF-8'?>

 <epub>

	<!-- opf element represents your epub's content.opf file -->
	<opf>

        	<!-- package is the root element of content.opf -->

        	<package xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
        	   <metadata>
        	      <dc:publisher>www.kurttrue.net</dc:publisher>
                      <dc:language xsi:type="dcterms:RFC4646">en</dc:language>
                      <dc:date opf:event="publication">2017-05-08</dc:date>
        	   </metadata>
        	</package>

	</opf>

	<!--
	     ncx is the root element of your epub's toc.ncx file.  -->
	     docTitle element can appear in the ncx element, or isgihgen can parse it from your content file.
	-->

        <ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="en">

   		<head>
   		   <meta  name="dtb:uid" content=""/>
   		   <meta  name="dtb:depth" content="2"/>
   		   <meta  name="dtb:totalPageCount" content="0"/>
   		   <meta  name="dtb:maxPageNumber" content="0"/>
   		</head>

        </ncx>

        <!--

             container is the root element of your epubs ./META-INF/container.xml file
             isgihgen will add the a reference to content.opf to this element.

        -->

        <container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0"/>

        <!--
              isgihgen wants to know how your content is organized.
              That information goes in the hierarchy element.

        -->

	<hierarchy>


	      <book>
	         <chapter>
		     <entry/>
	         </chapter>
	      </book>


	</hierarchy>


	<!--

	   References to your content paths can be relative or absolute
	   and can include a the asterisk as a wildcard.
	   See ./input/sample_c.xml for an path value that includes a wildcard.


        -->

	<content>
	    <paths>
	       <!--
	           The attribute type="cover" here can indicate that the file represents the cover,
	           or isgihgen can determine the cover from a meta tag in the cover's head element.
	           See cover.html's head element for a sample of this meta tag (name="type" content="cover").
	           If you use ./input/sample_b.xml as your input file, isgihgen will
	       -->
	       <path type="cover" id="test-cover">cover.xhtml</path>
	       <path>owl.xhtml</path>
	    </paths>
	</content>

	<!--

	     The output element tells isgihgen where to put your output.

	-->

	<output>
	   <paths>
	        <!--
	           isgihgen will create two directories under root, docs and epub.
	           Directory docs contains the prezipped epub content.
	           Directory epub contains the zipped epub.
	        -->
	        <root>../output/owlandpussycat</root>
	        <oebps>OEBPS</oebps>
	        <meta-inf>META-INF</meta-inf>
	        <!-- Content files will output to the subdirectory designated in the text element. -->
	        <text>text</text>
	   </paths>

	   <!-- the name of your epub. -->
	   <name>owlandpussycat.epub</name>

	   <!-- yes here means that isgihgen will delete your previous output. -->
	   <delete>yes</delete>

	</output>

 </epub>

The comments in sample_a.xml (the elements that begin with "<!--") give you an idea of what kind of data needs to go in the ISGIH document.

back to topThe package element

We've already seen the package element. package is an ancestor of our dc:creator and dc:title elements, and it corresponds to the root element of our file manifest (META-INF/OEPBS/content.opf). That means that any element that you add to the package element in your ISGIH document will be output to your file manifest.

Remember that I said you could populate the package->metadata element in your ISGIH document by adding meta key-value elements to your content file? But you probably would find it more convenient to those elements directly to your ISGIH file in the package element? Well, in sample_a.xml, you see an example of what I was talking about. dc:publisher, dc:language, and dc:date have all been added directly to the ISGIH document, and isgihgen.jar will output them to the file manifest.

Now if you know something about the EPUB standard, you know that the file manifest also needs to include the paths to the epub's content files. That's not something you need to add manually. The isgihgen classes take care of that for you. You just need to include the location of your *.xhtml content in your ISGIH file at epub->content->paths.

back to topThe ncx element

The ncx element in our ISGIH document corresponds to the root element of our epub's table of contents file (META-INF/OEBPS/toc.ncx). We really don't have to do much of anything to our ncx element, just make sure it has the xmlns, version and lang attributes as seen in sample_a.xml and a head element like the one in sample_a.xml with four meta tags that most ereaders ignore completely. The one meta tag you might want to pay attention to is dtb:depth, which refers to the number of levels in your table of contents. For example, an table of contents that refers to Parts broken up in to Chapters would have a dtb:depth of 2.

back to topThe container element

The container element should look just as it does in the example. This is what becomes the content of the epub's META-INF/container.xml file.

back to topThe hierarchy element

The hierarchy element tells the isgihgen about your table of contents levels. Remember when we marked up our sample content file with key-value pairs that began with navcategory? And one of those key-value pairs was navcategory=book and one was navcategory=chapter? Well the hierarchy tree tells the isgihgen what to do with those navcategory values, how the child-parent relationships work. Each element under hierarchy corresponds to the value of a navcategory key-value pair.

back to topThe content element

The content element tells isgihgen.jar where to find your content. Paths to your content can be relative our absolute, and can include the asterisk as a wildcard. isgihgen.jar treats relative paths as relative to directory that contains your ISGIH file.

As you can see in sample_a.xml, you can use a type attribute and an id attribute to indicate which file is the cover (and what the cover's id value should be), but you only have to do that if you haven't provided those values in meta elements in your cover file (as seen in the sample cover file above).

Your content element doesn't need to include references to your css or image files (or your sound, video, or javascript files, if you're making one of those fancy interactive epubs). isgihgen will follow the path in your href attribute to find those files.

back to topThe output element

The output element is the part of your ISGIH document that tells isgihgen.jar where to put (and how to organize) your output.

output->paths->root is the directory where you want your output to go. This value can be expressed as an absolute or relative path. As with the content element, isgihgen.jar treats relative paths as relative to directory that contains your ISGIH file.

output->paths->oebps and output->paths->meta-inf tell isgihgen.jar what to call your OEBPS and META-INF directories. You probably want to call them OEBPS and META-INF, but you can always change those values if you want.

output->paths->text tells isgihgen.jar the name of a subdirectory in which to put your content files. That subdirectory, in our example, would go under OEBPS. Or you can just omit the text element, and isgihgen.jar will deposit your content files in the same directory as the table of contents and file manifest.

output->name is the name of the epub.

output->delete is the element that tells isgihgen.jar whether to clean the output destination prior to writing new output. yes here means isgihgen.jar should recursively delete all output in the directory specified at output->paths->root before outputting new content. You might find that it's helpful to set this value to "yes" if you change the names of content files, or remove files from your input location. That way stale content doesn't accumulate in your output destination.

Once you have your content and ISGIH files in their final form, you can invoke the generator on the command line this way:

  java -jar [path_to_isgihgen-version.jar] input=[path_to_your_ISGIH_file]

So, for example, if isgihgen.jar (version 1.0) is in ~/lib/isgihgen10.jar and your input file is in ~/input/epubs/owlandpussycat.xml, you could navigate to your home directory and type this command on the command line:

  java -jar ./lib/isgihgen10.jar ./input/epubs/owlandpussycat.xml

isgihgen.jar will create two directories (docs and epub) in the output path specified in your ISGIH document (at epub->output->paths->root).

So, for example, if your ISGIH document designates the output path as ~/output/epubs/owlandpussycat, then isgihgen.jar will create an unzipped version of your epub at ~/output/epubs/owlandpussycat/docs and a zipped version of your epub (the version that you can load into an ereader) at ~/output/epubs/owlandpussycat/epub.

back to topDebugging your epub

isgihgen.jar doesn't do a lot of debugging or syntax checking. For that, you probably want to use a wonderful command line tool called EpubCheck, which is available on github at this url
https://github.com/IDPF/epubcheck/releases

I've used EpubCheck many times. It's a very handy tool, and if you have any errors in your content or metadata, it will tell you the exact filename and line and column numbers where the error occurs. I'd definitely recommend running your finished epub past EpubCheck before trying to load it into an ereader.

FAQs

back to topHow do you pronounce "isgih"?

I say it IZZ-gee (with a hard "g"), but I'm open to alternative pronunciations.

back to topOK, then, how do you pronounce "isgihgen"?

Well, the "gen" is short for "generator," so I would recommend IZZ-gee-jen. (I started out calling the isgihgen project Generator, but that seemed a little too unspecific.)

back to topWhat version of Java did you use to compile and test isgihgen?

Java version 1.7.0_151.

back to topWhat version of the EPUB Standard does isgihgen support?

The sample input at the isgihgen github repository complies with EPUB 3.0.

back to topWhat are isgihgen's dependencies?

Java 1.7 or higher. isgihgen doesn't need any external libraries.

Not a whole lot. You need to know how to run java on the command line, and you should have some knowledge of XML and HTML, and some acquaintance with the EPUB standard. If you're new the EPUB standard, the sample isgih files (sample_a.xml, sample_b.xml, sample_c.xml) available at the isgihgen github repository can get you started with learning the fundamentals.

back to topHow do I obtain isgihgen.jar?

Follow this link: download isgihgen.jar

I went to high school in Spring Valley (next door to Houston). In those days, we all had to take Civics.

Making EPUBs with isgihgen.jar