Reviewing hyphenation

From ConTeXt wiki

The Problem

We have a situation where hyphenation is an issue, due to a 2-column layout where the columns are not very wide. We've done a lot of tweaking of settings for hyphenation and interword spacing, and the result seems pretty good. In particular, there are not many cases of consecutive lines that end with hyphens, and not many cases where a hyphenation occurs over a right-hand page break. The few cases that exist, we have been fixing manually by using \hbox{...} to prevent hyphenation at the trouble spot.

But the hyphenation is by nature somewhat volatile, so whenever we change something we would like to be able to easily recheck the hyphenation. And our book is over 1200 pages, so it would be very helpful to have tools to make the checking more efficient.

Potential Solutions

PDF viewers

One tool we found was the "evince" PDF viewer in Linux, which highlights all search results at once. So you can search for "-", and it will highlight all hyphens, which makes it easier to scan the PDF visually for hyphenation problems.

Still, this approach has its limitations... our layout domain experts don't have Linux machines: "evince" is not preinstalled and it is still a manual task.

"evince" Windows binaries: http://live.gnome.org/Evince/Downloads (We are still looking into Okular, which is available for Windows at http://windows.kde.org)

A ConTeXt solution

Another approach we wondered about was having TeX highlight the hyphenations... e.g. have it change the background color to yellow or red, when outputting a word that's dynamically broken/hyphenated. (Rather like we have TeX output red grid lines to help with debugging layout.) I think we would also want to highlight static hyphens that occur at the end of a line, as in "Niger-
Congo," because they have a similar visual impact. Possibly using a different color.

This would be an ideal solution, I think, but we don't know how to have TeX detect when a word gets dynamically hyphenated. (I made some inquiries on the NTG list to this effect. The response was that it would be not difficult to implement this in mkiv, but it could not be done in mkii. And we are not free to move to mkiv at this time.)

Adobe Acrobat / Javascript

Another possibility is using javascript in Adobe Acrobat Pro to automatically find and highlight end-of-line (and end-of-page) hyphens. That is the approach where we had most success. The features and limitations are described below, and the javascript code is attached.

Features

  • In Acrobat Pro, load a PDF and select "Highlight Hyphens" from the Tools menu to begin the highlighting. The first part of each word that is line-broken with a hyphen is highlighted.
  • The javascript console window shows progress.
  • The console reports number of hyphens (actually, words line-broken with a hyphen) on each page.
  • The resulting highlighted PDF can be saved including the highlights.
  • The saved, highlighted PDF can be viewed with highlights using Adobe Reader (does not require Acrobat Pro).

The resulting PDF looks something like this:

Highlight-hyphens-scrshot.png

Limitations

  • Slow. A representative test showed 0.07 pages per second (14 seconds per page!) That would mean about 5 hours for our book.
  • The resulting PDF file grows by about 25%.
  • Sometimes the highlighting function stops with an error ("Internal error" / "General error") after about 30 pages. We don't know why but maybe it could be avoided by only doing a limited number of pages at a time.

The code

The following two javascript files are to be placed in the Acrobat javascripts folder, e.g. C:\Program Files\Adobe\Acrobat 9.0\Acrobat\Javascripts, and then Acrobat is restarted.

add-hyphen-menu.js adds a menu item for "Highlight Hyphens..." on the Tools menu.

// Add a menu item for "Highlight Hyphens"

app.addMenuItem({
    cName: "Highlight Hyphens...",
    cParent: "Tools",
    cExec: "highlightHyphens()"
});

findAndAnnot.js defines the function that finds line-broken words and highlights the first "quad" of each.

// Find and highlight all words that are line-broken with "-"
// Lars Huttar, lars_huttar@sil.org, 2009-02-04

function highlightHyphens() {
    console.show(); // show console, for debugging.
    console.println('entered highlightFinalHyphens()');

    var word, numWords, q;
    var count = 0;
    var startTime = (new Date()).getTime();
    var startPage = 0; // may want to specify page range
    var hyphThisPage;

    for (var i = startPage; i < this.numPages; i++) {
        hyphThisPage = 0;
        numWords = this.getPageNumWords(i);
        for (var j = 0; j < numWords; j++) {
            word = this.getPageNthWord(i, j, false); // don't strip punctuation!
		// console.println('word ' + j + ': ' + word);
            // was: if (word.charAt(word.length - 1) == '-') {
            // Don't highlight single hyphens (which are common in headers/footers)
            //  and which tend to come in as "- ";
            // Also, words that are line-broken with hyphens (tend to?) come in as
            //  a single word with medial hyphen, e.g. "vari-eties", whereas those that
            //  have a non-breaking hyphen come in as multiple words, e.g. "day-" "to-" "day".
            //  So don't highlight words whose only hyphen is at the end.
            // We don't expect any word breaks with one character before the hyphen, e.g. "C-ontinent",
            //  but if there were any we'd want to know about it!
            if ((word.length > 2 || (word.length > 1 && word.charAt(0) != '-'))) {
                ind = word.indexOf('-');
                if (ind > 0 && ind < word.length - 1) {
                    hyphThisPage++;
                    console.println('page ' + (i+1) + '/' + this.numPages + ' word: ' + word);
                    q = this.getPageNthWordQuads(i, j);
                    // The following call throws an exception in Reader; we don't have the right to manipulate comments,
                    // unless such has been enabled on this PDF using LiveCycle.
                    this.addAnnot({page: i, type: "Highlight", quads: new Array(q[0]) }); // highlight only the first quad
                }
            }
        }
        console.println('page ' + (i+1) + '/' + this.numPages + ': ' + hyphThisPage + ' hyphens');
    }
    var nPages = this.numPages - startPage;
    var endTime = (new Date()).getTime();
    console.println('Completed ' + nPages + ' pages in ' + ((endTime - startTime)/1000.0).toFixed(2) + ' sec. ('
        + ((nPages + 0.0) * 1000 / (endTime - startTime)).toFixed(2) + ' pages/sec)');
}