doc/bookloupe.txt
author ali <ali@juiblex.co.uk>
Sat Jul 20 11:17:23 2013 +0100 (2013-07-20)
changeset 77 9edfe77d747d
parent 74 411867e8e20b
child 80 d6fa8533242a
permissions -rw-r--r--
Update documentation for 1.91
     1 
     2 
     3                             Bookloupe documentation
     4 
     5 
     6 bookloupe: lists possible common formatting errors in a Project
     7 Gutenberg candidate file. Bookloupe is based on gutcheck, written
     8 by Jim Tinsley. It is a command line program and can be used under
     9 Microsoft Windows, Mac or Unix. For Windows-only people, there is
    10 an appendix at the end with brief instructions for running it.
    11 
    12 Current version: 1.91, a beta version leading up to version 2.0
    13 
    14 This software is Copyright Jim Tinsley 2000-2005 and
    15 J. Ali Harlow 2012 onwards.
    16 
    17 Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
    18 This is Free Software; you may redistribute it under certain conditions (GPL).
    19 
    20 See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
    21 
    22 
    23 Usage is: bookloupe [-setopxlywm] filename
    24       where:
    25       -s checks Single quotes 
    26       -e switches off Echoing of lines 
    27       -t checks Typos
    28       -o produces an Overview only
    29       -p sets strict quotes checking for Paragraphs
    30       -x (paranoid) switches OFF typo checking and extra checks
    31       -l turns off Line-end checks
    32       -y sets error messages to stdout
    33       -w is a special mode for web uploads (for future use)
    34       -v (verbose) forces individual reporting of minor problems
    35       -m interprets Markup of some common HTML tags and entities    
    36       -u warns about words in a user-defined typo file gutcheck.typ 
    37       -d ignores some DP-specific markup
    38 
    39 Running bookloupe without any parameters will display a brief help message.
    40 
    41 Sample usage: 
    42 
    43     bookloupe warpeace.txt
    44 
    45 
    46 More detail:
    47 
    48     Character encoding
    49 
    50       Bookloupe will handle e-texts encoded in UTF-8 (preferred),
    51       ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
    52       incorrectly, as ansi). The output will be in the same encoding
    53       as the input e-text.
    54 
    55     Echoing lines (-e to switch off)
    56 
    57       You may find it convenient, when reviewing Bookloupe's 
    58       suggestions, to see the line that Bookloupe is questioning.
    59       That way, you can often see at a glance whether it is
    60       a real error that needs to be fixed, or a false positive
    61       that should be in the text, but Bookloupe's limited
    62       programming doesn't understand.
    63 
    64       By default, bookloupe echoes these lines, but if you don't 
    65       want to see the lines referred to, -e will switch it OFF.
    66 
    67 
    68     Quotes (-s and -p switches)
    69 
    70       Bookloupe always looks for unbalanced doublequotes in a 
    71       paragraph. It is a common convention for writers not to
    72       close quotes in a paragraph if the next paragraph opens
    73       with quotes and is a continuation by the same speaker.
    74 
    75       Bookloupe therefore does not normally report unclosed quotes 
    76       if the next paragraph begins with a quote. If you need
    77       to see all unclosed quotes, even where the next paragraph
    78       begins with a quote, you should use the -p switch.
    79 
    80       Singlequotes (') are a problem, since the same character
    81       is used for an apostrophe. I'm not sure that it is 
    82       possible to get 100% accuracy on singlequotes checking,
    83       particularly since dialect, quite common in PG texts,
    84       upsets the normal rules so badly. Consider the sentence:
    85         'Tis often said that a man's a man for a' that.
    86       As humans, we recognize that both apostrophes are used
    87       for contractions rather than quotes, but it isn't easy 
    88       to get a program to recognize that.
    89 
    90       Since bookloupe makes too many mistakes when trying to match
    91       singlequotes, it doesn't look for unbalanced singlequotes
    92       unless you specify the -s switch.
    93 
    94       Consider these sentences, which illustrate the main cases:
    95 
    96         'Tis often said that a fool and his money are soon parted.
    97 
    98         'Becky's goin' home,' said Tom.
    99 
   100         The dogs' tails wagged in unison.
   101 
   102         Those 'pack dogs' of yours look more like wolves.
   103 
   104 
   105 
   106     Typos (-t switch)
   107 
   108       It's not bookoupe's job to be a spelling checker, but it
   109       does check for a list of common typos and OCR errors if you
   110       use the -t switch. (The -x switch also turns typo checking on.)
   111 
   112       It also checks for character combinations, especially involving
   113       h and b, which are often confused by OCR, that rarely or never
   114       occur. For example, it queries "tbe" in a word. Now, "the" often
   115       occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
   116       playing the odds - a few false positives for many errors found.
   117       Similarly with "ii", which is a very common OCR error.
   118 
   119       Bookloupe suppresses multiple reporting of the first 40 "typos"
   120       found. This is to remove the annoyance of seeing something like
   121       "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
   122       in a text. 
   123 
   124 
   125     Line-end checking (-l switch to disable)
   126 
   127       All PG texts should have a Carriage Return (CR - character 13)
   128       and a Line Feed (LF - character 10) at end of each line,
   129       regardless of what O/S you made them on. DOS/Windows, Unix
   130       and Mac have different conventions, but the final text should
   131       always use a CR/LF pair as its line terminator.
   132 
   133       By default, bookloupe verifies that every line does have
   134       the correct terminator, but if you're on a work-in-progress
   135       in Linux, you might want to convert the line-ends as a final
   136       step, and not want to see thousands of errors every time you
   137       run bookloupe before that final step, so you can turn off 
   138       this checking with the -l switch.
   139 
   140 
   141     Paranoid mode (-x switch to disable: Trust No One :-)
   142 
   143       -x switches OFF typo-checking, the -t flag, automatically
   144       and some extra checks like standalone 1 and 0 queries.
   145 
   146 
   147     Overview mode (-o switch)
   148 
   149       This mode just gives a count of queries found
   150       instead of a detailed list.
   151 
   152 
   153     Header quote  (-h switch)
   154 
   155       If you use the -h switch, bookloupe will also display
   156       the Title, Author, Release and Edition fields from the
   157       PG header. This is useful mostly for the automated
   158       checks we do on recently-posted texts.
   159 
   160 
   161     Errors to stdout (-y switch)
   162 
   163       If you're just running bookloupe normally, you can ignore
   164       this. It's only there for programs that provide a front
   165       end to bookloupe. It makes error messages appear within
   166       the output of bookloupe so that the front end knows whether
   167       bookloupe ran OK.
   168 
   169 
   170     Verbose reporting (-v switch)
   171 
   172       Normally, if bookloupe sees lots of long lines, short lines,
   173       spaced dashes, non-ASCII characters or dot-commas ".," it
   174       assumes these are features of the text, counts and summarizes
   175       them at the top of its report, but does not list them 
   176       individually. If the -v switch is on, bookloupe will list them all.
   177 
   178 
   179     Markup interpretation (-m switch)
   180 
   181       Normally, bookloupe flags anything it suspects of being HTML
   182       markup as a possible error. When you use the -m switch,
   183       however, it matches anything that looks like markup against
   184       a short list of common HTML tags and entities. If the markup
   185       is in that list, it either ignores the markup, in the case
   186       of a tag, or "interprets" the markup as its nearest ASCII 
   187       equivalent, in the case of an entity. So, for example, using
   188       this switch, bookloupe will "see"
   189 
   190       &ldquo;He went <i>thataway!</i>&rdquo;
   191 
   192       as
   193 
   194       "He went thataway!"
   195 
   196       and report accordingly.
   197 
   198       This switch does not, not, NOT check the validity of HTML;
   199       it exists so that you can run bookloupe on most HTML texts
   200       for PG, and get sane results. It does not support all tags.
   201       It does not support all entities. When it sees a tag or entity
   202       it does not recognize, it will query it as HTML just as if
   203       you hadn't specified the -m switch.
   204 
   205       Bookloupe will automatically switch on markup interpretation
   206       if it sees a lot of tags that appear to be markup, so mostly, you
   207       won't have to specify this.
   208 
   209     User-defined typos (-u switch)
   210 
   211       If you have a file named bookloupe.typ or gutcheck.typ either
   212       in your current working directory or in the directory from
   213       which you explicitly invoked bookoupe, but not necessarily on
   214       your path, and if you specify the -u switch, bookloupe will
   215       query any word specified in that file. The file is simple: one
   216       word, in lower case, per line. Be careful not to put multiple
   217       words onto a line, or leave any rubbish other than the word on
   218       the line. You should have received a sample file bookloupe.typ
   219       with this package. The file may be encoded in UTF-8 (preferred),
   220       ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
   221       incorrectly, as ansi).
   222 
   223     Ignore DP markup (-d switch)
   224         
   225       Distributed Proofreaders (http://www.pgdp.net) has for some
   226       time been the main source of PG texts, and proofers there use
   227       special conventions. This switch understands those conventions,
   228       so that people can use bookloupe on files in process that still
   229       haven't had the special conventions removed yet. The special
   230       conventions supported are page-separators and
   231       "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
   232 
   233 
   234 You will probably only run bookloupe on a text once or maybe twice,
   235 just prior to uploading; it usually finds a few formatting problems;
   236 it also usually finds queries that aren't problems at all - it often
   237 questions Tables of Contents for having short lines, for example.
   238 These are called "false positives," and need a human to decide on
   239 them.
   240 
   241 The text should be standard prose, and already close to PG normal
   242 format (plain text, about 70 characters per line with blank lines
   243 between paragraphs).
   244 
   245 Bookloupe merely draws your attention to things that might be errors.
   246 It is NOT a substitute for human judgement. Formatting choices like
   247 short lines may be for a reason that this program can't understand.
   248 
   249 Even the most careful human proofing can leave errors behind in a
   250 text, and there are several automated checks you can do to help find
   251 them. Of these, spellchecking (with _very_ careful human judgement) is
   252 the most important and most useful.
   253 
   254 Bookloupe does perform some basic typo-checking if you ask it to,
   255 but its focus is on formatting errors specific to PG texts—
   256 mismatched quotes, non-ASCII characters, bad spacing, bad line
   257 length, HTML tags perhaps left from a conversion, unbalanced
   258 brackets.
   259 
   260 Suggestions for additional checks would be appreciated and duly 
   261 considered, but no guarantees that they will be implemented.
   262 
   263 
   264 
   265 
   266         How does Jim Tinsley use gutcheck?
   267 
   268 Practically everyone I give gutcheck to asks me how _I_ use it.
   269 Well, when I get a text for posting, say filename.txt, I run
   270 
   271     gutcheck -o filename.txt
   272 
   273 That gives me a quick idea what I'm dealing with. It'll tell
   274 me what kind of problems gutcheck sees, and give me an idea 
   275 of how much more work needs to be done on the text. Keep in 
   276 mind that gutcheck doesn't do anything like a full spellcheck,
   277 but when I see a text that has a lot of problems, I assume that
   278 it probably needs a spellcheck too.
   279 
   280 Having got a feel for the ballpark, I run
   281 
   282     gutcheck filename.txt > jj
   283 
   284 where jj is my personal, all-purpose filename for temporary data
   285 that doesn't need to be kept. Then I open filename.txt and jj in
   286 a split-screen view in my editor, and work down the text, fixing
   287 whatever needs fixing, and skipping whatever doesn't. If your 
   288 editor doesn't split-screen, you can get much the same effect by 
   289 opening your original file in your normal editor, and jj (or your
   290 equivalent name) in something like Notepad, keeping both in view 
   291 at the same time.
   292 
   293 Twice a day, an automatic process looks at all recently-posted
   294 texts, and emails Michael, me, and sometimes other people with
   295 their gutcheck summaries.
   296 
   297 
   298 
   299         Future development of bookloupe
   300 
   301 Bookloupe version 2.0 is intended to add UTF-8 support to
   302 gutcheck. All the functionality should already be implemented
   303 in the beta versions leading up to version 2.0, although
   304 some bugs may well remain.
   305 
   306 Future versions will add support for UTF-8 characters that
   307 are not in ISO-8859-1 (eg., curled quotation marks);
   308 characters that do not have a composed form (version 2
   309 treats these as taking 2 or more columns); zero width and
   310 wide characters (version 2 treats these as taking 1 column).
   311 
   312 
   313 
   314 
   315 Explanations of common bookloupe messages:
   316 
   317     --> 74 lines in this file have white space at end
   318 
   319     PG texts shouldn't have extra white space added at end of line.
   320     Don't worry too much about this; they're not doing any harm,
   321     and they'll be removed during posting anyway.
   322 
   323 
   324     --> 348 lines in this file are short. Not reporting short lines.
   325     --> 84 lines in this file are long. Not reporting long lines.
   326     --> 8 lines in this file are VERY long!
   327 
   328     If there are a lot of long or short lines, bookloupe won't list
   329     them individually. The short lines version of this message
   330     is commonly seen when gutchecking poetry and some plays, where
   331     the normal line length is shorter than the standard for prose.
   332     A "VERY long" line is one over 80 characters.  You normally
   333     shouldn't have any of these, but sometimes you may have to render
   334     a table that must be that long, or some special preformatted
   335     quotation that can't be broken.
   336 
   337 
   338     --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
   339 
   340     The PG standard for an emdash--like these--is two minus signs
   341     with no spaces before or after them. However, some older texts
   342     used spaced dashes - like these -- and if there are very many
   343     such spaced dashes in the file, bookoupe just draws your
   344     attention to it and doesn't list them individually.
   345 
   346 
   347 
   348     Line 3020 - Non-ASCII character 233
   349 
   350     Standard PG texts should use only ASCII characters with values
   351     up to 127; however, non-English, accented characters can be 
   352     represented according to several different non-ASCII encoding 
   353     schemes, using values over 127. If you have a plain English text
   354     with a few accented characters in words like cafe or tete-a-tete,
   355     you might replace the accented characters with their unaccented 
   356     versions. The English pound sign is another commonly-seen
   357     non-ASCII character. If you have enough non-ASCII characters in
   358     your text that you feel removing them would degrade your text,
   359     you should probably consider doing a UTF-8 text.
   360 
   361 
   362 
   363     Line 1207 - Non-ISO-8859 character 156
   364 
   365     Even in "8-bit" texts, there are distinctions between code sets.
   366     The ISO-8859 family of 8-bit code sets is the most commonly used
   367     in PG, and these sets do not define values in the range 128 through
   368     159 as printable characters. It's quite common for someone on a
   369     Windows or Mac machine to use a non-ISO character inadvertently,
   370     so this message warns that the character is not only not ASCII,
   371     but also outside the ISO-8859 range.
   372 
   373 
   374 
   375     Line 46 - Tab character?
   376 
   377     Some editors and WPs will put in Tab characters (character 9) to
   378     indicate indented text. You should not use these in a PG text,
   379     because you can't be sure how they will appear on a reader's
   380     screen. Find the Tab, and replace it with the appropriate number
   381     of spaces.
   382 
   383 
   384     Line 1327 - Tilde character?
   385 
   386     The tilde character (~) might be legitimately used, but it's the
   387     character commonly used by OCR software to indicate a place where
   388     it couldn't make out the letter, so bookloupe flags it.
   389 
   390 
   391 
   392     Line 1347 - Asterisk?
   393 
   394     Asterisks are reported only in paranoid mode (see -x). 
   395     Like tildes, they are often used to indicate errors, but they are
   396     also legitimately used as line delimiters and footnote markers.
   397 
   398 
   399 
   400     Line 1451 - Long line 129
   401 
   402     PG texts should have lines shorter than 76. There may be occasions
   403     where you decide that you really have to go out to 79 characters,
   404     but the sample above says that line 1451 is 129 characters long—
   405     probably two lines run together.
   406 
   407 
   408 
   409     Line 1590 - Short line?
   410 
   411     PG texts should have lines longer than 54 characters. However,
   412     there are special cases like poetry and tables of contents where
   413     the lines _should_ be shorter. So treat bookloupe warnings about
   414     short lines carefully. Sometimes it's a genuine formatting
   415     problem; sometimes the line really needs to be short.
   416 
   417     Hint: bookloupe will not flag lines as short if they are indented
   418     —if they start with a space. I like to start inserted stanzas
   419     and other such items indented with a couple of spaces so that 
   420     they stand out from the main text anyway.
   421 
   422 
   423 
   424     Line 1804 - Begins with punctuation?
   425 
   426     Lines should normally not begin with commas, periods and so on.
   427     An exception is ellipses . . . which can happen at start of line.
   428 
   429 
   430 
   431     Line 1850 - Spaced em-dash?
   432 
   433     The PG standard for an em-dash--like these--is two minus signs
   434     with no spaces before or after them. Bookloupe flags non-PG
   435     em-dashes - like this one. Normally, you will replace it with a 
   436     PG-standard em-dash.
   437 
   438 
   439 
   440     Line 1904 - Query he/be error?
   441 
   442     Bookloupe makes a very minor effort to look for that scourge of all
   443     proofreaders, "be" replacing "he" or vice-versa, and draws your
   444     attention to it when it thinks it has found one.
   445 
   446 
   447 
   448     Line 2017 - Query digit in a1most
   449 
   450     The digit 1 is commonly OCRed for the letter l, the digit 0 for
   451     the letter O, and so on. When bookloupe sees a mix of digits and
   452     letters, it warns you. It may generate a false positive for
   453     something like 7am.
   454 
   455 
   456 
   457     Line 2083 - Query standalone 0
   458 
   459     In paranoid mode (see -x) only, bookloupe warns about the digit 0 
   460     and the number 1 standing alone as a word. This can happen if the 
   461     OCR misreads the words O or I.
   462 
   463 
   464 
   465     Line 2115 - Query word whetber
   466 
   467     If you have switched typo-checking on, bookloupe looks for
   468     potential typos, especially common h/b errors. It's not
   469     infallible; it sometimes queries legit words, but it's
   470     always worth taking a look.
   471 
   472 
   473 
   474     Line 2190 column 14 - Missing space?
   475 
   476     Omitting a space is a very common error,especially coming from
   477     OCRed text,and can be hard for a human to spot. The commas in
   478     the previous sentence illustrate the kind of thing I mean.
   479 
   480 
   481 
   482     Line 2240 column 48 - Spaced punctuation?
   483 
   484     The flip side of the "missing space" error , here , is when extra
   485     spaces are added before punctuation . Some old texts appear to add
   486     extra spaces around punctuation consistently, but this was a
   487     typographical convention rather than the author's intent, and the
   488     extra "spaces" should be removed when preparing a PG text.
   489 
   490 
   491 
   492     Line 2301 column 19 - Unspaced quotes?
   493 
   494     Another common spacing problem occurs in a phrase like "You wait
   495     there,"he said.
   496 
   497 
   498 
   499     Line 2385 column 27 - Wrongspaced quotes?
   500 
   501     Bookloupe checks whether a quote seems to be a start or end quote,
   502     and queries those that appear to be misplaced. This does give rise
   503     to false positives when quotes are nested, for example:
   504 
   505     "And how," she asked, "will your "friends" help you now?"
   506 
   507     but these false positives are worth it because of the many cases
   508     that this test catches, notably those like:
   509 
   510     "And how, "she said," will your friends help you now?"
   511 
   512     Sometimes a "wrongspaced quotes" query will arise because an earlier
   513     quote in the paragraph was omitted, so if the place specified seems
   514     to be OK, look back to see whether there's a problem in the preceding
   515     lines.
   516 
   517 
   518 
   519     Line 2400 - HTML Tag? <PRE>
   520 
   521     Some PG texts have been converted from HTML, and not all of the
   522     HTML tags have been removed.
   523 
   524 
   525 
   526     Line 2402 - HTML symbol? &emdash;
   527 
   528     Similarly, special HTML symbol characters can survive into PG
   529     texts. Can occasionally produce amusing false positives like
   530     . . . Marwick & Co were well known for it;
   531 
   532 
   533 
   534     Line 2540 - Mismatched quotes
   535 
   536     Another bookloupe mainstay—unclosed doublequotes in a paragraph.
   537     See the discussion of quotes in the switches section near the
   538     start of this file.
   539     
   540     Since the mismatch doesn't occur on any one line, bookloupe quotes
   541     the line number of the first blank line following the paragraph,
   542     since this is the point where it reconciles the count of quotes.
   543     However, if bookloupe is echoing lines, that is, you haven't used
   544     the -e switch, it will show the _first_ line of the paragraph, 
   545     to help you find the place without using line numbers. The 
   546     offending paragraph is therefore between the quoted line and 
   547     the line number given.
   548 
   549 
   550 
   551     Line 2587 - Mismatched single quotes
   552 
   553     Only checked with the -s switch, since checking single quotes is 
   554     not a very reliable process. Otherwise, the same logic as for 
   555     doublequotes applies.
   556 
   557 
   558 
   559     Line 2877 - Mismatched round brackets?
   560 
   561     Also curly and square brackets. Texts with a lot of brackets, like
   562     plays with bracketed stage instructions, may have mismatches.
   563 
   564 
   565     Line 3150 - No CR?
   566     Line 3204 - Two successive CRs?
   567     Line 3281 position 75 - CR without LF?
   568 
   569     These are the invalid line-end warnings. See the discussion of
   570     line-end checking in the switches section near the start of this
   571     file. If you see these, and your editor doesn't show anything
   572     wrong, you should probably try deleting the characters just before
   573     and after the line end, and the line-end itself, then retyping the
   574     characters and the line-end.
   575 
   576 
   577     Line 2940 - Paragraph starts with lower-case
   578 
   579     A common error in an e-text is for an extra blank line
   580 
   581     to be put in, like the blank line above, and this often
   582     shows up as a new paragraph beginning with lower case.
   583     Sometimes the blank line is deliberate, as when a 
   584     quotation is inserted in a speech. Use your judgement.
   585 
   586 
   587     Line 2987 - Extra period?
   588 
   589     An extra period. is a. common problem in OCRed text. and usually
   590     arises when a speck of dust on the page is mistaken for a period.
   591     or. as occasionally happens. when a comma loses its tail.
   592 
   593 
   594     Line 3012 column 12 - Double punctuation?
   595 
   596     Double punctuation., like that,, is a common typo and
   597     scanno. Some books have much legit double punctuation,
   598     like etc., etc., but it's worth checking anyway.
   599 
   600 
   601 
   602             *       *       *        *
   603 
   604 For Windows-only users who are unfamiliar with DOS:
   605 
   606     If you're a Windows-only user, you need to save
   607     bookloupe.exe into the folder (directory) where the
   608     text file you want to check is. Let's say your
   609     text file is in C:\gut, then you should save
   610     bookloupe.exe into C:\gut.
   611 
   612     Now get to a console. You can do this by
   613     selecting the "Command Prompt" or "MS-DOS Prompt"
   614     option that will be somewhere on your
   615     Start/Programs menu.
   616 
   617     Now get into the C:\gut directory. 
   618     You can do this using the cd (change directory) 
   619     command, like this:
   620         cd \gut
   621     and your prompt will change to 
   622         C:\gut>
   623     so you know you're in the right place.
   624 
   625     Now type
   626         bookloupe yourfile.txt
   627     and you'll see bookloupe's report
   628 
   629     By default, bookloupe prints its queries to screen.
   630     If you want to create a file of them, to edit
   631     against the text, you can use the greater-than
   632     sign (>) to tell it to output the report to a
   633     file. For example, if you want its report in a
   634     file called queries.lst, you could type
   635 
   636         bookloupe yourfile.txt > queries.lst
   637 
   638     The queries.lst file will then contain the listing
   639     of possible formatting errors, and you can
   640     edit it alongside your text.
   641 
   642     Whatever you do, DON'T make the filename after
   643     the greater-than sign the name of a file already
   644     on your disk that you want to keep, because
   645     the greater-than sign will cause bookloupe to
   646     replace any existing file of that name.
   647 
   648     So, for example, if you have two Tolstoy files
   649     that you want to check, called WARPEACE.TXT and 
   650     ANNAK.TXT, make sure that neither of these names
   651     is ever used following the greater-than sign.
   652     To check these correctly, you might do:
   653 
   654     bookloupe warpeace.txt > war.lst
   655 
   656     and
   657 
   658     bookloupe annak.txt > annak.lst
   659 
   660     separately. Then you can look at war.lst and annak.lst
   661     to see the bookloupe reports.