doc/bookloupe.txt
author ali <ali@juiblex.co.uk>
Thu Dec 05 10:58:41 2013 +0000 (2013-12-05)
changeset 219 98fc47ee1beb
parent 217 0c0f6373324e
permissions -rw-r--r--
Added tag 2.0.69 for changeset b01d4a64a929
     1 
     2 
     3                             Bookloupe documentation
     4 
     5 
     6 bookloupe: lists possible common formatting errors in a Project
     7 Gutenberg candidate file. Bookloupe is based on gutcheck, written
     8 by Jim Tinsley. It is a command line program and can be used under
     9 Microsoft Windows, Mac or Unix. For Windows-only people, there is
    10 an appendix at the end with brief instructions for running it.
    11 
    12 Current version: 2.0.69, an alpha version leading up to version 2.1
    13 
    14 This software is Copyright Jim Tinsley 2000-2005 and
    15 J. Ali Harlow 2012 onwards.
    16 
    17 Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
    18 This is Free Software; you may redistribute it under certain conditions (GPL).
    19 
    20 See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
    21 
    22 
    23                        Compatibility with guiguts v1.0.25
    24 
    25 Versions of guiguts up to at least 1.0.25 have a bug in the way that they
    26 prepare a copy of the ebook for gutcheck (or bookloupe) to check. This causes
    27 problems with ebooks that contain Unicode characters not present in Latin-1.
    28 
    29 The guiguts bug report is here: http://sourceforge.net/p/guiguts/bugs/95/
    30 The bug report also includes details of how to edit guiguts to work around
    31 the problem until an offical fix is released.
    32 
    33 
    34                           Recent changes in behaviour
    35 
    36 Each new version of bookloupe brings bug fixes and improvements. Sometimes
    37 the behaviour is also changed in ways that might be unexpected:
    38 
    39 Odd characters
    40 
    41     The check for "odd" characters (tab, tilde, carat, forward slash and
    42     asterisks) is disabled in bookloupe 2.0 when the character set is
    43     switched from ASCII/ISO-8859-1 to UNICODE (ie., when the "There are a
    44     lot of foreign letters here." message is printed). As of bookloupe 2.1
    45     these tests operate independently of the character set selected.
    46 
    47     Users may notice this change most especially in the case of the
    48     DP-specific /* ... */ markup. Bookloupe 2.0 often did not warn when
    49     this markup was encountered even when the --dp switch was not given.
    50     Bookloupe 2.1 will warn about this markup unless dp-specific mode is
    51     switched on, paranoid mode is switched off or the ebook contains more
    52     than 10 lines containing asterisks. In the last case
    53 
    54       --> 11 lines in this file contain asterisks. Not reporting them.
    55 
    56     will be printed.
    57 
    58 
    59 
    60 Usage is: bookloupe [OPTION...] filename
    61 
    62 Options:
    63       -d, --dp                  ignores some DP-specific markup
    64       -e, --no-echo             switches off Echoing of lines
    65       -s, --squote              checks Single quotes
    66       --typo                    checks Typos
    67       -p, --qpara               sets strict quotes checking for Paragraphs
    68       --no-paranoid             switches OFF typo checking and extra checks
    69       -l, --no-line-end         turns off Line-end checks
    70       -o, --overview            produces an Overview only
    71       -y, --stdout              sets error messages to stdout
    72       -h, --header              echos the header fields
    73       -m, --markup              ignore some common HTML markup
    74       -u, --usertypo            warns about words in a user-defined typo file
    75       -v, --verbose             forces individual reporting of minor problems
    76       -w, --web                 special mode for web uploads (for future use)
    77       --charset=NAME            the set of characters valid for this ebook
    78       --dump-config             dump the current configuration
    79 
    80 There are also inverted options available which are useful when it is
    81 desired to override an option set in the configuration file:
    82 
    83       --no-dp, --echo, --no-squote, --no-typo, --no-qpara, --paranoid,
    84       --line-end, --no-overview, --no-stdout, --no-header, --no-markup,
    85       --no-usertypo --no-verbose.
    86 
    87 Note: there is no --no-web since --web simply selects a set of options.
    88 
    89 Finally there are a couple of options that toggle the state of options
    90 rather than setting or unsetting them: -t (for typo) and -x (for typo
    91 and paranoid). These are mainly intended for compatability with gutcheck.
    92 
    93 Running bookloupe without any parameters will display a brief help message.
    94 
    95 Sample usage:
    96 
    97     bookloupe warpeace.txt
    98 
    99 
   100 More detail:
   101 
   102     Configuration file
   103 
   104       Bookloupe will look for a file named bookloupe.ini to read as
   105       a configuration file. Options set in a configuration file can
   106       be overridden from the command line as required.
   107 
   108       The following directories are searched in order:
   109 
   110         1) The current working directory. When run from the command
   111 	line, this is the directory you ran it from. When run from
   112 	guiguts it will normally be the directory that contains the
   113 	guiguts program.
   114 
   115 	2) The directory containing the bookloupe program.
   116 
   117 	3) The user's configuration directory. Under MS-Windows this
   118 	is normally CSIDL_LOCAL_APPDATA which is typically set to
   119 	C:\Documents and Settings\<user>\Local Settings\Application Data.
   120 	On other platforms this is normally $XDG_CONFIG_HOME which, if
   121 	not set defaults to $HOME/.config
   122 
   123 	The directories to search can also be changed using the
   124 	$BOOKLOUPE_CONFIG_PATH environment variable which is a colon
   125 	separated (semi-colon separated under MS-Windows) list of
   126 	directories.
   127 
   128       The configuration file is a key file. This is very similar to,
   129       but not identical to a typical ini file as found under MS-Windows.
   130       Key files consist of a number of groups which start with the
   131       group name enclosed in square brackets on a line by itself.
   132       Bookloupe recognises just one group, "options". Then below the
   133       group name there follows the keys and their values for that
   134       group, one per line in the format key=value. Most of bookloupe's
   135       options are flags (ie., either on or off). For these keys, the
   136       value must be either "true" or "false". The file may also contain
   137       comment lines which begin with the # symbol. The names of the
   138       keys follow the long option names.
   139 
   140       A sample configuration file is provided (in sample.ini). The file
   141       will need to be copied to bookloupe.ini before bookloupe will
   142       read it. You can also use the --dump-config option to write a
   143       configuration file for you. For example, if you typically want
   144       to run bookloupe with the --dp and --squote options, then you
   145       might do:
   146 
   147         $ bookloupe --dp --squote --dump-config > configuration.ini
   148 	$ ren configuration.ini bookloupe.ini
   149 
   150       (Don't be tempted to merge these two steps or bookloupe will see
   151       an empty configuration file and complain.)
   152 
   153       This same idea can also be used to modify an existing configuration.
   154 
   155 
   156     Character encoding
   157 
   158       Bookloupe will handle e-texts encoded in UTF-8 (preferred),
   159       ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
   160       incorrectly, as ansi). The output will be in the same encoding
   161       as the input e-text.
   162 
   163 
   164     Character set (--charset)
   165 
   166       Character encodings have an implicit set of characters that
   167       can be encoded and thus define a set of characters that can
   168       be present in the text. However sometimes it is desirable
   169       that not all characters that can be encoded should be present
   170       in a text. The set of characters that should be present is
   171       known as the character set.
   172 
   173       The default setting for the character set (called auto) does
   174       the same as gutcheck for Windows-1252 encoded texts for
   175       compatability:
   176 
   177       If the file is predominately ASCII then the set of legal
   178       characters is ASCII and warnings are issued whenever non-ASCII
   179       characters are encountered. The message will either warn of
   180       non-ASCII or non-ISO-8859-1 characters as appropriate.
   181 
   182       If the file contains a significant number of non-ASCII characters
   183       then a message is printed as follows:
   184 
   185         --> There are a lot of foreign letters here. Not reporting them.
   186 
   187       and the character set is widened to include all possible
   188       characters.
   189 
   190       For UTF-8 encoded texts, auto selects UNICODE.
   191       
   192       Most character sets are simply defined in bookloupe as the
   193       set of all characters that can be encoded in the encoding of
   194       the same name. UNICODE is an exception and includes only the
   195       characters assigned in the relevant Unicode standard but
   196       excluding the Private Use Area characters. Note that the
   197       relevant Unicode standard is given by the version of glib in
   198       use rather than by any code in bookloupe and thus can vary
   199       from system to system. PG texts however are likely to be
   200       using characters assigned in very early Unicode standards,
   201       thus mitigating this issue.
   202 
   203 
   204     Echoing lines (--no-echo to switch off)
   205 
   206       You may find it convenient, when reviewing Bookloupe's
   207       suggestions, to see the line that Bookloupe is questioning.
   208       That way, you can often see at a glance whether it is
   209       a real error that needs to be fixed, or a false positive
   210       that should be in the text, but Bookloupe's limited
   211       programming doesn't understand.
   212 
   213       By default, bookloupe echoes these lines, but if you don't
   214       want to see the lines referred to, --no-echo will switch it
   215       OFF.
   216 
   217 
   218     Quotes (--squote and --qpara switches)
   219 
   220       Bookloupe always looks for unbalanced doublequotes in a
   221       paragraph. It is a common convention for writers not to
   222       close quotes in a paragraph if the next paragraph opens
   223       with quotes and is a continuation by the same speaker.
   224 
   225       Bookloupe therefore does not normally report unclosed quotes
   226       if the next paragraph begins with a quote. If you need
   227       to see all unclosed quotes, even where the next paragraph
   228       begins with a quote, you should use the -p switch.
   229 
   230       Singlequotes (', `, ‘ and ’) are a problem, since the same
   231       character can be used for an apostrophe. I'm not sure that it
   232       is possible to get 100% accuracy on singlequotes checking,
   233       particularly since dialect, quite common in PG texts,
   234       upsets the normal rules so badly. Consider the sentence:
   235         'Tis often said that a man's a man for a' that.
   236       As humans, we recognize that both apostrophes are used
   237       for contractions rather than quotes, but it isn't easy
   238       to get a program to recognize that.
   239 
   240       Since bookloupe makes too many mistakes when trying to match
   241       singlequotes, it doesn't look for unbalanced singlequotes
   242       unless you specify the --squote switch.
   243 
   244       Consider these sentences, which illustrate the main cases:
   245 
   246         'Tis often said that a fool and his money are soon parted.
   247 
   248         'Becky's goin' home,' said Tom.
   249 
   250         The dogs' tails wagged in unison.
   251 
   252         Those 'pack dogs' of yours look more like wolves.
   253 
   254 
   255     Typos (--typo switch)
   256 
   257       It's not bookoupe's job to be a spelling checker, but it does
   258       check for a list of common typos and OCR errors if you use the
   259       --typo switch. (The -t and -x switchs also toggle typo checking.)
   260 
   261       It also checks for character combinations, especially involving
   262       h and b, which are often confused by OCR, that rarely or never
   263       occur. For example, it queries "tbe" in a word. Now, "the" often
   264       occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
   265       playing the odds - a few false positives for many errors found.
   266       Similarly with "ii", which is a very common OCR error.
   267 
   268       Bookloupe suppresses multiple reporting of the first 40 "typos"
   269       found. This is to remove the annoyance of seeing something like
   270       "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
   271       in a text.
   272 
   273 
   274     Line-end checking (--no-line-end switch to disable)
   275 
   276       All PG texts should have a Carriage Return (CR - character 13)
   277       and a Line Feed (LF - character 10) at end of each line,
   278       regardless of what O/S you made them on. DOS/Windows, Unix
   279       and Mac have different conventions, but the final text should
   280       always use a CR/LF pair as its line terminator.
   281 
   282       By default, bookloupe verifies that every line does have
   283       the correct terminator, but if you're on a work-in-progress
   284       in Linux, you might want to convert the line-ends as a final
   285       step, and not want to see thousands of errors every time you
   286       run bookloupe before that final step, so you can turn off
   287       this checking with the --no-line-end switch.
   288 
   289 
   290     Paranoid mode (--no-paranoid switch to disable: Trust No One :-)
   291 
   292       --no-paranoid switches OFF some extra checks like standalone
   293       1 and 0 queries.
   294 
   295 
   296     Overview mode (--overview switch)
   297 
   298       This mode just gives a count of queries found
   299       instead of a detailed list.
   300 
   301 
   302     Header quote  (--header switch)
   303 
   304       If you use the --header switch, bookloupe will also display
   305       the Title, Author, Release and Edition fields from the
   306       PG header. This is useful mostly for the automated
   307       checks we do on recently-posted texts.
   308 
   309 
   310     Errors to stdout (--stdout switch)
   311 
   312       If you're just running bookloupe normally, you can ignore
   313       this. It's only there for programs that provide a front
   314       end to bookloupe. It makes error messages appear within
   315       the output of bookloupe so that the front end knows whether
   316       bookloupe ran OK.
   317 
   318 
   319     Verbose reporting (--verbose switch)
   320 
   321       Normally, if bookloupe sees lots of long lines, short lines,
   322       spaced dashes, non-ASCII characters or dot-commas ".," it
   323       assumes these are features of the text, counts and summarizes
   324       them at the top of its report, but does not list them
   325       individually. If the verbose switch is on, bookloupe will list
   326       them all.
   327 
   328 
   329     Markup interpretation (--markup switch)
   330 
   331       Normally, bookloupe flags anything it suspects of being HTML
   332       markup as a possible error. When you use the --markup switch,
   333       however, it matches anything that looks like markup against
   334       a short list of common HTML tags and entities. If the markup
   335       is in that list, it either ignores the markup, in the case
   336       of a tag, or "interprets" the markup as its nearest ASCII
   337       equivalent, in the case of an entity. So, for example, using
   338       this switch, bookloupe will "see"
   339 
   340       &ldquo;He went <i>thataway!</i>&rdquo;
   341 
   342       as
   343 
   344       "He went thataway!"
   345 
   346       and report accordingly.
   347 
   348       This switch does not, not, NOT check the validity of HTML;
   349       it exists so that you can run bookloupe on most HTML texts
   350       for PG, and get sane results. It does not support all tags.
   351       It does not support all entities. When it sees a tag or entity
   352       it does not recognize, it will query it as HTML just as if
   353       you hadn't specified the --markup switch.
   354 
   355       Bookloupe will automatically switch on markup interpretation
   356       if it sees a lot of tags that appear to be markup, so mostly, you
   357       won't have to specify this.
   358 
   359 
   360     User-defined typos (--usertypo switch)
   361 
   362       If you have a file named bookloupe.typ or gutcheck.typ either
   363       in your current working directory or in the directory from
   364       which you explicitly invoked bookoupe, but not necessarily on
   365       your path, and if you specify the --usertypo switch, bookloupe
   366       will query any word specified in that file. The file is simple:
   367       one word, in lower case, per line. Be careful not to put multiple
   368       words onto a line, or leave any rubbish other than the word on
   369       the line. You should have received a sample file bookloupe.typ
   370       with this package. The file may be encoded in UTF-8 (preferred),
   371       ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
   372       incorrectly, as ansi).
   373 
   374 
   375     Ignore DP markup (--dp switch)
   376 
   377       Distributed Proofreaders (http://www.pgdp.net) has for some
   378       time been the main source of PG texts, and proofers there use
   379       special conventions. This switch understands those conventions,
   380       so that people can use bookloupe on files in process that still
   381       haven't had the special conventions removed yet. The special
   382       conventions supported are page-separators and
   383       "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
   384  
   385 
   386     Dump the current configuration (--dump-config switch)
   387 
   388       The --dump-config switch can be used to dump the current
   389       configuration. This is a combination of the internal defaults,
   390       the configuration file (if any) and the command line options.
   391       If a configuration file is present, any comments found in that
   392       file will be preserved in the dumped configuration. If there
   393       is no configuration file, then a default set of comments to
   394       go with the internal default configuration is generated.
   395 
   396 
   397 You will probably only run bookloupe on a text once or maybe twice,
   398 just prior to uploading; it usually finds a few formatting problems;
   399 it also usually finds queries that aren't problems at all - it often
   400 questions Tables of Contents for having short lines, for example.
   401 These are called "false positives," and need a human to decide on
   402 them.
   403 
   404 The text should be standard prose, and already close to PG normal
   405 format (plain text, about 70 characters per line with blank lines
   406 between paragraphs).
   407 
   408 Bookloupe merely draws your attention to things that might be errors.
   409 It is NOT a substitute for human judgement. Formatting choices like
   410 short lines may be for a reason that this program can't understand.
   411 
   412 Even the most careful human proofing can leave errors behind in a
   413 text, and there are several automated checks you can do to help find
   414 them. Of these, spellchecking (with _very_ careful human judgement) is
   415 the most important and most useful.
   416 
   417 Bookloupe does perform some basic typo-checking if you ask it to,
   418 but its focus is on formatting errors specific to PG texts—
   419 mismatched quotes, non-ASCII characters, bad spacing, bad line
   420 length, HTML tags perhaps left from a conversion, unbalanced
   421 brackets.
   422 
   423 Suggestions for additional checks would be appreciated and duly
   424 considered, but no guarantees that they will be implemented.
   425 
   426 
   427 
   428 
   429         How does Jim Tinsley use gutcheck?
   430 
   431 Practically everyone I give gutcheck to asks me how _I_ use it.
   432 Well, when I get a text for posting, say filename.txt, I run
   433 
   434     gutcheck -o filename.txt
   435 
   436 That gives me a quick idea what I'm dealing with. It'll tell
   437 me what kind of problems gutcheck sees, and give me an idea
   438 of how much more work needs to be done on the text. Keep in
   439 mind that gutcheck doesn't do anything like a full spellcheck,
   440 but when I see a text that has a lot of problems, I assume that
   441 it probably needs a spellcheck too.
   442 
   443 Having got a feel for the ballpark, I run
   444 
   445     gutcheck filename.txt > jj
   446 
   447 where jj is my personal, all-purpose filename for temporary data
   448 that doesn't need to be kept. Then I open filename.txt and jj in
   449 a split-screen view in my editor, and work down the text, fixing
   450 whatever needs fixing, and skipping whatever doesn't. If your
   451 editor doesn't split-screen, you can get much the same effect by
   452 opening your original file in your normal editor, and jj (or your
   453 equivalent name) in something like Notepad, keeping both in view
   454 at the same time.
   455 
   456 Twice a day, an automatic process looks at all recently-posted
   457 texts, and emails Michael, me, and sometimes other people with
   458 their gutcheck summaries.
   459 
   460 
   461 
   462 Explanations of common bookloupe messages:
   463 
   464     --> 74 lines in this file have white space at end
   465 
   466     PG texts shouldn't have extra white space added at end of line.
   467     Don't worry too much about this; they're not doing any harm,
   468     and they'll be removed during posting anyway.
   469 
   470 
   471     --> 348 lines in this file are short. Not reporting short lines.
   472     --> 84 lines in this file are long. Not reporting long lines.
   473     --> 8 lines in this file are VERY long!
   474 
   475     If there are a lot of long or short lines, bookloupe won't list
   476     them individually. The short lines version of this message
   477     is commonly seen when gutchecking poetry and some plays, where
   478     the normal line length is shorter than the standard for prose.
   479     A "VERY long" line is one over 80 characters.  You normally
   480     shouldn't have any of these, but sometimes you may have to render
   481     a table that must be that long, or some special preformatted
   482     quotation that can't be broken.
   483 
   484 
   485     --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
   486 
   487     The PG standard for an emdash--like these--is two minus signs
   488     with no spaces before or after them. However, some older texts
   489     used spaced dashes - like these -- and if there are very many
   490     such spaced dashes in the file, bookoupe just draws your
   491     attention to it and doesn't list them individually.
   492 
   493 
   494 
   495     Line 3020 - Non-ASCII character 233
   496 
   497     Standard PG texts should use only ASCII characters with values
   498     up to 127; however, non-English, accented characters can be
   499     represented according to several different non-ASCII encoding
   500     schemes, using values over 127. If you have a plain English text
   501     with a few accented characters in words like cafe or tete-a-tete,
   502     you might replace the accented characters with their unaccented
   503     versions. The English pound sign is another commonly-seen
   504     non-ASCII character. If you have enough non-ASCII characters in
   505     your text that you feel removing them would degrade your text,
   506     you should probably consider doing a UTF-8 text.
   507 
   508 
   509 
   510     Line 1207 - Non-ISO-8859 character 156
   511 
   512     Even in "8-bit" texts, there are distinctions between code sets.
   513     The ISO-8859 family of 8-bit code sets is the most commonly used
   514     in PG, and these sets do not define values in the range 128 through
   515     159 as printable characters. It's quite common for someone on a
   516     Windows or Mac machine to use a non-ISO character inadvertently,
   517     so this message warns that the character is not only not ASCII,
   518     but also outside the ISO-8859 range.
   519 
   520 
   521 
   522     Line 46 - Tab character?
   523 
   524     Some editors and WPs will put in Tab characters (character 9) to
   525     indicate indented text. You should not use these in a PG text,
   526     because you can't be sure how they will appear on a reader's
   527     screen. Find the Tab, and replace it with the appropriate number
   528     of spaces.
   529 
   530 
   531 
   532     Line 1327 - Tilde character?
   533 
   534     The tilde character (~) might be legitimately used, but it's the
   535     character commonly used by OCR software to indicate a place where
   536     it couldn't make out the letter, so bookloupe flags it.
   537 
   538 
   539 
   540     Line 1347 - Asterisk?
   541 
   542     Asterisks are reported only in paranoid mode (see -x).
   543     Like tildes, they are often used to indicate errors, but they are
   544     also legitimately used as line delimiters and footnote markers.
   545 
   546 
   547 
   548     Line 1451 - Long line 129
   549 
   550     PG texts should have lines shorter than 76. There may be occasions
   551     where you decide that you really have to go out to 79 characters,
   552     but the sample above says that line 1451 is 129 characters long—
   553     probably two lines run together.
   554 
   555 
   556 
   557     Line 1590 - Short line?
   558 
   559     PG texts should have lines longer than 54 characters. However,
   560     there are special cases like poetry and tables of contents where
   561     the lines _should_ be shorter. So treat bookloupe warnings about
   562     short lines carefully. Sometimes it's a genuine formatting
   563     problem; sometimes the line really needs to be short.
   564 
   565     Hint: bookloupe will not flag lines as short if they are indented
   566     —if they start with a space. I like to start inserted stanzas
   567     and other such items indented with a couple of spaces so that
   568     they stand out from the main text anyway.
   569 
   570 
   571 
   572     Line 1804 - Begins with punctuation?
   573 
   574     Lines should normally not begin with commas, periods and so on.
   575     An exception is ellipses . . . which can happen at start of line.
   576 
   577 
   578 
   579     Line 1850 - Spaced em-dash?
   580 
   581     The PG standard for an em-dash--like these--is two minus signs
   582     with no spaces before or after them. Bookloupe flags non-PG
   583     em-dashes - like this one. Normally, you will replace it with a
   584     PG-standard em-dash.
   585 
   586 
   587 
   588     Line 1904 - Query he/be error?
   589 
   590     Bookloupe makes a very minor effort to look for that scourge of all
   591     proofreaders, "be" replacing "he" or vice-versa, and draws your
   592     attention to it when it thinks it has found one.
   593 
   594 
   595 
   596     Line 2017 - Query digit in a1most
   597 
   598     The digit 1 is commonly OCRed for the letter l, the digit 0 for
   599     the letter O, and so on. When bookloupe sees a mix of digits and
   600     letters, it warns you. It may generate a false positive for
   601     something like 7am.
   602 
   603 
   604 
   605     Line 2083 - Query standalone 0
   606 
   607     In paranoid mode (see -x) only, bookloupe warns about the digit 0
   608     and the number 1 standing alone as a word. This can happen if the
   609     OCR misreads the words O or I.
   610 
   611 
   612 
   613     Line 2115 - Query word whetber
   614 
   615     If you have switched typo-checking on, bookloupe looks for
   616     potential typos, especially common h/b errors. It's not
   617     infallible; it sometimes queries legit words, but it's
   618     always worth taking a look.
   619 
   620 
   621 
   622     Line 2190 column 14 - Missing space?
   623 
   624     Omitting a space is a very common error,especially coming from
   625     OCRed text,and can be hard for a human to spot. The commas in
   626     the previous sentence illustrate the kind of thing I mean.
   627 
   628 
   629 
   630     Line 2240 column 48 - Spaced punctuation?
   631 
   632     The flip side of the "missing space" error , here , is when extra
   633     spaces are added before punctuation . Some old texts appear to add
   634     extra spaces around punctuation consistently, but this was a
   635     typographical convention rather than the author's intent, and the
   636     extra "spaces" should be removed when preparing a PG text.
   637 
   638 
   639 
   640     Line 2301 column 19 - Unspaced quotes?
   641 
   642     Another common spacing problem occurs in a phrase like "You wait
   643     there,"he said.
   644 
   645 
   646 
   647     Line 2385 column 27 - Wrongspaced quotes?
   648 
   649     Bookloupe checks whether a quote seems to be a start or end quote,
   650     and queries those that appear to be misplaced. This does give rise
   651     to false positives when quotes are nested, for example:
   652 
   653     "And how," she asked, "will your "friends" help you now?"
   654 
   655     but these false positives are worth it because of the many cases
   656     that this test catches, notably those like:
   657 
   658     "And how, "she said," will your friends help you now?"
   659 
   660     Sometimes a "wrongspaced quotes" query will arise because an earlier
   661     quote in the paragraph was omitted, so if the place specified seems
   662     to be OK, look back to see whether there's a problem in the preceding
   663     lines.
   664 
   665 
   666 
   667     Line 2400 - HTML Tag? <PRE>
   668 
   669     Some PG texts have been converted from HTML, and not all of the
   670     HTML tags have been removed.
   671 
   672 
   673 
   674     Line 2402 - HTML symbol? &emdash;
   675 
   676     Similarly, special HTML symbol characters can survive into PG
   677     texts. Can occasionally produce amusing false positives like
   678     . . . Marwick & Co were well known for it;
   679 
   680 
   681 
   682     Line 2540 - Mismatched quotes
   683 
   684     Another bookloupe mainstay—unclosed doublequotes in a paragraph.
   685     See the discussion of quotes in the switches section near the
   686     start of this file.
   687 
   688     Since the mismatch doesn't occur on any one line, bookloupe quotes
   689     the line number of the first blank line following the paragraph,
   690     since this is the point where it reconciles the count of quotes.
   691     However, if bookloupe is echoing lines, that is, you haven't used
   692     the -e switch, it will show the _first_ line of the paragraph,
   693     to help you find the place without using line numbers. The
   694     offending paragraph is therefore between the quoted line and
   695     the line number given.
   696 
   697 
   698 
   699     Line 2587 - Mismatched single quotes
   700 
   701     Only checked with the -s switch, since checking single quotes is
   702     not a very reliable process. Otherwise, the same logic as for
   703     doublequotes applies.
   704 
   705 
   706 
   707     Line 2877 - Mismatched round brackets?
   708 
   709     Also curly and square brackets. Texts with a lot of brackets, like
   710     plays with bracketed stage instructions, may have mismatches.
   711 
   712 
   713     Line 3150 - No CR?
   714     Line 3204 - Two successive CRs?
   715     Line 3281 position 75 - CR without LF?
   716 
   717     These are the invalid line-end warnings. See the discussion of
   718     line-end checking in the switches section near the start of this
   719     file. If you see these, and your editor doesn't show anything
   720     wrong, you should probably try deleting the characters just before
   721     and after the line end, and the line-end itself, then retyping the
   722     characters and the line-end.
   723 
   724 
   725     Line 2940 - Paragraph starts with lower-case
   726 
   727     A common error in an e-text is for an extra blank line
   728 
   729     to be put in, like the blank line above, and this often
   730     shows up as a new paragraph beginning with lower case.
   731     Sometimes the blank line is deliberate, as when a
   732     quotation is inserted in a speech. Use your judgement.
   733 
   734 
   735     Line 2987 - Extra period?
   736 
   737     An extra period. is a. common problem in OCRed text. and usually
   738     arises when a speck of dust on the page is mistaken for a period.
   739     or. as occasionally happens. when a comma loses its tail.
   740 
   741 
   742     Line 3012 column 12 - Double punctuation?
   743 
   744     Double punctuation., like that,, is a common typo and
   745     scanno. Some books have much legit double punctuation,
   746     like etc., etc., but it's worth checking anyway.
   747 
   748 
   749 
   750             *       *       *        *
   751 
   752 For Windows-only users who are unfamiliar with DOS:
   753 
   754     If you're a Windows-only user, you need to save
   755     bookloupe.exe into the folder (directory) where the
   756     text file you want to check is. Let's say your
   757     text file is in C:\gut, then you should save
   758     bookloupe.exe into C:\gut.
   759 
   760     Now get to a console. You can do this by
   761     selecting the "Command Prompt" or "MS-DOS Prompt"
   762     option that will be somewhere on your
   763     Start/Programs menu.
   764 
   765     Now get into the C:\gut directory.
   766     You can do this using the cd (change directory)
   767     command, like this:
   768         cd \gut
   769     and your prompt will change to
   770         C:\gut>
   771     so you know you're in the right place.
   772 
   773     Now type
   774         bookloupe yourfile.txt
   775     and you'll see bookloupe's report
   776 
   777     By default, bookloupe prints its queries to screen.
   778     If you want to create a file of them, to edit
   779     against the text, you can use the greater-than
   780     sign (>) to tell it to output the report to a
   781     file. For example, if you want its report in a
   782     file called queries.lst, you could type
   783 
   784         bookloupe yourfile.txt > queries.lst
   785 
   786     The queries.lst file will then contain the listing
   787     of possible formatting errors, and you can
   788     edit it alongside your text.
   789 
   790     Whatever you do, DON'T make the filename after
   791     the greater-than sign the name of a file already
   792     on your disk that you want to keep, because
   793     the greater-than sign will cause bookloupe to
   794     replace any existing file of that name.
   795 
   796     So, for example, if you have two Tolstoy files
   797     that you want to check, called WARPEACE.TXT and
   798     ANNAK.TXT, make sure that neither of these names
   799     is ever used following the greater-than sign.
   800     To check these correctly, you might do:
   801 
   802     bookloupe warpeace.txt > war.lst
   803 
   804     and
   805 
   806     bookloupe annak.txt > annak.lst
   807 
   808     separately. Then you can look at war.lst and annak.lst
   809     to see the bookloupe reports.
   810 
   811 For Windows-only users who want to use bookloupe from guiguts:
   812 
   813     1) If you haven't already done so, download bookloupe-win32-xxx.zip
   814     from http://www.juiblex.co.uk/pgdp/bookloupe/
   815 
   816     2) Extract the files into a suitable folder, e.g. C:\DP\bookloupe
   817 
   818     3) Start Guiguts
   819 
   820     4) Choose Preferences | File Paths | Set File Paths..
   821 
   822     5) Click the "Locate Gutcheck..." button
   823 
   824     6) Browse to the folder where you extracted bookloupe
   825 
   826     7) Double-click bookloupe.exe
   827 
   828     Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe
   829     instead. Since the output will look very like gutcheck output, you
   830     may want to check that it is actually bookloupe that is running. To do
   831     this, look at the black command line message window, which will say:
   832 
   833     "bookloupe: Check and report on an e-text".
   834 
   835     To return to using gutcheck for any reason, repeat steps 4 and 5
   836     above, and then,
   837 
   838     6b) Browse back to the gutcheck folder, which is in a "tools"
   839     folder inside the main Guiguts folder. It will be something like
   840     "C:\DP\guiguts-win\tools\gutcheck", depending on where you installed
   841     Guiguts originally.
   842 
   843     7b) Double-click gutcheck.exe
   844 
   845     Now doing "Gutcheck" in Guiguts will run gutcheck itself, and the
   846     message in the black window should read:
   847 
   848     "gutcheck: Check and report on an e-text".