doc/bookloupe.txt
author ali <ali@juiblex.co.uk>
Thu Dec 05 10:58:41 2013 +0000 (2013-12-05)
changeset 219 98fc47ee1beb
parent 217 0c0f6373324e
permissions -rw-r--r--
Added tag 2.0.69 for changeset b01d4a64a929
ali@0
     1
ali@0
     2
ali@74
     3
                            Bookloupe documentation
ali@0
     4
ali@0
     5
ali@74
     6
bookloupe: lists possible common formatting errors in a Project
ali@74
     7
Gutenberg candidate file. Bookloupe is based on gutcheck, written
ali@74
     8
by Jim Tinsley. It is a command line program and can be used under
ali@74
     9
Microsoft Windows, Mac or Unix. For Windows-only people, there is
ali@74
    10
an appendix at the end with brief instructions for running it.
ali@0
    11
ali@218
    12
Current version: 2.0.69, an alpha version leading up to version 2.1
ali@0
    13
ali@74
    14
This software is Copyright Jim Tinsley 2000-2005 and
ali@74
    15
J. Ali Harlow 2012 onwards.
ali@0
    16
ali@74
    17
Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0
    18
This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0
    19
ali@74
    20
See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
ali@0
    21
ali@0
    22
ali@217
    23
                       Compatibility with guiguts v1.0.25
ali@217
    24
ali@217
    25
Versions of guiguts up to at least 1.0.25 have a bug in the way that they
ali@217
    26
prepare a copy of the ebook for gutcheck (or bookloupe) to check. This causes
ali@217
    27
problems with ebooks that contain Unicode characters not present in Latin-1.
ali@217
    28
ali@217
    29
The guiguts bug report is here: http://sourceforge.net/p/guiguts/bugs/95/
ali@217
    30
The bug report also includes details of how to edit guiguts to work around
ali@217
    31
the problem until an offical fix is released.
ali@217
    32
ali@217
    33
ali@217
    34
                          Recent changes in behaviour
ali@190
    35
ali@190
    36
Each new version of bookloupe brings bug fixes and improvements. Sometimes
ali@190
    37
the behaviour is also changed in ways that might be unexpected:
ali@190
    38
ali@190
    39
Odd characters
ali@190
    40
ali@190
    41
    The check for "odd" characters (tab, tilde, carat, forward slash and
ali@190
    42
    asterisks) is disabled in bookloupe 2.0 when the character set is
ali@190
    43
    switched from ASCII/ISO-8859-1 to UNICODE (ie., when the "There are a
ali@190
    44
    lot of foreign letters here." message is printed). As of bookloupe 2.1
ali@190
    45
    these tests operate independently of the character set selected.
ali@190
    46
ali@190
    47
    Users may notice this change most especially in the case of the
ali@190
    48
    DP-specific /* ... */ markup. Bookloupe 2.0 often did not warn when
ali@190
    49
    this markup was encountered even when the --dp switch was not given.
ali@190
    50
    Bookloupe 2.1 will warn about this markup unless dp-specific mode is
ali@190
    51
    switched on, paranoid mode is switched off or the ebook contains more
ali@190
    52
    than 10 lines containing asterisks. In the last case
ali@190
    53
ali@190
    54
      --> 11 lines in this file contain asterisks. Not reporting them.
ali@190
    55
ali@190
    56
    will be printed.
ali@190
    57
ali@190
    58
ali@190
    59
ali@190
    60
Usage is: bookloupe [OPTION...] filename
ali@190
    61
ali@190
    62
Options:
ali@190
    63
      -d, --dp                  ignores some DP-specific markup
ali@190
    64
      -e, --no-echo             switches off Echoing of lines
ali@190
    65
      -s, --squote              checks Single quotes
ali@190
    66
      --typo                    checks Typos
ali@190
    67
      -p, --qpara               sets strict quotes checking for Paragraphs
ali@190
    68
      --no-paranoid             switches OFF typo checking and extra checks
ali@190
    69
      -l, --no-line-end         turns off Line-end checks
ali@190
    70
      -o, --overview            produces an Overview only
ali@190
    71
      -y, --stdout              sets error messages to stdout
ali@190
    72
      -h, --header              echos the header fields
ali@190
    73
      -m, --markup              ignore some common HTML markup
ali@190
    74
      -u, --usertypo            warns about words in a user-defined typo file
ali@190
    75
      -v, --verbose             forces individual reporting of minor problems
ali@190
    76
      -w, --web                 special mode for web uploads (for future use)
ali@190
    77
      --charset=NAME            the set of characters valid for this ebook
ali@190
    78
      --dump-config             dump the current configuration
ali@190
    79
ali@190
    80
There are also inverted options available which are useful when it is
ali@190
    81
desired to override an option set in the configuration file:
ali@190
    82
ali@190
    83
      --no-dp, --echo, --no-squote, --no-typo, --no-qpara, --paranoid,
ali@190
    84
      --line-end, --no-overview, --no-stdout, --no-header, --no-markup,
ali@190
    85
      --no-usertypo --no-verbose.
ali@190
    86
ali@190
    87
Note: there is no --no-web since --web simply selects a set of options.
ali@190
    88
ali@190
    89
Finally there are a couple of options that toggle the state of options
ali@190
    90
rather than setting or unsetting them: -t (for typo) and -x (for typo
ali@190
    91
and paranoid). These are mainly intended for compatability with gutcheck.
ali@0
    92
ali@74
    93
Running bookloupe without any parameters will display a brief help message.
ali@0
    94
ali@190
    95
Sample usage:
ali@0
    96
ali@74
    97
    bookloupe warpeace.txt
ali@0
    98
ali@0
    99
ali@0
   100
More detail:
ali@0
   101
ali@190
   102
    Configuration file
ali@190
   103
ali@190
   104
      Bookloupe will look for a file named bookloupe.ini to read as
ali@190
   105
      a configuration file. Options set in a configuration file can
ali@190
   106
      be overridden from the command line as required.
ali@190
   107
ali@190
   108
      The following directories are searched in order:
ali@190
   109
ali@190
   110
        1) The current working directory. When run from the command
ali@190
   111
	line, this is the directory you ran it from. When run from
ali@190
   112
	guiguts it will normally be the directory that contains the
ali@190
   113
	guiguts program.
ali@190
   114
ali@190
   115
	2) The directory containing the bookloupe program.
ali@190
   116
ali@190
   117
	3) The user's configuration directory. Under MS-Windows this
ali@190
   118
	is normally CSIDL_LOCAL_APPDATA which is typically set to
ali@190
   119
	C:\Documents and Settings\<user>\Local Settings\Application Data.
ali@190
   120
	On other platforms this is normally $XDG_CONFIG_HOME which, if
ali@190
   121
	not set defaults to $HOME/.config
ali@190
   122
ali@190
   123
	The directories to search can also be changed using the
ali@190
   124
	$BOOKLOUPE_CONFIG_PATH environment variable which is a colon
ali@190
   125
	separated (semi-colon separated under MS-Windows) list of
ali@190
   126
	directories.
ali@190
   127
ali@190
   128
      The configuration file is a key file. This is very similar to,
ali@190
   129
      but not identical to a typical ini file as found under MS-Windows.
ali@190
   130
      Key files consist of a number of groups which start with the
ali@190
   131
      group name enclosed in square brackets on a line by itself.
ali@190
   132
      Bookloupe recognises just one group, "options". Then below the
ali@190
   133
      group name there follows the keys and their values for that
ali@190
   134
      group, one per line in the format key=value. Most of bookloupe's
ali@190
   135
      options are flags (ie., either on or off). For these keys, the
ali@190
   136
      value must be either "true" or "false". The file may also contain
ali@190
   137
      comment lines which begin with the # symbol. The names of the
ali@190
   138
      keys follow the long option names.
ali@190
   139
ali@190
   140
      A sample configuration file is provided (in sample.ini). The file
ali@190
   141
      will need to be copied to bookloupe.ini before bookloupe will
ali@190
   142
      read it. You can also use the --dump-config option to write a
ali@190
   143
      configuration file for you. For example, if you typically want
ali@190
   144
      to run bookloupe with the --dp and --squote options, then you
ali@190
   145
      might do:
ali@190
   146
ali@190
   147
        $ bookloupe --dp --squote --dump-config > configuration.ini
ali@190
   148
	$ ren configuration.ini bookloupe.ini
ali@190
   149
ali@190
   150
      (Don't be tempted to merge these two steps or bookloupe will see
ali@190
   151
      an empty configuration file and complain.)
ali@190
   152
ali@190
   153
      This same idea can also be used to modify an existing configuration.
ali@190
   154
ali@190
   155
ali@74
   156
    Character encoding
ali@74
   157
ali@74
   158
      Bookloupe will handle e-texts encoded in UTF-8 (preferred),
ali@74
   159
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
   160
      incorrectly, as ansi). The output will be in the same encoding
ali@74
   161
      as the input e-text.
ali@74
   162
ali@0
   163
ali@190
   164
    Character set (--charset)
ali@190
   165
ali@190
   166
      Character encodings have an implicit set of characters that
ali@190
   167
      can be encoded and thus define a set of characters that can
ali@190
   168
      be present in the text. However sometimes it is desirable
ali@190
   169
      that not all characters that can be encoded should be present
ali@190
   170
      in a text. The set of characters that should be present is
ali@190
   171
      known as the character set.
ali@190
   172
ali@190
   173
      The default setting for the character set (called auto) does
ali@190
   174
      the same as gutcheck for Windows-1252 encoded texts for
ali@190
   175
      compatability:
ali@190
   176
ali@190
   177
      If the file is predominately ASCII then the set of legal
ali@190
   178
      characters is ASCII and warnings are issued whenever non-ASCII
ali@190
   179
      characters are encountered. The message will either warn of
ali@190
   180
      non-ASCII or non-ISO-8859-1 characters as appropriate.
ali@190
   181
ali@190
   182
      If the file contains a significant number of non-ASCII characters
ali@190
   183
      then a message is printed as follows:
ali@190
   184
ali@190
   185
        --> There are a lot of foreign letters here. Not reporting them.
ali@190
   186
ali@190
   187
      and the character set is widened to include all possible
ali@190
   188
      characters.
ali@190
   189
ali@190
   190
      For UTF-8 encoded texts, auto selects UNICODE.
ali@190
   191
      
ali@190
   192
      Most character sets are simply defined in bookloupe as the
ali@190
   193
      set of all characters that can be encoded in the encoding of
ali@190
   194
      the same name. UNICODE is an exception and includes only the
ali@190
   195
      characters assigned in the relevant Unicode standard but
ali@190
   196
      excluding the Private Use Area characters. Note that the
ali@190
   197
      relevant Unicode standard is given by the version of glib in
ali@190
   198
      use rather than by any code in bookloupe and thus can vary
ali@190
   199
      from system to system. PG texts however are likely to be
ali@190
   200
      using characters assigned in very early Unicode standards,
ali@190
   201
      thus mitigating this issue.
ali@190
   202
ali@190
   203
ali@190
   204
    Echoing lines (--no-echo to switch off)
ali@190
   205
ali@190
   206
      You may find it convenient, when reviewing Bookloupe's
ali@74
   207
      suggestions, to see the line that Bookloupe is questioning.
ali@0
   208
      That way, you can often see at a glance whether it is
ali@0
   209
      a real error that needs to be fixed, or a false positive
ali@74
   210
      that should be in the text, but Bookloupe's limited
ali@0
   211
      programming doesn't understand.
ali@0
   212
ali@190
   213
      By default, bookloupe echoes these lines, but if you don't
ali@190
   214
      want to see the lines referred to, --no-echo will switch it
ali@190
   215
      OFF.
ali@0
   216
ali@0
   217
ali@190
   218
    Quotes (--squote and --qpara switches)
ali@0
   219
ali@190
   220
      Bookloupe always looks for unbalanced doublequotes in a
ali@0
   221
      paragraph. It is a common convention for writers not to
ali@0
   222
      close quotes in a paragraph if the next paragraph opens
ali@0
   223
      with quotes and is a continuation by the same speaker.
ali@0
   224
ali@190
   225
      Bookloupe therefore does not normally report unclosed quotes
ali@0
   226
      if the next paragraph begins with a quote. If you need
ali@0
   227
      to see all unclosed quotes, even where the next paragraph
ali@0
   228
      begins with a quote, you should use the -p switch.
ali@0
   229
ali@190
   230
      Singlequotes (', `, ‘ and ’) are a problem, since the same
ali@190
   231
      character can be used for an apostrophe. I'm not sure that it
ali@190
   232
      is possible to get 100% accuracy on singlequotes checking,
ali@0
   233
      particularly since dialect, quite common in PG texts,
ali@0
   234
      upsets the normal rules so badly. Consider the sentence:
ali@0
   235
        'Tis often said that a man's a man for a' that.
ali@0
   236
      As humans, we recognize that both apostrophes are used
ali@190
   237
      for contractions rather than quotes, but it isn't easy
ali@0
   238
      to get a program to recognize that.
ali@0
   239
ali@74
   240
      Since bookloupe makes too many mistakes when trying to match
ali@0
   241
      singlequotes, it doesn't look for unbalanced singlequotes
ali@190
   242
      unless you specify the --squote switch.
ali@0
   243
ali@0
   244
      Consider these sentences, which illustrate the main cases:
ali@0
   245
ali@0
   246
        'Tis often said that a fool and his money are soon parted.
ali@0
   247
ali@0
   248
        'Becky's goin' home,' said Tom.
ali@0
   249
ali@0
   250
        The dogs' tails wagged in unison.
ali@0
   251
ali@0
   252
        Those 'pack dogs' of yours look more like wolves.
ali@0
   253
ali@0
   254
ali@190
   255
    Typos (--typo switch)
ali@0
   256
ali@190
   257
      It's not bookoupe's job to be a spelling checker, but it does
ali@190
   258
      check for a list of common typos and OCR errors if you use the
ali@190
   259
      --typo switch. (The -t and -x switchs also toggle typo checking.)
ali@0
   260
ali@0
   261
      It also checks for character combinations, especially involving
ali@0
   262
      h and b, which are often confused by OCR, that rarely or never
ali@0
   263
      occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0
   264
      occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0
   265
      playing the odds - a few false positives for many errors found.
ali@0
   266
      Similarly with "ii", which is a very common OCR error.
ali@0
   267
ali@74
   268
      Bookloupe suppresses multiple reporting of the first 40 "typos"
ali@0
   269
      found. This is to remove the annoyance of seeing something like
ali@0
   270
      "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@190
   271
      in a text.
ali@0
   272
ali@0
   273
ali@190
   274
    Line-end checking (--no-line-end switch to disable)
ali@0
   275
ali@0
   276
      All PG texts should have a Carriage Return (CR - character 13)
ali@0
   277
      and a Line Feed (LF - character 10) at end of each line,
ali@0
   278
      regardless of what O/S you made them on. DOS/Windows, Unix
ali@0
   279
      and Mac have different conventions, but the final text should
ali@0
   280
      always use a CR/LF pair as its line terminator.
ali@0
   281
ali@74
   282
      By default, bookloupe verifies that every line does have
ali@0
   283
      the correct terminator, but if you're on a work-in-progress
ali@0
   284
      in Linux, you might want to convert the line-ends as a final
ali@0
   285
      step, and not want to see thousands of errors every time you
ali@190
   286
      run bookloupe before that final step, so you can turn off
ali@190
   287
      this checking with the --no-line-end switch.
ali@0
   288
ali@0
   289
ali@190
   290
    Paranoid mode (--no-paranoid switch to disable: Trust No One :-)
ali@0
   291
ali@190
   292
      --no-paranoid switches OFF some extra checks like standalone
ali@190
   293
      1 and 0 queries.
ali@0
   294
ali@0
   295
ali@190
   296
    Overview mode (--overview switch)
ali@0
   297
ali@74
   298
      This mode just gives a count of queries found
ali@74
   299
      instead of a detailed list.
ali@0
   300
ali@0
   301
ali@190
   302
    Header quote  (--header switch)
ali@0
   303
ali@190
   304
      If you use the --header switch, bookloupe will also display
ali@74
   305
      the Title, Author, Release and Edition fields from the
ali@74
   306
      PG header. This is useful mostly for the automated
ali@74
   307
      checks we do on recently-posted texts.
ali@0
   308
ali@0
   309
ali@190
   310
    Errors to stdout (--stdout switch)
ali@0
   311
ali@74
   312
      If you're just running bookloupe normally, you can ignore
ali@74
   313
      this. It's only there for programs that provide a front
ali@74
   314
      end to bookloupe. It makes error messages appear within
ali@74
   315
      the output of bookloupe so that the front end knows whether
ali@74
   316
      bookloupe ran OK.
ali@0
   317
ali@0
   318
ali@190
   319
    Verbose reporting (--verbose switch)
ali@0
   320
ali@74
   321
      Normally, if bookloupe sees lots of long lines, short lines,
ali@74
   322
      spaced dashes, non-ASCII characters or dot-commas ".," it
ali@74
   323
      assumes these are features of the text, counts and summarizes
ali@190
   324
      them at the top of its report, but does not list them
ali@190
   325
      individually. If the verbose switch is on, bookloupe will list
ali@190
   326
      them all.
ali@0
   327
ali@0
   328
ali@190
   329
    Markup interpretation (--markup switch)
ali@0
   330
ali@74
   331
      Normally, bookloupe flags anything it suspects of being HTML
ali@190
   332
      markup as a possible error. When you use the --markup switch,
ali@74
   333
      however, it matches anything that looks like markup against
ali@74
   334
      a short list of common HTML tags and entities. If the markup
ali@74
   335
      is in that list, it either ignores the markup, in the case
ali@190
   336
      of a tag, or "interprets" the markup as its nearest ASCII
ali@74
   337
      equivalent, in the case of an entity. So, for example, using
ali@74
   338
      this switch, bookloupe will "see"
ali@0
   339
ali@74
   340
      &ldquo;He went <i>thataway!</i>&rdquo;
ali@0
   341
ali@74
   342
      as
ali@0
   343
ali@74
   344
      "He went thataway!"
ali@0
   345
ali@74
   346
      and report accordingly.
ali@0
   347
ali@74
   348
      This switch does not, not, NOT check the validity of HTML;
ali@74
   349
      it exists so that you can run bookloupe on most HTML texts
ali@74
   350
      for PG, and get sane results. It does not support all tags.
ali@74
   351
      It does not support all entities. When it sees a tag or entity
ali@74
   352
      it does not recognize, it will query it as HTML just as if
ali@190
   353
      you hadn't specified the --markup switch.
ali@0
   354
ali@74
   355
      Bookloupe will automatically switch on markup interpretation
ali@74
   356
      if it sees a lot of tags that appear to be markup, so mostly, you
ali@74
   357
      won't have to specify this.
ali@0
   358
ali@190
   359
ali@190
   360
    User-defined typos (--usertypo switch)
ali@0
   361
ali@74
   362
      If you have a file named bookloupe.typ or gutcheck.typ either
ali@74
   363
      in your current working directory or in the directory from
ali@74
   364
      which you explicitly invoked bookoupe, but not necessarily on
ali@190
   365
      your path, and if you specify the --usertypo switch, bookloupe
ali@190
   366
      will query any word specified in that file. The file is simple:
ali@190
   367
      one word, in lower case, per line. Be careful not to put multiple
ali@74
   368
      words onto a line, or leave any rubbish other than the word on
ali@74
   369
      the line. You should have received a sample file bookloupe.typ
ali@74
   370
      with this package. The file may be encoded in UTF-8 (preferred),
ali@74
   371
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
   372
      incorrectly, as ansi).
ali@0
   373
ali@190
   374
ali@190
   375
    Ignore DP markup (--dp switch)
ali@190
   376
ali@74
   377
      Distributed Proofreaders (http://www.pgdp.net) has for some
ali@74
   378
      time been the main source of PG texts, and proofers there use
ali@74
   379
      special conventions. This switch understands those conventions,
ali@74
   380
      so that people can use bookloupe on files in process that still
ali@74
   381
      haven't had the special conventions removed yet. The special
ali@74
   382
      conventions supported are page-separators and
ali@74
   383
      "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
ali@190
   384
 
ali@190
   385
ali@190
   386
    Dump the current configuration (--dump-config switch)
ali@190
   387
ali@190
   388
      The --dump-config switch can be used to dump the current
ali@190
   389
      configuration. This is a combination of the internal defaults,
ali@190
   390
      the configuration file (if any) and the command line options.
ali@190
   391
      If a configuration file is present, any comments found in that
ali@190
   392
      file will be preserved in the dumped configuration. If there
ali@190
   393
      is no configuration file, then a default set of comments to
ali@190
   394
      go with the internal default configuration is generated.
ali@0
   395
ali@0
   396
ali@74
   397
You will probably only run bookloupe on a text once or maybe twice,
ali@0
   398
just prior to uploading; it usually finds a few formatting problems;
ali@0
   399
it also usually finds queries that aren't problems at all - it often
ali@0
   400
questions Tables of Contents for having short lines, for example.
ali@74
   401
These are called "false positives," and need a human to decide on
ali@0
   402
them.
ali@0
   403
ali@0
   404
The text should be standard prose, and already close to PG normal
ali@0
   405
format (plain text, about 70 characters per line with blank lines
ali@0
   406
between paragraphs).
ali@0
   407
ali@74
   408
Bookloupe merely draws your attention to things that might be errors.
ali@0
   409
It is NOT a substitute for human judgement. Formatting choices like
ali@0
   410
short lines may be for a reason that this program can't understand.
ali@0
   411
ali@0
   412
Even the most careful human proofing can leave errors behind in a
ali@0
   413
text, and there are several automated checks you can do to help find
ali@0
   414
them. Of these, spellchecking (with _very_ careful human judgement) is
ali@0
   415
the most important and most useful.
ali@0
   416
ali@74
   417
Bookloupe does perform some basic typo-checking if you ask it to,
ali@74
   418
but its focus is on formatting errors specific to PG texts—
ali@0
   419
mismatched quotes, non-ASCII characters, bad spacing, bad line
ali@0
   420
length, HTML tags perhaps left from a conversion, unbalanced
ali@0
   421
brackets.
ali@0
   422
ali@190
   423
Suggestions for additional checks would be appreciated and duly
ali@0
   424
considered, but no guarantees that they will be implemented.
ali@0
   425
ali@0
   426
ali@0
   427
ali@0
   428
ali@74
   429
        How does Jim Tinsley use gutcheck?
ali@0
   430
ali@0
   431
Practically everyone I give gutcheck to asks me how _I_ use it.
ali@0
   432
Well, when I get a text for posting, say filename.txt, I run
ali@0
   433
ali@0
   434
    gutcheck -o filename.txt
ali@0
   435
ali@0
   436
That gives me a quick idea what I'm dealing with. It'll tell
ali@190
   437
me what kind of problems gutcheck sees, and give me an idea
ali@190
   438
of how much more work needs to be done on the text. Keep in
ali@0
   439
mind that gutcheck doesn't do anything like a full spellcheck,
ali@0
   440
but when I see a text that has a lot of problems, I assume that
ali@0
   441
it probably needs a spellcheck too.
ali@0
   442
ali@0
   443
Having got a feel for the ballpark, I run
ali@0
   444
ali@0
   445
    gutcheck filename.txt > jj
ali@0
   446
ali@0
   447
where jj is my personal, all-purpose filename for temporary data
ali@0
   448
that doesn't need to be kept. Then I open filename.txt and jj in
ali@0
   449
a split-screen view in my editor, and work down the text, fixing
ali@190
   450
whatever needs fixing, and skipping whatever doesn't. If your
ali@190
   451
editor doesn't split-screen, you can get much the same effect by
ali@0
   452
opening your original file in your normal editor, and jj (or your
ali@190
   453
equivalent name) in something like Notepad, keeping both in view
ali@0
   454
at the same time.
ali@0
   455
ali@0
   456
Twice a day, an automatic process looks at all recently-posted
ali@0
   457
texts, and emails Michael, me, and sometimes other people with
ali@0
   458
their gutcheck summaries.
ali@0
   459
ali@0
   460
ali@0
   461
ali@74
   462
Explanations of common bookloupe messages:
ali@0
   463
ali@0
   464
    --> 74 lines in this file have white space at end
ali@0
   465
ali@0
   466
    PG texts shouldn't have extra white space added at end of line.
ali@0
   467
    Don't worry too much about this; they're not doing any harm,
ali@0
   468
    and they'll be removed during posting anyway.
ali@0
   469
ali@0
   470
ali@0
   471
    --> 348 lines in this file are short. Not reporting short lines.
ali@0
   472
    --> 84 lines in this file are long. Not reporting long lines.
ali@0
   473
    --> 8 lines in this file are VERY long!
ali@0
   474
ali@74
   475
    If there are a lot of long or short lines, bookloupe won't list
ali@0
   476
    them individually. The short lines version of this message
ali@0
   477
    is commonly seen when gutchecking poetry and some plays, where
ali@0
   478
    the normal line length is shorter than the standard for prose.
ali@0
   479
    A "VERY long" line is one over 80 characters.  You normally
ali@0
   480
    shouldn't have any of these, but sometimes you may have to render
ali@0
   481
    a table that must be that long, or some special preformatted
ali@0
   482
    quotation that can't be broken.
ali@0
   483
ali@0
   484
ali@0
   485
    --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
ali@0
   486
ali@0
   487
    The PG standard for an emdash--like these--is two minus signs
ali@0
   488
    with no spaces before or after them. However, some older texts
ali@0
   489
    used spaced dashes - like these -- and if there are very many
ali@74
   490
    such spaced dashes in the file, bookoupe just draws your
ali@0
   491
    attention to it and doesn't list them individually.
ali@0
   492
ali@0
   493
ali@0
   494
ali@0
   495
    Line 3020 - Non-ASCII character 233
ali@0
   496
ali@0
   497
    Standard PG texts should use only ASCII characters with values
ali@190
   498
    up to 127; however, non-English, accented characters can be
ali@190
   499
    represented according to several different non-ASCII encoding
ali@0
   500
    schemes, using values over 127. If you have a plain English text
ali@0
   501
    with a few accented characters in words like cafe or tete-a-tete,
ali@190
   502
    you might replace the accented characters with their unaccented
ali@0
   503
    versions. The English pound sign is another commonly-seen
ali@0
   504
    non-ASCII character. If you have enough non-ASCII characters in
ali@74
   505
    your text that you feel removing them would degrade your text,
ali@74
   506
    you should probably consider doing a UTF-8 text.
ali@0
   507
ali@0
   508
ali@0
   509
ali@0
   510
    Line 1207 - Non-ISO-8859 character 156
ali@0
   511
ali@0
   512
    Even in "8-bit" texts, there are distinctions between code sets.
ali@0
   513
    The ISO-8859 family of 8-bit code sets is the most commonly used
ali@0
   514
    in PG, and these sets do not define values in the range 128 through
ali@0
   515
    159 as printable characters. It's quite common for someone on a
ali@0
   516
    Windows or Mac machine to use a non-ISO character inadvertently,
ali@0
   517
    so this message warns that the character is not only not ASCII,
ali@0
   518
    but also outside the ISO-8859 range.
ali@0
   519
ali@0
   520
ali@0
   521
ali@0
   522
    Line 46 - Tab character?
ali@0
   523
ali@0
   524
    Some editors and WPs will put in Tab characters (character 9) to
ali@0
   525
    indicate indented text. You should not use these in a PG text,
ali@0
   526
    because you can't be sure how they will appear on a reader's
ali@0
   527
    screen. Find the Tab, and replace it with the appropriate number
ali@0
   528
    of spaces.
ali@0
   529
ali@0
   530
ali@190
   531
ali@0
   532
    Line 1327 - Tilde character?
ali@0
   533
ali@0
   534
    The tilde character (~) might be legitimately used, but it's the
ali@0
   535
    character commonly used by OCR software to indicate a place where
ali@74
   536
    it couldn't make out the letter, so bookloupe flags it.
ali@0
   537
ali@0
   538
ali@0
   539
ali@0
   540
    Line 1347 - Asterisk?
ali@0
   541
ali@190
   542
    Asterisks are reported only in paranoid mode (see -x).
ali@0
   543
    Like tildes, they are often used to indicate errors, but they are
ali@0
   544
    also legitimately used as line delimiters and footnote markers.
ali@0
   545
ali@0
   546
ali@0
   547
ali@0
   548
    Line 1451 - Long line 129
ali@0
   549
ali@0
   550
    PG texts should have lines shorter than 76. There may be occasions
ali@0
   551
    where you decide that you really have to go out to 79 characters,
ali@74
   552
    but the sample above says that line 1451 is 129 characters long—
ali@0
   553
    probably two lines run together.
ali@0
   554
ali@0
   555
ali@0
   556
ali@0
   557
    Line 1590 - Short line?
ali@0
   558
ali@0
   559
    PG texts should have lines longer than 54 characters. However,
ali@0
   560
    there are special cases like poetry and tables of contents where
ali@74
   561
    the lines _should_ be shorter. So treat bookloupe warnings about
ali@0
   562
    short lines carefully. Sometimes it's a genuine formatting
ali@0
   563
    problem; sometimes the line really needs to be short.
ali@0
   564
ali@74
   565
    Hint: bookloupe will not flag lines as short if they are indented
ali@74
   566
    —if they start with a space. I like to start inserted stanzas
ali@190
   567
    and other such items indented with a couple of spaces so that
ali@0
   568
    they stand out from the main text anyway.
ali@0
   569
ali@0
   570
ali@0
   571
ali@0
   572
    Line 1804 - Begins with punctuation?
ali@0
   573
ali@0
   574
    Lines should normally not begin with commas, periods and so on.
ali@0
   575
    An exception is ellipses . . . which can happen at start of line.
ali@0
   576
ali@0
   577
ali@0
   578
ali@0
   579
    Line 1850 - Spaced em-dash?
ali@0
   580
ali@0
   581
    The PG standard for an em-dash--like these--is two minus signs
ali@74
   582
    with no spaces before or after them. Bookloupe flags non-PG
ali@190
   583
    em-dashes - like this one. Normally, you will replace it with a
ali@0
   584
    PG-standard em-dash.
ali@0
   585
ali@0
   586
ali@0
   587
ali@0
   588
    Line 1904 - Query he/be error?
ali@0
   589
ali@74
   590
    Bookloupe makes a very minor effort to look for that scourge of all
ali@0
   591
    proofreaders, "be" replacing "he" or vice-versa, and draws your
ali@0
   592
    attention to it when it thinks it has found one.
ali@0
   593
ali@0
   594
ali@0
   595
ali@0
   596
    Line 2017 - Query digit in a1most
ali@0
   597
ali@0
   598
    The digit 1 is commonly OCRed for the letter l, the digit 0 for
ali@74
   599
    the letter O, and so on. When bookloupe sees a mix of digits and
ali@0
   600
    letters, it warns you. It may generate a false positive for
ali@0
   601
    something like 7am.
ali@0
   602
ali@0
   603
ali@0
   604
ali@0
   605
    Line 2083 - Query standalone 0
ali@0
   606
ali@190
   607
    In paranoid mode (see -x) only, bookloupe warns about the digit 0
ali@190
   608
    and the number 1 standing alone as a word. This can happen if the
ali@0
   609
    OCR misreads the words O or I.
ali@0
   610
ali@0
   611
ali@0
   612
ali@0
   613
    Line 2115 - Query word whetber
ali@0
   614
ali@74
   615
    If you have switched typo-checking on, bookloupe looks for
ali@0
   616
    potential typos, especially common h/b errors. It's not
ali@0
   617
    infallible; it sometimes queries legit words, but it's
ali@0
   618
    always worth taking a look.
ali@0
   619
ali@0
   620
ali@0
   621
ali@0
   622
    Line 2190 column 14 - Missing space?
ali@0
   623
ali@0
   624
    Omitting a space is a very common error,especially coming from
ali@0
   625
    OCRed text,and can be hard for a human to spot. The commas in
ali@0
   626
    the previous sentence illustrate the kind of thing I mean.
ali@0
   627
ali@0
   628
ali@0
   629
ali@0
   630
    Line 2240 column 48 - Spaced punctuation?
ali@0
   631
ali@0
   632
    The flip side of the "missing space" error , here , is when extra
ali@0
   633
    spaces are added before punctuation . Some old texts appear to add
ali@0
   634
    extra spaces around punctuation consistently, but this was a
ali@0
   635
    typographical convention rather than the author's intent, and the
ali@0
   636
    extra "spaces" should be removed when preparing a PG text.
ali@0
   637
ali@0
   638
ali@0
   639
ali@0
   640
    Line 2301 column 19 - Unspaced quotes?
ali@0
   641
ali@0
   642
    Another common spacing problem occurs in a phrase like "You wait
ali@0
   643
    there,"he said.
ali@0
   644
ali@0
   645
ali@0
   646
ali@0
   647
    Line 2385 column 27 - Wrongspaced quotes?
ali@0
   648
ali@74
   649
    Bookloupe checks whether a quote seems to be a start or end quote,
ali@74
   650
    and queries those that appear to be misplaced. This does give rise
ali@74
   651
    to false positives when quotes are nested, for example:
ali@0
   652
ali@0
   653
    "And how," she asked, "will your "friends" help you now?"
ali@0
   654
ali@0
   655
    but these false positives are worth it because of the many cases
ali@0
   656
    that this test catches, notably those like:
ali@0
   657
ali@0
   658
    "And how, "she said," will your friends help you now?"
ali@0
   659
ali@0
   660
    Sometimes a "wrongspaced quotes" query will arise because an earlier
ali@0
   661
    quote in the paragraph was omitted, so if the place specified seems
ali@0
   662
    to be OK, look back to see whether there's a problem in the preceding
ali@0
   663
    lines.
ali@0
   664
ali@0
   665
ali@0
   666
ali@0
   667
    Line 2400 - HTML Tag? <PRE>
ali@0
   668
ali@0
   669
    Some PG texts have been converted from HTML, and not all of the
ali@0
   670
    HTML tags have been removed.
ali@0
   671
ali@0
   672
ali@0
   673
ali@0
   674
    Line 2402 - HTML symbol? &emdash;
ali@0
   675
ali@0
   676
    Similarly, special HTML symbol characters can survive into PG
ali@0
   677
    texts. Can occasionally produce amusing false positives like
ali@0
   678
    . . . Marwick & Co were well known for it;
ali@0
   679
ali@0
   680
ali@0
   681
ali@0
   682
    Line 2540 - Mismatched quotes
ali@0
   683
ali@74
   684
    Another bookloupe mainstay—unclosed doublequotes in a paragraph.
ali@0
   685
    See the discussion of quotes in the switches section near the
ali@0
   686
    start of this file.
ali@190
   687
ali@74
   688
    Since the mismatch doesn't occur on any one line, bookloupe quotes
ali@0
   689
    the line number of the first blank line following the paragraph,
ali@0
   690
    since this is the point where it reconciles the count of quotes.
ali@74
   691
    However, if bookloupe is echoing lines, that is, you haven't used
ali@190
   692
    the -e switch, it will show the _first_ line of the paragraph,
ali@190
   693
    to help you find the place without using line numbers. The
ali@190
   694
    offending paragraph is therefore between the quoted line and
ali@0
   695
    the line number given.
ali@0
   696
ali@0
   697
ali@0
   698
ali@0
   699
    Line 2587 - Mismatched single quotes
ali@0
   700
ali@190
   701
    Only checked with the -s switch, since checking single quotes is
ali@190
   702
    not a very reliable process. Otherwise, the same logic as for
ali@0
   703
    doublequotes applies.
ali@0
   704
ali@0
   705
ali@0
   706
ali@0
   707
    Line 2877 - Mismatched round brackets?
ali@0
   708
ali@0
   709
    Also curly and square brackets. Texts with a lot of brackets, like
ali@0
   710
    plays with bracketed stage instructions, may have mismatches.
ali@0
   711
ali@0
   712
ali@0
   713
    Line 3150 - No CR?
ali@0
   714
    Line 3204 - Two successive CRs?
ali@0
   715
    Line 3281 position 75 - CR without LF?
ali@0
   716
ali@0
   717
    These are the invalid line-end warnings. See the discussion of
ali@0
   718
    line-end checking in the switches section near the start of this
ali@0
   719
    file. If you see these, and your editor doesn't show anything
ali@0
   720
    wrong, you should probably try deleting the characters just before
ali@0
   721
    and after the line end, and the line-end itself, then retyping the
ali@0
   722
    characters and the line-end.
ali@0
   723
ali@0
   724
ali@0
   725
    Line 2940 - Paragraph starts with lower-case
ali@0
   726
ali@0
   727
    A common error in an e-text is for an extra blank line
ali@0
   728
ali@0
   729
    to be put in, like the blank line above, and this often
ali@0
   730
    shows up as a new paragraph beginning with lower case.
ali@190
   731
    Sometimes the blank line is deliberate, as when a
ali@0
   732
    quotation is inserted in a speech. Use your judgement.
ali@0
   733
ali@0
   734
ali@0
   735
    Line 2987 - Extra period?
ali@0
   736
ali@0
   737
    An extra period. is a. common problem in OCRed text. and usually
ali@0
   738
    arises when a speck of dust on the page is mistaken for a period.
ali@0
   739
    or. as occasionally happens. when a comma loses its tail.
ali@0
   740
ali@0
   741
ali@0
   742
    Line 3012 column 12 - Double punctuation?
ali@0
   743
ali@0
   744
    Double punctuation., like that,, is a common typo and
ali@0
   745
    scanno. Some books have much legit double punctuation,
ali@0
   746
    like etc., etc., but it's worth checking anyway.
ali@0
   747
ali@0
   748
ali@0
   749
ali@0
   750
            *       *       *        *
ali@0
   751
ali@0
   752
For Windows-only users who are unfamiliar with DOS:
ali@0
   753
ali@0
   754
    If you're a Windows-only user, you need to save
ali@74
   755
    bookloupe.exe into the folder (directory) where the
ali@0
   756
    text file you want to check is. Let's say your
ali@74
   757
    text file is in C:\gut, then you should save
ali@74
   758
    bookloupe.exe into C:\gut.
ali@0
   759
ali@74
   760
    Now get to a console. You can do this by
ali@0
   761
    selecting the "Command Prompt" or "MS-DOS Prompt"
ali@0
   762
    option that will be somewhere on your
ali@0
   763
    Start/Programs menu.
ali@0
   764
ali@190
   765
    Now get into the C:\gut directory.
ali@190
   766
    You can do this using the cd (change directory)
ali@0
   767
    command, like this:
ali@74
   768
        cd \gut
ali@190
   769
    and your prompt will change to
ali@74
   770
        C:\gut>
ali@0
   771
    so you know you're in the right place.
ali@0
   772
ali@0
   773
    Now type
ali@74
   774
        bookloupe yourfile.txt
ali@74
   775
    and you'll see bookloupe's report
ali@0
   776
ali@74
   777
    By default, bookloupe prints its queries to screen.
ali@0
   778
    If you want to create a file of them, to edit
ali@0
   779
    against the text, you can use the greater-than
ali@0
   780
    sign (>) to tell it to output the report to a
ali@0
   781
    file. For example, if you want its report in a
ali@74
   782
    file called queries.lst, you could type
ali@74
   783
ali@74
   784
        bookloupe yourfile.txt > queries.lst
ali@0
   785
ali@0
   786
    The queries.lst file will then contain the listing
ali@0
   787
    of possible formatting errors, and you can
ali@0
   788
    edit it alongside your text.
ali@0
   789
ali@0
   790
    Whatever you do, DON'T make the filename after
ali@0
   791
    the greater-than sign the name of a file already
ali@0
   792
    on your disk that you want to keep, because
ali@74
   793
    the greater-than sign will cause bookloupe to
ali@0
   794
    replace any existing file of that name.
ali@0
   795
ali@0
   796
    So, for example, if you have two Tolstoy files
ali@190
   797
    that you want to check, called WARPEACE.TXT and
ali@0
   798
    ANNAK.TXT, make sure that neither of these names
ali@0
   799
    is ever used following the greater-than sign.
ali@0
   800
    To check these correctly, you might do:
ali@0
   801
ali@74
   802
    bookloupe warpeace.txt > war.lst
ali@0
   803
ali@0
   804
    and
ali@0
   805
ali@74
   806
    bookloupe annak.txt > annak.lst
ali@0
   807
ali@0
   808
    separately. Then you can look at war.lst and annak.lst
ali@74
   809
    to see the bookloupe reports.
ali@83
   810
ali@83
   811
For Windows-only users who want to use bookloupe from guiguts:
ali@83
   812
ali@83
   813
    1) If you haven't already done so, download bookloupe-win32-xxx.zip
ali@83
   814
    from http://www.juiblex.co.uk/pgdp/bookloupe/
ali@83
   815
ali@83
   816
    2) Extract the files into a suitable folder, e.g. C:\DP\bookloupe
ali@83
   817
ali@83
   818
    3) Start Guiguts
ali@83
   819
ali@83
   820
    4) Choose Preferences | File Paths | Set File Paths..
ali@83
   821
ali@83
   822
    5) Click the "Locate Gutcheck..." button
ali@83
   823
ali@83
   824
    6) Browse to the folder where you extracted bookloupe
ali@83
   825
ali@190
   826
    7) Double-click bookloupe.exe
ali@89
   827
ali@89
   828
    Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe
ali@89
   829
    instead. Since the output will look very like gutcheck output, you
ali@89
   830
    may want to check that it is actually bookloupe that is running. To do
ali@89
   831
    this, look at the black command line message window, which will say:
ali@89
   832
ali@89
   833
    "bookloupe: Check and report on an e-text".
ali@89
   834
ali@89
   835
    To return to using gutcheck for any reason, repeat steps 4 and 5
ali@89
   836
    above, and then,
ali@89
   837
ali@89
   838
    6b) Browse back to the gutcheck folder, which is in a "tools"
ali@89
   839
    folder inside the main Guiguts folder. It will be something like
ali@89
   840
    "C:\DP\guiguts-win\tools\gutcheck", depending on where you installed
ali@89
   841
    Guiguts originally.
ali@89
   842
ali@89
   843
    7b) Double-click gutcheck.exe
ali@89
   844
ali@89
   845
    Now doing "Gutcheck" in Guiguts will run gutcheck itself, and the
ali@89
   846
    message in the black window should read:
ali@89
   847
ali@89
   848
    "gutcheck: Check and report on an e-text".