doc/bookloupe.txt
author ali <ali@juiblex.co.uk>
Wed Oct 02 09:14:33 2013 +0100 (2013-10-02)
changeset 105 2d48e8cdda24
parent 92 7a62c77a0dbe
permissions -rw-r--r--
Fix bug #19: Update documentation for 2.1
ali@0
     1
ali@0
     2
ali@74
     3
                            Bookloupe documentation
ali@0
     4
ali@0
     5
ali@74
     6
bookloupe: lists possible common formatting errors in a Project
ali@74
     7
Gutenberg candidate file. Bookloupe is based on gutcheck, written
ali@74
     8
by Jim Tinsley. It is a command line program and can be used under
ali@74
     9
Microsoft Windows, Mac or Unix. For Windows-only people, there is
ali@74
    10
an appendix at the end with brief instructions for running it.
ali@0
    11
ali@105
    12
Current version: 2.1
ali@0
    13
ali@74
    14
This software is Copyright Jim Tinsley 2000-2005 and
ali@74
    15
J. Ali Harlow 2012 onwards.
ali@0
    16
ali@74
    17
Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0
    18
This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0
    19
ali@74
    20
See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
ali@0
    21
ali@0
    22
ali@105
    23
                         Recent changes in behaviour
ali@105
    24
ali@105
    25
Each new version of bookloupe brings bug fixes and improvements. Sometimes
ali@105
    26
the behaviour is also changed in ways that might be unexpected:
ali@105
    27
ali@105
    28
Odd characters
ali@105
    29
ali@105
    30
    The check for "odd" characters (tab, tilde, carat, forward slash and
ali@105
    31
    asterisks) is disabled in bookloupe 2.0 when the character set is
ali@105
    32
    switched from ASCII/ISO-8859-1 to UNICODE (ie., when the "There are a
ali@105
    33
    lot of foreign letters here." message is printed). As of bookloupe 2.1
ali@105
    34
    these tests operate independently of the character set selected.
ali@105
    35
ali@105
    36
    Users may notice this change most especially in the case of the
ali@105
    37
    DP-specific /* ... */ markup. Bookloupe 2.0 often did not warn when
ali@105
    38
    this markup was encountered even when the --dp switch was not given.
ali@105
    39
    Bookloupe 2.1 will warn about this markup unless dp-specific mode is
ali@105
    40
    switched on, paranoid mode is switched off or the ebook contains more
ali@105
    41
    than 10 lines containing asterisks. In the last case
ali@105
    42
ali@105
    43
      --> 11 lines in this file contain asterisks. Not reporting them.
ali@105
    44
ali@105
    45
    will be printed.
ali@105
    46
ali@105
    47
ali@105
    48
ali@105
    49
Usage is: bookloupe [OPTION...] filename
ali@105
    50
ali@105
    51
Options:
ali@105
    52
      -d, --dp                  ignores some DP-specific markup
ali@105
    53
      -e, --no-echo             switches off Echoing of lines
ali@105
    54
      -s, --squote              checks Single quotes
ali@105
    55
      --typo                    checks Typos
ali@105
    56
      -p, --qpara               sets strict quotes checking for Paragraphs
ali@105
    57
      --no-paranoid             switches OFF typo checking and extra checks
ali@105
    58
      -l, --no-line-end         turns off Line-end checks
ali@105
    59
      -o, --overview            produces an Overview only
ali@105
    60
      -y, --stdout              sets error messages to stdout
ali@105
    61
      -h, --header              echos the header fields
ali@105
    62
      -m, --markup              ignore some common HTML markup
ali@105
    63
      -u, --usertypo            warns about words in a user-defined typo file
ali@105
    64
      -v, --verbose             forces individual reporting of minor problems
ali@105
    65
      -w, --web                 special mode for web uploads (for future use)
ali@105
    66
      --charset=NAME            the set of characters valid for this ebook
ali@105
    67
      --dump-config             dump the current configuration
ali@105
    68
ali@105
    69
There are also inverted options available which are useful when it is
ali@105
    70
desired to override an option set in the configuration file:
ali@105
    71
ali@105
    72
      --no-dp, --echo, --no-squote, --no-typo, --no-qpara, --paranoid,
ali@105
    73
      --line-end, --no-overview, --no-stdout, --no-header, --no-markup,
ali@105
    74
      --no-usertypo --no-verbose.
ali@105
    75
ali@105
    76
Note: there is no --no-web since --web simply selects a set of options.
ali@105
    77
ali@105
    78
Finally there are a couple of options that toggle the state of options
ali@105
    79
rather than setting or unsetting them: -t (for typo) and -x (for typo
ali@105
    80
and paranoid). These are mainly intended for compatability with gutcheck.
ali@0
    81
ali@74
    82
Running bookloupe without any parameters will display a brief help message.
ali@0
    83
ali@105
    84
Sample usage:
ali@0
    85
ali@74
    86
    bookloupe warpeace.txt
ali@0
    87
ali@0
    88
ali@0
    89
More detail:
ali@0
    90
ali@105
    91
    Configuration file
ali@105
    92
ali@105
    93
      Bookloupe will look for a file named bookloupe.ini to read as
ali@105
    94
      a configuration file. Options set in a configuration file can
ali@105
    95
      be overridden from the command line as required.
ali@105
    96
ali@105
    97
      The following directories are searched in order:
ali@105
    98
ali@105
    99
        1) The current working directory. When run from the command
ali@105
   100
	line, this is the directory you ran it from. When run from
ali@105
   101
	guiguts it will normally be the directory that contains the
ali@105
   102
	guiguts program.
ali@105
   103
ali@105
   104
	2) The directory containing the bookloupe program.
ali@105
   105
ali@105
   106
	3) The user's configuration directory. Under MS-Windows this
ali@105
   107
	is normally CSIDL_LOCAL_APPDATA which is typically set to
ali@105
   108
	C:\Documents and Settings\<user>\Local Settings\Application Data.
ali@105
   109
	On other platforms this is normally $XDG_CONFIG_HOME which, if
ali@105
   110
	not set defaults to $HOME/.config
ali@105
   111
ali@105
   112
	The directories to search can also be changed using the
ali@105
   113
	$BOOKLOUPE_CONFIG_PATH environment variable which is a colon
ali@105
   114
	separated (semi-colon separated under MS-Windows) list of
ali@105
   115
	directories.
ali@105
   116
ali@105
   117
      The configuration file is a key file. This is very similar to,
ali@105
   118
      but not identical to a typical ini file as found under MS-Windows.
ali@105
   119
      Key files consist of a number of groups which start with the
ali@105
   120
      group name enclosed in square brackets on a line by itself.
ali@105
   121
      Bookloupe recognises just one group, "options". Then below the
ali@105
   122
      group name there follows the keys and their values for that
ali@105
   123
      group, one per line in the format key=value. Most of bookloupe's
ali@105
   124
      options are flags (ie., either on or off). For these keys, the
ali@105
   125
      value must be either "true" or "false". The file may also contain
ali@105
   126
      comment lines which begin with the # symbol. The names of the
ali@105
   127
      keys follow the long option names.
ali@105
   128
ali@105
   129
      A sample configuration file is provided (in sample.ini). The file
ali@105
   130
      will need to be copied to bookloupe.ini before bookloupe will
ali@105
   131
      read it. You can also use the --dump-config option to write a
ali@105
   132
      configuration file for you. For example, if you typically want
ali@105
   133
      to run bookloupe with the --dp and --squote options, then you
ali@105
   134
      might do:
ali@105
   135
ali@105
   136
        $ bookloupe --dp --squote --dump-config > configuration.ini
ali@105
   137
	$ ren configuration.ini bookloupe.ini
ali@105
   138
ali@105
   139
      (Don't be tempted to merge these two steps or bookloupe will see
ali@105
   140
      an empty configuration file and complain.)
ali@105
   141
ali@105
   142
      This same idea can also be used to modify an existing configuration.
ali@105
   143
ali@105
   144
ali@74
   145
    Character encoding
ali@74
   146
ali@74
   147
      Bookloupe will handle e-texts encoded in UTF-8 (preferred),
ali@74
   148
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
   149
      incorrectly, as ansi). The output will be in the same encoding
ali@74
   150
      as the input e-text.
ali@74
   151
ali@0
   152
ali@105
   153
    Character set (--charset)
ali@105
   154
ali@105
   155
      Character encodings have an implicit set of characters that
ali@105
   156
      can be encoded and thus define a set of characters that can
ali@105
   157
      be present in the text. However sometimes it is desirable
ali@105
   158
      that not all characters that can be encoded should be present
ali@105
   159
      in a text. The set of characters that should be present is
ali@105
   160
      known as the character set.
ali@105
   161
ali@105
   162
      The default setting for the character set (called auto) does
ali@105
   163
      the same as gutcheck for Windows-1252 encoded texts for
ali@105
   164
      compatability:
ali@105
   165
ali@105
   166
      If the file is predominately ASCII then the set of legal
ali@105
   167
      characters is ASCII and warnings are issued whenever non-ASCII
ali@105
   168
      characters are encountered. The message will either warn of
ali@105
   169
      non-ASCII or non-ISO-8859-1 characters as appropriate.
ali@105
   170
ali@105
   171
      If the file contains a significant number of non-ASCII characters
ali@105
   172
      then a message is printed as follows:
ali@105
   173
ali@105
   174
        --> There are a lot of foreign letters here. Not reporting them.
ali@105
   175
ali@105
   176
      and the character set is widened to include all possible
ali@105
   177
      characters.
ali@105
   178
ali@105
   179
      For UTF-8 encoded texts, auto selects UNICODE.
ali@105
   180
      
ali@105
   181
      Most character sets are simply defined in bookloupe as the
ali@105
   182
      set of all characters that can be encoded in the encoding of
ali@105
   183
      the same name. UNICODE is an exception and includes only the
ali@105
   184
      characters assigned in the relevant Unicode standard but
ali@105
   185
      excluding the Private Use Area characters. Note that the
ali@105
   186
      relevant Unicode standard is given by the version of glib in
ali@105
   187
      use rather than by any code in bookloupe and thus can vary
ali@105
   188
      from system to system. PG texts however are likely to be
ali@105
   189
      using characters assigned in very early Unicode standards,
ali@105
   190
      thus mitigating this issue.
ali@105
   191
ali@105
   192
ali@105
   193
    Echoing lines (--no-echo to switch off)
ali@105
   194
ali@105
   195
      You may find it convenient, when reviewing Bookloupe's
ali@74
   196
      suggestions, to see the line that Bookloupe is questioning.
ali@0
   197
      That way, you can often see at a glance whether it is
ali@0
   198
      a real error that needs to be fixed, or a false positive
ali@74
   199
      that should be in the text, but Bookloupe's limited
ali@0
   200
      programming doesn't understand.
ali@0
   201
ali@105
   202
      By default, bookloupe echoes these lines, but if you don't
ali@105
   203
      want to see the lines referred to, --no-echo will switch it
ali@105
   204
      OFF.
ali@0
   205
ali@0
   206
ali@105
   207
    Quotes (--squote and --qpara switches)
ali@0
   208
ali@105
   209
      Bookloupe always looks for unbalanced doublequotes in a
ali@0
   210
      paragraph. It is a common convention for writers not to
ali@0
   211
      close quotes in a paragraph if the next paragraph opens
ali@0
   212
      with quotes and is a continuation by the same speaker.
ali@0
   213
ali@105
   214
      Bookloupe therefore does not normally report unclosed quotes
ali@0
   215
      if the next paragraph begins with a quote. If you need
ali@0
   216
      to see all unclosed quotes, even where the next paragraph
ali@0
   217
      begins with a quote, you should use the -p switch.
ali@0
   218
ali@105
   219
      Singlequotes (', `, ‘ and ’) are a problem, since the same
ali@105
   220
      character can be used for an apostrophe. I'm not sure that it
ali@105
   221
      is possible to get 100% accuracy on singlequotes checking,
ali@0
   222
      particularly since dialect, quite common in PG texts,
ali@0
   223
      upsets the normal rules so badly. Consider the sentence:
ali@0
   224
        'Tis often said that a man's a man for a' that.
ali@0
   225
      As humans, we recognize that both apostrophes are used
ali@105
   226
      for contractions rather than quotes, but it isn't easy
ali@0
   227
      to get a program to recognize that.
ali@0
   228
ali@74
   229
      Since bookloupe makes too many mistakes when trying to match
ali@0
   230
      singlequotes, it doesn't look for unbalanced singlequotes
ali@105
   231
      unless you specify the --squote switch.
ali@0
   232
ali@0
   233
      Consider these sentences, which illustrate the main cases:
ali@0
   234
ali@0
   235
        'Tis often said that a fool and his money are soon parted.
ali@0
   236
ali@0
   237
        'Becky's goin' home,' said Tom.
ali@0
   238
ali@0
   239
        The dogs' tails wagged in unison.
ali@0
   240
ali@0
   241
        Those 'pack dogs' of yours look more like wolves.
ali@0
   242
ali@0
   243
ali@105
   244
    Typos (--typo switch)
ali@0
   245
ali@105
   246
      It's not bookoupe's job to be a spelling checker, but it does
ali@105
   247
      check for a list of common typos and OCR errors if you use the
ali@105
   248
      --typo switch. (The -t and -x switchs also toggle typo checking.)
ali@0
   249
ali@0
   250
      It also checks for character combinations, especially involving
ali@0
   251
      h and b, which are often confused by OCR, that rarely or never
ali@0
   252
      occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0
   253
      occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0
   254
      playing the odds - a few false positives for many errors found.
ali@0
   255
      Similarly with "ii", which is a very common OCR error.
ali@0
   256
ali@74
   257
      Bookloupe suppresses multiple reporting of the first 40 "typos"
ali@0
   258
      found. This is to remove the annoyance of seeing something like
ali@0
   259
      "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@105
   260
      in a text.
ali@0
   261
ali@0
   262
ali@105
   263
    Line-end checking (--no-line-end switch to disable)
ali@0
   264
ali@0
   265
      All PG texts should have a Carriage Return (CR - character 13)
ali@0
   266
      and a Line Feed (LF - character 10) at end of each line,
ali@0
   267
      regardless of what O/S you made them on. DOS/Windows, Unix
ali@0
   268
      and Mac have different conventions, but the final text should
ali@0
   269
      always use a CR/LF pair as its line terminator.
ali@0
   270
ali@74
   271
      By default, bookloupe verifies that every line does have
ali@0
   272
      the correct terminator, but if you're on a work-in-progress
ali@0
   273
      in Linux, you might want to convert the line-ends as a final
ali@0
   274
      step, and not want to see thousands of errors every time you
ali@105
   275
      run bookloupe before that final step, so you can turn off
ali@105
   276
      this checking with the --no-line-end switch.
ali@0
   277
ali@0
   278
ali@105
   279
    Paranoid mode (--no-paranoid switch to disable: Trust No One :-)
ali@0
   280
ali@105
   281
      --no-paranoid switches OFF some extra checks like standalone
ali@105
   282
      1 and 0 queries.
ali@0
   283
ali@0
   284
ali@105
   285
    Overview mode (--overview switch)
ali@0
   286
ali@74
   287
      This mode just gives a count of queries found
ali@74
   288
      instead of a detailed list.
ali@0
   289
ali@0
   290
ali@105
   291
    Header quote  (--header switch)
ali@0
   292
ali@105
   293
      If you use the --header switch, bookloupe will also display
ali@74
   294
      the Title, Author, Release and Edition fields from the
ali@74
   295
      PG header. This is useful mostly for the automated
ali@74
   296
      checks we do on recently-posted texts.
ali@0
   297
ali@0
   298
ali@105
   299
    Errors to stdout (--stdout switch)
ali@0
   300
ali@74
   301
      If you're just running bookloupe normally, you can ignore
ali@74
   302
      this. It's only there for programs that provide a front
ali@74
   303
      end to bookloupe. It makes error messages appear within
ali@74
   304
      the output of bookloupe so that the front end knows whether
ali@74
   305
      bookloupe ran OK.
ali@0
   306
ali@0
   307
ali@105
   308
    Verbose reporting (--verbose switch)
ali@0
   309
ali@74
   310
      Normally, if bookloupe sees lots of long lines, short lines,
ali@74
   311
      spaced dashes, non-ASCII characters or dot-commas ".," it
ali@74
   312
      assumes these are features of the text, counts and summarizes
ali@105
   313
      them at the top of its report, but does not list them
ali@105
   314
      individually. If the verbose switch is on, bookloupe will list
ali@105
   315
      them all.
ali@0
   316
ali@0
   317
ali@105
   318
    Markup interpretation (--markup switch)
ali@0
   319
ali@74
   320
      Normally, bookloupe flags anything it suspects of being HTML
ali@105
   321
      markup as a possible error. When you use the --markup switch,
ali@74
   322
      however, it matches anything that looks like markup against
ali@74
   323
      a short list of common HTML tags and entities. If the markup
ali@74
   324
      is in that list, it either ignores the markup, in the case
ali@105
   325
      of a tag, or "interprets" the markup as its nearest ASCII
ali@74
   326
      equivalent, in the case of an entity. So, for example, using
ali@74
   327
      this switch, bookloupe will "see"
ali@0
   328
ali@74
   329
      &ldquo;He went <i>thataway!</i>&rdquo;
ali@0
   330
ali@74
   331
      as
ali@0
   332
ali@74
   333
      "He went thataway!"
ali@0
   334
ali@74
   335
      and report accordingly.
ali@0
   336
ali@74
   337
      This switch does not, not, NOT check the validity of HTML;
ali@74
   338
      it exists so that you can run bookloupe on most HTML texts
ali@74
   339
      for PG, and get sane results. It does not support all tags.
ali@74
   340
      It does not support all entities. When it sees a tag or entity
ali@74
   341
      it does not recognize, it will query it as HTML just as if
ali@105
   342
      you hadn't specified the --markup switch.
ali@0
   343
ali@74
   344
      Bookloupe will automatically switch on markup interpretation
ali@74
   345
      if it sees a lot of tags that appear to be markup, so mostly, you
ali@74
   346
      won't have to specify this.
ali@0
   347
ali@105
   348
ali@105
   349
    User-defined typos (--usertypo switch)
ali@0
   350
ali@74
   351
      If you have a file named bookloupe.typ or gutcheck.typ either
ali@74
   352
      in your current working directory or in the directory from
ali@74
   353
      which you explicitly invoked bookoupe, but not necessarily on
ali@105
   354
      your path, and if you specify the --usertypo switch, bookloupe
ali@105
   355
      will query any word specified in that file. The file is simple:
ali@105
   356
      one word, in lower case, per line. Be careful not to put multiple
ali@74
   357
      words onto a line, or leave any rubbish other than the word on
ali@74
   358
      the line. You should have received a sample file bookloupe.typ
ali@74
   359
      with this package. The file may be encoded in UTF-8 (preferred),
ali@74
   360
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
   361
      incorrectly, as ansi).
ali@0
   362
ali@105
   363
ali@105
   364
    Ignore DP markup (--dp switch)
ali@105
   365
ali@74
   366
      Distributed Proofreaders (http://www.pgdp.net) has for some
ali@74
   367
      time been the main source of PG texts, and proofers there use
ali@74
   368
      special conventions. This switch understands those conventions,
ali@74
   369
      so that people can use bookloupe on files in process that still
ali@74
   370
      haven't had the special conventions removed yet. The special
ali@74
   371
      conventions supported are page-separators and
ali@74
   372
      "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
ali@105
   373
 
ali@105
   374
ali@105
   375
    Dump the current configuration (--dump-config switch)
ali@105
   376
ali@105
   377
      The --dump-config switch can be used to dump the current
ali@105
   378
      configuration. This is a combination of the internal defaults,
ali@105
   379
      the configuration file (if any) and the command line options.
ali@105
   380
      If a configuration file is present, any comments found in that
ali@105
   381
      file will be preserved in the dumped configuration. If there
ali@105
   382
      is no configuration file, then a default set of comments to
ali@105
   383
      go with the internal default configuration is generated.
ali@0
   384
ali@0
   385
ali@74
   386
You will probably only run bookloupe on a text once or maybe twice,
ali@0
   387
just prior to uploading; it usually finds a few formatting problems;
ali@0
   388
it also usually finds queries that aren't problems at all - it often
ali@0
   389
questions Tables of Contents for having short lines, for example.
ali@74
   390
These are called "false positives," and need a human to decide on
ali@0
   391
them.
ali@0
   392
ali@0
   393
The text should be standard prose, and already close to PG normal
ali@0
   394
format (plain text, about 70 characters per line with blank lines
ali@0
   395
between paragraphs).
ali@0
   396
ali@74
   397
Bookloupe merely draws your attention to things that might be errors.
ali@0
   398
It is NOT a substitute for human judgement. Formatting choices like
ali@0
   399
short lines may be for a reason that this program can't understand.
ali@0
   400
ali@0
   401
Even the most careful human proofing can leave errors behind in a
ali@0
   402
text, and there are several automated checks you can do to help find
ali@0
   403
them. Of these, spellchecking (with _very_ careful human judgement) is
ali@0
   404
the most important and most useful.
ali@0
   405
ali@74
   406
Bookloupe does perform some basic typo-checking if you ask it to,
ali@74
   407
but its focus is on formatting errors specific to PG texts—
ali@0
   408
mismatched quotes, non-ASCII characters, bad spacing, bad line
ali@0
   409
length, HTML tags perhaps left from a conversion, unbalanced
ali@0
   410
brackets.
ali@0
   411
ali@105
   412
Suggestions for additional checks would be appreciated and duly
ali@0
   413
considered, but no guarantees that they will be implemented.
ali@0
   414
ali@0
   415
ali@0
   416
ali@0
   417
ali@74
   418
        How does Jim Tinsley use gutcheck?
ali@0
   419
ali@0
   420
Practically everyone I give gutcheck to asks me how _I_ use it.
ali@0
   421
Well, when I get a text for posting, say filename.txt, I run
ali@0
   422
ali@0
   423
    gutcheck -o filename.txt
ali@0
   424
ali@0
   425
That gives me a quick idea what I'm dealing with. It'll tell
ali@105
   426
me what kind of problems gutcheck sees, and give me an idea
ali@105
   427
of how much more work needs to be done on the text. Keep in
ali@0
   428
mind that gutcheck doesn't do anything like a full spellcheck,
ali@0
   429
but when I see a text that has a lot of problems, I assume that
ali@0
   430
it probably needs a spellcheck too.
ali@0
   431
ali@0
   432
Having got a feel for the ballpark, I run
ali@0
   433
ali@0
   434
    gutcheck filename.txt > jj
ali@0
   435
ali@0
   436
where jj is my personal, all-purpose filename for temporary data
ali@0
   437
that doesn't need to be kept. Then I open filename.txt and jj in
ali@0
   438
a split-screen view in my editor, and work down the text, fixing
ali@105
   439
whatever needs fixing, and skipping whatever doesn't. If your
ali@105
   440
editor doesn't split-screen, you can get much the same effect by
ali@0
   441
opening your original file in your normal editor, and jj (or your
ali@105
   442
equivalent name) in something like Notepad, keeping both in view
ali@0
   443
at the same time.
ali@0
   444
ali@0
   445
Twice a day, an automatic process looks at all recently-posted
ali@0
   446
texts, and emails Michael, me, and sometimes other people with
ali@0
   447
their gutcheck summaries.
ali@0
   448
ali@0
   449
ali@0
   450
ali@74
   451
Explanations of common bookloupe messages:
ali@0
   452
ali@0
   453
    --> 74 lines in this file have white space at end
ali@0
   454
ali@0
   455
    PG texts shouldn't have extra white space added at end of line.
ali@0
   456
    Don't worry too much about this; they're not doing any harm,
ali@0
   457
    and they'll be removed during posting anyway.
ali@0
   458
ali@0
   459
ali@0
   460
    --> 348 lines in this file are short. Not reporting short lines.
ali@0
   461
    --> 84 lines in this file are long. Not reporting long lines.
ali@0
   462
    --> 8 lines in this file are VERY long!
ali@0
   463
ali@74
   464
    If there are a lot of long or short lines, bookloupe won't list
ali@0
   465
    them individually. The short lines version of this message
ali@0
   466
    is commonly seen when gutchecking poetry and some plays, where
ali@0
   467
    the normal line length is shorter than the standard for prose.
ali@0
   468
    A "VERY long" line is one over 80 characters.  You normally
ali@0
   469
    shouldn't have any of these, but sometimes you may have to render
ali@0
   470
    a table that must be that long, or some special preformatted
ali@0
   471
    quotation that can't be broken.
ali@0
   472
ali@0
   473
ali@0
   474
    --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
ali@0
   475
ali@0
   476
    The PG standard for an emdash--like these--is two minus signs
ali@0
   477
    with no spaces before or after them. However, some older texts
ali@0
   478
    used spaced dashes - like these -- and if there are very many
ali@74
   479
    such spaced dashes in the file, bookoupe just draws your
ali@0
   480
    attention to it and doesn't list them individually.
ali@0
   481
ali@0
   482
ali@0
   483
ali@0
   484
    Line 3020 - Non-ASCII character 233
ali@0
   485
ali@0
   486
    Standard PG texts should use only ASCII characters with values
ali@105
   487
    up to 127; however, non-English, accented characters can be
ali@105
   488
    represented according to several different non-ASCII encoding
ali@0
   489
    schemes, using values over 127. If you have a plain English text
ali@0
   490
    with a few accented characters in words like cafe or tete-a-tete,
ali@105
   491
    you might replace the accented characters with their unaccented
ali@0
   492
    versions. The English pound sign is another commonly-seen
ali@0
   493
    non-ASCII character. If you have enough non-ASCII characters in
ali@74
   494
    your text that you feel removing them would degrade your text,
ali@74
   495
    you should probably consider doing a UTF-8 text.
ali@0
   496
ali@0
   497
ali@0
   498
ali@0
   499
    Line 1207 - Non-ISO-8859 character 156
ali@0
   500
ali@0
   501
    Even in "8-bit" texts, there are distinctions between code sets.
ali@0
   502
    The ISO-8859 family of 8-bit code sets is the most commonly used
ali@0
   503
    in PG, and these sets do not define values in the range 128 through
ali@0
   504
    159 as printable characters. It's quite common for someone on a
ali@0
   505
    Windows or Mac machine to use a non-ISO character inadvertently,
ali@0
   506
    so this message warns that the character is not only not ASCII,
ali@0
   507
    but also outside the ISO-8859 range.
ali@0
   508
ali@0
   509
ali@0
   510
ali@0
   511
    Line 46 - Tab character?
ali@0
   512
ali@0
   513
    Some editors and WPs will put in Tab characters (character 9) to
ali@0
   514
    indicate indented text. You should not use these in a PG text,
ali@0
   515
    because you can't be sure how they will appear on a reader's
ali@0
   516
    screen. Find the Tab, and replace it with the appropriate number
ali@0
   517
    of spaces.
ali@0
   518
ali@0
   519
ali@105
   520
ali@0
   521
    Line 1327 - Tilde character?
ali@0
   522
ali@0
   523
    The tilde character (~) might be legitimately used, but it's the
ali@0
   524
    character commonly used by OCR software to indicate a place where
ali@74
   525
    it couldn't make out the letter, so bookloupe flags it.
ali@0
   526
ali@0
   527
ali@0
   528
ali@0
   529
    Line 1347 - Asterisk?
ali@0
   530
ali@105
   531
    Asterisks are reported only in paranoid mode (see -x).
ali@0
   532
    Like tildes, they are often used to indicate errors, but they are
ali@0
   533
    also legitimately used as line delimiters and footnote markers.
ali@0
   534
ali@0
   535
ali@0
   536
ali@0
   537
    Line 1451 - Long line 129
ali@0
   538
ali@0
   539
    PG texts should have lines shorter than 76. There may be occasions
ali@0
   540
    where you decide that you really have to go out to 79 characters,
ali@74
   541
    but the sample above says that line 1451 is 129 characters long—
ali@0
   542
    probably two lines run together.
ali@0
   543
ali@0
   544
ali@0
   545
ali@0
   546
    Line 1590 - Short line?
ali@0
   547
ali@0
   548
    PG texts should have lines longer than 54 characters. However,
ali@0
   549
    there are special cases like poetry and tables of contents where
ali@74
   550
    the lines _should_ be shorter. So treat bookloupe warnings about
ali@0
   551
    short lines carefully. Sometimes it's a genuine formatting
ali@0
   552
    problem; sometimes the line really needs to be short.
ali@0
   553
ali@74
   554
    Hint: bookloupe will not flag lines as short if they are indented
ali@74
   555
    —if they start with a space. I like to start inserted stanzas
ali@105
   556
    and other such items indented with a couple of spaces so that
ali@0
   557
    they stand out from the main text anyway.
ali@0
   558
ali@0
   559
ali@0
   560
ali@0
   561
    Line 1804 - Begins with punctuation?
ali@0
   562
ali@0
   563
    Lines should normally not begin with commas, periods and so on.
ali@0
   564
    An exception is ellipses . . . which can happen at start of line.
ali@0
   565
ali@0
   566
ali@0
   567
ali@0
   568
    Line 1850 - Spaced em-dash?
ali@0
   569
ali@0
   570
    The PG standard for an em-dash--like these--is two minus signs
ali@74
   571
    with no spaces before or after them. Bookloupe flags non-PG
ali@105
   572
    em-dashes - like this one. Normally, you will replace it with a
ali@0
   573
    PG-standard em-dash.
ali@0
   574
ali@0
   575
ali@0
   576
ali@0
   577
    Line 1904 - Query he/be error?
ali@0
   578
ali@74
   579
    Bookloupe makes a very minor effort to look for that scourge of all
ali@0
   580
    proofreaders, "be" replacing "he" or vice-versa, and draws your
ali@0
   581
    attention to it when it thinks it has found one.
ali@0
   582
ali@0
   583
ali@0
   584
ali@0
   585
    Line 2017 - Query digit in a1most
ali@0
   586
ali@0
   587
    The digit 1 is commonly OCRed for the letter l, the digit 0 for
ali@74
   588
    the letter O, and so on. When bookloupe sees a mix of digits and
ali@0
   589
    letters, it warns you. It may generate a false positive for
ali@0
   590
    something like 7am.
ali@0
   591
ali@0
   592
ali@0
   593
ali@0
   594
    Line 2083 - Query standalone 0
ali@0
   595
ali@105
   596
    In paranoid mode (see -x) only, bookloupe warns about the digit 0
ali@105
   597
    and the number 1 standing alone as a word. This can happen if the
ali@0
   598
    OCR misreads the words O or I.
ali@0
   599
ali@0
   600
ali@0
   601
ali@0
   602
    Line 2115 - Query word whetber
ali@0
   603
ali@74
   604
    If you have switched typo-checking on, bookloupe looks for
ali@0
   605
    potential typos, especially common h/b errors. It's not
ali@0
   606
    infallible; it sometimes queries legit words, but it's
ali@0
   607
    always worth taking a look.
ali@0
   608
ali@0
   609
ali@0
   610
ali@0
   611
    Line 2190 column 14 - Missing space?
ali@0
   612
ali@0
   613
    Omitting a space is a very common error,especially coming from
ali@0
   614
    OCRed text,and can be hard for a human to spot. The commas in
ali@0
   615
    the previous sentence illustrate the kind of thing I mean.
ali@0
   616
ali@0
   617
ali@0
   618
ali@0
   619
    Line 2240 column 48 - Spaced punctuation?
ali@0
   620
ali@0
   621
    The flip side of the "missing space" error , here , is when extra
ali@0
   622
    spaces are added before punctuation . Some old texts appear to add
ali@0
   623
    extra spaces around punctuation consistently, but this was a
ali@0
   624
    typographical convention rather than the author's intent, and the
ali@0
   625
    extra "spaces" should be removed when preparing a PG text.
ali@0
   626
ali@0
   627
ali@0
   628
ali@0
   629
    Line 2301 column 19 - Unspaced quotes?
ali@0
   630
ali@0
   631
    Another common spacing problem occurs in a phrase like "You wait
ali@0
   632
    there,"he said.
ali@0
   633
ali@0
   634
ali@0
   635
ali@0
   636
    Line 2385 column 27 - Wrongspaced quotes?
ali@0
   637
ali@74
   638
    Bookloupe checks whether a quote seems to be a start or end quote,
ali@74
   639
    and queries those that appear to be misplaced. This does give rise
ali@74
   640
    to false positives when quotes are nested, for example:
ali@0
   641
ali@0
   642
    "And how," she asked, "will your "friends" help you now?"
ali@0
   643
ali@0
   644
    but these false positives are worth it because of the many cases
ali@0
   645
    that this test catches, notably those like:
ali@0
   646
ali@0
   647
    "And how, "she said," will your friends help you now?"
ali@0
   648
ali@0
   649
    Sometimes a "wrongspaced quotes" query will arise because an earlier
ali@0
   650
    quote in the paragraph was omitted, so if the place specified seems
ali@0
   651
    to be OK, look back to see whether there's a problem in the preceding
ali@0
   652
    lines.
ali@0
   653
ali@0
   654
ali@0
   655
ali@0
   656
    Line 2400 - HTML Tag? <PRE>
ali@0
   657
ali@0
   658
    Some PG texts have been converted from HTML, and not all of the
ali@0
   659
    HTML tags have been removed.
ali@0
   660
ali@0
   661
ali@0
   662
ali@0
   663
    Line 2402 - HTML symbol? &emdash;
ali@0
   664
ali@0
   665
    Similarly, special HTML symbol characters can survive into PG
ali@0
   666
    texts. Can occasionally produce amusing false positives like
ali@0
   667
    . . . Marwick & Co were well known for it;
ali@0
   668
ali@0
   669
ali@0
   670
ali@0
   671
    Line 2540 - Mismatched quotes
ali@0
   672
ali@74
   673
    Another bookloupe mainstay—unclosed doublequotes in a paragraph.
ali@0
   674
    See the discussion of quotes in the switches section near the
ali@0
   675
    start of this file.
ali@105
   676
ali@74
   677
    Since the mismatch doesn't occur on any one line, bookloupe quotes
ali@0
   678
    the line number of the first blank line following the paragraph,
ali@0
   679
    since this is the point where it reconciles the count of quotes.
ali@74
   680
    However, if bookloupe is echoing lines, that is, you haven't used
ali@105
   681
    the -e switch, it will show the _first_ line of the paragraph,
ali@105
   682
    to help you find the place without using line numbers. The
ali@105
   683
    offending paragraph is therefore between the quoted line and
ali@0
   684
    the line number given.
ali@0
   685
ali@0
   686
ali@0
   687
ali@0
   688
    Line 2587 - Mismatched single quotes
ali@0
   689
ali@105
   690
    Only checked with the -s switch, since checking single quotes is
ali@105
   691
    not a very reliable process. Otherwise, the same logic as for
ali@0
   692
    doublequotes applies.
ali@0
   693
ali@0
   694
ali@0
   695
ali@0
   696
    Line 2877 - Mismatched round brackets?
ali@0
   697
ali@0
   698
    Also curly and square brackets. Texts with a lot of brackets, like
ali@0
   699
    plays with bracketed stage instructions, may have mismatches.
ali@0
   700
ali@0
   701
ali@0
   702
    Line 3150 - No CR?
ali@0
   703
    Line 3204 - Two successive CRs?
ali@0
   704
    Line 3281 position 75 - CR without LF?
ali@0
   705
ali@0
   706
    These are the invalid line-end warnings. See the discussion of
ali@0
   707
    line-end checking in the switches section near the start of this
ali@0
   708
    file. If you see these, and your editor doesn't show anything
ali@0
   709
    wrong, you should probably try deleting the characters just before
ali@0
   710
    and after the line end, and the line-end itself, then retyping the
ali@0
   711
    characters and the line-end.
ali@0
   712
ali@0
   713
ali@0
   714
    Line 2940 - Paragraph starts with lower-case
ali@0
   715
ali@0
   716
    A common error in an e-text is for an extra blank line
ali@0
   717
ali@0
   718
    to be put in, like the blank line above, and this often
ali@0
   719
    shows up as a new paragraph beginning with lower case.
ali@105
   720
    Sometimes the blank line is deliberate, as when a
ali@0
   721
    quotation is inserted in a speech. Use your judgement.
ali@0
   722
ali@0
   723
ali@0
   724
    Line 2987 - Extra period?
ali@0
   725
ali@0
   726
    An extra period. is a. common problem in OCRed text. and usually
ali@0
   727
    arises when a speck of dust on the page is mistaken for a period.
ali@0
   728
    or. as occasionally happens. when a comma loses its tail.
ali@0
   729
ali@0
   730
ali@0
   731
    Line 3012 column 12 - Double punctuation?
ali@0
   732
ali@0
   733
    Double punctuation., like that,, is a common typo and
ali@0
   734
    scanno. Some books have much legit double punctuation,
ali@0
   735
    like etc., etc., but it's worth checking anyway.
ali@0
   736
ali@0
   737
ali@0
   738
ali@0
   739
            *       *       *        *
ali@0
   740
ali@0
   741
For Windows-only users who are unfamiliar with DOS:
ali@0
   742
ali@0
   743
    If you're a Windows-only user, you need to save
ali@74
   744
    bookloupe.exe into the folder (directory) where the
ali@0
   745
    text file you want to check is. Let's say your
ali@74
   746
    text file is in C:\gut, then you should save
ali@74
   747
    bookloupe.exe into C:\gut.
ali@0
   748
ali@74
   749
    Now get to a console. You can do this by
ali@0
   750
    selecting the "Command Prompt" or "MS-DOS Prompt"
ali@0
   751
    option that will be somewhere on your
ali@0
   752
    Start/Programs menu.
ali@0
   753
ali@105
   754
    Now get into the C:\gut directory.
ali@105
   755
    You can do this using the cd (change directory)
ali@0
   756
    command, like this:
ali@74
   757
        cd \gut
ali@105
   758
    and your prompt will change to
ali@74
   759
        C:\gut>
ali@0
   760
    so you know you're in the right place.
ali@0
   761
ali@0
   762
    Now type
ali@74
   763
        bookloupe yourfile.txt
ali@74
   764
    and you'll see bookloupe's report
ali@0
   765
ali@74
   766
    By default, bookloupe prints its queries to screen.
ali@0
   767
    If you want to create a file of them, to edit
ali@0
   768
    against the text, you can use the greater-than
ali@0
   769
    sign (>) to tell it to output the report to a
ali@0
   770
    file. For example, if you want its report in a
ali@74
   771
    file called queries.lst, you could type
ali@74
   772
ali@74
   773
        bookloupe yourfile.txt > queries.lst
ali@0
   774
ali@0
   775
    The queries.lst file will then contain the listing
ali@0
   776
    of possible formatting errors, and you can
ali@0
   777
    edit it alongside your text.
ali@0
   778
ali@0
   779
    Whatever you do, DON'T make the filename after
ali@0
   780
    the greater-than sign the name of a file already
ali@0
   781
    on your disk that you want to keep, because
ali@74
   782
    the greater-than sign will cause bookloupe to
ali@0
   783
    replace any existing file of that name.
ali@0
   784
ali@0
   785
    So, for example, if you have two Tolstoy files
ali@105
   786
    that you want to check, called WARPEACE.TXT and
ali@0
   787
    ANNAK.TXT, make sure that neither of these names
ali@0
   788
    is ever used following the greater-than sign.
ali@0
   789
    To check these correctly, you might do:
ali@0
   790
ali@74
   791
    bookloupe warpeace.txt > war.lst
ali@0
   792
ali@0
   793
    and
ali@0
   794
ali@74
   795
    bookloupe annak.txt > annak.lst
ali@0
   796
ali@0
   797
    separately. Then you can look at war.lst and annak.lst
ali@74
   798
    to see the bookloupe reports.
ali@83
   799
ali@83
   800
For Windows-only users who want to use bookloupe from guiguts:
ali@83
   801
ali@83
   802
    1) If you haven't already done so, download bookloupe-win32-xxx.zip
ali@83
   803
    from http://www.juiblex.co.uk/pgdp/bookloupe/
ali@83
   804
ali@83
   805
    2) Extract the files into a suitable folder, e.g. C:\DP\bookloupe
ali@83
   806
ali@83
   807
    3) Start Guiguts
ali@83
   808
ali@83
   809
    4) Choose Preferences | File Paths | Set File Paths..
ali@83
   810
ali@83
   811
    5) Click the "Locate Gutcheck..." button
ali@83
   812
ali@83
   813
    6) Browse to the folder where you extracted bookloupe
ali@83
   814
ali@105
   815
    7) Double-click bookloupe.exe
ali@89
   816
ali@89
   817
    Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe
ali@89
   818
    instead. Since the output will look very like gutcheck output, you
ali@89
   819
    may want to check that it is actually bookloupe that is running. To do
ali@89
   820
    this, look at the black command line message window, which will say:
ali@89
   821
ali@89
   822
    "bookloupe: Check and report on an e-text".
ali@89
   823
ali@89
   824
    To return to using gutcheck for any reason, repeat steps 4 and 5
ali@89
   825
    above, and then,
ali@89
   826
ali@89
   827
    6b) Browse back to the gutcheck folder, which is in a "tools"
ali@89
   828
    folder inside the main Guiguts folder. It will be something like
ali@89
   829
    "C:\DP\guiguts-win\tools\gutcheck", depending on where you installed
ali@89
   830
    Guiguts originally.
ali@89
   831
ali@89
   832
    7b) Double-click gutcheck.exe
ali@89
   833
ali@89
   834
    Now doing "Gutcheck" in Guiguts will run gutcheck itself, and the
ali@89
   835
    message in the black window should read:
ali@89
   836
ali@89
   837
    "gutcheck: Check and report on an e-text".