doc/bookloupe.txt
author ali <ali@juiblex.co.uk>
Thu May 30 20:27:45 2013 +0100 (2013-05-30)
changeset 74 411867e8e20b
parent 5 f600b0d1fc5d
child 77 9edfe77d747d
permissions -rw-r--r--
Update documentation
ali@0
     1
ali@0
     2
ali@74
     3
                            Bookloupe documentation
ali@0
     4
ali@0
     5
ali@74
     6
bookloupe: lists possible common formatting errors in a Project
ali@74
     7
Gutenberg candidate file. Bookloupe is based on gutcheck, written
ali@74
     8
by Jim Tinsley. It is a command line program and can be used under
ali@74
     9
Microsoft Windows, Mac or Unix. For Windows-only people, there is
ali@74
    10
an appendix at the end with brief instructions for running it.
ali@0
    11
ali@74
    12
Current version: 1.90, a beta version leading up to version 2.0
ali@0
    13
ali@74
    14
This software is Copyright Jim Tinsley 2000-2005 and
ali@74
    15
J. Ali Harlow 2012 onwards.
ali@0
    16
ali@74
    17
Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0
    18
This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0
    19
ali@74
    20
See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
ali@0
    21
ali@0
    22
ali@74
    23
Usage is: bookloupe [-setopxlywm] filename
ali@0
    24
      where:
ali@0
    25
      -s checks Single quotes 
ali@0
    26
      -e switches off Echoing of lines 
ali@0
    27
      -t checks Typos
ali@0
    28
      -o produces an Overview only
ali@0
    29
      -p sets strict quotes checking for Paragraphs
ali@0
    30
      -x (paranoid) switches OFF typo checking and extra checks
ali@0
    31
      -l turns off Line-end checks
ali@0
    32
      -y sets error messages to stdout
ali@0
    33
      -w is a special mode for web uploads (for future use)
ali@0
    34
      -v (verbose) forces individual reporting of minor problems
ali@0
    35
      -m interprets Markup of some common HTML tags and entities    
ali@0
    36
      -u warns about words in a user-defined typo file gutcheck.typ 
ali@0
    37
      -d ignores some DP-specific markup
ali@0
    38
ali@74
    39
Running bookloupe without any parameters will display a brief help message.
ali@0
    40
ali@0
    41
Sample usage: 
ali@0
    42
ali@74
    43
    bookloupe warpeace.txt
ali@0
    44
ali@0
    45
ali@0
    46
More detail:
ali@0
    47
ali@74
    48
    Character encoding
ali@74
    49
ali@74
    50
      Bookloupe will handle e-texts encoded in UTF-8 (preferred),
ali@74
    51
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
    52
      incorrectly, as ansi). The output will be in the same encoding
ali@74
    53
      as the input e-text.
ali@74
    54
ali@0
    55
    Echoing lines (-e to switch off)
ali@0
    56
ali@74
    57
      You may find it convenient, when reviewing Bookloupe's 
ali@74
    58
      suggestions, to see the line that Bookloupe is questioning.
ali@0
    59
      That way, you can often see at a glance whether it is
ali@0
    60
      a real error that needs to be fixed, or a false positive
ali@74
    61
      that should be in the text, but Bookloupe's limited
ali@0
    62
      programming doesn't understand.
ali@0
    63
ali@74
    64
      By default, bookloupe echoes these lines, but if you don't 
ali@0
    65
      want to see the lines referred to, -e will switch it OFF.
ali@0
    66
ali@0
    67
ali@0
    68
    Quotes (-s and -p switches)
ali@0
    69
ali@74
    70
      Bookloupe always looks for unbalanced doublequotes in a 
ali@0
    71
      paragraph. It is a common convention for writers not to
ali@0
    72
      close quotes in a paragraph if the next paragraph opens
ali@0
    73
      with quotes and is a continuation by the same speaker.
ali@0
    74
ali@74
    75
      Bookloupe therefore does not normally report unclosed quotes 
ali@0
    76
      if the next paragraph begins with a quote. If you need
ali@0
    77
      to see all unclosed quotes, even where the next paragraph
ali@0
    78
      begins with a quote, you should use the -p switch.
ali@0
    79
ali@0
    80
      Singlequotes (') are a problem, since the same character
ali@0
    81
      is used for an apostrophe. I'm not sure that it is 
ali@0
    82
      possible to get 100% accuracy on singlequotes checking,
ali@0
    83
      particularly since dialect, quite common in PG texts,
ali@0
    84
      upsets the normal rules so badly. Consider the sentence:
ali@0
    85
        'Tis often said that a man's a man for a' that.
ali@0
    86
      As humans, we recognize that both apostrophes are used
ali@0
    87
      for contractions rather than quotes, but it isn't easy 
ali@0
    88
      to get a program to recognize that.
ali@0
    89
ali@74
    90
      Since bookloupe makes too many mistakes when trying to match
ali@0
    91
      singlequotes, it doesn't look for unbalanced singlequotes
ali@0
    92
      unless you specify the -s switch.
ali@0
    93
ali@0
    94
      Consider these sentences, which illustrate the main cases:
ali@0
    95
ali@0
    96
        'Tis often said that a fool and his money are soon parted.
ali@0
    97
ali@0
    98
        'Becky's goin' home,' said Tom.
ali@0
    99
ali@0
   100
        The dogs' tails wagged in unison.
ali@0
   101
ali@0
   102
        Those 'pack dogs' of yours look more like wolves.
ali@0
   103
ali@0
   104
ali@0
   105
ali@0
   106
    Typos (-t switch)
ali@0
   107
ali@74
   108
      It's not bookoupe's job to be a spelling checker, but it
ali@0
   109
      does check for a list of common typos and OCR errors if you
ali@0
   110
      use the -t switch. (The -x switch also turns typo checking on.)
ali@0
   111
ali@0
   112
      It also checks for character combinations, especially involving
ali@0
   113
      h and b, which are often confused by OCR, that rarely or never
ali@0
   114
      occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0
   115
      occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0
   116
      playing the odds - a few false positives for many errors found.
ali@0
   117
      Similarly with "ii", which is a very common OCR error.
ali@0
   118
ali@74
   119
      Bookloupe suppresses multiple reporting of the first 40 "typos"
ali@0
   120
      found. This is to remove the annoyance of seeing something like
ali@0
   121
      "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@0
   122
      in a text. 
ali@0
   123
ali@0
   124
ali@0
   125
    Line-end checking (-l switch to disable)
ali@0
   126
ali@0
   127
      All PG texts should have a Carriage Return (CR - character 13)
ali@0
   128
      and a Line Feed (LF - character 10) at end of each line,
ali@0
   129
      regardless of what O/S you made them on. DOS/Windows, Unix
ali@0
   130
      and Mac have different conventions, but the final text should
ali@0
   131
      always use a CR/LF pair as its line terminator.
ali@0
   132
ali@74
   133
      By default, bookloupe verifies that every line does have
ali@0
   134
      the correct terminator, but if you're on a work-in-progress
ali@0
   135
      in Linux, you might want to convert the line-ends as a final
ali@0
   136
      step, and not want to see thousands of errors every time you
ali@74
   137
      run bookloupe before that final step, so you can turn off 
ali@0
   138
      this checking with the -l switch.
ali@0
   139
ali@0
   140
ali@0
   141
    Paranoid mode (-x switch to disable: Trust No One :-)
ali@0
   142
ali@0
   143
      -x switches OFF typo-checking, the -t flag, automatically
ali@0
   144
      and some extra checks like standalone 1 and 0 queries.
ali@0
   145
ali@0
   146
ali@0
   147
    Overview mode (-o switch)
ali@0
   148
ali@74
   149
      This mode just gives a count of queries found
ali@74
   150
      instead of a detailed list.
ali@0
   151
ali@0
   152
ali@0
   153
    Header quote  (-h switch)
ali@0
   154
ali@74
   155
      If you use the -h switch, bookloupe will also display
ali@74
   156
      the Title, Author, Release and Edition fields from the
ali@74
   157
      PG header. This is useful mostly for the automated
ali@74
   158
      checks we do on recently-posted texts.
ali@0
   159
ali@0
   160
ali@0
   161
    Errors to stdout (-y switch)
ali@0
   162
ali@74
   163
      If you're just running bookloupe normally, you can ignore
ali@74
   164
      this. It's only there for programs that provide a front
ali@74
   165
      end to bookloupe. It makes error messages appear within
ali@74
   166
      the output of bookloupe so that the front end knows whether
ali@74
   167
      bookloupe ran OK.
ali@0
   168
ali@0
   169
ali@0
   170
    Verbose reporting (-v switch)
ali@0
   171
ali@74
   172
      Normally, if bookloupe sees lots of long lines, short lines,
ali@74
   173
      spaced dashes, non-ASCII characters or dot-commas ".," it
ali@74
   174
      assumes these are features of the text, counts and summarizes
ali@74
   175
      them at the top of its report, but does not list them 
ali@74
   176
      individually. If the -v switch is on, bookloupe will list them all.
ali@0
   177
ali@0
   178
ali@0
   179
    Markup interpretation (-m switch)
ali@0
   180
ali@74
   181
      Normally, bookloupe flags anything it suspects of being HTML
ali@74
   182
      markup as a possible error. When you use the -m switch,
ali@74
   183
      however, it matches anything that looks like markup against
ali@74
   184
      a short list of common HTML tags and entities. If the markup
ali@74
   185
      is in that list, it either ignores the markup, in the case
ali@74
   186
      of a tag, or "interprets" the markup as its nearest ASCII 
ali@74
   187
      equivalent, in the case of an entity. So, for example, using
ali@74
   188
      this switch, bookloupe will "see"
ali@0
   189
ali@74
   190
      &ldquo;He went <i>thataway!</i>&rdquo;
ali@0
   191
ali@74
   192
      as
ali@0
   193
ali@74
   194
      "He went thataway!"
ali@0
   195
ali@74
   196
      and report accordingly.
ali@0
   197
ali@74
   198
      This switch does not, not, NOT check the validity of HTML;
ali@74
   199
      it exists so that you can run bookloupe on most HTML texts
ali@74
   200
      for PG, and get sane results. It does not support all tags.
ali@74
   201
      It does not support all entities. When it sees a tag or entity
ali@74
   202
      it does not recognize, it will query it as HTML just as if
ali@74
   203
      you hadn't specified the -m switch.
ali@0
   204
ali@74
   205
      Bookloupe will automatically switch on markup interpretation
ali@74
   206
      if it sees a lot of tags that appear to be markup, so mostly, you
ali@74
   207
      won't have to specify this.
ali@0
   208
ali@0
   209
    User-defined typos (-u switch)
ali@0
   210
ali@74
   211
      If you have a file named bookloupe.typ or gutcheck.typ either
ali@74
   212
      in your current working directory or in the directory from
ali@74
   213
      which you explicitly invoked bookoupe, but not necessarily on
ali@74
   214
      your path, and if you specify the -u switch, bookloupe will
ali@74
   215
      query any word specified in that file. The file is simple: one
ali@74
   216
      word, in lower case, per line. Be careful not to put multiple
ali@74
   217
      words onto a line, or leave any rubbish other than the word on
ali@74
   218
      the line. You should have received a sample file bookloupe.typ
ali@74
   219
      with this package. The file may be encoded in UTF-8 (preferred),
ali@74
   220
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
   221
      incorrectly, as ansi).
ali@0
   222
ali@0
   223
    Ignore DP markup (-d switch)
ali@0
   224
        
ali@74
   225
      Distributed Proofreaders (http://www.pgdp.net) has for some
ali@74
   226
      time been the main source of PG texts, and proofers there use
ali@74
   227
      special conventions. This switch understands those conventions,
ali@74
   228
      so that people can use bookloupe on files in process that still
ali@74
   229
      haven't had the special conventions removed yet. The special
ali@74
   230
      conventions supported are page-separators and
ali@74
   231
      "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
ali@0
   232
ali@0
   233
ali@74
   234
You will probably only run bookloupe on a text once or maybe twice,
ali@0
   235
just prior to uploading; it usually finds a few formatting problems;
ali@0
   236
it also usually finds queries that aren't problems at all - it often
ali@0
   237
questions Tables of Contents for having short lines, for example.
ali@74
   238
These are called "false positives," and need a human to decide on
ali@0
   239
them.
ali@0
   240
ali@0
   241
The text should be standard prose, and already close to PG normal
ali@0
   242
format (plain text, about 70 characters per line with blank lines
ali@0
   243
between paragraphs).
ali@0
   244
ali@74
   245
Bookloupe merely draws your attention to things that might be errors.
ali@0
   246
It is NOT a substitute for human judgement. Formatting choices like
ali@0
   247
short lines may be for a reason that this program can't understand.
ali@0
   248
ali@0
   249
Even the most careful human proofing can leave errors behind in a
ali@0
   250
text, and there are several automated checks you can do to help find
ali@0
   251
them. Of these, spellchecking (with _very_ careful human judgement) is
ali@0
   252
the most important and most useful.
ali@0
   253
ali@74
   254
Bookloupe does perform some basic typo-checking if you ask it to,
ali@74
   255
but its focus is on formatting errors specific to PG texts—
ali@0
   256
mismatched quotes, non-ASCII characters, bad spacing, bad line
ali@0
   257
length, HTML tags perhaps left from a conversion, unbalanced
ali@0
   258
brackets.
ali@0
   259
ali@0
   260
Suggestions for additional checks would be appreciated and duly 
ali@0
   261
considered, but no guarantees that they will be implemented.
ali@0
   262
ali@0
   263
ali@0
   264
ali@0
   265
ali@74
   266
        How does Jim Tinsley use gutcheck?
ali@0
   267
ali@0
   268
Practically everyone I give gutcheck to asks me how _I_ use it.
ali@0
   269
Well, when I get a text for posting, say filename.txt, I run
ali@0
   270
ali@0
   271
    gutcheck -o filename.txt
ali@0
   272
ali@0
   273
That gives me a quick idea what I'm dealing with. It'll tell
ali@0
   274
me what kind of problems gutcheck sees, and give me an idea 
ali@0
   275
of how much more work needs to be done on the text. Keep in 
ali@0
   276
mind that gutcheck doesn't do anything like a full spellcheck,
ali@0
   277
but when I see a text that has a lot of problems, I assume that
ali@0
   278
it probably needs a spellcheck too.
ali@0
   279
ali@0
   280
Having got a feel for the ballpark, I run
ali@0
   281
ali@0
   282
    gutcheck filename.txt > jj
ali@0
   283
ali@0
   284
where jj is my personal, all-purpose filename for temporary data
ali@0
   285
that doesn't need to be kept. Then I open filename.txt and jj in
ali@0
   286
a split-screen view in my editor, and work down the text, fixing
ali@0
   287
whatever needs fixing, and skipping whatever doesn't. If your 
ali@0
   288
editor doesn't split-screen, you can get much the same effect by 
ali@0
   289
opening your original file in your normal editor, and jj (or your
ali@0
   290
equivalent name) in something like Notepad, keeping both in view 
ali@0
   291
at the same time.
ali@0
   292
ali@0
   293
Twice a day, an automatic process looks at all recently-posted
ali@0
   294
texts, and emails Michael, me, and sometimes other people with
ali@0
   295
their gutcheck summaries.
ali@0
   296
ali@0
   297
ali@0
   298
ali@74
   299
        Future development of bookloupe
ali@0
   300
ali@74
   301
Bookloupe version 2.0 is intended to add UTF-8 support to
ali@74
   302
gutcheck. All the functionality should already be implemented
ali@74
   303
in the beta versions leading up to version 2.0, although
ali@74
   304
some bugs may well remain.
ali@0
   305
ali@74
   306
Future versions will add support for UTF-8 characters that
ali@74
   307
are not in ISO-8859-1 (eg., curled quotation marks);
ali@74
   308
characters that do not have a composed form (version 2
ali@74
   309
treats these as taking 2 or more columns); zero width and
ali@74
   310
wide characters (version 2 treats these as taking 1 column).
ali@0
   311
ali@0
   312
ali@0
   313
ali@0
   314
ali@74
   315
Explanations of common bookloupe messages:
ali@0
   316
ali@0
   317
    --> 74 lines in this file have white space at end
ali@0
   318
ali@0
   319
    PG texts shouldn't have extra white space added at end of line.
ali@0
   320
    Don't worry too much about this; they're not doing any harm,
ali@0
   321
    and they'll be removed during posting anyway.
ali@0
   322
ali@0
   323
ali@0
   324
    --> 348 lines in this file are short. Not reporting short lines.
ali@0
   325
    --> 84 lines in this file are long. Not reporting long lines.
ali@0
   326
    --> 8 lines in this file are VERY long!
ali@0
   327
ali@74
   328
    If there are a lot of long or short lines, bookloupe won't list
ali@0
   329
    them individually. The short lines version of this message
ali@0
   330
    is commonly seen when gutchecking poetry and some plays, where
ali@0
   331
    the normal line length is shorter than the standard for prose.
ali@0
   332
    A "VERY long" line is one over 80 characters.  You normally
ali@0
   333
    shouldn't have any of these, but sometimes you may have to render
ali@0
   334
    a table that must be that long, or some special preformatted
ali@0
   335
    quotation that can't be broken.
ali@0
   336
ali@0
   337
ali@0
   338
    --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
ali@0
   339
ali@0
   340
    The PG standard for an emdash--like these--is two minus signs
ali@0
   341
    with no spaces before or after them. However, some older texts
ali@0
   342
    used spaced dashes - like these -- and if there are very many
ali@74
   343
    such spaced dashes in the file, bookoupe just draws your
ali@0
   344
    attention to it and doesn't list them individually.
ali@0
   345
ali@0
   346
ali@0
   347
ali@0
   348
    Line 3020 - Non-ASCII character 233
ali@0
   349
ali@0
   350
    Standard PG texts should use only ASCII characters with values
ali@0
   351
    up to 127; however, non-English, accented characters can be 
ali@0
   352
    represented according to several different non-ASCII encoding 
ali@0
   353
    schemes, using values over 127. If you have a plain English text
ali@0
   354
    with a few accented characters in words like cafe or tete-a-tete,
ali@74
   355
    you might replace the accented characters with their unaccented 
ali@0
   356
    versions. The English pound sign is another commonly-seen
ali@0
   357
    non-ASCII character. If you have enough non-ASCII characters in
ali@74
   358
    your text that you feel removing them would degrade your text,
ali@74
   359
    you should probably consider doing a UTF-8 text.
ali@0
   360
ali@0
   361
ali@0
   362
ali@0
   363
    Line 1207 - Non-ISO-8859 character 156
ali@0
   364
ali@0
   365
    Even in "8-bit" texts, there are distinctions between code sets.
ali@0
   366
    The ISO-8859 family of 8-bit code sets is the most commonly used
ali@0
   367
    in PG, and these sets do not define values in the range 128 through
ali@0
   368
    159 as printable characters. It's quite common for someone on a
ali@0
   369
    Windows or Mac machine to use a non-ISO character inadvertently,
ali@0
   370
    so this message warns that the character is not only not ASCII,
ali@0
   371
    but also outside the ISO-8859 range.
ali@0
   372
ali@0
   373
ali@0
   374
ali@0
   375
    Line 46 - Tab character?
ali@0
   376
ali@0
   377
    Some editors and WPs will put in Tab characters (character 9) to
ali@0
   378
    indicate indented text. You should not use these in a PG text,
ali@0
   379
    because you can't be sure how they will appear on a reader's
ali@0
   380
    screen. Find the Tab, and replace it with the appropriate number
ali@0
   381
    of spaces.
ali@0
   382
ali@0
   383
ali@0
   384
    Line 1327 - Tilde character?
ali@0
   385
ali@0
   386
    The tilde character (~) might be legitimately used, but it's the
ali@0
   387
    character commonly used by OCR software to indicate a place where
ali@74
   388
    it couldn't make out the letter, so bookloupe flags it.
ali@0
   389
ali@0
   390
ali@0
   391
ali@0
   392
    Line 1347 - Asterisk?
ali@0
   393
ali@0
   394
    Asterisks are reported only in paranoid mode (see -x). 
ali@0
   395
    Like tildes, they are often used to indicate errors, but they are
ali@0
   396
    also legitimately used as line delimiters and footnote markers.
ali@0
   397
ali@0
   398
ali@0
   399
ali@0
   400
    Line 1451 - Long line 129
ali@0
   401
ali@0
   402
    PG texts should have lines shorter than 76. There may be occasions
ali@0
   403
    where you decide that you really have to go out to 79 characters,
ali@74
   404
    but the sample above says that line 1451 is 129 characters long—
ali@0
   405
    probably two lines run together.
ali@0
   406
ali@0
   407
ali@0
   408
ali@0
   409
    Line 1590 - Short line?
ali@0
   410
ali@0
   411
    PG texts should have lines longer than 54 characters. However,
ali@0
   412
    there are special cases like poetry and tables of contents where
ali@74
   413
    the lines _should_ be shorter. So treat bookloupe warnings about
ali@0
   414
    short lines carefully. Sometimes it's a genuine formatting
ali@0
   415
    problem; sometimes the line really needs to be short.
ali@0
   416
ali@74
   417
    Hint: bookloupe will not flag lines as short if they are indented
ali@74
   418
    —if they start with a space. I like to start inserted stanzas
ali@0
   419
    and other such items indented with a couple of spaces so that 
ali@0
   420
    they stand out from the main text anyway.
ali@0
   421
ali@0
   422
ali@0
   423
ali@0
   424
    Line 1804 - Begins with punctuation?
ali@0
   425
ali@0
   426
    Lines should normally not begin with commas, periods and so on.
ali@0
   427
    An exception is ellipses . . . which can happen at start of line.
ali@0
   428
ali@0
   429
ali@0
   430
ali@0
   431
    Line 1850 - Spaced em-dash?
ali@0
   432
ali@0
   433
    The PG standard for an em-dash--like these--is two minus signs
ali@74
   434
    with no spaces before or after them. Bookloupe flags non-PG
ali@0
   435
    em-dashes - like this one. Normally, you will replace it with a 
ali@0
   436
    PG-standard em-dash.
ali@0
   437
ali@0
   438
ali@0
   439
ali@0
   440
    Line 1904 - Query he/be error?
ali@0
   441
ali@74
   442
    Bookloupe makes a very minor effort to look for that scourge of all
ali@0
   443
    proofreaders, "be" replacing "he" or vice-versa, and draws your
ali@0
   444
    attention to it when it thinks it has found one.
ali@0
   445
ali@0
   446
ali@0
   447
ali@0
   448
    Line 2017 - Query digit in a1most
ali@0
   449
ali@0
   450
    The digit 1 is commonly OCRed for the letter l, the digit 0 for
ali@74
   451
    the letter O, and so on. When bookloupe sees a mix of digits and
ali@0
   452
    letters, it warns you. It may generate a false positive for
ali@0
   453
    something like 7am.
ali@0
   454
ali@0
   455
ali@0
   456
ali@0
   457
    Line 2083 - Query standalone 0
ali@0
   458
ali@74
   459
    In paranoid mode (see -x) only, bookloupe warns about the digit 0 
ali@0
   460
    and the number 1 standing alone as a word. This can happen if the 
ali@0
   461
    OCR misreads the words O or I.
ali@0
   462
ali@0
   463
ali@0
   464
ali@0
   465
    Line 2115 - Query word whetber
ali@0
   466
ali@74
   467
    If you have switched typo-checking on, bookloupe looks for
ali@0
   468
    potential typos, especially common h/b errors. It's not
ali@0
   469
    infallible; it sometimes queries legit words, but it's
ali@0
   470
    always worth taking a look.
ali@0
   471
ali@0
   472
ali@0
   473
ali@0
   474
    Line 2190 column 14 - Missing space?
ali@0
   475
ali@0
   476
    Omitting a space is a very common error,especially coming from
ali@0
   477
    OCRed text,and can be hard for a human to spot. The commas in
ali@0
   478
    the previous sentence illustrate the kind of thing I mean.
ali@0
   479
ali@0
   480
ali@0
   481
ali@0
   482
    Line 2240 column 48 - Spaced punctuation?
ali@0
   483
ali@0
   484
    The flip side of the "missing space" error , here , is when extra
ali@0
   485
    spaces are added before punctuation . Some old texts appear to add
ali@0
   486
    extra spaces around punctuation consistently, but this was a
ali@0
   487
    typographical convention rather than the author's intent, and the
ali@0
   488
    extra "spaces" should be removed when preparing a PG text.
ali@0
   489
ali@0
   490
ali@0
   491
ali@0
   492
    Line 2301 column 19 - Unspaced quotes?
ali@0
   493
ali@0
   494
    Another common spacing problem occurs in a phrase like "You wait
ali@0
   495
    there,"he said.
ali@0
   496
ali@0
   497
ali@0
   498
ali@0
   499
    Line 2385 column 27 - Wrongspaced quotes?
ali@0
   500
ali@74
   501
    Bookloupe checks whether a quote seems to be a start or end quote,
ali@74
   502
    and queries those that appear to be misplaced. This does give rise
ali@74
   503
    to false positives when quotes are nested, for example:
ali@0
   504
ali@0
   505
    "And how," she asked, "will your "friends" help you now?"
ali@0
   506
ali@0
   507
    but these false positives are worth it because of the many cases
ali@0
   508
    that this test catches, notably those like:
ali@0
   509
ali@0
   510
    "And how, "she said," will your friends help you now?"
ali@0
   511
ali@0
   512
    Sometimes a "wrongspaced quotes" query will arise because an earlier
ali@0
   513
    quote in the paragraph was omitted, so if the place specified seems
ali@0
   514
    to be OK, look back to see whether there's a problem in the preceding
ali@0
   515
    lines.
ali@0
   516
ali@0
   517
ali@0
   518
ali@0
   519
    Line 2400 - HTML Tag? <PRE>
ali@0
   520
ali@0
   521
    Some PG texts have been converted from HTML, and not all of the
ali@0
   522
    HTML tags have been removed.
ali@0
   523
ali@0
   524
ali@0
   525
ali@0
   526
    Line 2402 - HTML symbol? &emdash;
ali@0
   527
ali@0
   528
    Similarly, special HTML symbol characters can survive into PG
ali@0
   529
    texts. Can occasionally produce amusing false positives like
ali@0
   530
    . . . Marwick & Co were well known for it;
ali@0
   531
ali@0
   532
ali@0
   533
ali@0
   534
    Line 2540 - Mismatched quotes
ali@0
   535
ali@74
   536
    Another bookloupe mainstay—unclosed doublequotes in a paragraph.
ali@0
   537
    See the discussion of quotes in the switches section near the
ali@0
   538
    start of this file.
ali@0
   539
    
ali@74
   540
    Since the mismatch doesn't occur on any one line, bookloupe quotes
ali@0
   541
    the line number of the first blank line following the paragraph,
ali@0
   542
    since this is the point where it reconciles the count of quotes.
ali@74
   543
    However, if bookloupe is echoing lines, that is, you haven't used
ali@0
   544
    the -e switch, it will show the _first_ line of the paragraph, 
ali@0
   545
    to help you find the place without using line numbers. The 
ali@0
   546
    offending paragraph is therefore between the quoted line and 
ali@0
   547
    the line number given.
ali@0
   548
ali@0
   549
ali@0
   550
ali@0
   551
    Line 2587 - Mismatched single quotes
ali@0
   552
ali@0
   553
    Only checked with the -s switch, since checking single quotes is 
ali@0
   554
    not a very reliable process. Otherwise, the same logic as for 
ali@0
   555
    doublequotes applies.
ali@0
   556
ali@0
   557
ali@0
   558
ali@0
   559
    Line 2877 - Mismatched round brackets?
ali@0
   560
ali@0
   561
    Also curly and square brackets. Texts with a lot of brackets, like
ali@0
   562
    plays with bracketed stage instructions, may have mismatches.
ali@0
   563
ali@0
   564
ali@0
   565
    Line 3150 - No CR?
ali@0
   566
    Line 3204 - Two successive CRs?
ali@0
   567
    Line 3281 position 75 - CR without LF?
ali@0
   568
ali@0
   569
    These are the invalid line-end warnings. See the discussion of
ali@0
   570
    line-end checking in the switches section near the start of this
ali@0
   571
    file. If you see these, and your editor doesn't show anything
ali@0
   572
    wrong, you should probably try deleting the characters just before
ali@0
   573
    and after the line end, and the line-end itself, then retyping the
ali@0
   574
    characters and the line-end.
ali@0
   575
ali@0
   576
ali@0
   577
    Line 2940 - Paragraph starts with lower-case
ali@0
   578
ali@0
   579
    A common error in an e-text is for an extra blank line
ali@0
   580
ali@0
   581
    to be put in, like the blank line above, and this often
ali@0
   582
    shows up as a new paragraph beginning with lower case.
ali@0
   583
    Sometimes the blank line is deliberate, as when a 
ali@0
   584
    quotation is inserted in a speech. Use your judgement.
ali@0
   585
ali@0
   586
ali@0
   587
    Line 2987 - Extra period?
ali@0
   588
ali@0
   589
    An extra period. is a. common problem in OCRed text. and usually
ali@0
   590
    arises when a speck of dust on the page is mistaken for a period.
ali@0
   591
    or. as occasionally happens. when a comma loses its tail.
ali@0
   592
ali@0
   593
ali@0
   594
    Line 3012 column 12 - Double punctuation?
ali@0
   595
ali@0
   596
    Double punctuation., like that,, is a common typo and
ali@0
   597
    scanno. Some books have much legit double punctuation,
ali@0
   598
    like etc., etc., but it's worth checking anyway.
ali@0
   599
ali@0
   600
ali@0
   601
ali@0
   602
            *       *       *        *
ali@0
   603
ali@0
   604
For Windows-only users who are unfamiliar with DOS:
ali@0
   605
ali@0
   606
    If you're a Windows-only user, you need to save
ali@74
   607
    bookloupe.exe into the folder (directory) where the
ali@0
   608
    text file you want to check is. Let's say your
ali@74
   609
    text file is in C:\gut, then you should save
ali@74
   610
    bookloupe.exe into C:\gut.
ali@0
   611
ali@74
   612
    Now get to a console. You can do this by
ali@0
   613
    selecting the "Command Prompt" or "MS-DOS Prompt"
ali@0
   614
    option that will be somewhere on your
ali@0
   615
    Start/Programs menu.
ali@0
   616
ali@74
   617
    Now get into the C:\gut directory. 
ali@74
   618
    You can do this using the cd (change directory) 
ali@0
   619
    command, like this:
ali@74
   620
        cd \gut
ali@0
   621
    and your prompt will change to 
ali@74
   622
        C:\gut>
ali@0
   623
    so you know you're in the right place.
ali@0
   624
ali@0
   625
    Now type
ali@74
   626
        bookloupe yourfile.txt
ali@74
   627
    and you'll see bookloupe's report
ali@0
   628
ali@74
   629
    By default, bookloupe prints its queries to screen.
ali@0
   630
    If you want to create a file of them, to edit
ali@0
   631
    against the text, you can use the greater-than
ali@0
   632
    sign (>) to tell it to output the report to a
ali@0
   633
    file. For example, if you want its report in a
ali@74
   634
    file called queries.lst, you could type
ali@74
   635
ali@74
   636
        bookloupe yourfile.txt > queries.lst
ali@0
   637
ali@0
   638
    The queries.lst file will then contain the listing
ali@0
   639
    of possible formatting errors, and you can
ali@0
   640
    edit it alongside your text.
ali@0
   641
ali@0
   642
    Whatever you do, DON'T make the filename after
ali@0
   643
    the greater-than sign the name of a file already
ali@0
   644
    on your disk that you want to keep, because
ali@74
   645
    the greater-than sign will cause bookloupe to
ali@0
   646
    replace any existing file of that name.
ali@0
   647
ali@0
   648
    So, for example, if you have two Tolstoy files
ali@0
   649
    that you want to check, called WARPEACE.TXT and 
ali@0
   650
    ANNAK.TXT, make sure that neither of these names
ali@0
   651
    is ever used following the greater-than sign.
ali@0
   652
    To check these correctly, you might do:
ali@0
   653
ali@74
   654
    bookloupe warpeace.txt > war.lst
ali@0
   655
ali@0
   656
    and
ali@0
   657
ali@74
   658
    bookloupe annak.txt > annak.lst
ali@0
   659
ali@0
   660
    separately. Then you can look at war.lst and annak.lst
ali@74
   661
    to see the bookloupe reports.