doc/bookloupe.txt
author ali <ali@juiblex.co.uk>
Thu Oct 17 08:07:48 2013 +0100 (2013-10-17)
changeset 179 589d5af2c38d
parent 90 9bebf1b53a48
child 96 aff051e0e4cf
child 101 7afb1e598abd
child 104 c72d1286bb4e
child 109 f39394c9d41f
child 113 27e126ac2e8f
child 125 927fb871d2e3
child 129 3c8784ee9e90
child 133 f264008ff8dc
child 140 79117e7f8505
child 145 a67870b6958e
child 156 62a2c747b442
child 160 bab7d1b28dea
child 182 51c3ee3507e2
child 190 99f9da03119f
permissions -rw-r--r--
Bugs #13+14: charsets in configuration files
ali@0
     1
ali@0
     2
ali@74
     3
                            Bookloupe documentation
ali@0
     4
ali@0
     5
ali@74
     6
bookloupe: lists possible common formatting errors in a Project
ali@74
     7
Gutenberg candidate file. Bookloupe is based on gutcheck, written
ali@74
     8
by Jim Tinsley. It is a command line program and can be used under
ali@74
     9
Microsoft Windows, Mac or Unix. For Windows-only people, there is
ali@74
    10
an appendix at the end with brief instructions for running it.
ali@0
    11
ali@90
    12
Current version: 2.0
ali@0
    13
ali@74
    14
This software is Copyright Jim Tinsley 2000-2005 and
ali@74
    15
J. Ali Harlow 2012 onwards.
ali@0
    16
ali@74
    17
Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0
    18
This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0
    19
ali@74
    20
See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
ali@0
    21
ali@0
    22
ali@74
    23
Usage is: bookloupe [-setopxlywm] filename
ali@0
    24
      where:
ali@0
    25
      -s checks Single quotes 
ali@0
    26
      -e switches off Echoing of lines 
ali@0
    27
      -t checks Typos
ali@0
    28
      -o produces an Overview only
ali@0
    29
      -p sets strict quotes checking for Paragraphs
ali@0
    30
      -x (paranoid) switches OFF typo checking and extra checks
ali@0
    31
      -l turns off Line-end checks
ali@0
    32
      -y sets error messages to stdout
ali@0
    33
      -w is a special mode for web uploads (for future use)
ali@0
    34
      -v (verbose) forces individual reporting of minor problems
ali@0
    35
      -m interprets Markup of some common HTML tags and entities    
ali@0
    36
      -u warns about words in a user-defined typo file gutcheck.typ 
ali@0
    37
      -d ignores some DP-specific markup
ali@0
    38
ali@74
    39
Running bookloupe without any parameters will display a brief help message.
ali@0
    40
ali@0
    41
Sample usage: 
ali@0
    42
ali@74
    43
    bookloupe warpeace.txt
ali@0
    44
ali@0
    45
ali@0
    46
More detail:
ali@0
    47
ali@74
    48
    Character encoding
ali@74
    49
ali@74
    50
      Bookloupe will handle e-texts encoded in UTF-8 (preferred),
ali@74
    51
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
    52
      incorrectly, as ansi). The output will be in the same encoding
ali@74
    53
      as the input e-text.
ali@74
    54
ali@0
    55
    Echoing lines (-e to switch off)
ali@0
    56
ali@74
    57
      You may find it convenient, when reviewing Bookloupe's 
ali@74
    58
      suggestions, to see the line that Bookloupe is questioning.
ali@0
    59
      That way, you can often see at a glance whether it is
ali@0
    60
      a real error that needs to be fixed, or a false positive
ali@74
    61
      that should be in the text, but Bookloupe's limited
ali@0
    62
      programming doesn't understand.
ali@0
    63
ali@74
    64
      By default, bookloupe echoes these lines, but if you don't 
ali@0
    65
      want to see the lines referred to, -e will switch it OFF.
ali@0
    66
ali@0
    67
ali@0
    68
    Quotes (-s and -p switches)
ali@0
    69
ali@74
    70
      Bookloupe always looks for unbalanced doublequotes in a 
ali@0
    71
      paragraph. It is a common convention for writers not to
ali@0
    72
      close quotes in a paragraph if the next paragraph opens
ali@0
    73
      with quotes and is a continuation by the same speaker.
ali@0
    74
ali@74
    75
      Bookloupe therefore does not normally report unclosed quotes 
ali@0
    76
      if the next paragraph begins with a quote. If you need
ali@0
    77
      to see all unclosed quotes, even where the next paragraph
ali@0
    78
      begins with a quote, you should use the -p switch.
ali@0
    79
ali@94
    80
      Singlequotes (' and ’) are a problem, since the same
ali@94
    81
      character is used for an apostrophe. I'm not sure that it is
ali@0
    82
      possible to get 100% accuracy on singlequotes checking,
ali@0
    83
      particularly since dialect, quite common in PG texts,
ali@0
    84
      upsets the normal rules so badly. Consider the sentence:
ali@0
    85
        'Tis often said that a man's a man for a' that.
ali@0
    86
      As humans, we recognize that both apostrophes are used
ali@0
    87
      for contractions rather than quotes, but it isn't easy 
ali@0
    88
      to get a program to recognize that.
ali@0
    89
ali@74
    90
      Since bookloupe makes too many mistakes when trying to match
ali@0
    91
      singlequotes, it doesn't look for unbalanced singlequotes
ali@0
    92
      unless you specify the -s switch.
ali@0
    93
ali@0
    94
      Consider these sentences, which illustrate the main cases:
ali@0
    95
ali@0
    96
        'Tis often said that a fool and his money are soon parted.
ali@0
    97
ali@0
    98
        'Becky's goin' home,' said Tom.
ali@0
    99
ali@0
   100
        The dogs' tails wagged in unison.
ali@0
   101
ali@0
   102
        Those 'pack dogs' of yours look more like wolves.
ali@0
   103
ali@0
   104
ali@0
   105
ali@0
   106
    Typos (-t switch)
ali@0
   107
ali@74
   108
      It's not bookoupe's job to be a spelling checker, but it
ali@0
   109
      does check for a list of common typos and OCR errors if you
ali@0
   110
      use the -t switch. (The -x switch also turns typo checking on.)
ali@0
   111
ali@0
   112
      It also checks for character combinations, especially involving
ali@0
   113
      h and b, which are often confused by OCR, that rarely or never
ali@0
   114
      occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0
   115
      occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0
   116
      playing the odds - a few false positives for many errors found.
ali@0
   117
      Similarly with "ii", which is a very common OCR error.
ali@0
   118
ali@74
   119
      Bookloupe suppresses multiple reporting of the first 40 "typos"
ali@0
   120
      found. This is to remove the annoyance of seeing something like
ali@0
   121
      "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@0
   122
      in a text. 
ali@0
   123
ali@0
   124
ali@0
   125
    Line-end checking (-l switch to disable)
ali@0
   126
ali@0
   127
      All PG texts should have a Carriage Return (CR - character 13)
ali@0
   128
      and a Line Feed (LF - character 10) at end of each line,
ali@0
   129
      regardless of what O/S you made them on. DOS/Windows, Unix
ali@0
   130
      and Mac have different conventions, but the final text should
ali@0
   131
      always use a CR/LF pair as its line terminator.
ali@0
   132
ali@74
   133
      By default, bookloupe verifies that every line does have
ali@0
   134
      the correct terminator, but if you're on a work-in-progress
ali@0
   135
      in Linux, you might want to convert the line-ends as a final
ali@0
   136
      step, and not want to see thousands of errors every time you
ali@74
   137
      run bookloupe before that final step, so you can turn off 
ali@0
   138
      this checking with the -l switch.
ali@0
   139
ali@0
   140
ali@0
   141
    Paranoid mode (-x switch to disable: Trust No One :-)
ali@0
   142
ali@0
   143
      -x switches OFF typo-checking, the -t flag, automatically
ali@0
   144
      and some extra checks like standalone 1 and 0 queries.
ali@0
   145
ali@0
   146
ali@0
   147
    Overview mode (-o switch)
ali@0
   148
ali@74
   149
      This mode just gives a count of queries found
ali@74
   150
      instead of a detailed list.
ali@0
   151
ali@0
   152
ali@0
   153
    Header quote  (-h switch)
ali@0
   154
ali@74
   155
      If you use the -h switch, bookloupe will also display
ali@74
   156
      the Title, Author, Release and Edition fields from the
ali@74
   157
      PG header. This is useful mostly for the automated
ali@74
   158
      checks we do on recently-posted texts.
ali@0
   159
ali@0
   160
ali@0
   161
    Errors to stdout (-y switch)
ali@0
   162
ali@74
   163
      If you're just running bookloupe normally, you can ignore
ali@74
   164
      this. It's only there for programs that provide a front
ali@74
   165
      end to bookloupe. It makes error messages appear within
ali@74
   166
      the output of bookloupe so that the front end knows whether
ali@74
   167
      bookloupe ran OK.
ali@0
   168
ali@0
   169
ali@0
   170
    Verbose reporting (-v switch)
ali@0
   171
ali@74
   172
      Normally, if bookloupe sees lots of long lines, short lines,
ali@74
   173
      spaced dashes, non-ASCII characters or dot-commas ".," it
ali@74
   174
      assumes these are features of the text, counts and summarizes
ali@74
   175
      them at the top of its report, but does not list them 
ali@74
   176
      individually. If the -v switch is on, bookloupe will list them all.
ali@0
   177
ali@0
   178
ali@0
   179
    Markup interpretation (-m switch)
ali@0
   180
ali@74
   181
      Normally, bookloupe flags anything it suspects of being HTML
ali@74
   182
      markup as a possible error. When you use the -m switch,
ali@74
   183
      however, it matches anything that looks like markup against
ali@74
   184
      a short list of common HTML tags and entities. If the markup
ali@74
   185
      is in that list, it either ignores the markup, in the case
ali@74
   186
      of a tag, or "interprets" the markup as its nearest ASCII 
ali@74
   187
      equivalent, in the case of an entity. So, for example, using
ali@74
   188
      this switch, bookloupe will "see"
ali@0
   189
ali@74
   190
      &ldquo;He went <i>thataway!</i>&rdquo;
ali@0
   191
ali@74
   192
      as
ali@0
   193
ali@74
   194
      "He went thataway!"
ali@0
   195
ali@74
   196
      and report accordingly.
ali@0
   197
ali@74
   198
      This switch does not, not, NOT check the validity of HTML;
ali@74
   199
      it exists so that you can run bookloupe on most HTML texts
ali@74
   200
      for PG, and get sane results. It does not support all tags.
ali@74
   201
      It does not support all entities. When it sees a tag or entity
ali@74
   202
      it does not recognize, it will query it as HTML just as if
ali@74
   203
      you hadn't specified the -m switch.
ali@0
   204
ali@74
   205
      Bookloupe will automatically switch on markup interpretation
ali@74
   206
      if it sees a lot of tags that appear to be markup, so mostly, you
ali@74
   207
      won't have to specify this.
ali@0
   208
ali@0
   209
    User-defined typos (-u switch)
ali@0
   210
ali@74
   211
      If you have a file named bookloupe.typ or gutcheck.typ either
ali@74
   212
      in your current working directory or in the directory from
ali@74
   213
      which you explicitly invoked bookoupe, but not necessarily on
ali@74
   214
      your path, and if you specify the -u switch, bookloupe will
ali@74
   215
      query any word specified in that file. The file is simple: one
ali@74
   216
      word, in lower case, per line. Be careful not to put multiple
ali@74
   217
      words onto a line, or leave any rubbish other than the word on
ali@74
   218
      the line. You should have received a sample file bookloupe.typ
ali@74
   219
      with this package. The file may be encoded in UTF-8 (preferred),
ali@74
   220
      ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74
   221
      incorrectly, as ansi).
ali@0
   222
ali@0
   223
    Ignore DP markup (-d switch)
ali@0
   224
        
ali@74
   225
      Distributed Proofreaders (http://www.pgdp.net) has for some
ali@74
   226
      time been the main source of PG texts, and proofers there use
ali@74
   227
      special conventions. This switch understands those conventions,
ali@74
   228
      so that people can use bookloupe on files in process that still
ali@74
   229
      haven't had the special conventions removed yet. The special
ali@74
   230
      conventions supported are page-separators and
ali@74
   231
      "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
ali@0
   232
ali@0
   233
ali@74
   234
You will probably only run bookloupe on a text once or maybe twice,
ali@0
   235
just prior to uploading; it usually finds a few formatting problems;
ali@0
   236
it also usually finds queries that aren't problems at all - it often
ali@0
   237
questions Tables of Contents for having short lines, for example.
ali@74
   238
These are called "false positives," and need a human to decide on
ali@0
   239
them.
ali@0
   240
ali@0
   241
The text should be standard prose, and already close to PG normal
ali@0
   242
format (plain text, about 70 characters per line with blank lines
ali@0
   243
between paragraphs).
ali@0
   244
ali@74
   245
Bookloupe merely draws your attention to things that might be errors.
ali@0
   246
It is NOT a substitute for human judgement. Formatting choices like
ali@0
   247
short lines may be for a reason that this program can't understand.
ali@0
   248
ali@0
   249
Even the most careful human proofing can leave errors behind in a
ali@0
   250
text, and there are several automated checks you can do to help find
ali@0
   251
them. Of these, spellchecking (with _very_ careful human judgement) is
ali@0
   252
the most important and most useful.
ali@0
   253
ali@74
   254
Bookloupe does perform some basic typo-checking if you ask it to,
ali@74
   255
but its focus is on formatting errors specific to PG texts—
ali@0
   256
mismatched quotes, non-ASCII characters, bad spacing, bad line
ali@0
   257
length, HTML tags perhaps left from a conversion, unbalanced
ali@0
   258
brackets.
ali@0
   259
ali@0
   260
Suggestions for additional checks would be appreciated and duly 
ali@0
   261
considered, but no guarantees that they will be implemented.
ali@0
   262
ali@0
   263
ali@0
   264
ali@0
   265
ali@74
   266
        How does Jim Tinsley use gutcheck?
ali@0
   267
ali@0
   268
Practically everyone I give gutcheck to asks me how _I_ use it.
ali@0
   269
Well, when I get a text for posting, say filename.txt, I run
ali@0
   270
ali@0
   271
    gutcheck -o filename.txt
ali@0
   272
ali@0
   273
That gives me a quick idea what I'm dealing with. It'll tell
ali@0
   274
me what kind of problems gutcheck sees, and give me an idea 
ali@0
   275
of how much more work needs to be done on the text. Keep in 
ali@0
   276
mind that gutcheck doesn't do anything like a full spellcheck,
ali@0
   277
but when I see a text that has a lot of problems, I assume that
ali@0
   278
it probably needs a spellcheck too.
ali@0
   279
ali@0
   280
Having got a feel for the ballpark, I run
ali@0
   281
ali@0
   282
    gutcheck filename.txt > jj
ali@0
   283
ali@0
   284
where jj is my personal, all-purpose filename for temporary data
ali@0
   285
that doesn't need to be kept. Then I open filename.txt and jj in
ali@0
   286
a split-screen view in my editor, and work down the text, fixing
ali@0
   287
whatever needs fixing, and skipping whatever doesn't. If your 
ali@0
   288
editor doesn't split-screen, you can get much the same effect by 
ali@0
   289
opening your original file in your normal editor, and jj (or your
ali@0
   290
equivalent name) in something like Notepad, keeping both in view 
ali@0
   291
at the same time.
ali@0
   292
ali@0
   293
Twice a day, an automatic process looks at all recently-posted
ali@0
   294
texts, and emails Michael, me, and sometimes other people with
ali@0
   295
their gutcheck summaries.
ali@0
   296
ali@0
   297
ali@0
   298
ali@74
   299
        Future development of bookloupe
ali@0
   300
ali@74
   301
Future versions will add support for UTF-8 characters that
ali@74
   302
are not in ISO-8859-1 (eg., curled quotation marks);
ali@90
   303
characters that do not have a composed form (version 2.0
ali@74
   304
treats these as taking 2 or more columns); zero width and
ali@90
   305
wide characters (version 2.0 treats these as taking 1 column).
ali@0
   306
ali@0
   307
ali@0
   308
ali@0
   309
ali@74
   310
Explanations of common bookloupe messages:
ali@0
   311
ali@0
   312
    --> 74 lines in this file have white space at end
ali@0
   313
ali@0
   314
    PG texts shouldn't have extra white space added at end of line.
ali@0
   315
    Don't worry too much about this; they're not doing any harm,
ali@0
   316
    and they'll be removed during posting anyway.
ali@0
   317
ali@0
   318
ali@0
   319
    --> 348 lines in this file are short. Not reporting short lines.
ali@0
   320
    --> 84 lines in this file are long. Not reporting long lines.
ali@0
   321
    --> 8 lines in this file are VERY long!
ali@0
   322
ali@74
   323
    If there are a lot of long or short lines, bookloupe won't list
ali@0
   324
    them individually. The short lines version of this message
ali@0
   325
    is commonly seen when gutchecking poetry and some plays, where
ali@0
   326
    the normal line length is shorter than the standard for prose.
ali@0
   327
    A "VERY long" line is one over 80 characters.  You normally
ali@0
   328
    shouldn't have any of these, but sometimes you may have to render
ali@0
   329
    a table that must be that long, or some special preformatted
ali@0
   330
    quotation that can't be broken.
ali@0
   331
ali@0
   332
ali@0
   333
    --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
ali@0
   334
ali@0
   335
    The PG standard for an emdash--like these--is two minus signs
ali@0
   336
    with no spaces before or after them. However, some older texts
ali@0
   337
    used spaced dashes - like these -- and if there are very many
ali@74
   338
    such spaced dashes in the file, bookoupe just draws your
ali@0
   339
    attention to it and doesn't list them individually.
ali@0
   340
ali@0
   341
ali@0
   342
ali@0
   343
    Line 3020 - Non-ASCII character 233
ali@0
   344
ali@0
   345
    Standard PG texts should use only ASCII characters with values
ali@0
   346
    up to 127; however, non-English, accented characters can be 
ali@0
   347
    represented according to several different non-ASCII encoding 
ali@0
   348
    schemes, using values over 127. If you have a plain English text
ali@0
   349
    with a few accented characters in words like cafe or tete-a-tete,
ali@74
   350
    you might replace the accented characters with their unaccented 
ali@0
   351
    versions. The English pound sign is another commonly-seen
ali@0
   352
    non-ASCII character. If you have enough non-ASCII characters in
ali@74
   353
    your text that you feel removing them would degrade your text,
ali@74
   354
    you should probably consider doing a UTF-8 text.
ali@0
   355
ali@0
   356
ali@0
   357
ali@0
   358
    Line 1207 - Non-ISO-8859 character 156
ali@0
   359
ali@0
   360
    Even in "8-bit" texts, there are distinctions between code sets.
ali@0
   361
    The ISO-8859 family of 8-bit code sets is the most commonly used
ali@0
   362
    in PG, and these sets do not define values in the range 128 through
ali@0
   363
    159 as printable characters. It's quite common for someone on a
ali@0
   364
    Windows or Mac machine to use a non-ISO character inadvertently,
ali@0
   365
    so this message warns that the character is not only not ASCII,
ali@0
   366
    but also outside the ISO-8859 range.
ali@0
   367
ali@0
   368
ali@0
   369
ali@0
   370
    Line 46 - Tab character?
ali@0
   371
ali@0
   372
    Some editors and WPs will put in Tab characters (character 9) to
ali@0
   373
    indicate indented text. You should not use these in a PG text,
ali@0
   374
    because you can't be sure how they will appear on a reader's
ali@0
   375
    screen. Find the Tab, and replace it with the appropriate number
ali@0
   376
    of spaces.
ali@0
   377
ali@0
   378
ali@0
   379
    Line 1327 - Tilde character?
ali@0
   380
ali@0
   381
    The tilde character (~) might be legitimately used, but it's the
ali@0
   382
    character commonly used by OCR software to indicate a place where
ali@74
   383
    it couldn't make out the letter, so bookloupe flags it.
ali@0
   384
ali@0
   385
ali@0
   386
ali@0
   387
    Line 1347 - Asterisk?
ali@0
   388
ali@0
   389
    Asterisks are reported only in paranoid mode (see -x). 
ali@0
   390
    Like tildes, they are often used to indicate errors, but they are
ali@0
   391
    also legitimately used as line delimiters and footnote markers.
ali@0
   392
ali@0
   393
ali@0
   394
ali@0
   395
    Line 1451 - Long line 129
ali@0
   396
ali@0
   397
    PG texts should have lines shorter than 76. There may be occasions
ali@0
   398
    where you decide that you really have to go out to 79 characters,
ali@74
   399
    but the sample above says that line 1451 is 129 characters long—
ali@0
   400
    probably two lines run together.
ali@0
   401
ali@0
   402
ali@0
   403
ali@0
   404
    Line 1590 - Short line?
ali@0
   405
ali@0
   406
    PG texts should have lines longer than 54 characters. However,
ali@0
   407
    there are special cases like poetry and tables of contents where
ali@74
   408
    the lines _should_ be shorter. So treat bookloupe warnings about
ali@0
   409
    short lines carefully. Sometimes it's a genuine formatting
ali@0
   410
    problem; sometimes the line really needs to be short.
ali@0
   411
ali@74
   412
    Hint: bookloupe will not flag lines as short if they are indented
ali@74
   413
    —if they start with a space. I like to start inserted stanzas
ali@0
   414
    and other such items indented with a couple of spaces so that 
ali@0
   415
    they stand out from the main text anyway.
ali@0
   416
ali@0
   417
ali@0
   418
ali@0
   419
    Line 1804 - Begins with punctuation?
ali@0
   420
ali@0
   421
    Lines should normally not begin with commas, periods and so on.
ali@0
   422
    An exception is ellipses . . . which can happen at start of line.
ali@0
   423
ali@0
   424
ali@0
   425
ali@0
   426
    Line 1850 - Spaced em-dash?
ali@0
   427
ali@0
   428
    The PG standard for an em-dash--like these--is two minus signs
ali@74
   429
    with no spaces before or after them. Bookloupe flags non-PG
ali@0
   430
    em-dashes - like this one. Normally, you will replace it with a 
ali@0
   431
    PG-standard em-dash.
ali@0
   432
ali@0
   433
ali@0
   434
ali@0
   435
    Line 1904 - Query he/be error?
ali@0
   436
ali@74
   437
    Bookloupe makes a very minor effort to look for that scourge of all
ali@0
   438
    proofreaders, "be" replacing "he" or vice-versa, and draws your
ali@0
   439
    attention to it when it thinks it has found one.
ali@0
   440
ali@0
   441
ali@0
   442
ali@0
   443
    Line 2017 - Query digit in a1most
ali@0
   444
ali@0
   445
    The digit 1 is commonly OCRed for the letter l, the digit 0 for
ali@74
   446
    the letter O, and so on. When bookloupe sees a mix of digits and
ali@0
   447
    letters, it warns you. It may generate a false positive for
ali@0
   448
    something like 7am.
ali@0
   449
ali@0
   450
ali@0
   451
ali@0
   452
    Line 2083 - Query standalone 0
ali@0
   453
ali@74
   454
    In paranoid mode (see -x) only, bookloupe warns about the digit 0 
ali@0
   455
    and the number 1 standing alone as a word. This can happen if the 
ali@0
   456
    OCR misreads the words O or I.
ali@0
   457
ali@0
   458
ali@0
   459
ali@0
   460
    Line 2115 - Query word whetber
ali@0
   461
ali@74
   462
    If you have switched typo-checking on, bookloupe looks for
ali@0
   463
    potential typos, especially common h/b errors. It's not
ali@0
   464
    infallible; it sometimes queries legit words, but it's
ali@0
   465
    always worth taking a look.
ali@0
   466
ali@0
   467
ali@0
   468
ali@0
   469
    Line 2190 column 14 - Missing space?
ali@0
   470
ali@0
   471
    Omitting a space is a very common error,especially coming from
ali@0
   472
    OCRed text,and can be hard for a human to spot. The commas in
ali@0
   473
    the previous sentence illustrate the kind of thing I mean.
ali@0
   474
ali@0
   475
ali@0
   476
ali@0
   477
    Line 2240 column 48 - Spaced punctuation?
ali@0
   478
ali@0
   479
    The flip side of the "missing space" error , here , is when extra
ali@0
   480
    spaces are added before punctuation . Some old texts appear to add
ali@0
   481
    extra spaces around punctuation consistently, but this was a
ali@0
   482
    typographical convention rather than the author's intent, and the
ali@0
   483
    extra "spaces" should be removed when preparing a PG text.
ali@0
   484
ali@0
   485
ali@0
   486
ali@0
   487
    Line 2301 column 19 - Unspaced quotes?
ali@0
   488
ali@0
   489
    Another common spacing problem occurs in a phrase like "You wait
ali@0
   490
    there,"he said.
ali@0
   491
ali@0
   492
ali@0
   493
ali@0
   494
    Line 2385 column 27 - Wrongspaced quotes?
ali@0
   495
ali@74
   496
    Bookloupe checks whether a quote seems to be a start or end quote,
ali@74
   497
    and queries those that appear to be misplaced. This does give rise
ali@74
   498
    to false positives when quotes are nested, for example:
ali@0
   499
ali@0
   500
    "And how," she asked, "will your "friends" help you now?"
ali@0
   501
ali@0
   502
    but these false positives are worth it because of the many cases
ali@0
   503
    that this test catches, notably those like:
ali@0
   504
ali@0
   505
    "And how, "she said," will your friends help you now?"
ali@0
   506
ali@0
   507
    Sometimes a "wrongspaced quotes" query will arise because an earlier
ali@0
   508
    quote in the paragraph was omitted, so if the place specified seems
ali@0
   509
    to be OK, look back to see whether there's a problem in the preceding
ali@0
   510
    lines.
ali@0
   511
ali@0
   512
ali@0
   513
ali@0
   514
    Line 2400 - HTML Tag? <PRE>
ali@0
   515
ali@0
   516
    Some PG texts have been converted from HTML, and not all of the
ali@0
   517
    HTML tags have been removed.
ali@0
   518
ali@0
   519
ali@0
   520
ali@0
   521
    Line 2402 - HTML symbol? &emdash;
ali@0
   522
ali@0
   523
    Similarly, special HTML symbol characters can survive into PG
ali@0
   524
    texts. Can occasionally produce amusing false positives like
ali@0
   525
    . . . Marwick & Co were well known for it;
ali@0
   526
ali@0
   527
ali@0
   528
ali@0
   529
    Line 2540 - Mismatched quotes
ali@0
   530
ali@74
   531
    Another bookloupe mainstay—unclosed doublequotes in a paragraph.
ali@0
   532
    See the discussion of quotes in the switches section near the
ali@0
   533
    start of this file.
ali@0
   534
    
ali@74
   535
    Since the mismatch doesn't occur on any one line, bookloupe quotes
ali@0
   536
    the line number of the first blank line following the paragraph,
ali@0
   537
    since this is the point where it reconciles the count of quotes.
ali@74
   538
    However, if bookloupe is echoing lines, that is, you haven't used
ali@0
   539
    the -e switch, it will show the _first_ line of the paragraph, 
ali@0
   540
    to help you find the place without using line numbers. The 
ali@0
   541
    offending paragraph is therefore between the quoted line and 
ali@0
   542
    the line number given.
ali@0
   543
ali@0
   544
ali@0
   545
ali@0
   546
    Line 2587 - Mismatched single quotes
ali@0
   547
ali@0
   548
    Only checked with the -s switch, since checking single quotes is 
ali@0
   549
    not a very reliable process. Otherwise, the same logic as for 
ali@0
   550
    doublequotes applies.
ali@0
   551
ali@0
   552
ali@0
   553
ali@0
   554
    Line 2877 - Mismatched round brackets?
ali@0
   555
ali@0
   556
    Also curly and square brackets. Texts with a lot of brackets, like
ali@0
   557
    plays with bracketed stage instructions, may have mismatches.
ali@0
   558
ali@0
   559
ali@0
   560
    Line 3150 - No CR?
ali@0
   561
    Line 3204 - Two successive CRs?
ali@0
   562
    Line 3281 position 75 - CR without LF?
ali@0
   563
ali@0
   564
    These are the invalid line-end warnings. See the discussion of
ali@0
   565
    line-end checking in the switches section near the start of this
ali@0
   566
    file. If you see these, and your editor doesn't show anything
ali@0
   567
    wrong, you should probably try deleting the characters just before
ali@0
   568
    and after the line end, and the line-end itself, then retyping the
ali@0
   569
    characters and the line-end.
ali@0
   570
ali@0
   571
ali@0
   572
    Line 2940 - Paragraph starts with lower-case
ali@0
   573
ali@0
   574
    A common error in an e-text is for an extra blank line
ali@0
   575
ali@0
   576
    to be put in, like the blank line above, and this often
ali@0
   577
    shows up as a new paragraph beginning with lower case.
ali@0
   578
    Sometimes the blank line is deliberate, as when a 
ali@0
   579
    quotation is inserted in a speech. Use your judgement.
ali@0
   580
ali@0
   581
ali@0
   582
    Line 2987 - Extra period?
ali@0
   583
ali@0
   584
    An extra period. is a. common problem in OCRed text. and usually
ali@0
   585
    arises when a speck of dust on the page is mistaken for a period.
ali@0
   586
    or. as occasionally happens. when a comma loses its tail.
ali@0
   587
ali@0
   588
ali@0
   589
    Line 3012 column 12 - Double punctuation?
ali@0
   590
ali@0
   591
    Double punctuation., like that,, is a common typo and
ali@0
   592
    scanno. Some books have much legit double punctuation,
ali@0
   593
    like etc., etc., but it's worth checking anyway.
ali@0
   594
ali@0
   595
ali@0
   596
ali@0
   597
            *       *       *        *
ali@0
   598
ali@0
   599
For Windows-only users who are unfamiliar with DOS:
ali@0
   600
ali@0
   601
    If you're a Windows-only user, you need to save
ali@74
   602
    bookloupe.exe into the folder (directory) where the
ali@0
   603
    text file you want to check is. Let's say your
ali@74
   604
    text file is in C:\gut, then you should save
ali@74
   605
    bookloupe.exe into C:\gut.
ali@0
   606
ali@74
   607
    Now get to a console. You can do this by
ali@0
   608
    selecting the "Command Prompt" or "MS-DOS Prompt"
ali@0
   609
    option that will be somewhere on your
ali@0
   610
    Start/Programs menu.
ali@0
   611
ali@74
   612
    Now get into the C:\gut directory. 
ali@74
   613
    You can do this using the cd (change directory) 
ali@0
   614
    command, like this:
ali@74
   615
        cd \gut
ali@0
   616
    and your prompt will change to 
ali@74
   617
        C:\gut>
ali@0
   618
    so you know you're in the right place.
ali@0
   619
ali@0
   620
    Now type
ali@74
   621
        bookloupe yourfile.txt
ali@74
   622
    and you'll see bookloupe's report
ali@0
   623
ali@74
   624
    By default, bookloupe prints its queries to screen.
ali@0
   625
    If you want to create a file of them, to edit
ali@0
   626
    against the text, you can use the greater-than
ali@0
   627
    sign (>) to tell it to output the report to a
ali@0
   628
    file. For example, if you want its report in a
ali@74
   629
    file called queries.lst, you could type
ali@74
   630
ali@74
   631
        bookloupe yourfile.txt > queries.lst
ali@0
   632
ali@0
   633
    The queries.lst file will then contain the listing
ali@0
   634
    of possible formatting errors, and you can
ali@0
   635
    edit it alongside your text.
ali@0
   636
ali@0
   637
    Whatever you do, DON'T make the filename after
ali@0
   638
    the greater-than sign the name of a file already
ali@0
   639
    on your disk that you want to keep, because
ali@74
   640
    the greater-than sign will cause bookloupe to
ali@0
   641
    replace any existing file of that name.
ali@0
   642
ali@0
   643
    So, for example, if you have two Tolstoy files
ali@0
   644
    that you want to check, called WARPEACE.TXT and 
ali@0
   645
    ANNAK.TXT, make sure that neither of these names
ali@0
   646
    is ever used following the greater-than sign.
ali@0
   647
    To check these correctly, you might do:
ali@0
   648
ali@74
   649
    bookloupe warpeace.txt > war.lst
ali@0
   650
ali@0
   651
    and
ali@0
   652
ali@74
   653
    bookloupe annak.txt > annak.lst
ali@0
   654
ali@0
   655
    separately. Then you can look at war.lst and annak.lst
ali@74
   656
    to see the bookloupe reports.
ali@83
   657
ali@83
   658
For Windows-only users who want to use bookloupe from guiguts:
ali@83
   659
ali@83
   660
    1) If you haven't already done so, download bookloupe-win32-xxx.zip
ali@83
   661
    from http://www.juiblex.co.uk/pgdp/bookloupe/
ali@83
   662
ali@83
   663
    2) Extract the files into a suitable folder, e.g. C:\DP\bookloupe
ali@83
   664
ali@83
   665
    3) Start Guiguts
ali@83
   666
ali@83
   667
    4) Choose Preferences | File Paths | Set File Paths..
ali@83
   668
ali@83
   669
    5) Click the "Locate Gutcheck..." button
ali@83
   670
ali@83
   671
    6) Browse to the folder where you extracted bookloupe
ali@83
   672
ali@83
   673
    7) Double-click bookloupe.exe 
ali@89
   674
ali@89
   675
    Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe
ali@89
   676
    instead. Since the output will look very like gutcheck output, you
ali@89
   677
    may want to check that it is actually bookloupe that is running. To do
ali@89
   678
    this, look at the black command line message window, which will say:
ali@89
   679
ali@89
   680
    "bookloupe: Check and report on an e-text".
ali@89
   681
ali@89
   682
    To return to using gutcheck for any reason, repeat steps 4 and 5
ali@89
   683
    above, and then,
ali@89
   684
ali@89
   685
    6b) Browse back to the gutcheck folder, which is in a "tools"
ali@89
   686
    folder inside the main Guiguts folder. It will be something like
ali@89
   687
    "C:\DP\guiguts-win\tools\gutcheck", depending on where you installed
ali@89
   688
    Guiguts originally.
ali@89
   689
ali@89
   690
    7b) Double-click gutcheck.exe
ali@89
   691
ali@89
   692
    Now doing "Gutcheck" in Guiguts will run gutcheck itself, and the
ali@89
   693
    message in the black window should read:
ali@89
   694
ali@89
   695
    "gutcheck: Check and report on an e-text".