doc/gutcheck.txt
author ali <ali@juiblex.co.uk>
Tue Jan 24 23:54:05 2012 +0000 (2012-01-24)
changeset 0 c2f4c0285180
permissions -rw-r--r--
Initial version
ali@0
     1
ali@0
     2
ali@0
     3
                            Gutcheck documentation
ali@0
     4
ali@0
     5
ali@0
     6
gutcheck:  lists possible common formatting errors in a Project
ali@0
     7
Gutenberg candidate file. It is a command line program and can be used
ali@0
     8
under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
ali@0
     9
tell me). For Windows-only people, there is an appendix at the end
ali@0
    10
with brief instructions for running it.
ali@0
    11
ali@0
    12
ali@0
    13
Current version: 0.99. Users of 0.98 see end of file for changes.
ali@0
    14
ali@0
    15
You should also have received the licence file COPYING, a README file, 
ali@0
    16
gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
ali@0
    17
this file.
ali@0
    18
ali@0
    19
This software is Copyright Jim Tinsley 2000-2005.
ali@0
    20
ali@0
    21
Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0
    22
This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0
    23
ali@0
    24
See http://gutcheck.sourceforge.net for the latest version.
ali@0
    25
ali@0
    26
ali@0
    27
Usage is: gutcheck [-setopxlywm] filename
ali@0
    28
      where:
ali@0
    29
      -s checks Single quotes 
ali@0
    30
      -e switches off Echoing of lines 
ali@0
    31
      -t checks Typos
ali@0
    32
      -o produces an Overview only
ali@0
    33
      -p sets strict quotes checking for Paragraphs
ali@0
    34
      -x (paranoid) switches OFF typo checking and extra checks
ali@0
    35
      -l turns off Line-end checks
ali@0
    36
      -y sets error messages to stdout
ali@0
    37
      -w is a special mode for web uploads (for future use)
ali@0
    38
      -v (verbose) forces individual reporting of minor problems
ali@0
    39
      -m interprets Markup of some common HTML tags and entities    
ali@0
    40
      -u warns about words in a user-defined typo file gutcheck.typ 
ali@0
    41
      -d ignores some DP-specific markup
ali@0
    42
ali@0
    43
Running gutcheck without any parameters will display a brief help message.
ali@0
    44
ali@0
    45
Sample usage: 
ali@0
    46
ali@0
    47
    gutcheck warpeace.txt
ali@0
    48
ali@0
    49
ali@0
    50
More detail:
ali@0
    51
ali@0
    52
    Echoing lines (-e to switch off)
ali@0
    53
ali@0
    54
      You may find it convenient, when reviewing Gutcheck's 
ali@0
    55
      suggestions, to see the line that Gutcheck is questioning.
ali@0
    56
      That way, you can often see at a glance whether it is
ali@0
    57
      a real error that needs to be fixed, or a false positive
ali@0
    58
      that should be in the text, but Gutcheck's limited
ali@0
    59
      programming doesn't understand.
ali@0
    60
ali@0
    61
      By default, gutcheck echoes these lines, but if you don't 
ali@0
    62
      want to see the lines referred to, -e will switch it OFF.
ali@0
    63
ali@0
    64
ali@0
    65
    Quotes (-s and -p switches)
ali@0
    66
ali@0
    67
      Gutcheck always looks for unbalanced doublequotes in a 
ali@0
    68
      paragraph. It is a common convention for writers not to
ali@0
    69
      close quotes in a paragraph if the next paragraph opens
ali@0
    70
      with quotes and is a continuation by the same speaker.
ali@0
    71
ali@0
    72
      Gutcheck therefore does not normally report unclosed quotes 
ali@0
    73
      if the next paragraph begins with a quote. If you need
ali@0
    74
      to see all unclosed quotes, even where the next paragraph
ali@0
    75
      begins with a quote, you should use the -p switch.
ali@0
    76
ali@0
    77
      Singlequotes (') are a problem, since the same character
ali@0
    78
      is used for an apostrophe. I'm not sure that it is 
ali@0
    79
      possible to get 100% accuracy on singlequotes checking,
ali@0
    80
      particularly since dialect, quite common in PG texts,
ali@0
    81
      upsets the normal rules so badly. Consider the sentence:
ali@0
    82
        'Tis often said that a man's a man for a' that.
ali@0
    83
      As humans, we recognize that both apostrophes are used
ali@0
    84
      for contractions rather than quotes, but it isn't easy 
ali@0
    85
      to get a program to recognize that.
ali@0
    86
ali@0
    87
      Since Gutcheck makes too many mistakes when trying to match
ali@0
    88
      singlequotes, it doesn't look for unbalanced singlequotes
ali@0
    89
      unless you specify the -s switch.
ali@0
    90
ali@0
    91
      Consider these sentences, which illustrate the main cases:
ali@0
    92
ali@0
    93
        'Tis often said that a fool and his money are soon parted.
ali@0
    94
ali@0
    95
        'Becky's goin' home,' said Tom.
ali@0
    96
ali@0
    97
        The dogs' tails wagged in unison.
ali@0
    98
ali@0
    99
        Those 'pack dogs' of yours look more like wolves.
ali@0
   100
ali@0
   101
ali@0
   102
ali@0
   103
    Typos (-t switch)
ali@0
   104
ali@0
   105
      It's not Gutcheck's job to be a spelling checker, but it
ali@0
   106
      does check for a list of common typos and OCR errors if you
ali@0
   107
      use the -t switch. (The -x switch also turns typo checking on.)
ali@0
   108
ali@0
   109
      It also checks for character combinations, especially involving
ali@0
   110
      h and b, which are often confused by OCR, that rarely or never
ali@0
   111
      occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0
   112
      occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0
   113
      playing the odds - a few false positives for many errors found.
ali@0
   114
      Similarly with "ii", which is a very common OCR error.
ali@0
   115
ali@0
   116
      Gutcheck suppresses multiple reporting of the first 40 "typos"
ali@0
   117
      found. This is to remove the annoyance of seeing something like
ali@0
   118
      "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@0
   119
      in a text. 
ali@0
   120
ali@0
   121
ali@0
   122
    Line-end checking (-l switch to disable)
ali@0
   123
ali@0
   124
      All PG texts should have a Carriage Return (CR - character 13)
ali@0
   125
      and a Line Feed (LF - character 10) at end of each line,
ali@0
   126
      regardless of what O/S you made them on. DOS/Windows, Unix
ali@0
   127
      and Mac have different conventions, but the final text should
ali@0
   128
      always use a CR/LF pair as its line terminator.
ali@0
   129
ali@0
   130
      By default, Gutcheck verifies that every line does have
ali@0
   131
      the correct terminator, but if you're on a work-in-progress
ali@0
   132
      in Linux, you might want to convert the line-ends as a final
ali@0
   133
      step, and not want to see thousands of errors every time you
ali@0
   134
      run Gutcheck before that final step, so you can turn off 
ali@0
   135
      this checking with the -l switch.
ali@0
   136
ali@0
   137
ali@0
   138
    Paranoid mode (-x switch to disable: Trust No One :-)
ali@0
   139
ali@0
   140
      -x switches OFF typo-checking, the -t flag, automatically
ali@0
   141
      and some extra checks like standalone 1 and 0 queries.
ali@0
   142
ali@0
   143
ali@0
   144
    Overview mode (-o switch)
ali@0
   145
ali@0
   146
       This mode just gives a count of queries found
ali@0
   147
       instead of a detailed list.
ali@0
   148
ali@0
   149
ali@0
   150
    Header quote  (-h switch)
ali@0
   151
ali@0
   152
       If you use the -h switch, gutcheck will also display
ali@0
   153
       the Title, Author, Release and Edition fields from the
ali@0
   154
       PG header. This is useful mostly for the automated
ali@0
   155
       checks we do on recently-posted texts.
ali@0
   156
ali@0
   157
ali@0
   158
    Errors to stdout (-y switch)
ali@0
   159
ali@0
   160
       If you're just running gutcheck normally, you can ignore
ali@0
   161
       this. It's only there for programs that provide a front
ali@0
   162
       end to gutcheck. It makes error messages appear within
ali@0
   163
       the output of gutcheck so that the front end knows whether
ali@0
   164
       gutcheck ran OK.
ali@0
   165
ali@0
   166
ali@0
   167
    Verbose reporting (-v switch)
ali@0
   168
ali@0
   169
       Normally, if gutcheck sees lots of long lines, short lines,
ali@0
   170
       spaced dashes, non-ASCII characters or dot-commas ".," it
ali@0
   171
       assumes these are features of the text, counts and summarizes
ali@0
   172
       them at the top of its report, but does not list them 
ali@0
   173
       individually. If the -v switch is on, gutcheck will list them all.
ali@0
   174
ali@0
   175
ali@0
   176
    Markup interpretation (-m switch)
ali@0
   177
ali@0
   178
       Normally, gutcheck flags anything it suspects of being HTML
ali@0
   179
       markup as a possible error. When you use the -m switch,
ali@0
   180
       however, it matches anything that looks like markup against
ali@0
   181
       a short list of common HTML tags and entities. If the markup
ali@0
   182
       is in that list, it either ignores the markup, in the case
ali@0
   183
       of a tag, or "interprets" the markup as its nearest ASCII 
ali@0
   184
       equivalent, in the case of an entity. So, for example, using
ali@0
   185
       this switch, gutcheck will "see"
ali@0
   186
ali@0
   187
       &ldquo;He went <i>thataway!</i>&rdquo;
ali@0
   188
ali@0
   189
       as
ali@0
   190
ali@0
   191
       "He went thataway!"
ali@0
   192
ali@0
   193
       and report accordingly.
ali@0
   194
ali@0
   195
       This switch does not, not, NOT check the validity of HTML;
ali@0
   196
       it exists so that you can run gutcheck on most HTML texts
ali@0
   197
       for PG, and get sane results. It does not support all tags.
ali@0
   198
       It does not support all entities. When it sees a tag or entity
ali@0
   199
       it does not recognize, it will query it as HTML just as if
ali@0
   200
       you hadn't specified the -m switch.
ali@0
   201
ali@0
   202
       Gutcheck 0.99 will automatically switch on markup interpretation
ali@0
   203
       if it sees a lot of tags that appear to be markup, so mostly, you
ali@0
   204
       won't have to specify this.
ali@0
   205
ali@0
   206
    User-defined typos (-u switch)
ali@0
   207
ali@0
   208
        If you have a file named gutcheck.typ either in your current
ali@0
   209
        working directory or in the directory from which you explicitly
ali@0
   210
        invoked gutcheck, but not necessarily on your path, and if you
ali@0
   211
        specify the -u switch, gutcheck will query any word specified 
ali@0
   212
        in that file. The file is simple: one word, in lower case, per
ali@0
   213
        line. 999 lines are allowed for. Be careful not to put multiple
ali@0
   214
        words onto a line, or leave any rubbish other than the word on
ali@0
   215
        the line. You should have received a sample file gutcheck.typ
ali@0
   216
        with this package.
ali@0
   217
ali@0
   218
    Ignore DP markup (-d switch)
ali@0
   219
        
ali@0
   220
        Distributed Proofreaders (http://www.pgdp.net) is currently
ali@0
   221
        (2005) the main source of PG texts, and proofers there use
ali@0
   222
        special conventions. This switch understands those conventions,
ali@0
   223
        so that people can use gutcheck on files in process that still
ali@0
   224
        haven't had the special conventions removed yet. The special
ali@0
   225
        conventions supported in 0.99 are page-separators and
ali@0
   226
        "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
ali@0
   227
ali@0
   228
ali@0
   229
You will probably only run gutcheck on a text once or maybe twice,
ali@0
   230
just prior to uploading; it usually finds a few formatting problems;
ali@0
   231
it also usually finds queries that aren't problems at all - it often
ali@0
   232
questions Tables of Contents for having short lines, for example.
ali@0
   233
These are called "false positives", and need a human to decide on
ali@0
   234
them.
ali@0
   235
ali@0
   236
The text should be standard prose, and already close to PG normal
ali@0
   237
format (plain text, about 70 characters per line with blank lines
ali@0
   238
between paragraphs).
ali@0
   239
ali@0
   240
Gutcheck merely draws your attention to things that might be errors.
ali@0
   241
It is NOT a substitute for human judgement. Formatting choices like
ali@0
   242
short lines may be for a reason that this program can't understand.
ali@0
   243
ali@0
   244
Even the most careful human proofing can leave errors behind in a
ali@0
   245
text, and there are several automated checks you can do to help find
ali@0
   246
them. Of these, spellchecking (with _very_ careful human judgement) is
ali@0
   247
the most important and most useful.
ali@0
   248
ali@0
   249
Gutcheck does perform some basic typo-checking if you ask it to,
ali@0
   250
but its focus is on formatting errors specific to PG texts - 
ali@0
   251
mismatched quotes, non-ASCII characters, bad spacing, bad line
ali@0
   252
length, HTML tags perhaps left from a conversion, unbalanced
ali@0
   253
brackets.
ali@0
   254
ali@0
   255
Suggestions for additional checks would be appreciated and duly 
ali@0
   256
considered, but no guarantees that they will be implemented.
ali@0
   257
ali@0
   258
ali@0
   259
ali@0
   260
ali@0
   261
                How do _I_ use it?
ali@0
   262
ali@0
   263
Practically everyone I give gutcheck to asks me how _I_ use it.
ali@0
   264
Well, when I get a text for posting, say filename.txt, I run
ali@0
   265
ali@0
   266
    gutcheck -o filename.txt
ali@0
   267
ali@0
   268
That gives me a quick idea what I'm dealing with. It'll tell
ali@0
   269
me what kind of problems gutcheck sees, and give me an idea 
ali@0
   270
of how much more work needs to be done on the text. Keep in 
ali@0
   271
mind that gutcheck doesn't do anything like a full spellcheck,
ali@0
   272
but when I see a text that has a lot of problems, I assume that
ali@0
   273
it probably needs a spellcheck too.
ali@0
   274
ali@0
   275
Having got a feel for the ballpark, I run
ali@0
   276
ali@0
   277
    gutcheck filename.txt > jj
ali@0
   278
ali@0
   279
where jj is my personal, all-purpose filename for temporary data
ali@0
   280
that doesn't need to be kept. Then I open filename.txt and jj in
ali@0
   281
a split-screen view in my editor, and work down the text, fixing
ali@0
   282
whatever needs fixing, and skipping whatever doesn't. If your 
ali@0
   283
editor doesn't split-screen, you can get much the same effect by 
ali@0
   284
opening your original file in your normal editor, and jj (or your
ali@0
   285
equivalent name) in something like Notepad, keeping both in view 
ali@0
   286
at the same time.
ali@0
   287
ali@0
   288
Twice a day, an automatic process looks at all recently-posted
ali@0
   289
texts, and emails Michael, me, and sometimes other people with
ali@0
   290
their gutcheck summaries.
ali@0
   291
ali@0
   292
ali@0
   293
ali@0
   294
        Future development of gutcheck
ali@0
   295
ali@0
   296
Gutcheck has gone about as far as it can, given its current
ali@0
   297
structure. In order to add better singlequotes checking,
ali@0
   298
sentence checking, better he/be checking and other good stuff
ali@0
   299
that I'd like to see, I'll have to rewrite it from a different
ali@0
   300
angle - looking at the syntax instead of the lines. And I'll
ali@0
   301
probably get around to that sooner or later.
ali@0
   302
ali@0
   303
Meantime, I'm just trying to get this version stabilized, so
ali@0
   304
please report any bugs you find. When it is stable, I'll run
ali@0
   305
up a Windows port for those timid souls who can't look a 
ali@0
   306
command line in the eye. :-)
ali@0
   307
ali@0
   308
And I've started work on gutspell, a companion to gutcheck
ali@0
   309
which will concentrate on spelling problems. PG spelling
ali@0
   310
problems are unusual, since the range of texts we cover is
ali@0
   311
so wide, and I'll be taking a somewhat unorthodox approach
ali@0
   312
to writing this spelling-checker _specifically_ for texts
ali@0
   313
containing a lot of dialect and uncommon words that have
ali@0
   314
probably already been spell-checked against a standard
ali@0
   315
modern dictionary.
ali@0
   316
ali@0
   317
ali@0
   318
ali@0
   319
ali@0
   320
Explanations of common gutcheck messages:
ali@0
   321
ali@0
   322
    --> 74 lines in this file have white space at end
ali@0
   323
ali@0
   324
    PG texts shouldn't have extra white space added at end of line.
ali@0
   325
    Don't worry too much about this; they're not doing any harm,
ali@0
   326
    and they'll be removed during posting anyway.
ali@0
   327
ali@0
   328
ali@0
   329
    --> 348 lines in this file are short. Not reporting short lines.
ali@0
   330
    --> 84 lines in this file are long. Not reporting long lines.
ali@0
   331
    --> 8 lines in this file are VERY long!
ali@0
   332
ali@0
   333
    If there are a lot of long or short lines, Gutcheck won't list
ali@0
   334
    them individually. The short lines version of this message
ali@0
   335
    is commonly seen when gutchecking poetry and some plays, where
ali@0
   336
    the normal line length is shorter than the standard for prose.
ali@0
   337
    A "VERY long" line is one over 80 characters.  You normally
ali@0
   338
    shouldn't have any of these, but sometimes you may have to render
ali@0
   339
    a table that must be that long, or some special preformatted
ali@0
   340
    quotation that can't be broken.
ali@0
   341
ali@0
   342
ali@0
   343
    --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
ali@0
   344
ali@0
   345
    The PG standard for an emdash--like these--is two minus signs
ali@0
   346
    with no spaces before or after them. However, some older texts
ali@0
   347
    used spaced dashes - like these -- and if there are very many
ali@0
   348
    such spaced dashes in the file, gutcheck just draws your
ali@0
   349
    attention to it and doesn't list them individually.
ali@0
   350
ali@0
   351
ali@0
   352
ali@0
   353
    Line 3020 - Non-ASCII character 233
ali@0
   354
ali@0
   355
    Standard PG texts should use only ASCII characters with values
ali@0
   356
    up to 127; however, non-English, accented characters can be 
ali@0
   357
    represented according to several different non-ASCII encoding 
ali@0
   358
    schemes, using values over 127. If you have a plain English text
ali@0
   359
    with a few accented characters in words like cafe or tete-a-tete,
ali@0
   360
    you should replace the accented characters with their unaccented 
ali@0
   361
    versions. The English pound sign is another commonly-seen
ali@0
   362
    non-ASCII character. If you have enough non-ASCII characters in
ali@0
   363
    your text that you feel removing them would degrade your text
ali@0
   364
    unacceptably, you should probably consider doing an 8-bit text
ali@0
   365
    as well as a plain-ASCII version.
ali@0
   366
ali@0
   367
ali@0
   368
ali@0
   369
    Line 1207 - Non-ISO-8859 character 156
ali@0
   370
ali@0
   371
    Even in "8-bit" texts, there are distinctions between code sets.
ali@0
   372
    The ISO-8859 family of 8-bit code sets is the most commonly used
ali@0
   373
    in PG, and these sets do not define values in the range 128 through
ali@0
   374
    159 as printable characters. It's quite common for someone on a
ali@0
   375
    Windows or Mac machine to use a non-ISO character inadvertently,
ali@0
   376
    so this message warns that the character is not only not ASCII,
ali@0
   377
    but also outside the ISO-8859 range.
ali@0
   378
ali@0
   379
ali@0
   380
ali@0
   381
    Line 46 - Tab character?
ali@0
   382
ali@0
   383
    Some editors and WPs will put in Tab characters (character 9) to
ali@0
   384
    indicate indented text. You should not use these in a PG text,
ali@0
   385
    because you can't be sure how they will appear on a reader's
ali@0
   386
    screen. Find the Tab, and replace it with the appropriate number
ali@0
   387
    of spaces.
ali@0
   388
ali@0
   389
ali@0
   390
    Line 1327 - Tilde character?
ali@0
   391
ali@0
   392
    The tilde character (~) might be legitimately used, but it's the
ali@0
   393
    character commonly used by OCR software to indicate a place where
ali@0
   394
    it couldn't make out the letter, so gutcheck flags it.
ali@0
   395
ali@0
   396
ali@0
   397
ali@0
   398
    Line 1347 - Asterisk?
ali@0
   399
ali@0
   400
    Asterisks are reported only in paranoid mode (see -x). 
ali@0
   401
    Like tildes, they are often used to indicate errors, but they are
ali@0
   402
    also legitimately used as line delimiters and footnote markers.
ali@0
   403
ali@0
   404
ali@0
   405
ali@0
   406
    Line 1451 - Long line 129
ali@0
   407
ali@0
   408
    PG texts should have lines shorter than 76. There may be occasions
ali@0
   409
    where you decide that you really have to go out to 79 characters,
ali@0
   410
    but the sample above says that line 1451 is 129 characters long -
ali@0
   411
    probably two lines run together.
ali@0
   412
ali@0
   413
ali@0
   414
ali@0
   415
    Line 1590 - Short line?
ali@0
   416
ali@0
   417
    PG texts should have lines longer than 54 characters. However,
ali@0
   418
    there are special cases like poetry and tables of contents where
ali@0
   419
    the lines _should_ be shorter. So treat Gutcheck warnings about
ali@0
   420
    short lines carefully. Sometimes it's a genuine formatting
ali@0
   421
    problem; sometimes the line really needs to be short.
ali@0
   422
ali@0
   423
    Hint: gutcheck will not flag lines as short if they are indented
ali@0
   424
    - if they start with a space. I like to start inserted stanzas
ali@0
   425
    and other such items indented with a couple of spaces so that 
ali@0
   426
    they stand out from the main text anyway.
ali@0
   427
ali@0
   428
ali@0
   429
ali@0
   430
    Line 1804 - Begins with punctuation?
ali@0
   431
ali@0
   432
    Lines should normally not begin with commas, periods and so on.
ali@0
   433
    An exception is ellipses . . . which can happen at start of line.
ali@0
   434
ali@0
   435
ali@0
   436
ali@0
   437
    Line 1850 - Spaced em-dash?
ali@0
   438
ali@0
   439
    The PG standard for an em-dash--like these--is two minus signs
ali@0
   440
    with no spaces before or after them. Gutcheck flags non-PG
ali@0
   441
    em-dashes - like this one. Normally, you will replace it with a 
ali@0
   442
    PG-standard em-dash.
ali@0
   443
ali@0
   444
ali@0
   445
ali@0
   446
    Line 1904 - Query he/be error?
ali@0
   447
ali@0
   448
    Gutcheck makes a very minor effort to look for that scourge of all
ali@0
   449
    proofreaders, "be" replacing "he" or vice-versa, and draws your
ali@0
   450
    attention to it when it thinks it has found one.
ali@0
   451
ali@0
   452
ali@0
   453
ali@0
   454
    Line 2017 - Query digit in a1most
ali@0
   455
ali@0
   456
    The digit 1 is commonly OCRed for the letter l, the digit 0 for
ali@0
   457
    the letter O, and so on. When gutcheck sees a mix of digits and
ali@0
   458
    letters, it warns you. It may generate a false positive for
ali@0
   459
    something like 7am.
ali@0
   460
ali@0
   461
ali@0
   462
ali@0
   463
    Line 2083 - Query standalone 0
ali@0
   464
ali@0
   465
    In paranoid mode (see -x) only, gutcheck warns about the digit 0 
ali@0
   466
    and the number 1 standing alone as a word. This can happen if the 
ali@0
   467
    OCR misreads the words O or I.
ali@0
   468
ali@0
   469
ali@0
   470
ali@0
   471
    Line 2115 - Query word whetber
ali@0
   472
ali@0
   473
    If you have switched typo-checking on, gutcheck looks for
ali@0
   474
    potential typos, especially common h/b errors. It's not
ali@0
   475
    infallible; it sometimes queries legit words, but it's
ali@0
   476
    always worth taking a look.
ali@0
   477
ali@0
   478
ali@0
   479
ali@0
   480
    Line 2190 column 14 - Missing space?
ali@0
   481
ali@0
   482
    Omitting a space is a very common error,especially coming from
ali@0
   483
    OCRed text,and can be hard for a human to spot. The commas in
ali@0
   484
    the previous sentence illustrate the kind of thing I mean.
ali@0
   485
ali@0
   486
ali@0
   487
ali@0
   488
    Line 2240 column 48 - Spaced punctuation?
ali@0
   489
ali@0
   490
    The flip side of the "missing space" error , here , is when extra
ali@0
   491
    spaces are added before punctuation . Some old texts appear to add
ali@0
   492
    extra spaces around punctuation consistently, but this was a
ali@0
   493
    typographical convention rather than the author's intent, and the
ali@0
   494
    extra "spaces" should be removed when preparing a PG text.
ali@0
   495
ali@0
   496
ali@0
   497
ali@0
   498
    Line 2301 column 19 - Unspaced quotes?
ali@0
   499
ali@0
   500
    Another common spacing problem occurs in a phrase like "You wait
ali@0
   501
    there,"he said.
ali@0
   502
ali@0
   503
ali@0
   504
ali@0
   505
    Line 2385 column 27 - Wrongspaced quotes?
ali@0
   506
ali@0
   507
    As of version 0.98, gutcheck adds extra checks on whether a quote
ali@0
   508
    seems to be a start or end quote, and queries those that appear to
ali@0
   509
    be misplaced. This does give rise to false positives when quotes are
ali@0
   510
    nested, for example:
ali@0
   511
ali@0
   512
    "And how," she asked, "will your "friends" help you now?"
ali@0
   513
ali@0
   514
    but these false positives are worth it because of the many cases
ali@0
   515
    that this test catches, notably those like:
ali@0
   516
ali@0
   517
    "And how, "she said," will your friends help you now?"
ali@0
   518
ali@0
   519
    Sometimes a "wrongspaced quotes" query will arise because an earlier
ali@0
   520
    quote in the paragraph was omitted, so if the place specified seems
ali@0
   521
    to be OK, look back to see whether there's a problem in the preceding
ali@0
   522
    lines.
ali@0
   523
ali@0
   524
ali@0
   525
ali@0
   526
    Line 2400 - HTML Tag? <PRE>
ali@0
   527
ali@0
   528
    Some PG texts have been converted from HTML, and not all of the
ali@0
   529
    HTML tags have been removed.
ali@0
   530
ali@0
   531
ali@0
   532
ali@0
   533
    Line 2402 - HTML symbol? &emdash;
ali@0
   534
ali@0
   535
    Similarly, special HTML symbol characters can survive into PG
ali@0
   536
    texts. Can occasionally produce amusing false positives like
ali@0
   537
    . . . Marwick & Co were well known for it;
ali@0
   538
ali@0
   539
ali@0
   540
ali@0
   541
    Line 2540 - Mismatched quotes
ali@0
   542
ali@0
   543
    Another gutcheck mainstay - unclosed doublequotes in a paragraph.
ali@0
   544
    See the discussion of quotes in the switches section near the
ali@0
   545
    start of this file.
ali@0
   546
    
ali@0
   547
    Since the mismatch doesn't occur on any one line, gutcheck quotes
ali@0
   548
    the line number of the first blank line following the paragraph,
ali@0
   549
    since this is the point where it reconciles the count of quotes.
ali@0
   550
    However, if gutcheck is echoing lines, that is, you haven't used
ali@0
   551
    the -e switch, it will show the _first_ line of the paragraph, 
ali@0
   552
    to help you find the place without using line numbers. The 
ali@0
   553
    offending paragraph is therefore between the quoted line and 
ali@0
   554
    the line number given.
ali@0
   555
ali@0
   556
ali@0
   557
ali@0
   558
    Line 2587 - Mismatched single quotes
ali@0
   559
ali@0
   560
    Only checked with the -s switch, since checking single quotes is 
ali@0
   561
    not a very reliable process. Otherwise, the same logic as for 
ali@0
   562
    doublequotes applies.
ali@0
   563
ali@0
   564
ali@0
   565
ali@0
   566
    Line 2877 - Mismatched round brackets?
ali@0
   567
ali@0
   568
    Also curly and square brackets. Texts with a lot of brackets, like
ali@0
   569
    plays with bracketed stage instructions, may have mismatches.
ali@0
   570
ali@0
   571
ali@0
   572
    Line 3150 - No CR?
ali@0
   573
    Line 3204 - Two successive CRs?
ali@0
   574
    Line 3281 position 75 - CR without LF?
ali@0
   575
ali@0
   576
    These are the invalid line-end warnings. See the discussion of
ali@0
   577
    line-end checking in the switches section near the start of this
ali@0
   578
    file. If you see these, and your editor doesn't show anything
ali@0
   579
    wrong, you should probably try deleting the characters just before
ali@0
   580
    and after the line end, and the line-end itself, then retyping the
ali@0
   581
    characters and the line-end.
ali@0
   582
ali@0
   583
ali@0
   584
    Line 2940 - Paragraph starts with lower-case
ali@0
   585
ali@0
   586
    A common error in an e-text is for an extra blank line
ali@0
   587
ali@0
   588
    to be put in, like the blank line above, and this often
ali@0
   589
    shows up as a new paragraph beginning with lower case.
ali@0
   590
    Sometimes the blank line is deliberate, as when a 
ali@0
   591
    quotation is inserted in a speech. Use your judgement.
ali@0
   592
ali@0
   593
ali@0
   594
    Line 2987 - Extra period?
ali@0
   595
ali@0
   596
    An extra period. is a. common problem in OCRed text. and usually
ali@0
   597
    arises when a speck of dust on the page is mistaken for a period.
ali@0
   598
    or. as occasionally happens. when a comma loses its tail.
ali@0
   599
ali@0
   600
ali@0
   601
    Line 3012 column 12 - Double punctuation?
ali@0
   602
ali@0
   603
    Double punctuation., like that,, is a common typo and
ali@0
   604
    scanno. Some books have much legit double punctuation,
ali@0
   605
    like etc., etc., but it's worth checking anyway.
ali@0
   606
ali@0
   607
ali@0
   608
ali@0
   609
            *       *       *        *
ali@0
   610
ali@0
   611
For Windows-only users who are unfamiliar with DOS:
ali@0
   612
ali@0
   613
    If you're a Windows-only user, you need to save
ali@0
   614
    gutcheck.exe into the folder (directory) where the
ali@0
   615
    text file you want to check is. Let's say your
ali@0
   616
    text file is in C:\GUT, then you should save
ali@0
   617
    GUTCHECK.EXE into C:\GUT.
ali@0
   618
ali@0
   619
    Now get to a DOS prompt. You can do this by
ali@0
   620
    selecting the "Command Prompt" or "MS-DOS Prompt"
ali@0
   621
    option that will be somewhere on your
ali@0
   622
    Start/Programs menu.
ali@0
   623
ali@0
   624
    Now get into the C:\GUT directory. 
ali@0
   625
    You can do this using the CD (change directory) 
ali@0
   626
    command, like this:
ali@0
   627
        CD \GUT
ali@0
   628
    and your prompt will change to 
ali@0
   629
        C:\GUT>
ali@0
   630
    so you know you're in the right place.
ali@0
   631
ali@0
   632
    Now type
ali@0
   633
        gutcheck yourfile.txt
ali@0
   634
    and you'll see gutcheck's report
ali@0
   635
ali@0
   636
    By default, gutcheck prints its queries to screen.
ali@0
   637
    If you want to create a file of them, to edit
ali@0
   638
    against the text, you can use the greater-than
ali@0
   639
    sign (>) to tell it to output the report to a
ali@0
   640
    file. For example, if you want its report in a
ali@0
   641
    file called QUERIES.LST, you could type
ali@0
   642
    
ali@0
   643
        gutcheck yourfile.txt > queries.lst
ali@0
   644
ali@0
   645
    The queries.lst file will then contain the listing
ali@0
   646
    of possible formatting errors, and you can
ali@0
   647
    edit it alongside your text.
ali@0
   648
ali@0
   649
    Whatever you do, DON'T make the filename after
ali@0
   650
    the greater-than sign the name of a file already
ali@0
   651
    on your disk that you want to keep, because
ali@0
   652
    the greater-than sign will cause gutcheck to
ali@0
   653
    replace any existing file of that name.
ali@0
   654
ali@0
   655
    So, for example, if you have two Tolstoy files
ali@0
   656
    that you want to check, called WARPEACE.TXT and 
ali@0
   657
    ANNAK.TXT, make sure that neither of these names
ali@0
   658
    is ever used following the greater-than sign.
ali@0
   659
    To check these correctly, you might do:
ali@0
   660
ali@0
   661
    gutcheck warpeace.txt >war.lst
ali@0
   662
ali@0
   663
    and
ali@0
   664
ali@0
   665
    gutcheck annak.txt > annak.lst
ali@0
   666
ali@0
   667
    separately. Then you can look at war.lst and annak.lst
ali@0
   668
    to see the gutcheck reports.
ali@0
   669
ali@0
   670
            *       *       *        *
ali@0
   671
ali@0
   672
ali@0
   673
For existing 0.98 users upgrading to 0.99:
ali@0
   674
ali@0
   675
    If you run on old 16-bit DOS or Windows 3.x, I'm afraid
ali@0
   676
    you're out of luck. I'm not saying it _can't_ be compiled
ali@0
   677
    to run on 16-bit, but the executable with the package is
ali@0
   678
    for Win32 only. *nix users won't notice the change at all.
ali@0
   679
ali@0
   680
ali@0
   681
    There are two new switches: -u and -d. 
ali@0
   682
          See above for full rundown.
ali@0
   683
ali@0
   684
ali@0
   685
Here's a list of the new errors:
ali@0
   686
ali@0
   687
    Line 1456 - Carat character?
ali@0
   688
ali@0
   689
    I^ve found a few.
ali@0
   690
ali@0
   691
ali@0
   692
    Line 1821 - Forward slash?
ali@0
   693
ali@0
   694
    Common error for italicized "I", or so /'ve found.
ali@0
   695
ali@0
   696
ali@0
   697
    Line 2139 - Query missing paragraph break?
ali@0
   698
ali@0
   699
    "Come here, son." "Do I _have_ to go, dad?"
ali@0
   700
    Like that. False positives in some texts. Sorry 'bout that,
ali@0
   701
    but these are often errors.
ali@0
   702
ali@0
   703
ali@0
   704
    Line 2200 - Query had/bad error?
ali@0
   705
ali@0
   706
    Clear enough. Doesn't catch as many as I'd like it to,
ali@0
   707
    but rarely gives false alarms.
ali@0
   708
ali@0
   709
ali@0
   710
    Line 2268 - Query punctuation after the?
ali@0
   711
ali@0
   712
    Some words, like "the", very rarely have punctuation
ali@0
   713
    following them. Others, like "Mrs", usually have a
ali@0
   714
    period, but never a comma. Occasional false positives.
ali@0
   715
ali@0
   716
ali@0
   717
    Line 2380 - Query possible scanno arid
ali@0
   718
ali@0
   719
    It found one of your user-defined typos when you
ali@0
   720
    used the -u switch.
ali@0
   721
ali@0
   722
ali@0
   723
    Line 2511 - Capital "S"?
ali@0
   724
ali@0
   725
    Surprisingly common specific case, like: Jane'S 
ali@0
   726
ali@0
   727
    
ali@0
   728
    Line 3469 - endquote missing punctuation?
ali@0
   729
ali@0
   730
    OK. This one can really cause a lot of false positives
ali@0
   731
    in some books, but it switches itself off if it finds
ali@0
   732
    more than 20 in a text, unless you force it to list them
ali@0
   733
    all with the -v switch.
ali@0
   734
    "Hey, dad" Johnny said, "can we go now?"
ali@0
   735
    is a common punctuation-missing error.
ali@0
   736
ali@0
   737
ali@0
   738
    Line 4266 - Mismatched underscores?
ali@0
   739
ali@0
   740
    Like mismatched anything else!
ali@0
   741
ali@0
   742