doc/gutcheck.txt
changeset 0 c2f4c0285180
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/doc/gutcheck.txt	Tue Jan 24 23:54:05 2012 +0000
     1.3 @@ -0,0 +1,742 @@
     1.4 +
     1.5 +
     1.6 +                            Gutcheck documentation
     1.7 +
     1.8 +
     1.9 +gutcheck:  lists possible common formatting errors in a Project
    1.10 +Gutenberg candidate file. It is a command line program and can be used
    1.11 +under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
    1.12 +tell me). For Windows-only people, there is an appendix at the end
    1.13 +with brief instructions for running it.
    1.14 +
    1.15 +
    1.16 +Current version: 0.99. Users of 0.98 see end of file for changes.
    1.17 +
    1.18 +You should also have received the licence file COPYING, a README file, 
    1.19 +gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
    1.20 +this file.
    1.21 +
    1.22 +This software is Copyright Jim Tinsley 2000-2005.
    1.23 +
    1.24 +Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
    1.25 +This is Free Software; you may redistribute it under certain conditions (GPL).
    1.26 +
    1.27 +See http://gutcheck.sourceforge.net for the latest version.
    1.28 +
    1.29 +
    1.30 +Usage is: gutcheck [-setopxlywm] filename
    1.31 +      where:
    1.32 +      -s checks Single quotes 
    1.33 +      -e switches off Echoing of lines 
    1.34 +      -t checks Typos
    1.35 +      -o produces an Overview only
    1.36 +      -p sets strict quotes checking for Paragraphs
    1.37 +      -x (paranoid) switches OFF typo checking and extra checks
    1.38 +      -l turns off Line-end checks
    1.39 +      -y sets error messages to stdout
    1.40 +      -w is a special mode for web uploads (for future use)
    1.41 +      -v (verbose) forces individual reporting of minor problems
    1.42 +      -m interprets Markup of some common HTML tags and entities    
    1.43 +      -u warns about words in a user-defined typo file gutcheck.typ 
    1.44 +      -d ignores some DP-specific markup
    1.45 +
    1.46 +Running gutcheck without any parameters will display a brief help message.
    1.47 +
    1.48 +Sample usage: 
    1.49 +
    1.50 +    gutcheck warpeace.txt
    1.51 +
    1.52 +
    1.53 +More detail:
    1.54 +
    1.55 +    Echoing lines (-e to switch off)
    1.56 +
    1.57 +      You may find it convenient, when reviewing Gutcheck's 
    1.58 +      suggestions, to see the line that Gutcheck is questioning.
    1.59 +      That way, you can often see at a glance whether it is
    1.60 +      a real error that needs to be fixed, or a false positive
    1.61 +      that should be in the text, but Gutcheck's limited
    1.62 +      programming doesn't understand.
    1.63 +
    1.64 +      By default, gutcheck echoes these lines, but if you don't 
    1.65 +      want to see the lines referred to, -e will switch it OFF.
    1.66 +
    1.67 +
    1.68 +    Quotes (-s and -p switches)
    1.69 +
    1.70 +      Gutcheck always looks for unbalanced doublequotes in a 
    1.71 +      paragraph. It is a common convention for writers not to
    1.72 +      close quotes in a paragraph if the next paragraph opens
    1.73 +      with quotes and is a continuation by the same speaker.
    1.74 +
    1.75 +      Gutcheck therefore does not normally report unclosed quotes 
    1.76 +      if the next paragraph begins with a quote. If you need
    1.77 +      to see all unclosed quotes, even where the next paragraph
    1.78 +      begins with a quote, you should use the -p switch.
    1.79 +
    1.80 +      Singlequotes (') are a problem, since the same character
    1.81 +      is used for an apostrophe. I'm not sure that it is 
    1.82 +      possible to get 100% accuracy on singlequotes checking,
    1.83 +      particularly since dialect, quite common in PG texts,
    1.84 +      upsets the normal rules so badly. Consider the sentence:
    1.85 +        'Tis often said that a man's a man for a' that.
    1.86 +      As humans, we recognize that both apostrophes are used
    1.87 +      for contractions rather than quotes, but it isn't easy 
    1.88 +      to get a program to recognize that.
    1.89 +
    1.90 +      Since Gutcheck makes too many mistakes when trying to match
    1.91 +      singlequotes, it doesn't look for unbalanced singlequotes
    1.92 +      unless you specify the -s switch.
    1.93 +
    1.94 +      Consider these sentences, which illustrate the main cases:
    1.95 +
    1.96 +        'Tis often said that a fool and his money are soon parted.
    1.97 +
    1.98 +        'Becky's goin' home,' said Tom.
    1.99 +
   1.100 +        The dogs' tails wagged in unison.
   1.101 +
   1.102 +        Those 'pack dogs' of yours look more like wolves.
   1.103 +
   1.104 +
   1.105 +
   1.106 +    Typos (-t switch)
   1.107 +
   1.108 +      It's not Gutcheck's job to be a spelling checker, but it
   1.109 +      does check for a list of common typos and OCR errors if you
   1.110 +      use the -t switch. (The -x switch also turns typo checking on.)
   1.111 +
   1.112 +      It also checks for character combinations, especially involving
   1.113 +      h and b, which are often confused by OCR, that rarely or never
   1.114 +      occur. For example, it queries "tbe" in a word. Now, "the" often
   1.115 +      occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
   1.116 +      playing the odds - a few false positives for many errors found.
   1.117 +      Similarly with "ii", which is a very common OCR error.
   1.118 +
   1.119 +      Gutcheck suppresses multiple reporting of the first 40 "typos"
   1.120 +      found. This is to remove the annoyance of seeing something like
   1.121 +      "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
   1.122 +      in a text. 
   1.123 +
   1.124 +
   1.125 +    Line-end checking (-l switch to disable)
   1.126 +
   1.127 +      All PG texts should have a Carriage Return (CR - character 13)
   1.128 +      and a Line Feed (LF - character 10) at end of each line,
   1.129 +      regardless of what O/S you made them on. DOS/Windows, Unix
   1.130 +      and Mac have different conventions, but the final text should
   1.131 +      always use a CR/LF pair as its line terminator.
   1.132 +
   1.133 +      By default, Gutcheck verifies that every line does have
   1.134 +      the correct terminator, but if you're on a work-in-progress
   1.135 +      in Linux, you might want to convert the line-ends as a final
   1.136 +      step, and not want to see thousands of errors every time you
   1.137 +      run Gutcheck before that final step, so you can turn off 
   1.138 +      this checking with the -l switch.
   1.139 +
   1.140 +
   1.141 +    Paranoid mode (-x switch to disable: Trust No One :-)
   1.142 +
   1.143 +      -x switches OFF typo-checking, the -t flag, automatically
   1.144 +      and some extra checks like standalone 1 and 0 queries.
   1.145 +
   1.146 +
   1.147 +    Overview mode (-o switch)
   1.148 +
   1.149 +       This mode just gives a count of queries found
   1.150 +       instead of a detailed list.
   1.151 +
   1.152 +
   1.153 +    Header quote  (-h switch)
   1.154 +
   1.155 +       If you use the -h switch, gutcheck will also display
   1.156 +       the Title, Author, Release and Edition fields from the
   1.157 +       PG header. This is useful mostly for the automated
   1.158 +       checks we do on recently-posted texts.
   1.159 +
   1.160 +
   1.161 +    Errors to stdout (-y switch)
   1.162 +
   1.163 +       If you're just running gutcheck normally, you can ignore
   1.164 +       this. It's only there for programs that provide a front
   1.165 +       end to gutcheck. It makes error messages appear within
   1.166 +       the output of gutcheck so that the front end knows whether
   1.167 +       gutcheck ran OK.
   1.168 +
   1.169 +
   1.170 +    Verbose reporting (-v switch)
   1.171 +
   1.172 +       Normally, if gutcheck sees lots of long lines, short lines,
   1.173 +       spaced dashes, non-ASCII characters or dot-commas ".," it
   1.174 +       assumes these are features of the text, counts and summarizes
   1.175 +       them at the top of its report, but does not list them 
   1.176 +       individually. If the -v switch is on, gutcheck will list them all.
   1.177 +
   1.178 +
   1.179 +    Markup interpretation (-m switch)
   1.180 +
   1.181 +       Normally, gutcheck flags anything it suspects of being HTML
   1.182 +       markup as a possible error. When you use the -m switch,
   1.183 +       however, it matches anything that looks like markup against
   1.184 +       a short list of common HTML tags and entities. If the markup
   1.185 +       is in that list, it either ignores the markup, in the case
   1.186 +       of a tag, or "interprets" the markup as its nearest ASCII 
   1.187 +       equivalent, in the case of an entity. So, for example, using
   1.188 +       this switch, gutcheck will "see"
   1.189 +
   1.190 +       &ldquo;He went <i>thataway!</i>&rdquo;
   1.191 +
   1.192 +       as
   1.193 +
   1.194 +       "He went thataway!"
   1.195 +
   1.196 +       and report accordingly.
   1.197 +
   1.198 +       This switch does not, not, NOT check the validity of HTML;
   1.199 +       it exists so that you can run gutcheck on most HTML texts
   1.200 +       for PG, and get sane results. It does not support all tags.
   1.201 +       It does not support all entities. When it sees a tag or entity
   1.202 +       it does not recognize, it will query it as HTML just as if
   1.203 +       you hadn't specified the -m switch.
   1.204 +
   1.205 +       Gutcheck 0.99 will automatically switch on markup interpretation
   1.206 +       if it sees a lot of tags that appear to be markup, so mostly, you
   1.207 +       won't have to specify this.
   1.208 +
   1.209 +    User-defined typos (-u switch)
   1.210 +
   1.211 +        If you have a file named gutcheck.typ either in your current
   1.212 +        working directory or in the directory from which you explicitly
   1.213 +        invoked gutcheck, but not necessarily on your path, and if you
   1.214 +        specify the -u switch, gutcheck will query any word specified 
   1.215 +        in that file. The file is simple: one word, in lower case, per
   1.216 +        line. 999 lines are allowed for. Be careful not to put multiple
   1.217 +        words onto a line, or leave any rubbish other than the word on
   1.218 +        the line. You should have received a sample file gutcheck.typ
   1.219 +        with this package.
   1.220 +
   1.221 +    Ignore DP markup (-d switch)
   1.222 +        
   1.223 +        Distributed Proofreaders (http://www.pgdp.net) is currently
   1.224 +        (2005) the main source of PG texts, and proofers there use
   1.225 +        special conventions. This switch understands those conventions,
   1.226 +        so that people can use gutcheck on files in process that still
   1.227 +        haven't had the special conventions removed yet. The special
   1.228 +        conventions supported in 0.99 are page-separators and
   1.229 +        "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
   1.230 +
   1.231 +
   1.232 +You will probably only run gutcheck on a text once or maybe twice,
   1.233 +just prior to uploading; it usually finds a few formatting problems;
   1.234 +it also usually finds queries that aren't problems at all - it often
   1.235 +questions Tables of Contents for having short lines, for example.
   1.236 +These are called "false positives", and need a human to decide on
   1.237 +them.
   1.238 +
   1.239 +The text should be standard prose, and already close to PG normal
   1.240 +format (plain text, about 70 characters per line with blank lines
   1.241 +between paragraphs).
   1.242 +
   1.243 +Gutcheck merely draws your attention to things that might be errors.
   1.244 +It is NOT a substitute for human judgement. Formatting choices like
   1.245 +short lines may be for a reason that this program can't understand.
   1.246 +
   1.247 +Even the most careful human proofing can leave errors behind in a
   1.248 +text, and there are several automated checks you can do to help find
   1.249 +them. Of these, spellchecking (with _very_ careful human judgement) is
   1.250 +the most important and most useful.
   1.251 +
   1.252 +Gutcheck does perform some basic typo-checking if you ask it to,
   1.253 +but its focus is on formatting errors specific to PG texts - 
   1.254 +mismatched quotes, non-ASCII characters, bad spacing, bad line
   1.255 +length, HTML tags perhaps left from a conversion, unbalanced
   1.256 +brackets.
   1.257 +
   1.258 +Suggestions for additional checks would be appreciated and duly 
   1.259 +considered, but no guarantees that they will be implemented.
   1.260 +
   1.261 +
   1.262 +
   1.263 +
   1.264 +                How do _I_ use it?
   1.265 +
   1.266 +Practically everyone I give gutcheck to asks me how _I_ use it.
   1.267 +Well, when I get a text for posting, say filename.txt, I run
   1.268 +
   1.269 +    gutcheck -o filename.txt
   1.270 +
   1.271 +That gives me a quick idea what I'm dealing with. It'll tell
   1.272 +me what kind of problems gutcheck sees, and give me an idea 
   1.273 +of how much more work needs to be done on the text. Keep in 
   1.274 +mind that gutcheck doesn't do anything like a full spellcheck,
   1.275 +but when I see a text that has a lot of problems, I assume that
   1.276 +it probably needs a spellcheck too.
   1.277 +
   1.278 +Having got a feel for the ballpark, I run
   1.279 +
   1.280 +    gutcheck filename.txt > jj
   1.281 +
   1.282 +where jj is my personal, all-purpose filename for temporary data
   1.283 +that doesn't need to be kept. Then I open filename.txt and jj in
   1.284 +a split-screen view in my editor, and work down the text, fixing
   1.285 +whatever needs fixing, and skipping whatever doesn't. If your 
   1.286 +editor doesn't split-screen, you can get much the same effect by 
   1.287 +opening your original file in your normal editor, and jj (or your
   1.288 +equivalent name) in something like Notepad, keeping both in view 
   1.289 +at the same time.
   1.290 +
   1.291 +Twice a day, an automatic process looks at all recently-posted
   1.292 +texts, and emails Michael, me, and sometimes other people with
   1.293 +their gutcheck summaries.
   1.294 +
   1.295 +
   1.296 +
   1.297 +        Future development of gutcheck
   1.298 +
   1.299 +Gutcheck has gone about as far as it can, given its current
   1.300 +structure. In order to add better singlequotes checking,
   1.301 +sentence checking, better he/be checking and other good stuff
   1.302 +that I'd like to see, I'll have to rewrite it from a different
   1.303 +angle - looking at the syntax instead of the lines. And I'll
   1.304 +probably get around to that sooner or later.
   1.305 +
   1.306 +Meantime, I'm just trying to get this version stabilized, so
   1.307 +please report any bugs you find. When it is stable, I'll run
   1.308 +up a Windows port for those timid souls who can't look a 
   1.309 +command line in the eye. :-)
   1.310 +
   1.311 +And I've started work on gutspell, a companion to gutcheck
   1.312 +which will concentrate on spelling problems. PG spelling
   1.313 +problems are unusual, since the range of texts we cover is
   1.314 +so wide, and I'll be taking a somewhat unorthodox approach
   1.315 +to writing this spelling-checker _specifically_ for texts
   1.316 +containing a lot of dialect and uncommon words that have
   1.317 +probably already been spell-checked against a standard
   1.318 +modern dictionary.
   1.319 +
   1.320 +
   1.321 +
   1.322 +
   1.323 +Explanations of common gutcheck messages:
   1.324 +
   1.325 +    --> 74 lines in this file have white space at end
   1.326 +
   1.327 +    PG texts shouldn't have extra white space added at end of line.
   1.328 +    Don't worry too much about this; they're not doing any harm,
   1.329 +    and they'll be removed during posting anyway.
   1.330 +
   1.331 +
   1.332 +    --> 348 lines in this file are short. Not reporting short lines.
   1.333 +    --> 84 lines in this file are long. Not reporting long lines.
   1.334 +    --> 8 lines in this file are VERY long!
   1.335 +
   1.336 +    If there are a lot of long or short lines, Gutcheck won't list
   1.337 +    them individually. The short lines version of this message
   1.338 +    is commonly seen when gutchecking poetry and some plays, where
   1.339 +    the normal line length is shorter than the standard for prose.
   1.340 +    A "VERY long" line is one over 80 characters.  You normally
   1.341 +    shouldn't have any of these, but sometimes you may have to render
   1.342 +    a table that must be that long, or some special preformatted
   1.343 +    quotation that can't be broken.
   1.344 +
   1.345 +
   1.346 +    --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
   1.347 +
   1.348 +    The PG standard for an emdash--like these--is two minus signs
   1.349 +    with no spaces before or after them. However, some older texts
   1.350 +    used spaced dashes - like these -- and if there are very many
   1.351 +    such spaced dashes in the file, gutcheck just draws your
   1.352 +    attention to it and doesn't list them individually.
   1.353 +
   1.354 +
   1.355 +
   1.356 +    Line 3020 - Non-ASCII character 233
   1.357 +
   1.358 +    Standard PG texts should use only ASCII characters with values
   1.359 +    up to 127; however, non-English, accented characters can be 
   1.360 +    represented according to several different non-ASCII encoding 
   1.361 +    schemes, using values over 127. If you have a plain English text
   1.362 +    with a few accented characters in words like cafe or tete-a-tete,
   1.363 +    you should replace the accented characters with their unaccented 
   1.364 +    versions. The English pound sign is another commonly-seen
   1.365 +    non-ASCII character. If you have enough non-ASCII characters in
   1.366 +    your text that you feel removing them would degrade your text
   1.367 +    unacceptably, you should probably consider doing an 8-bit text
   1.368 +    as well as a plain-ASCII version.
   1.369 +
   1.370 +
   1.371 +
   1.372 +    Line 1207 - Non-ISO-8859 character 156
   1.373 +
   1.374 +    Even in "8-bit" texts, there are distinctions between code sets.
   1.375 +    The ISO-8859 family of 8-bit code sets is the most commonly used
   1.376 +    in PG, and these sets do not define values in the range 128 through
   1.377 +    159 as printable characters. It's quite common for someone on a
   1.378 +    Windows or Mac machine to use a non-ISO character inadvertently,
   1.379 +    so this message warns that the character is not only not ASCII,
   1.380 +    but also outside the ISO-8859 range.
   1.381 +
   1.382 +
   1.383 +
   1.384 +    Line 46 - Tab character?
   1.385 +
   1.386 +    Some editors and WPs will put in Tab characters (character 9) to
   1.387 +    indicate indented text. You should not use these in a PG text,
   1.388 +    because you can't be sure how they will appear on a reader's
   1.389 +    screen. Find the Tab, and replace it with the appropriate number
   1.390 +    of spaces.
   1.391 +
   1.392 +
   1.393 +    Line 1327 - Tilde character?
   1.394 +
   1.395 +    The tilde character (~) might be legitimately used, but it's the
   1.396 +    character commonly used by OCR software to indicate a place where
   1.397 +    it couldn't make out the letter, so gutcheck flags it.
   1.398 +
   1.399 +
   1.400 +
   1.401 +    Line 1347 - Asterisk?
   1.402 +
   1.403 +    Asterisks are reported only in paranoid mode (see -x). 
   1.404 +    Like tildes, they are often used to indicate errors, but they are
   1.405 +    also legitimately used as line delimiters and footnote markers.
   1.406 +
   1.407 +
   1.408 +
   1.409 +    Line 1451 - Long line 129
   1.410 +
   1.411 +    PG texts should have lines shorter than 76. There may be occasions
   1.412 +    where you decide that you really have to go out to 79 characters,
   1.413 +    but the sample above says that line 1451 is 129 characters long -
   1.414 +    probably two lines run together.
   1.415 +
   1.416 +
   1.417 +
   1.418 +    Line 1590 - Short line?
   1.419 +
   1.420 +    PG texts should have lines longer than 54 characters. However,
   1.421 +    there are special cases like poetry and tables of contents where
   1.422 +    the lines _should_ be shorter. So treat Gutcheck warnings about
   1.423 +    short lines carefully. Sometimes it's a genuine formatting
   1.424 +    problem; sometimes the line really needs to be short.
   1.425 +
   1.426 +    Hint: gutcheck will not flag lines as short if they are indented
   1.427 +    - if they start with a space. I like to start inserted stanzas
   1.428 +    and other such items indented with a couple of spaces so that 
   1.429 +    they stand out from the main text anyway.
   1.430 +
   1.431 +
   1.432 +
   1.433 +    Line 1804 - Begins with punctuation?
   1.434 +
   1.435 +    Lines should normally not begin with commas, periods and so on.
   1.436 +    An exception is ellipses . . . which can happen at start of line.
   1.437 +
   1.438 +
   1.439 +
   1.440 +    Line 1850 - Spaced em-dash?
   1.441 +
   1.442 +    The PG standard for an em-dash--like these--is two minus signs
   1.443 +    with no spaces before or after them. Gutcheck flags non-PG
   1.444 +    em-dashes - like this one. Normally, you will replace it with a 
   1.445 +    PG-standard em-dash.
   1.446 +
   1.447 +
   1.448 +
   1.449 +    Line 1904 - Query he/be error?
   1.450 +
   1.451 +    Gutcheck makes a very minor effort to look for that scourge of all
   1.452 +    proofreaders, "be" replacing "he" or vice-versa, and draws your
   1.453 +    attention to it when it thinks it has found one.
   1.454 +
   1.455 +
   1.456 +
   1.457 +    Line 2017 - Query digit in a1most
   1.458 +
   1.459 +    The digit 1 is commonly OCRed for the letter l, the digit 0 for
   1.460 +    the letter O, and so on. When gutcheck sees a mix of digits and
   1.461 +    letters, it warns you. It may generate a false positive for
   1.462 +    something like 7am.
   1.463 +
   1.464 +
   1.465 +
   1.466 +    Line 2083 - Query standalone 0
   1.467 +
   1.468 +    In paranoid mode (see -x) only, gutcheck warns about the digit 0 
   1.469 +    and the number 1 standing alone as a word. This can happen if the 
   1.470 +    OCR misreads the words O or I.
   1.471 +
   1.472 +
   1.473 +
   1.474 +    Line 2115 - Query word whetber
   1.475 +
   1.476 +    If you have switched typo-checking on, gutcheck looks for
   1.477 +    potential typos, especially common h/b errors. It's not
   1.478 +    infallible; it sometimes queries legit words, but it's
   1.479 +    always worth taking a look.
   1.480 +
   1.481 +
   1.482 +
   1.483 +    Line 2190 column 14 - Missing space?
   1.484 +
   1.485 +    Omitting a space is a very common error,especially coming from
   1.486 +    OCRed text,and can be hard for a human to spot. The commas in
   1.487 +    the previous sentence illustrate the kind of thing I mean.
   1.488 +
   1.489 +
   1.490 +
   1.491 +    Line 2240 column 48 - Spaced punctuation?
   1.492 +
   1.493 +    The flip side of the "missing space" error , here , is when extra
   1.494 +    spaces are added before punctuation . Some old texts appear to add
   1.495 +    extra spaces around punctuation consistently, but this was a
   1.496 +    typographical convention rather than the author's intent, and the
   1.497 +    extra "spaces" should be removed when preparing a PG text.
   1.498 +
   1.499 +
   1.500 +
   1.501 +    Line 2301 column 19 - Unspaced quotes?
   1.502 +
   1.503 +    Another common spacing problem occurs in a phrase like "You wait
   1.504 +    there,"he said.
   1.505 +
   1.506 +
   1.507 +
   1.508 +    Line 2385 column 27 - Wrongspaced quotes?
   1.509 +
   1.510 +    As of version 0.98, gutcheck adds extra checks on whether a quote
   1.511 +    seems to be a start or end quote, and queries those that appear to
   1.512 +    be misplaced. This does give rise to false positives when quotes are
   1.513 +    nested, for example:
   1.514 +
   1.515 +    "And how," she asked, "will your "friends" help you now?"
   1.516 +
   1.517 +    but these false positives are worth it because of the many cases
   1.518 +    that this test catches, notably those like:
   1.519 +
   1.520 +    "And how, "she said," will your friends help you now?"
   1.521 +
   1.522 +    Sometimes a "wrongspaced quotes" query will arise because an earlier
   1.523 +    quote in the paragraph was omitted, so if the place specified seems
   1.524 +    to be OK, look back to see whether there's a problem in the preceding
   1.525 +    lines.
   1.526 +
   1.527 +
   1.528 +
   1.529 +    Line 2400 - HTML Tag? <PRE>
   1.530 +
   1.531 +    Some PG texts have been converted from HTML, and not all of the
   1.532 +    HTML tags have been removed.
   1.533 +
   1.534 +
   1.535 +
   1.536 +    Line 2402 - HTML symbol? &emdash;
   1.537 +
   1.538 +    Similarly, special HTML symbol characters can survive into PG
   1.539 +    texts. Can occasionally produce amusing false positives like
   1.540 +    . . . Marwick & Co were well known for it;
   1.541 +
   1.542 +
   1.543 +
   1.544 +    Line 2540 - Mismatched quotes
   1.545 +
   1.546 +    Another gutcheck mainstay - unclosed doublequotes in a paragraph.
   1.547 +    See the discussion of quotes in the switches section near the
   1.548 +    start of this file.
   1.549 +    
   1.550 +    Since the mismatch doesn't occur on any one line, gutcheck quotes
   1.551 +    the line number of the first blank line following the paragraph,
   1.552 +    since this is the point where it reconciles the count of quotes.
   1.553 +    However, if gutcheck is echoing lines, that is, you haven't used
   1.554 +    the -e switch, it will show the _first_ line of the paragraph, 
   1.555 +    to help you find the place without using line numbers. The 
   1.556 +    offending paragraph is therefore between the quoted line and 
   1.557 +    the line number given.
   1.558 +
   1.559 +
   1.560 +
   1.561 +    Line 2587 - Mismatched single quotes
   1.562 +
   1.563 +    Only checked with the -s switch, since checking single quotes is 
   1.564 +    not a very reliable process. Otherwise, the same logic as for 
   1.565 +    doublequotes applies.
   1.566 +
   1.567 +
   1.568 +
   1.569 +    Line 2877 - Mismatched round brackets?
   1.570 +
   1.571 +    Also curly and square brackets. Texts with a lot of brackets, like
   1.572 +    plays with bracketed stage instructions, may have mismatches.
   1.573 +
   1.574 +
   1.575 +    Line 3150 - No CR?
   1.576 +    Line 3204 - Two successive CRs?
   1.577 +    Line 3281 position 75 - CR without LF?
   1.578 +
   1.579 +    These are the invalid line-end warnings. See the discussion of
   1.580 +    line-end checking in the switches section near the start of this
   1.581 +    file. If you see these, and your editor doesn't show anything
   1.582 +    wrong, you should probably try deleting the characters just before
   1.583 +    and after the line end, and the line-end itself, then retyping the
   1.584 +    characters and the line-end.
   1.585 +
   1.586 +
   1.587 +    Line 2940 - Paragraph starts with lower-case
   1.588 +
   1.589 +    A common error in an e-text is for an extra blank line
   1.590 +
   1.591 +    to be put in, like the blank line above, and this often
   1.592 +    shows up as a new paragraph beginning with lower case.
   1.593 +    Sometimes the blank line is deliberate, as when a 
   1.594 +    quotation is inserted in a speech. Use your judgement.
   1.595 +
   1.596 +
   1.597 +    Line 2987 - Extra period?
   1.598 +
   1.599 +    An extra period. is a. common problem in OCRed text. and usually
   1.600 +    arises when a speck of dust on the page is mistaken for a period.
   1.601 +    or. as occasionally happens. when a comma loses its tail.
   1.602 +
   1.603 +
   1.604 +    Line 3012 column 12 - Double punctuation?
   1.605 +
   1.606 +    Double punctuation., like that,, is a common typo and
   1.607 +    scanno. Some books have much legit double punctuation,
   1.608 +    like etc., etc., but it's worth checking anyway.
   1.609 +
   1.610 +
   1.611 +
   1.612 +            *       *       *        *
   1.613 +
   1.614 +For Windows-only users who are unfamiliar with DOS:
   1.615 +
   1.616 +    If you're a Windows-only user, you need to save
   1.617 +    gutcheck.exe into the folder (directory) where the
   1.618 +    text file you want to check is. Let's say your
   1.619 +    text file is in C:\GUT, then you should save
   1.620 +    GUTCHECK.EXE into C:\GUT.
   1.621 +
   1.622 +    Now get to a DOS prompt. You can do this by
   1.623 +    selecting the "Command Prompt" or "MS-DOS Prompt"
   1.624 +    option that will be somewhere on your
   1.625 +    Start/Programs menu.
   1.626 +
   1.627 +    Now get into the C:\GUT directory. 
   1.628 +    You can do this using the CD (change directory) 
   1.629 +    command, like this:
   1.630 +        CD \GUT
   1.631 +    and your prompt will change to 
   1.632 +        C:\GUT>
   1.633 +    so you know you're in the right place.
   1.634 +
   1.635 +    Now type
   1.636 +        gutcheck yourfile.txt
   1.637 +    and you'll see gutcheck's report
   1.638 +
   1.639 +    By default, gutcheck prints its queries to screen.
   1.640 +    If you want to create a file of them, to edit
   1.641 +    against the text, you can use the greater-than
   1.642 +    sign (>) to tell it to output the report to a
   1.643 +    file. For example, if you want its report in a
   1.644 +    file called QUERIES.LST, you could type
   1.645 +    
   1.646 +        gutcheck yourfile.txt > queries.lst
   1.647 +
   1.648 +    The queries.lst file will then contain the listing
   1.649 +    of possible formatting errors, and you can
   1.650 +    edit it alongside your text.
   1.651 +
   1.652 +    Whatever you do, DON'T make the filename after
   1.653 +    the greater-than sign the name of a file already
   1.654 +    on your disk that you want to keep, because
   1.655 +    the greater-than sign will cause gutcheck to
   1.656 +    replace any existing file of that name.
   1.657 +
   1.658 +    So, for example, if you have two Tolstoy files
   1.659 +    that you want to check, called WARPEACE.TXT and 
   1.660 +    ANNAK.TXT, make sure that neither of these names
   1.661 +    is ever used following the greater-than sign.
   1.662 +    To check these correctly, you might do:
   1.663 +
   1.664 +    gutcheck warpeace.txt >war.lst
   1.665 +
   1.666 +    and
   1.667 +
   1.668 +    gutcheck annak.txt > annak.lst
   1.669 +
   1.670 +    separately. Then you can look at war.lst and annak.lst
   1.671 +    to see the gutcheck reports.
   1.672 +
   1.673 +            *       *       *        *
   1.674 +
   1.675 +
   1.676 +For existing 0.98 users upgrading to 0.99:
   1.677 +
   1.678 +    If you run on old 16-bit DOS or Windows 3.x, I'm afraid
   1.679 +    you're out of luck. I'm not saying it _can't_ be compiled
   1.680 +    to run on 16-bit, but the executable with the package is
   1.681 +    for Win32 only. *nix users won't notice the change at all.
   1.682 +
   1.683 +
   1.684 +    There are two new switches: -u and -d. 
   1.685 +          See above for full rundown.
   1.686 +
   1.687 +
   1.688 +Here's a list of the new errors:
   1.689 +
   1.690 +    Line 1456 - Carat character?
   1.691 +
   1.692 +    I^ve found a few.
   1.693 +
   1.694 +
   1.695 +    Line 1821 - Forward slash?
   1.696 +
   1.697 +    Common error for italicized "I", or so /'ve found.
   1.698 +
   1.699 +
   1.700 +    Line 2139 - Query missing paragraph break?
   1.701 +
   1.702 +    "Come here, son." "Do I _have_ to go, dad?"
   1.703 +    Like that. False positives in some texts. Sorry 'bout that,
   1.704 +    but these are often errors.
   1.705 +
   1.706 +
   1.707 +    Line 2200 - Query had/bad error?
   1.708 +
   1.709 +    Clear enough. Doesn't catch as many as I'd like it to,
   1.710 +    but rarely gives false alarms.
   1.711 +
   1.712 +
   1.713 +    Line 2268 - Query punctuation after the?
   1.714 +
   1.715 +    Some words, like "the", very rarely have punctuation
   1.716 +    following them. Others, like "Mrs", usually have a
   1.717 +    period, but never a comma. Occasional false positives.
   1.718 +
   1.719 +
   1.720 +    Line 2380 - Query possible scanno arid
   1.721 +
   1.722 +    It found one of your user-defined typos when you
   1.723 +    used the -u switch.
   1.724 +
   1.725 +
   1.726 +    Line 2511 - Capital "S"?
   1.727 +
   1.728 +    Surprisingly common specific case, like: Jane'S 
   1.729 +
   1.730 +    
   1.731 +    Line 3469 - endquote missing punctuation?
   1.732 +
   1.733 +    OK. This one can really cause a lot of false positives
   1.734 +    in some books, but it switches itself off if it finds
   1.735 +    more than 20 in a text, unless you force it to list them
   1.736 +    all with the -v switch.
   1.737 +    "Hey, dad" Johnny said, "can we go now?"
   1.738 +    is a common punctuation-missing error.
   1.739 +
   1.740 +
   1.741 +    Line 4266 - Mismatched underscores?
   1.742 +
   1.743 +    Like mismatched anything else!
   1.744 +
   1.745 +