1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000
1.2 +++ b/doc/bookloupe.txt Sat Feb 18 23:07:09 2012 +0000
1.3 @@ -0,0 +1,742 @@
1.4 +
1.5 +
1.6 + Gutcheck documentation
1.7 +
1.8 +
1.9 +gutcheck: lists possible common formatting errors in a Project
1.10 +Gutenberg candidate file. It is a command line program and can be used
1.11 +under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
1.12 +tell me). For Windows-only people, there is an appendix at the end
1.13 +with brief instructions for running it.
1.14 +
1.15 +
1.16 +Current version: 0.99. Users of 0.98 see end of file for changes.
1.17 +
1.18 +You should also have received the licence file COPYING, a README file,
1.19 +gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
1.20 +this file.
1.21 +
1.22 +This software is Copyright Jim Tinsley 2000-2005.
1.23 +
1.24 +Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
1.25 +This is Free Software; you may redistribute it under certain conditions (GPL).
1.26 +
1.27 +See http://gutcheck.sourceforge.net for the latest version.
1.28 +
1.29 +
1.30 +Usage is: gutcheck [-setopxlywm] filename
1.31 + where:
1.32 + -s checks Single quotes
1.33 + -e switches off Echoing of lines
1.34 + -t checks Typos
1.35 + -o produces an Overview only
1.36 + -p sets strict quotes checking for Paragraphs
1.37 + -x (paranoid) switches OFF typo checking and extra checks
1.38 + -l turns off Line-end checks
1.39 + -y sets error messages to stdout
1.40 + -w is a special mode for web uploads (for future use)
1.41 + -v (verbose) forces individual reporting of minor problems
1.42 + -m interprets Markup of some common HTML tags and entities
1.43 + -u warns about words in a user-defined typo file gutcheck.typ
1.44 + -d ignores some DP-specific markup
1.45 +
1.46 +Running gutcheck without any parameters will display a brief help message.
1.47 +
1.48 +Sample usage:
1.49 +
1.50 + gutcheck warpeace.txt
1.51 +
1.52 +
1.53 +More detail:
1.54 +
1.55 + Echoing lines (-e to switch off)
1.56 +
1.57 + You may find it convenient, when reviewing Gutcheck's
1.58 + suggestions, to see the line that Gutcheck is questioning.
1.59 + That way, you can often see at a glance whether it is
1.60 + a real error that needs to be fixed, or a false positive
1.61 + that should be in the text, but Gutcheck's limited
1.62 + programming doesn't understand.
1.63 +
1.64 + By default, gutcheck echoes these lines, but if you don't
1.65 + want to see the lines referred to, -e will switch it OFF.
1.66 +
1.67 +
1.68 + Quotes (-s and -p switches)
1.69 +
1.70 + Gutcheck always looks for unbalanced doublequotes in a
1.71 + paragraph. It is a common convention for writers not to
1.72 + close quotes in a paragraph if the next paragraph opens
1.73 + with quotes and is a continuation by the same speaker.
1.74 +
1.75 + Gutcheck therefore does not normally report unclosed quotes
1.76 + if the next paragraph begins with a quote. If you need
1.77 + to see all unclosed quotes, even where the next paragraph
1.78 + begins with a quote, you should use the -p switch.
1.79 +
1.80 + Singlequotes (') are a problem, since the same character
1.81 + is used for an apostrophe. I'm not sure that it is
1.82 + possible to get 100% accuracy on singlequotes checking,
1.83 + particularly since dialect, quite common in PG texts,
1.84 + upsets the normal rules so badly. Consider the sentence:
1.85 + 'Tis often said that a man's a man for a' that.
1.86 + As humans, we recognize that both apostrophes are used
1.87 + for contractions rather than quotes, but it isn't easy
1.88 + to get a program to recognize that.
1.89 +
1.90 + Since Gutcheck makes too many mistakes when trying to match
1.91 + singlequotes, it doesn't look for unbalanced singlequotes
1.92 + unless you specify the -s switch.
1.93 +
1.94 + Consider these sentences, which illustrate the main cases:
1.95 +
1.96 + 'Tis often said that a fool and his money are soon parted.
1.97 +
1.98 + 'Becky's goin' home,' said Tom.
1.99 +
1.100 + The dogs' tails wagged in unison.
1.101 +
1.102 + Those 'pack dogs' of yours look more like wolves.
1.103 +
1.104 +
1.105 +
1.106 + Typos (-t switch)
1.107 +
1.108 + It's not Gutcheck's job to be a spelling checker, but it
1.109 + does check for a list of common typos and OCR errors if you
1.110 + use the -t switch. (The -x switch also turns typo checking on.)
1.111 +
1.112 + It also checks for character combinations, especially involving
1.113 + h and b, which are often confused by OCR, that rarely or never
1.114 + occur. For example, it queries "tbe" in a word. Now, "the" often
1.115 + occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
1.116 + playing the odds - a few false positives for many errors found.
1.117 + Similarly with "ii", which is a very common OCR error.
1.118 +
1.119 + Gutcheck suppresses multiple reporting of the first 40 "typos"
1.120 + found. This is to remove the annoyance of seeing something like
1.121 + "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
1.122 + in a text.
1.123 +
1.124 +
1.125 + Line-end checking (-l switch to disable)
1.126 +
1.127 + All PG texts should have a Carriage Return (CR - character 13)
1.128 + and a Line Feed (LF - character 10) at end of each line,
1.129 + regardless of what O/S you made them on. DOS/Windows, Unix
1.130 + and Mac have different conventions, but the final text should
1.131 + always use a CR/LF pair as its line terminator.
1.132 +
1.133 + By default, Gutcheck verifies that every line does have
1.134 + the correct terminator, but if you're on a work-in-progress
1.135 + in Linux, you might want to convert the line-ends as a final
1.136 + step, and not want to see thousands of errors every time you
1.137 + run Gutcheck before that final step, so you can turn off
1.138 + this checking with the -l switch.
1.139 +
1.140 +
1.141 + Paranoid mode (-x switch to disable: Trust No One :-)
1.142 +
1.143 + -x switches OFF typo-checking, the -t flag, automatically
1.144 + and some extra checks like standalone 1 and 0 queries.
1.145 +
1.146 +
1.147 + Overview mode (-o switch)
1.148 +
1.149 + This mode just gives a count of queries found
1.150 + instead of a detailed list.
1.151 +
1.152 +
1.153 + Header quote (-h switch)
1.154 +
1.155 + If you use the -h switch, gutcheck will also display
1.156 + the Title, Author, Release and Edition fields from the
1.157 + PG header. This is useful mostly for the automated
1.158 + checks we do on recently-posted texts.
1.159 +
1.160 +
1.161 + Errors to stdout (-y switch)
1.162 +
1.163 + If you're just running gutcheck normally, you can ignore
1.164 + this. It's only there for programs that provide a front
1.165 + end to gutcheck. It makes error messages appear within
1.166 + the output of gutcheck so that the front end knows whether
1.167 + gutcheck ran OK.
1.168 +
1.169 +
1.170 + Verbose reporting (-v switch)
1.171 +
1.172 + Normally, if gutcheck sees lots of long lines, short lines,
1.173 + spaced dashes, non-ASCII characters or dot-commas ".," it
1.174 + assumes these are features of the text, counts and summarizes
1.175 + them at the top of its report, but does not list them
1.176 + individually. If the -v switch is on, gutcheck will list them all.
1.177 +
1.178 +
1.179 + Markup interpretation (-m switch)
1.180 +
1.181 + Normally, gutcheck flags anything it suspects of being HTML
1.182 + markup as a possible error. When you use the -m switch,
1.183 + however, it matches anything that looks like markup against
1.184 + a short list of common HTML tags and entities. If the markup
1.185 + is in that list, it either ignores the markup, in the case
1.186 + of a tag, or "interprets" the markup as its nearest ASCII
1.187 + equivalent, in the case of an entity. So, for example, using
1.188 + this switch, gutcheck will "see"
1.189 +
1.190 + “He went <i>thataway!</i>”
1.191 +
1.192 + as
1.193 +
1.194 + "He went thataway!"
1.195 +
1.196 + and report accordingly.
1.197 +
1.198 + This switch does not, not, NOT check the validity of HTML;
1.199 + it exists so that you can run gutcheck on most HTML texts
1.200 + for PG, and get sane results. It does not support all tags.
1.201 + It does not support all entities. When it sees a tag or entity
1.202 + it does not recognize, it will query it as HTML just as if
1.203 + you hadn't specified the -m switch.
1.204 +
1.205 + Gutcheck 0.99 will automatically switch on markup interpretation
1.206 + if it sees a lot of tags that appear to be markup, so mostly, you
1.207 + won't have to specify this.
1.208 +
1.209 + User-defined typos (-u switch)
1.210 +
1.211 + If you have a file named gutcheck.typ either in your current
1.212 + working directory or in the directory from which you explicitly
1.213 + invoked gutcheck, but not necessarily on your path, and if you
1.214 + specify the -u switch, gutcheck will query any word specified
1.215 + in that file. The file is simple: one word, in lower case, per
1.216 + line. 999 lines are allowed for. Be careful not to put multiple
1.217 + words onto a line, or leave any rubbish other than the word on
1.218 + the line. You should have received a sample file gutcheck.typ
1.219 + with this package.
1.220 +
1.221 + Ignore DP markup (-d switch)
1.222 +
1.223 + Distributed Proofreaders (http://www.pgdp.net) is currently
1.224 + (2005) the main source of PG texts, and proofers there use
1.225 + special conventions. This switch understands those conventions,
1.226 + so that people can use gutcheck on files in process that still
1.227 + haven't had the special conventions removed yet. The special
1.228 + conventions supported in 0.99 are page-separators and
1.229 + "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
1.230 +
1.231 +
1.232 +You will probably only run gutcheck on a text once or maybe twice,
1.233 +just prior to uploading; it usually finds a few formatting problems;
1.234 +it also usually finds queries that aren't problems at all - it often
1.235 +questions Tables of Contents for having short lines, for example.
1.236 +These are called "false positives", and need a human to decide on
1.237 +them.
1.238 +
1.239 +The text should be standard prose, and already close to PG normal
1.240 +format (plain text, about 70 characters per line with blank lines
1.241 +between paragraphs).
1.242 +
1.243 +Gutcheck merely draws your attention to things that might be errors.
1.244 +It is NOT a substitute for human judgement. Formatting choices like
1.245 +short lines may be for a reason that this program can't understand.
1.246 +
1.247 +Even the most careful human proofing can leave errors behind in a
1.248 +text, and there are several automated checks you can do to help find
1.249 +them. Of these, spellchecking (with _very_ careful human judgement) is
1.250 +the most important and most useful.
1.251 +
1.252 +Gutcheck does perform some basic typo-checking if you ask it to,
1.253 +but its focus is on formatting errors specific to PG texts -
1.254 +mismatched quotes, non-ASCII characters, bad spacing, bad line
1.255 +length, HTML tags perhaps left from a conversion, unbalanced
1.256 +brackets.
1.257 +
1.258 +Suggestions for additional checks would be appreciated and duly
1.259 +considered, but no guarantees that they will be implemented.
1.260 +
1.261 +
1.262 +
1.263 +
1.264 + How do _I_ use it?
1.265 +
1.266 +Practically everyone I give gutcheck to asks me how _I_ use it.
1.267 +Well, when I get a text for posting, say filename.txt, I run
1.268 +
1.269 + gutcheck -o filename.txt
1.270 +
1.271 +That gives me a quick idea what I'm dealing with. It'll tell
1.272 +me what kind of problems gutcheck sees, and give me an idea
1.273 +of how much more work needs to be done on the text. Keep in
1.274 +mind that gutcheck doesn't do anything like a full spellcheck,
1.275 +but when I see a text that has a lot of problems, I assume that
1.276 +it probably needs a spellcheck too.
1.277 +
1.278 +Having got a feel for the ballpark, I run
1.279 +
1.280 + gutcheck filename.txt > jj
1.281 +
1.282 +where jj is my personal, all-purpose filename for temporary data
1.283 +that doesn't need to be kept. Then I open filename.txt and jj in
1.284 +a split-screen view in my editor, and work down the text, fixing
1.285 +whatever needs fixing, and skipping whatever doesn't. If your
1.286 +editor doesn't split-screen, you can get much the same effect by
1.287 +opening your original file in your normal editor, and jj (or your
1.288 +equivalent name) in something like Notepad, keeping both in view
1.289 +at the same time.
1.290 +
1.291 +Twice a day, an automatic process looks at all recently-posted
1.292 +texts, and emails Michael, me, and sometimes other people with
1.293 +their gutcheck summaries.
1.294 +
1.295 +
1.296 +
1.297 + Future development of gutcheck
1.298 +
1.299 +Gutcheck has gone about as far as it can, given its current
1.300 +structure. In order to add better singlequotes checking,
1.301 +sentence checking, better he/be checking and other good stuff
1.302 +that I'd like to see, I'll have to rewrite it from a different
1.303 +angle - looking at the syntax instead of the lines. And I'll
1.304 +probably get around to that sooner or later.
1.305 +
1.306 +Meantime, I'm just trying to get this version stabilized, so
1.307 +please report any bugs you find. When it is stable, I'll run
1.308 +up a Windows port for those timid souls who can't look a
1.309 +command line in the eye. :-)
1.310 +
1.311 +And I've started work on gutspell, a companion to gutcheck
1.312 +which will concentrate on spelling problems. PG spelling
1.313 +problems are unusual, since the range of texts we cover is
1.314 +so wide, and I'll be taking a somewhat unorthodox approach
1.315 +to writing this spelling-checker _specifically_ for texts
1.316 +containing a lot of dialect and uncommon words that have
1.317 +probably already been spell-checked against a standard
1.318 +modern dictionary.
1.319 +
1.320 +
1.321 +
1.322 +
1.323 +Explanations of common gutcheck messages:
1.324 +
1.325 + --> 74 lines in this file have white space at end
1.326 +
1.327 + PG texts shouldn't have extra white space added at end of line.
1.328 + Don't worry too much about this; they're not doing any harm,
1.329 + and they'll be removed during posting anyway.
1.330 +
1.331 +
1.332 + --> 348 lines in this file are short. Not reporting short lines.
1.333 + --> 84 lines in this file are long. Not reporting long lines.
1.334 + --> 8 lines in this file are VERY long!
1.335 +
1.336 + If there are a lot of long or short lines, Gutcheck won't list
1.337 + them individually. The short lines version of this message
1.338 + is commonly seen when gutchecking poetry and some plays, where
1.339 + the normal line length is shorter than the standard for prose.
1.340 + A "VERY long" line is one over 80 characters. You normally
1.341 + shouldn't have any of these, but sometimes you may have to render
1.342 + a table that must be that long, or some special preformatted
1.343 + quotation that can't be broken.
1.344 +
1.345 +
1.346 + --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
1.347 +
1.348 + The PG standard for an emdash--like these--is two minus signs
1.349 + with no spaces before or after them. However, some older texts
1.350 + used spaced dashes - like these -- and if there are very many
1.351 + such spaced dashes in the file, gutcheck just draws your
1.352 + attention to it and doesn't list them individually.
1.353 +
1.354 +
1.355 +
1.356 + Line 3020 - Non-ASCII character 233
1.357 +
1.358 + Standard PG texts should use only ASCII characters with values
1.359 + up to 127; however, non-English, accented characters can be
1.360 + represented according to several different non-ASCII encoding
1.361 + schemes, using values over 127. If you have a plain English text
1.362 + with a few accented characters in words like cafe or tete-a-tete,
1.363 + you should replace the accented characters with their unaccented
1.364 + versions. The English pound sign is another commonly-seen
1.365 + non-ASCII character. If you have enough non-ASCII characters in
1.366 + your text that you feel removing them would degrade your text
1.367 + unacceptably, you should probably consider doing an 8-bit text
1.368 + as well as a plain-ASCII version.
1.369 +
1.370 +
1.371 +
1.372 + Line 1207 - Non-ISO-8859 character 156
1.373 +
1.374 + Even in "8-bit" texts, there are distinctions between code sets.
1.375 + The ISO-8859 family of 8-bit code sets is the most commonly used
1.376 + in PG, and these sets do not define values in the range 128 through
1.377 + 159 as printable characters. It's quite common for someone on a
1.378 + Windows or Mac machine to use a non-ISO character inadvertently,
1.379 + so this message warns that the character is not only not ASCII,
1.380 + but also outside the ISO-8859 range.
1.381 +
1.382 +
1.383 +
1.384 + Line 46 - Tab character?
1.385 +
1.386 + Some editors and WPs will put in Tab characters (character 9) to
1.387 + indicate indented text. You should not use these in a PG text,
1.388 + because you can't be sure how they will appear on a reader's
1.389 + screen. Find the Tab, and replace it with the appropriate number
1.390 + of spaces.
1.391 +
1.392 +
1.393 + Line 1327 - Tilde character?
1.394 +
1.395 + The tilde character (~) might be legitimately used, but it's the
1.396 + character commonly used by OCR software to indicate a place where
1.397 + it couldn't make out the letter, so gutcheck flags it.
1.398 +
1.399 +
1.400 +
1.401 + Line 1347 - Asterisk?
1.402 +
1.403 + Asterisks are reported only in paranoid mode (see -x).
1.404 + Like tildes, they are often used to indicate errors, but they are
1.405 + also legitimately used as line delimiters and footnote markers.
1.406 +
1.407 +
1.408 +
1.409 + Line 1451 - Long line 129
1.410 +
1.411 + PG texts should have lines shorter than 76. There may be occasions
1.412 + where you decide that you really have to go out to 79 characters,
1.413 + but the sample above says that line 1451 is 129 characters long -
1.414 + probably two lines run together.
1.415 +
1.416 +
1.417 +
1.418 + Line 1590 - Short line?
1.419 +
1.420 + PG texts should have lines longer than 54 characters. However,
1.421 + there are special cases like poetry and tables of contents where
1.422 + the lines _should_ be shorter. So treat Gutcheck warnings about
1.423 + short lines carefully. Sometimes it's a genuine formatting
1.424 + problem; sometimes the line really needs to be short.
1.425 +
1.426 + Hint: gutcheck will not flag lines as short if they are indented
1.427 + - if they start with a space. I like to start inserted stanzas
1.428 + and other such items indented with a couple of spaces so that
1.429 + they stand out from the main text anyway.
1.430 +
1.431 +
1.432 +
1.433 + Line 1804 - Begins with punctuation?
1.434 +
1.435 + Lines should normally not begin with commas, periods and so on.
1.436 + An exception is ellipses . . . which can happen at start of line.
1.437 +
1.438 +
1.439 +
1.440 + Line 1850 - Spaced em-dash?
1.441 +
1.442 + The PG standard for an em-dash--like these--is two minus signs
1.443 + with no spaces before or after them. Gutcheck flags non-PG
1.444 + em-dashes - like this one. Normally, you will replace it with a
1.445 + PG-standard em-dash.
1.446 +
1.447 +
1.448 +
1.449 + Line 1904 - Query he/be error?
1.450 +
1.451 + Gutcheck makes a very minor effort to look for that scourge of all
1.452 + proofreaders, "be" replacing "he" or vice-versa, and draws your
1.453 + attention to it when it thinks it has found one.
1.454 +
1.455 +
1.456 +
1.457 + Line 2017 - Query digit in a1most
1.458 +
1.459 + The digit 1 is commonly OCRed for the letter l, the digit 0 for
1.460 + the letter O, and so on. When gutcheck sees a mix of digits and
1.461 + letters, it warns you. It may generate a false positive for
1.462 + something like 7am.
1.463 +
1.464 +
1.465 +
1.466 + Line 2083 - Query standalone 0
1.467 +
1.468 + In paranoid mode (see -x) only, gutcheck warns about the digit 0
1.469 + and the number 1 standing alone as a word. This can happen if the
1.470 + OCR misreads the words O or I.
1.471 +
1.472 +
1.473 +
1.474 + Line 2115 - Query word whetber
1.475 +
1.476 + If you have switched typo-checking on, gutcheck looks for
1.477 + potential typos, especially common h/b errors. It's not
1.478 + infallible; it sometimes queries legit words, but it's
1.479 + always worth taking a look.
1.480 +
1.481 +
1.482 +
1.483 + Line 2190 column 14 - Missing space?
1.484 +
1.485 + Omitting a space is a very common error,especially coming from
1.486 + OCRed text,and can be hard for a human to spot. The commas in
1.487 + the previous sentence illustrate the kind of thing I mean.
1.488 +
1.489 +
1.490 +
1.491 + Line 2240 column 48 - Spaced punctuation?
1.492 +
1.493 + The flip side of the "missing space" error , here , is when extra
1.494 + spaces are added before punctuation . Some old texts appear to add
1.495 + extra spaces around punctuation consistently, but this was a
1.496 + typographical convention rather than the author's intent, and the
1.497 + extra "spaces" should be removed when preparing a PG text.
1.498 +
1.499 +
1.500 +
1.501 + Line 2301 column 19 - Unspaced quotes?
1.502 +
1.503 + Another common spacing problem occurs in a phrase like "You wait
1.504 + there,"he said.
1.505 +
1.506 +
1.507 +
1.508 + Line 2385 column 27 - Wrongspaced quotes?
1.509 +
1.510 + As of version 0.98, gutcheck adds extra checks on whether a quote
1.511 + seems to be a start or end quote, and queries those that appear to
1.512 + be misplaced. This does give rise to false positives when quotes are
1.513 + nested, for example:
1.514 +
1.515 + "And how," she asked, "will your "friends" help you now?"
1.516 +
1.517 + but these false positives are worth it because of the many cases
1.518 + that this test catches, notably those like:
1.519 +
1.520 + "And how, "she said," will your friends help you now?"
1.521 +
1.522 + Sometimes a "wrongspaced quotes" query will arise because an earlier
1.523 + quote in the paragraph was omitted, so if the place specified seems
1.524 + to be OK, look back to see whether there's a problem in the preceding
1.525 + lines.
1.526 +
1.527 +
1.528 +
1.529 + Line 2400 - HTML Tag? <PRE>
1.530 +
1.531 + Some PG texts have been converted from HTML, and not all of the
1.532 + HTML tags have been removed.
1.533 +
1.534 +
1.535 +
1.536 + Line 2402 - HTML symbol? &emdash;
1.537 +
1.538 + Similarly, special HTML symbol characters can survive into PG
1.539 + texts. Can occasionally produce amusing false positives like
1.540 + . . . Marwick & Co were well known for it;
1.541 +
1.542 +
1.543 +
1.544 + Line 2540 - Mismatched quotes
1.545 +
1.546 + Another gutcheck mainstay - unclosed doublequotes in a paragraph.
1.547 + See the discussion of quotes in the switches section near the
1.548 + start of this file.
1.549 +
1.550 + Since the mismatch doesn't occur on any one line, gutcheck quotes
1.551 + the line number of the first blank line following the paragraph,
1.552 + since this is the point where it reconciles the count of quotes.
1.553 + However, if gutcheck is echoing lines, that is, you haven't used
1.554 + the -e switch, it will show the _first_ line of the paragraph,
1.555 + to help you find the place without using line numbers. The
1.556 + offending paragraph is therefore between the quoted line and
1.557 + the line number given.
1.558 +
1.559 +
1.560 +
1.561 + Line 2587 - Mismatched single quotes
1.562 +
1.563 + Only checked with the -s switch, since checking single quotes is
1.564 + not a very reliable process. Otherwise, the same logic as for
1.565 + doublequotes applies.
1.566 +
1.567 +
1.568 +
1.569 + Line 2877 - Mismatched round brackets?
1.570 +
1.571 + Also curly and square brackets. Texts with a lot of brackets, like
1.572 + plays with bracketed stage instructions, may have mismatches.
1.573 +
1.574 +
1.575 + Line 3150 - No CR?
1.576 + Line 3204 - Two successive CRs?
1.577 + Line 3281 position 75 - CR without LF?
1.578 +
1.579 + These are the invalid line-end warnings. See the discussion of
1.580 + line-end checking in the switches section near the start of this
1.581 + file. If you see these, and your editor doesn't show anything
1.582 + wrong, you should probably try deleting the characters just before
1.583 + and after the line end, and the line-end itself, then retyping the
1.584 + characters and the line-end.
1.585 +
1.586 +
1.587 + Line 2940 - Paragraph starts with lower-case
1.588 +
1.589 + A common error in an e-text is for an extra blank line
1.590 +
1.591 + to be put in, like the blank line above, and this often
1.592 + shows up as a new paragraph beginning with lower case.
1.593 + Sometimes the blank line is deliberate, as when a
1.594 + quotation is inserted in a speech. Use your judgement.
1.595 +
1.596 +
1.597 + Line 2987 - Extra period?
1.598 +
1.599 + An extra period. is a. common problem in OCRed text. and usually
1.600 + arises when a speck of dust on the page is mistaken for a period.
1.601 + or. as occasionally happens. when a comma loses its tail.
1.602 +
1.603 +
1.604 + Line 3012 column 12 - Double punctuation?
1.605 +
1.606 + Double punctuation., like that,, is a common typo and
1.607 + scanno. Some books have much legit double punctuation,
1.608 + like etc., etc., but it's worth checking anyway.
1.609 +
1.610 +
1.611 +
1.612 + * * * *
1.613 +
1.614 +For Windows-only users who are unfamiliar with DOS:
1.615 +
1.616 + If you're a Windows-only user, you need to save
1.617 + gutcheck.exe into the folder (directory) where the
1.618 + text file you want to check is. Let's say your
1.619 + text file is in C:\GUT, then you should save
1.620 + GUTCHECK.EXE into C:\GUT.
1.621 +
1.622 + Now get to a DOS prompt. You can do this by
1.623 + selecting the "Command Prompt" or "MS-DOS Prompt"
1.624 + option that will be somewhere on your
1.625 + Start/Programs menu.
1.626 +
1.627 + Now get into the C:\GUT directory.
1.628 + You can do this using the CD (change directory)
1.629 + command, like this:
1.630 + CD \GUT
1.631 + and your prompt will change to
1.632 + C:\GUT>
1.633 + so you know you're in the right place.
1.634 +
1.635 + Now type
1.636 + gutcheck yourfile.txt
1.637 + and you'll see gutcheck's report
1.638 +
1.639 + By default, gutcheck prints its queries to screen.
1.640 + If you want to create a file of them, to edit
1.641 + against the text, you can use the greater-than
1.642 + sign (>) to tell it to output the report to a
1.643 + file. For example, if you want its report in a
1.644 + file called QUERIES.LST, you could type
1.645 +
1.646 + gutcheck yourfile.txt > queries.lst
1.647 +
1.648 + The queries.lst file will then contain the listing
1.649 + of possible formatting errors, and you can
1.650 + edit it alongside your text.
1.651 +
1.652 + Whatever you do, DON'T make the filename after
1.653 + the greater-than sign the name of a file already
1.654 + on your disk that you want to keep, because
1.655 + the greater-than sign will cause gutcheck to
1.656 + replace any existing file of that name.
1.657 +
1.658 + So, for example, if you have two Tolstoy files
1.659 + that you want to check, called WARPEACE.TXT and
1.660 + ANNAK.TXT, make sure that neither of these names
1.661 + is ever used following the greater-than sign.
1.662 + To check these correctly, you might do:
1.663 +
1.664 + gutcheck warpeace.txt >war.lst
1.665 +
1.666 + and
1.667 +
1.668 + gutcheck annak.txt > annak.lst
1.669 +
1.670 + separately. Then you can look at war.lst and annak.lst
1.671 + to see the gutcheck reports.
1.672 +
1.673 + * * * *
1.674 +
1.675 +
1.676 +For existing 0.98 users upgrading to 0.99:
1.677 +
1.678 + If you run on old 16-bit DOS or Windows 3.x, I'm afraid
1.679 + you're out of luck. I'm not saying it _can't_ be compiled
1.680 + to run on 16-bit, but the executable with the package is
1.681 + for Win32 only. *nix users won't notice the change at all.
1.682 +
1.683 +
1.684 + There are two new switches: -u and -d.
1.685 + See above for full rundown.
1.686 +
1.687 +
1.688 +Here's a list of the new errors:
1.689 +
1.690 + Line 1456 - Carat character?
1.691 +
1.692 + I^ve found a few.
1.693 +
1.694 +
1.695 + Line 1821 - Forward slash?
1.696 +
1.697 + Common error for italicized "I", or so /'ve found.
1.698 +
1.699 +
1.700 + Line 2139 - Query missing paragraph break?
1.701 +
1.702 + "Come here, son." "Do I _have_ to go, dad?"
1.703 + Like that. False positives in some texts. Sorry 'bout that,
1.704 + but these are often errors.
1.705 +
1.706 +
1.707 + Line 2200 - Query had/bad error?
1.708 +
1.709 + Clear enough. Doesn't catch as many as I'd like it to,
1.710 + but rarely gives false alarms.
1.711 +
1.712 +
1.713 + Line 2268 - Query punctuation after the?
1.714 +
1.715 + Some words, like "the", very rarely have punctuation
1.716 + following them. Others, like "Mrs", usually have a
1.717 + period, but never a comma. Occasional false positives.
1.718 +
1.719 +
1.720 + Line 2380 - Query possible scanno arid
1.721 +
1.722 + It found one of your user-defined typos when you
1.723 + used the -u switch.
1.724 +
1.725 +
1.726 + Line 2511 - Capital "S"?
1.727 +
1.728 + Surprisingly common specific case, like: Jane'S
1.729 +
1.730 +
1.731 + Line 3469 - endquote missing punctuation?
1.732 +
1.733 + OK. This one can really cause a lot of false positives
1.734 + in some books, but it switches itself off if it finds
1.735 + more than 20 in a text, unless you force it to list them
1.736 + all with the -v switch.
1.737 + "Hey, dad" Johnny said, "can we go now?"
1.738 + is a common punctuation-missing error.
1.739 +
1.740 +
1.741 + Line 4266 - Mismatched underscores?
1.742 +
1.743 + Like mismatched anything else!
1.744 +
1.745 +