diff -r 000000000000 -r 218904410231 doc/gutcheck.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/gutcheck.txt Fri Jan 27 00:28:11 2012 +0000 @@ -0,0 +1,742 @@ + + + Gutcheck documentation + + +gutcheck: lists possible common formatting errors in a Project +Gutenberg candidate file. It is a command line program and can be used +under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't, +tell me). For Windows-only people, there is an appendix at the end +with brief instructions for running it. + + +Current version: 0.99. Users of 0.98 see end of file for changes. + +You should also have received the licence file COPYING, a README file, +gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with +this file. + +This software is Copyright Jim Tinsley 2000-2005. + +Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING. +This is Free Software; you may redistribute it under certain conditions (GPL). + +See http://gutcheck.sourceforge.net for the latest version. + + +Usage is: gutcheck [-setopxlywm] filename + where: + -s checks Single quotes + -e switches off Echoing of lines + -t checks Typos + -o produces an Overview only + -p sets strict quotes checking for Paragraphs + -x (paranoid) switches OFF typo checking and extra checks + -l turns off Line-end checks + -y sets error messages to stdout + -w is a special mode for web uploads (for future use) + -v (verbose) forces individual reporting of minor problems + -m interprets Markup of some common HTML tags and entities + -u warns about words in a user-defined typo file gutcheck.typ + -d ignores some DP-specific markup + +Running gutcheck without any parameters will display a brief help message. + +Sample usage: + + gutcheck warpeace.txt + + +More detail: + + Echoing lines (-e to switch off) + + You may find it convenient, when reviewing Gutcheck's + suggestions, to see the line that Gutcheck is questioning. + That way, you can often see at a glance whether it is + a real error that needs to be fixed, or a false positive + that should be in the text, but Gutcheck's limited + programming doesn't understand. + + By default, gutcheck echoes these lines, but if you don't + want to see the lines referred to, -e will switch it OFF. + + + Quotes (-s and -p switches) + + Gutcheck always looks for unbalanced doublequotes in a + paragraph. It is a common convention for writers not to + close quotes in a paragraph if the next paragraph opens + with quotes and is a continuation by the same speaker. + + Gutcheck therefore does not normally report unclosed quotes + if the next paragraph begins with a quote. If you need + to see all unclosed quotes, even where the next paragraph + begins with a quote, you should use the -p switch. + + Singlequotes (') are a problem, since the same character + is used for an apostrophe. I'm not sure that it is + possible to get 100% accuracy on singlequotes checking, + particularly since dialect, quite common in PG texts, + upsets the normal rules so badly. Consider the sentence: + 'Tis often said that a man's a man for a' that. + As humans, we recognize that both apostrophes are used + for contractions rather than quotes, but it isn't easy + to get a program to recognize that. + + Since Gutcheck makes too many mistakes when trying to match + singlequotes, it doesn't look for unbalanced singlequotes + unless you specify the -s switch. + + Consider these sentences, which illustrate the main cases: + + 'Tis often said that a fool and his money are soon parted. + + 'Becky's goin' home,' said Tom. + + The dogs' tails wagged in unison. + + Those 'pack dogs' of yours look more like wolves. + + + + Typos (-t switch) + + It's not Gutcheck's job to be a spelling checker, but it + does check for a list of common typos and OCR errors if you + use the -t switch. (The -x switch also turns typo checking on.) + + It also checks for character combinations, especially involving + h and b, which are often confused by OCR, that rarely or never + occur. For example, it queries "tbe" in a word. Now, "the" often + occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm + playing the odds - a few false positives for many errors found. + Similarly with "ii", which is a very common OCR error. + + Gutcheck suppresses multiple reporting of the first 40 "typos" + found. This is to remove the annoyance of seeing something like + "FN" (footnote) or "LK" (initials) flagged as a typo 147 times + in a text. + + + Line-end checking (-l switch to disable) + + All PG texts should have a Carriage Return (CR - character 13) + and a Line Feed (LF - character 10) at end of each line, + regardless of what O/S you made them on. DOS/Windows, Unix + and Mac have different conventions, but the final text should + always use a CR/LF pair as its line terminator. + + By default, Gutcheck verifies that every line does have + the correct terminator, but if you're on a work-in-progress + in Linux, you might want to convert the line-ends as a final + step, and not want to see thousands of errors every time you + run Gutcheck before that final step, so you can turn off + this checking with the -l switch. + + + Paranoid mode (-x switch to disable: Trust No One :-) + + -x switches OFF typo-checking, the -t flag, automatically + and some extra checks like standalone 1 and 0 queries. + + + Overview mode (-o switch) + + This mode just gives a count of queries found + instead of a detailed list. + + + Header quote (-h switch) + + If you use the -h switch, gutcheck will also display + the Title, Author, Release and Edition fields from the + PG header. This is useful mostly for the automated + checks we do on recently-posted texts. + + + Errors to stdout (-y switch) + + If you're just running gutcheck normally, you can ignore + this. It's only there for programs that provide a front + end to gutcheck. It makes error messages appear within + the output of gutcheck so that the front end knows whether + gutcheck ran OK. + + + Verbose reporting (-v switch) + + Normally, if gutcheck sees lots of long lines, short lines, + spaced dashes, non-ASCII characters or dot-commas ".," it + assumes these are features of the text, counts and summarizes + them at the top of its report, but does not list them + individually. If the -v switch is on, gutcheck will list them all. + + + Markup interpretation (-m switch) + + Normally, gutcheck flags anything it suspects of being HTML + markup as a possible error. When you use the -m switch, + however, it matches anything that looks like markup against + a short list of common HTML tags and entities. If the markup + is in that list, it either ignores the markup, in the case + of a tag, or "interprets" the markup as its nearest ASCII + equivalent, in the case of an entity. So, for example, using + this switch, gutcheck will "see" + + “He went thataway!” + + as + + "He went thataway!" + + and report accordingly. + + This switch does not, not, NOT check the validity of HTML; + it exists so that you can run gutcheck on most HTML texts + for PG, and get sane results. It does not support all tags. + It does not support all entities. When it sees a tag or entity + it does not recognize, it will query it as HTML just as if + you hadn't specified the -m switch. + + Gutcheck 0.99 will automatically switch on markup interpretation + if it sees a lot of tags that appear to be markup, so mostly, you + won't have to specify this. + + User-defined typos (-u switch) + + If you have a file named gutcheck.typ either in your current + working directory or in the directory from which you explicitly + invoked gutcheck, but not necessarily on your path, and if you + specify the -u switch, gutcheck will query any word specified + in that file. The file is simple: one word, in lower case, per + line. 999 lines are allowed for. Be careful not to put multiple + words onto a line, or leave any rubbish other than the word on + the line. You should have received a sample file gutcheck.typ + with this package. + + Ignore DP markup (-d switch) + + Distributed Proofreaders (http://www.pgdp.net) is currently + (2005) the main source of PG texts, and proofers there use + special conventions. This switch understands those conventions, + so that people can use gutcheck on files in process that still + haven't had the special conventions removed yet. The special + conventions supported in 0.99 are page-separators and + "", "", "/*", "*/", "/#", "#/", "/$", "$/". + + +You will probably only run gutcheck on a text once or maybe twice, +just prior to uploading; it usually finds a few formatting problems; +it also usually finds queries that aren't problems at all - it often +questions Tables of Contents for having short lines, for example. +These are called "false positives", and need a human to decide on +them. + +The text should be standard prose, and already close to PG normal +format (plain text, about 70 characters per line with blank lines +between paragraphs). + +Gutcheck merely draws your attention to things that might be errors. +It is NOT a substitute for human judgement. Formatting choices like +short lines may be for a reason that this program can't understand. + +Even the most careful human proofing can leave errors behind in a +text, and there are several automated checks you can do to help find +them. Of these, spellchecking (with _very_ careful human judgement) is +the most important and most useful. + +Gutcheck does perform some basic typo-checking if you ask it to, +but its focus is on formatting errors specific to PG texts - +mismatched quotes, non-ASCII characters, bad spacing, bad line +length, HTML tags perhaps left from a conversion, unbalanced +brackets. + +Suggestions for additional checks would be appreciated and duly +considered, but no guarantees that they will be implemented. + + + + + How do _I_ use it? + +Practically everyone I give gutcheck to asks me how _I_ use it. +Well, when I get a text for posting, say filename.txt, I run + + gutcheck -o filename.txt + +That gives me a quick idea what I'm dealing with. It'll tell +me what kind of problems gutcheck sees, and give me an idea +of how much more work needs to be done on the text. Keep in +mind that gutcheck doesn't do anything like a full spellcheck, +but when I see a text that has a lot of problems, I assume that +it probably needs a spellcheck too. + +Having got a feel for the ballpark, I run + + gutcheck filename.txt > jj + +where jj is my personal, all-purpose filename for temporary data +that doesn't need to be kept. Then I open filename.txt and jj in +a split-screen view in my editor, and work down the text, fixing +whatever needs fixing, and skipping whatever doesn't. If your +editor doesn't split-screen, you can get much the same effect by +opening your original file in your normal editor, and jj (or your +equivalent name) in something like Notepad, keeping both in view +at the same time. + +Twice a day, an automatic process looks at all recently-posted +texts, and emails Michael, me, and sometimes other people with +their gutcheck summaries. + + + + Future development of gutcheck + +Gutcheck has gone about as far as it can, given its current +structure. In order to add better singlequotes checking, +sentence checking, better he/be checking and other good stuff +that I'd like to see, I'll have to rewrite it from a different +angle - looking at the syntax instead of the lines. And I'll +probably get around to that sooner or later. + +Meantime, I'm just trying to get this version stabilized, so +please report any bugs you find. When it is stable, I'll run +up a Windows port for those timid souls who can't look a +command line in the eye. :-) + +And I've started work on gutspell, a companion to gutcheck +which will concentrate on spelling problems. PG spelling +problems are unusual, since the range of texts we cover is +so wide, and I'll be taking a somewhat unorthodox approach +to writing this spelling-checker _specifically_ for texts +containing a lot of dialect and uncommon words that have +probably already been spell-checked against a standard +modern dictionary. + + + + +Explanations of common gutcheck messages: + + --> 74 lines in this file have white space at end + + PG texts shouldn't have extra white space added at end of line. + Don't worry too much about this; they're not doing any harm, + and they'll be removed during posting anyway. + + + --> 348 lines in this file are short. Not reporting short lines. + --> 84 lines in this file are long. Not reporting long lines. + --> 8 lines in this file are VERY long! + + If there are a lot of long or short lines, Gutcheck won't list + them individually. The short lines version of this message + is commonly seen when gutchecking poetry and some plays, where + the normal line length is shorter than the standard for prose. + A "VERY long" line is one over 80 characters. You normally + shouldn't have any of these, but sometimes you may have to render + a table that must be that long, or some special preformatted + quotation that can't be broken. + + + --> There are 75 spaced dashes and em-dashes in this file. Not reporting them. + + The PG standard for an emdash--like these--is two minus signs + with no spaces before or after them. However, some older texts + used spaced dashes - like these -- and if there are very many + such spaced dashes in the file, gutcheck just draws your + attention to it and doesn't list them individually. + + + + Line 3020 - Non-ASCII character 233 + + Standard PG texts should use only ASCII characters with values + up to 127; however, non-English, accented characters can be + represented according to several different non-ASCII encoding + schemes, using values over 127. If you have a plain English text + with a few accented characters in words like cafe or tete-a-tete, + you should replace the accented characters with their unaccented + versions. The English pound sign is another commonly-seen + non-ASCII character. If you have enough non-ASCII characters in + your text that you feel removing them would degrade your text + unacceptably, you should probably consider doing an 8-bit text + as well as a plain-ASCII version. + + + + Line 1207 - Non-ISO-8859 character 156 + + Even in "8-bit" texts, there are distinctions between code sets. + The ISO-8859 family of 8-bit code sets is the most commonly used + in PG, and these sets do not define values in the range 128 through + 159 as printable characters. It's quite common for someone on a + Windows or Mac machine to use a non-ISO character inadvertently, + so this message warns that the character is not only not ASCII, + but also outside the ISO-8859 range. + + + + Line 46 - Tab character? + + Some editors and WPs will put in Tab characters (character 9) to + indicate indented text. You should not use these in a PG text, + because you can't be sure how they will appear on a reader's + screen. Find the Tab, and replace it with the appropriate number + of spaces. + + + Line 1327 - Tilde character? + + The tilde character (~) might be legitimately used, but it's the + character commonly used by OCR software to indicate a place where + it couldn't make out the letter, so gutcheck flags it. + + + + Line 1347 - Asterisk? + + Asterisks are reported only in paranoid mode (see -x). + Like tildes, they are often used to indicate errors, but they are + also legitimately used as line delimiters and footnote markers. + + + + Line 1451 - Long line 129 + + PG texts should have lines shorter than 76. There may be occasions + where you decide that you really have to go out to 79 characters, + but the sample above says that line 1451 is 129 characters long - + probably two lines run together. + + + + Line 1590 - Short line? + + PG texts should have lines longer than 54 characters. However, + there are special cases like poetry and tables of contents where + the lines _should_ be shorter. So treat Gutcheck warnings about + short lines carefully. Sometimes it's a genuine formatting + problem; sometimes the line really needs to be short. + + Hint: gutcheck will not flag lines as short if they are indented + - if they start with a space. I like to start inserted stanzas + and other such items indented with a couple of spaces so that + they stand out from the main text anyway. + + + + Line 1804 - Begins with punctuation? + + Lines should normally not begin with commas, periods and so on. + An exception is ellipses . . . which can happen at start of line. + + + + Line 1850 - Spaced em-dash? + + The PG standard for an em-dash--like these--is two minus signs + with no spaces before or after them. Gutcheck flags non-PG + em-dashes - like this one. Normally, you will replace it with a + PG-standard em-dash. + + + + Line 1904 - Query he/be error? + + Gutcheck makes a very minor effort to look for that scourge of all + proofreaders, "be" replacing "he" or vice-versa, and draws your + attention to it when it thinks it has found one. + + + + Line 2017 - Query digit in a1most + + The digit 1 is commonly OCRed for the letter l, the digit 0 for + the letter O, and so on. When gutcheck sees a mix of digits and + letters, it warns you. It may generate a false positive for + something like 7am. + + + + Line 2083 - Query standalone 0 + + In paranoid mode (see -x) only, gutcheck warns about the digit 0 + and the number 1 standing alone as a word. This can happen if the + OCR misreads the words O or I. + + + + Line 2115 - Query word whetber + + If you have switched typo-checking on, gutcheck looks for + potential typos, especially common h/b errors. It's not + infallible; it sometimes queries legit words, but it's + always worth taking a look. + + + + Line 2190 column 14 - Missing space? + + Omitting a space is a very common error,especially coming from + OCRed text,and can be hard for a human to spot. The commas in + the previous sentence illustrate the kind of thing I mean. + + + + Line 2240 column 48 - Spaced punctuation? + + The flip side of the "missing space" error , here , is when extra + spaces are added before punctuation . Some old texts appear to add + extra spaces around punctuation consistently, but this was a + typographical convention rather than the author's intent, and the + extra "spaces" should be removed when preparing a PG text. + + + + Line 2301 column 19 - Unspaced quotes? + + Another common spacing problem occurs in a phrase like "You wait + there,"he said. + + + + Line 2385 column 27 - Wrongspaced quotes? + + As of version 0.98, gutcheck adds extra checks on whether a quote + seems to be a start or end quote, and queries those that appear to + be misplaced. This does give rise to false positives when quotes are + nested, for example: + + "And how," she asked, "will your "friends" help you now?" + + but these false positives are worth it because of the many cases + that this test catches, notably those like: + + "And how, "she said," will your friends help you now?" + + Sometimes a "wrongspaced quotes" query will arise because an earlier + quote in the paragraph was omitted, so if the place specified seems + to be OK, look back to see whether there's a problem in the preceding + lines. + + + + Line 2400 - HTML Tag?
+
+    Some PG texts have been converted from HTML, and not all of the
+    HTML tags have been removed.
+
+
+
+    Line 2402 - HTML symbol? &emdash;
+
+    Similarly, special HTML symbol characters can survive into PG
+    texts. Can occasionally produce amusing false positives like
+    . . . Marwick & Co were well known for it;
+
+
+
+    Line 2540 - Mismatched quotes
+
+    Another gutcheck mainstay - unclosed doublequotes in a paragraph.
+    See the discussion of quotes in the switches section near the
+    start of this file.
+    
+    Since the mismatch doesn't occur on any one line, gutcheck quotes
+    the line number of the first blank line following the paragraph,
+    since this is the point where it reconciles the count of quotes.
+    However, if gutcheck is echoing lines, that is, you haven't used
+    the -e switch, it will show the _first_ line of the paragraph, 
+    to help you find the place without using line numbers. The 
+    offending paragraph is therefore between the quoted line and 
+    the line number given.
+
+
+
+    Line 2587 - Mismatched single quotes
+
+    Only checked with the -s switch, since checking single quotes is 
+    not a very reliable process. Otherwise, the same logic as for 
+    doublequotes applies.
+
+
+
+    Line 2877 - Mismatched round brackets?
+
+    Also curly and square brackets. Texts with a lot of brackets, like
+    plays with bracketed stage instructions, may have mismatches.
+
+
+    Line 3150 - No CR?
+    Line 3204 - Two successive CRs?
+    Line 3281 position 75 - CR without LF?
+
+    These are the invalid line-end warnings. See the discussion of
+    line-end checking in the switches section near the start of this
+    file. If you see these, and your editor doesn't show anything
+    wrong, you should probably try deleting the characters just before
+    and after the line end, and the line-end itself, then retyping the
+    characters and the line-end.
+
+
+    Line 2940 - Paragraph starts with lower-case
+
+    A common error in an e-text is for an extra blank line
+
+    to be put in, like the blank line above, and this often
+    shows up as a new paragraph beginning with lower case.
+    Sometimes the blank line is deliberate, as when a 
+    quotation is inserted in a speech. Use your judgement.
+
+
+    Line 2987 - Extra period?
+
+    An extra period. is a. common problem in OCRed text. and usually
+    arises when a speck of dust on the page is mistaken for a period.
+    or. as occasionally happens. when a comma loses its tail.
+
+
+    Line 3012 column 12 - Double punctuation?
+
+    Double punctuation., like that,, is a common typo and
+    scanno. Some books have much legit double punctuation,
+    like etc., etc., but it's worth checking anyway.
+
+
+
+            *       *       *        *
+
+For Windows-only users who are unfamiliar with DOS:
+
+    If you're a Windows-only user, you need to save
+    gutcheck.exe into the folder (directory) where the
+    text file you want to check is. Let's say your
+    text file is in C:\GUT, then you should save
+    GUTCHECK.EXE into C:\GUT.
+
+    Now get to a DOS prompt. You can do this by
+    selecting the "Command Prompt" or "MS-DOS Prompt"
+    option that will be somewhere on your
+    Start/Programs menu.
+
+    Now get into the C:\GUT directory. 
+    You can do this using the CD (change directory) 
+    command, like this:
+        CD \GUT
+    and your prompt will change to 
+        C:\GUT>
+    so you know you're in the right place.
+
+    Now type
+        gutcheck yourfile.txt
+    and you'll see gutcheck's report
+
+    By default, gutcheck prints its queries to screen.
+    If you want to create a file of them, to edit
+    against the text, you can use the greater-than
+    sign (>) to tell it to output the report to a
+    file. For example, if you want its report in a
+    file called QUERIES.LST, you could type
+    
+        gutcheck yourfile.txt > queries.lst
+
+    The queries.lst file will then contain the listing
+    of possible formatting errors, and you can
+    edit it alongside your text.
+
+    Whatever you do, DON'T make the filename after
+    the greater-than sign the name of a file already
+    on your disk that you want to keep, because
+    the greater-than sign will cause gutcheck to
+    replace any existing file of that name.
+
+    So, for example, if you have two Tolstoy files
+    that you want to check, called WARPEACE.TXT and 
+    ANNAK.TXT, make sure that neither of these names
+    is ever used following the greater-than sign.
+    To check these correctly, you might do:
+
+    gutcheck warpeace.txt >war.lst
+
+    and
+
+    gutcheck annak.txt > annak.lst
+
+    separately. Then you can look at war.lst and annak.lst
+    to see the gutcheck reports.
+
+            *       *       *        *
+
+
+For existing 0.98 users upgrading to 0.99:
+
+    If you run on old 16-bit DOS or Windows 3.x, I'm afraid
+    you're out of luck. I'm not saying it _can't_ be compiled
+    to run on 16-bit, but the executable with the package is
+    for Win32 only. *nix users won't notice the change at all.
+
+
+    There are two new switches: -u and -d. 
+          See above for full rundown.
+
+
+Here's a list of the new errors:
+
+    Line 1456 - Carat character?
+
+    I^ve found a few.
+
+
+    Line 1821 - Forward slash?
+
+    Common error for italicized "I", or so /'ve found.
+
+
+    Line 2139 - Query missing paragraph break?
+
+    "Come here, son." "Do I _have_ to go, dad?"
+    Like that. False positives in some texts. Sorry 'bout that,
+    but these are often errors.
+
+
+    Line 2200 - Query had/bad error?
+
+    Clear enough. Doesn't catch as many as I'd like it to,
+    but rarely gives false alarms.
+
+
+    Line 2268 - Query punctuation after the?
+
+    Some words, like "the", very rarely have punctuation
+    following them. Others, like "Mrs", usually have a
+    period, but never a comma. Occasional false positives.
+
+
+    Line 2380 - Query possible scanno arid
+
+    It found one of your user-defined typos when you
+    used the -u switch.
+
+
+    Line 2511 - Capital "S"?
+
+    Surprisingly common specific case, like: Jane'S 
+
+    
+    Line 3469 - endquote missing punctuation?
+
+    OK. This one can really cause a lot of false positives
+    in some books, but it switches itself off if it finds
+    more than 20 in a text, unless you force it to list them
+    all with the -v switch.
+    "Hey, dad" Johnny said, "can we go now?"
+    is a common punctuation-missing error.
+
+
+    Line 4266 - Mismatched underscores?
+
+    Like mismatched anything else!
+
+