diff -r 218904410231 -r f600b0d1fc5d doc/gutcheck.txt --- a/doc/gutcheck.txt Fri Jan 27 00:28:11 2012 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,742 +0,0 @@ - - - Gutcheck documentation - - -gutcheck: lists possible common formatting errors in a Project -Gutenberg candidate file. It is a command line program and can be used -under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't, -tell me). For Windows-only people, there is an appendix at the end -with brief instructions for running it. - - -Current version: 0.99. Users of 0.98 see end of file for changes. - -You should also have received the licence file COPYING, a README file, -gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with -this file. - -This software is Copyright Jim Tinsley 2000-2005. - -Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING. -This is Free Software; you may redistribute it under certain conditions (GPL). - -See http://gutcheck.sourceforge.net for the latest version. - - -Usage is: gutcheck [-setopxlywm] filename - where: - -s checks Single quotes - -e switches off Echoing of lines - -t checks Typos - -o produces an Overview only - -p sets strict quotes checking for Paragraphs - -x (paranoid) switches OFF typo checking and extra checks - -l turns off Line-end checks - -y sets error messages to stdout - -w is a special mode for web uploads (for future use) - -v (verbose) forces individual reporting of minor problems - -m interprets Markup of some common HTML tags and entities - -u warns about words in a user-defined typo file gutcheck.typ - -d ignores some DP-specific markup - -Running gutcheck without any parameters will display a brief help message. - -Sample usage: - - gutcheck warpeace.txt - - -More detail: - - Echoing lines (-e to switch off) - - You may find it convenient, when reviewing Gutcheck's - suggestions, to see the line that Gutcheck is questioning. - That way, you can often see at a glance whether it is - a real error that needs to be fixed, or a false positive - that should be in the text, but Gutcheck's limited - programming doesn't understand. - - By default, gutcheck echoes these lines, but if you don't - want to see the lines referred to, -e will switch it OFF. - - - Quotes (-s and -p switches) - - Gutcheck always looks for unbalanced doublequotes in a - paragraph. It is a common convention for writers not to - close quotes in a paragraph if the next paragraph opens - with quotes and is a continuation by the same speaker. - - Gutcheck therefore does not normally report unclosed quotes - if the next paragraph begins with a quote. If you need - to see all unclosed quotes, even where the next paragraph - begins with a quote, you should use the -p switch. - - Singlequotes (') are a problem, since the same character - is used for an apostrophe. I'm not sure that it is - possible to get 100% accuracy on singlequotes checking, - particularly since dialect, quite common in PG texts, - upsets the normal rules so badly. Consider the sentence: - 'Tis often said that a man's a man for a' that. - As humans, we recognize that both apostrophes are used - for contractions rather than quotes, but it isn't easy - to get a program to recognize that. - - Since Gutcheck makes too many mistakes when trying to match - singlequotes, it doesn't look for unbalanced singlequotes - unless you specify the -s switch. - - Consider these sentences, which illustrate the main cases: - - 'Tis often said that a fool and his money are soon parted. - - 'Becky's goin' home,' said Tom. - - The dogs' tails wagged in unison. - - Those 'pack dogs' of yours look more like wolves. - - - - Typos (-t switch) - - It's not Gutcheck's job to be a spelling checker, but it - does check for a list of common typos and OCR errors if you - use the -t switch. (The -x switch also turns typo checking on.) - - It also checks for character combinations, especially involving - h and b, which are often confused by OCR, that rarely or never - occur. For example, it queries "tbe" in a word. Now, "the" often - occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm - playing the odds - a few false positives for many errors found. - Similarly with "ii", which is a very common OCR error. - - Gutcheck suppresses multiple reporting of the first 40 "typos" - found. This is to remove the annoyance of seeing something like - "FN" (footnote) or "LK" (initials) flagged as a typo 147 times - in a text. - - - Line-end checking (-l switch to disable) - - All PG texts should have a Carriage Return (CR - character 13) - and a Line Feed (LF - character 10) at end of each line, - regardless of what O/S you made them on. DOS/Windows, Unix - and Mac have different conventions, but the final text should - always use a CR/LF pair as its line terminator. - - By default, Gutcheck verifies that every line does have - the correct terminator, but if you're on a work-in-progress - in Linux, you might want to convert the line-ends as a final - step, and not want to see thousands of errors every time you - run Gutcheck before that final step, so you can turn off - this checking with the -l switch. - - - Paranoid mode (-x switch to disable: Trust No One :-) - - -x switches OFF typo-checking, the -t flag, automatically - and some extra checks like standalone 1 and 0 queries. - - - Overview mode (-o switch) - - This mode just gives a count of queries found - instead of a detailed list. - - - Header quote (-h switch) - - If you use the -h switch, gutcheck will also display - the Title, Author, Release and Edition fields from the - PG header. This is useful mostly for the automated - checks we do on recently-posted texts. - - - Errors to stdout (-y switch) - - If you're just running gutcheck normally, you can ignore - this. It's only there for programs that provide a front - end to gutcheck. It makes error messages appear within - the output of gutcheck so that the front end knows whether - gutcheck ran OK. - - - Verbose reporting (-v switch) - - Normally, if gutcheck sees lots of long lines, short lines, - spaced dashes, non-ASCII characters or dot-commas ".," it - assumes these are features of the text, counts and summarizes - them at the top of its report, but does not list them - individually. If the -v switch is on, gutcheck will list them all. - - - Markup interpretation (-m switch) - - Normally, gutcheck flags anything it suspects of being HTML - markup as a possible error. When you use the -m switch, - however, it matches anything that looks like markup against - a short list of common HTML tags and entities. If the markup - is in that list, it either ignores the markup, in the case - of a tag, or "interprets" the markup as its nearest ASCII - equivalent, in the case of an entity. So, for example, using - this switch, gutcheck will "see" - - “He went thataway!” - - as - - "He went thataway!" - - and report accordingly. - - This switch does not, not, NOT check the validity of HTML; - it exists so that you can run gutcheck on most HTML texts - for PG, and get sane results. It does not support all tags. - It does not support all entities. When it sees a tag or entity - it does not recognize, it will query it as HTML just as if - you hadn't specified the -m switch. - - Gutcheck 0.99 will automatically switch on markup interpretation - if it sees a lot of tags that appear to be markup, so mostly, you - won't have to specify this. - - User-defined typos (-u switch) - - If you have a file named gutcheck.typ either in your current - working directory or in the directory from which you explicitly - invoked gutcheck, but not necessarily on your path, and if you - specify the -u switch, gutcheck will query any word specified - in that file. The file is simple: one word, in lower case, per - line. 999 lines are allowed for. Be careful not to put multiple - words onto a line, or leave any rubbish other than the word on - the line. You should have received a sample file gutcheck.typ - with this package. - - Ignore DP markup (-d switch) - - Distributed Proofreaders (http://www.pgdp.net) is currently - (2005) the main source of PG texts, and proofers there use - special conventions. This switch understands those conventions, - so that people can use gutcheck on files in process that still - haven't had the special conventions removed yet. The special - conventions supported in 0.99 are page-separators and - "", "", "/*", "*/", "/#", "#/", "/$", "$/". - - -You will probably only run gutcheck on a text once or maybe twice, -just prior to uploading; it usually finds a few formatting problems; -it also usually finds queries that aren't problems at all - it often -questions Tables of Contents for having short lines, for example. -These are called "false positives", and need a human to decide on -them. - -The text should be standard prose, and already close to PG normal -format (plain text, about 70 characters per line with blank lines -between paragraphs). - -Gutcheck merely draws your attention to things that might be errors. -It is NOT a substitute for human judgement. Formatting choices like -short lines may be for a reason that this program can't understand. - -Even the most careful human proofing can leave errors behind in a -text, and there are several automated checks you can do to help find -them. Of these, spellchecking (with _very_ careful human judgement) is -the most important and most useful. - -Gutcheck does perform some basic typo-checking if you ask it to, -but its focus is on formatting errors specific to PG texts - -mismatched quotes, non-ASCII characters, bad spacing, bad line -length, HTML tags perhaps left from a conversion, unbalanced -brackets. - -Suggestions for additional checks would be appreciated and duly -considered, but no guarantees that they will be implemented. - - - - - How do _I_ use it? - -Practically everyone I give gutcheck to asks me how _I_ use it. -Well, when I get a text for posting, say filename.txt, I run - - gutcheck -o filename.txt - -That gives me a quick idea what I'm dealing with. It'll tell -me what kind of problems gutcheck sees, and give me an idea -of how much more work needs to be done on the text. Keep in -mind that gutcheck doesn't do anything like a full spellcheck, -but when I see a text that has a lot of problems, I assume that -it probably needs a spellcheck too. - -Having got a feel for the ballpark, I run - - gutcheck filename.txt > jj - -where jj is my personal, all-purpose filename for temporary data -that doesn't need to be kept. Then I open filename.txt and jj in -a split-screen view in my editor, and work down the text, fixing -whatever needs fixing, and skipping whatever doesn't. If your -editor doesn't split-screen, you can get much the same effect by -opening your original file in your normal editor, and jj (or your -equivalent name) in something like Notepad, keeping both in view -at the same time. - -Twice a day, an automatic process looks at all recently-posted -texts, and emails Michael, me, and sometimes other people with -their gutcheck summaries. - - - - Future development of gutcheck - -Gutcheck has gone about as far as it can, given its current -structure. In order to add better singlequotes checking, -sentence checking, better he/be checking and other good stuff -that I'd like to see, I'll have to rewrite it from a different -angle - looking at the syntax instead of the lines. And I'll -probably get around to that sooner or later. - -Meantime, I'm just trying to get this version stabilized, so -please report any bugs you find. When it is stable, I'll run -up a Windows port for those timid souls who can't look a -command line in the eye. :-) - -And I've started work on gutspell, a companion to gutcheck -which will concentrate on spelling problems. PG spelling -problems are unusual, since the range of texts we cover is -so wide, and I'll be taking a somewhat unorthodox approach -to writing this spelling-checker _specifically_ for texts -containing a lot of dialect and uncommon words that have -probably already been spell-checked against a standard -modern dictionary. - - - - -Explanations of common gutcheck messages: - - --> 74 lines in this file have white space at end - - PG texts shouldn't have extra white space added at end of line. - Don't worry too much about this; they're not doing any harm, - and they'll be removed during posting anyway. - - - --> 348 lines in this file are short. Not reporting short lines. - --> 84 lines in this file are long. Not reporting long lines. - --> 8 lines in this file are VERY long! - - If there are a lot of long or short lines, Gutcheck won't list - them individually. The short lines version of this message - is commonly seen when gutchecking poetry and some plays, where - the normal line length is shorter than the standard for prose. - A "VERY long" line is one over 80 characters. You normally - shouldn't have any of these, but sometimes you may have to render - a table that must be that long, or some special preformatted - quotation that can't be broken. - - - --> There are 75 spaced dashes and em-dashes in this file. Not reporting them. - - The PG standard for an emdash--like these--is two minus signs - with no spaces before or after them. However, some older texts - used spaced dashes - like these -- and if there are very many - such spaced dashes in the file, gutcheck just draws your - attention to it and doesn't list them individually. - - - - Line 3020 - Non-ASCII character 233 - - Standard PG texts should use only ASCII characters with values - up to 127; however, non-English, accented characters can be - represented according to several different non-ASCII encoding - schemes, using values over 127. If you have a plain English text - with a few accented characters in words like cafe or tete-a-tete, - you should replace the accented characters with their unaccented - versions. The English pound sign is another commonly-seen - non-ASCII character. If you have enough non-ASCII characters in - your text that you feel removing them would degrade your text - unacceptably, you should probably consider doing an 8-bit text - as well as a plain-ASCII version. - - - - Line 1207 - Non-ISO-8859 character 156 - - Even in "8-bit" texts, there are distinctions between code sets. - The ISO-8859 family of 8-bit code sets is the most commonly used - in PG, and these sets do not define values in the range 128 through - 159 as printable characters. It's quite common for someone on a - Windows or Mac machine to use a non-ISO character inadvertently, - so this message warns that the character is not only not ASCII, - but also outside the ISO-8859 range. - - - - Line 46 - Tab character? - - Some editors and WPs will put in Tab characters (character 9) to - indicate indented text. You should not use these in a PG text, - because you can't be sure how they will appear on a reader's - screen. Find the Tab, and replace it with the appropriate number - of spaces. - - - Line 1327 - Tilde character? - - The tilde character (~) might be legitimately used, but it's the - character commonly used by OCR software to indicate a place where - it couldn't make out the letter, so gutcheck flags it. - - - - Line 1347 - Asterisk? - - Asterisks are reported only in paranoid mode (see -x). - Like tildes, they are often used to indicate errors, but they are - also legitimately used as line delimiters and footnote markers. - - - - Line 1451 - Long line 129 - - PG texts should have lines shorter than 76. There may be occasions - where you decide that you really have to go out to 79 characters, - but the sample above says that line 1451 is 129 characters long - - probably two lines run together. - - - - Line 1590 - Short line? - - PG texts should have lines longer than 54 characters. However, - there are special cases like poetry and tables of contents where - the lines _should_ be shorter. So treat Gutcheck warnings about - short lines carefully. Sometimes it's a genuine formatting - problem; sometimes the line really needs to be short. - - Hint: gutcheck will not flag lines as short if they are indented - - if they start with a space. I like to start inserted stanzas - and other such items indented with a couple of spaces so that - they stand out from the main text anyway. - - - - Line 1804 - Begins with punctuation? - - Lines should normally not begin with commas, periods and so on. - An exception is ellipses . . . which can happen at start of line. - - - - Line 1850 - Spaced em-dash? - - The PG standard for an em-dash--like these--is two minus signs - with no spaces before or after them. Gutcheck flags non-PG - em-dashes - like this one. Normally, you will replace it with a - PG-standard em-dash. - - - - Line 1904 - Query he/be error? - - Gutcheck makes a very minor effort to look for that scourge of all - proofreaders, "be" replacing "he" or vice-versa, and draws your - attention to it when it thinks it has found one. - - - - Line 2017 - Query digit in a1most - - The digit 1 is commonly OCRed for the letter l, the digit 0 for - the letter O, and so on. When gutcheck sees a mix of digits and - letters, it warns you. It may generate a false positive for - something like 7am. - - - - Line 2083 - Query standalone 0 - - In paranoid mode (see -x) only, gutcheck warns about the digit 0 - and the number 1 standing alone as a word. This can happen if the - OCR misreads the words O or I. - - - - Line 2115 - Query word whetber - - If you have switched typo-checking on, gutcheck looks for - potential typos, especially common h/b errors. It's not - infallible; it sometimes queries legit words, but it's - always worth taking a look. - - - - Line 2190 column 14 - Missing space? - - Omitting a space is a very common error,especially coming from - OCRed text,and can be hard for a human to spot. The commas in - the previous sentence illustrate the kind of thing I mean. - - - - Line 2240 column 48 - Spaced punctuation? - - The flip side of the "missing space" error , here , is when extra - spaces are added before punctuation . Some old texts appear to add - extra spaces around punctuation consistently, but this was a - typographical convention rather than the author's intent, and the - extra "spaces" should be removed when preparing a PG text. - - - - Line 2301 column 19 - Unspaced quotes? - - Another common spacing problem occurs in a phrase like "You wait - there,"he said. - - - - Line 2385 column 27 - Wrongspaced quotes? - - As of version 0.98, gutcheck adds extra checks on whether a quote - seems to be a start or end quote, and queries those that appear to - be misplaced. This does give rise to false positives when quotes are - nested, for example: - - "And how," she asked, "will your "friends" help you now?" - - but these false positives are worth it because of the many cases - that this test catches, notably those like: - - "And how, "she said," will your friends help you now?" - - Sometimes a "wrongspaced quotes" query will arise because an earlier - quote in the paragraph was omitted, so if the place specified seems - to be OK, look back to see whether there's a problem in the preceding - lines. - - - - Line 2400 - HTML Tag?
-
-    Some PG texts have been converted from HTML, and not all of the
-    HTML tags have been removed.
-
-
-
-    Line 2402 - HTML symbol? &emdash;
-
-    Similarly, special HTML symbol characters can survive into PG
-    texts. Can occasionally produce amusing false positives like
-    . . . Marwick & Co were well known for it;
-
-
-
-    Line 2540 - Mismatched quotes
-
-    Another gutcheck mainstay - unclosed doublequotes in a paragraph.
-    See the discussion of quotes in the switches section near the
-    start of this file.
-    
-    Since the mismatch doesn't occur on any one line, gutcheck quotes
-    the line number of the first blank line following the paragraph,
-    since this is the point where it reconciles the count of quotes.
-    However, if gutcheck is echoing lines, that is, you haven't used
-    the -e switch, it will show the _first_ line of the paragraph, 
-    to help you find the place without using line numbers. The 
-    offending paragraph is therefore between the quoted line and 
-    the line number given.
-
-
-
-    Line 2587 - Mismatched single quotes
-
-    Only checked with the -s switch, since checking single quotes is 
-    not a very reliable process. Otherwise, the same logic as for 
-    doublequotes applies.
-
-
-
-    Line 2877 - Mismatched round brackets?
-
-    Also curly and square brackets. Texts with a lot of brackets, like
-    plays with bracketed stage instructions, may have mismatches.
-
-
-    Line 3150 - No CR?
-    Line 3204 - Two successive CRs?
-    Line 3281 position 75 - CR without LF?
-
-    These are the invalid line-end warnings. See the discussion of
-    line-end checking in the switches section near the start of this
-    file. If you see these, and your editor doesn't show anything
-    wrong, you should probably try deleting the characters just before
-    and after the line end, and the line-end itself, then retyping the
-    characters and the line-end.
-
-
-    Line 2940 - Paragraph starts with lower-case
-
-    A common error in an e-text is for an extra blank line
-
-    to be put in, like the blank line above, and this often
-    shows up as a new paragraph beginning with lower case.
-    Sometimes the blank line is deliberate, as when a 
-    quotation is inserted in a speech. Use your judgement.
-
-
-    Line 2987 - Extra period?
-
-    An extra period. is a. common problem in OCRed text. and usually
-    arises when a speck of dust on the page is mistaken for a period.
-    or. as occasionally happens. when a comma loses its tail.
-
-
-    Line 3012 column 12 - Double punctuation?
-
-    Double punctuation., like that,, is a common typo and
-    scanno. Some books have much legit double punctuation,
-    like etc., etc., but it's worth checking anyway.
-
-
-
-            *       *       *        *
-
-For Windows-only users who are unfamiliar with DOS:
-
-    If you're a Windows-only user, you need to save
-    gutcheck.exe into the folder (directory) where the
-    text file you want to check is. Let's say your
-    text file is in C:\GUT, then you should save
-    GUTCHECK.EXE into C:\GUT.
-
-    Now get to a DOS prompt. You can do this by
-    selecting the "Command Prompt" or "MS-DOS Prompt"
-    option that will be somewhere on your
-    Start/Programs menu.
-
-    Now get into the C:\GUT directory. 
-    You can do this using the CD (change directory) 
-    command, like this:
-        CD \GUT
-    and your prompt will change to 
-        C:\GUT>
-    so you know you're in the right place.
-
-    Now type
-        gutcheck yourfile.txt
-    and you'll see gutcheck's report
-
-    By default, gutcheck prints its queries to screen.
-    If you want to create a file of them, to edit
-    against the text, you can use the greater-than
-    sign (>) to tell it to output the report to a
-    file. For example, if you want its report in a
-    file called QUERIES.LST, you could type
-    
-        gutcheck yourfile.txt > queries.lst
-
-    The queries.lst file will then contain the listing
-    of possible formatting errors, and you can
-    edit it alongside your text.
-
-    Whatever you do, DON'T make the filename after
-    the greater-than sign the name of a file already
-    on your disk that you want to keep, because
-    the greater-than sign will cause gutcheck to
-    replace any existing file of that name.
-
-    So, for example, if you have two Tolstoy files
-    that you want to check, called WARPEACE.TXT and 
-    ANNAK.TXT, make sure that neither of these names
-    is ever used following the greater-than sign.
-    To check these correctly, you might do:
-
-    gutcheck warpeace.txt >war.lst
-
-    and
-
-    gutcheck annak.txt > annak.lst
-
-    separately. Then you can look at war.lst and annak.lst
-    to see the gutcheck reports.
-
-            *       *       *        *
-
-
-For existing 0.98 users upgrading to 0.99:
-
-    If you run on old 16-bit DOS or Windows 3.x, I'm afraid
-    you're out of luck. I'm not saying it _can't_ be compiled
-    to run on 16-bit, but the executable with the package is
-    for Win32 only. *nix users won't notice the change at all.
-
-
-    There are two new switches: -u and -d. 
-          See above for full rundown.
-
-
-Here's a list of the new errors:
-
-    Line 1456 - Carat character?
-
-    I^ve found a few.
-
-
-    Line 1821 - Forward slash?
-
-    Common error for italicized "I", or so /'ve found.
-
-
-    Line 2139 - Query missing paragraph break?
-
-    "Come here, son." "Do I _have_ to go, dad?"
-    Like that. False positives in some texts. Sorry 'bout that,
-    but these are often errors.
-
-
-    Line 2200 - Query had/bad error?
-
-    Clear enough. Doesn't catch as many as I'd like it to,
-    but rarely gives false alarms.
-
-
-    Line 2268 - Query punctuation after the?
-
-    Some words, like "the", very rarely have punctuation
-    following them. Others, like "Mrs", usually have a
-    period, but never a comma. Occasional false positives.
-
-
-    Line 2380 - Query possible scanno arid
-
-    It found one of your user-defined typos when you
-    used the -u switch.
-
-
-    Line 2511 - Capital "S"?
-
-    Surprisingly common specific case, like: Jane'S 
-
-    
-    Line 3469 - endquote missing punctuation?
-
-    OK. This one can really cause a lot of false positives
-    in some books, but it switches itself off if it finds
-    more than 20 in a text, unless you force it to list them
-    all with the -v switch.
-    "Hey, dad" Johnny said, "can we go now?"
-    is a common punctuation-missing error.
-
-
-    Line 4266 - Mismatched underscores?
-
-    Like mismatched anything else!
-
-