diff -r 218904410231 -r f600b0d1fc5d doc/gutcheck.txt
--- a/doc/gutcheck.txt Fri Jan 27 00:28:11 2012 +0000
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,742 +0,0 @@
-
-
- Gutcheck documentation
-
-
-gutcheck: lists possible common formatting errors in a Project
-Gutenberg candidate file. It is a command line program and can be used
-under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
-tell me). For Windows-only people, there is an appendix at the end
-with brief instructions for running it.
-
-
-Current version: 0.99. Users of 0.98 see end of file for changes.
-
-You should also have received the licence file COPYING, a README file,
-gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
-this file.
-
-This software is Copyright Jim Tinsley 2000-2005.
-
-Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
-This is Free Software; you may redistribute it under certain conditions (GPL).
-
-See http://gutcheck.sourceforge.net for the latest version.
-
-
-Usage is: gutcheck [-setopxlywm] filename
- where:
- -s checks Single quotes
- -e switches off Echoing of lines
- -t checks Typos
- -o produces an Overview only
- -p sets strict quotes checking for Paragraphs
- -x (paranoid) switches OFF typo checking and extra checks
- -l turns off Line-end checks
- -y sets error messages to stdout
- -w is a special mode for web uploads (for future use)
- -v (verbose) forces individual reporting of minor problems
- -m interprets Markup of some common HTML tags and entities
- -u warns about words in a user-defined typo file gutcheck.typ
- -d ignores some DP-specific markup
-
-Running gutcheck without any parameters will display a brief help message.
-
-Sample usage:
-
- gutcheck warpeace.txt
-
-
-More detail:
-
- Echoing lines (-e to switch off)
-
- You may find it convenient, when reviewing Gutcheck's
- suggestions, to see the line that Gutcheck is questioning.
- That way, you can often see at a glance whether it is
- a real error that needs to be fixed, or a false positive
- that should be in the text, but Gutcheck's limited
- programming doesn't understand.
-
- By default, gutcheck echoes these lines, but if you don't
- want to see the lines referred to, -e will switch it OFF.
-
-
- Quotes (-s and -p switches)
-
- Gutcheck always looks for unbalanced doublequotes in a
- paragraph. It is a common convention for writers not to
- close quotes in a paragraph if the next paragraph opens
- with quotes and is a continuation by the same speaker.
-
- Gutcheck therefore does not normally report unclosed quotes
- if the next paragraph begins with a quote. If you need
- to see all unclosed quotes, even where the next paragraph
- begins with a quote, you should use the -p switch.
-
- Singlequotes (') are a problem, since the same character
- is used for an apostrophe. I'm not sure that it is
- possible to get 100% accuracy on singlequotes checking,
- particularly since dialect, quite common in PG texts,
- upsets the normal rules so badly. Consider the sentence:
- 'Tis often said that a man's a man for a' that.
- As humans, we recognize that both apostrophes are used
- for contractions rather than quotes, but it isn't easy
- to get a program to recognize that.
-
- Since Gutcheck makes too many mistakes when trying to match
- singlequotes, it doesn't look for unbalanced singlequotes
- unless you specify the -s switch.
-
- Consider these sentences, which illustrate the main cases:
-
- 'Tis often said that a fool and his money are soon parted.
-
- 'Becky's goin' home,' said Tom.
-
- The dogs' tails wagged in unison.
-
- Those 'pack dogs' of yours look more like wolves.
-
-
-
- Typos (-t switch)
-
- It's not Gutcheck's job to be a spelling checker, but it
- does check for a list of common typos and OCR errors if you
- use the -t switch. (The -x switch also turns typo checking on.)
-
- It also checks for character combinations, especially involving
- h and b, which are often confused by OCR, that rarely or never
- occur. For example, it queries "tbe" in a word. Now, "the" often
- occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
- playing the odds - a few false positives for many errors found.
- Similarly with "ii", which is a very common OCR error.
-
- Gutcheck suppresses multiple reporting of the first 40 "typos"
- found. This is to remove the annoyance of seeing something like
- "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
- in a text.
-
-
- Line-end checking (-l switch to disable)
-
- All PG texts should have a Carriage Return (CR - character 13)
- and a Line Feed (LF - character 10) at end of each line,
- regardless of what O/S you made them on. DOS/Windows, Unix
- and Mac have different conventions, but the final text should
- always use a CR/LF pair as its line terminator.
-
- By default, Gutcheck verifies that every line does have
- the correct terminator, but if you're on a work-in-progress
- in Linux, you might want to convert the line-ends as a final
- step, and not want to see thousands of errors every time you
- run Gutcheck before that final step, so you can turn off
- this checking with the -l switch.
-
-
- Paranoid mode (-x switch to disable: Trust No One :-)
-
- -x switches OFF typo-checking, the -t flag, automatically
- and some extra checks like standalone 1 and 0 queries.
-
-
- Overview mode (-o switch)
-
- This mode just gives a count of queries found
- instead of a detailed list.
-
-
- Header quote (-h switch)
-
- If you use the -h switch, gutcheck will also display
- the Title, Author, Release and Edition fields from the
- PG header. This is useful mostly for the automated
- checks we do on recently-posted texts.
-
-
- Errors to stdout (-y switch)
-
- If you're just running gutcheck normally, you can ignore
- this. It's only there for programs that provide a front
- end to gutcheck. It makes error messages appear within
- the output of gutcheck so that the front end knows whether
- gutcheck ran OK.
-
-
- Verbose reporting (-v switch)
-
- Normally, if gutcheck sees lots of long lines, short lines,
- spaced dashes, non-ASCII characters or dot-commas ".," it
- assumes these are features of the text, counts and summarizes
- them at the top of its report, but does not list them
- individually. If the -v switch is on, gutcheck will list them all.
-
-
- Markup interpretation (-m switch)
-
- Normally, gutcheck flags anything it suspects of being HTML
- markup as a possible error. When you use the -m switch,
- however, it matches anything that looks like markup against
- a short list of common HTML tags and entities. If the markup
- is in that list, it either ignores the markup, in the case
- of a tag, or "interprets" the markup as its nearest ASCII
- equivalent, in the case of an entity. So, for example, using
- this switch, gutcheck will "see"
-
- “He went thataway!”
-
- as
-
- "He went thataway!"
-
- and report accordingly.
-
- This switch does not, not, NOT check the validity of HTML;
- it exists so that you can run gutcheck on most HTML texts
- for PG, and get sane results. It does not support all tags.
- It does not support all entities. When it sees a tag or entity
- it does not recognize, it will query it as HTML just as if
- you hadn't specified the -m switch.
-
- Gutcheck 0.99 will automatically switch on markup interpretation
- if it sees a lot of tags that appear to be markup, so mostly, you
- won't have to specify this.
-
- User-defined typos (-u switch)
-
- If you have a file named gutcheck.typ either in your current
- working directory or in the directory from which you explicitly
- invoked gutcheck, but not necessarily on your path, and if you
- specify the -u switch, gutcheck will query any word specified
- in that file. The file is simple: one word, in lower case, per
- line. 999 lines are allowed for. Be careful not to put multiple
- words onto a line, or leave any rubbish other than the word on
- the line. You should have received a sample file gutcheck.typ
- with this package.
-
- Ignore DP markup (-d switch)
-
- Distributed Proofreaders (http://www.pgdp.net) is currently
- (2005) the main source of PG texts, and proofers there use
- special conventions. This switch understands those conventions,
- so that people can use gutcheck on files in process that still
- haven't had the special conventions removed yet. The special
- conventions supported in 0.99 are page-separators and
- "
- - Some PG texts have been converted from HTML, and not all of the - HTML tags have been removed. - - - - Line 2402 - HTML symbol? &emdash; - - Similarly, special HTML symbol characters can survive into PG - texts. Can occasionally produce amusing false positives like - . . . Marwick & Co were well known for it; - - - - Line 2540 - Mismatched quotes - - Another gutcheck mainstay - unclosed doublequotes in a paragraph. - See the discussion of quotes in the switches section near the - start of this file. - - Since the mismatch doesn't occur on any one line, gutcheck quotes - the line number of the first blank line following the paragraph, - since this is the point where it reconciles the count of quotes. - However, if gutcheck is echoing lines, that is, you haven't used - the -e switch, it will show the _first_ line of the paragraph, - to help you find the place without using line numbers. The - offending paragraph is therefore between the quoted line and - the line number given. - - - - Line 2587 - Mismatched single quotes - - Only checked with the -s switch, since checking single quotes is - not a very reliable process. Otherwise, the same logic as for - doublequotes applies. - - - - Line 2877 - Mismatched round brackets? - - Also curly and square brackets. Texts with a lot of brackets, like - plays with bracketed stage instructions, may have mismatches. - - - Line 3150 - No CR? - Line 3204 - Two successive CRs? - Line 3281 position 75 - CR without LF? - - These are the invalid line-end warnings. See the discussion of - line-end checking in the switches section near the start of this - file. If you see these, and your editor doesn't show anything - wrong, you should probably try deleting the characters just before - and after the line end, and the line-end itself, then retyping the - characters and the line-end. - - - Line 2940 - Paragraph starts with lower-case - - A common error in an e-text is for an extra blank line - - to be put in, like the blank line above, and this often - shows up as a new paragraph beginning with lower case. - Sometimes the blank line is deliberate, as when a - quotation is inserted in a speech. Use your judgement. - - - Line 2987 - Extra period? - - An extra period. is a. common problem in OCRed text. and usually - arises when a speck of dust on the page is mistaken for a period. - or. as occasionally happens. when a comma loses its tail. - - - Line 3012 column 12 - Double punctuation? - - Double punctuation., like that,, is a common typo and - scanno. Some books have much legit double punctuation, - like etc., etc., but it's worth checking anyway. - - - - * * * * - -For Windows-only users who are unfamiliar with DOS: - - If you're a Windows-only user, you need to save - gutcheck.exe into the folder (directory) where the - text file you want to check is. Let's say your - text file is in C:\GUT, then you should save - GUTCHECK.EXE into C:\GUT. - - Now get to a DOS prompt. You can do this by - selecting the "Command Prompt" or "MS-DOS Prompt" - option that will be somewhere on your - Start/Programs menu. - - Now get into the C:\GUT directory. - You can do this using the CD (change directory) - command, like this: - CD \GUT - and your prompt will change to - C:\GUT> - so you know you're in the right place. - - Now type - gutcheck yourfile.txt - and you'll see gutcheck's report - - By default, gutcheck prints its queries to screen. - If you want to create a file of them, to edit - against the text, you can use the greater-than - sign (>) to tell it to output the report to a - file. For example, if you want its report in a - file called QUERIES.LST, you could type - - gutcheck yourfile.txt > queries.lst - - The queries.lst file will then contain the listing - of possible formatting errors, and you can - edit it alongside your text. - - Whatever you do, DON'T make the filename after - the greater-than sign the name of a file already - on your disk that you want to keep, because - the greater-than sign will cause gutcheck to - replace any existing file of that name. - - So, for example, if you have two Tolstoy files - that you want to check, called WARPEACE.TXT and - ANNAK.TXT, make sure that neither of these names - is ever used following the greater-than sign. - To check these correctly, you might do: - - gutcheck warpeace.txt >war.lst - - and - - gutcheck annak.txt > annak.lst - - separately. Then you can look at war.lst and annak.lst - to see the gutcheck reports. - - * * * * - - -For existing 0.98 users upgrading to 0.99: - - If you run on old 16-bit DOS or Windows 3.x, I'm afraid - you're out of luck. I'm not saying it _can't_ be compiled - to run on 16-bit, but the executable with the package is - for Win32 only. *nix users won't notice the change at all. - - - There are two new switches: -u and -d. - See above for full rundown. - - -Here's a list of the new errors: - - Line 1456 - Carat character? - - I^ve found a few. - - - Line 1821 - Forward slash? - - Common error for italicized "I", or so /'ve found. - - - Line 2139 - Query missing paragraph break? - - "Come here, son." "Do I _have_ to go, dad?" - Like that. False positives in some texts. Sorry 'bout that, - but these are often errors. - - - Line 2200 - Query had/bad error? - - Clear enough. Doesn't catch as many as I'd like it to, - but rarely gives false alarms. - - - Line 2268 - Query punctuation after the? - - Some words, like "the", very rarely have punctuation - following them. Others, like "Mrs", usually have a - period, but never a comma. Occasional false positives. - - - Line 2380 - Query possible scanno arid - - It found one of your user-defined typos when you - used the -u switch. - - - Line 2511 - Capital "S"? - - Surprisingly common specific case, like: Jane'S - - - Line 3469 - endquote missing punctuation? - - OK. This one can really cause a lot of false positives - in some books, but it switches itself off if it finds - more than 20 in a text, unless you force it to list them - all with the -v switch. - "Hey, dad" Johnny said, "can we go now?" - is a common punctuation-missing error. - - - Line 4266 - Mismatched underscores? - - Like mismatched anything else! - -