diff -r c2f4c0285180 -r 08b03c341e61 doc/bookloupe.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/bookloupe.txt Sun Feb 19 09:56:28 2012 +0000
@@ -0,0 +1,742 @@
+
+
+ Gutcheck documentation
+
+
+gutcheck: lists possible common formatting errors in a Project
+Gutenberg candidate file. It is a command line program and can be used
+under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
+tell me). For Windows-only people, there is an appendix at the end
+with brief instructions for running it.
+
+
+Current version: 0.99. Users of 0.98 see end of file for changes.
+
+You should also have received the licence file COPYING, a README file,
+gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
+this file.
+
+This software is Copyright Jim Tinsley 2000-2005.
+
+Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
+This is Free Software; you may redistribute it under certain conditions (GPL).
+
+See http://gutcheck.sourceforge.net for the latest version.
+
+
+Usage is: gutcheck [-setopxlywm] filename
+ where:
+ -s checks Single quotes
+ -e switches off Echoing of lines
+ -t checks Typos
+ -o produces an Overview only
+ -p sets strict quotes checking for Paragraphs
+ -x (paranoid) switches OFF typo checking and extra checks
+ -l turns off Line-end checks
+ -y sets error messages to stdout
+ -w is a special mode for web uploads (for future use)
+ -v (verbose) forces individual reporting of minor problems
+ -m interprets Markup of some common HTML tags and entities
+ -u warns about words in a user-defined typo file gutcheck.typ
+ -d ignores some DP-specific markup
+
+Running gutcheck without any parameters will display a brief help message.
+
+Sample usage:
+
+ gutcheck warpeace.txt
+
+
+More detail:
+
+ Echoing lines (-e to switch off)
+
+ You may find it convenient, when reviewing Gutcheck's
+ suggestions, to see the line that Gutcheck is questioning.
+ That way, you can often see at a glance whether it is
+ a real error that needs to be fixed, or a false positive
+ that should be in the text, but Gutcheck's limited
+ programming doesn't understand.
+
+ By default, gutcheck echoes these lines, but if you don't
+ want to see the lines referred to, -e will switch it OFF.
+
+
+ Quotes (-s and -p switches)
+
+ Gutcheck always looks for unbalanced doublequotes in a
+ paragraph. It is a common convention for writers not to
+ close quotes in a paragraph if the next paragraph opens
+ with quotes and is a continuation by the same speaker.
+
+ Gutcheck therefore does not normally report unclosed quotes
+ if the next paragraph begins with a quote. If you need
+ to see all unclosed quotes, even where the next paragraph
+ begins with a quote, you should use the -p switch.
+
+ Singlequotes (') are a problem, since the same character
+ is used for an apostrophe. I'm not sure that it is
+ possible to get 100% accuracy on singlequotes checking,
+ particularly since dialect, quite common in PG texts,
+ upsets the normal rules so badly. Consider the sentence:
+ 'Tis often said that a man's a man for a' that.
+ As humans, we recognize that both apostrophes are used
+ for contractions rather than quotes, but it isn't easy
+ to get a program to recognize that.
+
+ Since Gutcheck makes too many mistakes when trying to match
+ singlequotes, it doesn't look for unbalanced singlequotes
+ unless you specify the -s switch.
+
+ Consider these sentences, which illustrate the main cases:
+
+ 'Tis often said that a fool and his money are soon parted.
+
+ 'Becky's goin' home,' said Tom.
+
+ The dogs' tails wagged in unison.
+
+ Those 'pack dogs' of yours look more like wolves.
+
+
+
+ Typos (-t switch)
+
+ It's not Gutcheck's job to be a spelling checker, but it
+ does check for a list of common typos and OCR errors if you
+ use the -t switch. (The -x switch also turns typo checking on.)
+
+ It also checks for character combinations, especially involving
+ h and b, which are often confused by OCR, that rarely or never
+ occur. For example, it queries "tbe" in a word. Now, "the" often
+ occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
+ playing the odds - a few false positives for many errors found.
+ Similarly with "ii", which is a very common OCR error.
+
+ Gutcheck suppresses multiple reporting of the first 40 "typos"
+ found. This is to remove the annoyance of seeing something like
+ "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
+ in a text.
+
+
+ Line-end checking (-l switch to disable)
+
+ All PG texts should have a Carriage Return (CR - character 13)
+ and a Line Feed (LF - character 10) at end of each line,
+ regardless of what O/S you made them on. DOS/Windows, Unix
+ and Mac have different conventions, but the final text should
+ always use a CR/LF pair as its line terminator.
+
+ By default, Gutcheck verifies that every line does have
+ the correct terminator, but if you're on a work-in-progress
+ in Linux, you might want to convert the line-ends as a final
+ step, and not want to see thousands of errors every time you
+ run Gutcheck before that final step, so you can turn off
+ this checking with the -l switch.
+
+
+ Paranoid mode (-x switch to disable: Trust No One :-)
+
+ -x switches OFF typo-checking, the -t flag, automatically
+ and some extra checks like standalone 1 and 0 queries.
+
+
+ Overview mode (-o switch)
+
+ This mode just gives a count of queries found
+ instead of a detailed list.
+
+
+ Header quote (-h switch)
+
+ If you use the -h switch, gutcheck will also display
+ the Title, Author, Release and Edition fields from the
+ PG header. This is useful mostly for the automated
+ checks we do on recently-posted texts.
+
+
+ Errors to stdout (-y switch)
+
+ If you're just running gutcheck normally, you can ignore
+ this. It's only there for programs that provide a front
+ end to gutcheck. It makes error messages appear within
+ the output of gutcheck so that the front end knows whether
+ gutcheck ran OK.
+
+
+ Verbose reporting (-v switch)
+
+ Normally, if gutcheck sees lots of long lines, short lines,
+ spaced dashes, non-ASCII characters or dot-commas ".," it
+ assumes these are features of the text, counts and summarizes
+ them at the top of its report, but does not list them
+ individually. If the -v switch is on, gutcheck will list them all.
+
+
+ Markup interpretation (-m switch)
+
+ Normally, gutcheck flags anything it suspects of being HTML
+ markup as a possible error. When you use the -m switch,
+ however, it matches anything that looks like markup against
+ a short list of common HTML tags and entities. If the markup
+ is in that list, it either ignores the markup, in the case
+ of a tag, or "interprets" the markup as its nearest ASCII
+ equivalent, in the case of an entity. So, for example, using
+ this switch, gutcheck will "see"
+
+ “He went thataway!”
+
+ as
+
+ "He went thataway!"
+
+ and report accordingly.
+
+ This switch does not, not, NOT check the validity of HTML;
+ it exists so that you can run gutcheck on most HTML texts
+ for PG, and get sane results. It does not support all tags.
+ It does not support all entities. When it sees a tag or entity
+ it does not recognize, it will query it as HTML just as if
+ you hadn't specified the -m switch.
+
+ Gutcheck 0.99 will automatically switch on markup interpretation
+ if it sees a lot of tags that appear to be markup, so mostly, you
+ won't have to specify this.
+
+ User-defined typos (-u switch)
+
+ If you have a file named gutcheck.typ either in your current
+ working directory or in the directory from which you explicitly
+ invoked gutcheck, but not necessarily on your path, and if you
+ specify the -u switch, gutcheck will query any word specified
+ in that file. The file is simple: one word, in lower case, per
+ line. 999 lines are allowed for. Be careful not to put multiple
+ words onto a line, or leave any rubbish other than the word on
+ the line. You should have received a sample file gutcheck.typ
+ with this package.
+
+ Ignore DP markup (-d switch)
+
+ Distributed Proofreaders (http://www.pgdp.net) is currently
+ (2005) the main source of PG texts, and proofers there use
+ special conventions. This switch understands those conventions,
+ so that people can use gutcheck on files in process that still
+ haven't had the special conventions removed yet. The special
+ conventions supported in 0.99 are page-separators and
+ "
+ + Some PG texts have been converted from HTML, and not all of the + HTML tags have been removed. + + + + Line 2402 - HTML symbol? &emdash; + + Similarly, special HTML symbol characters can survive into PG + texts. Can occasionally produce amusing false positives like + . . . Marwick & Co were well known for it; + + + + Line 2540 - Mismatched quotes + + Another gutcheck mainstay - unclosed doublequotes in a paragraph. + See the discussion of quotes in the switches section near the + start of this file. + + Since the mismatch doesn't occur on any one line, gutcheck quotes + the line number of the first blank line following the paragraph, + since this is the point where it reconciles the count of quotes. + However, if gutcheck is echoing lines, that is, you haven't used + the -e switch, it will show the _first_ line of the paragraph, + to help you find the place without using line numbers. The + offending paragraph is therefore between the quoted line and + the line number given. + + + + Line 2587 - Mismatched single quotes + + Only checked with the -s switch, since checking single quotes is + not a very reliable process. Otherwise, the same logic as for + doublequotes applies. + + + + Line 2877 - Mismatched round brackets? + + Also curly and square brackets. Texts with a lot of brackets, like + plays with bracketed stage instructions, may have mismatches. + + + Line 3150 - No CR? + Line 3204 - Two successive CRs? + Line 3281 position 75 - CR without LF? + + These are the invalid line-end warnings. See the discussion of + line-end checking in the switches section near the start of this + file. If you see these, and your editor doesn't show anything + wrong, you should probably try deleting the characters just before + and after the line end, and the line-end itself, then retyping the + characters and the line-end. + + + Line 2940 - Paragraph starts with lower-case + + A common error in an e-text is for an extra blank line + + to be put in, like the blank line above, and this often + shows up as a new paragraph beginning with lower case. + Sometimes the blank line is deliberate, as when a + quotation is inserted in a speech. Use your judgement. + + + Line 2987 - Extra period? + + An extra period. is a. common problem in OCRed text. and usually + arises when a speck of dust on the page is mistaken for a period. + or. as occasionally happens. when a comma loses its tail. + + + Line 3012 column 12 - Double punctuation? + + Double punctuation., like that,, is a common typo and + scanno. Some books have much legit double punctuation, + like etc., etc., but it's worth checking anyway. + + + + * * * * + +For Windows-only users who are unfamiliar with DOS: + + If you're a Windows-only user, you need to save + gutcheck.exe into the folder (directory) where the + text file you want to check is. Let's say your + text file is in C:\GUT, then you should save + GUTCHECK.EXE into C:\GUT. + + Now get to a DOS prompt. You can do this by + selecting the "Command Prompt" or "MS-DOS Prompt" + option that will be somewhere on your + Start/Programs menu. + + Now get into the C:\GUT directory. + You can do this using the CD (change directory) + command, like this: + CD \GUT + and your prompt will change to + C:\GUT> + so you know you're in the right place. + + Now type + gutcheck yourfile.txt + and you'll see gutcheck's report + + By default, gutcheck prints its queries to screen. + If you want to create a file of them, to edit + against the text, you can use the greater-than + sign (>) to tell it to output the report to a + file. For example, if you want its report in a + file called QUERIES.LST, you could type + + gutcheck yourfile.txt > queries.lst + + The queries.lst file will then contain the listing + of possible formatting errors, and you can + edit it alongside your text. + + Whatever you do, DON'T make the filename after + the greater-than sign the name of a file already + on your disk that you want to keep, because + the greater-than sign will cause gutcheck to + replace any existing file of that name. + + So, for example, if you have two Tolstoy files + that you want to check, called WARPEACE.TXT and + ANNAK.TXT, make sure that neither of these names + is ever used following the greater-than sign. + To check these correctly, you might do: + + gutcheck warpeace.txt >war.lst + + and + + gutcheck annak.txt > annak.lst + + separately. Then you can look at war.lst and annak.lst + to see the gutcheck reports. + + * * * * + + +For existing 0.98 users upgrading to 0.99: + + If you run on old 16-bit DOS or Windows 3.x, I'm afraid + you're out of luck. I'm not saying it _can't_ be compiled + to run on 16-bit, but the executable with the package is + for Win32 only. *nix users won't notice the change at all. + + + There are two new switches: -u and -d. + See above for full rundown. + + +Here's a list of the new errors: + + Line 1456 - Carat character? + + I^ve found a few. + + + Line 1821 - Forward slash? + + Common error for italicized "I", or so /'ve found. + + + Line 2139 - Query missing paragraph break? + + "Come here, son." "Do I _have_ to go, dad?" + Like that. False positives in some texts. Sorry 'bout that, + but these are often errors. + + + Line 2200 - Query had/bad error? + + Clear enough. Doesn't catch as many as I'd like it to, + but rarely gives false alarms. + + + Line 2268 - Query punctuation after the? + + Some words, like "the", very rarely have punctuation + following them. Others, like "Mrs", usually have a + period, but never a comma. Occasional false positives. + + + Line 2380 - Query possible scanno arid + + It found one of your user-defined typos when you + used the -u switch. + + + Line 2511 - Capital "S"? + + Surprisingly common specific case, like: Jane'S + + + Line 3469 - endquote missing punctuation? + + OK. This one can really cause a lot of false positives + in some books, but it switches itself off if it finds + more than 20 in a text, unless you force it to list them + all with the -v switch. + "Hey, dad" Johnny said, "can we go now?" + is a common punctuation-missing error. + + + Line 4266 - Mismatched underscores? + + Like mismatched anything else! + +