Gutcheck documentation
gutcheck: lists possible common formatting errors in a Project
Gutenberg candidate file. It is a command line program and can be used
under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
tell me). For Windows-only people, there is an appendix at the end
with brief instructions for running it.
Current version: 0.99. Users of 0.98 see end of file for changes.
You should also have received the licence file COPYING, a README file,
gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
this file.
This software is Copyright Jim Tinsley 2000-2005.
Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
This is Free Software; you may redistribute it under certain conditions (GPL).
See http://gutcheck.sourceforge.net for the latest version.
Usage is: gutcheck [-setopxlywm] filename
where:
-s checks Single quotes
-e switches off Echoing of lines
-t checks Typos
-o produces an Overview only
-p sets strict quotes checking for Paragraphs
-x (paranoid) switches OFF typo checking and extra checks
-l turns off Line-end checks
-y sets error messages to stdout
-w is a special mode for web uploads (for future use)
-v (verbose) forces individual reporting of minor problems
-m interprets Markup of some common HTML tags and entities
-u warns about words in a user-defined typo file gutcheck.typ
-d ignores some DP-specific markup
Running gutcheck without any parameters will display a brief help message.
Sample usage:
gutcheck warpeace.txt
More detail:
Echoing lines (-e to switch off)
You may find it convenient, when reviewing Gutcheck's
suggestions, to see the line that Gutcheck is questioning.
That way, you can often see at a glance whether it is
a real error that needs to be fixed, or a false positive
that should be in the text, but Gutcheck's limited
programming doesn't understand.
By default, gutcheck echoes these lines, but if you don't
want to see the lines referred to, -e will switch it OFF.
Quotes (-s and -p switches)
Gutcheck always looks for unbalanced doublequotes in a
paragraph. It is a common convention for writers not to
close quotes in a paragraph if the next paragraph opens
with quotes and is a continuation by the same speaker.
Gutcheck therefore does not normally report unclosed quotes
if the next paragraph begins with a quote. If you need
to see all unclosed quotes, even where the next paragraph
begins with a quote, you should use the -p switch.
Singlequotes (') are a problem, since the same character
is used for an apostrophe. I'm not sure that it is
possible to get 100% accuracy on singlequotes checking,
particularly since dialect, quite common in PG texts,
upsets the normal rules so badly. Consider the sentence:
'Tis often said that a man's a man for a' that.
As humans, we recognize that both apostrophes are used
for contractions rather than quotes, but it isn't easy
to get a program to recognize that.
Since Gutcheck makes too many mistakes when trying to match
singlequotes, it doesn't look for unbalanced singlequotes
unless you specify the -s switch.
Consider these sentences, which illustrate the main cases:
'Tis often said that a fool and his money are soon parted.
'Becky's goin' home,' said Tom.
The dogs' tails wagged in unison.
Those 'pack dogs' of yours look more like wolves.
Typos (-t switch)
It's not Gutcheck's job to be a spelling checker, but it
does check for a list of common typos and OCR errors if you
use the -t switch. (The -x switch also turns typo checking on.)
It also checks for character combinations, especially involving
h and b, which are often confused by OCR, that rarely or never
occur. For example, it queries "tbe" in a word. Now, "the" often
occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
playing the odds - a few false positives for many errors found.
Similarly with "ii", which is a very common OCR error.
Gutcheck suppresses multiple reporting of the first 40 "typos"
found. This is to remove the annoyance of seeing something like
"FN" (footnote) or "LK" (initials) flagged as a typo 147 times
in a text.
Line-end checking (-l switch to disable)
All PG texts should have a Carriage Return (CR - character 13)
and a Line Feed (LF - character 10) at end of each line,
regardless of what O/S you made them on. DOS/Windows, Unix
and Mac have different conventions, but the final text should
always use a CR/LF pair as its line terminator.
By default, Gutcheck verifies that every line does have
the correct terminator, but if you're on a work-in-progress
in Linux, you might want to convert the line-ends as a final
step, and not want to see thousands of errors every time you
run Gutcheck before that final step, so you can turn off
this checking with the -l switch.
Paranoid mode (-x switch to disable: Trust No One :-)
-x switches OFF typo-checking, the -t flag, automatically
and some extra checks like standalone 1 and 0 queries.
Overview mode (-o switch)
This mode just gives a count of queries found
instead of a detailed list.
Header quote (-h switch)
If you use the -h switch, gutcheck will also display
the Title, Author, Release and Edition fields from the
PG header. This is useful mostly for the automated
checks we do on recently-posted texts.
Errors to stdout (-y switch)
If you're just running gutcheck normally, you can ignore
this. It's only there for programs that provide a front
end to gutcheck. It makes error messages appear within
the output of gutcheck so that the front end knows whether
gutcheck ran OK.
Verbose reporting (-v switch)
Normally, if gutcheck sees lots of long lines, short lines,
spaced dashes, non-ASCII characters or dot-commas ".," it
assumes these are features of the text, counts and summarizes
them at the top of its report, but does not list them
individually. If the -v switch is on, gutcheck will list them all.
Markup interpretation (-m switch)
Normally, gutcheck flags anything it suspects of being HTML
markup as a possible error. When you use the -m switch,
however, it matches anything that looks like markup against
a short list of common HTML tags and entities. If the markup
is in that list, it either ignores the markup, in the case
of a tag, or "interprets" the markup as its nearest ASCII
equivalent, in the case of an entity. So, for example, using
this switch, gutcheck will "see"
“He went thataway!”
as
"He went thataway!"
and report accordingly.
This switch does not, not, NOT check the validity of HTML;
it exists so that you can run gutcheck on most HTML texts
for PG, and get sane results. It does not support all tags.
It does not support all entities. When it sees a tag or entity
it does not recognize, it will query it as HTML just as if
you hadn't specified the -m switch.
Gutcheck 0.99 will automatically switch on markup interpretation
if it sees a lot of tags that appear to be markup, so mostly, you
won't have to specify this.
User-defined typos (-u switch)
If you have a file named gutcheck.typ either in your current
working directory or in the directory from which you explicitly
invoked gutcheck, but not necessarily on your path, and if you
specify the -u switch, gutcheck will query any word specified
in that file. The file is simple: one word, in lower case, per
line. 999 lines are allowed for. Be careful not to put multiple
words onto a line, or leave any rubbish other than the word on
the line. You should have received a sample file gutcheck.typ
with this package.
Ignore DP markup (-d switch)
Distributed Proofreaders (http://www.pgdp.net) is currently
(2005) the main source of PG texts, and proofers there use
special conventions. This switch understands those conventions,
so that people can use gutcheck on files in process that still
haven't had the special conventions removed yet. The special
conventions supported in 0.99 are page-separators and
"
Some PG texts have been converted from HTML, and not all of the HTML tags have been removed. Line 2402 - HTML symbol? &emdash; Similarly, special HTML symbol characters can survive into PG texts. Can occasionally produce amusing false positives like . . . Marwick & Co were well known for it; Line 2540 - Mismatched quotes Another gutcheck mainstay - unclosed doublequotes in a paragraph. See the discussion of quotes in the switches section near the start of this file. Since the mismatch doesn't occur on any one line, gutcheck quotes the line number of the first blank line following the paragraph, since this is the point where it reconciles the count of quotes. However, if gutcheck is echoing lines, that is, you haven't used the -e switch, it will show the _first_ line of the paragraph, to help you find the place without using line numbers. The offending paragraph is therefore between the quoted line and the line number given. Line 2587 - Mismatched single quotes Only checked with the -s switch, since checking single quotes is not a very reliable process. Otherwise, the same logic as for doublequotes applies. Line 2877 - Mismatched round brackets? Also curly and square brackets. Texts with a lot of brackets, like plays with bracketed stage instructions, may have mismatches. Line 3150 - No CR? Line 3204 - Two successive CRs? Line 3281 position 75 - CR without LF? These are the invalid line-end warnings. See the discussion of line-end checking in the switches section near the start of this file. If you see these, and your editor doesn't show anything wrong, you should probably try deleting the characters just before and after the line end, and the line-end itself, then retyping the characters and the line-end. Line 2940 - Paragraph starts with lower-case A common error in an e-text is for an extra blank line to be put in, like the blank line above, and this often shows up as a new paragraph beginning with lower case. Sometimes the blank line is deliberate, as when a quotation is inserted in a speech. Use your judgement. Line 2987 - Extra period? An extra period. is a. common problem in OCRed text. and usually arises when a speck of dust on the page is mistaken for a period. or. as occasionally happens. when a comma loses its tail. Line 3012 column 12 - Double punctuation? Double punctuation., like that,, is a common typo and scanno. Some books have much legit double punctuation, like etc., etc., but it's worth checking anyway. * * * * For Windows-only users who are unfamiliar with DOS: If you're a Windows-only user, you need to save gutcheck.exe into the folder (directory) where the text file you want to check is. Let's say your text file is in C:\GUT, then you should save GUTCHECK.EXE into C:\GUT. Now get to a DOS prompt. You can do this by selecting the "Command Prompt" or "MS-DOS Prompt" option that will be somewhere on your Start/Programs menu. Now get into the C:\GUT directory. You can do this using the CD (change directory) command, like this: CD \GUT and your prompt will change to C:\GUT> so you know you're in the right place. Now type gutcheck yourfile.txt and you'll see gutcheck's report By default, gutcheck prints its queries to screen. If you want to create a file of them, to edit against the text, you can use the greater-than sign (>) to tell it to output the report to a file. For example, if you want its report in a file called QUERIES.LST, you could type gutcheck yourfile.txt > queries.lst The queries.lst file will then contain the listing of possible formatting errors, and you can edit it alongside your text. Whatever you do, DON'T make the filename after the greater-than sign the name of a file already on your disk that you want to keep, because the greater-than sign will cause gutcheck to replace any existing file of that name. So, for example, if you have two Tolstoy files that you want to check, called WARPEACE.TXT and ANNAK.TXT, make sure that neither of these names is ever used following the greater-than sign. To check these correctly, you might do: gutcheck warpeace.txt >war.lst and gutcheck annak.txt > annak.lst separately. Then you can look at war.lst and annak.lst to see the gutcheck reports. * * * * For existing 0.98 users upgrading to 0.99: If you run on old 16-bit DOS or Windows 3.x, I'm afraid you're out of luck. I'm not saying it _can't_ be compiled to run on 16-bit, but the executable with the package is for Win32 only. *nix users won't notice the change at all. There are two new switches: -u and -d. See above for full rundown. Here's a list of the new errors: Line 1456 - Carat character? I^ve found a few. Line 1821 - Forward slash? Common error for italicized "I", or so /'ve found. Line 2139 - Query missing paragraph break? "Come here, son." "Do I _have_ to go, dad?" Like that. False positives in some texts. Sorry 'bout that, but these are often errors. Line 2200 - Query had/bad error? Clear enough. Doesn't catch as many as I'd like it to, but rarely gives false alarms. Line 2268 - Query punctuation after the? Some words, like "the", very rarely have punctuation following them. Others, like "Mrs", usually have a period, but never a comma. Occasional false positives. Line 2380 - Query possible scanno arid It found one of your user-defined typos when you used the -u switch. Line 2511 - Capital "S"? Surprisingly common specific case, like: Jane'S Line 3469 - endquote missing punctuation? OK. This one can really cause a lot of false positives in some books, but it switches itself off if it finds more than 20 in a text, unless you force it to list them all with the -v switch. "Hey, dad" Johnny said, "can we go now?" is a common punctuation-missing error. Line 4266 - Mismatched underscores? Like mismatched anything else!