ali@0:
ali@0:
ali@0: Gutcheck documentation
ali@0:
ali@0:
ali@0: gutcheck: lists possible common formatting errors in a Project
ali@0: Gutenberg candidate file. It is a command line program and can be used
ali@0: under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
ali@0: tell me). For Windows-only people, there is an appendix at the end
ali@0: with brief instructions for running it.
ali@0:
ali@0:
ali@0: Current version: 0.99. Users of 0.98 see end of file for changes.
ali@0:
ali@0: You should also have received the licence file COPYING, a README file,
ali@0: gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
ali@0: this file.
ali@0:
ali@0: This software is Copyright Jim Tinsley 2000-2005.
ali@0:
ali@0: Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0: This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0:
ali@0: See http://gutcheck.sourceforge.net for the latest version.
ali@0:
ali@0:
ali@0: Usage is: gutcheck [-setopxlywm] filename
ali@0: where:
ali@0: -s checks Single quotes
ali@0: -e switches off Echoing of lines
ali@0: -t checks Typos
ali@0: -o produces an Overview only
ali@0: -p sets strict quotes checking for Paragraphs
ali@0: -x (paranoid) switches OFF typo checking and extra checks
ali@0: -l turns off Line-end checks
ali@0: -y sets error messages to stdout
ali@0: -w is a special mode for web uploads (for future use)
ali@0: -v (verbose) forces individual reporting of minor problems
ali@0: -m interprets Markup of some common HTML tags and entities
ali@0: -u warns about words in a user-defined typo file gutcheck.typ
ali@0: -d ignores some DP-specific markup
ali@0:
ali@0: Running gutcheck without any parameters will display a brief help message.
ali@0:
ali@0: Sample usage:
ali@0:
ali@0: gutcheck warpeace.txt
ali@0:
ali@0:
ali@0: More detail:
ali@0:
ali@0: Echoing lines (-e to switch off)
ali@0:
ali@0: You may find it convenient, when reviewing Gutcheck's
ali@0: suggestions, to see the line that Gutcheck is questioning.
ali@0: That way, you can often see at a glance whether it is
ali@0: a real error that needs to be fixed, or a false positive
ali@0: that should be in the text, but Gutcheck's limited
ali@0: programming doesn't understand.
ali@0:
ali@0: By default, gutcheck echoes these lines, but if you don't
ali@0: want to see the lines referred to, -e will switch it OFF.
ali@0:
ali@0:
ali@0: Quotes (-s and -p switches)
ali@0:
ali@0: Gutcheck always looks for unbalanced doublequotes in a
ali@0: paragraph. It is a common convention for writers not to
ali@0: close quotes in a paragraph if the next paragraph opens
ali@0: with quotes and is a continuation by the same speaker.
ali@0:
ali@0: Gutcheck therefore does not normally report unclosed quotes
ali@0: if the next paragraph begins with a quote. If you need
ali@0: to see all unclosed quotes, even where the next paragraph
ali@0: begins with a quote, you should use the -p switch.
ali@0:
ali@0: Singlequotes (') are a problem, since the same character
ali@0: is used for an apostrophe. I'm not sure that it is
ali@0: possible to get 100% accuracy on singlequotes checking,
ali@0: particularly since dialect, quite common in PG texts,
ali@0: upsets the normal rules so badly. Consider the sentence:
ali@0: 'Tis often said that a man's a man for a' that.
ali@0: As humans, we recognize that both apostrophes are used
ali@0: for contractions rather than quotes, but it isn't easy
ali@0: to get a program to recognize that.
ali@0:
ali@0: Since Gutcheck makes too many mistakes when trying to match
ali@0: singlequotes, it doesn't look for unbalanced singlequotes
ali@0: unless you specify the -s switch.
ali@0:
ali@0: Consider these sentences, which illustrate the main cases:
ali@0:
ali@0: 'Tis often said that a fool and his money are soon parted.
ali@0:
ali@0: 'Becky's goin' home,' said Tom.
ali@0:
ali@0: The dogs' tails wagged in unison.
ali@0:
ali@0: Those 'pack dogs' of yours look more like wolves.
ali@0:
ali@0:
ali@0:
ali@0: Typos (-t switch)
ali@0:
ali@0: It's not Gutcheck's job to be a spelling checker, but it
ali@0: does check for a list of common typos and OCR errors if you
ali@0: use the -t switch. (The -x switch also turns typo checking on.)
ali@0:
ali@0: It also checks for character combinations, especially involving
ali@0: h and b, which are often confused by OCR, that rarely or never
ali@0: occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0: occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0: playing the odds - a few false positives for many errors found.
ali@0: Similarly with "ii", which is a very common OCR error.
ali@0:
ali@0: Gutcheck suppresses multiple reporting of the first 40 "typos"
ali@0: found. This is to remove the annoyance of seeing something like
ali@0: "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@0: in a text.
ali@0:
ali@0:
ali@0: Line-end checking (-l switch to disable)
ali@0:
ali@0: All PG texts should have a Carriage Return (CR - character 13)
ali@0: and a Line Feed (LF - character 10) at end of each line,
ali@0: regardless of what O/S you made them on. DOS/Windows, Unix
ali@0: and Mac have different conventions, but the final text should
ali@0: always use a CR/LF pair as its line terminator.
ali@0:
ali@0: By default, Gutcheck verifies that every line does have
ali@0: the correct terminator, but if you're on a work-in-progress
ali@0: in Linux, you might want to convert the line-ends as a final
ali@0: step, and not want to see thousands of errors every time you
ali@0: run Gutcheck before that final step, so you can turn off
ali@0: this checking with the -l switch.
ali@0:
ali@0:
ali@0: Paranoid mode (-x switch to disable: Trust No One :-)
ali@0:
ali@0: -x switches OFF typo-checking, the -t flag, automatically
ali@0: and some extra checks like standalone 1 and 0 queries.
ali@0:
ali@0:
ali@0: Overview mode (-o switch)
ali@0:
ali@0: This mode just gives a count of queries found
ali@0: instead of a detailed list.
ali@0:
ali@0:
ali@0: Header quote (-h switch)
ali@0:
ali@0: If you use the -h switch, gutcheck will also display
ali@0: the Title, Author, Release and Edition fields from the
ali@0: PG header. This is useful mostly for the automated
ali@0: checks we do on recently-posted texts.
ali@0:
ali@0:
ali@0: Errors to stdout (-y switch)
ali@0:
ali@0: If you're just running gutcheck normally, you can ignore
ali@0: this. It's only there for programs that provide a front
ali@0: end to gutcheck. It makes error messages appear within
ali@0: the output of gutcheck so that the front end knows whether
ali@0: gutcheck ran OK.
ali@0:
ali@0:
ali@0: Verbose reporting (-v switch)
ali@0:
ali@0: Normally, if gutcheck sees lots of long lines, short lines,
ali@0: spaced dashes, non-ASCII characters or dot-commas ".," it
ali@0: assumes these are features of the text, counts and summarizes
ali@0: them at the top of its report, but does not list them
ali@0: individually. If the -v switch is on, gutcheck will list them all.
ali@0:
ali@0:
ali@0: Markup interpretation (-m switch)
ali@0:
ali@0: Normally, gutcheck flags anything it suspects of being HTML
ali@0: markup as a possible error. When you use the -m switch,
ali@0: however, it matches anything that looks like markup against
ali@0: a short list of common HTML tags and entities. If the markup
ali@0: is in that list, it either ignores the markup, in the case
ali@0: of a tag, or "interprets" the markup as its nearest ASCII
ali@0: equivalent, in the case of an entity. So, for example, using
ali@0: this switch, gutcheck will "see"
ali@0:
ali@0: “He went thataway!”
ali@0:
ali@0: as
ali@0:
ali@0: "He went thataway!"
ali@0:
ali@0: and report accordingly.
ali@0:
ali@0: This switch does not, not, NOT check the validity of HTML;
ali@0: it exists so that you can run gutcheck on most HTML texts
ali@0: for PG, and get sane results. It does not support all tags.
ali@0: It does not support all entities. When it sees a tag or entity
ali@0: it does not recognize, it will query it as HTML just as if
ali@0: you hadn't specified the -m switch.
ali@0:
ali@0: Gutcheck 0.99 will automatically switch on markup interpretation
ali@0: if it sees a lot of tags that appear to be markup, so mostly, you
ali@0: won't have to specify this.
ali@0:
ali@0: User-defined typos (-u switch)
ali@0:
ali@0: If you have a file named gutcheck.typ either in your current
ali@0: working directory or in the directory from which you explicitly
ali@0: invoked gutcheck, but not necessarily on your path, and if you
ali@0: specify the -u switch, gutcheck will query any word specified
ali@0: in that file. The file is simple: one word, in lower case, per
ali@0: line. 999 lines are allowed for. Be careful not to put multiple
ali@0: words onto a line, or leave any rubbish other than the word on
ali@0: the line. You should have received a sample file gutcheck.typ
ali@0: with this package.
ali@0:
ali@0: Ignore DP markup (-d switch)
ali@0:
ali@0: Distributed Proofreaders (http://www.pgdp.net) is currently
ali@0: (2005) the main source of PG texts, and proofers there use
ali@0: special conventions. This switch understands those conventions,
ali@0: so that people can use gutcheck on files in process that still
ali@0: haven't had the special conventions removed yet. The special
ali@0: conventions supported in 0.99 are page-separators and
ali@0: "
ali@0: ali@0: Some PG texts have been converted from HTML, and not all of the ali@0: HTML tags have been removed. ali@0: ali@0: ali@0: ali@0: Line 2402 - HTML symbol? &emdash; ali@0: ali@0: Similarly, special HTML symbol characters can survive into PG ali@0: texts. Can occasionally produce amusing false positives like ali@0: . . . Marwick & Co were well known for it; ali@0: ali@0: ali@0: ali@0: Line 2540 - Mismatched quotes ali@0: ali@0: Another gutcheck mainstay - unclosed doublequotes in a paragraph. ali@0: See the discussion of quotes in the switches section near the ali@0: start of this file. ali@0: ali@0: Since the mismatch doesn't occur on any one line, gutcheck quotes ali@0: the line number of the first blank line following the paragraph, ali@0: since this is the point where it reconciles the count of quotes. ali@0: However, if gutcheck is echoing lines, that is, you haven't used ali@0: the -e switch, it will show the _first_ line of the paragraph, ali@0: to help you find the place without using line numbers. The ali@0: offending paragraph is therefore between the quoted line and ali@0: the line number given. ali@0: ali@0: ali@0: ali@0: Line 2587 - Mismatched single quotes ali@0: ali@0: Only checked with the -s switch, since checking single quotes is ali@0: not a very reliable process. Otherwise, the same logic as for ali@0: doublequotes applies. ali@0: ali@0: ali@0: ali@0: Line 2877 - Mismatched round brackets? ali@0: ali@0: Also curly and square brackets. Texts with a lot of brackets, like ali@0: plays with bracketed stage instructions, may have mismatches. ali@0: ali@0: ali@0: Line 3150 - No CR? ali@0: Line 3204 - Two successive CRs? ali@0: Line 3281 position 75 - CR without LF? ali@0: ali@0: These are the invalid line-end warnings. See the discussion of ali@0: line-end checking in the switches section near the start of this ali@0: file. If you see these, and your editor doesn't show anything ali@0: wrong, you should probably try deleting the characters just before ali@0: and after the line end, and the line-end itself, then retyping the ali@0: characters and the line-end. ali@0: ali@0: ali@0: Line 2940 - Paragraph starts with lower-case ali@0: ali@0: A common error in an e-text is for an extra blank line ali@0: ali@0: to be put in, like the blank line above, and this often ali@0: shows up as a new paragraph beginning with lower case. ali@0: Sometimes the blank line is deliberate, as when a ali@0: quotation is inserted in a speech. Use your judgement. ali@0: ali@0: ali@0: Line 2987 - Extra period? ali@0: ali@0: An extra period. is a. common problem in OCRed text. and usually ali@0: arises when a speck of dust on the page is mistaken for a period. ali@0: or. as occasionally happens. when a comma loses its tail. ali@0: ali@0: ali@0: Line 3012 column 12 - Double punctuation? ali@0: ali@0: Double punctuation., like that,, is a common typo and ali@0: scanno. Some books have much legit double punctuation, ali@0: like etc., etc., but it's worth checking anyway. ali@0: ali@0: ali@0: ali@0: * * * * ali@0: ali@0: For Windows-only users who are unfamiliar with DOS: ali@0: ali@0: If you're a Windows-only user, you need to save ali@0: gutcheck.exe into the folder (directory) where the ali@0: text file you want to check is. Let's say your ali@0: text file is in C:\GUT, then you should save ali@0: GUTCHECK.EXE into C:\GUT. ali@0: ali@0: Now get to a DOS prompt. You can do this by ali@0: selecting the "Command Prompt" or "MS-DOS Prompt" ali@0: option that will be somewhere on your ali@0: Start/Programs menu. ali@0: ali@0: Now get into the C:\GUT directory. ali@0: You can do this using the CD (change directory) ali@0: command, like this: ali@0: CD \GUT ali@0: and your prompt will change to ali@0: C:\GUT> ali@0: so you know you're in the right place. ali@0: ali@0: Now type ali@0: gutcheck yourfile.txt ali@0: and you'll see gutcheck's report ali@0: ali@0: By default, gutcheck prints its queries to screen. ali@0: If you want to create a file of them, to edit ali@0: against the text, you can use the greater-than ali@0: sign (>) to tell it to output the report to a ali@0: file. For example, if you want its report in a ali@0: file called QUERIES.LST, you could type ali@0: ali@0: gutcheck yourfile.txt > queries.lst ali@0: ali@0: The queries.lst file will then contain the listing ali@0: of possible formatting errors, and you can ali@0: edit it alongside your text. ali@0: ali@0: Whatever you do, DON'T make the filename after ali@0: the greater-than sign the name of a file already ali@0: on your disk that you want to keep, because ali@0: the greater-than sign will cause gutcheck to ali@0: replace any existing file of that name. ali@0: ali@0: So, for example, if you have two Tolstoy files ali@0: that you want to check, called WARPEACE.TXT and ali@0: ANNAK.TXT, make sure that neither of these names ali@0: is ever used following the greater-than sign. ali@0: To check these correctly, you might do: ali@0: ali@0: gutcheck warpeace.txt >war.lst ali@0: ali@0: and ali@0: ali@0: gutcheck annak.txt > annak.lst ali@0: ali@0: separately. Then you can look at war.lst and annak.lst ali@0: to see the gutcheck reports. ali@0: ali@0: * * * * ali@0: ali@0: ali@0: For existing 0.98 users upgrading to 0.99: ali@0: ali@0: If you run on old 16-bit DOS or Windows 3.x, I'm afraid ali@0: you're out of luck. I'm not saying it _can't_ be compiled ali@0: to run on 16-bit, but the executable with the package is ali@0: for Win32 only. *nix users won't notice the change at all. ali@0: ali@0: ali@0: There are two new switches: -u and -d. ali@0: See above for full rundown. ali@0: ali@0: ali@0: Here's a list of the new errors: ali@0: ali@0: Line 1456 - Carat character? ali@0: ali@0: I^ve found a few. ali@0: ali@0: ali@0: Line 1821 - Forward slash? ali@0: ali@0: Common error for italicized "I", or so /'ve found. ali@0: ali@0: ali@0: Line 2139 - Query missing paragraph break? ali@0: ali@0: "Come here, son." "Do I _have_ to go, dad?" ali@0: Like that. False positives in some texts. Sorry 'bout that, ali@0: but these are often errors. ali@0: ali@0: ali@0: Line 2200 - Query had/bad error? ali@0: ali@0: Clear enough. Doesn't catch as many as I'd like it to, ali@0: but rarely gives false alarms. ali@0: ali@0: ali@0: Line 2268 - Query punctuation after the? ali@0: ali@0: Some words, like "the", very rarely have punctuation ali@0: following them. Others, like "Mrs", usually have a ali@0: period, but never a comma. Occasional false positives. ali@0: ali@0: ali@0: Line 2380 - Query possible scanno arid ali@0: ali@0: It found one of your user-defined typos when you ali@0: used the -u switch. ali@0: ali@0: ali@0: Line 2511 - Capital "S"? ali@0: ali@0: Surprisingly common specific case, like: Jane'S ali@0: ali@0: ali@0: Line 3469 - endquote missing punctuation? ali@0: ali@0: OK. This one can really cause a lot of false positives ali@0: in some books, but it switches itself off if it finds ali@0: more than 20 in a text, unless you force it to list them ali@0: all with the -v switch. ali@0: "Hey, dad" Johnny said, "can we go now?" ali@0: is a common punctuation-missing error. ali@0: ali@0: ali@0: Line 4266 - Mismatched underscores? ali@0: ali@0: Like mismatched anything else! ali@0: ali@0: