Gutcheck documentation gutcheck: lists possible common formatting errors in a Project Gutenberg candidate file. It is a command line program and can be used under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't, tell me). For Windows-only people, there is an appendix at the end with brief instructions for running it. Current version: 0.99. Users of 0.98 see end of file for changes. You should also have received the licence file COPYING, a README file, gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with this file. This software is Copyright Jim Tinsley 2000-2005. Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING. This is Free Software; you may redistribute it under certain conditions (GPL). See http://gutcheck.sourceforge.net for the latest version. Usage is: gutcheck [-setopxlywm] filename where: -s checks Single quotes -e switches off Echoing of lines -t checks Typos -o produces an Overview only -p sets strict quotes checking for Paragraphs -x (paranoid) switches OFF typo checking and extra checks -l turns off Line-end checks -y sets error messages to stdout -w is a special mode for web uploads (for future use) -v (verbose) forces individual reporting of minor problems -m interprets Markup of some common HTML tags and entities -u warns about words in a user-defined typo file gutcheck.typ -d ignores some DP-specific markup Running gutcheck without any parameters will display a brief help message. Sample usage: gutcheck warpeace.txt More detail: Echoing lines (-e to switch off) You may find it convenient, when reviewing Gutcheck's suggestions, to see the line that Gutcheck is questioning. That way, you can often see at a glance whether it is a real error that needs to be fixed, or a false positive that should be in the text, but Gutcheck's limited programming doesn't understand. By default, gutcheck echoes these lines, but if you don't want to see the lines referred to, -e will switch it OFF. Quotes (-s and -p switches) Gutcheck always looks for unbalanced doublequotes in a paragraph. It is a common convention for writers not to close quotes in a paragraph if the next paragraph opens with quotes and is a continuation by the same speaker. Gutcheck therefore does not normally report unclosed quotes if the next paragraph begins with a quote. If you need to see all unclosed quotes, even where the next paragraph begins with a quote, you should use the -p switch. Singlequotes (') are a problem, since the same character is used for an apostrophe. I'm not sure that it is possible to get 100% accuracy on singlequotes checking, particularly since dialect, quite common in PG texts, upsets the normal rules so badly. Consider the sentence: 'Tis often said that a man's a man for a' that. As humans, we recognize that both apostrophes are used for contractions rather than quotes, but it isn't easy to get a program to recognize that. Since Gutcheck makes too many mistakes when trying to match singlequotes, it doesn't look for unbalanced singlequotes unless you specify the -s switch. Consider these sentences, which illustrate the main cases: 'Tis often said that a fool and his money are soon parted. 'Becky's goin' home,' said Tom. The dogs' tails wagged in unison. Those 'pack dogs' of yours look more like wolves. Typos (-t switch) It's not Gutcheck's job to be a spelling checker, but it does check for a list of common typos and OCR errors if you use the -t switch. (The -x switch also turns typo checking on.) It also checks for character combinations, especially involving h and b, which are often confused by OCR, that rarely or never occur. For example, it queries "tbe" in a word. Now, "the" often occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm playing the odds - a few false positives for many errors found. Similarly with "ii", which is a very common OCR error. Gutcheck suppresses multiple reporting of the first 40 "typos" found. This is to remove the annoyance of seeing something like "FN" (footnote) or "LK" (initials) flagged as a typo 147 times in a text. Line-end checking (-l switch to disable) All PG texts should have a Carriage Return (CR - character 13) and a Line Feed (LF - character 10) at end of each line, regardless of what O/S you made them on. DOS/Windows, Unix and Mac have different conventions, but the final text should always use a CR/LF pair as its line terminator. By default, Gutcheck verifies that every line does have the correct terminator, but if you're on a work-in-progress in Linux, you might want to convert the line-ends as a final step, and not want to see thousands of errors every time you run Gutcheck before that final step, so you can turn off this checking with the -l switch. Paranoid mode (-x switch to disable: Trust No One :-) -x switches OFF typo-checking, the -t flag, automatically and some extra checks like standalone 1 and 0 queries. Overview mode (-o switch) This mode just gives a count of queries found instead of a detailed list. Header quote (-h switch) If you use the -h switch, gutcheck will also display the Title, Author, Release and Edition fields from the PG header. This is useful mostly for the automated checks we do on recently-posted texts. Errors to stdout (-y switch) If you're just running gutcheck normally, you can ignore this. It's only there for programs that provide a front end to gutcheck. It makes error messages appear within the output of gutcheck so that the front end knows whether gutcheck ran OK. Verbose reporting (-v switch) Normally, if gutcheck sees lots of long lines, short lines, spaced dashes, non-ASCII characters or dot-commas ".," it assumes these are features of the text, counts and summarizes them at the top of its report, but does not list them individually. If the -v switch is on, gutcheck will list them all. Markup interpretation (-m switch) Normally, gutcheck flags anything it suspects of being HTML markup as a possible error. When you use the -m switch, however, it matches anything that looks like markup against a short list of common HTML tags and entities. If the markup is in that list, it either ignores the markup, in the case of a tag, or "interprets" the markup as its nearest ASCII equivalent, in the case of an entity. So, for example, using this switch, gutcheck will "see" “He went thataway!” as "He went thataway!" and report accordingly. This switch does not, not, NOT check the validity of HTML; it exists so that you can run gutcheck on most HTML texts for PG, and get sane results. It does not support all tags. It does not support all entities. When it sees a tag or entity it does not recognize, it will query it as HTML just as if you hadn't specified the -m switch. Gutcheck 0.99 will automatically switch on markup interpretation if it sees a lot of tags that appear to be markup, so mostly, you won't have to specify this. User-defined typos (-u switch) If you have a file named gutcheck.typ either in your current working directory or in the directory from which you explicitly invoked gutcheck, but not necessarily on your path, and if you specify the -u switch, gutcheck will query any word specified in that file. The file is simple: one word, in lower case, per line. 999 lines are allowed for. Be careful not to put multiple words onto a line, or leave any rubbish other than the word on the line. You should have received a sample file gutcheck.typ with this package. Ignore DP markup (-d switch) Distributed Proofreaders (http://www.pgdp.net) is currently (2005) the main source of PG texts, and proofers there use special conventions. This switch understands those conventions, so that people can use gutcheck on files in process that still haven't had the special conventions removed yet. The special conventions supported in 0.99 are page-separators and "", "", "/*", "*/", "/#", "#/", "/$", "$/". You will probably only run gutcheck on a text once or maybe twice, just prior to uploading; it usually finds a few formatting problems; it also usually finds queries that aren't problems at all - it often questions Tables of Contents for having short lines, for example. These are called "false positives", and need a human to decide on them. The text should be standard prose, and already close to PG normal format (plain text, about 70 characters per line with blank lines between paragraphs). Gutcheck merely draws your attention to things that might be errors. It is NOT a substitute for human judgement. Formatting choices like short lines may be for a reason that this program can't understand. Even the most careful human proofing can leave errors behind in a text, and there are several automated checks you can do to help find them. Of these, spellchecking (with _very_ careful human judgement) is the most important and most useful. Gutcheck does perform some basic typo-checking if you ask it to, but its focus is on formatting errors specific to PG texts - mismatched quotes, non-ASCII characters, bad spacing, bad line length, HTML tags perhaps left from a conversion, unbalanced brackets. Suggestions for additional checks would be appreciated and duly considered, but no guarantees that they will be implemented. How do _I_ use it? Practically everyone I give gutcheck to asks me how _I_ use it. Well, when I get a text for posting, say filename.txt, I run gutcheck -o filename.txt That gives me a quick idea what I'm dealing with. It'll tell me what kind of problems gutcheck sees, and give me an idea of how much more work needs to be done on the text. Keep in mind that gutcheck doesn't do anything like a full spellcheck, but when I see a text that has a lot of problems, I assume that it probably needs a spellcheck too. Having got a feel for the ballpark, I run gutcheck filename.txt > jj where jj is my personal, all-purpose filename for temporary data that doesn't need to be kept. Then I open filename.txt and jj in a split-screen view in my editor, and work down the text, fixing whatever needs fixing, and skipping whatever doesn't. If your editor doesn't split-screen, you can get much the same effect by opening your original file in your normal editor, and jj (or your equivalent name) in something like Notepad, keeping both in view at the same time. Twice a day, an automatic process looks at all recently-posted texts, and emails Michael, me, and sometimes other people with their gutcheck summaries. Future development of gutcheck Gutcheck has gone about as far as it can, given its current structure. In order to add better singlequotes checking, sentence checking, better he/be checking and other good stuff that I'd like to see, I'll have to rewrite it from a different angle - looking at the syntax instead of the lines. And I'll probably get around to that sooner or later. Meantime, I'm just trying to get this version stabilized, so please report any bugs you find. When it is stable, I'll run up a Windows port for those timid souls who can't look a command line in the eye. :-) And I've started work on gutspell, a companion to gutcheck which will concentrate on spelling problems. PG spelling problems are unusual, since the range of texts we cover is so wide, and I'll be taking a somewhat unorthodox approach to writing this spelling-checker _specifically_ for texts containing a lot of dialect and uncommon words that have probably already been spell-checked against a standard modern dictionary. Explanations of common gutcheck messages: --> 74 lines in this file have white space at end PG texts shouldn't have extra white space added at end of line. Don't worry too much about this; they're not doing any harm, and they'll be removed during posting anyway. --> 348 lines in this file are short. Not reporting short lines. --> 84 lines in this file are long. Not reporting long lines. --> 8 lines in this file are VERY long! If there are a lot of long or short lines, Gutcheck won't list them individually. The short lines version of this message is commonly seen when gutchecking poetry and some plays, where the normal line length is shorter than the standard for prose. A "VERY long" line is one over 80 characters. You normally shouldn't have any of these, but sometimes you may have to render a table that must be that long, or some special preformatted quotation that can't be broken. --> There are 75 spaced dashes and em-dashes in this file. Not reporting them. The PG standard for an emdash--like these--is two minus signs with no spaces before or after them. However, some older texts used spaced dashes - like these -- and if there are very many such spaced dashes in the file, gutcheck just draws your attention to it and doesn't list them individually. Line 3020 - Non-ASCII character 233 Standard PG texts should use only ASCII characters with values up to 127; however, non-English, accented characters can be represented according to several different non-ASCII encoding schemes, using values over 127. If you have a plain English text with a few accented characters in words like cafe or tete-a-tete, you should replace the accented characters with their unaccented versions. The English pound sign is another commonly-seen non-ASCII character. If you have enough non-ASCII characters in your text that you feel removing them would degrade your text unacceptably, you should probably consider doing an 8-bit text as well as a plain-ASCII version. Line 1207 - Non-ISO-8859 character 156 Even in "8-bit" texts, there are distinctions between code sets. The ISO-8859 family of 8-bit code sets is the most commonly used in PG, and these sets do not define values in the range 128 through 159 as printable characters. It's quite common for someone on a Windows or Mac machine to use a non-ISO character inadvertently, so this message warns that the character is not only not ASCII, but also outside the ISO-8859 range. Line 46 - Tab character? Some editors and WPs will put in Tab characters (character 9) to indicate indented text. You should not use these in a PG text, because you can't be sure how they will appear on a reader's screen. Find the Tab, and replace it with the appropriate number of spaces. Line 1327 - Tilde character? The tilde character (~) might be legitimately used, but it's the character commonly used by OCR software to indicate a place where it couldn't make out the letter, so gutcheck flags it. Line 1347 - Asterisk? Asterisks are reported only in paranoid mode (see -x). Like tildes, they are often used to indicate errors, but they are also legitimately used as line delimiters and footnote markers. Line 1451 - Long line 129 PG texts should have lines shorter than 76. There may be occasions where you decide that you really have to go out to 79 characters, but the sample above says that line 1451 is 129 characters long - probably two lines run together. Line 1590 - Short line? PG texts should have lines longer than 54 characters. However, there are special cases like poetry and tables of contents where the lines _should_ be shorter. So treat Gutcheck warnings about short lines carefully. Sometimes it's a genuine formatting problem; sometimes the line really needs to be short. Hint: gutcheck will not flag lines as short if they are indented - if they start with a space. I like to start inserted stanzas and other such items indented with a couple of spaces so that they stand out from the main text anyway. Line 1804 - Begins with punctuation? Lines should normally not begin with commas, periods and so on. An exception is ellipses . . . which can happen at start of line. Line 1850 - Spaced em-dash? The PG standard for an em-dash--like these--is two minus signs with no spaces before or after them. Gutcheck flags non-PG em-dashes - like this one. Normally, you will replace it with a PG-standard em-dash. Line 1904 - Query he/be error? Gutcheck makes a very minor effort to look for that scourge of all proofreaders, "be" replacing "he" or vice-versa, and draws your attention to it when it thinks it has found one. Line 2017 - Query digit in a1most The digit 1 is commonly OCRed for the letter l, the digit 0 for the letter O, and so on. When gutcheck sees a mix of digits and letters, it warns you. It may generate a false positive for something like 7am. Line 2083 - Query standalone 0 In paranoid mode (see -x) only, gutcheck warns about the digit 0 and the number 1 standing alone as a word. This can happen if the OCR misreads the words O or I. Line 2115 - Query word whetber If you have switched typo-checking on, gutcheck looks for potential typos, especially common h/b errors. It's not infallible; it sometimes queries legit words, but it's always worth taking a look. Line 2190 column 14 - Missing space? Omitting a space is a very common error,especially coming from OCRed text,and can be hard for a human to spot. The commas in the previous sentence illustrate the kind of thing I mean. Line 2240 column 48 - Spaced punctuation? The flip side of the "missing space" error , here , is when extra spaces are added before punctuation . Some old texts appear to add extra spaces around punctuation consistently, but this was a typographical convention rather than the author's intent, and the extra "spaces" should be removed when preparing a PG text. Line 2301 column 19 - Unspaced quotes? Another common spacing problem occurs in a phrase like "You wait there,"he said. Line 2385 column 27 - Wrongspaced quotes? As of version 0.98, gutcheck adds extra checks on whether a quote seems to be a start or end quote, and queries those that appear to be misplaced. This does give rise to false positives when quotes are nested, for example: "And how," she asked, "will your "friends" help you now?" but these false positives are worth it because of the many cases that this test catches, notably those like: "And how, "she said," will your friends help you now?" Sometimes a "wrongspaced quotes" query will arise because an earlier quote in the paragraph was omitted, so if the place specified seems to be OK, look back to see whether there's a problem in the preceding lines. Line 2400 - HTML Tag?

    Some PG texts have been converted from HTML, and not all of the
    HTML tags have been removed.



    Line 2402 - HTML symbol? &emdash;

    Similarly, special HTML symbol characters can survive into PG
    texts. Can occasionally produce amusing false positives like
    . . . Marwick & Co were well known for it;



    Line 2540 - Mismatched quotes

    Another gutcheck mainstay - unclosed doublequotes in a paragraph.
    See the discussion of quotes in the switches section near the
    start of this file.
    
    Since the mismatch doesn't occur on any one line, gutcheck quotes
    the line number of the first blank line following the paragraph,
    since this is the point where it reconciles the count of quotes.
    However, if gutcheck is echoing lines, that is, you haven't used
    the -e switch, it will show the _first_ line of the paragraph, 
    to help you find the place without using line numbers. The 
    offending paragraph is therefore between the quoted line and 
    the line number given.



    Line 2587 - Mismatched single quotes

    Only checked with the -s switch, since checking single quotes is 
    not a very reliable process. Otherwise, the same logic as for 
    doublequotes applies.



    Line 2877 - Mismatched round brackets?

    Also curly and square brackets. Texts with a lot of brackets, like
    plays with bracketed stage instructions, may have mismatches.


    Line 3150 - No CR?
    Line 3204 - Two successive CRs?
    Line 3281 position 75 - CR without LF?

    These are the invalid line-end warnings. See the discussion of
    line-end checking in the switches section near the start of this
    file. If you see these, and your editor doesn't show anything
    wrong, you should probably try deleting the characters just before
    and after the line end, and the line-end itself, then retyping the
    characters and the line-end.


    Line 2940 - Paragraph starts with lower-case

    A common error in an e-text is for an extra blank line

    to be put in, like the blank line above, and this often
    shows up as a new paragraph beginning with lower case.
    Sometimes the blank line is deliberate, as when a 
    quotation is inserted in a speech. Use your judgement.


    Line 2987 - Extra period?

    An extra period. is a. common problem in OCRed text. and usually
    arises when a speck of dust on the page is mistaken for a period.
    or. as occasionally happens. when a comma loses its tail.


    Line 3012 column 12 - Double punctuation?

    Double punctuation., like that,, is a common typo and
    scanno. Some books have much legit double punctuation,
    like etc., etc., but it's worth checking anyway.



            *       *       *        *

For Windows-only users who are unfamiliar with DOS:

    If you're a Windows-only user, you need to save
    gutcheck.exe into the folder (directory) where the
    text file you want to check is. Let's say your
    text file is in C:\GUT, then you should save
    GUTCHECK.EXE into C:\GUT.

    Now get to a DOS prompt. You can do this by
    selecting the "Command Prompt" or "MS-DOS Prompt"
    option that will be somewhere on your
    Start/Programs menu.

    Now get into the C:\GUT directory. 
    You can do this using the CD (change directory) 
    command, like this:
        CD \GUT
    and your prompt will change to 
        C:\GUT>
    so you know you're in the right place.

    Now type
        gutcheck yourfile.txt
    and you'll see gutcheck's report

    By default, gutcheck prints its queries to screen.
    If you want to create a file of them, to edit
    against the text, you can use the greater-than
    sign (>) to tell it to output the report to a
    file. For example, if you want its report in a
    file called QUERIES.LST, you could type
    
        gutcheck yourfile.txt > queries.lst

    The queries.lst file will then contain the listing
    of possible formatting errors, and you can
    edit it alongside your text.

    Whatever you do, DON'T make the filename after
    the greater-than sign the name of a file already
    on your disk that you want to keep, because
    the greater-than sign will cause gutcheck to
    replace any existing file of that name.

    So, for example, if you have two Tolstoy files
    that you want to check, called WARPEACE.TXT and 
    ANNAK.TXT, make sure that neither of these names
    is ever used following the greater-than sign.
    To check these correctly, you might do:

    gutcheck warpeace.txt >war.lst

    and

    gutcheck annak.txt > annak.lst

    separately. Then you can look at war.lst and annak.lst
    to see the gutcheck reports.

            *       *       *        *


For existing 0.98 users upgrading to 0.99:

    If you run on old 16-bit DOS or Windows 3.x, I'm afraid
    you're out of luck. I'm not saying it _can't_ be compiled
    to run on 16-bit, but the executable with the package is
    for Win32 only. *nix users won't notice the change at all.


    There are two new switches: -u and -d. 
          See above for full rundown.


Here's a list of the new errors:

    Line 1456 - Carat character?

    I^ve found a few.


    Line 1821 - Forward slash?

    Common error for italicized "I", or so /'ve found.


    Line 2139 - Query missing paragraph break?

    "Come here, son." "Do I _have_ to go, dad?"
    Like that. False positives in some texts. Sorry 'bout that,
    but these are often errors.


    Line 2200 - Query had/bad error?

    Clear enough. Doesn't catch as many as I'd like it to,
    but rarely gives false alarms.


    Line 2268 - Query punctuation after the?

    Some words, like "the", very rarely have punctuation
    following them. Others, like "Mrs", usually have a
    period, but never a comma. Occasional false positives.


    Line 2380 - Query possible scanno arid

    It found one of your user-defined typos when you
    used the -u switch.


    Line 2511 - Capital "S"?

    Surprisingly common specific case, like: Jane'S 

    
    Line 3469 - endquote missing punctuation?

    OK. This one can really cause a lot of false positives
    in some books, but it switches itself off if it finds
    more than 20 in a text, unless you force it to list them
    all with the -v switch.
    "Hey, dad" Johnny said, "can we go now?"
    is a common punctuation-missing error.


    Line 4266 - Mismatched underscores?

    Like mismatched anything else!