Fix bug caused by late edit.
6 gutcheck: lists possible common formatting errors in a Project
7 Gutenberg candidate file. It is a command line program and can be used
8 under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
9 tell me). For Windows-only people, there is an appendix at the end
10 with brief instructions for running it.
13 Current version: 0.99. Users of 0.98 see end of file for changes.
15 You should also have received the licence file COPYING, a README file,
16 gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
19 This software is Copyright Jim Tinsley 2000-2005.
21 Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
22 This is Free Software; you may redistribute it under certain conditions (GPL).
24 See http://gutcheck.sourceforge.net for the latest version.
27 Usage is: gutcheck [-setopxlywm] filename
29 -s checks Single quotes
30 -e switches off Echoing of lines
32 -o produces an Overview only
33 -p sets strict quotes checking for Paragraphs
34 -x (paranoid) switches OFF typo checking and extra checks
35 -l turns off Line-end checks
36 -y sets error messages to stdout
37 -w is a special mode for web uploads (for future use)
38 -v (verbose) forces individual reporting of minor problems
39 -m interprets Markup of some common HTML tags and entities
40 -u warns about words in a user-defined typo file gutcheck.typ
41 -d ignores some DP-specific markup
43 Running gutcheck without any parameters will display a brief help message.
52 Echoing lines (-e to switch off)
54 You may find it convenient, when reviewing Gutcheck's
55 suggestions, to see the line that Gutcheck is questioning.
56 That way, you can often see at a glance whether it is
57 a real error that needs to be fixed, or a false positive
58 that should be in the text, but Gutcheck's limited
59 programming doesn't understand.
61 By default, gutcheck echoes these lines, but if you don't
62 want to see the lines referred to, -e will switch it OFF.
65 Quotes (-s and -p switches)
67 Gutcheck always looks for unbalanced doublequotes in a
68 paragraph. It is a common convention for writers not to
69 close quotes in a paragraph if the next paragraph opens
70 with quotes and is a continuation by the same speaker.
72 Gutcheck therefore does not normally report unclosed quotes
73 if the next paragraph begins with a quote. If you need
74 to see all unclosed quotes, even where the next paragraph
75 begins with a quote, you should use the -p switch.
77 Singlequotes (') are a problem, since the same character
78 is used for an apostrophe. I'm not sure that it is
79 possible to get 100% accuracy on singlequotes checking,
80 particularly since dialect, quite common in PG texts,
81 upsets the normal rules so badly. Consider the sentence:
82 'Tis often said that a man's a man for a' that.
83 As humans, we recognize that both apostrophes are used
84 for contractions rather than quotes, but it isn't easy
85 to get a program to recognize that.
87 Since Gutcheck makes too many mistakes when trying to match
88 singlequotes, it doesn't look for unbalanced singlequotes
89 unless you specify the -s switch.
91 Consider these sentences, which illustrate the main cases:
93 'Tis often said that a fool and his money are soon parted.
95 'Becky's goin' home,' said Tom.
97 The dogs' tails wagged in unison.
99 Those 'pack dogs' of yours look more like wolves.
105 It's not Gutcheck's job to be a spelling checker, but it
106 does check for a list of common typos and OCR errors if you
107 use the -t switch. (The -x switch also turns typo checking on.)
109 It also checks for character combinations, especially involving
110 h and b, which are often confused by OCR, that rarely or never
111 occur. For example, it queries "tbe" in a word. Now, "the" often
112 occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
113 playing the odds - a few false positives for many errors found.
114 Similarly with "ii", which is a very common OCR error.
116 Gutcheck suppresses multiple reporting of the first 40 "typos"
117 found. This is to remove the annoyance of seeing something like
118 "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
122 Line-end checking (-l switch to disable)
124 All PG texts should have a Carriage Return (CR - character 13)
125 and a Line Feed (LF - character 10) at end of each line,
126 regardless of what O/S you made them on. DOS/Windows, Unix
127 and Mac have different conventions, but the final text should
128 always use a CR/LF pair as its line terminator.
130 By default, Gutcheck verifies that every line does have
131 the correct terminator, but if you're on a work-in-progress
132 in Linux, you might want to convert the line-ends as a final
133 step, and not want to see thousands of errors every time you
134 run Gutcheck before that final step, so you can turn off
135 this checking with the -l switch.
138 Paranoid mode (-x switch to disable: Trust No One :-)
140 -x switches OFF typo-checking, the -t flag, automatically
141 and some extra checks like standalone 1 and 0 queries.
144 Overview mode (-o switch)
146 This mode just gives a count of queries found
147 instead of a detailed list.
150 Header quote (-h switch)
152 If you use the -h switch, gutcheck will also display
153 the Title, Author, Release and Edition fields from the
154 PG header. This is useful mostly for the automated
155 checks we do on recently-posted texts.
158 Errors to stdout (-y switch)
160 If you're just running gutcheck normally, you can ignore
161 this. It's only there for programs that provide a front
162 end to gutcheck. It makes error messages appear within
163 the output of gutcheck so that the front end knows whether
167 Verbose reporting (-v switch)
169 Normally, if gutcheck sees lots of long lines, short lines,
170 spaced dashes, non-ASCII characters or dot-commas ".," it
171 assumes these are features of the text, counts and summarizes
172 them at the top of its report, but does not list them
173 individually. If the -v switch is on, gutcheck will list them all.
176 Markup interpretation (-m switch)
178 Normally, gutcheck flags anything it suspects of being HTML
179 markup as a possible error. When you use the -m switch,
180 however, it matches anything that looks like markup against
181 a short list of common HTML tags and entities. If the markup
182 is in that list, it either ignores the markup, in the case
183 of a tag, or "interprets" the markup as its nearest ASCII
184 equivalent, in the case of an entity. So, for example, using
185 this switch, gutcheck will "see"
187 “He went <i>thataway!</i>”
193 and report accordingly.
195 This switch does not, not, NOT check the validity of HTML;
196 it exists so that you can run gutcheck on most HTML texts
197 for PG, and get sane results. It does not support all tags.
198 It does not support all entities. When it sees a tag or entity
199 it does not recognize, it will query it as HTML just as if
200 you hadn't specified the -m switch.
202 Gutcheck 0.99 will automatically switch on markup interpretation
203 if it sees a lot of tags that appear to be markup, so mostly, you
204 won't have to specify this.
206 User-defined typos (-u switch)
208 If you have a file named gutcheck.typ either in your current
209 working directory or in the directory from which you explicitly
210 invoked gutcheck, but not necessarily on your path, and if you
211 specify the -u switch, gutcheck will query any word specified
212 in that file. The file is simple: one word, in lower case, per
213 line. 999 lines are allowed for. Be careful not to put multiple
214 words onto a line, or leave any rubbish other than the word on
215 the line. You should have received a sample file gutcheck.typ
218 Ignore DP markup (-d switch)
220 Distributed Proofreaders (http://www.pgdp.net) is currently
221 (2005) the main source of PG texts, and proofers there use
222 special conventions. This switch understands those conventions,
223 so that people can use gutcheck on files in process that still
224 haven't had the special conventions removed yet. The special
225 conventions supported in 0.99 are page-separators and
226 "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
229 You will probably only run gutcheck on a text once or maybe twice,
230 just prior to uploading; it usually finds a few formatting problems;
231 it also usually finds queries that aren't problems at all - it often
232 questions Tables of Contents for having short lines, for example.
233 These are called "false positives", and need a human to decide on
236 The text should be standard prose, and already close to PG normal
237 format (plain text, about 70 characters per line with blank lines
240 Gutcheck merely draws your attention to things that might be errors.
241 It is NOT a substitute for human judgement. Formatting choices like
242 short lines may be for a reason that this program can't understand.
244 Even the most careful human proofing can leave errors behind in a
245 text, and there are several automated checks you can do to help find
246 them. Of these, spellchecking (with _very_ careful human judgement) is
247 the most important and most useful.
249 Gutcheck does perform some basic typo-checking if you ask it to,
250 but its focus is on formatting errors specific to PG texts -
251 mismatched quotes, non-ASCII characters, bad spacing, bad line
252 length, HTML tags perhaps left from a conversion, unbalanced
255 Suggestions for additional checks would be appreciated and duly
256 considered, but no guarantees that they will be implemented.
263 Practically everyone I give gutcheck to asks me how _I_ use it.
264 Well, when I get a text for posting, say filename.txt, I run
266 gutcheck -o filename.txt
268 That gives me a quick idea what I'm dealing with. It'll tell
269 me what kind of problems gutcheck sees, and give me an idea
270 of how much more work needs to be done on the text. Keep in
271 mind that gutcheck doesn't do anything like a full spellcheck,
272 but when I see a text that has a lot of problems, I assume that
273 it probably needs a spellcheck too.
275 Having got a feel for the ballpark, I run
277 gutcheck filename.txt > jj
279 where jj is my personal, all-purpose filename for temporary data
280 that doesn't need to be kept. Then I open filename.txt and jj in
281 a split-screen view in my editor, and work down the text, fixing
282 whatever needs fixing, and skipping whatever doesn't. If your
283 editor doesn't split-screen, you can get much the same effect by
284 opening your original file in your normal editor, and jj (or your
285 equivalent name) in something like Notepad, keeping both in view
288 Twice a day, an automatic process looks at all recently-posted
289 texts, and emails Michael, me, and sometimes other people with
290 their gutcheck summaries.
294 Future development of gutcheck
296 Gutcheck has gone about as far as it can, given its current
297 structure. In order to add better singlequotes checking,
298 sentence checking, better he/be checking and other good stuff
299 that I'd like to see, I'll have to rewrite it from a different
300 angle - looking at the syntax instead of the lines. And I'll
301 probably get around to that sooner or later.
303 Meantime, I'm just trying to get this version stabilized, so
304 please report any bugs you find. When it is stable, I'll run
305 up a Windows port for those timid souls who can't look a
306 command line in the eye. :-)
308 And I've started work on gutspell, a companion to gutcheck
309 which will concentrate on spelling problems. PG spelling
310 problems are unusual, since the range of texts we cover is
311 so wide, and I'll be taking a somewhat unorthodox approach
312 to writing this spelling-checker _specifically_ for texts
313 containing a lot of dialect and uncommon words that have
314 probably already been spell-checked against a standard
320 Explanations of common gutcheck messages:
322 --> 74 lines in this file have white space at end
324 PG texts shouldn't have extra white space added at end of line.
325 Don't worry too much about this; they're not doing any harm,
326 and they'll be removed during posting anyway.
329 --> 348 lines in this file are short. Not reporting short lines.
330 --> 84 lines in this file are long. Not reporting long lines.
331 --> 8 lines in this file are VERY long!
333 If there are a lot of long or short lines, Gutcheck won't list
334 them individually. The short lines version of this message
335 is commonly seen when gutchecking poetry and some plays, where
336 the normal line length is shorter than the standard for prose.
337 A "VERY long" line is one over 80 characters. You normally
338 shouldn't have any of these, but sometimes you may have to render
339 a table that must be that long, or some special preformatted
340 quotation that can't be broken.
343 --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
345 The PG standard for an emdash--like these--is two minus signs
346 with no spaces before or after them. However, some older texts
347 used spaced dashes - like these -- and if there are very many
348 such spaced dashes in the file, gutcheck just draws your
349 attention to it and doesn't list them individually.
353 Line 3020 - Non-ASCII character 233
355 Standard PG texts should use only ASCII characters with values
356 up to 127; however, non-English, accented characters can be
357 represented according to several different non-ASCII encoding
358 schemes, using values over 127. If you have a plain English text
359 with a few accented characters in words like cafe or tete-a-tete,
360 you should replace the accented characters with their unaccented
361 versions. The English pound sign is another commonly-seen
362 non-ASCII character. If you have enough non-ASCII characters in
363 your text that you feel removing them would degrade your text
364 unacceptably, you should probably consider doing an 8-bit text
365 as well as a plain-ASCII version.
369 Line 1207 - Non-ISO-8859 character 156
371 Even in "8-bit" texts, there are distinctions between code sets.
372 The ISO-8859 family of 8-bit code sets is the most commonly used
373 in PG, and these sets do not define values in the range 128 through
374 159 as printable characters. It's quite common for someone on a
375 Windows or Mac machine to use a non-ISO character inadvertently,
376 so this message warns that the character is not only not ASCII,
377 but also outside the ISO-8859 range.
381 Line 46 - Tab character?
383 Some editors and WPs will put in Tab characters (character 9) to
384 indicate indented text. You should not use these in a PG text,
385 because you can't be sure how they will appear on a reader's
386 screen. Find the Tab, and replace it with the appropriate number
390 Line 1327 - Tilde character?
392 The tilde character (~) might be legitimately used, but it's the
393 character commonly used by OCR software to indicate a place where
394 it couldn't make out the letter, so gutcheck flags it.
398 Line 1347 - Asterisk?
400 Asterisks are reported only in paranoid mode (see -x).
401 Like tildes, they are often used to indicate errors, but they are
402 also legitimately used as line delimiters and footnote markers.
406 Line 1451 - Long line 129
408 PG texts should have lines shorter than 76. There may be occasions
409 where you decide that you really have to go out to 79 characters,
410 but the sample above says that line 1451 is 129 characters long -
411 probably two lines run together.
415 Line 1590 - Short line?
417 PG texts should have lines longer than 54 characters. However,
418 there are special cases like poetry and tables of contents where
419 the lines _should_ be shorter. So treat Gutcheck warnings about
420 short lines carefully. Sometimes it's a genuine formatting
421 problem; sometimes the line really needs to be short.
423 Hint: gutcheck will not flag lines as short if they are indented
424 - if they start with a space. I like to start inserted stanzas
425 and other such items indented with a couple of spaces so that
426 they stand out from the main text anyway.
430 Line 1804 - Begins with punctuation?
432 Lines should normally not begin with commas, periods and so on.
433 An exception is ellipses . . . which can happen at start of line.
437 Line 1850 - Spaced em-dash?
439 The PG standard for an em-dash--like these--is two minus signs
440 with no spaces before or after them. Gutcheck flags non-PG
441 em-dashes - like this one. Normally, you will replace it with a
446 Line 1904 - Query he/be error?
448 Gutcheck makes a very minor effort to look for that scourge of all
449 proofreaders, "be" replacing "he" or vice-versa, and draws your
450 attention to it when it thinks it has found one.
454 Line 2017 - Query digit in a1most
456 The digit 1 is commonly OCRed for the letter l, the digit 0 for
457 the letter O, and so on. When gutcheck sees a mix of digits and
458 letters, it warns you. It may generate a false positive for
463 Line 2083 - Query standalone 0
465 In paranoid mode (see -x) only, gutcheck warns about the digit 0
466 and the number 1 standing alone as a word. This can happen if the
467 OCR misreads the words O or I.
471 Line 2115 - Query word whetber
473 If you have switched typo-checking on, gutcheck looks for
474 potential typos, especially common h/b errors. It's not
475 infallible; it sometimes queries legit words, but it's
476 always worth taking a look.
480 Line 2190 column 14 - Missing space?
482 Omitting a space is a very common error,especially coming from
483 OCRed text,and can be hard for a human to spot. The commas in
484 the previous sentence illustrate the kind of thing I mean.
488 Line 2240 column 48 - Spaced punctuation?
490 The flip side of the "missing space" error , here , is when extra
491 spaces are added before punctuation . Some old texts appear to add
492 extra spaces around punctuation consistently, but this was a
493 typographical convention rather than the author's intent, and the
494 extra "spaces" should be removed when preparing a PG text.
498 Line 2301 column 19 - Unspaced quotes?
500 Another common spacing problem occurs in a phrase like "You wait
505 Line 2385 column 27 - Wrongspaced quotes?
507 As of version 0.98, gutcheck adds extra checks on whether a quote
508 seems to be a start or end quote, and queries those that appear to
509 be misplaced. This does give rise to false positives when quotes are
512 "And how," she asked, "will your "friends" help you now?"
514 but these false positives are worth it because of the many cases
515 that this test catches, notably those like:
517 "And how, "she said," will your friends help you now?"
519 Sometimes a "wrongspaced quotes" query will arise because an earlier
520 quote in the paragraph was omitted, so if the place specified seems
521 to be OK, look back to see whether there's a problem in the preceding
526 Line 2400 - HTML Tag? <PRE>
528 Some PG texts have been converted from HTML, and not all of the
529 HTML tags have been removed.
533 Line 2402 - HTML symbol? &emdash;
535 Similarly, special HTML symbol characters can survive into PG
536 texts. Can occasionally produce amusing false positives like
537 . . . Marwick & Co were well known for it;
541 Line 2540 - Mismatched quotes
543 Another gutcheck mainstay - unclosed doublequotes in a paragraph.
544 See the discussion of quotes in the switches section near the
547 Since the mismatch doesn't occur on any one line, gutcheck quotes
548 the line number of the first blank line following the paragraph,
549 since this is the point where it reconciles the count of quotes.
550 However, if gutcheck is echoing lines, that is, you haven't used
551 the -e switch, it will show the _first_ line of the paragraph,
552 to help you find the place without using line numbers. The
553 offending paragraph is therefore between the quoted line and
554 the line number given.
558 Line 2587 - Mismatched single quotes
560 Only checked with the -s switch, since checking single quotes is
561 not a very reliable process. Otherwise, the same logic as for
562 doublequotes applies.
566 Line 2877 - Mismatched round brackets?
568 Also curly and square brackets. Texts with a lot of brackets, like
569 plays with bracketed stage instructions, may have mismatches.
573 Line 3204 - Two successive CRs?
574 Line 3281 position 75 - CR without LF?
576 These are the invalid line-end warnings. See the discussion of
577 line-end checking in the switches section near the start of this
578 file. If you see these, and your editor doesn't show anything
579 wrong, you should probably try deleting the characters just before
580 and after the line end, and the line-end itself, then retyping the
581 characters and the line-end.
584 Line 2940 - Paragraph starts with lower-case
586 A common error in an e-text is for an extra blank line
588 to be put in, like the blank line above, and this often
589 shows up as a new paragraph beginning with lower case.
590 Sometimes the blank line is deliberate, as when a
591 quotation is inserted in a speech. Use your judgement.
594 Line 2987 - Extra period?
596 An extra period. is a. common problem in OCRed text. and usually
597 arises when a speck of dust on the page is mistaken for a period.
598 or. as occasionally happens. when a comma loses its tail.
601 Line 3012 column 12 - Double punctuation?
603 Double punctuation., like that,, is a common typo and
604 scanno. Some books have much legit double punctuation,
605 like etc., etc., but it's worth checking anyway.
611 For Windows-only users who are unfamiliar with DOS:
613 If you're a Windows-only user, you need to save
614 gutcheck.exe into the folder (directory) where the
615 text file you want to check is. Let's say your
616 text file is in C:\GUT, then you should save
617 GUTCHECK.EXE into C:\GUT.
619 Now get to a DOS prompt. You can do this by
620 selecting the "Command Prompt" or "MS-DOS Prompt"
621 option that will be somewhere on your
624 Now get into the C:\GUT directory.
625 You can do this using the CD (change directory)
628 and your prompt will change to
630 so you know you're in the right place.
633 gutcheck yourfile.txt
634 and you'll see gutcheck's report
636 By default, gutcheck prints its queries to screen.
637 If you want to create a file of them, to edit
638 against the text, you can use the greater-than
639 sign (>) to tell it to output the report to a
640 file. For example, if you want its report in a
641 file called QUERIES.LST, you could type
643 gutcheck yourfile.txt > queries.lst
645 The queries.lst file will then contain the listing
646 of possible formatting errors, and you can
647 edit it alongside your text.
649 Whatever you do, DON'T make the filename after
650 the greater-than sign the name of a file already
651 on your disk that you want to keep, because
652 the greater-than sign will cause gutcheck to
653 replace any existing file of that name.
655 So, for example, if you have two Tolstoy files
656 that you want to check, called WARPEACE.TXT and
657 ANNAK.TXT, make sure that neither of these names
658 is ever used following the greater-than sign.
659 To check these correctly, you might do:
661 gutcheck warpeace.txt >war.lst
665 gutcheck annak.txt > annak.lst
667 separately. Then you can look at war.lst and annak.lst
668 to see the gutcheck reports.
673 For existing 0.98 users upgrading to 0.99:
675 If you run on old 16-bit DOS or Windows 3.x, I'm afraid
676 you're out of luck. I'm not saying it _can't_ be compiled
677 to run on 16-bit, but the executable with the package is
678 for Win32 only. *nix users won't notice the change at all.
681 There are two new switches: -u and -d.
682 See above for full rundown.
685 Here's a list of the new errors:
687 Line 1456 - Carat character?
692 Line 1821 - Forward slash?
694 Common error for italicized "I", or so /'ve found.
697 Line 2139 - Query missing paragraph break?
699 "Come here, son." "Do I _have_ to go, dad?"
700 Like that. False positives in some texts. Sorry 'bout that,
701 but these are often errors.
704 Line 2200 - Query had/bad error?
706 Clear enough. Doesn't catch as many as I'd like it to,
707 but rarely gives false alarms.
710 Line 2268 - Query punctuation after the?
712 Some words, like "the", very rarely have punctuation
713 following them. Others, like "Mrs", usually have a
714 period, but never a comma. Occasional false positives.
717 Line 2380 - Query possible scanno arid
719 It found one of your user-defined typos when you
723 Line 2511 - Capital "S"?
725 Surprisingly common specific case, like: Jane'S
728 Line 3469 - endquote missing punctuation?
730 OK. This one can really cause a lot of false positives
731 in some books, but it switches itself off if it finds
732 more than 20 in a text, unless you force it to list them
733 all with the -v switch.
734 "Hey, dad" Johnny said, "can we go now?"
735 is a common punctuation-missing error.
738 Line 4266 - Mismatched underscores?
740 Like mismatched anything else!