ali@0
|
1 |
|
ali@0
|
2 |
|
ali@0
|
3 |
Gutcheck documentation
|
ali@0
|
4 |
|
ali@0
|
5 |
|
ali@0
|
6 |
gutcheck: lists possible common formatting errors in a Project
|
ali@0
|
7 |
Gutenberg candidate file. It is a command line program and can be used
|
ali@0
|
8 |
under Win32 or Unix (gutcheck.c should compile anywhere; if it doesn't,
|
ali@0
|
9 |
tell me). For Windows-only people, there is an appendix at the end
|
ali@0
|
10 |
with brief instructions for running it.
|
ali@0
|
11 |
|
ali@0
|
12 |
|
ali@0
|
13 |
Current version: 0.99. Users of 0.98 see end of file for changes.
|
ali@0
|
14 |
|
ali@0
|
15 |
You should also have received the licence file COPYING, a README file,
|
ali@0
|
16 |
gutcheck.c, the source code, and gutcheck.exe, a DOS executable, with
|
ali@0
|
17 |
this file.
|
ali@0
|
18 |
|
ali@0
|
19 |
This software is Copyright Jim Tinsley 2000-2005.
|
ali@0
|
20 |
|
ali@0
|
21 |
Gutcheck comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
|
ali@0
|
22 |
This is Free Software; you may redistribute it under certain conditions (GPL).
|
ali@0
|
23 |
|
ali@0
|
24 |
See http://gutcheck.sourceforge.net for the latest version.
|
ali@0
|
25 |
|
ali@0
|
26 |
|
ali@0
|
27 |
Usage is: gutcheck [-setopxlywm] filename
|
ali@0
|
28 |
where:
|
ali@0
|
29 |
-s checks Single quotes
|
ali@0
|
30 |
-e switches off Echoing of lines
|
ali@0
|
31 |
-t checks Typos
|
ali@0
|
32 |
-o produces an Overview only
|
ali@0
|
33 |
-p sets strict quotes checking for Paragraphs
|
ali@0
|
34 |
-x (paranoid) switches OFF typo checking and extra checks
|
ali@0
|
35 |
-l turns off Line-end checks
|
ali@0
|
36 |
-y sets error messages to stdout
|
ali@0
|
37 |
-w is a special mode for web uploads (for future use)
|
ali@0
|
38 |
-v (verbose) forces individual reporting of minor problems
|
ali@0
|
39 |
-m interprets Markup of some common HTML tags and entities
|
ali@0
|
40 |
-u warns about words in a user-defined typo file gutcheck.typ
|
ali@0
|
41 |
-d ignores some DP-specific markup
|
ali@0
|
42 |
|
ali@0
|
43 |
Running gutcheck without any parameters will display a brief help message.
|
ali@0
|
44 |
|
ali@0
|
45 |
Sample usage:
|
ali@0
|
46 |
|
ali@0
|
47 |
gutcheck warpeace.txt
|
ali@0
|
48 |
|
ali@0
|
49 |
|
ali@0
|
50 |
More detail:
|
ali@0
|
51 |
|
ali@0
|
52 |
Echoing lines (-e to switch off)
|
ali@0
|
53 |
|
ali@0
|
54 |
You may find it convenient, when reviewing Gutcheck's
|
ali@0
|
55 |
suggestions, to see the line that Gutcheck is questioning.
|
ali@0
|
56 |
That way, you can often see at a glance whether it is
|
ali@0
|
57 |
a real error that needs to be fixed, or a false positive
|
ali@0
|
58 |
that should be in the text, but Gutcheck's limited
|
ali@0
|
59 |
programming doesn't understand.
|
ali@0
|
60 |
|
ali@0
|
61 |
By default, gutcheck echoes these lines, but if you don't
|
ali@0
|
62 |
want to see the lines referred to, -e will switch it OFF.
|
ali@0
|
63 |
|
ali@0
|
64 |
|
ali@0
|
65 |
Quotes (-s and -p switches)
|
ali@0
|
66 |
|
ali@0
|
67 |
Gutcheck always looks for unbalanced doublequotes in a
|
ali@0
|
68 |
paragraph. It is a common convention for writers not to
|
ali@0
|
69 |
close quotes in a paragraph if the next paragraph opens
|
ali@0
|
70 |
with quotes and is a continuation by the same speaker.
|
ali@0
|
71 |
|
ali@0
|
72 |
Gutcheck therefore does not normally report unclosed quotes
|
ali@0
|
73 |
if the next paragraph begins with a quote. If you need
|
ali@0
|
74 |
to see all unclosed quotes, even where the next paragraph
|
ali@0
|
75 |
begins with a quote, you should use the -p switch.
|
ali@0
|
76 |
|
ali@0
|
77 |
Singlequotes (') are a problem, since the same character
|
ali@0
|
78 |
is used for an apostrophe. I'm not sure that it is
|
ali@0
|
79 |
possible to get 100% accuracy on singlequotes checking,
|
ali@0
|
80 |
particularly since dialect, quite common in PG texts,
|
ali@0
|
81 |
upsets the normal rules so badly. Consider the sentence:
|
ali@0
|
82 |
'Tis often said that a man's a man for a' that.
|
ali@0
|
83 |
As humans, we recognize that both apostrophes are used
|
ali@0
|
84 |
for contractions rather than quotes, but it isn't easy
|
ali@0
|
85 |
to get a program to recognize that.
|
ali@0
|
86 |
|
ali@0
|
87 |
Since Gutcheck makes too many mistakes when trying to match
|
ali@0
|
88 |
singlequotes, it doesn't look for unbalanced singlequotes
|
ali@0
|
89 |
unless you specify the -s switch.
|
ali@0
|
90 |
|
ali@0
|
91 |
Consider these sentences, which illustrate the main cases:
|
ali@0
|
92 |
|
ali@0
|
93 |
'Tis often said that a fool and his money are soon parted.
|
ali@0
|
94 |
|
ali@0
|
95 |
'Becky's goin' home,' said Tom.
|
ali@0
|
96 |
|
ali@0
|
97 |
The dogs' tails wagged in unison.
|
ali@0
|
98 |
|
ali@0
|
99 |
Those 'pack dogs' of yours look more like wolves.
|
ali@0
|
100 |
|
ali@0
|
101 |
|
ali@0
|
102 |
|
ali@0
|
103 |
Typos (-t switch)
|
ali@0
|
104 |
|
ali@0
|
105 |
It's not Gutcheck's job to be a spelling checker, but it
|
ali@0
|
106 |
does check for a list of common typos and OCR errors if you
|
ali@0
|
107 |
use the -t switch. (The -x switch also turns typo checking on.)
|
ali@0
|
108 |
|
ali@0
|
109 |
It also checks for character combinations, especially involving
|
ali@0
|
110 |
h and b, which are often confused by OCR, that rarely or never
|
ali@0
|
111 |
occur. For example, it queries "tbe" in a word. Now, "the" often
|
ali@0
|
112 |
occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
|
ali@0
|
113 |
playing the odds - a few false positives for many errors found.
|
ali@0
|
114 |
Similarly with "ii", which is a very common OCR error.
|
ali@0
|
115 |
|
ali@0
|
116 |
Gutcheck suppresses multiple reporting of the first 40 "typos"
|
ali@0
|
117 |
found. This is to remove the annoyance of seeing something like
|
ali@0
|
118 |
"FN" (footnote) or "LK" (initials) flagged as a typo 147 times
|
ali@0
|
119 |
in a text.
|
ali@0
|
120 |
|
ali@0
|
121 |
|
ali@0
|
122 |
Line-end checking (-l switch to disable)
|
ali@0
|
123 |
|
ali@0
|
124 |
All PG texts should have a Carriage Return (CR - character 13)
|
ali@0
|
125 |
and a Line Feed (LF - character 10) at end of each line,
|
ali@0
|
126 |
regardless of what O/S you made them on. DOS/Windows, Unix
|
ali@0
|
127 |
and Mac have different conventions, but the final text should
|
ali@0
|
128 |
always use a CR/LF pair as its line terminator.
|
ali@0
|
129 |
|
ali@0
|
130 |
By default, Gutcheck verifies that every line does have
|
ali@0
|
131 |
the correct terminator, but if you're on a work-in-progress
|
ali@0
|
132 |
in Linux, you might want to convert the line-ends as a final
|
ali@0
|
133 |
step, and not want to see thousands of errors every time you
|
ali@0
|
134 |
run Gutcheck before that final step, so you can turn off
|
ali@0
|
135 |
this checking with the -l switch.
|
ali@0
|
136 |
|
ali@0
|
137 |
|
ali@0
|
138 |
Paranoid mode (-x switch to disable: Trust No One :-)
|
ali@0
|
139 |
|
ali@0
|
140 |
-x switches OFF typo-checking, the -t flag, automatically
|
ali@0
|
141 |
and some extra checks like standalone 1 and 0 queries.
|
ali@0
|
142 |
|
ali@0
|
143 |
|
ali@0
|
144 |
Overview mode (-o switch)
|
ali@0
|
145 |
|
ali@0
|
146 |
This mode just gives a count of queries found
|
ali@0
|
147 |
instead of a detailed list.
|
ali@0
|
148 |
|
ali@0
|
149 |
|
ali@0
|
150 |
Header quote (-h switch)
|
ali@0
|
151 |
|
ali@0
|
152 |
If you use the -h switch, gutcheck will also display
|
ali@0
|
153 |
the Title, Author, Release and Edition fields from the
|
ali@0
|
154 |
PG header. This is useful mostly for the automated
|
ali@0
|
155 |
checks we do on recently-posted texts.
|
ali@0
|
156 |
|
ali@0
|
157 |
|
ali@0
|
158 |
Errors to stdout (-y switch)
|
ali@0
|
159 |
|
ali@0
|
160 |
If you're just running gutcheck normally, you can ignore
|
ali@0
|
161 |
this. It's only there for programs that provide a front
|
ali@0
|
162 |
end to gutcheck. It makes error messages appear within
|
ali@0
|
163 |
the output of gutcheck so that the front end knows whether
|
ali@0
|
164 |
gutcheck ran OK.
|
ali@0
|
165 |
|
ali@0
|
166 |
|
ali@0
|
167 |
Verbose reporting (-v switch)
|
ali@0
|
168 |
|
ali@0
|
169 |
Normally, if gutcheck sees lots of long lines, short lines,
|
ali@0
|
170 |
spaced dashes, non-ASCII characters or dot-commas ".," it
|
ali@0
|
171 |
assumes these are features of the text, counts and summarizes
|
ali@0
|
172 |
them at the top of its report, but does not list them
|
ali@0
|
173 |
individually. If the -v switch is on, gutcheck will list them all.
|
ali@0
|
174 |
|
ali@0
|
175 |
|
ali@0
|
176 |
Markup interpretation (-m switch)
|
ali@0
|
177 |
|
ali@0
|
178 |
Normally, gutcheck flags anything it suspects of being HTML
|
ali@0
|
179 |
markup as a possible error. When you use the -m switch,
|
ali@0
|
180 |
however, it matches anything that looks like markup against
|
ali@0
|
181 |
a short list of common HTML tags and entities. If the markup
|
ali@0
|
182 |
is in that list, it either ignores the markup, in the case
|
ali@0
|
183 |
of a tag, or "interprets" the markup as its nearest ASCII
|
ali@0
|
184 |
equivalent, in the case of an entity. So, for example, using
|
ali@0
|
185 |
this switch, gutcheck will "see"
|
ali@0
|
186 |
|
ali@0
|
187 |
“He went <i>thataway!</i>”
|
ali@0
|
188 |
|
ali@0
|
189 |
as
|
ali@0
|
190 |
|
ali@0
|
191 |
"He went thataway!"
|
ali@0
|
192 |
|
ali@0
|
193 |
and report accordingly.
|
ali@0
|
194 |
|
ali@0
|
195 |
This switch does not, not, NOT check the validity of HTML;
|
ali@0
|
196 |
it exists so that you can run gutcheck on most HTML texts
|
ali@0
|
197 |
for PG, and get sane results. It does not support all tags.
|
ali@0
|
198 |
It does not support all entities. When it sees a tag or entity
|
ali@0
|
199 |
it does not recognize, it will query it as HTML just as if
|
ali@0
|
200 |
you hadn't specified the -m switch.
|
ali@0
|
201 |
|
ali@0
|
202 |
Gutcheck 0.99 will automatically switch on markup interpretation
|
ali@0
|
203 |
if it sees a lot of tags that appear to be markup, so mostly, you
|
ali@0
|
204 |
won't have to specify this.
|
ali@0
|
205 |
|
ali@0
|
206 |
User-defined typos (-u switch)
|
ali@0
|
207 |
|
ali@0
|
208 |
If you have a file named gutcheck.typ either in your current
|
ali@0
|
209 |
working directory or in the directory from which you explicitly
|
ali@0
|
210 |
invoked gutcheck, but not necessarily on your path, and if you
|
ali@0
|
211 |
specify the -u switch, gutcheck will query any word specified
|
ali@0
|
212 |
in that file. The file is simple: one word, in lower case, per
|
ali@0
|
213 |
line. 999 lines are allowed for. Be careful not to put multiple
|
ali@0
|
214 |
words onto a line, or leave any rubbish other than the word on
|
ali@0
|
215 |
the line. You should have received a sample file gutcheck.typ
|
ali@0
|
216 |
with this package.
|
ali@0
|
217 |
|
ali@0
|
218 |
Ignore DP markup (-d switch)
|
ali@0
|
219 |
|
ali@0
|
220 |
Distributed Proofreaders (http://www.pgdp.net) is currently
|
ali@0
|
221 |
(2005) the main source of PG texts, and proofers there use
|
ali@0
|
222 |
special conventions. This switch understands those conventions,
|
ali@0
|
223 |
so that people can use gutcheck on files in process that still
|
ali@0
|
224 |
haven't had the special conventions removed yet. The special
|
ali@0
|
225 |
conventions supported in 0.99 are page-separators and
|
ali@0
|
226 |
"<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
|
ali@0
|
227 |
|
ali@0
|
228 |
|
ali@0
|
229 |
You will probably only run gutcheck on a text once or maybe twice,
|
ali@0
|
230 |
just prior to uploading; it usually finds a few formatting problems;
|
ali@0
|
231 |
it also usually finds queries that aren't problems at all - it often
|
ali@0
|
232 |
questions Tables of Contents for having short lines, for example.
|
ali@0
|
233 |
These are called "false positives", and need a human to decide on
|
ali@0
|
234 |
them.
|
ali@0
|
235 |
|
ali@0
|
236 |
The text should be standard prose, and already close to PG normal
|
ali@0
|
237 |
format (plain text, about 70 characters per line with blank lines
|
ali@0
|
238 |
between paragraphs).
|
ali@0
|
239 |
|
ali@0
|
240 |
Gutcheck merely draws your attention to things that might be errors.
|
ali@0
|
241 |
It is NOT a substitute for human judgement. Formatting choices like
|
ali@0
|
242 |
short lines may be for a reason that this program can't understand.
|
ali@0
|
243 |
|
ali@0
|
244 |
Even the most careful human proofing can leave errors behind in a
|
ali@0
|
245 |
text, and there are several automated checks you can do to help find
|
ali@0
|
246 |
them. Of these, spellchecking (with _very_ careful human judgement) is
|
ali@0
|
247 |
the most important and most useful.
|
ali@0
|
248 |
|
ali@0
|
249 |
Gutcheck does perform some basic typo-checking if you ask it to,
|
ali@0
|
250 |
but its focus is on formatting errors specific to PG texts -
|
ali@0
|
251 |
mismatched quotes, non-ASCII characters, bad spacing, bad line
|
ali@0
|
252 |
length, HTML tags perhaps left from a conversion, unbalanced
|
ali@0
|
253 |
brackets.
|
ali@0
|
254 |
|
ali@0
|
255 |
Suggestions for additional checks would be appreciated and duly
|
ali@0
|
256 |
considered, but no guarantees that they will be implemented.
|
ali@0
|
257 |
|
ali@0
|
258 |
|
ali@0
|
259 |
|
ali@0
|
260 |
|
ali@0
|
261 |
How do _I_ use it?
|
ali@0
|
262 |
|
ali@0
|
263 |
Practically everyone I give gutcheck to asks me how _I_ use it.
|
ali@0
|
264 |
Well, when I get a text for posting, say filename.txt, I run
|
ali@0
|
265 |
|
ali@0
|
266 |
gutcheck -o filename.txt
|
ali@0
|
267 |
|
ali@0
|
268 |
That gives me a quick idea what I'm dealing with. It'll tell
|
ali@0
|
269 |
me what kind of problems gutcheck sees, and give me an idea
|
ali@0
|
270 |
of how much more work needs to be done on the text. Keep in
|
ali@0
|
271 |
mind that gutcheck doesn't do anything like a full spellcheck,
|
ali@0
|
272 |
but when I see a text that has a lot of problems, I assume that
|
ali@0
|
273 |
it probably needs a spellcheck too.
|
ali@0
|
274 |
|
ali@0
|
275 |
Having got a feel for the ballpark, I run
|
ali@0
|
276 |
|
ali@0
|
277 |
gutcheck filename.txt > jj
|
ali@0
|
278 |
|
ali@0
|
279 |
where jj is my personal, all-purpose filename for temporary data
|
ali@0
|
280 |
that doesn't need to be kept. Then I open filename.txt and jj in
|
ali@0
|
281 |
a split-screen view in my editor, and work down the text, fixing
|
ali@0
|
282 |
whatever needs fixing, and skipping whatever doesn't. If your
|
ali@0
|
283 |
editor doesn't split-screen, you can get much the same effect by
|
ali@0
|
284 |
opening your original file in your normal editor, and jj (or your
|
ali@0
|
285 |
equivalent name) in something like Notepad, keeping both in view
|
ali@0
|
286 |
at the same time.
|
ali@0
|
287 |
|
ali@0
|
288 |
Twice a day, an automatic process looks at all recently-posted
|
ali@0
|
289 |
texts, and emails Michael, me, and sometimes other people with
|
ali@0
|
290 |
their gutcheck summaries.
|
ali@0
|
291 |
|
ali@0
|
292 |
|
ali@0
|
293 |
|
ali@0
|
294 |
Future development of gutcheck
|
ali@0
|
295 |
|
ali@0
|
296 |
Gutcheck has gone about as far as it can, given its current
|
ali@0
|
297 |
structure. In order to add better singlequotes checking,
|
ali@0
|
298 |
sentence checking, better he/be checking and other good stuff
|
ali@0
|
299 |
that I'd like to see, I'll have to rewrite it from a different
|
ali@0
|
300 |
angle - looking at the syntax instead of the lines. And I'll
|
ali@0
|
301 |
probably get around to that sooner or later.
|
ali@0
|
302 |
|
ali@0
|
303 |
Meantime, I'm just trying to get this version stabilized, so
|
ali@0
|
304 |
please report any bugs you find. When it is stable, I'll run
|
ali@0
|
305 |
up a Windows port for those timid souls who can't look a
|
ali@0
|
306 |
command line in the eye. :-)
|
ali@0
|
307 |
|
ali@0
|
308 |
And I've started work on gutspell, a companion to gutcheck
|
ali@0
|
309 |
which will concentrate on spelling problems. PG spelling
|
ali@0
|
310 |
problems are unusual, since the range of texts we cover is
|
ali@0
|
311 |
so wide, and I'll be taking a somewhat unorthodox approach
|
ali@0
|
312 |
to writing this spelling-checker _specifically_ for texts
|
ali@0
|
313 |
containing a lot of dialect and uncommon words that have
|
ali@0
|
314 |
probably already been spell-checked against a standard
|
ali@0
|
315 |
modern dictionary.
|
ali@0
|
316 |
|
ali@0
|
317 |
|
ali@0
|
318 |
|
ali@0
|
319 |
|
ali@0
|
320 |
Explanations of common gutcheck messages:
|
ali@0
|
321 |
|
ali@0
|
322 |
--> 74 lines in this file have white space at end
|
ali@0
|
323 |
|
ali@0
|
324 |
PG texts shouldn't have extra white space added at end of line.
|
ali@0
|
325 |
Don't worry too much about this; they're not doing any harm,
|
ali@0
|
326 |
and they'll be removed during posting anyway.
|
ali@0
|
327 |
|
ali@0
|
328 |
|
ali@0
|
329 |
--> 348 lines in this file are short. Not reporting short lines.
|
ali@0
|
330 |
--> 84 lines in this file are long. Not reporting long lines.
|
ali@0
|
331 |
--> 8 lines in this file are VERY long!
|
ali@0
|
332 |
|
ali@0
|
333 |
If there are a lot of long or short lines, Gutcheck won't list
|
ali@0
|
334 |
them individually. The short lines version of this message
|
ali@0
|
335 |
is commonly seen when gutchecking poetry and some plays, where
|
ali@0
|
336 |
the normal line length is shorter than the standard for prose.
|
ali@0
|
337 |
A "VERY long" line is one over 80 characters. You normally
|
ali@0
|
338 |
shouldn't have any of these, but sometimes you may have to render
|
ali@0
|
339 |
a table that must be that long, or some special preformatted
|
ali@0
|
340 |
quotation that can't be broken.
|
ali@0
|
341 |
|
ali@0
|
342 |
|
ali@0
|
343 |
--> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
|
ali@0
|
344 |
|
ali@0
|
345 |
The PG standard for an emdash--like these--is two minus signs
|
ali@0
|
346 |
with no spaces before or after them. However, some older texts
|
ali@0
|
347 |
used spaced dashes - like these -- and if there are very many
|
ali@0
|
348 |
such spaced dashes in the file, gutcheck just draws your
|
ali@0
|
349 |
attention to it and doesn't list them individually.
|
ali@0
|
350 |
|
ali@0
|
351 |
|
ali@0
|
352 |
|
ali@0
|
353 |
Line 3020 - Non-ASCII character 233
|
ali@0
|
354 |
|
ali@0
|
355 |
Standard PG texts should use only ASCII characters with values
|
ali@0
|
356 |
up to 127; however, non-English, accented characters can be
|
ali@0
|
357 |
represented according to several different non-ASCII encoding
|
ali@0
|
358 |
schemes, using values over 127. If you have a plain English text
|
ali@0
|
359 |
with a few accented characters in words like cafe or tete-a-tete,
|
ali@0
|
360 |
you should replace the accented characters with their unaccented
|
ali@0
|
361 |
versions. The English pound sign is another commonly-seen
|
ali@0
|
362 |
non-ASCII character. If you have enough non-ASCII characters in
|
ali@0
|
363 |
your text that you feel removing them would degrade your text
|
ali@0
|
364 |
unacceptably, you should probably consider doing an 8-bit text
|
ali@0
|
365 |
as well as a plain-ASCII version.
|
ali@0
|
366 |
|
ali@0
|
367 |
|
ali@0
|
368 |
|
ali@0
|
369 |
Line 1207 - Non-ISO-8859 character 156
|
ali@0
|
370 |
|
ali@0
|
371 |
Even in "8-bit" texts, there are distinctions between code sets.
|
ali@0
|
372 |
The ISO-8859 family of 8-bit code sets is the most commonly used
|
ali@0
|
373 |
in PG, and these sets do not define values in the range 128 through
|
ali@0
|
374 |
159 as printable characters. It's quite common for someone on a
|
ali@0
|
375 |
Windows or Mac machine to use a non-ISO character inadvertently,
|
ali@0
|
376 |
so this message warns that the character is not only not ASCII,
|
ali@0
|
377 |
but also outside the ISO-8859 range.
|
ali@0
|
378 |
|
ali@0
|
379 |
|
ali@0
|
380 |
|
ali@0
|
381 |
Line 46 - Tab character?
|
ali@0
|
382 |
|
ali@0
|
383 |
Some editors and WPs will put in Tab characters (character 9) to
|
ali@0
|
384 |
indicate indented text. You should not use these in a PG text,
|
ali@0
|
385 |
because you can't be sure how they will appear on a reader's
|
ali@0
|
386 |
screen. Find the Tab, and replace it with the appropriate number
|
ali@0
|
387 |
of spaces.
|
ali@0
|
388 |
|
ali@0
|
389 |
|
ali@0
|
390 |
Line 1327 - Tilde character?
|
ali@0
|
391 |
|
ali@0
|
392 |
The tilde character (~) might be legitimately used, but it's the
|
ali@0
|
393 |
character commonly used by OCR software to indicate a place where
|
ali@0
|
394 |
it couldn't make out the letter, so gutcheck flags it.
|
ali@0
|
395 |
|
ali@0
|
396 |
|
ali@0
|
397 |
|
ali@0
|
398 |
Line 1347 - Asterisk?
|
ali@0
|
399 |
|
ali@0
|
400 |
Asterisks are reported only in paranoid mode (see -x).
|
ali@0
|
401 |
Like tildes, they are often used to indicate errors, but they are
|
ali@0
|
402 |
also legitimately used as line delimiters and footnote markers.
|
ali@0
|
403 |
|
ali@0
|
404 |
|
ali@0
|
405 |
|
ali@0
|
406 |
Line 1451 - Long line 129
|
ali@0
|
407 |
|
ali@0
|
408 |
PG texts should have lines shorter than 76. There may be occasions
|
ali@0
|
409 |
where you decide that you really have to go out to 79 characters,
|
ali@0
|
410 |
but the sample above says that line 1451 is 129 characters long -
|
ali@0
|
411 |
probably two lines run together.
|
ali@0
|
412 |
|
ali@0
|
413 |
|
ali@0
|
414 |
|
ali@0
|
415 |
Line 1590 - Short line?
|
ali@0
|
416 |
|
ali@0
|
417 |
PG texts should have lines longer than 54 characters. However,
|
ali@0
|
418 |
there are special cases like poetry and tables of contents where
|
ali@0
|
419 |
the lines _should_ be shorter. So treat Gutcheck warnings about
|
ali@0
|
420 |
short lines carefully. Sometimes it's a genuine formatting
|
ali@0
|
421 |
problem; sometimes the line really needs to be short.
|
ali@0
|
422 |
|
ali@0
|
423 |
Hint: gutcheck will not flag lines as short if they are indented
|
ali@0
|
424 |
- if they start with a space. I like to start inserted stanzas
|
ali@0
|
425 |
and other such items indented with a couple of spaces so that
|
ali@0
|
426 |
they stand out from the main text anyway.
|
ali@0
|
427 |
|
ali@0
|
428 |
|
ali@0
|
429 |
|
ali@0
|
430 |
Line 1804 - Begins with punctuation?
|
ali@0
|
431 |
|
ali@0
|
432 |
Lines should normally not begin with commas, periods and so on.
|
ali@0
|
433 |
An exception is ellipses . . . which can happen at start of line.
|
ali@0
|
434 |
|
ali@0
|
435 |
|
ali@0
|
436 |
|
ali@0
|
437 |
Line 1850 - Spaced em-dash?
|
ali@0
|
438 |
|
ali@0
|
439 |
The PG standard for an em-dash--like these--is two minus signs
|
ali@0
|
440 |
with no spaces before or after them. Gutcheck flags non-PG
|
ali@0
|
441 |
em-dashes - like this one. Normally, you will replace it with a
|
ali@0
|
442 |
PG-standard em-dash.
|
ali@0
|
443 |
|
ali@0
|
444 |
|
ali@0
|
445 |
|
ali@0
|
446 |
Line 1904 - Query he/be error?
|
ali@0
|
447 |
|
ali@0
|
448 |
Gutcheck makes a very minor effort to look for that scourge of all
|
ali@0
|
449 |
proofreaders, "be" replacing "he" or vice-versa, and draws your
|
ali@0
|
450 |
attention to it when it thinks it has found one.
|
ali@0
|
451 |
|
ali@0
|
452 |
|
ali@0
|
453 |
|
ali@0
|
454 |
Line 2017 - Query digit in a1most
|
ali@0
|
455 |
|
ali@0
|
456 |
The digit 1 is commonly OCRed for the letter l, the digit 0 for
|
ali@0
|
457 |
the letter O, and so on. When gutcheck sees a mix of digits and
|
ali@0
|
458 |
letters, it warns you. It may generate a false positive for
|
ali@0
|
459 |
something like 7am.
|
ali@0
|
460 |
|
ali@0
|
461 |
|
ali@0
|
462 |
|
ali@0
|
463 |
Line 2083 - Query standalone 0
|
ali@0
|
464 |
|
ali@0
|
465 |
In paranoid mode (see -x) only, gutcheck warns about the digit 0
|
ali@0
|
466 |
and the number 1 standing alone as a word. This can happen if the
|
ali@0
|
467 |
OCR misreads the words O or I.
|
ali@0
|
468 |
|
ali@0
|
469 |
|
ali@0
|
470 |
|
ali@0
|
471 |
Line 2115 - Query word whetber
|
ali@0
|
472 |
|
ali@0
|
473 |
If you have switched typo-checking on, gutcheck looks for
|
ali@0
|
474 |
potential typos, especially common h/b errors. It's not
|
ali@0
|
475 |
infallible; it sometimes queries legit words, but it's
|
ali@0
|
476 |
always worth taking a look.
|
ali@0
|
477 |
|
ali@0
|
478 |
|
ali@0
|
479 |
|
ali@0
|
480 |
Line 2190 column 14 - Missing space?
|
ali@0
|
481 |
|
ali@0
|
482 |
Omitting a space is a very common error,especially coming from
|
ali@0
|
483 |
OCRed text,and can be hard for a human to spot. The commas in
|
ali@0
|
484 |
the previous sentence illustrate the kind of thing I mean.
|
ali@0
|
485 |
|
ali@0
|
486 |
|
ali@0
|
487 |
|
ali@0
|
488 |
Line 2240 column 48 - Spaced punctuation?
|
ali@0
|
489 |
|
ali@0
|
490 |
The flip side of the "missing space" error , here , is when extra
|
ali@0
|
491 |
spaces are added before punctuation . Some old texts appear to add
|
ali@0
|
492 |
extra spaces around punctuation consistently, but this was a
|
ali@0
|
493 |
typographical convention rather than the author's intent, and the
|
ali@0
|
494 |
extra "spaces" should be removed when preparing a PG text.
|
ali@0
|
495 |
|
ali@0
|
496 |
|
ali@0
|
497 |
|
ali@0
|
498 |
Line 2301 column 19 - Unspaced quotes?
|
ali@0
|
499 |
|
ali@0
|
500 |
Another common spacing problem occurs in a phrase like "You wait
|
ali@0
|
501 |
there,"he said.
|
ali@0
|
502 |
|
ali@0
|
503 |
|
ali@0
|
504 |
|
ali@0
|
505 |
Line 2385 column 27 - Wrongspaced quotes?
|
ali@0
|
506 |
|
ali@0
|
507 |
As of version 0.98, gutcheck adds extra checks on whether a quote
|
ali@0
|
508 |
seems to be a start or end quote, and queries those that appear to
|
ali@0
|
509 |
be misplaced. This does give rise to false positives when quotes are
|
ali@0
|
510 |
nested, for example:
|
ali@0
|
511 |
|
ali@0
|
512 |
"And how," she asked, "will your "friends" help you now?"
|
ali@0
|
513 |
|
ali@0
|
514 |
but these false positives are worth it because of the many cases
|
ali@0
|
515 |
that this test catches, notably those like:
|
ali@0
|
516 |
|
ali@0
|
517 |
"And how, "she said," will your friends help you now?"
|
ali@0
|
518 |
|
ali@0
|
519 |
Sometimes a "wrongspaced quotes" query will arise because an earlier
|
ali@0
|
520 |
quote in the paragraph was omitted, so if the place specified seems
|
ali@0
|
521 |
to be OK, look back to see whether there's a problem in the preceding
|
ali@0
|
522 |
lines.
|
ali@0
|
523 |
|
ali@0
|
524 |
|
ali@0
|
525 |
|
ali@0
|
526 |
Line 2400 - HTML Tag? <PRE>
|
ali@0
|
527 |
|
ali@0
|
528 |
Some PG texts have been converted from HTML, and not all of the
|
ali@0
|
529 |
HTML tags have been removed.
|
ali@0
|
530 |
|
ali@0
|
531 |
|
ali@0
|
532 |
|
ali@0
|
533 |
Line 2402 - HTML symbol? &emdash;
|
ali@0
|
534 |
|
ali@0
|
535 |
Similarly, special HTML symbol characters can survive into PG
|
ali@0
|
536 |
texts. Can occasionally produce amusing false positives like
|
ali@0
|
537 |
. . . Marwick & Co were well known for it;
|
ali@0
|
538 |
|
ali@0
|
539 |
|
ali@0
|
540 |
|
ali@0
|
541 |
Line 2540 - Mismatched quotes
|
ali@0
|
542 |
|
ali@0
|
543 |
Another gutcheck mainstay - unclosed doublequotes in a paragraph.
|
ali@0
|
544 |
See the discussion of quotes in the switches section near the
|
ali@0
|
545 |
start of this file.
|
ali@0
|
546 |
|
ali@0
|
547 |
Since the mismatch doesn't occur on any one line, gutcheck quotes
|
ali@0
|
548 |
the line number of the first blank line following the paragraph,
|
ali@0
|
549 |
since this is the point where it reconciles the count of quotes.
|
ali@0
|
550 |
However, if gutcheck is echoing lines, that is, you haven't used
|
ali@0
|
551 |
the -e switch, it will show the _first_ line of the paragraph,
|
ali@0
|
552 |
to help you find the place without using line numbers. The
|
ali@0
|
553 |
offending paragraph is therefore between the quoted line and
|
ali@0
|
554 |
the line number given.
|
ali@0
|
555 |
|
ali@0
|
556 |
|
ali@0
|
557 |
|
ali@0
|
558 |
Line 2587 - Mismatched single quotes
|
ali@0
|
559 |
|
ali@0
|
560 |
Only checked with the -s switch, since checking single quotes is
|
ali@0
|
561 |
not a very reliable process. Otherwise, the same logic as for
|
ali@0
|
562 |
doublequotes applies.
|
ali@0
|
563 |
|
ali@0
|
564 |
|
ali@0
|
565 |
|
ali@0
|
566 |
Line 2877 - Mismatched round brackets?
|
ali@0
|
567 |
|
ali@0
|
568 |
Also curly and square brackets. Texts with a lot of brackets, like
|
ali@0
|
569 |
plays with bracketed stage instructions, may have mismatches.
|
ali@0
|
570 |
|
ali@0
|
571 |
|
ali@0
|
572 |
Line 3150 - No CR?
|
ali@0
|
573 |
Line 3204 - Two successive CRs?
|
ali@0
|
574 |
Line 3281 position 75 - CR without LF?
|
ali@0
|
575 |
|
ali@0
|
576 |
These are the invalid line-end warnings. See the discussion of
|
ali@0
|
577 |
line-end checking in the switches section near the start of this
|
ali@0
|
578 |
file. If you see these, and your editor doesn't show anything
|
ali@0
|
579 |
wrong, you should probably try deleting the characters just before
|
ali@0
|
580 |
and after the line end, and the line-end itself, then retyping the
|
ali@0
|
581 |
characters and the line-end.
|
ali@0
|
582 |
|
ali@0
|
583 |
|
ali@0
|
584 |
Line 2940 - Paragraph starts with lower-case
|
ali@0
|
585 |
|
ali@0
|
586 |
A common error in an e-text is for an extra blank line
|
ali@0
|
587 |
|
ali@0
|
588 |
to be put in, like the blank line above, and this often
|
ali@0
|
589 |
shows up as a new paragraph beginning with lower case.
|
ali@0
|
590 |
Sometimes the blank line is deliberate, as when a
|
ali@0
|
591 |
quotation is inserted in a speech. Use your judgement.
|
ali@0
|
592 |
|
ali@0
|
593 |
|
ali@0
|
594 |
Line 2987 - Extra period?
|
ali@0
|
595 |
|
ali@0
|
596 |
An extra period. is a. common problem in OCRed text. and usually
|
ali@0
|
597 |
arises when a speck of dust on the page is mistaken for a period.
|
ali@0
|
598 |
or. as occasionally happens. when a comma loses its tail.
|
ali@0
|
599 |
|
ali@0
|
600 |
|
ali@0
|
601 |
Line 3012 column 12 - Double punctuation?
|
ali@0
|
602 |
|
ali@0
|
603 |
Double punctuation., like that,, is a common typo and
|
ali@0
|
604 |
scanno. Some books have much legit double punctuation,
|
ali@0
|
605 |
like etc., etc., but it's worth checking anyway.
|
ali@0
|
606 |
|
ali@0
|
607 |
|
ali@0
|
608 |
|
ali@0
|
609 |
* * * *
|
ali@0
|
610 |
|
ali@0
|
611 |
For Windows-only users who are unfamiliar with DOS:
|
ali@0
|
612 |
|
ali@0
|
613 |
If you're a Windows-only user, you need to save
|
ali@0
|
614 |
gutcheck.exe into the folder (directory) where the
|
ali@0
|
615 |
text file you want to check is. Let's say your
|
ali@0
|
616 |
text file is in C:\GUT, then you should save
|
ali@0
|
617 |
GUTCHECK.EXE into C:\GUT.
|
ali@0
|
618 |
|
ali@0
|
619 |
Now get to a DOS prompt. You can do this by
|
ali@0
|
620 |
selecting the "Command Prompt" or "MS-DOS Prompt"
|
ali@0
|
621 |
option that will be somewhere on your
|
ali@0
|
622 |
Start/Programs menu.
|
ali@0
|
623 |
|
ali@0
|
624 |
Now get into the C:\GUT directory.
|
ali@0
|
625 |
You can do this using the CD (change directory)
|
ali@0
|
626 |
command, like this:
|
ali@0
|
627 |
CD \GUT
|
ali@0
|
628 |
and your prompt will change to
|
ali@0
|
629 |
C:\GUT>
|
ali@0
|
630 |
so you know you're in the right place.
|
ali@0
|
631 |
|
ali@0
|
632 |
Now type
|
ali@0
|
633 |
gutcheck yourfile.txt
|
ali@0
|
634 |
and you'll see gutcheck's report
|
ali@0
|
635 |
|
ali@0
|
636 |
By default, gutcheck prints its queries to screen.
|
ali@0
|
637 |
If you want to create a file of them, to edit
|
ali@0
|
638 |
against the text, you can use the greater-than
|
ali@0
|
639 |
sign (>) to tell it to output the report to a
|
ali@0
|
640 |
file. For example, if you want its report in a
|
ali@0
|
641 |
file called QUERIES.LST, you could type
|
ali@0
|
642 |
|
ali@0
|
643 |
gutcheck yourfile.txt > queries.lst
|
ali@0
|
644 |
|
ali@0
|
645 |
The queries.lst file will then contain the listing
|
ali@0
|
646 |
of possible formatting errors, and you can
|
ali@0
|
647 |
edit it alongside your text.
|
ali@0
|
648 |
|
ali@0
|
649 |
Whatever you do, DON'T make the filename after
|
ali@0
|
650 |
the greater-than sign the name of a file already
|
ali@0
|
651 |
on your disk that you want to keep, because
|
ali@0
|
652 |
the greater-than sign will cause gutcheck to
|
ali@0
|
653 |
replace any existing file of that name.
|
ali@0
|
654 |
|
ali@0
|
655 |
So, for example, if you have two Tolstoy files
|
ali@0
|
656 |
that you want to check, called WARPEACE.TXT and
|
ali@0
|
657 |
ANNAK.TXT, make sure that neither of these names
|
ali@0
|
658 |
is ever used following the greater-than sign.
|
ali@0
|
659 |
To check these correctly, you might do:
|
ali@0
|
660 |
|
ali@0
|
661 |
gutcheck warpeace.txt >war.lst
|
ali@0
|
662 |
|
ali@0
|
663 |
and
|
ali@0
|
664 |
|
ali@0
|
665 |
gutcheck annak.txt > annak.lst
|
ali@0
|
666 |
|
ali@0
|
667 |
separately. Then you can look at war.lst and annak.lst
|
ali@0
|
668 |
to see the gutcheck reports.
|
ali@0
|
669 |
|
ali@0
|
670 |
* * * *
|
ali@0
|
671 |
|
ali@0
|
672 |
|
ali@0
|
673 |
For existing 0.98 users upgrading to 0.99:
|
ali@0
|
674 |
|
ali@0
|
675 |
If you run on old 16-bit DOS or Windows 3.x, I'm afraid
|
ali@0
|
676 |
you're out of luck. I'm not saying it _can't_ be compiled
|
ali@0
|
677 |
to run on 16-bit, but the executable with the package is
|
ali@0
|
678 |
for Win32 only. *nix users won't notice the change at all.
|
ali@0
|
679 |
|
ali@0
|
680 |
|
ali@0
|
681 |
There are two new switches: -u and -d.
|
ali@0
|
682 |
See above for full rundown.
|
ali@0
|
683 |
|
ali@0
|
684 |
|
ali@0
|
685 |
Here's a list of the new errors:
|
ali@0
|
686 |
|
ali@0
|
687 |
Line 1456 - Carat character?
|
ali@0
|
688 |
|
ali@0
|
689 |
I^ve found a few.
|
ali@0
|
690 |
|
ali@0
|
691 |
|
ali@0
|
692 |
Line 1821 - Forward slash?
|
ali@0
|
693 |
|
ali@0
|
694 |
Common error for italicized "I", or so /'ve found.
|
ali@0
|
695 |
|
ali@0
|
696 |
|
ali@0
|
697 |
Line 2139 - Query missing paragraph break?
|
ali@0
|
698 |
|
ali@0
|
699 |
"Come here, son." "Do I _have_ to go, dad?"
|
ali@0
|
700 |
Like that. False positives in some texts. Sorry 'bout that,
|
ali@0
|
701 |
but these are often errors.
|
ali@0
|
702 |
|
ali@0
|
703 |
|
ali@0
|
704 |
Line 2200 - Query had/bad error?
|
ali@0
|
705 |
|
ali@0
|
706 |
Clear enough. Doesn't catch as many as I'd like it to,
|
ali@0
|
707 |
but rarely gives false alarms.
|
ali@0
|
708 |
|
ali@0
|
709 |
|
ali@0
|
710 |
Line 2268 - Query punctuation after the?
|
ali@0
|
711 |
|
ali@0
|
712 |
Some words, like "the", very rarely have punctuation
|
ali@0
|
713 |
following them. Others, like "Mrs", usually have a
|
ali@0
|
714 |
period, but never a comma. Occasional false positives.
|
ali@0
|
715 |
|
ali@0
|
716 |
|
ali@0
|
717 |
Line 2380 - Query possible scanno arid
|
ali@0
|
718 |
|
ali@0
|
719 |
It found one of your user-defined typos when you
|
ali@0
|
720 |
used the -u switch.
|
ali@0
|
721 |
|
ali@0
|
722 |
|
ali@0
|
723 |
Line 2511 - Capital "S"?
|
ali@0
|
724 |
|
ali@0
|
725 |
Surprisingly common specific case, like: Jane'S
|
ali@0
|
726 |
|
ali@0
|
727 |
|
ali@0
|
728 |
Line 3469 - endquote missing punctuation?
|
ali@0
|
729 |
|
ali@0
|
730 |
OK. This one can really cause a lot of false positives
|
ali@0
|
731 |
in some books, but it switches itself off if it finds
|
ali@0
|
732 |
more than 20 in a text, unless you force it to list them
|
ali@0
|
733 |
all with the -v switch.
|
ali@0
|
734 |
"Hey, dad" Johnny said, "can we go now?"
|
ali@0
|
735 |
is a common punctuation-missing error.
|
ali@0
|
736 |
|
ali@0
|
737 |
|
ali@0
|
738 |
Line 4266 - Mismatched underscores?
|
ali@0
|
739 |
|
ali@0
|
740 |
Like mismatched anything else!
|
ali@0
|
741 |
|
ali@0
|
742 |
|