ali@0
|
1 |
|
ali@0
|
2 |
|
ali@74
|
3 |
Bookloupe documentation
|
ali@0
|
4 |
|
ali@0
|
5 |
|
ali@74
|
6 |
bookloupe: lists possible common formatting errors in a Project
|
ali@74
|
7 |
Gutenberg candidate file. Bookloupe is based on gutcheck, written
|
ali@74
|
8 |
by Jim Tinsley. It is a command line program and can be used under
|
ali@74
|
9 |
Microsoft Windows, Mac or Unix. For Windows-only people, there is
|
ali@74
|
10 |
an appendix at the end with brief instructions for running it.
|
ali@0
|
11 |
|
ali@90
|
12 |
Current version: 2.0
|
ali@0
|
13 |
|
ali@74
|
14 |
This software is Copyright Jim Tinsley 2000-2005 and
|
ali@74
|
15 |
J. Ali Harlow 2012 onwards.
|
ali@0
|
16 |
|
ali@74
|
17 |
Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
|
ali@0
|
18 |
This is Free Software; you may redistribute it under certain conditions (GPL).
|
ali@0
|
19 |
|
ali@74
|
20 |
See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
|
ali@0
|
21 |
|
ali@0
|
22 |
|
ali@74
|
23 |
Usage is: bookloupe [-setopxlywm] filename
|
ali@0
|
24 |
where:
|
ali@0
|
25 |
-s checks Single quotes
|
ali@0
|
26 |
-e switches off Echoing of lines
|
ali@0
|
27 |
-t checks Typos
|
ali@0
|
28 |
-o produces an Overview only
|
ali@0
|
29 |
-p sets strict quotes checking for Paragraphs
|
ali@0
|
30 |
-x (paranoid) switches OFF typo checking and extra checks
|
ali@0
|
31 |
-l turns off Line-end checks
|
ali@0
|
32 |
-y sets error messages to stdout
|
ali@0
|
33 |
-w is a special mode for web uploads (for future use)
|
ali@0
|
34 |
-v (verbose) forces individual reporting of minor problems
|
ali@0
|
35 |
-m interprets Markup of some common HTML tags and entities
|
ali@0
|
36 |
-u warns about words in a user-defined typo file gutcheck.typ
|
ali@0
|
37 |
-d ignores some DP-specific markup
|
ali@0
|
38 |
|
ali@74
|
39 |
Running bookloupe without any parameters will display a brief help message.
|
ali@0
|
40 |
|
ali@0
|
41 |
Sample usage:
|
ali@0
|
42 |
|
ali@74
|
43 |
bookloupe warpeace.txt
|
ali@0
|
44 |
|
ali@0
|
45 |
|
ali@0
|
46 |
More detail:
|
ali@0
|
47 |
|
ali@74
|
48 |
Character encoding
|
ali@74
|
49 |
|
ali@74
|
50 |
Bookloupe will handle e-texts encoded in UTF-8 (preferred),
|
ali@74
|
51 |
ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
|
ali@74
|
52 |
incorrectly, as ansi). The output will be in the same encoding
|
ali@74
|
53 |
as the input e-text.
|
ali@74
|
54 |
|
ali@0
|
55 |
Echoing lines (-e to switch off)
|
ali@0
|
56 |
|
ali@74
|
57 |
You may find it convenient, when reviewing Bookloupe's
|
ali@74
|
58 |
suggestions, to see the line that Bookloupe is questioning.
|
ali@0
|
59 |
That way, you can often see at a glance whether it is
|
ali@0
|
60 |
a real error that needs to be fixed, or a false positive
|
ali@74
|
61 |
that should be in the text, but Bookloupe's limited
|
ali@0
|
62 |
programming doesn't understand.
|
ali@0
|
63 |
|
ali@74
|
64 |
By default, bookloupe echoes these lines, but if you don't
|
ali@0
|
65 |
want to see the lines referred to, -e will switch it OFF.
|
ali@0
|
66 |
|
ali@0
|
67 |
|
ali@0
|
68 |
Quotes (-s and -p switches)
|
ali@0
|
69 |
|
ali@74
|
70 |
Bookloupe always looks for unbalanced doublequotes in a
|
ali@0
|
71 |
paragraph. It is a common convention for writers not to
|
ali@0
|
72 |
close quotes in a paragraph if the next paragraph opens
|
ali@0
|
73 |
with quotes and is a continuation by the same speaker.
|
ali@0
|
74 |
|
ali@74
|
75 |
Bookloupe therefore does not normally report unclosed quotes
|
ali@0
|
76 |
if the next paragraph begins with a quote. If you need
|
ali@0
|
77 |
to see all unclosed quotes, even where the next paragraph
|
ali@0
|
78 |
begins with a quote, you should use the -p switch.
|
ali@0
|
79 |
|
ali@92
|
80 |
Singlequotes (' and ’) are a problem, since the same
|
ali@92
|
81 |
character is used for an apostrophe. I'm not sure that it is
|
ali@0
|
82 |
possible to get 100% accuracy on singlequotes checking,
|
ali@0
|
83 |
particularly since dialect, quite common in PG texts,
|
ali@0
|
84 |
upsets the normal rules so badly. Consider the sentence:
|
ali@0
|
85 |
'Tis often said that a man's a man for a' that.
|
ali@0
|
86 |
As humans, we recognize that both apostrophes are used
|
ali@0
|
87 |
for contractions rather than quotes, but it isn't easy
|
ali@0
|
88 |
to get a program to recognize that.
|
ali@0
|
89 |
|
ali@74
|
90 |
Since bookloupe makes too many mistakes when trying to match
|
ali@0
|
91 |
singlequotes, it doesn't look for unbalanced singlequotes
|
ali@0
|
92 |
unless you specify the -s switch.
|
ali@0
|
93 |
|
ali@0
|
94 |
Consider these sentences, which illustrate the main cases:
|
ali@0
|
95 |
|
ali@0
|
96 |
'Tis often said that a fool and his money are soon parted.
|
ali@0
|
97 |
|
ali@0
|
98 |
'Becky's goin' home,' said Tom.
|
ali@0
|
99 |
|
ali@0
|
100 |
The dogs' tails wagged in unison.
|
ali@0
|
101 |
|
ali@0
|
102 |
Those 'pack dogs' of yours look more like wolves.
|
ali@0
|
103 |
|
ali@0
|
104 |
|
ali@0
|
105 |
|
ali@0
|
106 |
Typos (-t switch)
|
ali@0
|
107 |
|
ali@74
|
108 |
It's not bookoupe's job to be a spelling checker, but it
|
ali@0
|
109 |
does check for a list of common typos and OCR errors if you
|
ali@0
|
110 |
use the -t switch. (The -x switch also turns typo checking on.)
|
ali@0
|
111 |
|
ali@0
|
112 |
It also checks for character combinations, especially involving
|
ali@0
|
113 |
h and b, which are often confused by OCR, that rarely or never
|
ali@0
|
114 |
occur. For example, it queries "tbe" in a word. Now, "the" often
|
ali@0
|
115 |
occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
|
ali@0
|
116 |
playing the odds - a few false positives for many errors found.
|
ali@0
|
117 |
Similarly with "ii", which is a very common OCR error.
|
ali@0
|
118 |
|
ali@74
|
119 |
Bookloupe suppresses multiple reporting of the first 40 "typos"
|
ali@0
|
120 |
found. This is to remove the annoyance of seeing something like
|
ali@0
|
121 |
"FN" (footnote) or "LK" (initials) flagged as a typo 147 times
|
ali@0
|
122 |
in a text.
|
ali@0
|
123 |
|
ali@0
|
124 |
|
ali@0
|
125 |
Line-end checking (-l switch to disable)
|
ali@0
|
126 |
|
ali@0
|
127 |
All PG texts should have a Carriage Return (CR - character 13)
|
ali@0
|
128 |
and a Line Feed (LF - character 10) at end of each line,
|
ali@0
|
129 |
regardless of what O/S you made them on. DOS/Windows, Unix
|
ali@0
|
130 |
and Mac have different conventions, but the final text should
|
ali@0
|
131 |
always use a CR/LF pair as its line terminator.
|
ali@0
|
132 |
|
ali@74
|
133 |
By default, bookloupe verifies that every line does have
|
ali@0
|
134 |
the correct terminator, but if you're on a work-in-progress
|
ali@0
|
135 |
in Linux, you might want to convert the line-ends as a final
|
ali@0
|
136 |
step, and not want to see thousands of errors every time you
|
ali@74
|
137 |
run bookloupe before that final step, so you can turn off
|
ali@0
|
138 |
this checking with the -l switch.
|
ali@0
|
139 |
|
ali@0
|
140 |
|
ali@0
|
141 |
Paranoid mode (-x switch to disable: Trust No One :-)
|
ali@0
|
142 |
|
ali@0
|
143 |
-x switches OFF typo-checking, the -t flag, automatically
|
ali@0
|
144 |
and some extra checks like standalone 1 and 0 queries.
|
ali@0
|
145 |
|
ali@0
|
146 |
|
ali@0
|
147 |
Overview mode (-o switch)
|
ali@0
|
148 |
|
ali@74
|
149 |
This mode just gives a count of queries found
|
ali@74
|
150 |
instead of a detailed list.
|
ali@0
|
151 |
|
ali@0
|
152 |
|
ali@0
|
153 |
Header quote (-h switch)
|
ali@0
|
154 |
|
ali@74
|
155 |
If you use the -h switch, bookloupe will also display
|
ali@74
|
156 |
the Title, Author, Release and Edition fields from the
|
ali@74
|
157 |
PG header. This is useful mostly for the automated
|
ali@74
|
158 |
checks we do on recently-posted texts.
|
ali@0
|
159 |
|
ali@0
|
160 |
|
ali@0
|
161 |
Errors to stdout (-y switch)
|
ali@0
|
162 |
|
ali@74
|
163 |
If you're just running bookloupe normally, you can ignore
|
ali@74
|
164 |
this. It's only there for programs that provide a front
|
ali@74
|
165 |
end to bookloupe. It makes error messages appear within
|
ali@74
|
166 |
the output of bookloupe so that the front end knows whether
|
ali@74
|
167 |
bookloupe ran OK.
|
ali@0
|
168 |
|
ali@0
|
169 |
|
ali@0
|
170 |
Verbose reporting (-v switch)
|
ali@0
|
171 |
|
ali@74
|
172 |
Normally, if bookloupe sees lots of long lines, short lines,
|
ali@74
|
173 |
spaced dashes, non-ASCII characters or dot-commas ".," it
|
ali@74
|
174 |
assumes these are features of the text, counts and summarizes
|
ali@74
|
175 |
them at the top of its report, but does not list them
|
ali@74
|
176 |
individually. If the -v switch is on, bookloupe will list them all.
|
ali@0
|
177 |
|
ali@0
|
178 |
|
ali@0
|
179 |
Markup interpretation (-m switch)
|
ali@0
|
180 |
|
ali@74
|
181 |
Normally, bookloupe flags anything it suspects of being HTML
|
ali@74
|
182 |
markup as a possible error. When you use the -m switch,
|
ali@74
|
183 |
however, it matches anything that looks like markup against
|
ali@74
|
184 |
a short list of common HTML tags and entities. If the markup
|
ali@74
|
185 |
is in that list, it either ignores the markup, in the case
|
ali@74
|
186 |
of a tag, or "interprets" the markup as its nearest ASCII
|
ali@74
|
187 |
equivalent, in the case of an entity. So, for example, using
|
ali@74
|
188 |
this switch, bookloupe will "see"
|
ali@0
|
189 |
|
ali@74
|
190 |
“He went <i>thataway!</i>”
|
ali@0
|
191 |
|
ali@74
|
192 |
as
|
ali@0
|
193 |
|
ali@74
|
194 |
"He went thataway!"
|
ali@0
|
195 |
|
ali@74
|
196 |
and report accordingly.
|
ali@0
|
197 |
|
ali@74
|
198 |
This switch does not, not, NOT check the validity of HTML;
|
ali@74
|
199 |
it exists so that you can run bookloupe on most HTML texts
|
ali@74
|
200 |
for PG, and get sane results. It does not support all tags.
|
ali@74
|
201 |
It does not support all entities. When it sees a tag or entity
|
ali@74
|
202 |
it does not recognize, it will query it as HTML just as if
|
ali@74
|
203 |
you hadn't specified the -m switch.
|
ali@0
|
204 |
|
ali@74
|
205 |
Bookloupe will automatically switch on markup interpretation
|
ali@74
|
206 |
if it sees a lot of tags that appear to be markup, so mostly, you
|
ali@74
|
207 |
won't have to specify this.
|
ali@0
|
208 |
|
ali@0
|
209 |
User-defined typos (-u switch)
|
ali@0
|
210 |
|
ali@74
|
211 |
If you have a file named bookloupe.typ or gutcheck.typ either
|
ali@74
|
212 |
in your current working directory or in the directory from
|
ali@74
|
213 |
which you explicitly invoked bookoupe, but not necessarily on
|
ali@74
|
214 |
your path, and if you specify the -u switch, bookloupe will
|
ali@74
|
215 |
query any word specified in that file. The file is simple: one
|
ali@74
|
216 |
word, in lower case, per line. Be careful not to put multiple
|
ali@74
|
217 |
words onto a line, or leave any rubbish other than the word on
|
ali@74
|
218 |
the line. You should have received a sample file bookloupe.typ
|
ali@74
|
219 |
with this package. The file may be encoded in UTF-8 (preferred),
|
ali@74
|
220 |
ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
|
ali@74
|
221 |
incorrectly, as ansi).
|
ali@0
|
222 |
|
ali@0
|
223 |
Ignore DP markup (-d switch)
|
ali@0
|
224 |
|
ali@74
|
225 |
Distributed Proofreaders (http://www.pgdp.net) has for some
|
ali@74
|
226 |
time been the main source of PG texts, and proofers there use
|
ali@74
|
227 |
special conventions. This switch understands those conventions,
|
ali@74
|
228 |
so that people can use bookloupe on files in process that still
|
ali@74
|
229 |
haven't had the special conventions removed yet. The special
|
ali@74
|
230 |
conventions supported are page-separators and
|
ali@74
|
231 |
"<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
|
ali@0
|
232 |
|
ali@0
|
233 |
|
ali@74
|
234 |
You will probably only run bookloupe on a text once or maybe twice,
|
ali@0
|
235 |
just prior to uploading; it usually finds a few formatting problems;
|
ali@0
|
236 |
it also usually finds queries that aren't problems at all - it often
|
ali@0
|
237 |
questions Tables of Contents for having short lines, for example.
|
ali@74
|
238 |
These are called "false positives," and need a human to decide on
|
ali@0
|
239 |
them.
|
ali@0
|
240 |
|
ali@0
|
241 |
The text should be standard prose, and already close to PG normal
|
ali@0
|
242 |
format (plain text, about 70 characters per line with blank lines
|
ali@0
|
243 |
between paragraphs).
|
ali@0
|
244 |
|
ali@74
|
245 |
Bookloupe merely draws your attention to things that might be errors.
|
ali@0
|
246 |
It is NOT a substitute for human judgement. Formatting choices like
|
ali@0
|
247 |
short lines may be for a reason that this program can't understand.
|
ali@0
|
248 |
|
ali@0
|
249 |
Even the most careful human proofing can leave errors behind in a
|
ali@0
|
250 |
text, and there are several automated checks you can do to help find
|
ali@0
|
251 |
them. Of these, spellchecking (with _very_ careful human judgement) is
|
ali@0
|
252 |
the most important and most useful.
|
ali@0
|
253 |
|
ali@74
|
254 |
Bookloupe does perform some basic typo-checking if you ask it to,
|
ali@74
|
255 |
but its focus is on formatting errors specific to PG texts—
|
ali@0
|
256 |
mismatched quotes, non-ASCII characters, bad spacing, bad line
|
ali@0
|
257 |
length, HTML tags perhaps left from a conversion, unbalanced
|
ali@0
|
258 |
brackets.
|
ali@0
|
259 |
|
ali@0
|
260 |
Suggestions for additional checks would be appreciated and duly
|
ali@0
|
261 |
considered, but no guarantees that they will be implemented.
|
ali@0
|
262 |
|
ali@0
|
263 |
|
ali@0
|
264 |
|
ali@0
|
265 |
|
ali@74
|
266 |
How does Jim Tinsley use gutcheck?
|
ali@0
|
267 |
|
ali@0
|
268 |
Practically everyone I give gutcheck to asks me how _I_ use it.
|
ali@0
|
269 |
Well, when I get a text for posting, say filename.txt, I run
|
ali@0
|
270 |
|
ali@0
|
271 |
gutcheck -o filename.txt
|
ali@0
|
272 |
|
ali@0
|
273 |
That gives me a quick idea what I'm dealing with. It'll tell
|
ali@0
|
274 |
me what kind of problems gutcheck sees, and give me an idea
|
ali@0
|
275 |
of how much more work needs to be done on the text. Keep in
|
ali@0
|
276 |
mind that gutcheck doesn't do anything like a full spellcheck,
|
ali@0
|
277 |
but when I see a text that has a lot of problems, I assume that
|
ali@0
|
278 |
it probably needs a spellcheck too.
|
ali@0
|
279 |
|
ali@0
|
280 |
Having got a feel for the ballpark, I run
|
ali@0
|
281 |
|
ali@0
|
282 |
gutcheck filename.txt > jj
|
ali@0
|
283 |
|
ali@0
|
284 |
where jj is my personal, all-purpose filename for temporary data
|
ali@0
|
285 |
that doesn't need to be kept. Then I open filename.txt and jj in
|
ali@0
|
286 |
a split-screen view in my editor, and work down the text, fixing
|
ali@0
|
287 |
whatever needs fixing, and skipping whatever doesn't. If your
|
ali@0
|
288 |
editor doesn't split-screen, you can get much the same effect by
|
ali@0
|
289 |
opening your original file in your normal editor, and jj (or your
|
ali@0
|
290 |
equivalent name) in something like Notepad, keeping both in view
|
ali@0
|
291 |
at the same time.
|
ali@0
|
292 |
|
ali@0
|
293 |
Twice a day, an automatic process looks at all recently-posted
|
ali@0
|
294 |
texts, and emails Michael, me, and sometimes other people with
|
ali@0
|
295 |
their gutcheck summaries.
|
ali@0
|
296 |
|
ali@0
|
297 |
|
ali@0
|
298 |
|
ali@74
|
299 |
Future development of bookloupe
|
ali@0
|
300 |
|
ali@74
|
301 |
Future versions will add support for UTF-8 characters that
|
ali@74
|
302 |
are not in ISO-8859-1 (eg., curled quotation marks);
|
ali@90
|
303 |
characters that do not have a composed form (version 2.0
|
ali@74
|
304 |
treats these as taking 2 or more columns); zero width and
|
ali@90
|
305 |
wide characters (version 2.0 treats these as taking 1 column).
|
ali@0
|
306 |
|
ali@0
|
307 |
|
ali@0
|
308 |
|
ali@0
|
309 |
|
ali@74
|
310 |
Explanations of common bookloupe messages:
|
ali@0
|
311 |
|
ali@0
|
312 |
--> 74 lines in this file have white space at end
|
ali@0
|
313 |
|
ali@0
|
314 |
PG texts shouldn't have extra white space added at end of line.
|
ali@0
|
315 |
Don't worry too much about this; they're not doing any harm,
|
ali@0
|
316 |
and they'll be removed during posting anyway.
|
ali@0
|
317 |
|
ali@0
|
318 |
|
ali@0
|
319 |
--> 348 lines in this file are short. Not reporting short lines.
|
ali@0
|
320 |
--> 84 lines in this file are long. Not reporting long lines.
|
ali@0
|
321 |
--> 8 lines in this file are VERY long!
|
ali@0
|
322 |
|
ali@74
|
323 |
If there are a lot of long or short lines, bookloupe won't list
|
ali@0
|
324 |
them individually. The short lines version of this message
|
ali@0
|
325 |
is commonly seen when gutchecking poetry and some plays, where
|
ali@0
|
326 |
the normal line length is shorter than the standard for prose.
|
ali@0
|
327 |
A "VERY long" line is one over 80 characters. You normally
|
ali@0
|
328 |
shouldn't have any of these, but sometimes you may have to render
|
ali@0
|
329 |
a table that must be that long, or some special preformatted
|
ali@0
|
330 |
quotation that can't be broken.
|
ali@0
|
331 |
|
ali@0
|
332 |
|
ali@0
|
333 |
--> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
|
ali@0
|
334 |
|
ali@0
|
335 |
The PG standard for an emdash--like these--is two minus signs
|
ali@0
|
336 |
with no spaces before or after them. However, some older texts
|
ali@0
|
337 |
used spaced dashes - like these -- and if there are very many
|
ali@74
|
338 |
such spaced dashes in the file, bookoupe just draws your
|
ali@0
|
339 |
attention to it and doesn't list them individually.
|
ali@0
|
340 |
|
ali@0
|
341 |
|
ali@0
|
342 |
|
ali@0
|
343 |
Line 3020 - Non-ASCII character 233
|
ali@0
|
344 |
|
ali@0
|
345 |
Standard PG texts should use only ASCII characters with values
|
ali@0
|
346 |
up to 127; however, non-English, accented characters can be
|
ali@0
|
347 |
represented according to several different non-ASCII encoding
|
ali@0
|
348 |
schemes, using values over 127. If you have a plain English text
|
ali@0
|
349 |
with a few accented characters in words like cafe or tete-a-tete,
|
ali@74
|
350 |
you might replace the accented characters with their unaccented
|
ali@0
|
351 |
versions. The English pound sign is another commonly-seen
|
ali@0
|
352 |
non-ASCII character. If you have enough non-ASCII characters in
|
ali@74
|
353 |
your text that you feel removing them would degrade your text,
|
ali@74
|
354 |
you should probably consider doing a UTF-8 text.
|
ali@0
|
355 |
|
ali@0
|
356 |
|
ali@0
|
357 |
|
ali@0
|
358 |
Line 1207 - Non-ISO-8859 character 156
|
ali@0
|
359 |
|
ali@0
|
360 |
Even in "8-bit" texts, there are distinctions between code sets.
|
ali@0
|
361 |
The ISO-8859 family of 8-bit code sets is the most commonly used
|
ali@0
|
362 |
in PG, and these sets do not define values in the range 128 through
|
ali@0
|
363 |
159 as printable characters. It's quite common for someone on a
|
ali@0
|
364 |
Windows or Mac machine to use a non-ISO character inadvertently,
|
ali@0
|
365 |
so this message warns that the character is not only not ASCII,
|
ali@0
|
366 |
but also outside the ISO-8859 range.
|
ali@0
|
367 |
|
ali@0
|
368 |
|
ali@0
|
369 |
|
ali@0
|
370 |
Line 46 - Tab character?
|
ali@0
|
371 |
|
ali@0
|
372 |
Some editors and WPs will put in Tab characters (character 9) to
|
ali@0
|
373 |
indicate indented text. You should not use these in a PG text,
|
ali@0
|
374 |
because you can't be sure how they will appear on a reader's
|
ali@0
|
375 |
screen. Find the Tab, and replace it with the appropriate number
|
ali@0
|
376 |
of spaces.
|
ali@0
|
377 |
|
ali@0
|
378 |
|
ali@0
|
379 |
Line 1327 - Tilde character?
|
ali@0
|
380 |
|
ali@0
|
381 |
The tilde character (~) might be legitimately used, but it's the
|
ali@0
|
382 |
character commonly used by OCR software to indicate a place where
|
ali@74
|
383 |
it couldn't make out the letter, so bookloupe flags it.
|
ali@0
|
384 |
|
ali@0
|
385 |
|
ali@0
|
386 |
|
ali@0
|
387 |
Line 1347 - Asterisk?
|
ali@0
|
388 |
|
ali@0
|
389 |
Asterisks are reported only in paranoid mode (see -x).
|
ali@0
|
390 |
Like tildes, they are often used to indicate errors, but they are
|
ali@0
|
391 |
also legitimately used as line delimiters and footnote markers.
|
ali@0
|
392 |
|
ali@0
|
393 |
|
ali@0
|
394 |
|
ali@0
|
395 |
Line 1451 - Long line 129
|
ali@0
|
396 |
|
ali@0
|
397 |
PG texts should have lines shorter than 76. There may be occasions
|
ali@0
|
398 |
where you decide that you really have to go out to 79 characters,
|
ali@74
|
399 |
but the sample above says that line 1451 is 129 characters long—
|
ali@0
|
400 |
probably two lines run together.
|
ali@0
|
401 |
|
ali@0
|
402 |
|
ali@0
|
403 |
|
ali@0
|
404 |
Line 1590 - Short line?
|
ali@0
|
405 |
|
ali@0
|
406 |
PG texts should have lines longer than 54 characters. However,
|
ali@0
|
407 |
there are special cases like poetry and tables of contents where
|
ali@74
|
408 |
the lines _should_ be shorter. So treat bookloupe warnings about
|
ali@0
|
409 |
short lines carefully. Sometimes it's a genuine formatting
|
ali@0
|
410 |
problem; sometimes the line really needs to be short.
|
ali@0
|
411 |
|
ali@74
|
412 |
Hint: bookloupe will not flag lines as short if they are indented
|
ali@74
|
413 |
—if they start with a space. I like to start inserted stanzas
|
ali@0
|
414 |
and other such items indented with a couple of spaces so that
|
ali@0
|
415 |
they stand out from the main text anyway.
|
ali@0
|
416 |
|
ali@0
|
417 |
|
ali@0
|
418 |
|
ali@0
|
419 |
Line 1804 - Begins with punctuation?
|
ali@0
|
420 |
|
ali@0
|
421 |
Lines should normally not begin with commas, periods and so on.
|
ali@0
|
422 |
An exception is ellipses . . . which can happen at start of line.
|
ali@0
|
423 |
|
ali@0
|
424 |
|
ali@0
|
425 |
|
ali@0
|
426 |
Line 1850 - Spaced em-dash?
|
ali@0
|
427 |
|
ali@0
|
428 |
The PG standard for an em-dash--like these--is two minus signs
|
ali@74
|
429 |
with no spaces before or after them. Bookloupe flags non-PG
|
ali@0
|
430 |
em-dashes - like this one. Normally, you will replace it with a
|
ali@0
|
431 |
PG-standard em-dash.
|
ali@0
|
432 |
|
ali@0
|
433 |
|
ali@0
|
434 |
|
ali@0
|
435 |
Line 1904 - Query he/be error?
|
ali@0
|
436 |
|
ali@74
|
437 |
Bookloupe makes a very minor effort to look for that scourge of all
|
ali@0
|
438 |
proofreaders, "be" replacing "he" or vice-versa, and draws your
|
ali@0
|
439 |
attention to it when it thinks it has found one.
|
ali@0
|
440 |
|
ali@0
|
441 |
|
ali@0
|
442 |
|
ali@0
|
443 |
Line 2017 - Query digit in a1most
|
ali@0
|
444 |
|
ali@0
|
445 |
The digit 1 is commonly OCRed for the letter l, the digit 0 for
|
ali@74
|
446 |
the letter O, and so on. When bookloupe sees a mix of digits and
|
ali@0
|
447 |
letters, it warns you. It may generate a false positive for
|
ali@0
|
448 |
something like 7am.
|
ali@0
|
449 |
|
ali@0
|
450 |
|
ali@0
|
451 |
|
ali@0
|
452 |
Line 2083 - Query standalone 0
|
ali@0
|
453 |
|
ali@74
|
454 |
In paranoid mode (see -x) only, bookloupe warns about the digit 0
|
ali@0
|
455 |
and the number 1 standing alone as a word. This can happen if the
|
ali@0
|
456 |
OCR misreads the words O or I.
|
ali@0
|
457 |
|
ali@0
|
458 |
|
ali@0
|
459 |
|
ali@0
|
460 |
Line 2115 - Query word whetber
|
ali@0
|
461 |
|
ali@74
|
462 |
If you have switched typo-checking on, bookloupe looks for
|
ali@0
|
463 |
potential typos, especially common h/b errors. It's not
|
ali@0
|
464 |
infallible; it sometimes queries legit words, but it's
|
ali@0
|
465 |
always worth taking a look.
|
ali@0
|
466 |
|
ali@0
|
467 |
|
ali@0
|
468 |
|
ali@0
|
469 |
Line 2190 column 14 - Missing space?
|
ali@0
|
470 |
|
ali@0
|
471 |
Omitting a space is a very common error,especially coming from
|
ali@0
|
472 |
OCRed text,and can be hard for a human to spot. The commas in
|
ali@0
|
473 |
the previous sentence illustrate the kind of thing I mean.
|
ali@0
|
474 |
|
ali@0
|
475 |
|
ali@0
|
476 |
|
ali@0
|
477 |
Line 2240 column 48 - Spaced punctuation?
|
ali@0
|
478 |
|
ali@0
|
479 |
The flip side of the "missing space" error , here , is when extra
|
ali@0
|
480 |
spaces are added before punctuation . Some old texts appear to add
|
ali@0
|
481 |
extra spaces around punctuation consistently, but this was a
|
ali@0
|
482 |
typographical convention rather than the author's intent, and the
|
ali@0
|
483 |
extra "spaces" should be removed when preparing a PG text.
|
ali@0
|
484 |
|
ali@0
|
485 |
|
ali@0
|
486 |
|
ali@0
|
487 |
Line 2301 column 19 - Unspaced quotes?
|
ali@0
|
488 |
|
ali@0
|
489 |
Another common spacing problem occurs in a phrase like "You wait
|
ali@0
|
490 |
there,"he said.
|
ali@0
|
491 |
|
ali@0
|
492 |
|
ali@0
|
493 |
|
ali@0
|
494 |
Line 2385 column 27 - Wrongspaced quotes?
|
ali@0
|
495 |
|
ali@74
|
496 |
Bookloupe checks whether a quote seems to be a start or end quote,
|
ali@74
|
497 |
and queries those that appear to be misplaced. This does give rise
|
ali@74
|
498 |
to false positives when quotes are nested, for example:
|
ali@0
|
499 |
|
ali@0
|
500 |
"And how," she asked, "will your "friends" help you now?"
|
ali@0
|
501 |
|
ali@0
|
502 |
but these false positives are worth it because of the many cases
|
ali@0
|
503 |
that this test catches, notably those like:
|
ali@0
|
504 |
|
ali@0
|
505 |
"And how, "she said," will your friends help you now?"
|
ali@0
|
506 |
|
ali@0
|
507 |
Sometimes a "wrongspaced quotes" query will arise because an earlier
|
ali@0
|
508 |
quote in the paragraph was omitted, so if the place specified seems
|
ali@0
|
509 |
to be OK, look back to see whether there's a problem in the preceding
|
ali@0
|
510 |
lines.
|
ali@0
|
511 |
|
ali@0
|
512 |
|
ali@0
|
513 |
|
ali@0
|
514 |
Line 2400 - HTML Tag? <PRE>
|
ali@0
|
515 |
|
ali@0
|
516 |
Some PG texts have been converted from HTML, and not all of the
|
ali@0
|
517 |
HTML tags have been removed.
|
ali@0
|
518 |
|
ali@0
|
519 |
|
ali@0
|
520 |
|
ali@0
|
521 |
Line 2402 - HTML symbol? &emdash;
|
ali@0
|
522 |
|
ali@0
|
523 |
Similarly, special HTML symbol characters can survive into PG
|
ali@0
|
524 |
texts. Can occasionally produce amusing false positives like
|
ali@0
|
525 |
. . . Marwick & Co were well known for it;
|
ali@0
|
526 |
|
ali@0
|
527 |
|
ali@0
|
528 |
|
ali@0
|
529 |
Line 2540 - Mismatched quotes
|
ali@0
|
530 |
|
ali@74
|
531 |
Another bookloupe mainstay—unclosed doublequotes in a paragraph.
|
ali@0
|
532 |
See the discussion of quotes in the switches section near the
|
ali@0
|
533 |
start of this file.
|
ali@0
|
534 |
|
ali@74
|
535 |
Since the mismatch doesn't occur on any one line, bookloupe quotes
|
ali@0
|
536 |
the line number of the first blank line following the paragraph,
|
ali@0
|
537 |
since this is the point where it reconciles the count of quotes.
|
ali@74
|
538 |
However, if bookloupe is echoing lines, that is, you haven't used
|
ali@0
|
539 |
the -e switch, it will show the _first_ line of the paragraph,
|
ali@0
|
540 |
to help you find the place without using line numbers. The
|
ali@0
|
541 |
offending paragraph is therefore between the quoted line and
|
ali@0
|
542 |
the line number given.
|
ali@0
|
543 |
|
ali@0
|
544 |
|
ali@0
|
545 |
|
ali@0
|
546 |
Line 2587 - Mismatched single quotes
|
ali@0
|
547 |
|
ali@0
|
548 |
Only checked with the -s switch, since checking single quotes is
|
ali@0
|
549 |
not a very reliable process. Otherwise, the same logic as for
|
ali@0
|
550 |
doublequotes applies.
|
ali@0
|
551 |
|
ali@0
|
552 |
|
ali@0
|
553 |
|
ali@0
|
554 |
Line 2877 - Mismatched round brackets?
|
ali@0
|
555 |
|
ali@0
|
556 |
Also curly and square brackets. Texts with a lot of brackets, like
|
ali@0
|
557 |
plays with bracketed stage instructions, may have mismatches.
|
ali@0
|
558 |
|
ali@0
|
559 |
|
ali@0
|
560 |
Line 3150 - No CR?
|
ali@0
|
561 |
Line 3204 - Two successive CRs?
|
ali@0
|
562 |
Line 3281 position 75 - CR without LF?
|
ali@0
|
563 |
|
ali@0
|
564 |
These are the invalid line-end warnings. See the discussion of
|
ali@0
|
565 |
line-end checking in the switches section near the start of this
|
ali@0
|
566 |
file. If you see these, and your editor doesn't show anything
|
ali@0
|
567 |
wrong, you should probably try deleting the characters just before
|
ali@0
|
568 |
and after the line end, and the line-end itself, then retyping the
|
ali@0
|
569 |
characters and the line-end.
|
ali@0
|
570 |
|
ali@0
|
571 |
|
ali@0
|
572 |
Line 2940 - Paragraph starts with lower-case
|
ali@0
|
573 |
|
ali@0
|
574 |
A common error in an e-text is for an extra blank line
|
ali@0
|
575 |
|
ali@0
|
576 |
to be put in, like the blank line above, and this often
|
ali@0
|
577 |
shows up as a new paragraph beginning with lower case.
|
ali@0
|
578 |
Sometimes the blank line is deliberate, as when a
|
ali@0
|
579 |
quotation is inserted in a speech. Use your judgement.
|
ali@0
|
580 |
|
ali@0
|
581 |
|
ali@0
|
582 |
Line 2987 - Extra period?
|
ali@0
|
583 |
|
ali@0
|
584 |
An extra period. is a. common problem in OCRed text. and usually
|
ali@0
|
585 |
arises when a speck of dust on the page is mistaken for a period.
|
ali@0
|
586 |
or. as occasionally happens. when a comma loses its tail.
|
ali@0
|
587 |
|
ali@0
|
588 |
|
ali@0
|
589 |
Line 3012 column 12 - Double punctuation?
|
ali@0
|
590 |
|
ali@0
|
591 |
Double punctuation., like that,, is a common typo and
|
ali@0
|
592 |
scanno. Some books have much legit double punctuation,
|
ali@0
|
593 |
like etc., etc., but it's worth checking anyway.
|
ali@0
|
594 |
|
ali@0
|
595 |
|
ali@0
|
596 |
|
ali@0
|
597 |
* * * *
|
ali@0
|
598 |
|
ali@0
|
599 |
For Windows-only users who are unfamiliar with DOS:
|
ali@0
|
600 |
|
ali@0
|
601 |
If you're a Windows-only user, you need to save
|
ali@74
|
602 |
bookloupe.exe into the folder (directory) where the
|
ali@0
|
603 |
text file you want to check is. Let's say your
|
ali@74
|
604 |
text file is in C:\gut, then you should save
|
ali@74
|
605 |
bookloupe.exe into C:\gut.
|
ali@0
|
606 |
|
ali@74
|
607 |
Now get to a console. You can do this by
|
ali@0
|
608 |
selecting the "Command Prompt" or "MS-DOS Prompt"
|
ali@0
|
609 |
option that will be somewhere on your
|
ali@0
|
610 |
Start/Programs menu.
|
ali@0
|
611 |
|
ali@74
|
612 |
Now get into the C:\gut directory.
|
ali@74
|
613 |
You can do this using the cd (change directory)
|
ali@0
|
614 |
command, like this:
|
ali@74
|
615 |
cd \gut
|
ali@0
|
616 |
and your prompt will change to
|
ali@74
|
617 |
C:\gut>
|
ali@0
|
618 |
so you know you're in the right place.
|
ali@0
|
619 |
|
ali@0
|
620 |
Now type
|
ali@74
|
621 |
bookloupe yourfile.txt
|
ali@74
|
622 |
and you'll see bookloupe's report
|
ali@0
|
623 |
|
ali@74
|
624 |
By default, bookloupe prints its queries to screen.
|
ali@0
|
625 |
If you want to create a file of them, to edit
|
ali@0
|
626 |
against the text, you can use the greater-than
|
ali@0
|
627 |
sign (>) to tell it to output the report to a
|
ali@0
|
628 |
file. For example, if you want its report in a
|
ali@74
|
629 |
file called queries.lst, you could type
|
ali@74
|
630 |
|
ali@74
|
631 |
bookloupe yourfile.txt > queries.lst
|
ali@0
|
632 |
|
ali@0
|
633 |
The queries.lst file will then contain the listing
|
ali@0
|
634 |
of possible formatting errors, and you can
|
ali@0
|
635 |
edit it alongside your text.
|
ali@0
|
636 |
|
ali@0
|
637 |
Whatever you do, DON'T make the filename after
|
ali@0
|
638 |
the greater-than sign the name of a file already
|
ali@0
|
639 |
on your disk that you want to keep, because
|
ali@74
|
640 |
the greater-than sign will cause bookloupe to
|
ali@0
|
641 |
replace any existing file of that name.
|
ali@0
|
642 |
|
ali@0
|
643 |
So, for example, if you have two Tolstoy files
|
ali@0
|
644 |
that you want to check, called WARPEACE.TXT and
|
ali@0
|
645 |
ANNAK.TXT, make sure that neither of these names
|
ali@0
|
646 |
is ever used following the greater-than sign.
|
ali@0
|
647 |
To check these correctly, you might do:
|
ali@0
|
648 |
|
ali@74
|
649 |
bookloupe warpeace.txt > war.lst
|
ali@0
|
650 |
|
ali@0
|
651 |
and
|
ali@0
|
652 |
|
ali@74
|
653 |
bookloupe annak.txt > annak.lst
|
ali@0
|
654 |
|
ali@0
|
655 |
separately. Then you can look at war.lst and annak.lst
|
ali@74
|
656 |
to see the bookloupe reports.
|
ali@83
|
657 |
|
ali@83
|
658 |
For Windows-only users who want to use bookloupe from guiguts:
|
ali@83
|
659 |
|
ali@83
|
660 |
1) If you haven't already done so, download bookloupe-win32-xxx.zip
|
ali@83
|
661 |
from http://www.juiblex.co.uk/pgdp/bookloupe/
|
ali@83
|
662 |
|
ali@83
|
663 |
2) Extract the files into a suitable folder, e.g. C:\DP\bookloupe
|
ali@83
|
664 |
|
ali@83
|
665 |
3) Start Guiguts
|
ali@83
|
666 |
|
ali@83
|
667 |
4) Choose Preferences | File Paths | Set File Paths..
|
ali@83
|
668 |
|
ali@83
|
669 |
5) Click the "Locate Gutcheck..." button
|
ali@83
|
670 |
|
ali@83
|
671 |
6) Browse to the folder where you extracted bookloupe
|
ali@83
|
672 |
|
ali@83
|
673 |
7) Double-click bookloupe.exe
|
ali@89
|
674 |
|
ali@89
|
675 |
Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe
|
ali@89
|
676 |
instead. Since the output will look very like gutcheck output, you
|
ali@89
|
677 |
may want to check that it is actually bookloupe that is running. To do
|
ali@89
|
678 |
this, look at the black command line message window, which will say:
|
ali@89
|
679 |
|
ali@89
|
680 |
"bookloupe: Check and report on an e-text".
|
ali@89
|
681 |
|
ali@89
|
682 |
To return to using gutcheck for any reason, repeat steps 4 and 5
|
ali@89
|
683 |
above, and then,
|
ali@89
|
684 |
|
ali@89
|
685 |
6b) Browse back to the gutcheck folder, which is in a "tools"
|
ali@89
|
686 |
folder inside the main Guiguts folder. It will be something like
|
ali@89
|
687 |
"C:\DP\guiguts-win\tools\gutcheck", depending on where you installed
|
ali@89
|
688 |
Guiguts originally.
|
ali@89
|
689 |
|
ali@89
|
690 |
7b) Double-click gutcheck.exe
|
ali@89
|
691 |
|
ali@89
|
692 |
Now doing "Gutcheck" in Guiguts will run gutcheck itself, and the
|
ali@89
|
693 |
message in the black window should read:
|
ali@89
|
694 |
|
ali@89
|
695 |
"gutcheck: Check and report on an e-text".
|