3 Bookloupe documentation
6 bookloupe: lists possible common formatting errors in a Project
7 Gutenberg candidate file. Bookloupe is based on gutcheck, written
8 by Jim Tinsley. It is a command line program and can be used under
9 Microsoft Windows, Mac or Unix. For Windows-only people, there is
10 an appendix at the end with brief instructions for running it.
12 Current version: 1.91, a beta version leading up to version 2.0
14 This software is Copyright Jim Tinsley 2000-2005 and
15 J. Ali Harlow 2012 onwards.
17 Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
18 This is Free Software; you may redistribute it under certain conditions (GPL).
20 See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
23 Usage is: bookloupe [-setopxlywm] filename
25 -s checks Single quotes
26 -e switches off Echoing of lines
28 -o produces an Overview only
29 -p sets strict quotes checking for Paragraphs
30 -x (paranoid) switches OFF typo checking and extra checks
31 -l turns off Line-end checks
32 -y sets error messages to stdout
33 -w is a special mode for web uploads (for future use)
34 -v (verbose) forces individual reporting of minor problems
35 -m interprets Markup of some common HTML tags and entities
36 -u warns about words in a user-defined typo file gutcheck.typ
37 -d ignores some DP-specific markup
39 Running bookloupe without any parameters will display a brief help message.
43 bookloupe warpeace.txt
50 Bookloupe will handle e-texts encoded in UTF-8 (preferred),
51 ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
52 incorrectly, as ansi). The output will be in the same encoding
55 Echoing lines (-e to switch off)
57 You may find it convenient, when reviewing Bookloupe's
58 suggestions, to see the line that Bookloupe is questioning.
59 That way, you can often see at a glance whether it is
60 a real error that needs to be fixed, or a false positive
61 that should be in the text, but Bookloupe's limited
62 programming doesn't understand.
64 By default, bookloupe echoes these lines, but if you don't
65 want to see the lines referred to, -e will switch it OFF.
68 Quotes (-s and -p switches)
70 Bookloupe always looks for unbalanced doublequotes in a
71 paragraph. It is a common convention for writers not to
72 close quotes in a paragraph if the next paragraph opens
73 with quotes and is a continuation by the same speaker.
75 Bookloupe therefore does not normally report unclosed quotes
76 if the next paragraph begins with a quote. If you need
77 to see all unclosed quotes, even where the next paragraph
78 begins with a quote, you should use the -p switch.
80 Singlequotes (') are a problem, since the same character
81 is used for an apostrophe. I'm not sure that it is
82 possible to get 100% accuracy on singlequotes checking,
83 particularly since dialect, quite common in PG texts,
84 upsets the normal rules so badly. Consider the sentence:
85 'Tis often said that a man's a man for a' that.
86 As humans, we recognize that both apostrophes are used
87 for contractions rather than quotes, but it isn't easy
88 to get a program to recognize that.
90 Since bookloupe makes too many mistakes when trying to match
91 singlequotes, it doesn't look for unbalanced singlequotes
92 unless you specify the -s switch.
94 Consider these sentences, which illustrate the main cases:
96 'Tis often said that a fool and his money are soon parted.
98 'Becky's goin' home,' said Tom.
100 The dogs' tails wagged in unison.
102 Those 'pack dogs' of yours look more like wolves.
108 It's not bookoupe's job to be a spelling checker, but it
109 does check for a list of common typos and OCR errors if you
110 use the -t switch. (The -x switch also turns typo checking on.)
112 It also checks for character combinations, especially involving
113 h and b, which are often confused by OCR, that rarely or never
114 occur. For example, it queries "tbe" in a word. Now, "the" often
115 occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
116 playing the odds - a few false positives for many errors found.
117 Similarly with "ii", which is a very common OCR error.
119 Bookloupe suppresses multiple reporting of the first 40 "typos"
120 found. This is to remove the annoyance of seeing something like
121 "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
125 Line-end checking (-l switch to disable)
127 All PG texts should have a Carriage Return (CR - character 13)
128 and a Line Feed (LF - character 10) at end of each line,
129 regardless of what O/S you made them on. DOS/Windows, Unix
130 and Mac have different conventions, but the final text should
131 always use a CR/LF pair as its line terminator.
133 By default, bookloupe verifies that every line does have
134 the correct terminator, but if you're on a work-in-progress
135 in Linux, you might want to convert the line-ends as a final
136 step, and not want to see thousands of errors every time you
137 run bookloupe before that final step, so you can turn off
138 this checking with the -l switch.
141 Paranoid mode (-x switch to disable: Trust No One :-)
143 -x switches OFF typo-checking, the -t flag, automatically
144 and some extra checks like standalone 1 and 0 queries.
147 Overview mode (-o switch)
149 This mode just gives a count of queries found
150 instead of a detailed list.
153 Header quote (-h switch)
155 If you use the -h switch, bookloupe will also display
156 the Title, Author, Release and Edition fields from the
157 PG header. This is useful mostly for the automated
158 checks we do on recently-posted texts.
161 Errors to stdout (-y switch)
163 If you're just running bookloupe normally, you can ignore
164 this. It's only there for programs that provide a front
165 end to bookloupe. It makes error messages appear within
166 the output of bookloupe so that the front end knows whether
170 Verbose reporting (-v switch)
172 Normally, if bookloupe sees lots of long lines, short lines,
173 spaced dashes, non-ASCII characters or dot-commas ".," it
174 assumes these are features of the text, counts and summarizes
175 them at the top of its report, but does not list them
176 individually. If the -v switch is on, bookloupe will list them all.
179 Markup interpretation (-m switch)
181 Normally, bookloupe flags anything it suspects of being HTML
182 markup as a possible error. When you use the -m switch,
183 however, it matches anything that looks like markup against
184 a short list of common HTML tags and entities. If the markup
185 is in that list, it either ignores the markup, in the case
186 of a tag, or "interprets" the markup as its nearest ASCII
187 equivalent, in the case of an entity. So, for example, using
188 this switch, bookloupe will "see"
190 “He went <i>thataway!</i>”
196 and report accordingly.
198 This switch does not, not, NOT check the validity of HTML;
199 it exists so that you can run bookloupe on most HTML texts
200 for PG, and get sane results. It does not support all tags.
201 It does not support all entities. When it sees a tag or entity
202 it does not recognize, it will query it as HTML just as if
203 you hadn't specified the -m switch.
205 Bookloupe will automatically switch on markup interpretation
206 if it sees a lot of tags that appear to be markup, so mostly, you
207 won't have to specify this.
209 User-defined typos (-u switch)
211 If you have a file named bookloupe.typ or gutcheck.typ either
212 in your current working directory or in the directory from
213 which you explicitly invoked bookoupe, but not necessarily on
214 your path, and if you specify the -u switch, bookloupe will
215 query any word specified in that file. The file is simple: one
216 word, in lower case, per line. Be careful not to put multiple
217 words onto a line, or leave any rubbish other than the word on
218 the line. You should have received a sample file bookloupe.typ
219 with this package. The file may be encoded in UTF-8 (preferred),
220 ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
221 incorrectly, as ansi).
223 Ignore DP markup (-d switch)
225 Distributed Proofreaders (http://www.pgdp.net) has for some
226 time been the main source of PG texts, and proofers there use
227 special conventions. This switch understands those conventions,
228 so that people can use bookloupe on files in process that still
229 haven't had the special conventions removed yet. The special
230 conventions supported are page-separators and
231 "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
234 You will probably only run bookloupe on a text once or maybe twice,
235 just prior to uploading; it usually finds a few formatting problems;
236 it also usually finds queries that aren't problems at all - it often
237 questions Tables of Contents for having short lines, for example.
238 These are called "false positives," and need a human to decide on
241 The text should be standard prose, and already close to PG normal
242 format (plain text, about 70 characters per line with blank lines
245 Bookloupe merely draws your attention to things that might be errors.
246 It is NOT a substitute for human judgement. Formatting choices like
247 short lines may be for a reason that this program can't understand.
249 Even the most careful human proofing can leave errors behind in a
250 text, and there are several automated checks you can do to help find
251 them. Of these, spellchecking (with _very_ careful human judgement) is
252 the most important and most useful.
254 Bookloupe does perform some basic typo-checking if you ask it to,
255 but its focus is on formatting errors specific to PG texts—
256 mismatched quotes, non-ASCII characters, bad spacing, bad line
257 length, HTML tags perhaps left from a conversion, unbalanced
260 Suggestions for additional checks would be appreciated and duly
261 considered, but no guarantees that they will be implemented.
266 How does Jim Tinsley use gutcheck?
268 Practically everyone I give gutcheck to asks me how _I_ use it.
269 Well, when I get a text for posting, say filename.txt, I run
271 gutcheck -o filename.txt
273 That gives me a quick idea what I'm dealing with. It'll tell
274 me what kind of problems gutcheck sees, and give me an idea
275 of how much more work needs to be done on the text. Keep in
276 mind that gutcheck doesn't do anything like a full spellcheck,
277 but when I see a text that has a lot of problems, I assume that
278 it probably needs a spellcheck too.
280 Having got a feel for the ballpark, I run
282 gutcheck filename.txt > jj
284 where jj is my personal, all-purpose filename for temporary data
285 that doesn't need to be kept. Then I open filename.txt and jj in
286 a split-screen view in my editor, and work down the text, fixing
287 whatever needs fixing, and skipping whatever doesn't. If your
288 editor doesn't split-screen, you can get much the same effect by
289 opening your original file in your normal editor, and jj (or your
290 equivalent name) in something like Notepad, keeping both in view
293 Twice a day, an automatic process looks at all recently-posted
294 texts, and emails Michael, me, and sometimes other people with
295 their gutcheck summaries.
299 Future development of bookloupe
301 Bookloupe version 2.0 is intended to add UTF-8 support to
302 gutcheck. All the functionality should already be implemented
303 in the beta versions leading up to version 2.0, although
304 some bugs may well remain.
306 Future versions will add support for UTF-8 characters that
307 are not in ISO-8859-1 (eg., curled quotation marks);
308 characters that do not have a composed form (version 2
309 treats these as taking 2 or more columns); zero width and
310 wide characters (version 2 treats these as taking 1 column).
315 Explanations of common bookloupe messages:
317 --> 74 lines in this file have white space at end
319 PG texts shouldn't have extra white space added at end of line.
320 Don't worry too much about this; they're not doing any harm,
321 and they'll be removed during posting anyway.
324 --> 348 lines in this file are short. Not reporting short lines.
325 --> 84 lines in this file are long. Not reporting long lines.
326 --> 8 lines in this file are VERY long!
328 If there are a lot of long or short lines, bookloupe won't list
329 them individually. The short lines version of this message
330 is commonly seen when gutchecking poetry and some plays, where
331 the normal line length is shorter than the standard for prose.
332 A "VERY long" line is one over 80 characters. You normally
333 shouldn't have any of these, but sometimes you may have to render
334 a table that must be that long, or some special preformatted
335 quotation that can't be broken.
338 --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
340 The PG standard for an emdash--like these--is two minus signs
341 with no spaces before or after them. However, some older texts
342 used spaced dashes - like these -- and if there are very many
343 such spaced dashes in the file, bookoupe just draws your
344 attention to it and doesn't list them individually.
348 Line 3020 - Non-ASCII character 233
350 Standard PG texts should use only ASCII characters with values
351 up to 127; however, non-English, accented characters can be
352 represented according to several different non-ASCII encoding
353 schemes, using values over 127. If you have a plain English text
354 with a few accented characters in words like cafe or tete-a-tete,
355 you might replace the accented characters with their unaccented
356 versions. The English pound sign is another commonly-seen
357 non-ASCII character. If you have enough non-ASCII characters in
358 your text that you feel removing them would degrade your text,
359 you should probably consider doing a UTF-8 text.
363 Line 1207 - Non-ISO-8859 character 156
365 Even in "8-bit" texts, there are distinctions between code sets.
366 The ISO-8859 family of 8-bit code sets is the most commonly used
367 in PG, and these sets do not define values in the range 128 through
368 159 as printable characters. It's quite common for someone on a
369 Windows or Mac machine to use a non-ISO character inadvertently,
370 so this message warns that the character is not only not ASCII,
371 but also outside the ISO-8859 range.
375 Line 46 - Tab character?
377 Some editors and WPs will put in Tab characters (character 9) to
378 indicate indented text. You should not use these in a PG text,
379 because you can't be sure how they will appear on a reader's
380 screen. Find the Tab, and replace it with the appropriate number
384 Line 1327 - Tilde character?
386 The tilde character (~) might be legitimately used, but it's the
387 character commonly used by OCR software to indicate a place where
388 it couldn't make out the letter, so bookloupe flags it.
392 Line 1347 - Asterisk?
394 Asterisks are reported only in paranoid mode (see -x).
395 Like tildes, they are often used to indicate errors, but they are
396 also legitimately used as line delimiters and footnote markers.
400 Line 1451 - Long line 129
402 PG texts should have lines shorter than 76. There may be occasions
403 where you decide that you really have to go out to 79 characters,
404 but the sample above says that line 1451 is 129 characters long—
405 probably two lines run together.
409 Line 1590 - Short line?
411 PG texts should have lines longer than 54 characters. However,
412 there are special cases like poetry and tables of contents where
413 the lines _should_ be shorter. So treat bookloupe warnings about
414 short lines carefully. Sometimes it's a genuine formatting
415 problem; sometimes the line really needs to be short.
417 Hint: bookloupe will not flag lines as short if they are indented
418 —if they start with a space. I like to start inserted stanzas
419 and other such items indented with a couple of spaces so that
420 they stand out from the main text anyway.
424 Line 1804 - Begins with punctuation?
426 Lines should normally not begin with commas, periods and so on.
427 An exception is ellipses . . . which can happen at start of line.
431 Line 1850 - Spaced em-dash?
433 The PG standard for an em-dash--like these--is two minus signs
434 with no spaces before or after them. Bookloupe flags non-PG
435 em-dashes - like this one. Normally, you will replace it with a
440 Line 1904 - Query he/be error?
442 Bookloupe makes a very minor effort to look for that scourge of all
443 proofreaders, "be" replacing "he" or vice-versa, and draws your
444 attention to it when it thinks it has found one.
448 Line 2017 - Query digit in a1most
450 The digit 1 is commonly OCRed for the letter l, the digit 0 for
451 the letter O, and so on. When bookloupe sees a mix of digits and
452 letters, it warns you. It may generate a false positive for
457 Line 2083 - Query standalone 0
459 In paranoid mode (see -x) only, bookloupe warns about the digit 0
460 and the number 1 standing alone as a word. This can happen if the
461 OCR misreads the words O or I.
465 Line 2115 - Query word whetber
467 If you have switched typo-checking on, bookloupe looks for
468 potential typos, especially common h/b errors. It's not
469 infallible; it sometimes queries legit words, but it's
470 always worth taking a look.
474 Line 2190 column 14 - Missing space?
476 Omitting a space is a very common error,especially coming from
477 OCRed text,and can be hard for a human to spot. The commas in
478 the previous sentence illustrate the kind of thing I mean.
482 Line 2240 column 48 - Spaced punctuation?
484 The flip side of the "missing space" error , here , is when extra
485 spaces are added before punctuation . Some old texts appear to add
486 extra spaces around punctuation consistently, but this was a
487 typographical convention rather than the author's intent, and the
488 extra "spaces" should be removed when preparing a PG text.
492 Line 2301 column 19 - Unspaced quotes?
494 Another common spacing problem occurs in a phrase like "You wait
499 Line 2385 column 27 - Wrongspaced quotes?
501 Bookloupe checks whether a quote seems to be a start or end quote,
502 and queries those that appear to be misplaced. This does give rise
503 to false positives when quotes are nested, for example:
505 "And how," she asked, "will your "friends" help you now?"
507 but these false positives are worth it because of the many cases
508 that this test catches, notably those like:
510 "And how, "she said," will your friends help you now?"
512 Sometimes a "wrongspaced quotes" query will arise because an earlier
513 quote in the paragraph was omitted, so if the place specified seems
514 to be OK, look back to see whether there's a problem in the preceding
519 Line 2400 - HTML Tag? <PRE>
521 Some PG texts have been converted from HTML, and not all of the
522 HTML tags have been removed.
526 Line 2402 - HTML symbol? &emdash;
528 Similarly, special HTML symbol characters can survive into PG
529 texts. Can occasionally produce amusing false positives like
530 . . . Marwick & Co were well known for it;
534 Line 2540 - Mismatched quotes
536 Another bookloupe mainstay—unclosed doublequotes in a paragraph.
537 See the discussion of quotes in the switches section near the
540 Since the mismatch doesn't occur on any one line, bookloupe quotes
541 the line number of the first blank line following the paragraph,
542 since this is the point where it reconciles the count of quotes.
543 However, if bookloupe is echoing lines, that is, you haven't used
544 the -e switch, it will show the _first_ line of the paragraph,
545 to help you find the place without using line numbers. The
546 offending paragraph is therefore between the quoted line and
547 the line number given.
551 Line 2587 - Mismatched single quotes
553 Only checked with the -s switch, since checking single quotes is
554 not a very reliable process. Otherwise, the same logic as for
555 doublequotes applies.
559 Line 2877 - Mismatched round brackets?
561 Also curly and square brackets. Texts with a lot of brackets, like
562 plays with bracketed stage instructions, may have mismatches.
566 Line 3204 - Two successive CRs?
567 Line 3281 position 75 - CR without LF?
569 These are the invalid line-end warnings. See the discussion of
570 line-end checking in the switches section near the start of this
571 file. If you see these, and your editor doesn't show anything
572 wrong, you should probably try deleting the characters just before
573 and after the line end, and the line-end itself, then retyping the
574 characters and the line-end.
577 Line 2940 - Paragraph starts with lower-case
579 A common error in an e-text is for an extra blank line
581 to be put in, like the blank line above, and this often
582 shows up as a new paragraph beginning with lower case.
583 Sometimes the blank line is deliberate, as when a
584 quotation is inserted in a speech. Use your judgement.
587 Line 2987 - Extra period?
589 An extra period. is a. common problem in OCRed text. and usually
590 arises when a speck of dust on the page is mistaken for a period.
591 or. as occasionally happens. when a comma loses its tail.
594 Line 3012 column 12 - Double punctuation?
596 Double punctuation., like that,, is a common typo and
597 scanno. Some books have much legit double punctuation,
598 like etc., etc., but it's worth checking anyway.
604 For Windows-only users who are unfamiliar with DOS:
606 If you're a Windows-only user, you need to save
607 bookloupe.exe into the folder (directory) where the
608 text file you want to check is. Let's say your
609 text file is in C:\gut, then you should save
610 bookloupe.exe into C:\gut.
612 Now get to a console. You can do this by
613 selecting the "Command Prompt" or "MS-DOS Prompt"
614 option that will be somewhere on your
617 Now get into the C:\gut directory.
618 You can do this using the cd (change directory)
621 and your prompt will change to
623 so you know you're in the right place.
626 bookloupe yourfile.txt
627 and you'll see bookloupe's report
629 By default, bookloupe prints its queries to screen.
630 If you want to create a file of them, to edit
631 against the text, you can use the greater-than
632 sign (>) to tell it to output the report to a
633 file. For example, if you want its report in a
634 file called queries.lst, you could type
636 bookloupe yourfile.txt > queries.lst
638 The queries.lst file will then contain the listing
639 of possible formatting errors, and you can
640 edit it alongside your text.
642 Whatever you do, DON'T make the filename after
643 the greater-than sign the name of a file already
644 on your disk that you want to keep, because
645 the greater-than sign will cause bookloupe to
646 replace any existing file of that name.
648 So, for example, if you have two Tolstoy files
649 that you want to check, called WARPEACE.TXT and
650 ANNAK.TXT, make sure that neither of these names
651 is ever used following the greater-than sign.
652 To check these correctly, you might do:
654 bookloupe warpeace.txt > war.lst
658 bookloupe annak.txt > annak.lst
660 separately. Then you can look at war.lst and annak.lst
661 to see the bookloupe reports.