Fix bug #12: Balanced square brackets test should recognize multi-line [Illustration] tags
3 Bookloupe documentation
6 bookloupe: lists possible common formatting errors in a Project
7 Gutenberg candidate file. Bookloupe is based on gutcheck, written
8 by Jim Tinsley. It is a command line program and can be used under
9 Microsoft Windows, Mac or Unix. For Windows-only people, there is
10 an appendix at the end with brief instructions for running it.
14 This software is Copyright Jim Tinsley 2000-2005 and
15 J. Ali Harlow 2012 onwards.
17 Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
18 This is Free Software; you may redistribute it under certain conditions (GPL).
20 See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
23 Usage is: bookloupe [-setopxlywm] filename
25 -s checks Single quotes
26 -e switches off Echoing of lines
28 -o produces an Overview only
29 -p sets strict quotes checking for Paragraphs
30 -x (paranoid) switches OFF typo checking and extra checks
31 -l turns off Line-end checks
32 -y sets error messages to stdout
33 -w is a special mode for web uploads (for future use)
34 -v (verbose) forces individual reporting of minor problems
35 -m interprets Markup of some common HTML tags and entities
36 -u warns about words in a user-defined typo file gutcheck.typ
37 -d ignores some DP-specific markup
39 Running bookloupe without any parameters will display a brief help message.
43 bookloupe warpeace.txt
50 Bookloupe will handle e-texts encoded in UTF-8 (preferred),
51 ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
52 incorrectly, as ansi). The output will be in the same encoding
55 Echoing lines (-e to switch off)
57 You may find it convenient, when reviewing Bookloupe's
58 suggestions, to see the line that Bookloupe is questioning.
59 That way, you can often see at a glance whether it is
60 a real error that needs to be fixed, or a false positive
61 that should be in the text, but Bookloupe's limited
62 programming doesn't understand.
64 By default, bookloupe echoes these lines, but if you don't
65 want to see the lines referred to, -e will switch it OFF.
68 Quotes (-s and -p switches)
70 Bookloupe always looks for unbalanced doublequotes in a
71 paragraph. It is a common convention for writers not to
72 close quotes in a paragraph if the next paragraph opens
73 with quotes and is a continuation by the same speaker.
75 Bookloupe therefore does not normally report unclosed quotes
76 if the next paragraph begins with a quote. If you need
77 to see all unclosed quotes, even where the next paragraph
78 begins with a quote, you should use the -p switch.
80 Singlequotes (' and ’) are a problem, since the same
81 character is used for an apostrophe. I'm not sure that it is
82 possible to get 100% accuracy on singlequotes checking,
83 particularly since dialect, quite common in PG texts,
84 upsets the normal rules so badly. Consider the sentence:
85 'Tis often said that a man's a man for a' that.
86 As humans, we recognize that both apostrophes are used
87 for contractions rather than quotes, but it isn't easy
88 to get a program to recognize that.
90 Since bookloupe makes too many mistakes when trying to match
91 singlequotes, it doesn't look for unbalanced singlequotes
92 unless you specify the -s switch.
94 Consider these sentences, which illustrate the main cases:
96 'Tis often said that a fool and his money are soon parted.
98 'Becky's goin' home,' said Tom.
100 The dogs' tails wagged in unison.
102 Those 'pack dogs' of yours look more like wolves.
108 It's not bookoupe's job to be a spelling checker, but it
109 does check for a list of common typos and OCR errors if you
110 use the -t switch. (The -x switch also turns typo checking on.)
112 It also checks for character combinations, especially involving
113 h and b, which are often confused by OCR, that rarely or never
114 occur. For example, it queries "tbe" in a word. Now, "the" often
115 occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
116 playing the odds - a few false positives for many errors found.
117 Similarly with "ii", which is a very common OCR error.
119 Bookloupe suppresses multiple reporting of the first 40 "typos"
120 found. This is to remove the annoyance of seeing something like
121 "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
125 Line-end checking (-l switch to disable)
127 All PG texts should have a Carriage Return (CR - character 13)
128 and a Line Feed (LF - character 10) at end of each line,
129 regardless of what O/S you made them on. DOS/Windows, Unix
130 and Mac have different conventions, but the final text should
131 always use a CR/LF pair as its line terminator.
133 By default, bookloupe verifies that every line does have
134 the correct terminator, but if you're on a work-in-progress
135 in Linux, you might want to convert the line-ends as a final
136 step, and not want to see thousands of errors every time you
137 run bookloupe before that final step, so you can turn off
138 this checking with the -l switch.
141 Paranoid mode (-x switch to disable: Trust No One :-)
143 -x switches OFF typo-checking, the -t flag, automatically
144 and some extra checks like standalone 1 and 0 queries.
147 Overview mode (-o switch)
149 This mode just gives a count of queries found
150 instead of a detailed list.
153 Header quote (-h switch)
155 If you use the -h switch, bookloupe will also display
156 the Title, Author, Release and Edition fields from the
157 PG header. This is useful mostly for the automated
158 checks we do on recently-posted texts.
161 Errors to stdout (-y switch)
163 If you're just running bookloupe normally, you can ignore
164 this. It's only there for programs that provide a front
165 end to bookloupe. It makes error messages appear within
166 the output of bookloupe so that the front end knows whether
170 Verbose reporting (-v switch)
172 Normally, if bookloupe sees lots of long lines, short lines,
173 spaced dashes, non-ASCII characters or dot-commas ".," it
174 assumes these are features of the text, counts and summarizes
175 them at the top of its report, but does not list them
176 individually. If the -v switch is on, bookloupe will list them all.
179 Markup interpretation (-m switch)
181 Normally, bookloupe flags anything it suspects of being HTML
182 markup as a possible error. When you use the -m switch,
183 however, it matches anything that looks like markup against
184 a short list of common HTML tags and entities. If the markup
185 is in that list, it either ignores the markup, in the case
186 of a tag, or "interprets" the markup as its nearest ASCII
187 equivalent, in the case of an entity. So, for example, using
188 this switch, bookloupe will "see"
190 “He went <i>thataway!</i>”
196 and report accordingly.
198 This switch does not, not, NOT check the validity of HTML;
199 it exists so that you can run bookloupe on most HTML texts
200 for PG, and get sane results. It does not support all tags.
201 It does not support all entities. When it sees a tag or entity
202 it does not recognize, it will query it as HTML just as if
203 you hadn't specified the -m switch.
205 Bookloupe will automatically switch on markup interpretation
206 if it sees a lot of tags that appear to be markup, so mostly, you
207 won't have to specify this.
209 User-defined typos (-u switch)
211 If you have a file named bookloupe.typ or gutcheck.typ either
212 in your current working directory or in the directory from
213 which you explicitly invoked bookoupe, but not necessarily on
214 your path, and if you specify the -u switch, bookloupe will
215 query any word specified in that file. The file is simple: one
216 word, in lower case, per line. Be careful not to put multiple
217 words onto a line, or leave any rubbish other than the word on
218 the line. You should have received a sample file bookloupe.typ
219 with this package. The file may be encoded in UTF-8 (preferred),
220 ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
221 incorrectly, as ansi).
223 Ignore DP markup (-d switch)
225 Distributed Proofreaders (http://www.pgdp.net) has for some
226 time been the main source of PG texts, and proofers there use
227 special conventions. This switch understands those conventions,
228 so that people can use bookloupe on files in process that still
229 haven't had the special conventions removed yet. The special
230 conventions supported are page-separators and
231 "<sc>", "</sc>", "/*", "*/", "/#", "#/", "/$", "$/".
234 You will probably only run bookloupe on a text once or maybe twice,
235 just prior to uploading; it usually finds a few formatting problems;
236 it also usually finds queries that aren't problems at all - it often
237 questions Tables of Contents for having short lines, for example.
238 These are called "false positives," and need a human to decide on
241 The text should be standard prose, and already close to PG normal
242 format (plain text, about 70 characters per line with blank lines
245 Bookloupe merely draws your attention to things that might be errors.
246 It is NOT a substitute for human judgement. Formatting choices like
247 short lines may be for a reason that this program can't understand.
249 Even the most careful human proofing can leave errors behind in a
250 text, and there are several automated checks you can do to help find
251 them. Of these, spellchecking (with _very_ careful human judgement) is
252 the most important and most useful.
254 Bookloupe does perform some basic typo-checking if you ask it to,
255 but its focus is on formatting errors specific to PG texts—
256 mismatched quotes, non-ASCII characters, bad spacing, bad line
257 length, HTML tags perhaps left from a conversion, unbalanced
260 Suggestions for additional checks would be appreciated and duly
261 considered, but no guarantees that they will be implemented.
266 How does Jim Tinsley use gutcheck?
268 Practically everyone I give gutcheck to asks me how _I_ use it.
269 Well, when I get a text for posting, say filename.txt, I run
271 gutcheck -o filename.txt
273 That gives me a quick idea what I'm dealing with. It'll tell
274 me what kind of problems gutcheck sees, and give me an idea
275 of how much more work needs to be done on the text. Keep in
276 mind that gutcheck doesn't do anything like a full spellcheck,
277 but when I see a text that has a lot of problems, I assume that
278 it probably needs a spellcheck too.
280 Having got a feel for the ballpark, I run
282 gutcheck filename.txt > jj
284 where jj is my personal, all-purpose filename for temporary data
285 that doesn't need to be kept. Then I open filename.txt and jj in
286 a split-screen view in my editor, and work down the text, fixing
287 whatever needs fixing, and skipping whatever doesn't. If your
288 editor doesn't split-screen, you can get much the same effect by
289 opening your original file in your normal editor, and jj (or your
290 equivalent name) in something like Notepad, keeping both in view
293 Twice a day, an automatic process looks at all recently-posted
294 texts, and emails Michael, me, and sometimes other people with
295 their gutcheck summaries.
299 Future development of bookloupe
301 Future versions will add support for UTF-8 characters that
302 are not in ISO-8859-1 (eg., curled quotation marks);
303 characters that do not have a composed form (version 2.0
304 treats these as taking 2 or more columns); zero width and
305 wide characters (version 2.0 treats these as taking 1 column).
310 Explanations of common bookloupe messages:
312 --> 74 lines in this file have white space at end
314 PG texts shouldn't have extra white space added at end of line.
315 Don't worry too much about this; they're not doing any harm,
316 and they'll be removed during posting anyway.
319 --> 348 lines in this file are short. Not reporting short lines.
320 --> 84 lines in this file are long. Not reporting long lines.
321 --> 8 lines in this file are VERY long!
323 If there are a lot of long or short lines, bookloupe won't list
324 them individually. The short lines version of this message
325 is commonly seen when gutchecking poetry and some plays, where
326 the normal line length is shorter than the standard for prose.
327 A "VERY long" line is one over 80 characters. You normally
328 shouldn't have any of these, but sometimes you may have to render
329 a table that must be that long, or some special preformatted
330 quotation that can't be broken.
333 --> There are 75 spaced dashes and em-dashes in this file. Not reporting them.
335 The PG standard for an emdash--like these--is two minus signs
336 with no spaces before or after them. However, some older texts
337 used spaced dashes - like these -- and if there are very many
338 such spaced dashes in the file, bookoupe just draws your
339 attention to it and doesn't list them individually.
343 Line 3020 - Non-ASCII character 233
345 Standard PG texts should use only ASCII characters with values
346 up to 127; however, non-English, accented characters can be
347 represented according to several different non-ASCII encoding
348 schemes, using values over 127. If you have a plain English text
349 with a few accented characters in words like cafe or tete-a-tete,
350 you might replace the accented characters with their unaccented
351 versions. The English pound sign is another commonly-seen
352 non-ASCII character. If you have enough non-ASCII characters in
353 your text that you feel removing them would degrade your text,
354 you should probably consider doing a UTF-8 text.
358 Line 1207 - Non-ISO-8859 character 156
360 Even in "8-bit" texts, there are distinctions between code sets.
361 The ISO-8859 family of 8-bit code sets is the most commonly used
362 in PG, and these sets do not define values in the range 128 through
363 159 as printable characters. It's quite common for someone on a
364 Windows or Mac machine to use a non-ISO character inadvertently,
365 so this message warns that the character is not only not ASCII,
366 but also outside the ISO-8859 range.
370 Line 46 - Tab character?
372 Some editors and WPs will put in Tab characters (character 9) to
373 indicate indented text. You should not use these in a PG text,
374 because you can't be sure how they will appear on a reader's
375 screen. Find the Tab, and replace it with the appropriate number
379 Line 1327 - Tilde character?
381 The tilde character (~) might be legitimately used, but it's the
382 character commonly used by OCR software to indicate a place where
383 it couldn't make out the letter, so bookloupe flags it.
387 Line 1347 - Asterisk?
389 Asterisks are reported only in paranoid mode (see -x).
390 Like tildes, they are often used to indicate errors, but they are
391 also legitimately used as line delimiters and footnote markers.
395 Line 1451 - Long line 129
397 PG texts should have lines shorter than 76. There may be occasions
398 where you decide that you really have to go out to 79 characters,
399 but the sample above says that line 1451 is 129 characters long—
400 probably two lines run together.
404 Line 1590 - Short line?
406 PG texts should have lines longer than 54 characters. However,
407 there are special cases like poetry and tables of contents where
408 the lines _should_ be shorter. So treat bookloupe warnings about
409 short lines carefully. Sometimes it's a genuine formatting
410 problem; sometimes the line really needs to be short.
412 Hint: bookloupe will not flag lines as short if they are indented
413 —if they start with a space. I like to start inserted stanzas
414 and other such items indented with a couple of spaces so that
415 they stand out from the main text anyway.
419 Line 1804 - Begins with punctuation?
421 Lines should normally not begin with commas, periods and so on.
422 An exception is ellipses . . . which can happen at start of line.
426 Line 1850 - Spaced em-dash?
428 The PG standard for an em-dash--like these--is two minus signs
429 with no spaces before or after them. Bookloupe flags non-PG
430 em-dashes - like this one. Normally, you will replace it with a
435 Line 1904 - Query he/be error?
437 Bookloupe makes a very minor effort to look for that scourge of all
438 proofreaders, "be" replacing "he" or vice-versa, and draws your
439 attention to it when it thinks it has found one.
443 Line 2017 - Query digit in a1most
445 The digit 1 is commonly OCRed for the letter l, the digit 0 for
446 the letter O, and so on. When bookloupe sees a mix of digits and
447 letters, it warns you. It may generate a false positive for
452 Line 2083 - Query standalone 0
454 In paranoid mode (see -x) only, bookloupe warns about the digit 0
455 and the number 1 standing alone as a word. This can happen if the
456 OCR misreads the words O or I.
460 Line 2115 - Query word whetber
462 If you have switched typo-checking on, bookloupe looks for
463 potential typos, especially common h/b errors. It's not
464 infallible; it sometimes queries legit words, but it's
465 always worth taking a look.
469 Line 2190 column 14 - Missing space?
471 Omitting a space is a very common error,especially coming from
472 OCRed text,and can be hard for a human to spot. The commas in
473 the previous sentence illustrate the kind of thing I mean.
477 Line 2240 column 48 - Spaced punctuation?
479 The flip side of the "missing space" error , here , is when extra
480 spaces are added before punctuation . Some old texts appear to add
481 extra spaces around punctuation consistently, but this was a
482 typographical convention rather than the author's intent, and the
483 extra "spaces" should be removed when preparing a PG text.
487 Line 2301 column 19 - Unspaced quotes?
489 Another common spacing problem occurs in a phrase like "You wait
494 Line 2385 column 27 - Wrongspaced quotes?
496 Bookloupe checks whether a quote seems to be a start or end quote,
497 and queries those that appear to be misplaced. This does give rise
498 to false positives when quotes are nested, for example:
500 "And how," she asked, "will your "friends" help you now?"
502 but these false positives are worth it because of the many cases
503 that this test catches, notably those like:
505 "And how, "she said," will your friends help you now?"
507 Sometimes a "wrongspaced quotes" query will arise because an earlier
508 quote in the paragraph was omitted, so if the place specified seems
509 to be OK, look back to see whether there's a problem in the preceding
514 Line 2400 - HTML Tag? <PRE>
516 Some PG texts have been converted from HTML, and not all of the
517 HTML tags have been removed.
521 Line 2402 - HTML symbol? &emdash;
523 Similarly, special HTML symbol characters can survive into PG
524 texts. Can occasionally produce amusing false positives like
525 . . . Marwick & Co were well known for it;
529 Line 2540 - Mismatched quotes
531 Another bookloupe mainstay—unclosed doublequotes in a paragraph.
532 See the discussion of quotes in the switches section near the
535 Since the mismatch doesn't occur on any one line, bookloupe quotes
536 the line number of the first blank line following the paragraph,
537 since this is the point where it reconciles the count of quotes.
538 However, if bookloupe is echoing lines, that is, you haven't used
539 the -e switch, it will show the _first_ line of the paragraph,
540 to help you find the place without using line numbers. The
541 offending paragraph is therefore between the quoted line and
542 the line number given.
546 Line 2587 - Mismatched single quotes
548 Only checked with the -s switch, since checking single quotes is
549 not a very reliable process. Otherwise, the same logic as for
550 doublequotes applies.
554 Line 2877 - Mismatched round brackets?
556 Also curly and square brackets. Texts with a lot of brackets, like
557 plays with bracketed stage instructions, may have mismatches.
561 Line 3204 - Two successive CRs?
562 Line 3281 position 75 - CR without LF?
564 These are the invalid line-end warnings. See the discussion of
565 line-end checking in the switches section near the start of this
566 file. If you see these, and your editor doesn't show anything
567 wrong, you should probably try deleting the characters just before
568 and after the line end, and the line-end itself, then retyping the
569 characters and the line-end.
572 Line 2940 - Paragraph starts with lower-case
574 A common error in an e-text is for an extra blank line
576 to be put in, like the blank line above, and this often
577 shows up as a new paragraph beginning with lower case.
578 Sometimes the blank line is deliberate, as when a
579 quotation is inserted in a speech. Use your judgement.
582 Line 2987 - Extra period?
584 An extra period. is a. common problem in OCRed text. and usually
585 arises when a speck of dust on the page is mistaken for a period.
586 or. as occasionally happens. when a comma loses its tail.
589 Line 3012 column 12 - Double punctuation?
591 Double punctuation., like that,, is a common typo and
592 scanno. Some books have much legit double punctuation,
593 like etc., etc., but it's worth checking anyway.
599 For Windows-only users who are unfamiliar with DOS:
601 If you're a Windows-only user, you need to save
602 bookloupe.exe into the folder (directory) where the
603 text file you want to check is. Let's say your
604 text file is in C:\gut, then you should save
605 bookloupe.exe into C:\gut.
607 Now get to a console. You can do this by
608 selecting the "Command Prompt" or "MS-DOS Prompt"
609 option that will be somewhere on your
612 Now get into the C:\gut directory.
613 You can do this using the cd (change directory)
616 and your prompt will change to
618 so you know you're in the right place.
621 bookloupe yourfile.txt
622 and you'll see bookloupe's report
624 By default, bookloupe prints its queries to screen.
625 If you want to create a file of them, to edit
626 against the text, you can use the greater-than
627 sign (>) to tell it to output the report to a
628 file. For example, if you want its report in a
629 file called queries.lst, you could type
631 bookloupe yourfile.txt > queries.lst
633 The queries.lst file will then contain the listing
634 of possible formatting errors, and you can
635 edit it alongside your text.
637 Whatever you do, DON'T make the filename after
638 the greater-than sign the name of a file already
639 on your disk that you want to keep, because
640 the greater-than sign will cause bookloupe to
641 replace any existing file of that name.
643 So, for example, if you have two Tolstoy files
644 that you want to check, called WARPEACE.TXT and
645 ANNAK.TXT, make sure that neither of these names
646 is ever used following the greater-than sign.
647 To check these correctly, you might do:
649 bookloupe warpeace.txt > war.lst
653 bookloupe annak.txt > annak.lst
655 separately. Then you can look at war.lst and annak.lst
656 to see the bookloupe reports.
658 For Windows-only users who want to use bookloupe from guiguts:
660 1) If you haven't already done so, download bookloupe-win32-xxx.zip
661 from http://www.juiblex.co.uk/pgdp/bookloupe/
663 2) Extract the files into a suitable folder, e.g. C:\DP\bookloupe
667 4) Choose Preferences | File Paths | Set File Paths..
669 5) Click the "Locate Gutcheck..." button
671 6) Browse to the folder where you extracted bookloupe
673 7) Double-click bookloupe.exe
675 Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe
676 instead. Since the output will look very like gutcheck output, you
677 may want to check that it is actually bookloupe that is running. To do
678 this, look at the black command line message window, which will say:
680 "bookloupe: Check and report on an e-text".
682 To return to using gutcheck for any reason, repeat steps 4 and 5
685 6b) Browse back to the gutcheck folder, which is in a "tools"
686 folder inside the main Guiguts folder. It will be something like
687 "C:\DP\guiguts-win\tools\gutcheck", depending on where you installed
690 7b) Double-click gutcheck.exe
692 Now doing "Gutcheck" in Guiguts will run gutcheck itself, and the
693 message in the black window should read:
695 "gutcheck: Check and report on an e-text".