ali@0:
ali@0:
ali@74: Bookloupe documentation
ali@0:
ali@0:
ali@74: bookloupe: lists possible common formatting errors in a Project
ali@74: Gutenberg candidate file. Bookloupe is based on gutcheck, written
ali@74: by Jim Tinsley. It is a command line program and can be used under
ali@74: Microsoft Windows, Mac or Unix. For Windows-only people, there is
ali@74: an appendix at the end with brief instructions for running it.
ali@0:
ali@77: Current version: 1.91, a beta version leading up to version 2.0
ali@0:
ali@74: This software is Copyright Jim Tinsley 2000-2005 and
ali@74: J. Ali Harlow 2012 onwards.
ali@0:
ali@74: Bookloupe comes wih ABSOLUTELY NO WARRANTY. For details, read the file COPYING.
ali@0: This is Free Software; you may redistribute it under certain conditions (GPL).
ali@0:
ali@74: See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version.
ali@0:
ali@0:
ali@74: Usage is: bookloupe [-setopxlywm] filename
ali@0: where:
ali@0: -s checks Single quotes
ali@0: -e switches off Echoing of lines
ali@0: -t checks Typos
ali@0: -o produces an Overview only
ali@0: -p sets strict quotes checking for Paragraphs
ali@0: -x (paranoid) switches OFF typo checking and extra checks
ali@0: -l turns off Line-end checks
ali@0: -y sets error messages to stdout
ali@0: -w is a special mode for web uploads (for future use)
ali@0: -v (verbose) forces individual reporting of minor problems
ali@0: -m interprets Markup of some common HTML tags and entities
ali@0: -u warns about words in a user-defined typo file gutcheck.typ
ali@0: -d ignores some DP-specific markup
ali@0:
ali@74: Running bookloupe without any parameters will display a brief help message.
ali@0:
ali@0: Sample usage:
ali@0:
ali@74: bookloupe warpeace.txt
ali@0:
ali@0:
ali@0: More detail:
ali@0:
ali@74: Character encoding
ali@74:
ali@74: Bookloupe will handle e-texts encoded in UTF-8 (preferred),
ali@74: ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74: incorrectly, as ansi). The output will be in the same encoding
ali@74: as the input e-text.
ali@74:
ali@0: Echoing lines (-e to switch off)
ali@0:
ali@74: You may find it convenient, when reviewing Bookloupe's
ali@74: suggestions, to see the line that Bookloupe is questioning.
ali@0: That way, you can often see at a glance whether it is
ali@0: a real error that needs to be fixed, or a false positive
ali@74: that should be in the text, but Bookloupe's limited
ali@0: programming doesn't understand.
ali@0:
ali@74: By default, bookloupe echoes these lines, but if you don't
ali@0: want to see the lines referred to, -e will switch it OFF.
ali@0:
ali@0:
ali@0: Quotes (-s and -p switches)
ali@0:
ali@74: Bookloupe always looks for unbalanced doublequotes in a
ali@0: paragraph. It is a common convention for writers not to
ali@0: close quotes in a paragraph if the next paragraph opens
ali@0: with quotes and is a continuation by the same speaker.
ali@0:
ali@74: Bookloupe therefore does not normally report unclosed quotes
ali@0: if the next paragraph begins with a quote. If you need
ali@0: to see all unclosed quotes, even where the next paragraph
ali@0: begins with a quote, you should use the -p switch.
ali@0:
ali@0: Singlequotes (') are a problem, since the same character
ali@0: is used for an apostrophe. I'm not sure that it is
ali@0: possible to get 100% accuracy on singlequotes checking,
ali@0: particularly since dialect, quite common in PG texts,
ali@0: upsets the normal rules so badly. Consider the sentence:
ali@0: 'Tis often said that a man's a man for a' that.
ali@0: As humans, we recognize that both apostrophes are used
ali@0: for contractions rather than quotes, but it isn't easy
ali@0: to get a program to recognize that.
ali@0:
ali@74: Since bookloupe makes too many mistakes when trying to match
ali@0: singlequotes, it doesn't look for unbalanced singlequotes
ali@0: unless you specify the -s switch.
ali@0:
ali@0: Consider these sentences, which illustrate the main cases:
ali@0:
ali@0: 'Tis often said that a fool and his money are soon parted.
ali@0:
ali@0: 'Becky's goin' home,' said Tom.
ali@0:
ali@0: The dogs' tails wagged in unison.
ali@0:
ali@0: Those 'pack dogs' of yours look more like wolves.
ali@0:
ali@0:
ali@0:
ali@0: Typos (-t switch)
ali@0:
ali@74: It's not bookoupe's job to be a spelling checker, but it
ali@0: does check for a list of common typos and OCR errors if you
ali@0: use the -t switch. (The -x switch also turns typo checking on.)
ali@0:
ali@0: It also checks for character combinations, especially involving
ali@0: h and b, which are often confused by OCR, that rarely or never
ali@0: occur. For example, it queries "tbe" in a word. Now, "the" often
ali@0: occurs, but "tbe" is very rare (heartbeat, hotbed), so I'm
ali@0: playing the odds - a few false positives for many errors found.
ali@0: Similarly with "ii", which is a very common OCR error.
ali@0:
ali@74: Bookloupe suppresses multiple reporting of the first 40 "typos"
ali@0: found. This is to remove the annoyance of seeing something like
ali@0: "FN" (footnote) or "LK" (initials) flagged as a typo 147 times
ali@0: in a text.
ali@0:
ali@0:
ali@0: Line-end checking (-l switch to disable)
ali@0:
ali@0: All PG texts should have a Carriage Return (CR - character 13)
ali@0: and a Line Feed (LF - character 10) at end of each line,
ali@0: regardless of what O/S you made them on. DOS/Windows, Unix
ali@0: and Mac have different conventions, but the final text should
ali@0: always use a CR/LF pair as its line terminator.
ali@0:
ali@74: By default, bookloupe verifies that every line does have
ali@0: the correct terminator, but if you're on a work-in-progress
ali@0: in Linux, you might want to convert the line-ends as a final
ali@0: step, and not want to see thousands of errors every time you
ali@74: run bookloupe before that final step, so you can turn off
ali@0: this checking with the -l switch.
ali@0:
ali@0:
ali@0: Paranoid mode (-x switch to disable: Trust No One :-)
ali@0:
ali@0: -x switches OFF typo-checking, the -t flag, automatically
ali@0: and some extra checks like standalone 1 and 0 queries.
ali@0:
ali@0:
ali@0: Overview mode (-o switch)
ali@0:
ali@74: This mode just gives a count of queries found
ali@74: instead of a detailed list.
ali@0:
ali@0:
ali@0: Header quote (-h switch)
ali@0:
ali@74: If you use the -h switch, bookloupe will also display
ali@74: the Title, Author, Release and Edition fields from the
ali@74: PG header. This is useful mostly for the automated
ali@74: checks we do on recently-posted texts.
ali@0:
ali@0:
ali@0: Errors to stdout (-y switch)
ali@0:
ali@74: If you're just running bookloupe normally, you can ignore
ali@74: this. It's only there for programs that provide a front
ali@74: end to bookloupe. It makes error messages appear within
ali@74: the output of bookloupe so that the front end knows whether
ali@74: bookloupe ran OK.
ali@0:
ali@0:
ali@0: Verbose reporting (-v switch)
ali@0:
ali@74: Normally, if bookloupe sees lots of long lines, short lines,
ali@74: spaced dashes, non-ASCII characters or dot-commas ".," it
ali@74: assumes these are features of the text, counts and summarizes
ali@74: them at the top of its report, but does not list them
ali@74: individually. If the -v switch is on, bookloupe will list them all.
ali@0:
ali@0:
ali@0: Markup interpretation (-m switch)
ali@0:
ali@74: Normally, bookloupe flags anything it suspects of being HTML
ali@74: markup as a possible error. When you use the -m switch,
ali@74: however, it matches anything that looks like markup against
ali@74: a short list of common HTML tags and entities. If the markup
ali@74: is in that list, it either ignores the markup, in the case
ali@74: of a tag, or "interprets" the markup as its nearest ASCII
ali@74: equivalent, in the case of an entity. So, for example, using
ali@74: this switch, bookloupe will "see"
ali@0:
ali@74: “He went thataway!”
ali@0:
ali@74: as
ali@0:
ali@74: "He went thataway!"
ali@0:
ali@74: and report accordingly.
ali@0:
ali@74: This switch does not, not, NOT check the validity of HTML;
ali@74: it exists so that you can run bookloupe on most HTML texts
ali@74: for PG, and get sane results. It does not support all tags.
ali@74: It does not support all entities. When it sees a tag or entity
ali@74: it does not recognize, it will query it as HTML just as if
ali@74: you hadn't specified the -m switch.
ali@0:
ali@74: Bookloupe will automatically switch on markup interpretation
ali@74: if it sees a lot of tags that appear to be markup, so mostly, you
ali@74: won't have to specify this.
ali@0:
ali@0: User-defined typos (-u switch)
ali@0:
ali@74: If you have a file named bookloupe.typ or gutcheck.typ either
ali@74: in your current working directory or in the directory from
ali@74: which you explicitly invoked bookoupe, but not necessarily on
ali@74: your path, and if you specify the -u switch, bookloupe will
ali@74: query any word specified in that file. The file is simple: one
ali@74: word, in lower case, per line. Be careful not to put multiple
ali@74: words onto a line, or leave any rubbish other than the word on
ali@74: the line. You should have received a sample file bookloupe.typ
ali@74: with this package. The file may be encoded in UTF-8 (preferred),
ali@74: ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known,
ali@74: incorrectly, as ansi).
ali@0:
ali@0: Ignore DP markup (-d switch)
ali@0:
ali@74: Distributed Proofreaders (http://www.pgdp.net) has for some
ali@74: time been the main source of PG texts, and proofers there use
ali@74: special conventions. This switch understands those conventions,
ali@74: so that people can use bookloupe on files in process that still
ali@74: haven't had the special conventions removed yet. The special
ali@74: conventions supported are page-separators and
ali@74: "
ali@0: ali@0: Some PG texts have been converted from HTML, and not all of the ali@0: HTML tags have been removed. ali@0: ali@0: ali@0: ali@0: Line 2402 - HTML symbol? &emdash; ali@0: ali@0: Similarly, special HTML symbol characters can survive into PG ali@0: texts. Can occasionally produce amusing false positives like ali@0: . . . Marwick & Co were well known for it; ali@0: ali@0: ali@0: ali@0: Line 2540 - Mismatched quotes ali@0: ali@74: Another bookloupe mainstay—unclosed doublequotes in a paragraph. ali@0: See the discussion of quotes in the switches section near the ali@0: start of this file. ali@0: ali@74: Since the mismatch doesn't occur on any one line, bookloupe quotes ali@0: the line number of the first blank line following the paragraph, ali@0: since this is the point where it reconciles the count of quotes. ali@74: However, if bookloupe is echoing lines, that is, you haven't used ali@0: the -e switch, it will show the _first_ line of the paragraph, ali@0: to help you find the place without using line numbers. The ali@0: offending paragraph is therefore between the quoted line and ali@0: the line number given. ali@0: ali@0: ali@0: ali@0: Line 2587 - Mismatched single quotes ali@0: ali@0: Only checked with the -s switch, since checking single quotes is ali@0: not a very reliable process. Otherwise, the same logic as for ali@0: doublequotes applies. ali@0: ali@0: ali@0: ali@0: Line 2877 - Mismatched round brackets? ali@0: ali@0: Also curly and square brackets. Texts with a lot of brackets, like ali@0: plays with bracketed stage instructions, may have mismatches. ali@0: ali@0: ali@0: Line 3150 - No CR? ali@0: Line 3204 - Two successive CRs? ali@0: Line 3281 position 75 - CR without LF? ali@0: ali@0: These are the invalid line-end warnings. See the discussion of ali@0: line-end checking in the switches section near the start of this ali@0: file. If you see these, and your editor doesn't show anything ali@0: wrong, you should probably try deleting the characters just before ali@0: and after the line end, and the line-end itself, then retyping the ali@0: characters and the line-end. ali@0: ali@0: ali@0: Line 2940 - Paragraph starts with lower-case ali@0: ali@0: A common error in an e-text is for an extra blank line ali@0: ali@0: to be put in, like the blank line above, and this often ali@0: shows up as a new paragraph beginning with lower case. ali@0: Sometimes the blank line is deliberate, as when a ali@0: quotation is inserted in a speech. Use your judgement. ali@0: ali@0: ali@0: Line 2987 - Extra period? ali@0: ali@0: An extra period. is a. common problem in OCRed text. and usually ali@0: arises when a speck of dust on the page is mistaken for a period. ali@0: or. as occasionally happens. when a comma loses its tail. ali@0: ali@0: ali@0: Line 3012 column 12 - Double punctuation? ali@0: ali@0: Double punctuation., like that,, is a common typo and ali@0: scanno. Some books have much legit double punctuation, ali@0: like etc., etc., but it's worth checking anyway. ali@0: ali@0: ali@0: ali@0: * * * * ali@0: ali@0: For Windows-only users who are unfamiliar with DOS: ali@0: ali@0: If you're a Windows-only user, you need to save ali@74: bookloupe.exe into the folder (directory) where the ali@0: text file you want to check is. Let's say your ali@74: text file is in C:\gut, then you should save ali@74: bookloupe.exe into C:\gut. ali@0: ali@74: Now get to a console. You can do this by ali@0: selecting the "Command Prompt" or "MS-DOS Prompt" ali@0: option that will be somewhere on your ali@0: Start/Programs menu. ali@0: ali@74: Now get into the C:\gut directory. ali@74: You can do this using the cd (change directory) ali@0: command, like this: ali@74: cd \gut ali@0: and your prompt will change to ali@74: C:\gut> ali@0: so you know you're in the right place. ali@0: ali@0: Now type ali@74: bookloupe yourfile.txt ali@74: and you'll see bookloupe's report ali@0: ali@74: By default, bookloupe prints its queries to screen. ali@0: If you want to create a file of them, to edit ali@0: against the text, you can use the greater-than ali@0: sign (>) to tell it to output the report to a ali@0: file. For example, if you want its report in a ali@74: file called queries.lst, you could type ali@74: ali@74: bookloupe yourfile.txt > queries.lst ali@0: ali@0: The queries.lst file will then contain the listing ali@0: of possible formatting errors, and you can ali@0: edit it alongside your text. ali@0: ali@0: Whatever you do, DON'T make the filename after ali@0: the greater-than sign the name of a file already ali@0: on your disk that you want to keep, because ali@74: the greater-than sign will cause bookloupe to ali@0: replace any existing file of that name. ali@0: ali@0: So, for example, if you have two Tolstoy files ali@0: that you want to check, called WARPEACE.TXT and ali@0: ANNAK.TXT, make sure that neither of these names ali@0: is ever used following the greater-than sign. ali@0: To check these correctly, you might do: ali@0: ali@74: bookloupe warpeace.txt > war.lst ali@0: ali@0: and ali@0: ali@74: bookloupe annak.txt > annak.lst ali@0: ali@0: separately. Then you can look at war.lst and annak.lst ali@74: to see the bookloupe reports.