diff -r 7a62c77a0dbe -r 2d48e8cdda24 doc/bookloupe.txt --- a/doc/bookloupe.txt Sat Sep 21 23:40:18 2013 +0100 +++ b/doc/bookloupe.txt Wed Oct 02 09:14:33 2013 +0100 @@ -9,7 +9,7 @@ Microsoft Windows, Mac or Unix. For Windows-only people, there is an appendix at the end with brief instructions for running it. -Current version: 2.0 +Current version: 2.1 This software is Copyright Jim Tinsley 2000-2005 and J. Ali Harlow 2012 onwards. @@ -20,31 +20,128 @@ See http://www.juiblex.co.uk/pgdp/bookloupe/ for the latest version. -Usage is: bookloupe [-setopxlywm] filename - where: - -s checks Single quotes - -e switches off Echoing of lines - -t checks Typos - -o produces an Overview only - -p sets strict quotes checking for Paragraphs - -x (paranoid) switches OFF typo checking and extra checks - -l turns off Line-end checks - -y sets error messages to stdout - -w is a special mode for web uploads (for future use) - -v (verbose) forces individual reporting of minor problems - -m interprets Markup of some common HTML tags and entities - -u warns about words in a user-defined typo file gutcheck.typ - -d ignores some DP-specific markup + Recent changes in behaviour + +Each new version of bookloupe brings bug fixes and improvements. Sometimes +the behaviour is also changed in ways that might be unexpected: + +Odd characters + + The check for "odd" characters (tab, tilde, carat, forward slash and + asterisks) is disabled in bookloupe 2.0 when the character set is + switched from ASCII/ISO-8859-1 to UNICODE (ie., when the "There are a + lot of foreign letters here." message is printed). As of bookloupe 2.1 + these tests operate independently of the character set selected. + + Users may notice this change most especially in the case of the + DP-specific /* ... */ markup. Bookloupe 2.0 often did not warn when + this markup was encountered even when the --dp switch was not given. + Bookloupe 2.1 will warn about this markup unless dp-specific mode is + switched on, paranoid mode is switched off or the ebook contains more + than 10 lines containing asterisks. In the last case + + --> 11 lines in this file contain asterisks. Not reporting them. + + will be printed. + + + +Usage is: bookloupe [OPTION...] filename + +Options: + -d, --dp ignores some DP-specific markup + -e, --no-echo switches off Echoing of lines + -s, --squote checks Single quotes + --typo checks Typos + -p, --qpara sets strict quotes checking for Paragraphs + --no-paranoid switches OFF typo checking and extra checks + -l, --no-line-end turns off Line-end checks + -o, --overview produces an Overview only + -y, --stdout sets error messages to stdout + -h, --header echos the header fields + -m, --markup ignore some common HTML markup + -u, --usertypo warns about words in a user-defined typo file + -v, --verbose forces individual reporting of minor problems + -w, --web special mode for web uploads (for future use) + --charset=NAME the set of characters valid for this ebook + --dump-config dump the current configuration + +There are also inverted options available which are useful when it is +desired to override an option set in the configuration file: + + --no-dp, --echo, --no-squote, --no-typo, --no-qpara, --paranoid, + --line-end, --no-overview, --no-stdout, --no-header, --no-markup, + --no-usertypo --no-verbose. + +Note: there is no --no-web since --web simply selects a set of options. + +Finally there are a couple of options that toggle the state of options +rather than setting or unsetting them: -t (for typo) and -x (for typo +and paranoid). These are mainly intended for compatability with gutcheck. Running bookloupe without any parameters will display a brief help message. -Sample usage: +Sample usage: bookloupe warpeace.txt More detail: + Configuration file + + Bookloupe will look for a file named bookloupe.ini to read as + a configuration file. Options set in a configuration file can + be overridden from the command line as required. + + The following directories are searched in order: + + 1) The current working directory. When run from the command + line, this is the directory you ran it from. When run from + guiguts it will normally be the directory that contains the + guiguts program. + + 2) The directory containing the bookloupe program. + + 3) The user's configuration directory. Under MS-Windows this + is normally CSIDL_LOCAL_APPDATA which is typically set to + C:\Documents and Settings\\Local Settings\Application Data. + On other platforms this is normally $XDG_CONFIG_HOME which, if + not set defaults to $HOME/.config + + The directories to search can also be changed using the + $BOOKLOUPE_CONFIG_PATH environment variable which is a colon + separated (semi-colon separated under MS-Windows) list of + directories. + + The configuration file is a key file. This is very similar to, + but not identical to a typical ini file as found under MS-Windows. + Key files consist of a number of groups which start with the + group name enclosed in square brackets on a line by itself. + Bookloupe recognises just one group, "options". Then below the + group name there follows the keys and their values for that + group, one per line in the format key=value. Most of bookloupe's + options are flags (ie., either on or off). For these keys, the + value must be either "true" or "false". The file may also contain + comment lines which begin with the # symbol. The names of the + keys follow the long option names. + + A sample configuration file is provided (in sample.ini). The file + will need to be copied to bookloupe.ini before bookloupe will + read it. You can also use the --dump-config option to write a + configuration file for you. For example, if you typically want + to run bookloupe with the --dp and --squote options, then you + might do: + + $ bookloupe --dp --squote --dump-config > configuration.ini + $ ren configuration.ini bookloupe.ini + + (Don't be tempted to merge these two steps or bookloupe will see + an empty configuration file and complain.) + + This same idea can also be used to modify an existing configuration. + + Character encoding Bookloupe will handle e-texts encoded in UTF-8 (preferred), @@ -52,44 +149,86 @@ incorrectly, as ansi). The output will be in the same encoding as the input e-text. - Echoing lines (-e to switch off) - You may find it convenient, when reviewing Bookloupe's + Character set (--charset) + + Character encodings have an implicit set of characters that + can be encoded and thus define a set of characters that can + be present in the text. However sometimes it is desirable + that not all characters that can be encoded should be present + in a text. The set of characters that should be present is + known as the character set. + + The default setting for the character set (called auto) does + the same as gutcheck for Windows-1252 encoded texts for + compatability: + + If the file is predominately ASCII then the set of legal + characters is ASCII and warnings are issued whenever non-ASCII + characters are encountered. The message will either warn of + non-ASCII or non-ISO-8859-1 characters as appropriate. + + If the file contains a significant number of non-ASCII characters + then a message is printed as follows: + + --> There are a lot of foreign letters here. Not reporting them. + + and the character set is widened to include all possible + characters. + + For UTF-8 encoded texts, auto selects UNICODE. + + Most character sets are simply defined in bookloupe as the + set of all characters that can be encoded in the encoding of + the same name. UNICODE is an exception and includes only the + characters assigned in the relevant Unicode standard but + excluding the Private Use Area characters. Note that the + relevant Unicode standard is given by the version of glib in + use rather than by any code in bookloupe and thus can vary + from system to system. PG texts however are likely to be + using characters assigned in very early Unicode standards, + thus mitigating this issue. + + + Echoing lines (--no-echo to switch off) + + You may find it convenient, when reviewing Bookloupe's suggestions, to see the line that Bookloupe is questioning. That way, you can often see at a glance whether it is a real error that needs to be fixed, or a false positive that should be in the text, but Bookloupe's limited programming doesn't understand. - By default, bookloupe echoes these lines, but if you don't - want to see the lines referred to, -e will switch it OFF. + By default, bookloupe echoes these lines, but if you don't + want to see the lines referred to, --no-echo will switch it + OFF. - Quotes (-s and -p switches) + Quotes (--squote and --qpara switches) - Bookloupe always looks for unbalanced doublequotes in a + Bookloupe always looks for unbalanced doublequotes in a paragraph. It is a common convention for writers not to close quotes in a paragraph if the next paragraph opens with quotes and is a continuation by the same speaker. - Bookloupe therefore does not normally report unclosed quotes + Bookloupe therefore does not normally report unclosed quotes if the next paragraph begins with a quote. If you need to see all unclosed quotes, even where the next paragraph begins with a quote, you should use the -p switch. - Singlequotes (' and ’) are a problem, since the same - character is used for an apostrophe. I'm not sure that it is - possible to get 100% accuracy on singlequotes checking, + Singlequotes (', `, ‘ and ’) are a problem, since the same + character can be used for an apostrophe. I'm not sure that it + is possible to get 100% accuracy on singlequotes checking, particularly since dialect, quite common in PG texts, upsets the normal rules so badly. Consider the sentence: 'Tis often said that a man's a man for a' that. As humans, we recognize that both apostrophes are used - for contractions rather than quotes, but it isn't easy + for contractions rather than quotes, but it isn't easy to get a program to recognize that. Since bookloupe makes too many mistakes when trying to match singlequotes, it doesn't look for unbalanced singlequotes - unless you specify the -s switch. + unless you specify the --squote switch. Consider these sentences, which illustrate the main cases: @@ -102,12 +241,11 @@ Those 'pack dogs' of yours look more like wolves. + Typos (--typo switch) - Typos (-t switch) - - It's not bookoupe's job to be a spelling checker, but it - does check for a list of common typos and OCR errors if you - use the -t switch. (The -x switch also turns typo checking on.) + It's not bookoupe's job to be a spelling checker, but it does + check for a list of common typos and OCR errors if you use the + --typo switch. (The -t and -x switchs also toggle typo checking.) It also checks for character combinations, especially involving h and b, which are often confused by OCR, that rarely or never @@ -119,10 +257,10 @@ Bookloupe suppresses multiple reporting of the first 40 "typos" found. This is to remove the annoyance of seeing something like "FN" (footnote) or "LK" (initials) flagged as a typo 147 times - in a text. + in a text. - Line-end checking (-l switch to disable) + Line-end checking (--no-line-end switch to disable) All PG texts should have a Carriage Return (CR - character 13) and a Line Feed (LF - character 10) at end of each line, @@ -134,31 +272,31 @@ the correct terminator, but if you're on a work-in-progress in Linux, you might want to convert the line-ends as a final step, and not want to see thousands of errors every time you - run bookloupe before that final step, so you can turn off - this checking with the -l switch. + run bookloupe before that final step, so you can turn off + this checking with the --no-line-end switch. - Paranoid mode (-x switch to disable: Trust No One :-) + Paranoid mode (--no-paranoid switch to disable: Trust No One :-) - -x switches OFF typo-checking, the -t flag, automatically - and some extra checks like standalone 1 and 0 queries. + --no-paranoid switches OFF some extra checks like standalone + 1 and 0 queries. - Overview mode (-o switch) + Overview mode (--overview switch) This mode just gives a count of queries found instead of a detailed list. - Header quote (-h switch) + Header quote (--header switch) - If you use the -h switch, bookloupe will also display + If you use the --header switch, bookloupe will also display the Title, Author, Release and Edition fields from the PG header. This is useful mostly for the automated checks we do on recently-posted texts. - Errors to stdout (-y switch) + Errors to stdout (--stdout switch) If you're just running bookloupe normally, you can ignore this. It's only there for programs that provide a front @@ -167,23 +305,24 @@ bookloupe ran OK. - Verbose reporting (-v switch) + Verbose reporting (--verbose switch) Normally, if bookloupe sees lots of long lines, short lines, spaced dashes, non-ASCII characters or dot-commas ".," it assumes these are features of the text, counts and summarizes - them at the top of its report, but does not list them - individually. If the -v switch is on, bookloupe will list them all. + them at the top of its report, but does not list them + individually. If the verbose switch is on, bookloupe will list + them all. - Markup interpretation (-m switch) + Markup interpretation (--markup switch) Normally, bookloupe flags anything it suspects of being HTML - markup as a possible error. When you use the -m switch, + markup as a possible error. When you use the --markup switch, however, it matches anything that looks like markup against a short list of common HTML tags and entities. If the markup is in that list, it either ignores the markup, in the case - of a tag, or "interprets" the markup as its nearest ASCII + of a tag, or "interprets" the markup as its nearest ASCII equivalent, in the case of an entity. So, for example, using this switch, bookloupe will "see" @@ -200,28 +339,30 @@ for PG, and get sane results. It does not support all tags. It does not support all entities. When it sees a tag or entity it does not recognize, it will query it as HTML just as if - you hadn't specified the -m switch. + you hadn't specified the --markup switch. Bookloupe will automatically switch on markup interpretation if it sees a lot of tags that appear to be markup, so mostly, you won't have to specify this. - User-defined typos (-u switch) + + User-defined typos (--usertypo switch) If you have a file named bookloupe.typ or gutcheck.typ either in your current working directory or in the directory from which you explicitly invoked bookoupe, but not necessarily on - your path, and if you specify the -u switch, bookloupe will - query any word specified in that file. The file is simple: one - word, in lower case, per line. Be careful not to put multiple + your path, and if you specify the --usertypo switch, bookloupe + will query any word specified in that file. The file is simple: + one word, in lower case, per line. Be careful not to put multiple words onto a line, or leave any rubbish other than the word on the line. You should have received a sample file bookloupe.typ with this package. The file may be encoded in UTF-8 (preferred), ISO-8859-1 (also known as Latin-1), or WINDOWS-1252 (also known, incorrectly, as ansi). - Ignore DP markup (-d switch) - + + Ignore DP markup (--dp switch) + Distributed Proofreaders (http://www.pgdp.net) has for some time been the main source of PG texts, and proofers there use special conventions. This switch understands those conventions, @@ -229,6 +370,17 @@ haven't had the special conventions removed yet. The special conventions supported are page-separators and "", "", "/*", "*/", "/#", "#/", "/$", "$/". + + + Dump the current configuration (--dump-config switch) + + The --dump-config switch can be used to dump the current + configuration. This is a combination of the internal defaults, + the configuration file (if any) and the command line options. + If a configuration file is present, any comments found in that + file will be preserved in the dumped configuration. If there + is no configuration file, then a default set of comments to + go with the internal default configuration is generated. You will probably only run bookloupe on a text once or maybe twice, @@ -257,7 +409,7 @@ length, HTML tags perhaps left from a conversion, unbalanced brackets. -Suggestions for additional checks would be appreciated and duly +Suggestions for additional checks would be appreciated and duly considered, but no guarantees that they will be implemented. @@ -271,8 +423,8 @@ gutcheck -o filename.txt That gives me a quick idea what I'm dealing with. It'll tell -me what kind of problems gutcheck sees, and give me an idea -of how much more work needs to be done on the text. Keep in +me what kind of problems gutcheck sees, and give me an idea +of how much more work needs to be done on the text. Keep in mind that gutcheck doesn't do anything like a full spellcheck, but when I see a text that has a lot of problems, I assume that it probably needs a spellcheck too. @@ -284,10 +436,10 @@ where jj is my personal, all-purpose filename for temporary data that doesn't need to be kept. Then I open filename.txt and jj in a split-screen view in my editor, and work down the text, fixing -whatever needs fixing, and skipping whatever doesn't. If your -editor doesn't split-screen, you can get much the same effect by +whatever needs fixing, and skipping whatever doesn't. If your +editor doesn't split-screen, you can get much the same effect by opening your original file in your normal editor, and jj (or your -equivalent name) in something like Notepad, keeping both in view +equivalent name) in something like Notepad, keeping both in view at the same time. Twice a day, an automatic process looks at all recently-posted @@ -296,17 +448,6 @@ - Future development of bookloupe - -Future versions will add support for UTF-8 characters that -are not in ISO-8859-1 (eg., curled quotation marks); -characters that do not have a composed form (version 2.0 -treats these as taking 2 or more columns); zero width and -wide characters (version 2.0 treats these as taking 1 column). - - - - Explanations of common bookloupe messages: --> 74 lines in this file have white space at end @@ -343,11 +484,11 @@ Line 3020 - Non-ASCII character 233 Standard PG texts should use only ASCII characters with values - up to 127; however, non-English, accented characters can be - represented according to several different non-ASCII encoding + up to 127; however, non-English, accented characters can be + represented according to several different non-ASCII encoding schemes, using values over 127. If you have a plain English text with a few accented characters in words like cafe or tete-a-tete, - you might replace the accented characters with their unaccented + you might replace the accented characters with their unaccented versions. The English pound sign is another commonly-seen non-ASCII character. If you have enough non-ASCII characters in your text that you feel removing them would degrade your text, @@ -376,6 +517,7 @@ of spaces. + Line 1327 - Tilde character? The tilde character (~) might be legitimately used, but it's the @@ -386,7 +528,7 @@ Line 1347 - Asterisk? - Asterisks are reported only in paranoid mode (see -x). + Asterisks are reported only in paranoid mode (see -x). Like tildes, they are often used to indicate errors, but they are also legitimately used as line delimiters and footnote markers. @@ -411,7 +553,7 @@ Hint: bookloupe will not flag lines as short if they are indented —if they start with a space. I like to start inserted stanzas - and other such items indented with a couple of spaces so that + and other such items indented with a couple of spaces so that they stand out from the main text anyway. @@ -427,7 +569,7 @@ The PG standard for an em-dash--like these--is two minus signs with no spaces before or after them. Bookloupe flags non-PG - em-dashes - like this one. Normally, you will replace it with a + em-dashes - like this one. Normally, you will replace it with a PG-standard em-dash. @@ -451,8 +593,8 @@ Line 2083 - Query standalone 0 - In paranoid mode (see -x) only, bookloupe warns about the digit 0 - and the number 1 standing alone as a word. This can happen if the + In paranoid mode (see -x) only, bookloupe warns about the digit 0 + and the number 1 standing alone as a word. This can happen if the OCR misreads the words O or I. @@ -531,22 +673,22 @@ Another bookloupe mainstay—unclosed doublequotes in a paragraph. See the discussion of quotes in the switches section near the start of this file. - + Since the mismatch doesn't occur on any one line, bookloupe quotes the line number of the first blank line following the paragraph, since this is the point where it reconciles the count of quotes. However, if bookloupe is echoing lines, that is, you haven't used - the -e switch, it will show the _first_ line of the paragraph, - to help you find the place without using line numbers. The - offending paragraph is therefore between the quoted line and + the -e switch, it will show the _first_ line of the paragraph, + to help you find the place without using line numbers. The + offending paragraph is therefore between the quoted line and the line number given. Line 2587 - Mismatched single quotes - Only checked with the -s switch, since checking single quotes is - not a very reliable process. Otherwise, the same logic as for + Only checked with the -s switch, since checking single quotes is + not a very reliable process. Otherwise, the same logic as for doublequotes applies. @@ -575,7 +717,7 @@ to be put in, like the blank line above, and this often shows up as a new paragraph beginning with lower case. - Sometimes the blank line is deliberate, as when a + Sometimes the blank line is deliberate, as when a quotation is inserted in a speech. Use your judgement. @@ -609,11 +751,11 @@ option that will be somewhere on your Start/Programs menu. - Now get into the C:\gut directory. - You can do this using the cd (change directory) + Now get into the C:\gut directory. + You can do this using the cd (change directory) command, like this: cd \gut - and your prompt will change to + and your prompt will change to C:\gut> so you know you're in the right place. @@ -641,7 +783,7 @@ replace any existing file of that name. So, for example, if you have two Tolstoy files - that you want to check, called WARPEACE.TXT and + that you want to check, called WARPEACE.TXT and ANNAK.TXT, make sure that neither of these names is ever used following the greater-than sign. To check these correctly, you might do: @@ -670,7 +812,7 @@ 6) Browse to the folder where you extracted bookloupe - 7) Double-click bookloupe.exe + 7) Double-click bookloupe.exe Now, whenever you do "Gutcheck" in Guiguts, it will run bookloupe instead. Since the output will look very like gutcheck output, you