# HG changeset patch # User ali # Date 1369471256 -3600 # Node ID 20d51419e077af81580f07f79e2a24dd33f38213 # Parent 68b1403e2971fecd8f1530b668c3bbff9dac7838 Break report_first_pass() out diff -r 68b1403e2971 -r 20d51419e077 bookloupe/bookloupe.c --- a/bookloupe/bookloupe.c Sat May 25 08:52:47 2013 +0100 +++ b/bookloupe/bookloupe.c Sat May 25 09:40:56 2013 +0100 @@ -494,7 +494,8 @@ if (usertypo_count>=MAX_USER_TYPOS) { printf(" --> Only %d user-defined typos " - "allowed: ignoring the rest\n"); + "allowed: ignoring the rest\n", + MAX_USER_TYPOS); break; } } @@ -694,6 +695,199 @@ return &results; } +struct warnings { + signed int shortline,longline,bin,dash,dotcomma,ast,fslash,digit,hyphen; + signed int endquote,isDutch,isFrench; +}; + +/* + * report_first_pass: + * + * Make some snap decisions based on the first pass results. + */ +struct warnings *report_first_pass(struct first_pass_results *results) +{ + static struct warnings warnings={0}; + if (cnt_spacend>0) + printf(" --> %ld lines in this file have white space at end\n", + cnt_spacend); + warnings.dotcomma=1; + if (results->dotcomma>5) + { + warnings.dotcomma=0; + printf(" --> %ld lines in this file contain '.,'. " + "Not reporting them.\n",results->dotcomma); + } + /* + * If more than 50 lines, or one-tenth, are short, + * don't bother reporting them. + */ + warnings.shortline=1; + if (results->shortline>50 || results->shortline*10>linecnt) + { + warnings.shortline=0; + printf(" --> %ld lines in this file are short. " + "Not reporting short lines.\n",results->shortline); + } + /* + * If more than 50 lines, or one-tenth, are long, + * don't bother reporting them. + */ + warnings.longline=1; + if (results->longline>50 || results->longline*10>linecnt) + { + warnings.longline=0; + printf(" --> %ld lines in this file are long. " + "Not reporting long lines.\n",results->longline); + } + /* If more than 10 lines contain asterisks, don't bother reporting them. */ + warnings.ast=1; + if (results->astline>10) + { + warnings.ast=0; + printf(" --> %ld lines in this file contain asterisks. " + "Not reporting them.\n",results->astline); + } + /* + * If more than 10 lines contain forward slashes, + * don't bother reporting them. + */ + warnings.fslash=1; + if (results->fslashline>10) + { + warnings.fslash=0; + printf(" --> %ld lines in this file contain forward slashes. " + "Not reporting them.\n",results->fslashline); + } + /* + * If more than 20 lines contain unpunctuated endquotes, + * don't bother reporting them. + */ + warnings.endquote=1; + if (results->endquote_count>20) + { + warnings.endquote=0; + printf(" --> %ld lines in this file contain unpunctuated endquotes. " + "Not reporting them.\n",results->endquote_count); + } + /* + * If more than 15 lines contain standalone digits, + * don't bother reporting them. + */ + warnings.digit=1; + if (results->standalone_digit>10) + { + warnings.digit=0; + printf(" --> %ld lines in this file contain standalone 0s and 1s. " + "Not reporting them.\n",results->standalone_digit); + } + /* + * If more than 20 lines contain hyphens at end, + * don't bother reporting them. + */ + warnings.hyphen=1; + if (results->hyphens>20) + { + warnings.hyphen=0; + printf(" --> %ld lines in this file have hyphens at end. " + "Not reporting them.\n",results->hyphens); + } + if (results->htmcount>20 && !pswit[MARKUP_SWITCH]) + { + printf(" --> Looks like this is HTML. Switching HTML mode ON.\n"); + pswit[MARKUP_SWITCH]=1; + } + if (results->verylongline>0) + printf(" --> %ld lines in this file are VERY long!\n", + results->verylongline); + /* + * If there are more non-PG spaced dashes than PG em-dashes, + * assume it's deliberate. + * Current PG guidelines say don't use them, but older texts do, + * and some people insist on them whatever the guidelines say. + */ + warnings.dash=1; + if (results->spacedash+results->non_PG_space_emdash> + results->PG_space_emdash) + { + warnings.dash=0; + printf(" --> There are %ld spaced dashes and em-dashes. " + "Not reporting them.\n", + results->spacedash+results->non_PG_space_emdash); + } + /* If more than a quarter of characters are hi-bit, bug out. */ + warnings.bin=1; + if (results->binlen*4>results->totlen) + { + printf(" --> This file does not appear to be ASCII. " + "Terminating. Best of luck with it!\n"); + exit(1); + } + if (results->alphalen*4totlen) + { + printf(" --> This file does not appear to be text. " + "Terminating. Best of luck with it!\n"); + exit(1); + } + if (results->binlen*100>results->totlen || results->binlen>100) + { + printf(" --> There are a lot of foreign letters here. " + "Not reporting them.\n"); + warnings.bin=0; + } + warnings.isDutch=0; + if (results->Dutchcount>50) + { + warnings.isDutch=1; + printf(" --> This looks like Dutch - " + "switching off dashes and warnings for 's Middags case.\n"); + } + warnings.isFrench=0; + if (results->Frenchcount>50) + { + warnings.isFrench=1; + printf(" --> This looks like French - " + "switching off some doublepunct.\n"); + } + if (results->firstline && results->footerline) + printf(" The PG header and footer appear to be already on.\n"); + else + { + if (results->firstline) + printf(" The PG header is on - no footer.\n"); + if (results->footerline) + printf(" The PG footer is on - no header.\n"); + } + printf("\n"); + if (pswit[VERBOSE_SWITCH]) + { + warnings.bin=1; + warnings.shortline=1; + warnings.dotcomma=1; + warnings.longline=1; + warnings.dash=1; + warnings.digit=1; + warnings.ast=1; + warnings.fslash=1; + warnings.hyphen=1; + warnings.endquote=1; + printf(" *** Verbose output is ON -- you asked for it! ***\n"); + } + if (warnings.isDutch) + warnings.dash=0; + if (results->footerline>0 && results->firstline>0 && + results->footerline>results->firstline && + results->footerline-results->firstline<100) + { + printf(" --> I don't really know where this text starts. \n"); + printf(" There are no reference points.\n"); + printf(" I'm going to have to report the header and footer " + "as well.\n"); + results->firstline=0; + } + return &warnings; +} + /* * procfile: * @@ -706,11 +900,10 @@ char parastart[81]; /* first line of current para */ FILE *infile; struct first_pass_results *first_pass_results; + struct warnings *warnings; long quot,squot,start_para_line; signed int i,j,llen,isemptyline,isacro,isellipsis,istypo,alower, eNon_A,eTab,eTilde,eAst,eFSlash,eCarat; - signed int warn_short,warn_long,warn_bin,warn_dash,warn_dotcomma, - warn_ast,warn_fslash,warn_digit,warn_hyphen,warn_endquote; unsigned int lastlen,lastblen; signed int s_brack,c_brack,r_brack,c_unders; signed int open_single_quote,close_single_quote,guessquote,dquotepar, @@ -720,7 +913,6 @@ cbrack_err[80],unders_err[80]; signed int qword_index,qperiod_index,isdup; signed int enddash; - signed int isDutch,isFrench; laststart=CHAR_SPACE; lastlen=lastblen=0; *dquote_err=*squote_err=*rbrack_err=*cbrack_err=*sbrack_err= @@ -728,13 +920,10 @@ linecnt=checked_linecnt=start_para_line=0; quot=squot=s_brack=c_brack=r_brack=c_unders=0; i=llen=isemptyline=isacro=isellipsis=istypo=0; - warn_short=warn_long=warn_bin=warn_dash=warn_dotcomma= - warn_ast=warn_fslash=warn_digit=warn_endquote=0; isnewpara=vowel=consonant=enddash=0; qword_index=qperiod_index=isdup=0; *inword=*testword=0; open_single_quote=close_single_quote=guessquote=dquotepar=squotepar=0; - isDutch=isFrench=0; for (j=0;j0) - printf(" --> %ld lines in this file have white space at end\n", - cnt_spacend); - warn_dotcomma=1; - if (first_pass_results->dotcomma>5) - { - warn_dotcomma=0; - printf(" --> %ld lines in this file contain '.,'. " - "Not reporting them.\n",first_pass_results->dotcomma); - } - /* if more than 50 lines, or one-tenth, are short, - * don't bother reporting them */ - warn_short=1; - if (first_pass_results->shortline>50 || - first_pass_results->shortline*10>linecnt) - { - warn_short=0; - printf(" --> %ld lines in this file are short. " - "Not reporting short lines.\n",first_pass_results->shortline); - } - /* - * If more than 50 lines, or one-tenth, are long, - * don't bother reporting them. - */ - warn_long=1; - if (first_pass_results->longline>50 || - first_pass_results->longline*10>linecnt) - { - warn_long=0; - printf(" --> %ld lines in this file are long. " - "Not reporting long lines.\n",first_pass_results->longline); - } - /* If more than 10 lines contain asterisks, don't bother reporting them. */ - warn_ast=1; - if (first_pass_results->astline>10) - { - warn_ast=0; - printf(" --> %ld lines in this file contain asterisks. " - "Not reporting them.\n",first_pass_results->astline); - } - /* - * If more than 10 lines contain forward slashes, - * don't bother reporting them. - */ - warn_fslash=1; - if (first_pass_results->fslashline>10) - { - warn_fslash=0; - printf(" --> %ld lines in this file contain forward slashes. " - "Not reporting them.\n",first_pass_results->fslashline); - } - /* - * If more than 20 lines contain unpunctuated endquotes, - * don't bother reporting them. - */ - warn_endquote=1; - if (first_pass_results->endquote_count>20) - { - warn_endquote=0; - printf(" --> %ld lines in this file contain unpunctuated endquotes. " - "Not reporting them.\n",first_pass_results->endquote_count); - } - /* - * If more than 15 lines contain standalone digits, - * don't bother reporting them. - */ - warn_digit=1; - if (first_pass_results->standalone_digit>10) - { - warn_digit=0; - printf(" --> %ld lines in this file contain standalone 0s and 1s. " - "Not reporting them.\n",first_pass_results->standalone_digit); - } - /* - * If more than 20 lines contain hyphens at end, - * don't bother reporting them. - */ - warn_hyphen=1; - if (first_pass_results->hyphens>20) - { - warn_hyphen=0; - printf(" --> %ld lines in this file have hyphens at end. " - "Not reporting them.\n",first_pass_results->hyphens); - } - if (first_pass_results->htmcount>20 && !pswit[MARKUP_SWITCH]) - { - printf(" --> Looks like this is HTML. Switching HTML mode ON.\n"); - pswit[MARKUP_SWITCH]=1; - } - if (first_pass_results->verylongline>0) - printf(" --> %ld lines in this file are VERY long!\n", - first_pass_results->verylongline); - /* - * If there are more non-PG spaced dashes than PG em-dashes, - * assume it's deliberate. - * Current PG guidelines say don't use them, but older texts do, - * and some people insist on them whatever the guidelines say. - */ - warn_dash=1; - if (first_pass_results->spacedash+first_pass_results->non_PG_space_emdash> - first_pass_results->PG_space_emdash) - { - warn_dash=0; - printf(" --> There are %ld spaced dashes and em-dashes. " - "Not reporting them.\n",first_pass_results->spacedash+ - first_pass_results->non_PG_space_emdash); - } - /* If more than a quarter of characters are hi-bit, bug out. */ - warn_bin=1; - if (first_pass_results->binlen*4>first_pass_results->totlen) - { - printf(" --> This file does not appear to be ASCII. " - "Terminating. Best of luck with it!\n"); - exit(1); - } - if (first_pass_results->alphalen*4totlen) - { - printf(" --> This file does not appear to be text. " - "Terminating. Best of luck with it!\n"); - exit(1); - } - if (first_pass_results->binlen*100>first_pass_results->totlen || - first_pass_results->binlen>100) - { - printf(" --> There are a lot of foreign letters here. " - "Not reporting them.\n"); - warn_bin=0; - } - isDutch=0; - if (first_pass_results->Dutchcount>50) - { - isDutch=1; - printf(" --> This looks like Dutch - " - "switching off dashes and warnings for 's Middags case.\n"); - } - isFrench=0; - if (first_pass_results->Frenchcount>50) - { - isFrench=1; - printf(" --> This looks like French - " - "switching off some doublepunct.\n"); - } - if (first_pass_results->firstline && first_pass_results->footerline) - printf(" The PG header and footer appear to be already on.\n"); - else - { - if (first_pass_results->firstline) - printf(" The PG header is on - no footer.\n"); - if (first_pass_results->footerline) - printf(" The PG footer is on - no header.\n"); - } - printf("\n"); - if (pswit[VERBOSE_SWITCH]) - { - warn_bin=1; - warn_short=1; - warn_dotcomma=1; - warn_long=1; - warn_dash=1; - warn_digit=1; - warn_ast=1; - warn_fslash=1; - warn_hyphen=1; - warn_endquote=1; - printf(" *** Verbose output is ON -- you asked for it! ***\n"); - } - if (isDutch) - warn_dash=0; - infile=fopen(filename,"rb"); - if (!infile) - { - if (pswit[STDOUT_SWITCH]) - fprintf(stdout,"bookloupe: cannot open %s\n",filename); - else - fprintf(stderr,"bookloupe: cannot open %s\n",filename); - exit(1); - } - if (first_pass_results->footerline>0 && first_pass_results->firstline>0 && - first_pass_results->footerline>first_pass_results->firstline && - first_pass_results->footerline-first_pass_results->firstline<100) - { - printf(" --> I don't really know where this text starts. \n"); - printf(" There are no reference points.\n"); - printf(" I'm going to have to report the header and footer " - "as well.\n"); - first_pass_results->firstline=0; - } + warnings=report_first_pass(first_pass_results); + rewind(infile); /* * Here we go with the main pass. Hold onto yer hat! * Re-init some variables we've dirtied. @@ -1210,7 +1210,7 @@ cnt_bin++; } } - if (warn_bin) + if (warnings->bin) { /* Don't repeat multiple warnings on one line. */ eNon_A=eTab=eTilde=eCarat=eFSlash=eAst=0; @@ -1274,7 +1274,7 @@ cnt_odd++; eCarat=1; } - if (!eFSlash && *s==CHAR_FORESLASH && warn_fslash) + if (!eFSlash && *s==CHAR_FORESLASH && warnings->fslash) { if (pswit[ECHO_SWITCH]) printf("\n%s\n",aline); @@ -1289,7 +1289,7 @@ * Report asterisks only in paranoid mode, * since they're often deliberate. */ - if (!eAst && pswit[PARANOID_SWITCH] && warn_ast && + if (!eAst && pswit[PARANOID_SWITCH] && warnings->ast && !isemptyline && *s==CHAR_ASTERISK) { if (pswit[ECHO_SWITCH]) @@ -1304,7 +1304,7 @@ } } /* Check for line too long. */ - if (warn_long) + if (warnings->longline) { if (strlen(aline)>LONGEST_PG_LINE) { @@ -1338,7 +1338,7 @@ * then just assume it's OK? Need to look at some texts to see * how often a formula like this would get the right result. */ - if (warn_short && strlen(aline)>1 && lastlen>1 && + if (warnings->shortline && strlen(aline)>1 && lastlen>1 && lastlen1 && lastblen>SHORTEST_PG_LINE && laststart!=CHAR_SPACE) { @@ -1370,7 +1370,7 @@ * hence the loop - even if the first double-dash is OK * there may be another that's wrong later on. */ - if (warn_dash) + if (warnings->dash) { s=aline; while (strstr(s,"--")) @@ -1390,7 +1390,7 @@ } } /* Check for spaced dashes. */ - if (warn_dash) + if (warnings->dash) { if (strstr(aline," -")) { @@ -1590,7 +1590,7 @@ t++; continue; } - if (isDutch) + if (warnings->isDutch) { /* For Frank & Jeroen -- 's Middags case */ if (t[2]==CHAR_SQUOTE && t[3]>='a' && t[3]<='z' && @@ -1718,7 +1718,7 @@ if (pswit[ECHO_SWITCH]) printf("\n%s\n",aline); if (!pswit[OVERVIEW_SWITCH]) - printf(" Line %ld column %ld - Query digit in %s\n", + printf(" Line %ld column %d - Query digit in %s\n", linecnt,(int)(wordstart-aline)+1,inword); else cnt_word++; @@ -1890,7 +1890,7 @@ "Query possible scanno %s\n", linecnt,(int)(wordstart-aline)+2,inword); } - if (pswit[PARANOID_SWITCH] && warn_digit) + if (pswit[PARANOID_SWITCH] && warnings->digit) { /* In paranoid mode, query all 0 and 1 standing alone. */ if (!strcmp(inword,"0") || !strcmp(inword,"1")) @@ -2089,7 +2089,7 @@ printf("\n%s\n",aline); if (!pswit[OVERVIEW_SWITCH]) printf(" Line %ld column 1 - Wrongspaced quotes?\n", - linecnt,(int)(s-aline)+1); + linecnt); else cnt_punct++; } @@ -2143,7 +2143,7 @@ * e.g. "etc., etc.," and vol. 1.; vol 3.; * OTOH, from my initial tests, there are also fairly * common errors. What to do? Make these cases paranoid? - * ".," is the most common, so warn_dotcomma is used + * ".," is the most common, so warnings->dotcomma is used * to suppress detailed reporting if it occurs often. */ llen=strlen(aline); @@ -2156,28 +2156,28 @@ /* followed by punctuation, it's a query, unless . . . */ if (aline[i]==aline[i+1] && (aline[i]=='.' || aline[i]=='?' || aline[i]=='!') || - !warn_dotcomma && aline[i]=='.' && aline[i+1]==',' || - isFrench && !strncmp(aline+i,",...",4) || - isFrench && !strncmp(aline+i,"...,",4) || - isFrench && !strncmp(aline+i,";...",4) || - isFrench && !strncmp(aline+i,"...;",4) || - isFrench && !strncmp(aline+i,":...",4) || - isFrench && !strncmp(aline+i,"...:",4) || - isFrench && !strncmp(aline+i,"!...",4) || - isFrench && !strncmp(aline+i,"...!",4) || - isFrench && !strncmp(aline+i,"?...",4) || - isFrench && !strncmp(aline+i,"...?",4)) + !warnings->dotcomma && aline[i]=='.' && aline[i+1]==',' || + warnings->isFrench && !strncmp(aline+i,",...",4) || + warnings->isFrench && !strncmp(aline+i,"...,",4) || + warnings->isFrench && !strncmp(aline+i,";...",4) || + warnings->isFrench && !strncmp(aline+i,"...;",4) || + warnings->isFrench && !strncmp(aline+i,":...",4) || + warnings->isFrench && !strncmp(aline+i,"...:",4) || + warnings->isFrench && !strncmp(aline+i,"!...",4) || + warnings->isFrench && !strncmp(aline+i,"...!",4) || + warnings->isFrench && !strncmp(aline+i,"?...",4) || + warnings->isFrench && !strncmp(aline+i,"...?",4)) { - if (isFrench && !strncmp(aline+i,",...",4) || - isFrench && !strncmp(aline+i,"...,",4) || - isFrench && !strncmp(aline+i,";...",4) || - isFrench && !strncmp(aline+i,"...;",4) || - isFrench && !strncmp(aline+i,":...",4) || - isFrench && !strncmp(aline+i,"...:",4) || - isFrench && !strncmp(aline+i,"!...",4) || - isFrench && !strncmp(aline+i,"...!",4) || - isFrench && !strncmp(aline+i,"?...",4) || - isFrench && !strncmp(aline+i,"...?",4)) + if (warnings->isFrench && !strncmp(aline+i,",...",4) || + warnings->isFrench && !strncmp(aline+i,"...,",4) || + warnings->isFrench && !strncmp(aline+i,";...",4) || + warnings->isFrench && !strncmp(aline+i,"...;",4) || + warnings->isFrench && !strncmp(aline+i,":...",4) || + warnings->isFrench && !strncmp(aline+i,"...:",4) || + warnings->isFrench && !strncmp(aline+i,"!...",4) || + warnings->isFrench && !strncmp(aline+i,"...!",4) || + warnings->isFrench && !strncmp(aline+i,"?...",4) || + warnings->isFrench && !strncmp(aline+i,"...?",4)) i+=4; ; /* do nothing for .. !! and ?? which can be legit */ } @@ -2280,7 +2280,7 @@ * Dash at end of line may well be legit - paranoid mode only * and don't report em-dash at line-end. */ - if (pswit[PARANOID_SWITCH] && warn_hyphen) + if (pswit[PARANOID_SWITCH] && warnings->hyphen) { for (i=llen-1;i>0 && (unsigned char)aline[i]<=CHAR_SPACE;i--) ; @@ -2315,7 +2315,7 @@ } } llen=strlen(aline); - if (warn_endquote) + if (warnings->endquote) { for (i=1;i