NAME

TmplTokenizer.pm - Simple-minded tokenizer class for HTML::Template .tmpl files

DESCRIPTION

Because .tmpl files contains HTML::Template directives that tend to confuse real parsers (e.g., HTML::Parse), it might be better to create a customized scanner to scan the template files for tokens. This module is a simple-minded attempt at such a scanner.

In addition to the basic scanning, this class will also perform the following:

-: Emulation of c-format strings (see below)
-: Display of warnings for certain things that affects either the ability of this class to yield correct output, or things that are known to cause the original template to cause trouble.
-: Automatic correction of some of the things warned about (e.g., SGML "closed start tag" notation).

c-format strings emulation

Because English word order is not universal, a simple extraction of translatable strings may yield some strings like "Accounts for" or ambiguous strings like "in". This makes the resulting strings difficult to translate, but does not affect all languages alike. For example, Chinese (with a somewhat different word order) would be hit harder, but French would be relatively unaffected.

To overcome this problem, the scanner can be configured to detect patterns with <TMPL_VAR> directives (as well as certain HTML tags), and try to construct a larger pattern that will appear in the PO file as c-format strings with %s placeholders. This additional step allows the translator to deal with cases where word order is different (replacing %s with %1$s, %2$s, etc.), or when certain words will require certain inflectional suffixes in sentences.

Because this is an incompatible change, this mode must be explicitly turned on using the set_cformat(1) method call.

The flag characters

The character % is followed by zero or more of the following flags:

#: The value comes from HTML <INPUT> elements. This abuse of the flag character is somewhat reasonable, since TMPL_VAR and INPUT are both variables, but of different kinds.

The field width and precision

An optional 0.0 can be specified for %s to specify that the <TMPL_VAR> should be suppressed.

The conversion specifier

p: Specifies any input field that is neither text nor hidden (which currently mean radio buttons). The p conversion specifier is chosen because this does not evoke any certain sensible data type.
S: Specifies a text input field (<INPUT TYPE=TEXT>). This use of the S conversion specifier is somewhat reasonable, since text input fields contain values of undeterminable type, which can be treated as strings.
s: Specifies a <TMPL_VAR>. This use of the o conversion specifier is somewhat reasonable, since <TMPL_VAR> denotes values of undeterminable type, which can be treated as strings.

BUGS

There is no code to save the tag name anywhere in the scanned token.

The use of <Ai> to stand for the ith anchor is not very well thought out. Some abuse of c-format specifies might have been more appropriate.

HISTORY

This tokenizer is mostly based on Ambrose's hideous Perl script known as subst.pl.