C4::Charset - utilities for handling character set conversions.
use C4::Charset;
This module contains routines for dealing with character set conversions, particularly for MARC records.
A variety of character encodings are in use by various MARC standards, and even more character encodings are used by non-standard MARC records. The various MARC formats generally do not do a good job of advertising a given record's character encoding, and even when a record does advertise its encoding, e.g., via the Leader/09, experience has shown that one cannot trust it.
Ultimately, all MARC records are stored in Koha in UTF-8 and must be converted from whatever the source character encoding is. The goal of this module is to ensure that these conversions take place accurately. When a character conversion cannot take place, or at least not accurately, the module was provide enough information to allow user-facing code to inform the user on how to deal with the situation.
my $is_utf8 = IsStringUTF8ish($str);
Determines if $str is valid UTF-8.
This can mean one of two things:
The function is named IsStringUTF8ish instead of IsStringUTF8 because in one could be presented with a MARC blob that is not actually in UTF-8 but whose sequence of octets appears to be valid UTF-8.
The rest of the MARC character conversion functions will assume that this situation occur does not very often.
($marc_record, $converted_from, $errors_arrayref) = MarcToUTF8Record($marc_blob, $marc_flavour, [, $source_encoding]);
Given a MARC blob or a MARC::Record,
the MARC flavour,
and an optional source encoding,
return a MARC::Record that is converted to UTF-8.
The returned $marc_record is guaranteed to be in valid UTF-8,
but is not guaranteed to have been converted correctly.
Specifically,
if $converted_from is 'failed',
the MARC record returned failed character conversion and had each of its non-ASCII octets changed to the Unicode replacement character.
If the source encoding was not specified,
this routine will try to guess it; the character encoding used for a successful conversion is returned in $converted_from.
SetMarcUnicodeFlag($marc_record, $marc_flavour);
Set both the internal MARC::Record encoding flag and the appropriate Leader/09 (MARC21) or 100/26-29 (UNIMARC) to indicate that the record is in UTF-8. Note that this does not do any actual character conversion.
my $new_str = StripNonXmlChars($old_str);
Given a string, return a copy with the characters that are illegal in XML removed.
This function exists to work around a problem that can occur with badly-encoded MARC records. Specifically, if a UTF-8 MARC record also has excape (\x1b) characters, MARC::File::XML will let the escape characters pass through when as_xml() or as_xml_record() is called. The problem is that the escape character is not legal in well-formed XML documents, so when MARC::File::XML attempts to parse such a record, the XML parser will fail.
Stripping such characters will allow a MARC::Record->new_from_xml() to work, at the possible risk of some data loss.
my ($new_marc_record, $guessed_charset) = _default_marc21_charconv_to_utf8($marc_record);
Converts a MARC::Record of unknown character set to UTF-8,
first by trying a MARC-8 to UTF-8 conversion,
then ISO-8859-1 to UTF-8,
then a default conversion that replaces each non-ASCII character with the replacement character.
The $guessed_charset return value contains the character set that resulted in a conversion to valid UTF-8; note that if the MARC-8 and ISO-8859-1 conversions failed,
the value of this is 'failed'.
my ($new_marc_record, $guessed_charset) = _default_unimarc_charconv_to_utf8($marc_record);
Converts a MARC::Record of unknown character set to UTF-8,
first by trying a ISO-5426 to UTF-8 conversion,
then ISO-8859-1 to UTF-8,
then a default conversion that replaces each non-ASCII character with the replacement character.
The $guessed_charset return value contains the character set that resulted in a conversion to valid UTF-8; note that if the MARC-8 and ISO-8859-1 conversions failed,
the value of this is 'failed'.
my @errors = _marc_marc8_to_utf8($marc_record, $marc_flavour, $source_encoding);
Convert a MARC::Record to UTF-8 in-place from MARC-8.
If the conversion fails for some reason,
an appropriate messages will be placed in the returned @errors array.
my @errors = _marc_iso5426_to_utf8($marc_record, $marc_flavour, $source_encoding);
Convert a MARC::Record to UTF-8 in-place from ISO-5426.
If the conversion fails for some reason,
an appropriate messages will be placed in the returned @errors array.
FIXME - is ISO-5426 equivalent enough to MARC-8 that MARC::Charset can be used instead?
my @errors = _marc_to_utf8_via_text_iconv($marc_record, $marc_flavour, $source_encoding);
Convert a MARC::Record to UTF-8 in-place using the Text::Iconv CPAN module.
Any source encoding accepted by the user's iconv installation should work.
If the source encoding is not recognized on the user's server or the conversion fails for some reason,
appropriate messages will be placed in the returned @errors array.
_marc_to_utf8_replacement_char($marc_record, $marc_flavour);
Convert a MARC::Record to UTF-8 in-place,
adopting the unsatisfactory method of replacing all non-ASCII (e.g.,
where the eight bit is set) octet with the Unicode replacement character.
This is meant as a last-ditch method,
and would be best used as part of a UI that lets a cataloguer pick various character conversions until he or she finds the right one.
my $utf8string = char_decode5426($iso_5426_string);
Converts a string from ISO-5426 to UTF-8.
Koha Development Team <info@koha.org>
Galen Charlton <galen.charlton@liblime.com>