...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
This section focuses on some of the design questions.
Unicode support was one of the features specifically requested during the formal review. Throughout this document "Unicode support" is a synonym for "wchar_t" support, assuming that "wchar_t" always uses Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll not mean strict 7-bit ASCII encoding, but rather "char" strings in local 8-bit encoding.
Generally, "Unicode support" can mean many things, but for the program_options library it means that:
Each parser should accept either char*
or wchar_t*
, correctly split the input into option
names and option values and return the data.
For each option, it should be possible to specify whether the conversion from string to value uses ascii or Unicode.
The library guarantees that:
ascii input is passed to an ascii value without change
Unicode input is passed to a Unicode value without change
ascii input passed to a Unicode value, and Unicode input passed to an ascii value will be converted using a codecvt facet (which may be specified by the user).
The important point is that it's possible to have some "ascii options" together with "Unicode options". There are two reasons for this. First, for a given type you might not have the code to extract the value from Unicode string and it's not good to require that such code be written. Second, imagine a reusable library which has some options and exposes options description in its interface. If all options are either ascii or Unicode, and the library does not use any Unicode strings, then the author will likely to use ascii options, which would make the library unusable inside Unicode applications. Essentially, it would be necessary to provide two versions of the library -- ascii and Unicode.
Another important point is that ascii strings are passed though
without modification. In other words, it's not possible to just convert
ascii to Unicode and process the Unicode further. The problem is that the
default conversion mechanism -- the codecvt
facet -- might
not work with 8-bit input without additional setup.
The Unicode support outlined above is not complete. For example, we don't support Unicode option names. Unicode support is hard and requires a Boost-wide solution. Even comparing two arbitrary Unicode strings is non-trivial. Finally, using Unicode in option names is related to internationalization, which has it's own complexities. E.g. if option names depend on current locale, then all program parts and other parts which use the name must be internationalized too.
The primary question in implementing the Unicode support is whether
to use templates and std::basic_string
or to use some
internal encoding and convert between internal and external encodings on
the interface boundaries.
The choice, mostly, is between code size and execution speed. A templated solution would either link library code into every application that uses the library (thereby making shared library impossible), or provide explicit instantiations in the shared library (increasing its size). The solution based on internal encoding would necessarily make conversions in a number of places and will be somewhat slower. Since speed is generally not an issue for this library, the second solution looks more attractive, but we'll take a closer look at individual components.
For the parsers component, we have three choices:
Use a fully templated implementation: given a string of a
certain type, a parser will return a parsed_options
instance
with strings of the same type (i.e. the parsed_options
class
will be templated).
Use internal encoding: same as above, but strings will be converted to and from the internal encoding.
Use and partly expose the internal encoding: same as above,
but the strings in the parsed_options
instance will be in the
internal encoding. This might avoid a conversion if
parsed_options
instance is passed directly to other components,
but can be also dangerous or confusing for a user.
The second solution appears to be the best -- it does not increase
the code size much and is cleaner than the third. To avoid extra
conversions, the Unicode version of parsed_options
can also store
strings in internal encoding.
For the options descriptions component, we don't have much
choice. Since it's not desirable to have either all options use ascii or all
of them use Unicode, but rather have some ascii and some Unicode options, the
interface of the value_semantic
must work with both. The only way is
to pass an additional flag telling if strings use ascii or internal encoding.
The instance of value_semantic
can then convert into some
other encoding if needed.
For the storage component, the only affected function is store
.
For Unicode input, the store
function should convert the value to the
internal encoding. It should also inform the value_semantic
class
about the used encoding.
Finally, what internal encoding should we use? The
alternatives are:
std::wstring
(using UCS-4 encoding) and
std::string
(using UTF-8 encoding). The difference between
alternatives is:
Speed: UTF-8 is a bit slower
Space: UTF-8 takes less space when input is ascii
Code size: UTF-8 requires additional conversion code. However,
it allows one to use existing parsers without converting them to
std::wstring
and such conversion is likely to create a
number of new instantiations.
There's no clear leader, but the last point seems important, so UTF-8 will be used.
Choosing the UTF-8 encoding allows the use of existing parsers, because 7-bit ascii characters retain their values in UTF-8, so searching for 7-bit strings is simple. However, there are two subtle issues:
We need to assume the character literals use ascii encoding and that inputs use Unicode encoding.
A Unicode character (say '=') can be followed by 'composing character' and the combination is not the same as just '=', so a simple search for '=' might find the wrong character.
Neither of these issues appear to be critical in practice, since ascii is almost universal encoding and since composing characters following '=' (and other characters with special meaning to the library) are not likely to appear.