///////////////////////////////////////////////////////////////////////////// // Name: string.h // Purpose: topic overview // Author: wxWidgets team // Licence: wxWindows licence ///////////////////////////////////////////////////////////////////////////// /** @page overview_string wxString Overview @tableofcontents wxString is used for all strings in wxWidgets. This class is very similar to the standard string class, and is implemented using it, but provides additional compatibility functions to allow applications originally written for the much older versions of wxWidgets to continue to work with the latest ones. When writing new code, you're encouraged to use wxString as if it were `std::wstring` and use only functions compatible with the standard class. @section overview_string_settings wxString Related Compilation Settings The main build options affecting wxString are `wxUSE_UNICODE_WCHAR` and `wxUSE_UNICODE_UTF8`, exactly one of which must be set to determine whether fixed-width `wchar_t` or variable-width `char`-based strings are used internally. Please see @ref overview_unicode_support_utf for more information about this choice. The other options all affect the presence, or absence, of various implicit conversions provided by this class. By default, wxString can be implicitly created from `char*`, `wchar_t*`, `std::string` and `std::wstring` and can be implicitly converted to `char*` or `wchar_t*`. This behaviour is convenient and compatible with the previous wxWidgets versions, but is dangerous and may result in unwanted conversions, please see @ref string_conv for how to disable them. @section overview_string_iterating Iterating over wxString It is possible to iterate over strings using indices, but the recommended way to do it is to use iterators, either explicitly: @code wxString s = "hello"; wxString::const_iterator i; for (i = s.begin(); i != s.end(); ++i) { wxUniChar uni_ch = *i; // do something with it } @endcode or, even simpler, implicitly, using range for loop: @code wxString s = "hello"; for ( auto c : s ) { // do something with "c" } @endcode @note wxString iterators have unusual proxy-like semantics and can be used to modify the string even when @e not using references, i.e. with just @c auto, as in the example above. @section overview_string_internal wxString Internal Representation @note This section can be skipped at first reading and is provided solely for informational purposes. As mentioned above, wxString may use any of @c UTF-16 (under Windows, using the native 16 bit @c wchar_t), @c UTF-32 (under Unix, using the native 32 bit @c wchar_t) or @c UTF-8 (under both Windows and Unix) to store its content. By default, @c wchar_t is used under all platforms, but wxWidgets can be compiled with wxUSE_UNICODE_UTF8=1 to use UTF-8 instead. For simplicity of implementation, wxString uses per code unit indexing instead of per code point indexing when using UTF-16, i.e. in the default wxUSE_UNICODE_WCHAR==1 build under Windows and doesn't know anything about surrogate pairs. In other words it always considers code points to be composed by 1 code unit, while this is really true only for characters in the @e BMP (Basic Multilingual Plane), as explained in more details in the @ref overview_unicode_encodings section. Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user code has to take care of surrogate pairs manually if it needs to handle them (note however that Windows itself has built-in support for surrogate pairs in UTF-16, such as for drawing strings on screen, so nothing special needs to be done when just passing strings containing surrogates to wxWidgets functions). @remarks Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1 resembles UCS-2 encoding, it's not completely correct to refer to wxString as UCS-2 encoded since you can encode code points outside the @e BMP in a wxString as two code units (i.e. as a surrogate pair; as already mentioned however wxString will "see" them as two different code points) In wxUSE_UNICODE_UTF8==1 case, wxString handles UTF-8 multi-bytes sequences just fine also for characters outside the BMP (it implements per code point indexing), so that you can use UTF-8 in a completely transparent way: Example: @code // first test, using exotic characters outside of the Unicode BMP: wxString test = wxString::FromUTF8("\xF0\x90\x8C\x80"); // U+10300 is "OLD ITALIC LETTER A" and is part of Unicode Plane 1 // in UTF8 it's encoded as 0xF0 0x90 0x8C 0x80 // it's a single Unicode code-point encoded as: // - a UTF16 surrogate pair under Windows // - a UTF8 multiple-bytes sequence under Linux // (without considering the final NUL) wxPrintf("wxString reports a length of %d character(s)", test.length()); // prints "wxString reports a length of 1 character(s)" on Linux // prints "wxString reports a length of 2 character(s)" on Windows // since wxString on Windows doesn't have surrogate pairs support! // second test, this time using characters part of the Unicode BMP: wxString test2 = wxString::FromUTF8("\x41\xC3\xA0\xE2\x82\xAC"); // this is the UTF8 encoding of capital letter A followed by // 'small case letter a with grave' followed by the 'euro sign' // they are 3 Unicode code-points encoded as: // - 3 UTF16 code units under Windows // - 6 UTF8 code units under Linux // (without considering the final NUL) wxPrintf("wxString reports a length of %d character(s)", test2.length()); // prints "wxString reports a length of 3 character(s)" on Linux // prints "wxString reports a length of 3 character(s)" on Windows @endcode To better explain what stated above, consider the second string of the example above; it's composed by 3 characters and the final @NUL: @image html overview_wxstring_encoding.png As you can see, UTF16 encoding is straightforward (for characters in the @e BMP) and in this example the UTF16-encoded wxString takes 8 bytes. UTF8 encoding is more elaborated and in this example takes 7 bytes. In general, for strings containing many latin characters UTF8 provides a big advantage with regards to the memory footprint respect UTF16, but requires some more processing for common operations like e.g. length calculation. Finally, note that the type used by wxString to store Unicode code units (@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType. */