WideCharToMultiByte and MultiByteToWideChar information summarize-白红宇的个人博客

WideCharToMultiByte and MultiByteToWideChar information summarize

发布日期：2021-09-02 01:10:01 浏览次数：1 分类：技术文章

本文共 13052 字，大约阅读时间需要 43 分钟。

URL:

Visual C++ in Short: Converting between Unicode and UTF-8

The Windows SDK provides the WideCharToMultiByte function to convert a Unicode, or UTF-16, string (WCHAR*) to a character string (CHAR*) using a particular code page. Windows also provides the MultiByteToWideChar function to convert a character string from a particular code page to a Unicode string. These functions can be a bit daunting at first but unless you have a lot of legacy code or APIs to deal with you can just specify CP_UTF8 as the code page and these functions will convert between Unicode UTF-16 and UTF-8 formats. UTF-8 isn’t really a code page in the original sense but the API functions lives on and now provide support for UTF conversions.

ATL provides a set of class templates that wrap these functions to simplify conversions even further. It takes a fairly efficient and elegant approach to memory management (compared to previous versions of ATL) that should serve you well in most cases. CW2A is a typedef for the CW2AEX class template that wraps the WideCharToMultiByte function. Similarly, CA2W is a typedef for the CA2WEX class template that wraps the MultiByteToWideChar function.

In the example below I start with a Unicode string that includes the Greek capital letters for Alpha and Omega. The string is converted to UTF-8 with CW2A and then back to Unicode with CA2W. Be sure to specify CP_UTF8 as the second parameter in both cases otherwise ATL will use the current ANSI code page.

Keep in mind that although UTF-8 strings look like characters strings, you cannot rely on pointer arithmetic to subscript them as the characters may actually consume anywhere from one to four bytes. It’s also possible that Unicode characters may require more than two bytes should they fall in a range above U+FFFF. In general you should treat user input as opaque buffers.

#include <atlconv.h>

#include <atlstr.h>

#define ASSERT ATLASSERT

int main()

{

const CStringW unicode1 = L"\x0391 and \x03A9"; // 'Alpha' and 'Omega'

const CStringA utf8 = CW2A(unicode1, CP_UTF8);

ASSERT(utf8.GetLength() > unicode1.GetLength());

const CStringW unicode2 = CA2W(utf8, CP_UTF8);

ASSERT(unicode1 == unicode2);

}

URL:

Conversion between Unicode UTF-8 and UTF-16 with STL strings

Suppose there is a need to convert between Unicode UTF-8 and Unicode UTF-16 in a Windows C++ application. This can happen because it is good to use UTF-16 as the Unicode encoding inside a C++ app (in fact, UTF-16 is the encoding used by Win32 Unicode APIs), and use UTF-8 outside app boundaries (e.g. text files, etc.).

To do that, it is possible to use ATL conversion helpers like CA2W and CW2A, as shown in this blog post by Kenny Kerr. Or it is possible to directly use MultiByteToWideChar and WideCharToMultiByte and CString(A/W) class as illustrated in a previous blog post here.

Another option is to use STL strings instead of ATL/MFC CString. An advantage of this approach is that it works also with the Express editions of Visual Studio (which do not include ATL and MFC). Moreover, STL strings are better integrated in the context of STL and Boost, and there are C++ programmers who just prefer STL strings to ATL/MFC CString. The code that uses STL strings is similar to that illustrated previously for CString’s. Considering a conversion from UTF-8 to UTF-16, MultiByteToWideChar API is called twice: the first call determines the length of the resulting UTF-16 string, so that enough memory can be reserved for the string; then, the second call performs the actual conversion. A similar pattern is followed for the symmetric conversion (from UTF-16 to UTF-8, this time using WideCharToMultiByte API).

A couple of differences between CString and STL’s strings in the context of Win32 programming are worth noting.

First, Win32 APIs tend to receive input strings in the form of LPCTSTR, which is a typedef for “const TCHAR *”, i.e. these are raw C strings, NUL terminated. CString plays well in this model, in fact it is possible to simply pass instances of CString’s in the presence of LPCTSTR parameters (thanks to proper cast operator PCXSTR() implemented by CSimpleStringT, the base class of CStringT). Instead, in the presence of std::[w]string arguments, c_str() or data() methods must be called explicitly.

Moreover, when there is a need to reserve some memory inside CString buffer to modify its content directly, it is possible to call GetBuffer() or GetBufferSetLength() methods (these methods return a non-const pointer to the internal string buffer, allowing direct modification of its content). Instead, with STL’s strings it is possible to call the resize() method to reserve enough memory for the string content, and then use code like &myString[0] to get direct (non-const) access to internal string content. (This technique works at least with current Visual C++ implementation of STL strings.)

With these two differences between CString and STL’s strings in mind, it should be easy to follow the commented code in “utf8conv.h” file, attached to this blog post.

As a final note, Win32 API’s used in the UTF-8 conversion process can fail; as it is common in the Win32 programming model, GetLastError function can be used to retrieve more details on the error. Instead of using return codes for error conditions, the attached source code throws C++ exceptions. For this purpose, an exception class, named utf8_error, is derived from std::exception, and used to signal error conditions during the conversion process.

URL:

Conversion between Unicode UTF-16 and UTF-8 in C++/Win32

There are several possible representations of Unicode text, e.g. UTF-8, UTF-16, UTF-32, etc.

UTF-16 is the default Unicode encoding form used by Windows.

UTF-8 is a common encoding form used to exchange text data on the Internet.

One of the advantages of UTF-8 is that there is no endian problem (i.e. big-endian vs. little-end), because UTF-8 is interpreted just as a sequence of bytes (instead, it is important to specify the correct endiannes of UTF-16 and UTF-32 code units).

To convert text between Unicode UTF-8 and UTF-16 encodings, a couple of Win32 APIs come in handy: MultiByteToWideChar and WideCharToMultiByte functions.

Suppose we want to convert text from UTF-8 to UTF-16. In this case, MultiByteToWideChar function can be used. To request a conversion from UTF-8, the CP_UTF8 code page value must be specified as first parameter of MultiByteToWideChar.

This function should be called twice: the first time it is called, we set the cchWideChar parameter to 0, so the function returns the required buffer size for the resulting UTF-16 ("wide char") string. So, we can dynamically allocate a buffer to store the UTF-16 string (this is done using CStringW::GetBuffer method in code sample attached here). Then, we can call the MultiByteToWideChar function again, to perform the actual conversion from UTF-8 to UTF-16.

(So, to summarize: the purpose of the first call to the function is to get the destination buffer size, the second call to the function does the actual conversion.)

A similar process occurs for WideCharToMultiByte, which can be used to convert text from Unicode UTF-16 ("wide char") to UTF-8.

The following C++ commented code shows how to use these Win32 functions to convert text between UTF-8 and UTF-16.

This code is pure Win32 C++ code; it uses ATL convenient CString class (the UTF-16 strings are stored in instances of CStringW; UTF-8 strings are stored in instances of CStringA). This code can be used in the context of MFC as well.

// *** Routines to convert between Unicode UTF-8 and Unicode UTF-16 ***

// By Giovanni Dicanio <giovanni.dicanio AT gmail.com>

// Last update: 2010, January 2nd

// These routines use ::MultiByteToWideChar and ::WideCharToMultiByte

// Win32 API functions to convert between Unicode UTF-8 and UTF-16.

// UTF-16 strings are stored in instances of CStringW.

// UTF-8 strings are stored in instances of CStringA.

// On error, the conversion routines use AtlThrow to signal the

// error condition.

// If input string pointers are NULL, empty strings are returned.

// Prefixes used in these routines:

// --------------------------------

// - cch : count of characters (CHAR's or WCHAR's)

// - cb : count of bytes

// - psz : pointer to a NUL-terminated string (CHAR* or WCHAR*)

// - str : instance of CString(A/W) class

// Useful Web References:

// ----------------------

// WideCharToMultiByte Function

// MultiByteToWideChar Function

// AtlThrow

// Developed on VC9 (Visual Studio 2008 SP1)

namespace UTF8Util

{

//----------------------------------------------------------------------------

// FUNCTION: ConvertUTF8ToUTF16

// DESC: Converts Unicode UTF-8 text to Unicode UTF-16 (Windows default).

//----------------------------------------------------------------------------

CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )

{

// Special case of NULL or empty input string

if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') )

{

// Return empty string

return L"";

}

// Consider CHAR's count corresponding to total input string length,

// including end-of-string (\0) character

const size_t cchUTF8Max = INT_MAX - 1;

size_t cchUTF8;

HRESULT hr = ::StringCchLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );

if ( FAILED( hr ) )

{

AtlThrow( hr );

}

// Consider also terminating \0

++cchUTF8;

// Convert to 'int' for use with MultiByteToWideChar API

int cbUTF8 = static_cast<int>( cchUTF8 );

// Get size of destination UTF-16 buffer, in WCHAR's

int cchUTF16 = ::MultiByteToWideChar(

CP_UTF8, // convert from UTF-8

MB_ERR_INVALID_CHARS, // error on invalid chars

pszTextUTF8, // source UTF-8 string

cbUTF8, // total length of source UTF-8 string,

// in CHAR's (= bytes), including end-of-string \0

NULL, // unused - no conversion done in this step

0 // request size of destination buffer, in WCHAR's

);

ATLASSERT( cchUTF16 != 0 );

if ( cchUTF16 == 0 )

{

AtlThrowLastWin32();

}

// Allocate destination buffer to store UTF-16 string

CStringW strUTF16;

WCHAR * pszUTF16 = strUTF16.GetBuffer( cchUTF16 );

// Do the conversion from UTF-8 to UTF-16

int result = ::MultiByteToWideChar(

CP_UTF8, // convert from UTF-8

MB_ERR_INVALID_CHARS, // error on invalid chars

pszTextUTF8, // source UTF-8 string

cbUTF8, // total length of source UTF-8 string,

// in CHAR's (= bytes), including end-of-string \0

pszUTF16, // destination buffer

cchUTF16 // size of destination buffer, in WCHAR's

);

ATLASSERT( result != 0 );

if ( result == 0 )

{

AtlThrowLastWin32();

}

// Release internal CString buffer

strUTF16.ReleaseBuffer();

// Return resulting UTF16 string

return strUTF16;

}

//----------------------------------------------------------------------------

// FUNCTION: ConvertUTF16ToUTF8

// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.

//----------------------------------------------------------------------------

CStringA ConvertUTF16ToUTF8( __in const WCHAR * pszTextUTF16 )

{

// Special case of NULL or empty input string

if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L'\0') )

{

// Return empty string

return "";

}

// Consider WCHAR's count corresponding to total input string length,

// including end-of-string (L'\0') character.

const size_t cchUTF16Max = INT_MAX - 1;

size_t cchUTF16;

HRESULT hr = ::StringCchLengthW( pszTextUTF16, cchUTF16Max, &cchUTF16 );

if ( FAILED( hr ) )

{

AtlThrow( hr );

}

// Consider also terminating \0

++cchUTF16;

// WC_ERR_INVALID_CHARS flag is set to fail if invalid input character

// is encountered.

// This flag is supported on Windows Vista and later.

// Don't use it on Windows XP and previous.

#if (WINVER >= 0x0600)

DWORD dwConversionFlags = WC_ERR_INVALID_CHARS;

#else

DWORD dwConversionFlags = 0;

#endif

// Get size of destination UTF-8 buffer, in CHAR's (= bytes)

int cbUTF8 = ::WideCharToMultiByte(

CP_UTF8, // convert to UTF-8

dwConversionFlags, // specify conversion behavior

pszTextUTF16, // source UTF-16 string

static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's,

// including end-of-string \0

NULL, // unused - no conversion required in this step

0, // request buffer size

NULL, NULL // unused

);

ATLASSERT( cbUTF8 != 0 );

if ( cbUTF8 == 0 )

{

AtlThrowLastWin32();

}

// Allocate destination buffer for UTF-8 string

CStringA strUTF8;

int cchUTF8 = cbUTF8; // sizeof(CHAR) = 1 byte

CHAR * pszUTF8 = strUTF8.GetBuffer( cchUTF8 );

// Do the conversion from UTF-16 to UTF-8

int result = ::WideCharToMultiByte(

CP_UTF8, // convert to UTF-8

dwConversionFlags, // specify conversion behavior

pszTextUTF16, // source UTF-16 string

static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's,

// including end-of-string \0

pszUTF8, // destination buffer

cbUTF8, // destination buffer size, in bytes

NULL, NULL // unused

);

ATLASSERT( result != 0 );

if ( result == 0 )

{

AtlThrowLastWin32();

}

// Release internal CString buffer

strUTF8.ReleaseBuffer();

// Return resulting UTF-8 string

return strUTF8;

}

} // namespace UTF8Util

转载地址：https://blog.csdn.net/weixin_34129696/article/details/85525620 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：姓氏分类索引

下一篇：[译]ECMAScript 6中的集合类型,第三部分:WeakMap

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章