检测字节流是否是UTF8编码
发布日期:2021-10-17 00:06:11 浏览次数:19 分类:技术文章

本文共 3911 字,大约阅读时间需要 13 分钟。

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理:

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂,总结起来如下:

ASCII码(U+0000 - U+007F),不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte) 
•结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。 
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则,我给出的C#代码如下:

/// <summary>
///   Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes.
/// </summary>
/// <param name="inputStream">
///    The input stream.
///  </param>
/// <returns>
///   <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>.
/// </returns>
/// <remarks>
///   All ASCII chars will regards not UTF8 encoding.
/// </remarks>
public 
static 
bool 
IsTextUTF8(
ref 
byte
[] inputStream)
{
    
int 
encodingBytesCount = 0;
    
bool 
allTextsAreASCIIChars =
true
;
 
    
for 
(
int 
i = 0; i < inputStream.Length; i++)
    
{
        
byte 
current = inputStream[i];
 
        
if 
((current & 0x80) == 0x80)
        
{                   
            
allTextsAreASCIIChars =
false
;
        
}
        
// First byte
        
if 
(encodingBytesCount == 0)
        
{
            
if 
((current & 0x80) == 0)
            
{
                
// ASCII chars, from 0x00-0x7F
                
continue
;
            
}
 
            
if 
((current & 0xC0) == 0xC0)
            
{
                
encodingBytesCount = 1;
                
current <<= 2;
 
                
// More than two bytes used to encoding a unicode char.
                
// Calculate the real length.
                
while 
((current & 0x80) == 0x80)
                
{
                    
current <<= 1;
                    
encodingBytesCount++;
                
}
            
}                   
            
else
            
{
                
// Invalid bits structure for UTF8 encoding rule.
                
return 
false
;
            
}
        
}               
        
else
        
{
            
// Following bytes, must start with 10.
            
if 
((current & 0xC0) == 0x80)
            
{                       
                
encodingBytesCount--;
            
}
            
else
            
{
                
// Invalid bits structure for UTF8 encoding rule.
                
return 
false
;
            
}
        
}
    
}
 
    
if 
(encodingBytesCount != 0)
    
{
        
// Invalid bits structure for UTF8 encoding rule.
        
// Wrong following bytes count.
        
return 
false
;
    
}
 
    
// Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
    
return 
!allTextsAreASCIIChars;
}

 

 

再附上单元测试代码:

 

/// <summary>
///This is a test class for EncodingHelperTest and is intended
///to contain all EncodingHelperTest Unit Tests
///</summary>
[TestClass()]
public 
class 
EncodingHelperTest
{
    
/// <summary>
    
///  Normal test for this method.
    
///</summary>
    
[TestMethod()]
    
public 
void 
IsTextUTF8Test()
    
{
        
for 
(
int 
i = 0; i < 1000; i++)
        
{
            
List<Char> chars =
new 
List<
char
>();
            
chars.Add(
'中'
);
 
            
List<UnicodeCategory> temp =
new 
List<UnicodeCategory>();
            
Random rd =
new 
Random((
int
)(DateTime.Now.Ticks & 0x7FFFFFFF));
 
            
for 
(
int 
j = 0; j < 255; j++)
            
{
                
char 
ch = (
char
)rd.Next(0xFFFF);
                
UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);
                
if 
(uc == UnicodeCategory.Surrogate ||
// Single surrogate could not be encoding correctly.
                    
uc == UnicodeCategory.PrivateUse ||
// Private use blocks should be excluded.
                    
uc == UnicodeCategory.OtherNotAssigned
                    
)
                
{
                    
j--;
                
}
                
else
                
{
                    
chars.Add(ch);
                    
temp.Add(uc);
                
}
            
}
 
            
string 
str =
new 
string
(chars.ToArray());
 
            
byte
[] inputStream = Encoding.UTF8.GetBytes(str);
            
bool 
expected =
true
;
            
bool 
actual;
            
actual = EncodingHelper.IsTextUTF8(
ref 
inputStream);
            
Assert.AreEqual(expected, actual,
string
.Format(
"UTF8_Assert Fails at:{0}"
, str));
 
            
inputStream = Encoding.GetEncoding(932).GetBytes(str);
            
expected =
false
;
 
            
actual = EncodingHelper.IsTextUTF8(
ref 
inputStream);
            
Assert.AreEqual(expected, actual,
string
.Format(
"ShiftJIS_Assert Fails at:{0}"
, str));
        
}
    
}
 
    
/// <summary>
    
///   Check with All ASCII chars
    
/// </summary>
    
[TestMethod]
    
public 
void 
IsTextUTF8Test_AllASCII()
    
{
        
string 
str =
"ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD"
;
 
        
byte
[] inputStream = Encoding.UTF8.GetBytes(str);
        
bool 
expected =
false
;
        
bool 
actual;
        
actual = EncodingHelper.IsTextUTF8(
ref 
inputStream);
        
Assert.AreEqual(expected, actual,
string
.Format(
"UTF8_Assert Fails at:{0}"
, str));
 
 
    
}
}

 

另:

如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

参考:

维基百科:

转载地址:https://blog.csdn.net/sulan2131/article/details/69053582 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!

上一篇:Url Rewrite 再说Url 重写
下一篇:如何使用MOQ进行单元测试

发表评论

最新留言

路过按个爪印,很不错,赞一个!
[***.219.124.196]2024年04月02日 12时57分44秒

关于作者

    喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!

推荐文章