检测字节流是否是UTF8编码-白红宇的个人博客

检测字节流是否是UTF8编码

发布日期：2021-10-17 00:06:11 浏览次数：19 分类：技术文章

本文共 3911 字，大约阅读时间需要 13 分钟。

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求，这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理：

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂，总结起来如下：

ASCII码（U+0000 - U+007F），不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码，n的个数说明了这个多Byte字节组字节数（包括第一个Byte）
•结下来会有n个以10开头的Byte，后6个bit存储真正的字符编码。
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则，我给出的C#代码如下：

              
/// <summary>
///   Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes.
/// </summary>
/// <param name="inputStream">
///    The input stream.
///  </param>
/// <returns>
///   <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>.
/// </returns>
/// <remarks>
///   All ASCII chars will regards not UTF8 encoding.
/// </remarks>
public
static
bool
IsTextUTF8(
ref
byte
[] inputStream)
{ 
    
int
encodingBytesCount = 0;
    
bool
allTextsAreASCIIChars = 
true
;
    
for
(
int
i = 0; i < inputStream.Length; i++)
    
{ 
        
byte
current = inputStream[i];
        
if
((current & 0x80) == 0x80)
        
{                    
            
allTextsAreASCIIChars = 
false
;
        
}
        
// First byte
        
if
(encodingBytesCount == 0)
        
{ 
            
if
((current & 0x80) == 0)
            
{ 
                
// ASCII chars, from 0x00-0x7F
                
continue
;
            
}
            
if
((current & 0xC0) == 0xC0)
            
{ 
                
encodingBytesCount = 1;
                
current <<= 2;
                
// More than two bytes used to encoding a unicode char.
                
// Calculate the real length.
                
while
((current & 0x80) == 0x80)
                
{ 
                    
current <<= 1;
                    
encodingBytesCount++;
                
}
            
}                    
            
else
            
{ 
                
// Invalid bits structure for UTF8 encoding rule.
                
return
false
;
            
}
        
}                
        
else
        
{ 
            
// Following bytes, must start with 10.
            
if
((current & 0xC0) == 0x80)
            
{                        
                
encodingBytesCount--;
            
}
            
else
            
{ 
                
// Invalid bits structure for UTF8 encoding rule.
                
return
false
;
            
}
        
}
    
}
    
if
(encodingBytesCount != 0)
    
{ 
        
// Invalid bits structure for UTF8 encoding rule.
        
// Wrong following bytes count.
        
return
false
;
    
}
    
// Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
    
return
!allTextsAreASCIIChars;
}
       
     

再附上单元测试代码：

              
/// <summary>
///This is a test class for EncodingHelperTest and is intended
///to contain all EncodingHelperTest Unit Tests
///</summary>
[TestClass()]
public
class
EncodingHelperTest
{ 
    
/// <summary>
    
///  Normal test for this method.
    
///</summary>
    
[TestMethod()]
    
public
void
IsTextUTF8Test()
    
{ 
        
for
(
int
i = 0; i < 1000; i++)
        
{ 
            
List<Char> chars = 
new
List<
char
>();
            
chars.Add(
'中'
);
            
List<UnicodeCategory> temp = 
new
List<UnicodeCategory>();
            
Random rd = 
new
Random((
int
)(DateTime.Now.Ticks & 0x7FFFFFFF));
            
for
(
int
j = 0; j < 255; j++)
            
{ 
                
char
ch = (
char
)rd.Next(0xFFFF);
                
UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);
                
if
(uc == UnicodeCategory.Surrogate || 
// Single surrogate could not be encoding correctly.
                    
uc == UnicodeCategory.PrivateUse || 
// Private use blocks should be excluded.
                    
uc == UnicodeCategory.OtherNotAssigned
                    
)
                
{ 
                    
j--;
                
}
                
else
                
{ 
                    
chars.Add(ch);
                    
temp.Add(uc);
                
}
            
}
            
string
str = 
new
string
(chars.ToArray());
            
byte
[] inputStream = Encoding.UTF8.GetBytes(str);
            
bool
expected = 
true
; 
            
bool
actual;
            
actual = EncodingHelper.IsTextUTF8(
ref
inputStream);
            
Assert.AreEqual(expected, actual, 
string
.Format(
"UTF8_Assert Fails at:{0}"
, str));
            
inputStream = Encoding.GetEncoding(932).GetBytes(str);
            
expected = 
false
;
            
actual = EncodingHelper.IsTextUTF8(
ref
inputStream);
            
Assert.AreEqual(expected, actual, 
string
.Format(
"ShiftJIS_Assert Fails at:{0}"
, str));
        
}
    
}
    
/// <summary>
    
///   Check with All ASCII chars
    
/// </summary>
    
[TestMethod]
    
public
void
IsTextUTF8Test_AllASCII()
    
{ 
        
string
str = 
"ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD"
;
        
byte
[] inputStream = Encoding.UTF8.GetBytes(str);
        
bool
expected = 
false
;
        
bool
actual;
        
actual = EncodingHelper.IsTextUTF8(
ref
inputStream);
        
Assert.AreEqual(expected, actual, 
string
.Format(
"UTF8_Assert Fails at:{0}"
, str));
    
}
}
       
     

另：

如果是判断一个文件是否使用了UTF8编码，不一定非用这种方法，因为通常以UTF8格式保存的文件最初两个字符是BOM头，标示该文件使用了UTF8编码。

参考：

维基百科：

转载地址：https://blog.csdn.net/sulan2131/article/details/69053582 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：Url Rewrite 再说Url 重写

下一篇：如何使用MOQ进行单元测试

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章