45fan.com - 路饭网

搜索: 您的位置主页 > 网络频道 > 阅读资讯:识别无BOM的UTF8文本得方法

识别无BOM的UTF8文本得方法

2016-09-02 12:59:03 来源:www.45fan.com 【

识别无BOM的UTF8文本得方法

/* IsTextUTF8

*

* UTF-8 is the encoding of Unicode based on Internet Society RFC2279

* ( See http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html )

*

* Basicly:

* 0000 0000-0000 007F - 0xxxxxxx (ascii converts to 1 octet!)

* 0000 0080-0000 07FF - 110xxxxx 10xxxxxx ( 2 octet format)

* 0000 0800-0000 FFFF - 1110xxxx 10xxxxxx 10xxxxxx (3 octet format)

* (this keeps going for 32 bit unicode)

*

*

* Return value: TRUE, if the text is in UTF-8 format.

* FALSE, if the text is not in UTF-8 format.

* We will also return FALSE is it is only 7-bit ascii, so the right code page

* will be used.

*

* Actually for 7 bit ascii, it doesn't matter which code page we use, but

* notepad will remember that it is utf-8 and "save" or "save as" will store

* the file with a UTF-8 BOM. Not cool.

*/

INT IsTextUTF8( LPSTR lpstrInputStream, INT iLen )

{

INT i;

DWORD cOctets; // octets to go in this UTF-8 encoded character

UCHAR chr;

BOOL bAllAscii= TRUE;

cOctets= 0;

for( i=0; i < iLen; i++ ) {

chr= *(lpstrInputStream+i);

if( (chr&0x80) != 0 ) bAllAscii= FALSE;

if( cOctets == 0 ) {

//

// 7 bit ascii after 7 bit ascii is just fine. Handle start of encoding case.

//

if( chr >= 0x80 ) {

//

// count of the leading 1 bits is the number of characters encoded

//

do {

chr <<= 1;

cOctets++;

}

while( (chr&0x80) != 0 );

cOctets--; // count includes this character

if( cOctets == 0 ) return FALSE; // must start with 11xxxxxx

}

}

else {

// non-leading bytes must start as 10xxxxxx

if( (chr&0xC0) != 0x80 ) {

return FALSE;

}

cOctets--; // processed another octet in encoding

}

}

//

// End of text. Check for consistency.

//

if( cOctets > 0 ) { // anything left over at the end is an error

return FALSE;

}

if( bAllAscii ) { // Not utf-8 if all ascii. Forces caller to use code pages for conversion

return FALSE;

}

return TRUE;

}

不过这段代码是有bug的,微软记事本很著名的“联通”的bug也是这段代码引起的,暂时还没找到更好的识别无BOM的UTF8文本的更好的办法

 

本文地址:http://www.45fan.com/a/question/71222.html
Tags: 文本 UTF8 BOM
编辑:路饭网
关于我们 | 联系我们 | 友情链接 | 网站地图 | Sitemap | App | 返回顶部