Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
259 views
in Technique[技术] by (71.8m points)

html - how to detect "�" (combination of unicode) in c++ string

I am trying to detect some of the combination of Unicode character (like a€?) to cleanup the string, For a single Unicode character it is detecting but combination of Unicode is not detecting.

These string I am using to make HTML page from another HTML page which need to be cleanup. I want to clean only string which have these kind of unicode that not even visible in html page in browser.

below is the sample code:

void detect_Unicode(string& str) { 

      if(!str.empty() && str.find_first_not_of(" 

fvu00A0u00C2u00E2u20ACu2039")==string::npos)
                str.assign(" ");
      return;
 }

Input string:

1. " a€?    a€? " ;
2. "are ? ? there is something ? ? ? a€? combination    a€?"  
3. " ? ? "   
4. "a€?  ? ? a€?" 
5 . "? ? a a" 

Expected Output:

1. " "  
2. "are ? ? there is something ? ? ? a€? combination    a€?"   
3. " "  
4. " "  
5. " "

Please let me know other ways too.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).

On that basis, I humbly submit this:

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& ws)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (ws);
}

std::wstring widen (const std::string& s)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (s);
}

std::string detect_Unicode (const std::string& s)
{ 
    std::wstring ws = widen (s);
    if (ws.empty() || ws.find_first_not_of (L" 

fvu00A0u00C2u00E2u20ACu2039") != std::wstring::npos)
        return " ";
    return s;
}

#include <iostream>

int main ()
{
    std::cout << narrow (L"u00A0 u00C2 u00E2 u20AC u2039

");
    std::cout << "0."" << detect_Unicode (u8"abcde") << ""
";
    std::cout << "1."" << detect_Unicode (u8" a€?    a€? ") << ""
";
    std::cout << "2."" << detect_Unicode (u8"are ? ? there is something ? ? ? a€? combination    a€?") << ""
";
    std::cout << "3."" << detect_Unicode (u8" ? ? ") << ""
";
    std::cout << "4."" << detect_Unicode (u8"a€?  ? ? a€?") << ""
";
    std::cout << "5."" << detect_Unicode (u8"? ? a a") << ""
";
}

Output:

  ? a € ?

0.  " "
1.  " a€?    a€? "
2.  " "
3.  " ? ? "
4.  "a€?  ? ? a€?"
5.  "? ? a a"

Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode() looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string operations on it reliably, because there are no multibyte issues now.

An alternative, slightly radical, implementation of detect_Unicode() might be:

for (auto wide_char : ws)
{
    if (wide_char > 0xff)
        return " ";
}
return s;

But really, now you have a wide string to hand in detect_Unicode, anything is possible, so go wild OP.

Other notes:

  • std::codecvt is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow and widen if it comes to it.
  • Depending on platform, std::wstring might not be the best choice but it's probably fine. You could also look at std::u16string and std::u32string.

Live demo.

Inspiration taken from here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...