html - how to detect "â€?" (combination of unicode) in c++ string

Question

Welcome To Ask or Share your Answers For Others

html - how to detect "â€?" (combination of unicode) in c++ string

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

html - how to detect "â€?" (combination of unicode) in c++ string

I am trying to detect some of the combination of Unicode character (like a€?) to cleanup the string, For a single Unicode character it is detecting but combination of Unicode is not detecting.

These string I am using to make HTML page from another HTML page which need to be cleanup. I want to clean only string which have these kind of unicode that not even visible in html page in browser.

below is the sample code:

void detect_Unicode(string& str) { 

      if(!str.empty() && str.find_first_not_of(" 

fvu00A0u00C2u00E2u20ACu2039")==string::npos)
                str.assign(" ");
      return;
 }

Input string:

1. " a€?    a€? " ;
2. "are ? ? there is something ? ? ? a€? combination    a€?"  
3. " ? ? "   
4. "a€?  ? ? a€?" 
5 . "? ? a a"

Expected Output:

1. " "  
2. "are ? ? there is something ? ? ? a€? combination    a€?"   
3. " "  
4. " "  
5. " "

Please let me know other ways too.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:33:41+0000

OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).

On that basis, I humbly submit this:

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& ws)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (ws);
}

std::wstring widen (const std::string& s)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (s);
}

std::string detect_Unicode (const std::string& s)
{ 
    std::wstring ws = widen (s);
    if (ws.empty() || ws.find_first_not_of (L" 

fvu00A0u00C2u00E2u20ACu2039") != std::wstring::npos)
        return " ";
    return s;
}

#include <iostream>

int main ()
{
    std::cout << narrow (L"u00A0 u00C2 u00E2 u20AC u2039

");
    std::cout << "0."" << detect_Unicode (u8"abcde") << ""
";
    std::cout << "1."" << detect_Unicode (u8" a€?    a€? ") << ""
";
    std::cout << "2."" << detect_Unicode (u8"are ? ? there is something ? ? ? a€? combination    a€?") << ""
";
    std::cout << "3."" << detect_Unicode (u8" ? ? ") << ""
";
    std::cout << "4."" << detect_Unicode (u8"a€?  ? ? a€?") << ""
";
    std::cout << "5."" << detect_Unicode (u8"? ? a a") << ""
";
}

Output:

  ? a € ?

0.  " "
1.  " a€?    a€? "
2.  " "
3.  " ? ? "
4.  "a€?  ? ? a€?"
5.  "? ? a a"

Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode() looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string operations on it reliably, because there are no multibyte issues now.

An alternative, slightly radical, implementation of detect_Unicode() might be:

for (auto wide_char : ws)
{
    if (wide_char > 0xff)
        return " ";
}
return s;

But really, now you have a wide string to hand in detect_Unicode, anything is possible, so go wild OP.

Other notes:

std::codecvt is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow and widen if it comes to it.
Depending on platform, std::wstring might not be the best choice but it's probably fine. You could also look at std::u16string and std::u32string.

Live demo.

Inspiration taken from here.

Categories

html - how to detect "â€?" (combination of unicode) in c++ string

html - how to detect "â€?" (combination of unicode) in c++ string

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

html - how to detect "&#226;€?" (combination of unicode) in c++ string

html - how to detect "&#226;€?" (combination of unicode) in c++ string

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

html - how to detect "â€?" (combination of unicode) in c++ string

html - how to detect "â€?" (combination of unicode) in c++ string