Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
823 views
in Technique[技术] by (71.8m points)

utf 8 - Classic ASP - How to convert a UTF-8 string to UCS-2?

I have a problem where I am storing a UTF-8 string in SQL Server as UCS-2. When I pull it out to display on a page with content-type set to UTF-8 it works fine. But I have a third party Javascript component which when I pass it the string for the database it renders it as USC2. Or not UTF8.

Is there a way in ASP to convert this string to UTF-8 after reading it from the database to pass it to the third party component (obfuscated)?

Hope this makes sense.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

My suspicion is you are falling foul of the classic form post character encoding mismatch problem.

It goes like this:-

  • You have a form which is presented to the client using the UTF-8 encoding.
  • As a result the browser posts text values entered into the form using UTF-8 encoding.
  • The action page receiving the post has its Response.Codepage set to a typical OEM codepage such as 1252.
  • Each byte of the posted UTF-8 string is treated by server as an individual character rather than decoding sets of UTF-8 encoded bytes to the correct unicode character.
  • The string is stored in the DB with the now corrupted characters.
  • A page wishes to present to the client the content of a DB field containing the corrupted characters.
  • The page sets it CharSet to UTF-8 but its Response.CodePage remains at the OEM codepage such as 1252.
  • Response.Write is used to send the field content to the client, the unicode characters are transformed back to the byte for byte set as was received in the ealier post.
  • The client thinks its getting UTF-8 hence it decodes the characters received from the server as UTF-8 just as they were originally hence they appear on screen correctly.
  • Everything proceeds fine as if all is ok whilst these characters are simply being bounced back and forth through ASP. A bug in one page has a matching bug in the other (could be the same page) which makes everything look fine.

If you examine the field contents directly with SQL server tools you will likely see the corrupted strings there. Now that you want to use this string with another component which is expecting a straight-forward unicode string this is where you discover this bug.

The solution is to always ensure all your pages not only send CharSet = "UTF-8" in the response but also use Response.CodePage = 65001 before using Response.Write and before attempting to read any Request.Form values. Use Codepage directive in the <%@ page header.

Now you are left with repairing the corrupt strings already in your DB.

Use an ADODB.Stream:-

Function ConvertFromUTF8(sIn)

    Dim oIn: Set oIn = CreateObject("ADODB.Stream")

    oIn.Open
    oIn.CharSet = "WIndows-1252"
    oIn.WriteText sIn
    oIn.Position = 0
    oIn.CharSet = "UTF-8"
    ConvertFromUTF8 = oIn.ReadText
    oIn.Close

End Function

This function (which BTW is the answer to your actual question) takes a corrupted string (one that has the byte of byte representation) and converts to the string it should have been. You need to apply this transform to every field in the DB that has fallen victim to the bug.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...