utf-8多字节字符转换成多个ascii字符 - Convert UTF-8 multibyte characters into multiple ascii characters

- 此内容更新于:2016-02-01
主题:

谁能帮我做一些(潜在)虚假的utf-8多字节字符转换成ascii如下?吗?吗?吗?我的用例是有趣的。我知道我没有压缩,代表两个多字节字符的ascii字符。我玩着各种版本的但它似乎永远不会正常工作:力编码似乎返回相同的:

原文:

Can someone help me with converting some (potential) bogus UTF-8 multibyte characters into ascii as follows?

\u6162["\x61", "\x62"]["a", "b"]"ab"

My use case is fun only. I know I'm not compressing anything by representing two ascii characters in a multibyte character.

I've played around with various versions of unpack but it never seems to work correctly:

"\u6162".unpack('H*')
# => ["e685a2"]

Force encoding seems to return the same:

"\u6162".force_encoding('US-ASCII')
# => "\xE6\x85\xA2"
解决方案:
不是等价的。表示Unicode代码点不直接转换为十六进制值。Unicode代码点6162吗?。因为它是一个字符串,因为Ruby使用utf-8在默认情况下,当你打开它,你得到的utf-8U+6162的价值三个字节。得到你想要的,你需要它utf-16表示。但是如果你只是编码,你会得到一个字节顺序标记。所以使用UTF-16BE(大端字节)来避免这种情况。
原文:

"\u6162" is not equivalent to "\x61" + "\x62". \u indicates a Unicode code point which does not translate directly to a hex value. Unicode code point 6162 is 慢.

Because it is a string, and because Ruby uses UTF-8 by default, when you unpack it you get the UTF-8 value of U+6162 which is three bytes: E6 85 A2.

2.2.1 :023 > "\u6162".encoding
 => #<Encoding:UTF-8> 
2.2.1 :024 > "\u6162".unpack("A*")
 => ["\xE6\x85\xA2"] 

To get what you want, you need its UTF-16 representation 61 62. But if you just encode as UTF-16 you'll get a byte order marker FE FF 61 62. So use UTF-16BE (big endian) to avoid this.

2.2.1 :052 > "\u6162".encode("UTF-16BE").unpack("A*")
 => ["ab"] 
解决方案:

原文:
"\u6162".codepoints.first.divmod(16 ** 2).map(&:chr).join
# => "ab"
网友:好的解决方案,但downvoted缺乏说明文本。另外,喜欢。

(原文:Good solution, but downvoted for lack of explanatory text. Also, prefer pack('c*') to map(&:chr).join.)