unicode vs uft8-白红宇

unicode vs uft8

阅读量：5963 次

发布时间：2019-06-19

本文共 1626 字，大约阅读时间需要 5 分钟。

使用的环境

python2.7.3

ipython

背景

unicode 和 utf8完全是两个东西 unicode简单地讲就是一个数值每个值都对应了一张巨大无比的表上的一个值比如 '\u6211' 这个玩意儿6211对应的值就是 我

uft8 utf16 utfXXX都是编码的规范我们可以把unicode编码成uft8的字符串所以在把unicode专程utf8的是否调用的方法为 encode(utf8) decode则是把utf8字符串解码成unicode形式

那么python中的一些常见的例子就可以很清楚地解释了：

In [1]: a = '我'In [2]: aOut[2]: '\xe6\x88\x91'  # 已经是uft8了 这个应该和你用的terminal有关In [3]: a.decode('utf8') # 解码成unicodeOut[3]: u'\u6211'

场景

前几天写了一个测试用例主要是用来测试现在的Model在存取MongoDB时是否正常期间遇到了一些unicode的坑测试出问题的地方在:

a = '我'b = u'我'assertEqual(a, b) # Falsec = 'c'd = u'c'assertEqual(c, d) # True

把a存进Mongodb是正常的但是把a取出来也就是这里的b 再和之前的a对比就会出错了错误提示

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode

由于之前都是直接去存取MongoDB 没有注意到这里的差别翻看了MongoDB的文档后发现了下面这段：

You probably noticed that the regular Python strings we stored earlier look different when retrieved from the server (e.g. u’Mike’ instead of ‘Mike’). A short explanation is in order.

MongoDB stores data in BSON format. BSON strings are UTF-8 encoded so PyMongo must ensure that any strings it stores contain only valid UTF-8 data. Regular strings (<type ‘str’>) are validated and stored unaltered. Unicode strings (<type ‘unicode’>) are encoded UTF-8 first. The reason our example string is represented in the Python shell as u’Mike’ instead of ‘Mike’ is that PyMongo decodes each BSON string to a Python unicode string, not a regular str.

原来如此取出来的都是unicode

后话

我在这个testcase里面比较的是两个大的字典里面有中文引文草泥马文对比从MongDB里面去除来的数据岂不是要每个字段都去encode? 崩溃不过这里有个比较取巧的方法就是用 json dumps

self.data = json.loads(json.dumps(self.data)) # 出来就是unicode了 爽的

转载于:https://my.oschina.net/pengfeix/blog/149403

你可能感兴趣的文章