不得不知道的Python字符串编码相关的知识

en_wan 9年前
   <p>开发经常会遇到各种字符串编码的问题，例如报错 SyntaxError: Non-ASCII character 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) ，又例如显示乱码。</p>    <p>由于之前不知道编码的原理，遇到这些情况，就只能不断的用各种编码decode和encode。。。。。</p>    <p>今天整理一个python中的各种编码问题的原因和解决方法，以后遇到编码问题，就不会像莽头苍蝇一样，到处乱撞了。</p>    <p>下面的python环境都是在2.7，听说在3.X中已经没有编码的问题了，因为所有的字符串都是unicode了，之后装个3.X试一下。</p>    <h3><strong>一、encoding的作用</strong></h3>    <p>1.在python文件中，如果有中文，就一定要在文件的第一行标记使用的编码类型，例如 #encoding=utf-8 ,就是使用utf-8的编码，这个编码有什么作用呢？会改变什么呢？</p>    <p>demo1.py</p>    <pre>  <code class="language-python"># encoding=utf-8  test='测试test'  printtype(test)  printrepr(test)  </code></pre>    <p>输出：</p>    <pre>  <code class="language-python"><type 'str'>  '\xe6\xb5\x8b\xe8\xaf\x95test'  </code></pre>    <p>我们通过print把一个变量输出到终端的时候，IDE或者系统一般都会帮我们的输出作转换，例如中文字符会转成中文，所以就看不到变量的原始内容。</p>    <p>repr函数可以看这个变量的给python看的形式，也就是看到这个变量的原始内容</p>    <p>从上面的输出可以看到test变量的str类型，它的编码是utf-8的（怎么知道是utf-8，请看第三部分），也就是的encoding类型</p>    <p>如果我们把encoding改为gbk</p>    <p>demo2.py</p>    <pre>  <code class="language-python"># encoding=gbk  test='测试test'  printtype(test)  printrepr(test)  </code></pre>    <p>输出</p>    <pre>  <code class="language-python"><type 'str'>  '\xb2\xe2\xca\xd4test'  </code></pre>    <p>这样test的编码类型就变为gbk了。</p>    <p>所以这个encoding会决定在这个py文件中定义的字符串变量的编码方式。</p>    <p>而如果一个变量是从其他py文件导入，或者从数据库，redis等读取出来的话，它的编码又是怎样的？</p>    <p>a.py</p>    <pre>  <code class="language-python"># encoding=utf-8  test='测试test'  </code></pre>    <p>b.py</p>    <pre>  <code class="language-python"># encoding=gbk  from a importtest  printrepr(test)  </code></pre>    <p>输出</p>    <pre>  <code class="language-python">'\xe6\xb5\x8b\xe8\xaf\x95test'  </code></pre>    <p>a.py中定义test变量，a.py的编码方式是utf-8,b.py的编码方式是gbk,b从a中导入test，结果显示test依然为utf-8编码，也就是a.py的编码</p>    <p>所以encoding只会决定本py文件的编码方式，不会影响导入的或者从其他地方读取的变量的编码方式</p>    <h2><strong>二、常见报错 codec can't encode characters 的原因</strong></h2>    <p>python的程序经常会报错 codec can't encode characters 或 codec can't decode characters</p>    <p>在python中定义一个字符串，</p>    <pre>  <code class="language-python">importsys  printsys.getdefaultencoding() # 输出 ascii  unicode_test=u'测试test'  printrepr(str(unicode_test))  </code></pre>    <p>上面的代码会报错</p>    <pre>  <code class="language-python"> 'ascii' codeccan't encodecharactersin position 0-1: ordinalnot in range(128)  </code></pre>    <p>除了str方法外，如果操作两个都有中文的字符串，也会报错，但是只有其中一个有中文，却不会报错</p>    <pre>  <code class="language-python">unicode_test = u'测试test%s{0}'     print '%stest' % unicode_test  # 不会报错  print '%s测试' % unicode_test  #会报错     printunicode_test % 'test'  #不会报错  printunicode_test % '测试'  #会报错     printunicode_test.format('test') #不会报错  printunicode_test.format('测试') #会报错     printunicode_test.split('test')  #不会报错  printunicode_test.split('测试')  #报错     printunicode_test + 'test'  #不会报错  printunicode_test + '测试'  #会报错  </code></pre>    <p>为什么会这样？</p>    <p>这原因下面再解答，这里先列出这个报错的解决方法：</p>    <p>解决方法是：把系统的默认编码设置为utf-8</p>    <pre>  <code class="language-python">importsys  reload(sys)  sys.setdefaultencoding('utf-8')  printsys.getdefaultencoding()  unicode_test=u'测试test'  </code></pre>    <p>demo3.py</p>    <pre>  <code class="language-python"># encoding=utf-8  import sys  reload(sys)  sys.setdefaultencoding('utf-8')  unicode_test=u'测试test'  utf8_test='测试test'  gbk_test=unicode_test.encode('gbk')     #合并unicode和utf-8  merge=unicode_test+utf8_test  print type(merge)  print repr(merge)     #合并unicode和gbk  merge=unicode_test+gbk_test  print type(merge)  print repr(merge)  print merge     #合并utf-8和gbk  merge=utf8_test+gbk_test  print type(merge)  print repr(merge)  print merge  </code></pre>    <p>这里定义三个分别是unicode，utf-8和gbk编码的字符串，unicode_test,utf8_test和gbk_test</p>    <p>1.合并unicode和utf-8的时候，输出：</p>    <pre>  <code class="language-python"><type 'unicode'>  u'\u6d4b\u8bd5test\u6d4b\u8bd5test'  </code></pre>    <p>合并的结果的编码是unicode编码。</p>    <p>2.合并unicode和gbk，会报错：</p>    <pre>  <code class="language-python">'utf8' codeccan't decodebyte 0xb2 in position 0: invalidstartbyte  </code></pre>    <p>所以我们可以推测：</p>    <p>在python对两个字符串进行操作的时候，如果这两个字符串有一个是unicode编码，有一个是非unicode编码，python会将非unicode编码的字符串decode成unicode编码，再进行字符串操作</p>    <p>例如合并字符串的操作可以写成以下的function：</p>    <pre>  <code class="language-python">defmerge_str(str1, str2):      if isinstance(str1, unicode) and not isinstance(str2, unicode):          str2 = str2.decode(sys.getdefaultencoding())      elifnot isinstance(str1, unicode) and isinstance(str2, unicode):          str1 = str1.decode(sys.getdefaultencoding())      return str1 + str2  </code></pre>    <p>PS:sys.getdefaultencoding()的初始值是ascii</p>    <p>所以，</p>    <p>codec can't encode（decode） characters 这个报错是encode或decode这两个方法产生的，而这个方法的参数是sys.getdefaultencoding()。如果用ascii编码对带有中文的字符串进行解码，就会报错。所以修改系统的默认编码可以避免这个报错。</p>    <p>当执行 str 操作时，python会执行 unicode_test.encode(sys.getdefaultencoding()) ，所以也会报错。</p>    <p>3.#合并utf-8和gbk的时候却不会报错，python会直接把两个字符串合并，不会有decode或encode的操作，但是输出的时候，部分字符串会乱码。</p>    <p>demo4.py</p>    <pre>  <code class="language-python"># encoding=gbk  importsys     reload(sys)  sys.setdefaultencoding('utf-8')  unicode_test = u'测试test'  utf8_test = unicode_test.encode('utf-8')  gbk_test = unicode_test.encode('gbk')     merge = utf8_test + gbk_test  printtype(merge)  printrepr(merge)  printmerge  </code></pre>    <p>这里文件的encoding是gbk，sys.getdefaultencoding()设置为utf-8，结果是：</p>    <pre>  <code class="language-python"><type 'str'>  '\xe6\xb5\x8b\xe8\xaf\x95test\xb2\xe2\xca\xd4test'  测试test����test  </code></pre>    <p>即gbk的部分乱码了。所以输出的时候会按照sys.getdefaultencoding()的编码来解码。</p>    <h2><strong>三、怎么判断一个字符串（string）的编码方式</strong></h2>    <p>1.没有办法准确地判断一个字符串的编码方式，例如gbk的“\aa”代表甲，utf-8的“\aa”代表乙，如果给定“\aa”怎么判断是哪种编码？它既可以是gbk也可以是utf-8</p>    <p>2.我们能做的是粗略地判断一个字符串的编码方式，因为上面的例如的情况是很少的，更多的情况是gbk中的’\aa’代表甲，utf-8中是乱码，例如�，这样我们就能判断’\aa’是gbk编码，因为如果用utf-8编码去解码的结果是没有意义的</p>    <p>3.而我们经常遇到的编码其实主要的就只有三种：utf-8，gbk，unicode</p>    <ul>     <li>unicode一般是 \u 带头的，然后后面跟四位数字或字符串，例如 \u6d4b\u8bd5 ，一个 \u 对应一个汉字</li>     <li>utf-8一般是 \x 带头的，后面跟两位字母或数字，例如 \xe6\xb5\x8b\xe8\xaf\x95\xe5\x95\x8a ，三个 \x 代表一个汉字</li>     <li>gbk一般是 \x 带头的，后面跟两位字母或数字，例如 \xb2\xe2\xca\xd4\xb0\xa1 ，两个个 \x 代表一个汉字</li>    </ul>    <p>4.使用chardet模块来判断</p>    <pre>  <code class="language-python">  import chardet    raw = u'我是一只小小鸟'    print chardet.detect(raw.encode('utf-8'))    print chardet.detect(raw.encode('gbk'))    </code></pre>    <p> </p>    <p>输出：</p>    <pre>  <code class="language-python">{'confidence': 0.99, 'encoding': 'utf-8'}  {'confidence': 0.99, 'encoding': 'GB2312'}  </code></pre>    <p>chardet模块可以计算这个字符串是某个编码的概率，基本对于99%的应用场景，这个模块都够用了。</p>    <h2><strong>四、string_escape和unicode_escape</strong></h2>    <h3>1. string_escape</h3>    <p>在str中， \x 是保留字符，表示后面的两位字符表示一个字符单元（暂且这么叫，不知道对不对），例如 '\xe6' ，一般三个字符单元表示一个中文字符</p>    <p>所以在定义变量时， a='\xe6\x88\x91' ,是代表定义了一个中文字符“我”，但是有时候，我们不希望a这个变量代表中文字符，而是代表3*4=12个英文字符，可以使用 encode('string_escape') 来转换：</p>    <pre>  <code class="language-python">'\xe6\x88\x91'.encode('string_escape')='\\xe6\\x88\\x91'  </code></pre>    <p>decode就是反过来。</p>    <p>转换前后的类型都是string。</p>    <p>还有一个现象，定义 a='\x' , a='\x0' 都是会报错 ValueError: invalid \x escape 的，而定义 a='\a' ,即反斜杠后面不是跟x，都会没问题，而定义 a='\x00' ，即x后面跟两个字符，也是没问题的。</p>    <h3>2. unicode_escape</h3>    <p>同理在unicode中，\ u 是保留字符，表示后面的四个字符表示一个中文字符，例如 b=u'u6211' ，表示“我:”，同理我们希望b变量，表示6个英文字符，而不是一个中文字符，就可以使用encode(‘unicode-escape’)来转换：</p>    <pre>  <code class="language-python">u'u6211'.encode('unicode-escape')='\u6211'  </code></pre>    <p>注意encode前是unicode，转换后是string。</p>    <p>在unicode中，\u是保留字符，但是在string中，就不是了，所以只有一个反斜杠，而不是两个。</p>    <p>decode就是反过来。</p>    <p>同理， a='\u' 也是会报错的</p>    <h3>3. 例子</h3>    <pre>  <code class="language-python">#正常的str和unicode字符  str_char='我'  uni_char=u'我'  print repr(str_char) # '\xe6\x88\x91'  print repr(uni_char) #  u'\u6211'     # decode('unicode-escape')  s1='\u6211'  s2=s1.decode('unicode-escape')  print repr(s1) # '\\u6211'  print repr(s2) # u'\u6211'     # encode('unicode-escape')  s1=u'\u6211'  s2=s1.encode('unicode-escape')  print repr(s1) # u'\u6211'  print repr(s2) # '\\u6211'     # decode("string_escape")  s1='\\xe6\\x88\\x91'  s2=s1.decode('string_escape')  print repr(s1) # '\\xe6\\x88\\x91'  print repr(s2) # '\xe6\x88\x91'     # encode("string_escape")  s1='\xe6\x88\x91'  s2=s1.encode('string_escape')  print repr(s1) # '\xe6\x88\x91'  print repr(s2) # '\\xe6\\x88\\x91'  </code></pre>    <h3>4. 应用</h3>    <ol>     <li>内容是unicode，但是type是str，就可以使用 decode("unicode_escape") 转换为内容和type都是unicode <pre>  <code class="language-python">s1='\u6211'  s2=s1.decode('unicode-escape')  </code></pre> </li>     <li>内容是str，但是type是unicode,就可以使用 encode("unicode_escape").decode("string_escape") 转换为内容和type都是str <pre>  <code class="language-python">s1=u'\xe6\x88\x91'  s2=s1.encode('unicode_escape').decode("string_escape")  </code></pre> </li>    </ol>    <p> </p>    <p>来自：http://python.jobbole.com/87042/</p>    <p> </p>
不得不知道的Python字符串编码相关的知识

相关经验

目录