Python str unicode

在 Python2 裡頭，在成對的(單/雙/三)引號裡頭的東西，就是字串物件。字串物件分成兩種

例如：

>>> a = 'hello'
>>> type(a)
<type 'str'>
>>> ua = u'hello'
>>> type(ua)
<type 'unicode'>

當成對的引號前面以 u 開頭，則為 unicode string。

在上述的例子當中，

a 是 8-bit string

ua 是 unicode string

當字串物件的型態是 str 時，代表它一定是使用了某種編碼 (可能是 ascii 或 big5 或 utf8 或 utf16 或 ... 其中之一)。透過 decode 方法，可以將字串物件轉換成 Python 內部用來表示字串的編碼 (也就是 <type 'unicode'> )。

當字串物件的型態是 unicode 時(Python 內部用來表示字串的編碼)，則可以透過 encode 方法將其轉換成我們想要使用的編碼(如果是中文的話，常見的編碼表示有 utf-8 與 big5)。

在 Python2 直譯器下

>>> a = '你'
>>> type(a)
<type 'str'>
>>> ua = u'你'
>>> type(ua)
<type 'unicode'>

>>> a
'\xe4\xbd\xa0'
>>> print a
你
>>> ua
u'\u4f60'
>>> print ua
你

>>> ua.encode("utf-8")
'\xe4\xbd\xa0'
>>> a.decode("utf-8")
u'\u4f60'

Demo1.py

# -*- coding: utf-8 -*-
print '哈摟'

當程式碼裡頭，出現了非 ascii 字元時，記得要在檔案的開頭加上編碼指示

# -- coding: utf-8 --

，程式執行時才不會出錯。

Demo2.py

# -*- coding: utf-8 -*-
a = '哈摟'
a.encode("utf-8")