python pinyin module support simplified and traditional chinese codecs

git clone

Pinyin python module

Origin:      Author:cleverdeng (not reachable)

forked by Qingfeng Xia based on v0.9:  
               (1)renamed class name from PinYin to Pinyin, 
               (2)dict file ""  is  renamed  as "" 
               (3) add encoding support, or it will not work for windows cmd prompt! 
               (4) move load_word() (renamed as loaddict() ) into __init__(), to make API concise
               (5)  "ord(UNICODE)= list of pinyin", for quick loading and human readable check
                  why some unicode has multiple pinyin units??? 

               copy the two files:, , into your project folder, or under $PYTHONPATH , or your sitepackages

test:   testing code is under __main__ section
               python -m

suggested new features: 
                  (1)  traditional chinese support:    done!
                  (2)  repr()   print the tones, print in two lines,  first line using ASCII char as tone -- / \ V
                  (3)  other  pinyin romanization styles:  Yale, Wade-Giles, etc, using pinyin4j 's pinyindb
                  (4) consider python 3.x support: by replace print with print_function  for test()
                  (5) performance improvement, using better container than dict

see also: 
    (1) ruby    "Ruby module:  hanzi_to_pinyin" 
          java  pinyin4j:   supports  6 pinyin resprensation styles:  Yale, Wade-Giles, etc
    (2) ibus-pinyin: phrase 
    (3) oopinyinguide: openoffice 3.x extension
    (4) Unicode for  greater Chinese charset: CJKV:  4E00-9FFF,  
          this (0x3400-9F2D) and (0x20000-0x2B6F8)

from pinyin import Pinyin
def test_console(encoding='cp936'):
    import os
        print "POSIX os input encoding is utf-8, for windows try cp936/gbk for simplified"
        print "str(test.hanzi2pinyin(string)"
        s=raw_input("input hanzi string in windows console") # only for python 2.x
        print "pinyin for input hanzi are:"
        print test2.hanzi2pinyin(s)

if __name__ == "__main__":
    test = Pinyin()
    print "test with utf8 console encoding"
    string = "钓鱼岛是中国的"  #utf8  simplified chinese
    print "in: %s" % string
    print "out: %s" % str(test.hanzi2pinyin(string, showingtone=True))
    print "out: %s" % test.hanzi2pinyin_split(string, split="-")
    str2="釣魚島是台灣的也是中國的" #utf8 traditional chinese
    print "in: %s" % str2
    print "out: %s" % str(test.hanzi2pinyin(str2, showingtone=True))
This entry was posted in Download, Programming. Bookmark the permalink.