python pinyin module support simplified and traditional chinese codecs

copyright (C) qingfeng Xia 2011-2020 CC-BY-NC 4.0

git clone https://github.com/qingfengxia/pinyin.py.git

Pinyin python module

Origin:      Author:cleverdeng      E-mail:clverdeng@gmail.com (not reachable)

forked by Qingfeng Xia based on v0.9:  
               (1)renamed class name from PinYin to Pinyin, 
               (2)dict file "word.data"  is  renamed  as "pinyin.data" 
               (3) add encoding support, or it will not work for windows cmd prompt! 
               (4) move load_word() (renamed as loaddict() ) into __init__(), to make API concise
               (5) word.data  "ord(UNICODE)= list of pinyin", for quick loading and human readable check
                  why some unicode has multiple pinyin units??? 

installation: 
               copy the two files:  pinyin.py, pinyin.data , into your project folder, or under $PYTHONPATH , or your sitepackages

test:   testing code is under __main__ section
               python -m pinyin.py


suggested new features: 
                  (1)  traditional chinese support:    done!
                  (2)  repr()   print the tones, print in two lines,  first line using ASCII char as tone -- / \ V
                  (3)  other  pinyin romanization styles:  Yale, Wade-Giles, etc, using pinyin4j 's pinyindb
                  (4) consider python 3.x support: by replace print with print_function  for test()
                  (5) performance improvement, using better container than dict

see also: 
    (1) ruby    "Ruby module:  hanzi_to_pinyin" 
          java  pinyin4j:   supports  6 pinyin resprensation styles:  Yale, Wade-Giles, etc
    (2) ibus-pinyin: phrase 
    (3) oopinyinguide: openoffice 3.x extension
    (4) Unicode for  greater Chinese charset: CJKV:  4E00-9FFF,  
          this pinyin.data (0x3400-9F2D) and (0x20000-0x2B6F8)

example:
from pinyin import Pinyin
def test_console(encoding='cp936'):
    import os
    if os.name=='nt':
        print "POSIX os input encoding is utf-8, for windows try cp936/gbk for simplified"
        test2=Pinyin(encoding=encoding)
        print "str(test.hanzi2pinyin(string)"
        s=raw_input("input hanzi string in windows console") # only for python 2.x
        print "pinyin for input hanzi are:"
        print test2.hanzi2pinyin(s)

if __name__ == "__main__":
    test = Pinyin()
    #
    print "test with utf8 console encoding"
    string = "钓鱼岛是中国的"  #utf8  simplified chinese
    print "in: %s" % string
    print "out: %s" % str(test.hanzi2pinyin(string, showingtone=True))
    print "out: %s" % test.hanzi2pinyin_split(string, split="-")
    #
    str2="釣魚島是台灣的也是中國的" #utf8 traditional chinese
    print "in: %s" % str2
    print "out: %s" % str(test.hanzi2pinyin(str2, showingtone=True))
    #test()


CC-BY-NC 4.0 licensed free for non-commercial usage
Author: Qingfeng XIA
copyright (C) 2011-2020
http://www.iesensor.com
please keep the original link in your reference.
http://www.iesensor.com/blog/2013/03/26/python-pinyin-module/
This entry was posted in Download, Programming. Bookmark the permalink.