Overall, it is not really complex. First you need to split the task on parts: build phonetic dictionary, build language model, build acoustic model. Start with phonetic dictionary.
You need to write a Python script to map unicode input to the transliteration:
?? r a tt a
????? e k a ng a yi
???? ??? a v a s a r a d i m a
Basically for every you write a corresponding transliteration. That is all you need to do, later you can just feed the list of words into your script and get a dictionary in cmusphinx format. This part is covered in tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialdict
Once you have a transliteration tool you can proceed with language model. You need a lot of texts to build a language model. You can download texts from wikipedia or from local newspaper. Then you can use any language model toolkit to create an ARPA model. All of them support unicode - SRILM, MITLM, IRSTLM, you can use any of them. This part is covered in tutorial
http://cmusphinx.sourceforge.net/wiki/tutoriallm
Third step is to create an acoustic model. You need to record audio or segment existing recordings and start training. This part is also covered in the tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialam
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…