Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
328 views
in Technique[技术] by (71.8m points)

python - How to tokenize a Malayalam word?

???????????????  

itu oru stalam anu

This is a Unicode string meaning this is a place

import nltk
nltk.wordpunct_tokenize('??????????????? '.decode('utf8'))

is not working for me .

nltk.word_tokenize('??????????????? '.decode('utf8'))

is also not working other examples

"???????? "  = ????? +????,
"???????"  = ???? + ???

Right Split :

???  ??? ?????? ??? 

output:

[u'u0d07u0d24u0d4du0d12u0d30u0d41u0d38u0d4du0d25u0d32u0d02u0d06u0d23u0d4d']

I just need to split the words as shown in the other example. Other example section is for testing.The problem is not with Unicode. It is with morphology of language. for this purpose you need to use a morphological analyzer
Have a look at this paper. http://link.springer.com/chapter/10.1007%2F978-3-642-27872-3_38

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

After a crash course of the language from wikipedia (http://en.wikipedia.org/wiki/Malayalam), there are some issues in your question and the tools you've requested for your desired output.

Conflated Task

Firstly, the OP conflated the task of morphological analysis, segmentation and tokenization. Often there is a fine distinction especially for aggluntinative languages such as Turkish/Malayalam (see http://en.wikipedia.org/wiki/Agglutinative_language).

Agglutinative NLP and best practices

Next, I don't think tokenizer is appropriate for Malayalam, an agglutinative language. One of the most studied aggluntinative language in NLP, Turkish have adopted a different strategy when it comes to "tokenization", they found that a full blown morphological analyzer is necessary (see http://www.denizyuret.com/2006/11/turkish-resources.html, www.andrew.cmu.edu/user/ko/downloads/lrec.pdf?).

Word Boundaries

Tokenization is defined as the identification of linguistically meaningful units (LMU) from the surface text (see Why do I need a tokenizer for each language?) And different language would require a different tokenizer to identify the word boundary of different languages. Different people have approach the problem for finding word boundary different but in summary in NLP people have subscribed to the following:

  1. Agglutinative Languages requires a full blown morphological analyzer trained with some sort of language models. There is often only a single tier when identifying what is token and that is at the morphemic level hence the NLP community had developed different language models for their respective morphological analysis tools.

  2. Polysynthetic Languages with specified word boundary has the choice of a two tier tokenization where the system can first identify an isolated word and then if necessary morphological analysis should be done to obtain a finer grain tokens. A coarse grain tokenizer can split a string using certain delimiter (e.g. NLTK's word_tokenize or punct_tokenize which uses whitespaces/punctuation for English). Then for finer grain analysis at morphemic level, people would usually use some finite state machines to split words up into morpheme (e.g. in German http://canoo.net/services/WordformationRules/Derivation/To-N/N-To-N/Pre+Suffig.html)

  3. Polysynthetic Langauges without specified word boundary often requires a segmenter first to add whitespaces between the tokens because the orthography doesn't differentiate word boundaries (e.g. in Chinese https://code.google.com/p/mini-segmenter/). Then from the delimited tokens, if necessary, morphemic analysis can be done to produce finer grain tokens (e.g. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Often this finer grain tokens are tied with POS tags.

The answer in brief to OP's request/question, the OP had used the wrong tools for the task:

  • To output tokens for Malayalam, a morphological analyzer is necessary, simple coarse grain tokenizer in NLTK would not work.
  • NLTK's tokenizer is meant to tokenize polysynthetic Languages with specified word boundary (e.g. English/European languages) so it is not that the tokenizer is not working for Malayalam, it just wasn't meant to tokenize aggluntinative languages.
  • To achieve the output, a full blown morphological analyzer needs to be built for the language and someone had built it (aclweb.org/anthology//O/O12/O12-1028.pdf?), the OP should contact the author of the paper if he/she is interested in the tool.
  • Short of building a morphological analyzer with a language model, I encourage the OP to first spot for common delimiters that splits words into morphemes in the language and then perform the simple re.split() to achieve a baseline tokenizer.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...