Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
520 views
in Technique[技术] by (71.8m points)

tokenize - In HuggingFace tokenizers: how can I split a sequence simply on spaces?

I am using DistilBertTokenizer tokenizer from HuggingFace.

I would like to tokenize my text by simple splitting it on space:

["Don't", "you", "love", "??", "Transformers?", "We", "sure", "do."]

instead of the default behavior, which is like this:

["Do", "n't", "you", "love", "??", "Transformers", "?", "We", "sure", "do", "."]

I read their documentation about Tokenization in general as well as about BERT Tokenizer specifically, but could not find an answer to this simple question :(

I assume that it should be a parameter when loading Tokenizer, but I could not find it among the parameters list ...

EDIT: Minimal code example to reproduce:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')

tokens = tokenizer.tokenize("Don't you love ?? Transformers? We sure do.")
print("Tokens: ", tokens)
question from:https://stackoverflow.com/questions/66064503/in-huggingface-tokenizers-how-can-i-split-a-sequence-simply-on-spaces

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

That is not how it works. The transformers library provides different types of tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split() method of the python string:

text = "Don't you love ?? Transformers? We sure do."
tokens = text.split()
print("Tokens: ", tokens)

Output:

Tokens:  ["Don't", 'you', 'love', '??', 'Transformers?', 'We', 'sure', 'do.']

In case you are looking for a bit more complex tokenization that also takes the punctuation into account, you can utilize the basic_tokenizer:

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.basic_tokenizer.tokenize(text)
print("Tokens: ", tokens)

Output:

Tokens:  ['Don', "'", 't', 'you', 'love', '??', 'Transformers', '?', 'We', 'sure', 'do', '.']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...