In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature.
However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens.
When the network encounters an unknown (infrequent) word in the training set, what should its output be?
Example:
Sentence: I went to the mall and bought a <ukn> and some groceries
Network input: I went to the mall and bought a
Current network output: <unk> and some groceries
Desired network output: ??? and some groceries
What should it be outputting instead of the <unk>
?
I don't want to build a generator that outputs words it does not know.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…