On https://stanfordnlp.github.io/CoreNLP/truecase.html, truecase.bias
is described as:
Biases to choose certain behaviors. You can use this to adjust the proclivities of the truecaser. The truecaser classes are: UPPER, LOWER, INIT_UPPER, and O (for mixed case words like McVey).
It has the default values of INIT_UPPER:-0.7,UPPER:-0.7,O:0
, and I'd like to change these, but it looks like the only way to do that (using stanza
) is by modifying self.client.start_cmd
directly in the constructor. This is what I'm doing:
from stanza.server import CoreNLPClient
class TrueCaseAnnotator(object):
def __init__(self, classpath=CLASSPATH, bias="INIT_UPPER:-1,UPPER:-1,O:0"):
self.client = CoreNLPClient(
annotators=["tokenize,ssplit,truecase"],
classpath=classpath,
output_format='json',
)
self.client.start_cmd.append("-bias")
self.client.start_cmd.append(bias)
And this correctly logs the command with the adjusted bias:
2021-02-04 21:12:51 INFO: Starting server with command: java -Xmx5G -cp /app/artifacts/stanford-corenlp-4.2.0/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-462b8ca2f7024c78.props -annotators tokenize,ssplit,truecase -preload -outputFormat json -bias INIT_UPPER:-1,UPPER:-1,O:0
Assuming the range for these variables is -1 to 1 and an INIT_UPPER
value of -1 means never capitalize the initial word of a sentence, I expect the first word to come out lowercase at all times. However, this is not what's happening.
First, can you confirm -1 to 1 is the correct range? (I've also tried other numbers like -100, 0.2 and 1, and nothing changes).
I did notice that classBias
(with emphasis on class) is always set to the default values (INIT_UPPER:-0.7,UPPER:-0.7,O:0
) no matter what I pass for the bias
parameter (see logs below). Is it possible classBias
is overwriting the value of bias
? Where is that getting set and what's the difference between the two anyway?
2021-02-04 21:29:55 INFO: Starting server with command: java -Xmx5G -cp /app/artifacts/stanford-corenlp-4.2.0/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-ceededd66e0947d7.props -annotators tokenize,ssplit,truecase -preload -outputFormat json -bias INIT_UPPER:-1,UPPER:-1,O:0
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
annotators = tokenize,ssplit,truecase
bias = INIT_UPPER:1,UPPER:-100,O:0
inputFormat = text
outputFormat = json
prettyPrint = false
threads = 5
[main] INFO CoreNLP - Threads: 5
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator truecase
[main] INFO edu.stanford.nlp.sequences.SeqClassifierFlags - classBias=INIT_UPPER:-0.7,UPPER:-0.7,O:0
[main] INFO edu.stanford.nlp.sequences.SeqClassifierFlags - loadClassifier=edu/stanford/nlp/models/truecase/truecasing.fast.caseless.qn.ser.gz
[main] INFO edu.stanford.nlp.sequences.SeqClassifierFlags - mixedCaseMapFile=edu/stanford/nlp/models/truecase/MixDisambiguation.list
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/truecase/truecasing.fast.caseless.qn.ser.gz ... done [5.9 sec].
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:9000
question from:
https://stackoverflow.com/questions/66054176/corenlp-truecaseannotators-truecase-bias-not-having-any-effect