Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
456 views
in Technique[技术] by (71.8m points)

java - Split string into sentences

I have written this piece of code that splits a string and stores it in a string array:-

String[] sSentence = sResult.split("[a-z]\.\s+");

However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-

Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv

I see that I lose the pattern specified in the split function. It's okay for me to lose the period, but losing the last letter of the word disturbs its meaning.

Could someone help me with this, and in addition, could someone help me with dealing with abbreviations? For example, because I split the string based on periods, I do not want to lose the abbreviations.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...