Using split
to count isn't the most efficient, but if you insist on doing that, the proper way is this:
haystack.split(needle, -1).length -1
If you don't set limit
to -1
, split
defaults to 0
, which removes trailing empty strings, which messes up your count.
From the API:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. [...] If n
is zero then [...] trailing empty strings will be discarded.
You also need to subtract 1 from the length
of the array, because N
occurrences of the delimiter splits the string into N+1
parts.
As for the regex itself (i.e. the needle
), you can use
the word boundary anchors around the word
. If you allow word
to contain metacharacters (e.g. count occurrences of "$US"
), you may want to Pattern.quote
it.
I've come up with this:
numThe += line.split("[^a-zA-Z][Tt]he[^a-zA-Z]", -1).length - 1;
Though still getting some strange numbers. I was able to get an accurate general count (without the regular expression), now my issue is with the regexp.
Now the issue is that you're not counting [Tt]he
that appears as the first or last word, because the regex says that it has to be preceded/followed by some character, something that matches [^a-zA-Z]
(that is, your match must be of length 5!). You're not allowing the case where there isn't a character at all!
You can try something like this instead:
"(^|[^a-zA-Z])[Tt]he([^a-zA-Z]|$)"
This isn't the most concise solution, but it works.
Something like this (using negative lookarounds) also works:
"(?<![a-zA-Z])[Tt]he(?![^a-zA-Z])"
This has the benefit of matching just [Tt]he
, without any extra characters around it like your previous solution did. This is relevant in case you actually want to process the tokens returned by split
, because the delimiter in this case isn't "stealing" anything from the tokens.
Non-split
Though using split
to count is rather convenient, it isn't the most efficient (e.g. it's doing all kinds of work to return those strings that you discard). The fact that as you said you're counting line-by-line means that the pattern would also have to be recompiled and thrown away every line.
A more efficient way would be to use the same regex you did before and do the usual Pattern.compile
and while (matcher.find()) count++;