Let's go through this step by step. There are several sub-tasks you should take care of:
- Identify all substrings of length 4 or more.
- Count the occurrence of these substrings.
- Filter all substrings with 2 occurrences or more.
You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.
The following examples all use
mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4
1. Substrings of a given length
You can easily get substrings by slicing - for example, mystring[4:4+6]
gives you the substring from position 4 of length 6: 'thisis'
. More generically, you want substrings of the form mystring[start:start+length]
.
So what values do you need for start
and length
?
start
must...
- cover all substrings, so it must include the first character:
start in range(0, ...)
.
- not map to short substrings, so it can stop
min_length
characters before the end: start in range(..., len(mystring) - min_length + 1)
.
length
must...
- cover the shortest substring of length 4:
length in range(min_length, ...)
.
- not exceed the remaining string after
i
: length in range(..., len(mystring) - i + 1))
The +1
terms come from converting lengths (>=1) to indices (>=0).
You can put this all together into a single comprehension:
substrings = [
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
]
2. Count substrings
Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dict
s are made for. So you should use substrings as keys and counts as values in a dict
. In essence, this corresponds to this:
counts = {}
for substring in substrings:
try: # increase count for existing keys, set for new keys
counts[substring] += 1
except KeyError:
counts[substring] = 1
You can simply feed your substrings
to collections.Counter
, and it produces something like the above.
>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})
Notice how the duplicate 'abcd'
maps to the count of 2.
3. Filtering duplicate substrings
So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.
Python offers several constructs for filtering, depending on the output you want. These work also if counts
is a regular dict
:
>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}
Using Python primitives
Python ships with primitives that allow you to do this more efficiently.
Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:
substrings = (
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
)
Use a pre-existing Counter implementation. Python comes with a dict
-like container that counts its members: collections.Counter
can directly digest your substring generator. Especially in newer version, this is much more efficient.
counts = collections.Counter(substrings)
You can exploit Python's lazy filters to only ever inspect one substring. The filter
builtin or another generator generator expression can produce one result at a time without storing them all in memory.
for substring in filter(lambda key: counts[key] > 1, counts):
print(substring, 'occurs', counts[substring], 'times')