Is the id used for the decision to delete the sequence, or is the list of values used for the decision?
You can build a dictionary where the id number is the key (converted to int because of the later sorting) and the following lines are converted to the list of strings that is the value for the key. Then you can delete the item with the key 2, and traverse the items sorted by the key, and output the new id:key plus the formated list of the strings.
Or you can build the list of lists where the order is protected. If the sequence of the id's is to be protected (i.e. not renumbered), you can also remember the id:n in the inner list.
This can be done for a reasonably sized file. If the file is huge, you should copy the source to the destination and skip the unwanted sequence on the fly. The last case can be fairly easy also for the small file.
[added after the clarification]
I recommend to learn the following approach that is usefull in many such cases. It uses so called finite automaton that implements actions bound to transitions from one state to another (see Mealy machine).
The text line is the input element here. The nodes that represent the context status are numbered here. (My experience is that it is not worth to give them names -- keep them just stupid numbers.) Here only two states are used and the status
could easily be replaced by a boolean variable. However, if the case becomes more complicated, it leads to introduction of another boolean variable, and the code becomes more error prone.
The code may look very complicated at first, but it is fairly easy to understand when you know that you can think about each if status == number
separately. This is the mentioned context that captured the previous processing. Do not try to optimize, let the code that way. It can actually be human-decoded later, and you can draw the picture similar to the Mealy machine example. If you do, then it is much more understandable.
The wanted functionality is a bit generalized -- a set of ignored sections can be passed as the first argument:
import re
def filterSections(del_set, fname_in, fname_out):
'''Filtering out the del_set sections from fname_in. Result in fname_out.'''
# The regular expression was chosen for detecting and parsing the id-line.
# It can be done differently, but I consider it just fine and efficient.
rex_id = re.compile(r'^id:(d+)s*$')
# Let's open the input and output file. The files will be closed
# automatically.
with open(fname_in) as fin, open(fname_out, 'w') as fout:
status = 1 # initial status -- expecting the id line
for line in fin:
m = rex_id.match(line) # get the match object if it is the id-line
if status == 1: # skipping the non-id lines
if m: # you can also write "if m is not None:"
num_id = int(m.group(1)) # get the numeric value of the id
if num_id in del_set: # if this id should be deleted
status = 1 # or pass (to stay in this status)
else:
fout.write(line) # copy this id-line
status = 2 # to copy the following non-id lines
#else ignore this line (no code needed to ignore it :)
elif status == 2: # copy the non-id lines
if m: # the id-line found
num_id = int(m.group(1)) # get the numeric value of the id
if num_id in del_set: # if this id should be deleted
status = 1 # or pass (to stay in this status)
else:
fout.write(line) # copy this id-line
status = 2 # to copy the following non-id lines
else:
fout.write(line) # copy this non-id line
if __name__ == '__main__':
filterSections( {1, 3}, 'data.txt', 'output.txt')
# or you can write the older set([1, 3]) for the first argument.
Here the output id-lines where given the original number. If you want to renumber the sections, it can be done via a simple modification. Try the code and ask for details.
Beware, the finite automata have limited power. They cannot be used for the usual programming languages as they are not able to capture nested paired structures (like parenteses).
P.S. The 7000 lines is actually a tiny file from a computer perspective ;)