I need to form a matrix from a list of textfiles containing frequency distribution of expressions. Therefore, I created a list of all that text files (lof) from a directory and used it to build a matrix (thanks to gboffy). Each filename in that list is structured in a way: CompanyName-SerialNumber_IssueDate_IFRS.txt (Example: GoldmanSachs-123456_31.12.2014_IFRS.txt). Each file's content is structured in a exact same way too:
CompanyABC-123456_31.12.2012_IFRS.txt
Company ABC-123456_31.12.2012
financial statement:4
corporate-taxes:8
assets:2
available-for-sale property:0
auditors:213
Company123-789102_31.12.2012_IFRS.txt
Company123-789102_31.12.2012
financial statement:15
corporate-taxes:3
assets:8
available-for-sale property:2
auditors:23
My desired output from this should be a single matrix file written to txt with one line for each company file consisting of (CompanyName,Serial Number,IssueDate,Frequency1,Frequency2,...,FrequencyN):
'CompanyABC','123456','31.12.2012','4','8','2','0','213'
'Company123','789102','31.12.2012','15','3','8','2','23'
Here is my code so far:
def list_textfiles(directory, min_file_size):
# Creates a list of all files stored in DIRECTORY ending on '.txt' with minimum file size
textfiles = []
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
return textfiles
directory = 'C:/CompanyFiles'
minimum_size = 30000
lof = list_textfiles(directory, minimum_size)
res = []
for f in lof:
res += [[entry.split(':')[1] for entry in cdata ]
for cdata in [data.splitlines() for data in open(f).read().split('
')]]
with open('C:/CompanyFiles/Matrix.txt', 'wt') as outfile:
outfile.write(str(res))
How can I modify my code to achieve the output as stated above?
See Question&Answers more detail:
os