While this is a bit tricky, I was able to solve it.
Steps:
1. Setup the dataframe
2. Cleanup the data by removing $, comma, and dashes. Then convert the data to floats
3. Extract the code (RC0,RC1...) and Date (MMDDYYYY) from filename
4. Create a Quarterly Period from the date
5. Group By code and quarterly period to get the sum of each Line Item
6. Transpose the result to create in the format you asked for
7. Make Quarterly Period as the column header and delete the quarterly row
8. Rename the axis name to Line Item as per your request
9. Add Schedule as the column name as per your request
10. Print the final dataframe
Here's the code to do this.
import pandas as pd
import numpy as np
#Dataframe setup activities
c = ['ACCEPTANCES_EXECUTED_FOR_ACCT____OUT',
'ACCUMULATED_OTH_COMPREHENSIVE_INCOME',
'ALLL_AMT',
'AUDIT_INDICATOR',
'AVAILABLE_FOR_SALE_SECURITIES',
'COMMON_STOCK',
'file']
d = [
['$-','$-' ,'$979.00' ,'$1.00','$-' ,'$1,200.00','FFIEC CDR Call Schedule RC0 03312001.txt'],
['$-','$93.00' ,'$127.00' ,'$4.00','$8,546.00' ,'$120.00' ,'FFIEC CDR Call Schedule RC0 03312001.txt'],
['$-','$246.00','$278.00' ,'$1.00','$27,398.00','$-' ,'FFIEC CDR Call Schedule RC0 03312002.txt'],
['$-','$19.00' ,'$130.00' ,'$4.00','$4,047.00' ,'$25.00' ,'FFIEC CDR Call Schedule RC0 03312002.txt'],
['$-','$125.00','$1,188.00','$1.00','$21,265.00','$596.00' ,'FFIEC CDR Call Schedule RC0 03312002.txt'],
['$-','$-' ,'$164.00' ,'$4.00','$-' ,'$60.00' ,'FFIEC CDR Call Schedule RC1 03312002.txt'],
['$-','$800.00','$2,115.00','$1.00','$55,699.00','$5,400.00','FFIEC CDR Call Schedule RC1 03312003.txt'],
['$-','$199.00','$2,372.00','$2.00','$32,306.00','$19.00' ,'FFIEC CDR Call Schedule RC0 03312003.txt'],
['$-','$174.00','$1,114.00','$5.00','$18,296.00','$1,600.00','FFIEC CDR Call Schedule RC1 03312004.txt'],
['$-','$31.00' ,'$323.00' ,'$5.00','$2,997.00' ,'$240.00' ,'FFIEC CDR Call Schedule RC0 03312004.txt']]
df = pd.DataFrame(d,columns=c)
#Dataframe cleanup activities. Remove $ sign, comma and dash to enable computing
df[df.columns[:6]] = df[df.columns[:6]].apply(lambda x: x.str.replace('[$,]','',regex=True).replace('-','0',regex=True)).astype(np.float64)
#Extract code from the filename
df['code'] = df['file'].str[24:27]
#Extract Date (MMDDYYYY) from file name
df['year'] = pd.to_datetime(df['file'].str[28:36],format='%m%d%Y').dt.strftime('%m/%d/%Y')
#Create a Quarterly Period from the date
df['qyear'] = pd.PeriodIndex(df['year'], freq='Q')
#Calculate the sum based on Quarteryly values using Groupby
#Transpose to create the dataframe in the format you wanted. Store it into df1
df1 = df.groupby(['code','qyear']).agg({'ACCEPTANCES_EXECUTED_FOR_ACCT____OUT':'sum',
'ACCUMULATED_OTH_COMPREHENSIVE_INCOME':'sum',
'ALLL_AMT':'sum',
'AUDIT_INDICATOR':'sum',
'AVAILABLE_FOR_SALE_SECURITIES':'sum',
'COMMON_STOCK':'sum'}).reset_index().T
#Make Quarterly Period as the column header and delete the quarterly row
df1.columns = df1.loc['qyear']
df1.drop(['qyear'],axis=0,inplace=True)
#Rename the axis name to Line Item as per your request
df1.rename_axis('Line Item',axis="columns",inplace=True)
#Add Schedule as the column name as per your request
df1['Schedule'] = 'BalanceSheet'
#You now have the final dataframe as per your request
print (df1)
Note that if filename has values more than 1 code (RC0, then separate sets of Quarterly Periods are setup for each RC code (as columns). Due to this, we cannot generalize the code to RC0. Also, I am having some difficulty in computing the values of Line Items per RC code and adding them as separate rows. Looking at how the data is organized, it is best to have the RC0 and RC1 with Quarterly Periods show up in separate columns rather than duplicate the line items for each RC group.
The final dataframe looks like this (note I have RC0 and RC1 in this):
Line Item 2001Q1 2002Q1 2003Q1 2004Q1 2002Q1
code RC0 RC0 RC0 RC0 RC1
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT 0 0 0 0 0
ACCUMULATED_OTH_COMPREHENSIVE_INCOME 93 390 199 31 0
ALLL_AMT 1106 1596 2372 323 164
AUDIT_INDICATOR 5 6 2 5 4
AVAILABLE_FOR_SALE_SECURITIES 8546 52710 32306 2997 0
COMMON_STOCK 1320 621 19 240 60
Line Item 2003Q1 2004Q1 Schedule
code RC1 RC1 BalanceSheet
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT 0 0 BalanceSheet
ACCUMULATED_OTH_COMPREHENSIVE_INCOME 800 174 BalanceSheet
ALLL_AMT 2115 1114 BalanceSheet
AUDIT_INDICATOR 1 5 BalanceSheet
AVAILABLE_FOR_SALE_SECURITIES 55699 18296 BalanceSheet
COMMON_STOCK 5400 1600 BalanceSheet
Went back and looked at your request. It looks like you also wanted to print the $ sign for all the dollar values.
Instead of transforming the dataframe, I am doing it a bit later. That allows me to add the $ sign to the values.
df1 =
df.groupby(['code','qyear']).agg({'ACCEPTANCES_EXECUTED_FOR_ACCT____OUT':'sum',
'ACCUMULATED_OTH_COMPREHENSIVE_INCOME':'sum',
'ALLL_AMT':'sum',
'AUDIT_INDICATOR':'sum',
'AVAILABLE_FOR_SALE_SECURITIES':'sum',
'COMMON_STOCK':'sum'}).reset_index()
cols = df1.columns[2:]
df1.loc[:, cols] = df1[cols].astype(float).applymap('${:,.2f}'.format)
df1 = df1.T
df1.columns = df1.loc['qyear']
If I transform the data, then RC0/1... will also show up and I cannot add $ value. So doing it before I transform.
The output is as follows:
Line Item 2001Q1 2002Q1 2003Q1
code RC0 RC0 RC0
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT $0.00 $0.00 $0.00
ACCUMULATED_OTH_COMPREHENSIVE_INCOME $93.00 $390.00 $199.00
ALLL_AMT $1,106.00 $1,596.00 $2,372.00
AUDIT_INDICATOR $5.00 $6.00 $2.00
AVAILABLE_FOR_SALE_SECURITIES $8,546.00 $52,710.00 $32,306.00
COMMON_STOCK $1,320.00 $621.00 $19.00
Line Item 2004Q1 2002Q1 2003Q1
code RC0 RC1 RC1
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT $0.00 $0.00 $0.00
ACCUMULATED_OTH_COMPREHENSIVE_INCOME $31.00 $0.00 $800.00
ALLL_AMT $323.00 $164.00 $2,115.00
AUDIT_INDICATOR $5.00 $4.00 $1.00
AVAILABLE_FOR_SALE_SECURITIES $2,997.00 $0.00 $55,699.00
COMMON_STOCK $240.00 $60.00 $5,400.00
Line Item 2004Q1 Schedule
code RC1 BalanceSheet
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT $0.00 BalanceSheet
ACCUMULATED_OTH_COMPREHENSIVE_INCOME $174.00 BalanceSheet
ALLL_AMT $1,114.00 BalanceSheet
AUDIT_INDICATOR $5.00 BalanceSheet
AVAILABLE_FOR_SALE_SECURITIES $18,296.00 BalanceSheet
COMMON_STOCK $1,600.00 BalanceSheet
If you want to replace all the $0.00
with $-
then you can do:
df1.replace(to_replace = '$0.00', value = '$-', inplace = True)
If you have varying columns and want to aggregate on all of them, then do this.
Assuming the last column is file
and you want to aggregate on all the columns from index 0 thru last but one, you can do this:
cagg = {cx:'sum' for cx in df.columns[:-1]}
df1 = df.groupby(['code','qyear']).agg(cagg).reset_index()
Since you are going to add 3 columns towards the end:
df['code'] = df['file'].str[24:27]
df['year'] = pd.to_datetime(df['file'].str[28:36],format='%m%d%Y').dt.strftime('%m/%d/%Y')
df['qyear'] = pd.PeriodIndex(df['year'], freq='Q')
You can give modify the code as follows [:-4] to exclude file + code + year + qyear columns:
df = pd.DataFrame(d,columns=c)
df[df.columns[:6]] = df[df.columns[:6]].apply(lambda x: x.str.replace('[$,]','',regex=True).replace('-','0',regex=True)).astype(np.float64)
df['code'] = df['file'].str[24:27]
df['year'] = pd.to_datetime(df['file'].str[28:36],format='%m%d%Y').dt.strftime('%m/%d/%Y')
df['qyear'] = pd.PeriodIndex(df['year'], freq='Q')
cagg = {cx:'sum' for cx in df.columns[:-4]}
df1 = df.groupby(['code','qyear']).agg(cagg).reset_index()