Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
252 views
in Technique[技术] by (71.8m points)

python - How to Reshape a Data Frame, or do an UnPivot, Based on Headers, Then Insert Two Columns

I have a dataframe that looks like this:

 ACCEPTANCES_EXECUTED_FOR_ACCT____OUT    ACCUMULATED_OTH_COMPREHENSIVE_INCOME    ALLL_AMT    AUDIT_INDICATOR     AVAILABLE_FOR_SALE_SECURITIES   COMMON_STOCK    file 
 $-                                      $-                                      $979.00     $1.00               $-                              $1,200.00       FFIEC CDR Call Schedule RC0 03312001.txt
 $-                                      $93.00                                  $127.00     $4.00               $8,546.00                       $120.00         FFIEC CDR Call Schedule RC0 03312001.txt
 $-                                      $246.00                                 $278.00     $1.00               $27,398.00                      $-              FFIEC CDR Call Schedule RC0 03312002.txt
 $-                                      $19.00                                  $130.00     $4.00               $4,047.00                       $25.00          FFIEC CDR Call Schedule RC0 03312002.txt
 $-                                      $125.00                                 $1,188.00   $1.00               $21,265.00                      $596.00         FFIEC CDR Call Schedule RC0 03312002.txt
 $-                                      $-                                      $164.00     $4.00               $-                              $60.00          FFIEC CDR Call Schedule RC0 03312002.txt
 $-                                      $800.00                                 $2,115.00   $1.00               $55,699.00                      $5,400.00       FFIEC CDR Call Schedule RC0 03312003.txt
 $-                                      $199.00                                 $2,372.00   $2.00               $32,306.00                      $19.00          FFIEC CDR Call Schedule RC0 03312003.txt
 $-                                      $174.00                                 $1,114.00   $5.00               $18,296.00                      $1,600.00       FFIEC CDR Call Schedule RC0 03312004.txt
 $-                                      $31.00                                  $323.00     $5.00               $2,997.00                       $240.00         FFIEC CDR Call Schedule RC0 03312004.txt

How can I pivot the data so it looks like this?

Code     Schedule            Line Item                                  Q12001      Q22002      Q32003       Q42004
RC0      BalanceSheet        ACCEPTANCES_EXECUTED_FOR_ACCT____OUT       $-          $-          $-           $-   
RC0      BalanceSheet        ACCUMULATED_OTH_COMPREHENSIVE_INCOME       $93.00      390         $999.00      $205.00 
RC0      BalanceSheet        ALLL_AMT                                   $1,060.00   $1,760.00   $4,487.00    $1,437.00 
RC0      BalanceSheet        AUDIT_INDICATOR                            $5.00       $10.00      $3.00        $10.00 
RC0      BalanceSheet        AVAILABLE_FOR_SALE_SECURITIES              $8,546.00   $52,710.00  $88,050.00   $21,293.00 
RC0      BalanceSheet        COMMON_STOCK                               $1,320.00   $681.00     $5,419.00    $1,840.00 

I don't see this this is a standard pivot exercise, because I need to parse the file name into quarters and years, so this: 'FFIEC CDR Call Schedule RC0 03312001.txt'

Becomes this: 'Q12001'

  1. I would like to add in 'Line Item' as a header
  2. I would like the very first column to have a header of 'Code' and below this should be the name of the file, which is 'RC0'.
  3. I would like to add

It almost looks something like this...

output = df_append.pivot(index='headers', columns='file')
print(output)

But, of course there is no column named 'headers' and I can't figure out how to parse the file into quarters and years. There are over 100 fields in this dataframe and the 'file' goes back almost 20 years, split into quarters and years. Is this even possible?!

question from:https://stackoverflow.com/questions/65623539/how-to-reshape-a-data-frame-or-do-an-unpivot-based-on-headers-then-insert-two

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

While this is a bit tricky, I was able to solve it.

Steps:

1. Setup the dataframe

2. Cleanup the data by removing $, comma, and dashes. Then convert the data to floats

3. Extract the code (RC0,RC1...) and Date (MMDDYYYY) from filename

4. Create a Quarterly Period from the date

5. Group By code and quarterly period to get the sum of each Line Item

6. Transpose the result to create in the format you asked for

7. Make Quarterly Period as the column header and delete the quarterly row

8. Rename the axis name to Line Item as per your request

9. Add Schedule as the column name as per your request

10. Print the final dataframe

Here's the code to do this.

import pandas as pd
import numpy as np

#Dataframe setup activities

c = ['ACCEPTANCES_EXECUTED_FOR_ACCT____OUT',
     'ACCUMULATED_OTH_COMPREHENSIVE_INCOME',
     'ALLL_AMT',
     'AUDIT_INDICATOR',
     'AVAILABLE_FOR_SALE_SECURITIES',
     'COMMON_STOCK',
     'file']
d = [
['$-','$-'     ,'$979.00'  ,'$1.00','$-'        ,'$1,200.00','FFIEC CDR Call Schedule RC0 03312001.txt'],
['$-','$93.00' ,'$127.00'  ,'$4.00','$8,546.00' ,'$120.00'  ,'FFIEC CDR Call Schedule RC0 03312001.txt'],
['$-','$246.00','$278.00'  ,'$1.00','$27,398.00','$-'       ,'FFIEC CDR Call Schedule RC0 03312002.txt'],
['$-','$19.00' ,'$130.00'  ,'$4.00','$4,047.00' ,'$25.00'   ,'FFIEC CDR Call Schedule RC0 03312002.txt'],
['$-','$125.00','$1,188.00','$1.00','$21,265.00','$596.00'  ,'FFIEC CDR Call Schedule RC0 03312002.txt'],
['$-','$-'     ,'$164.00'  ,'$4.00','$-'        ,'$60.00'   ,'FFIEC CDR Call Schedule RC1 03312002.txt'],
['$-','$800.00','$2,115.00','$1.00','$55,699.00','$5,400.00','FFIEC CDR Call Schedule RC1 03312003.txt'],
['$-','$199.00','$2,372.00','$2.00','$32,306.00','$19.00'   ,'FFIEC CDR Call Schedule RC0 03312003.txt'],
['$-','$174.00','$1,114.00','$5.00','$18,296.00','$1,600.00','FFIEC CDR Call Schedule RC1 03312004.txt'],
['$-','$31.00' ,'$323.00'  ,'$5.00','$2,997.00' ,'$240.00'  ,'FFIEC CDR Call Schedule RC0 03312004.txt']]

df = pd.DataFrame(d,columns=c)

#Dataframe cleanup activities. Remove $ sign, comma and dash to enable computing

df[df.columns[:6]] = df[df.columns[:6]].apply(lambda x: x.str.replace('[$,]','',regex=True).replace('-','0',regex=True)).astype(np.float64)

#Extract code from the filename

df['code'] = df['file'].str[24:27]

#Extract Date (MMDDYYYY) from file name

df['year'] = pd.to_datetime(df['file'].str[28:36],format='%m%d%Y').dt.strftime('%m/%d/%Y')

#Create a Quarterly Period from the date

df['qyear'] = pd.PeriodIndex(df['year'], freq='Q')

#Calculate the sum based on Quarteryly values using Groupby
#Transpose to create the dataframe in the format you wanted. Store it into df1

df1 = df.groupby(['code','qyear']).agg({'ACCEPTANCES_EXECUTED_FOR_ACCT____OUT':'sum',
                  'ACCUMULATED_OTH_COMPREHENSIVE_INCOME':'sum',
                  'ALLL_AMT':'sum',
                                   'AUDIT_INDICATOR':'sum',
                                   'AVAILABLE_FOR_SALE_SECURITIES':'sum',
                                   'COMMON_STOCK':'sum'}).reset_index().T

#Make Quarterly Period as the column header and delete the quarterly row
df1.columns = df1.loc['qyear']
df1.drop(['qyear'],axis=0,inplace=True)

#Rename the axis name to Line Item as per your request
df1.rename_axis('Line Item',axis="columns",inplace=True)

#Add Schedule as the column name as per your request
df1['Schedule'] = 'BalanceSheet'

#You now have the final dataframe as per your request
print (df1)

Note that if filename has values more than 1 code (RC0, then separate sets of Quarterly Periods are setup for each RC code (as columns). Due to this, we cannot generalize the code to RC0. Also, I am having some difficulty in computing the values of Line Items per RC code and adding them as separate rows. Looking at how the data is organized, it is best to have the RC0 and RC1 with Quarterly Periods show up in separate columns rather than duplicate the line items for each RC group.

The final dataframe looks like this (note I have RC0 and RC1 in this):

Line Item                            2001Q1 2002Q1 2003Q1 2004Q1 2002Q1  
code                                    RC0    RC0    RC0    RC0    RC1   
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT      0      0      0      0      0   
ACCUMULATED_OTH_COMPREHENSIVE_INCOME     93    390    199     31      0   
ALLL_AMT                               1106   1596   2372    323    164   
AUDIT_INDICATOR                           5      6      2      5      4   
AVAILABLE_FOR_SALE_SECURITIES          8546  52710  32306   2997      0   
COMMON_STOCK                           1320    621     19    240     60   

Line Item                            2003Q1 2004Q1      Schedule  
code                                    RC1    RC1  BalanceSheet  
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT      0      0  BalanceSheet  
ACCUMULATED_OTH_COMPREHENSIVE_INCOME    800    174  BalanceSheet  
ALLL_AMT                               2115   1114  BalanceSheet  
AUDIT_INDICATOR                           1      5  BalanceSheet  
AVAILABLE_FOR_SALE_SECURITIES         55699  18296  BalanceSheet  
COMMON_STOCK                           5400   1600  BalanceSheet  

Went back and looked at your request. It looks like you also wanted to print the $ sign for all the dollar values.

Instead of transforming the dataframe, I am doing it a bit later. That allows me to add the $ sign to the values.

df1 =

df.groupby(['code','qyear']).agg({'ACCEPTANCES_EXECUTED_FOR_ACCT____OUT':'sum',
                  'ACCUMULATED_OTH_COMPREHENSIVE_INCOME':'sum',
                  'ALLL_AMT':'sum',
                                   'AUDIT_INDICATOR':'sum',
                                   'AVAILABLE_FOR_SALE_SECURITIES':'sum',
                                   'COMMON_STOCK':'sum'}).reset_index()
cols = df1.columns[2:]
df1.loc[:, cols] = df1[cols].astype(float).applymap('${:,.2f}'.format)
df1 = df1.T
df1.columns = df1.loc['qyear']

If I transform the data, then RC0/1... will also show up and I cannot add $ value. So doing it before I transform.

The output is as follows:

Line Item                                2001Q1      2002Q1      2003Q1  
code                                        RC0         RC0         RC0   
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT      $0.00       $0.00       $0.00   
ACCUMULATED_OTH_COMPREHENSIVE_INCOME     $93.00     $390.00     $199.00   
ALLL_AMT                              $1,106.00   $1,596.00   $2,372.00   
AUDIT_INDICATOR                           $5.00       $6.00       $2.00   
AVAILABLE_FOR_SALE_SECURITIES         $8,546.00  $52,710.00  $32,306.00   
COMMON_STOCK                          $1,320.00     $621.00      $19.00   

Line Item                                2004Q1   2002Q1      2003Q1  
code                                        RC0      RC1         RC1   
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT      $0.00    $0.00       $0.00   
ACCUMULATED_OTH_COMPREHENSIVE_INCOME     $31.00    $0.00     $800.00   
ALLL_AMT                                $323.00  $164.00   $2,115.00   
AUDIT_INDICATOR                           $5.00    $4.00       $1.00   
AVAILABLE_FOR_SALE_SECURITIES         $2,997.00    $0.00  $55,699.00   
COMMON_STOCK                            $240.00   $60.00   $5,400.00   

Line Item                                 2004Q1      Schedule  
code                                         RC1  BalanceSheet  
ACCEPTANCES_EXECUTED_FOR_ACCT____OUT       $0.00  BalanceSheet  
ACCUMULATED_OTH_COMPREHENSIVE_INCOME     $174.00  BalanceSheet  
ALLL_AMT                               $1,114.00  BalanceSheet  
AUDIT_INDICATOR                            $5.00  BalanceSheet  
AVAILABLE_FOR_SALE_SECURITIES         $18,296.00  BalanceSheet  
COMMON_STOCK                           $1,600.00  BalanceSheet  

If you want to replace all the $0.00 with $- then you can do:

df1.replace(to_replace = '$0.00', value = '$-', inplace = True)

If you have varying columns and want to aggregate on all of them, then do this.

Assuming the last column is file and you want to aggregate on all the columns from index 0 thru last but one, you can do this:

cagg = {cx:'sum' for cx in df.columns[:-1]}
df1 = df.groupby(['code','qyear']).agg(cagg).reset_index()

Since you are going to add 3 columns towards the end:

df['code'] = df['file'].str[24:27]
df['year'] = pd.to_datetime(df['file'].str[28:36],format='%m%d%Y').dt.strftime('%m/%d/%Y')
df['qyear'] = pd.PeriodIndex(df['year'], freq='Q')

You can give modify the code as follows [:-4] to exclude file + code + year + qyear columns:

df = pd.DataFrame(d,columns=c)
df[df.columns[:6]] = df[df.columns[:6]].apply(lambda x: x.str.replace('[$,]','',regex=True).replace('-','0',regex=True)).astype(np.float64)
df['code'] = df['file'].str[24:27]
df['year'] = pd.to_datetime(df['file'].str[28:36],format='%m%d%Y').dt.strftime('%m/%d/%Y')
df['qyear'] = pd.PeriodIndex(df['year'], freq='Q')
cagg = {cx:'sum' for cx in df.columns[:-4]}
df1 = df.groupby(['code','qyear']).agg(cagg).reset_index()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...