Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
171 views
in Technique[技术] by (71.8m points)

r - How to read a data set where variables are stored as rows, and some variable names contain "#"

I have a fairly large (about 1200 variables with 14 observations each) data set, stored in a text file, with a very weird and definitely not tidy structure. In practice each variable is stored as a row, instead than as a column, and the first and second row are respectively the variable name, and the measurement unit for that variable. Here is a sample data set:

Date    --- 1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016
PT-AMB#SRV  V   1.403400    1.403207    1.403265    1.403326    1.403454    1.403783    1.404924    1.404962    1.405291    1.404951    1.404685    1.404812    1.404433    1.404428
PS1-SEC20#SRV   V   2.395769    2.416003    2.362276    2.253045    2.139873    1.939328    2.450442    2.294791    2.085946    1.929666    2.634747    3.067008    3.081949    3.095456

The first variable is called Date and it's nondimensional (units ---), the second one is PT-AMB#SRV and measured in volts V, and so on. NOTE: two entries on the same row are separated by a tab in the original file. I'm not sure if this is conserved once I copy&paste data here on Stack Overflow.

First of all, I tried to read in data like this:

df=read.table("TEST.txt",sep="	")

I get the following error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 16 elements

The error goes away if I manually edit the second (and the third) variable names, changing the # to a -.

The first question is: why is this happening, and how can I prevent it from happening? If I need to change all # to - in all variable names, how can I do that automatically? Preferably in R, otherwise the command line is fine (I work in Windows).

Second question: after modifying all the # (just two in this sample data set), I read it with

 df=read.table("TEST.txt",sep="	")

I get:

             V1  V2        V3        V4        V5        V6        V7        V8        V9       V10       V11       V12       V13       V14       V15       V16
1          Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
2    PT-AMB-SRV   V  1.403400  1.403207  1.403265  1.403326  1.403454  1.403783  1.404924  1.404962  1.405291  1.404951  1.404685  1.404812  1.404433  1.404428
3 PS1-SEC20-SRV   V  2.395769  2.416003  2.362276  2.253045  2.139873  1.939328  2.450442  2.294791  2.085946  1.929666  2.634747  3.067008  3.081949  3.095456

I then try to transpose df, so that variables are stored in columns:

df_t=t(df)

I get:

    [,1]        [,2]         [,3]           
V1  "Date"      "PT-AMB-SRV" "PS1-SEC20-SRV"
V2  "---"       "V"          "V"            
V3  "1/19/2016" "1.403400"   "2.395769"     
V4  "1/19/2016" "1.403207"   "2.416003"     
V5  "1/19/2016" "1.403265"   "2.362276"     
V6  "1/19/2016" "1.403326"   "2.253045"     
V7  "1/19/2016" "1.403454"   "2.139873"     
V8  "1/19/2016" "1.403783"   "1.939328"     
V9  "1/19/2016" "1.404924"   "2.450442"     
V10 "1/19/2016" "1.404962"   "2.294791"     
V11 "1/19/2016" "1.405291"   "2.085946"     
V12 "1/19/2016" "1.404951"   "1.929666"     
V13 "1/19/2016" "1.404685"   "2.634747"     
V14 "1/19/2016" "1.404812"   "3.067008"     
V15 "1/19/2016" "1.404433"   "3.081949"     
V16 "1/19/2016" "1.404428"   "3.095456" 

No more a data frame, but an array of characters. Definitely not what I want to. How can I make it so that variables are stored in columns (tidy data set)? I thought the problem was the column containing the measurement units, but removing it before transposing with

df[,"V2"]=NULL   

doesn't solve anything. Maybe tidyr could help here, but I don't know how.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Read the data into DF0, transpose it and use type.convert to get appropriate classes for the columns. Set the names and convert the first column to "Date" class using the appropriate format string.

# replace text = Lines with file = "myfile.dat"
DF0 <- read.table( text = Lines, colClasses = "character", comment = "" )
L <- lapply( as.data.frame( tail( t(DF0), -2 ), stringsAsFactors = FALSE ), type.convert )
DF <- setNames( as.data.frame(L), DF0[[1]] )
DF$Date <- as.Date( DF$Date, format = "%m/%d/%Y" )

The result is:

> DF
         Date PT-AMB#SRV PS1-SEC20#SRV
1  2016-01-19   1.403400      2.395769
2  2016-01-19   1.403207      2.416003
3  2016-01-19   1.403265      2.362276
4  2016-01-19   1.403326      2.253045
5  2016-01-19   1.403454      2.139873
6  2016-01-19   1.403783      1.939328
7  2016-01-19   1.404924      2.450442
8  2016-01-19   1.404962      2.294791
9  2016-01-19   1.405291      2.085946
10 2016-01-19   1.404951      1.929666
11 2016-01-19   1.404685      2.634747
12 2016-01-19   1.404812      3.067008
13 2016-01-19   1.404433      3.081949
14 2016-01-19   1.404428      3.095456

Note: We used this input:

Lines <- 
"Date    --- 1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016
PT-AMB#SRV  V   1.403400    1.403207    1.403265    1.403326    1.403454    1.403783    1.404924    1.404962    1.405291    1.404951    1.404685    1.404812    1.404433    1.404428
PS1-SEC20#SRV   V   2.395769    2.416003    2.362276    2.253045    2.139873    1.939328    2.450442    2.294791    2.085946    1.929666    2.634747    3.067008    3.081949    3.095456"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...