I have a fairly large (about 1200 variables with 14 observations each) data set, stored in a text file, with a very weird and definitely not tidy
structure. In practice each variable is stored as a row, instead than as a column, and the first and second row are respectively the variable name, and the measurement unit for that variable. Here is a sample data set:
Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
PT-AMB#SRV V 1.403400 1.403207 1.403265 1.403326 1.403454 1.403783 1.404924 1.404962 1.405291 1.404951 1.404685 1.404812 1.404433 1.404428
PS1-SEC20#SRV V 2.395769 2.416003 2.362276 2.253045 2.139873 1.939328 2.450442 2.294791 2.085946 1.929666 2.634747 3.067008 3.081949 3.095456
The first variable is called Date
and it's nondimensional (units ---
), the second one is PT-AMB#SRV
and measured in volts V
, and so on. NOTE: two entries on the same row are separated by a tab in the original file. I'm not sure if this is conserved once I copy&paste data here on Stack Overflow.
First of all, I tried to read in data like this:
df=read.table("TEST.txt",sep=" ")
I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 16 elements
The error goes away if I manually edit the second (and the third) variable names, changing the #
to a -
.
The first question is: why is this happening, and how can I prevent it from happening? If I need to change all #
to -
in all variable names, how can I do that automatically? Preferably in R
, otherwise the command line is fine (I work in Windows).
Second question: after modifying all the #
(just two in this sample data set), I read it with
df=read.table("TEST.txt",sep=" ")
I get:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
2 PT-AMB-SRV V 1.403400 1.403207 1.403265 1.403326 1.403454 1.403783 1.404924 1.404962 1.405291 1.404951 1.404685 1.404812 1.404433 1.404428
3 PS1-SEC20-SRV V 2.395769 2.416003 2.362276 2.253045 2.139873 1.939328 2.450442 2.294791 2.085946 1.929666 2.634747 3.067008 3.081949 3.095456
I then try to transpose df
, so that variables are stored in columns:
df_t=t(df)
I get:
[,1] [,2] [,3]
V1 "Date" "PT-AMB-SRV" "PS1-SEC20-SRV"
V2 "---" "V" "V"
V3 "1/19/2016" "1.403400" "2.395769"
V4 "1/19/2016" "1.403207" "2.416003"
V5 "1/19/2016" "1.403265" "2.362276"
V6 "1/19/2016" "1.403326" "2.253045"
V7 "1/19/2016" "1.403454" "2.139873"
V8 "1/19/2016" "1.403783" "1.939328"
V9 "1/19/2016" "1.404924" "2.450442"
V10 "1/19/2016" "1.404962" "2.294791"
V11 "1/19/2016" "1.405291" "2.085946"
V12 "1/19/2016" "1.404951" "1.929666"
V13 "1/19/2016" "1.404685" "2.634747"
V14 "1/19/2016" "1.404812" "3.067008"
V15 "1/19/2016" "1.404433" "3.081949"
V16 "1/19/2016" "1.404428" "3.095456"
No more a data frame, but an array of characters. Definitely not what I want to. How can I make it so that variables are stored in columns (tidy data set)? I thought the problem was the column containing the measurement units, but removing it before transposing with
df[,"V2"]=NULL
doesn't solve anything. Maybe tidyr
could help here, but I don't know how.
See Question&Answers more detail:
os