pass dtype
param to specify the dtype:
In [159]:
import pandas as pd
import io
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
df = pd.read_csv(io.StringIO(t), dtype={'uid':str})
df.set_index('uid', inplace=True)
df.index
Out[159]:
Index(['01', '02', '03'], dtype='object', name='uid')
So in your case the following should work:
df = pd.read_csv('sample.csv', dtype={'uid':str})
df.set_index('uid', inplace=True)
The one-line equivalent doesn't work, due to a still-outstanding pandas bug here where the dtype param is ignored on cols that are to be treated as the index**:
df = pd.read_csv('sample.csv', dtype={'uid':str}, index_col='uid')
You can dynamically do this if we assume the first column is the index column:
In [171]:
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
df = pd.read_csv(io.StringIO(t), dtype=dtypes)
df.set_index('uid', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 01 to 03
Data columns (total 3 columns):
f1 3 non-null float64
f2 3 non-null float64
f3 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
In [172]:
df.index
Out[172]:
Index(['01', '02', '03'], dtype='object', name='uid')
Here we read just the header row to get the column names:
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
we then generate dict of the column names with the desired dtypes:
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
we get the index name, assuming it's the first entry and then create a dict from the rest of the cols and assign float
as the desired dtype and add the index col specifying the type to be str
, you can then pass this as the dtype
param to read_csv