Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
710 views
in Technique[技术] by (71.8m points)

python 2.7 - Scikit-learn labelencoder: how to preserve mappings between batches?

I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.

But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.

The question is how to preserve the encodings/mappings/labels between batches? The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London. But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.

Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do? If I apply the same Labelencoder instance to different batches will the mappings remain the same? I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I suggest using Pandas' get_dummies() for this, since sklearn's OneHotEncoder() needs to see all possible categorical values when .fit(), otherwise it will throw an error when it encounters a new one during .transform().

# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]

# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')

# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)

# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)

final_encoded

output

   City_Chicago  City_London  City_Paris  City_Rome  City_Tokyo
0             0            0           1          0           0
1             0            0           0          0           1
2             0            0           0          1           0
3             0            1           0          0           0
4             1            0           0          0           0
5             0            0           1          0           0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...