Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
385 views
in Technique[技术] by (71.8m points)

python - Convert large csv to sparse matrix for use in sklearn

I have a ~30GB (~1.7 GB compressed | 180K rows x 32K columns) matrix saved in a csv format. I would like to convert this matrix to sparse format to be able to load the full dataset in memory for machine learning with sklearn. The cells that are populated contain float numbers less than 1. A caveat of the large matrix is the target variable is stored as the last column. What is the best method to allow this large matrix to be utilized in sklearn? I.E. How can you transition the ~30GB csv into a scipy sparse format without loading the original matrix into memory?

Pseudocode

  1. Remove target variable (keep order intact)
  2. Convert ~30 GB matrix to sparse format (Help!!)
  3. Load sparse format into memory and target variable to run machine learning pipeline (How would I do this?)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can row-wise build a sparse matrix in memory pretty easily:

import numpy as np
import scipy.sparse as sps

input_file_name = "something.csv"
sep = ""

def _process_data(row_array):
    return row_array

sp_data = []
with open(input_file_name) as csv_file:
    for row in csv_file:
        data = np.fromstring(row, sep=sep)
        data = _process_data(data)
        data = sps.coo_matrix(data)
        sp_data.append(data)


sp_data = sps.vstack(sp_data)

This will be easier to write into hdf5 which is a way better way to store numbers at this scale than a text file.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...