Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
117 views
in Technique[技术] by (71.8m points)

python - Conceptual problem with inserting csv "large data" into database and data cleaning

I have a big csv files from Kaggle (200 MB and 3 GB). The data inside csv is multi type (text, floats, integers) and "dirty" (empty fields, undecoded text signs, etc.)

I have to do a project which aim is to:

  • use 3 different database: 1 SQL (I choose mySQL for that) and 2 noSQL databases,
  • make some statistics and analysis operations on the data inside each database type,
  • compare time of execution using each database.

I want to use python, pandas and numpy for doing what's described above.

My problem is I don't know, how to approach data cleaning. I have two ideas:

  1. Split csv files into 200k-rows-max > read each splitted csv by pandas > do the cleaning > create mySQL database and tables > insert already clean data from pandas dataframe into table > do the rest of calculations over database;

or

  1. Create mySQL database and tables > split csv files into 200k-rows-max > insert "dirty" data from splitted csv files into tables > do the cleaning over database > do the rest of calculations over database.

Does any of it make sense or am I missing something?

question from:https://stackoverflow.com/questions/65872075/conceptual-problem-with-inserting-csv-large-data-into-database-and-data-cleani

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...