I've got an 7.4Gb csv file. After converting it to a sqlite database with a python script the output DB is 4.7Gb, around 60% of the original size.
The csv has around 150,000,000 rows. It has header:
tkey,ipaddr,healthtime,numconnections,policystatus,activityflag
And each row looks something like
261846,172.10.28.15,2012-02-03 16:15:00,22,1,1
The script uses healthtime to split the data into tables 192 tables
When I first saw these numbers, I assumed I had made an error someplace. How big of a reduction in file size should I expect from the added efficiency of only writing the healthtime 192 times instead of 150,000,000 times?
EDIT: As soon as I posted this I realized the answer. I'm removing about 40% of the string, hence the 40% reduction in size.
Edit 2 Let's calculate the difference in size between the plain text:
"261846,172.10.28.15,2012-02-03 16:15:00,22,1,1"
And the data base entry:
db(261846,'172.10.28.15',22,1,1)
First of all, we drop from 46 to 26 characters in plain text representation.
The remaining characters are:
"261846,172.10.28.15,22,1,1"
or 26 bytes. If each integer needs to be stored in 32 bit (4 bytes), then we have:
12 bytes (ipaddr) + 4 bytes * 4 (integer fields) = 28 bytes.
So it looks like converting to integers makes the storage slightly less efficient, and all of my gains come form reducing the number of characters stored in each row.
See Question&Answers more detail:
os