I'd recommend using Solr to process your data, it supports the indexing of CSV data.
Solr solves the problems associated with uploading, downloading and searching the data. It's performance is such that the index can be easily updated or built again from scratch.
Example
Install Solr
Read the documentation for how to install Solr. The following is a "kick-start" to make this demo work on Linux:
wget http://www.apache.org/dist/lucene/solr/3.5.0/apache-solr-3.5.0.tgz
tar zxvf apache-solr-3.5.0.tgz
cd apache-solr-3.5.0/example
java -jar start.jar
Solr admin screen available on the following URL
http://localhost:8983/solr/admin/
Generate sample data
Wrote a groovy script to generate a sample CSV file.
new File("data.csv").withWriter { writer ->
writer.println "id;A_s;B_i;C_s;D_s;E_s;F_s"
for (i in 1..500000) {
writer.println "${i};${i*10};${i*20};${i*30};${i*40};${i*50};${i*60}"
}
}
Example
id;A_s;B_i;C_s;D_s;E_s;F_s
1;10;20;30;40;50;60
2;20;40;60;80;100;120
3;30;60;90;120;150;180
4;40;80;120;160;200;240
5;50;100;150;200;250;300
6;60;120;180;240;300;360
7;70;140;210;280;350;420
8;80;160;240;320;400;480
9;90;180;270;360;450;540
..
Note:
I performed no customisation to the "out of the box" Solr settings. This meant
- Default Solr configuration requires a unique id field
- Appending "_s" to the column name indicates string fields. See dynamic fields feature in Solr
- Loaded Column B as an integer (B_i) to enable numeric sorting
See the Solr Wiki for details on how to write a custom schema.
Load CSV file
The Linux curl command is used to load the data.csv file, by performing a HTTP post operation:
$ curl 'http://localhost:8983/solr/update/csv?separator=;&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @data.csv
Note:
- 500,000 rows of data loaded in approx 90 seconds
- Use separator parameter to indicate the data is not separated by the default ","
- Use commit parameter to indicate data is committed to index at the end.
Query the data
The following query returns the top 50 rows sorted on column B
http://localhost:8983/solr/select/?q=*:*&rows=50&sort=B_i+desc&fl=id,A_s,B_i,C_s,D_s,E_s,F_s&wt=csv
Output is CSV formatted:
id,A_s,B_i,C_s,D_s,E_s,F_s
500000,5000000,10000000,15000000,20000000,25000000,30000000
499999,4999990,9999980,14999970,19999960,24999950,29999940
499998,4999980,9999960,14999940,19999920,24999900,29999880
499997,4999970,9999940,14999910,19999880,24999850,29999820
499996,4999960,9999920,14999880,19999840,24999800,29999760
..
REST parameters:
|-------------------------------|------------------------------|
| Rest parameter | Description |
|-------------------------------|------------------------------|
| q=*:* | Everything. |
| rows=50 | Number of rows to output |
| sort=B_i+desc | descending order, sort on B_i|
| fl=id,A_s,B_i,C_s,D_s,E_s,F_s | columns to include |
| wt=csv | CSV output |
|-------------------------------|------------------------------|