We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?
CSV data file (pipe delimited) looks like this:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA
My logstash config looks like this:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}
A document in elasticsearch, then looks like this:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA
"
],
"@version": "1",
"@timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
I think would like the unique ID in the _id
field, to be replaced with the value of property_id
. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.
The document_id
setting for elasticsearch output doesn't put that field's value into _id
(it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?
EDIT: WORKING!
Using @rutter's suggestion, I've updated the output
config to this:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{property_id}"
}
}
Now documents are updating by dropping new files into the data folder as expected. _id
and property_id
are the same value.
{
"_index": "samples6",
"_type": "sample",
"_id": "351",
"_score": 1,
"_source": {
"message": [
"351|90045|Easy as 123 ST|LOS ANGELES|CA
"
],
"@version": "1",
"@timestamp": "2014-02-12T16:12:52.102Z",
"host": "TXDFWL3474",
"path": "C:/Data/sample/sample_update_3.csv",
"property_id": "351",
"postal_code": "90045",
"address_1": "Easy as 123 ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
See Question&Answers more detail:
os