Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
480 views
in Technique[技术] by (71.8m points)

search - ElasticSearch: Index vs type and handling updates

I'm pretty familiar with the capabilities of ElasticSearch and its benefits, but this is the first time I'm getting my hands dirty building an index. So I'm eager to get the following approach vetted with experts.

Requirement:

Our application gets metadata about various products from multiple sources. Typically these feeds come in as XML files (file size can vary from 2 GB to 12 GB and sometimes a single record is spread across multiple files) and the information contained in the feed from one provider may or may not overlap with information contained in others. So we de-dupe this data, normalize this data to a set of common formats, depending on the product type, and we need to provide the ability to search against this consolidated data set (obviously this is where ElasticSearch comes in).

All products have certain common identifiers (like id, price, etc.), but the core metadata can look completely different between different product types. To quantify this, let's say all products have 30% fields in common and 70% fields differ between product types. There aren't too many product types and it's safe to assume that there will not be more than 10 types at any point in time. To start with the number is much smaller (around 3-4).

Additionally, there can be updates coming in at random intervals from these data sources and some of these updates need to be reflected in searches right away (near real time), without bringing down the search capability.

The proposed solution:

I'm considering having different indexes for different product types, based on what I read here: https://www.elastic.co/blog/index-vs-type. So the normalization job will look at the source files, create the normalized structure for the given product type and add it to the appropriate index. The search API that we expose will perform a search against each of these indices for the search term used and consolidate the results into a single JSON response with multiple sections (one section for each product type).

For updates, we plan on using the bulk API for update, insert and delete and given that the limitations around REST API calls, we will have to make these calls in batches of x MB each.

Question:

Is this the best way to organize the data and update it in ElasticSearch (for my use-case)? Would it be better to use multiple types on the same index (example: /products containing products of type typeA, typeB, etc.) instead of creating one index per product type? If so, will the search be significantly faster than searching across indices? Are there better ways to handle the CRUD of records after the index has been created?

Thanks in advance!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First it is worth noting that mapping types will go away in the next ES release (soft deprecation in ES 6 and removal in ES 7).

Now whether types go away or not, it is still possible to use a single index, however you'd increase sparsity since only 30% of your fields are common and that should be avoided at all cost.

So, I'd say that your multi-index approach is the only one that makes sense given the nature of your data.

Additional information worth reading: https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...