Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
229 views
in Technique[技术] by (71.8m points)

database design - Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

I'd love to know how Stack Overflow's tagging and search is architected, because it seems to work pretty well.

What is a good database/search model if I want to do all of the following:

  1. Storing Tags on various entities, (how normalized? i.e. Entity, Tag, and Entity_Tag tables?)
    • Searching for items with particular tags
    • Building a tag cloud of all tags that apply to a particular search result set
    • How to show a tag list for each item in a search result?

Perhaps it makes sense to store the tags in a normalized form, but also as a space-delimited string for the purposes of #2, #4, and perhaps #3. Thoughts?

I have heard it said that Stack Overflow uses Lucene for search. Is that true? I've heard a couple of podcasts discussing SQL optimization, but nothing about Lucene. If they do use Lucene, I'm wondering how much of the search result comes from Lucene, and whether the "drill-down" tag cloud comes from Lucene.

question from:https://stackoverflow.com/questions/234796/optimal-data-architecture-for-tagging-clouds-and-searching-like-stackoverflow

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

So here I go again...

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

Regarding the OS projects recommended by @Grant:

  • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
  • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
  • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

I also found some other questions on SO that I'd missed before:

What I'm currently doing for each of the items I mentioned:

  1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
    • Build site-wide tag clouds
    • browse by tag (i.e. urls like SO's /questions/tagged/ASP.NET)
  2. For search I use Lucene + NHibernate.Search
    • Tags are concat'd into a TagString that is indexed by Lucene
      • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
      • I can search for text and filter by tags at the same time
      • The Lucene analyzer merges words for better tag searches (i.e. a tag search for "test" will also find stuff tagged "testing")
    • Lucene returns a potentially enormous result set, which I paginate to 20 results
    • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
    • So it's entirely possible that a search results in 0 hits to the DB
  3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
  4. Haven't done this yet either, but I will probably store the TagString in the DB so that I can show an Entity's Tag list without having to make 2 more joins.

This means that whenever an Entity's tags are modified, I have to:

  • Insert any new Tags that do not already exist
  • Insert/Delete from the EntityTag table
  • Update Entity.TagString
  • Update the Lucene index for the Entity

Given that the ratio of reads to writes is very big in my application, I think I'm ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I'm not excited about that, but I think that if I do it in a background thread, it will be fine.

Time will tell...


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...