sql server - Best Practice to Combine both DB and Lucene Search -
i developing advanced search engine using .net users can build query based on several fields:
- title
- content of document
- date from, date to
- from modified date, modified date
- owner
- location
- other metadata
i using lucene index document content , corresponding ids. however, other metadata resides in ms sql db (to avoid enlarging index, , keep updating index on modification of metadata).
how can perform search?
when user search term:
- narrow down search results according criteria selected user looking in sql db.
- return matching ids lucene searcher web service, search keyword entered in documnentids returned adv search web service.
- then relevant metadata document ids (returned lucence) looking again in db.
as notice here, there 1 lookup in db, lucene, , db values displayed in grid.
questions:
how can overcome situation? thought begin searching lucene has drawback if documents indexed reached 2 million. (i think narrowing down results using db first have large effect on performance).
another issue passing ids lucene search service, how effective passing hundred thousands of ids? , alternative solution?
i welcome idea, please share thoughts.
your current solution incurs following overhead @ query-time:
1) narrowing search space via ms-sql
- generating query in app
- sending on wire ms-sql
- parsing/optimizing/execution of sql query
- [!!] i/o overhead of returning 100,000s of ids
2) executing bounded full-text search via lucene.net
- [!!] lucene memory overhead of generating/executing large booleanquery containing 100,000s of id clauses in app (you'll need first override default limit of 1024 clauses measure effect)
- standard lucene full text search execution
- returning matching ids
3) materializing result details via ms-sql
- fast, indexed, id-based lookup of search result documents (only needed first page of displayed results ~10-25 records)
there 2 assumptions may making worth reconsidering
a) indexing metadata (dates, author, location, etc...) unacceptably increase size of index.
try out first: best practice, , you'll massively reduce query execution overhead letting lucene of filtering in addition text search.
also, size of index has cardinality of each field. example, if have 500 unique owner names, 500 strings stored, , each lucene document internally reference owner through symbol-table lookup (4-byte integer * 2mm docs + 500 strings = < 8mb additional).
b) ms-sql queries quickest way filter on non-text metadata.
- reconsider this: metadata indexed using appropriate lucene types, won't incur additional overhead querying lucene vs query ms-sql. (in cases, lucene may faster.)
- your mileage may vary, in experience, type of filtered-full-text-search when executed on lucene collection of 2mm documents typically run in under 100ms.
so summarize best practice:
index of data want query or filter by. (no need store source data since ms-sql system-of-record).
run filtered queries against lucene (e.g. text , date ranges, owner, location, etc...)
return ids
materialize documents ms-sql using returned ids.
i'd recommend exploring move standalone search server (solr or elasticsearch) number of reasons:
- you won't have worry search-index memory requirements cannibalizing application memory requirements.
- you'll take advantage of sophisticated filter caching performance boosts , os-based i/o optimizations.
- you'll able iterate upon search solution configuration-based environment used/supported.
- you'll have tools in place scale/tune/backup/restore search without impacting application.
Comments
Post a Comment