sql server - Best Practice to Combine both DB and Lucene Search -

- July 15, 2011

i developing advanced search engine using .net users can build query based on several fields:

title
content of document
date from, date to
from modified date, modified date
owner
location
other metadata

i using lucene index document content , corresponding ids. however, other metadata resides in ms sql db (to avoid enlarging index, , keep updating index on modification of metadata).

how can perform search?

when user search term:

narrow down search results according criteria selected user looking in sql db.
return matching ids lucene searcher web service, search keyword entered in documnentids returned adv search web service.
then relevant metadata document ids (returned lucence) looking again in db.

as notice here, there 1 lookup in db, lucene, , db values displayed in grid.

questions:

how can overcome situation? thought begin searching lucene has drawback if documents indexed reached 2 million. (i think narrowing down results using db first have large effect on performance).

another issue passing ids lucene search service, how effective passing hundred thousands of ids? , alternative solution?

i welcome idea, please share thoughts.

your current solution incurs following overhead @ query-time:

1) narrowing search space via ms-sql

generating query in app
sending on wire ms-sql
parsing/optimizing/execution of sql query
[!!] i/o overhead of returning 100,000s of ids

2) executing bounded full-text search via lucene.net

[!!] lucene memory overhead of generating/executing large booleanquery containing 100,000s of id clauses in app (you'll need first override default limit of 1024 clauses measure effect)
standard lucene full text search execution
returning matching ids

3) materializing result details via ms-sql

fast, indexed, id-based lookup of search result documents (only needed first page of displayed results ~10-25 records)

there 2 assumptions may making worth reconsidering

a) indexing metadata (dates, author, location, etc...) unacceptably increase size of index.

try out first: best practice, , you'll massively reduce query execution overhead letting lucene of filtering in addition text search.
also, size of index has cardinality of each field. example, if have 500 unique owner names, 500 strings stored, , each lucene document internally reference owner through symbol-table lookup (4-byte integer * 2mm docs + 500 strings = < 8mb additional).

b) ms-sql queries quickest way filter on non-text metadata.

reconsider this: metadata indexed using appropriate lucene types, won't incur additional overhead querying lucene vs query ms-sql. (in cases, lucene may faster.)
your mileage may vary, in experience, type of filtered-full-text-search when executed on lucene collection of 2mm documents typically run in under 100ms.

so summarize best practice:

index of data want query or filter by. (no need store source data since ms-sql system-of-record).
run filtered queries against lucene (e.g. text , date ranges, owner, location, etc...)
return ids
materialize documents ms-sql using returned ids.

i'd recommend exploring move standalone search server (solr or elasticsearch) number of reasons:

you won't have worry search-index memory requirements cannibalizing application memory requirements.
you'll take advantage of sophisticated filter caching performance boosts , os-based i/o optimizations.
you'll able iterate upon search solution configuration-based environment used/supported.
you'll have tools in place scale/tune/backup/restore search without impacting application.

Search This Blog

Click Hand

sql server - Best Practice to Combine both DB and Lucene Search -

Comments

Post a Comment

Popular posts from this blog

apache - setting document root in antoher partition on ubuntu -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -

python - pip install -U PySide error -