sql server - Best Practice to Combine both DB and Lucene Search -


i developing advanced search engine using .net users can build query based on several fields:

  1. title
  2. content of document
  3. date from, date to
  4. from modified date, modified date
  5. owner
  6. location
  7. other metadata

i using lucene index document content , corresponding ids. however, other metadata resides in ms sql db (to avoid enlarging index, , keep updating index on modification of metadata).

how can perform search?

when user search term:

  • narrow down search results according criteria selected user looking in sql db.
  • return matching ids lucene searcher web service, search keyword entered in documnentids returned adv search web service.
  • then relevant metadata document ids (returned lucence) looking again in db.

as notice here, there 1 lookup in db, lucene, , db values displayed in grid.

questions:

how can overcome situation? thought begin searching lucene has drawback if documents indexed reached 2 million. (i think narrowing down results using db first have large effect on performance).

another issue passing ids lucene search service, how effective passing hundred thousands of ids? , alternative solution?

i welcome idea, please share thoughts.

your current solution incurs following overhead @ query-time:

1) narrowing search space via ms-sql

  • generating query in app
  • sending on wire ms-sql
  • parsing/optimizing/execution of sql query
  • [!!] i/o overhead of returning 100,000s of ids

2) executing bounded full-text search via lucene.net

  • [!!] lucene memory overhead of generating/executing large booleanquery containing 100,000s of id clauses in app (you'll need first override default limit of 1024 clauses measure effect)
  • standard lucene full text search execution
  • returning matching ids

3) materializing result details via ms-sql

  • fast, indexed, id-based lookup of search result documents (only needed first page of displayed results ~10-25 records)

there 2 assumptions may making worth reconsidering

a) indexing metadata (dates, author, location, etc...) unacceptably increase size of index.

  • try out first: best practice, , you'll massively reduce query execution overhead letting lucene of filtering in addition text search.

  • also, size of index has cardinality of each field. example, if have 500 unique owner names, 500 strings stored, , each lucene document internally reference owner through symbol-table lookup (4-byte integer * 2mm docs + 500 strings = < 8mb additional).

b) ms-sql queries quickest way filter on non-text metadata.

  • reconsider this: metadata indexed using appropriate lucene types, won't incur additional overhead querying lucene vs query ms-sql. (in cases, lucene may faster.)
  • your mileage may vary, in experience, type of filtered-full-text-search when executed on lucene collection of 2mm documents typically run in under 100ms.

so summarize best practice:

  • index of data want query or filter by. (no need store source data since ms-sql system-of-record).

  • run filtered queries against lucene (e.g. text , date ranges, owner, location, etc...)

  • return ids

  • materialize documents ms-sql using returned ids.


i'd recommend exploring move standalone search server (solr or elasticsearch) number of reasons:

  1. you won't have worry search-index memory requirements cannibalizing application memory requirements.
  2. you'll take advantage of sophisticated filter caching performance boosts , os-based i/o optimizations.
  3. you'll able iterate upon search solution configuration-based environment used/supported.
  4. you'll have tools in place scale/tune/backup/restore search without impacting application.

Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -