...by splitting input into characters, instead of whitespace delimited
words. This means you can now match partial words, real substrings from
anywhere: "foo ba" will match "Foo Bar Baz", while previously you had to
have full words ("foo bar") to match anything.
My dev setup incurred an 8% increase in storage usage, from ~13MB to
~14MB (for ~40k torrents).
Small change, big improvement. Wonder why I didn't do this at first.
Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the
ngram limit, so we'd get to long.tokens.wit which would then be split -
discarding "with.dots.or.dashes" completely. The fullword index would
keep the complete large token, but without any ngramming, so incomplete
searches (like "tokens") would not match it, only the full token.
Now, we split words before ngramming them, so the main index will
properly handle words up to the ngram limit. The fullword index will
still handle the longer words for non-ngram matching.
Also optimized away duplicate tokens from the indices (since we rely on
boolean matching, not scoring) to save a couple megabytes of space.
* Optimize Elasticsearch fullword field
Since the main display_name field ngrams words up to 15 characters,
anything to and under that will already be indexed - the fullword field
(which we have for words longer than 15 characters) needs to index only
words longer than that.
* Preprocess ES terms for better literal matching
This commit adds a new .exact subfield to display_name, which holds a
barely-filtered version of the original title we can do "literal"
matching against. This is not real substring matching, but quoting
terms now actually does something!
Implements a simple preprocessor for the search terms to extract quoted
parts from the search terms, optionally prefixed with - to negate them.
The preprocessor will create a query that'll join all three query-types:
the simple_query_string, must-phrases and must-not-phrases.
This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously).
Does not (seem to!) break anything, but requires a re-indexing of ES.