Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the
ngram limit, so we'd get to long.tokens.wit which would then be split -
discarding "with.dots.or.dashes" completely. The fullword index would
keep the complete large token, but without any ngramming, so incomplete
searches (like "tokens") would not match it, only the full token.
Now, we split words before ngramming them, so the main index will
properly handle words up to the ngram limit. The fullword index will
still handle the longer words for non-ngram matching.
Also optimized away duplicate tokens from the indices (since we rely on
boolean matching, not scoring) to save a couple megabytes of space.
* Optimize Elasticsearch fullword field
Since the main display_name field ngrams words up to 15 characters,
anything to and under that will already be indexed - the fullword field
(which we have for words longer than 15 characters) needs to index only
words longer than that.
* Preprocess ES terms for better literal matching
This commit adds a new .exact subfield to display_name, which holds a
barely-filtered version of the original title we can do "literal"
matching against. This is not real substring matching, but quoting
terms now actually does something!
Implements a simple preprocessor for the search terms to extract quoted
parts from the search terms, optionally prefixed with - to negate them.
The preprocessor will create a query that'll join all three query-types:
the simple_query_string, must-phrases and must-not-phrases.
This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously).
Does not (seem to!) break anything, but requires a re-indexing of ES.