1
0
Fork 0
mirror of https://gitlab.com/SIGBUS/nyaa.git synced 2025-01-09 17:24:09 +00:00
Commit graph

11 commits

Author SHA1 Message Date
Anna-Maria Meriniemi bc1901baa5 ES: implement real substring matching (#500)
...by splitting input into characters, instead of whitespace delimited
words. This means you can now match partial words, real substrings from
anywhere: "foo ba" will match "Foo Bar Baz", while previously you had to
have full words ("foo bar") to match anything.

My dev setup incurred an 8% increase in storage usage, from ~13MB to
~14MB (for ~40k torrents).
Small change, big improvement. Wonder why I didn't do this at first.
2018-06-08 00:59:19 -07:00
Anna-Maria Meriniemi 59db958977 ES: delimit words before ngram, optimize tokens (#487)
Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the
ngram limit, so we'd get to long.tokens.wit which would then be split -
discarding "with.dots.or.dashes" completely. The fullword index would
keep the complete large token, but without any ngramming, so incomplete
searches (like "tokens") would not match it, only the full token.

Now, we split words before ngramming them, so the main index will
properly handle words up to the ngram limit. The fullword index will
still handle the longer words for non-ngram matching.

Also optimized away duplicate tokens from the indices (since we rely on
boolean matching, not scoring) to save a couple megabytes of space.
2018-04-28 18:09:40 -07:00
Anna-Maria Meriniemi 0b78428abc [ES Change] Improve Elasticsearch term quoting (#473)
* Optimize Elasticsearch fullword field

Since the main display_name field ngrams words up to 15 characters,
anything to and under that will already be indexed - the fullword field
(which we have for words longer than 15 characters) needs to index only
words longer than that.

* Preprocess ES terms for better literal matching

This commit adds a new .exact subfield to display_name, which holds a
barely-filtered version of the original title we can do "literal"
matching against. This is not real substring matching, but quoting
terms now actually does something!

Implements a simple preprocessor for the search terms to extract quoted
parts from the search terms, optionally prefixed with - to negate them.
The preprocessor will create a query that'll join all three query-types:
the simple_query_string, must-phrases and must-not-phrases.
2018-04-13 17:06:25 -07:00
TheAMM 2d0cf7cbb4 [ES Schema change] Multi-field search display_name to match words over ngram limit
This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously).
Does not (seem to!) break anything, but requires a re-indexing of ES.
2017-06-05 17:29:00 +03:00
aldacron 535be9c8bd Fixes #227 2017-06-04 23:03:32 -07:00
TheAMM 9cd6c506ae Update ElasticSeach index and scripts for comment_count 2017-05-26 16:12:47 +03:00
aldacron 142dd5359c Resolves #129 and refactored create magnet es naming 2017-05-24 23:19:08 -07:00
aldacron 6b4d487314 updated indicies 2017-05-18 01:58:08 -07:00
aldacron 6ad43bbcaa Reverted previous commit for mapping 2017-05-16 22:53:03 -07:00
aldacron b2a7b49757 changed es mapping to disable fields that don't need querying 2017-05-16 22:12:58 -07:00
aldacron c2c547e786 some more elasticsearch work, including index mapping and analyzer 2017-05-15 11:14:01 -07:00