ES: implement real substring matching (#500)

...by splitting input into characters, instead of whitespace delimited words. This means you can now match partial words, real substrings from anywhere: "foo ba" will match "Foo Bar Baz", while previously you had to have full words ("foo bar") to match anything. My dev setup incurred an 8% increase in storage usage, from ~13MB to ~14MB (for ~40k torrents). Small change, big improvement. Wonder why I didn't do this at first.
2025-04-10 07:49:26 +00:00 · 2018-06-08 10:59:19 +03:00 · 2018-06-08 10:59:19 +03:00 · bc1901baa5
parent d407f09cab
commit bc1901baa5
2 changed files with 10 additions and 3 deletions
--- a/es_mapping.yml
+++ b/es_mapping.yml
@ -24,9 +24,9 @@ settings:
          - my_ngram
          - trim_zero
          - unique
-      # For exact matching - simple lowercase + whitespace delimiter
+      # For exact matching - separate each character for substring matching + lowercase
      exact_analyzer:
-        tokenizer: whitespace
+        tokenizer: exact_tokenizer
        filter:
          - lowercase
      # For matching full words longer than the ngram limit (15 chars)
@ -43,6 +43,13 @@ settings:
          - fullword_min
          - unique

+    tokenizer:
+      # Splits input into characters, for exact substring matching
+      exact_tokenizer:
+        type: pattern
+        pattern: "(.)"
+        group: 1
+
    filter:
      my_ngram:
        type: edgeNGram
--- a/nyaa/templates/help.html
+++ b/nyaa/templates/help.html
@ -46,7 +46,7 @@
 	name, but not those which have <em>bar</em> in the name as well.
 </div>
 <div>
-	If you want to search for a several-word expression in its entirety, you can
+	If you want to search for a several-word expression (substring) in its entirety, you can
 	surround searches with <kbd>"</kbd> (double quotes), such as
 	<kbd>"foo bar"</kbd>, which would match torrents named <em>foo bar</em> but not
 	those named <em>bar foo</em>. You may also use the aforementioned <kbd>|</kbd> to group