nyaa/es_mapping.yml

---
# CREATE DTABASE/TABLE equivalent for elasticsearch, in yaml
# fo inline comments.
settings:
  analysis:
    analyzer:
      my_search_analyzer:
        type: custom
        tokenizer: standard
        char_filter:
          - my_char_filter
        filter:
          - standard
          - lowercase
      my_index_analyzer:
        type: custom
        tokenizer: standard
        char_filter:
          - my_char_filter
        filter:
          - resolution
          - lowercase
          - word_delimit
          - my_ngram
          - trim_zero
          - unique
      # For exact matching - separate each character for substring matching + lowercase
      exact_analyzer:
        tokenizer: exact_tokenizer
        filter:
          - lowercase
      # For matching full words longer than the ngram limit (15 chars)
      my_fullword_index_analyzer:
        type: custom
        tokenizer: standard
        char_filter:
          - my_char_filter
        filter:
          - lowercase
          - word_delimit
          # Skip tokens shorter than N characters,
          # since they're already indexed in the main field
          - fullword_min
          - unique

    tokenizer:
      # Splits input into characters, for exact substring matching
      exact_tokenizer:
        type: pattern
        pattern: "(.)"
        group: 1

    filter:
      my_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
      fullword_min:
        type: length
        # Remember to change this if you change the max_gram below!
        min: 16
      resolution:
        type: pattern_capture
        patterns: ["(\\d+)[xX](\\d+)"]
      trim_zero:
        type: pattern_capture
        patterns: ["0*([0-9]*)"]
      word_delimit:
        type: word_delimiter
        preserve_original: true
        split_on_numerics: false
    char_filter:
      my_char_filter:
        type: mapping
        mappings: ["-=>_", "!=>_", "_=>\\u0020"]
  index:
    # we're running a single es node, so no sharding necessary,
    # plus replicas don't really help either.
    number_of_shards: 1
    number_of_replicas : 0
    mapper:
      # disable elasticsearch's "helpful" autoschema
      dynamic: false
    # since we disabled the _all field, default query the
    # name of the torrent.
    query:
      default_field: display_name
mappings:
  torrent:
    # don't want everything concatenated
    _all:
      enabled: false
    properties:
      id:
        type: long
      display_name:
        # TODO could do a fancier tokenizer here to parse out the
        # the scene convention of stuff in brackets, plus stuff like k-on
        type: text
        analyzer: my_index_analyzer
        fielddata: true # Is this required?
        fields:
          # Multi-field for full-word matching (when going over ngram limits)
          # Note: will have to be queried for, not automatic
          fullword:
            type: text
            analyzer: my_fullword_index_analyzer
          # Stored for exact phrase matching
          exact:
            type: text
            analyzer: exact_analyzer
      created_time:
        type: date
        # Only in the ES index for generating magnet links
      info_hash:
        enabled: false
      filesize:
        type: long
      anonymous:
        type: boolean
      trusted:
        type: boolean
      remake:
        type: boolean
      complete:
        type: boolean
      hidden:
        type: boolean
      deleted:
        type: boolean
      has_torrent:
        type: boolean
      download_count:
        type: long
      leech_count:
        type: long
      seed_count:
        type: long
      comment_count:
        type: long
      # these ids are really only for filtering, thus keyword
      uploader_id:
        type: keyword
      main_category_id:
        type: keyword
      sub_category_id:
        type: keyword
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`---`
			`# CREATE DTABASE/TABLE equivalent for elasticsearch, in yaml`
			`# fo inline comments.`
			`settings:`
			`analysis:`
			`analyzer:`
			`my_search_analyzer:`
			`type: custom`
			`tokenizer: standard`
			`char_filter:`
			`- my_char_filter`
			`filter:`
			`- standard`
			`- lowercase`
			`my_index_analyzer:`
			`type: custom`
			`tokenizer: standard`
			`char_filter:`
			`- my_char_filter`
			`filter:`
updated indicies 2017-05-18 08:58:08 +00:00			`- resolution`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`- lowercase`
Resolves #129 and refactored create magnet es naming 2017-05-25 06:19:08 +00:00			`- word_delimit`
ES: delimit words before ngram, optimize tokens (#487) Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the ngram limit, so we'd get to long.tokens.wit which would then be split - discarding "with.dots.or.dashes" completely. The fullword index would keep the complete large token, but without any ngramming, so incomplete searches (like "tokens") would not match it, only the full token. Now, we split words before ngramming them, so the main index will properly handle words up to the ngram limit. The fullword index will still handle the longer words for non-ngram matching. Also optimized away duplicate tokens from the indices (since we rely on boolean matching, not scoring) to save a couple megabytes of space. 2018-04-29 01:09:40 +00:00			`- my_ngram`
Fixes #227 2017-06-05 06:03:32 +00:00			`- trim_zero`
ES: delimit words before ngram, optimize tokens (#487) Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the ngram limit, so we'd get to long.tokens.wit which would then be split - discarding "with.dots.or.dashes" completely. The fullword index would keep the complete large token, but without any ngramming, so incomplete searches (like "tokens") would not match it, only the full token. Now, we split words before ngramming them, so the main index will properly handle words up to the ngram limit. The fullword index will still handle the longer words for non-ngram matching. Also optimized away duplicate tokens from the indices (since we rely on boolean matching, not scoring) to save a couple megabytes of space. 2018-04-29 01:09:40 +00:00			`- unique`
ES: implement real substring matching (#500) ...by splitting input into characters, instead of whitespace delimited words. This means you can now match partial words, real substrings from anywhere: "foo ba" will match "Foo Bar Baz", while previously you had to have full words ("foo bar") to match anything. My dev setup incurred an 8% increase in storage usage, from ~13MB to ~14MB (for ~40k torrents). Small change, big improvement. Wonder why I didn't do this at first. 2018-06-08 07:59:19 +00:00			`# For exact matching - separate each character for substring matching + lowercase`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`exact_analyzer:`
ES: implement real substring matching (#500) ...by splitting input into characters, instead of whitespace delimited words. This means you can now match partial words, real substrings from anywhere: "foo ba" will match "Foo Bar Baz", while previously you had to have full words ("foo bar") to match anything. My dev setup incurred an 8% increase in storage usage, from ~13MB to ~14MB (for ~40k torrents). Small change, big improvement. Wonder why I didn't do this at first. 2018-06-08 07:59:19 +00:00			`tokenizer: exact_tokenizer`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`filter:`
			`- lowercase`
[ES Schema change] Multi-field search display_name to match words over ngram limit This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously). Does not (seem to!) break anything, but requires a re-indexing of ES. 2017-06-05 14:29:00 +00:00			`# For matching full words longer than the ngram limit (15 chars)`
			`my_fullword_index_analyzer:`
			`type: custom`
			`tokenizer: standard`
			`char_filter:`
			`- my_char_filter`
			`filter:`
			`- lowercase`
			`- word_delimit`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`# Skip tokens shorter than N characters,`
			`# since they're already indexed in the main field`
			`- fullword_min`
ES: delimit words before ngram, optimize tokens (#487) Before, long.tokens.with.dots.or.dashes would get edgengrammed up to the ngram limit, so we'd get to long.tokens.wit which would then be split - discarding "with.dots.or.dashes" completely. The fullword index would keep the complete large token, but without any ngramming, so incomplete searches (like "tokens") would not match it, only the full token. Now, we split words before ngramming them, so the main index will properly handle words up to the ngram limit. The fullword index will still handle the longer words for non-ngram matching. Also optimized away duplicate tokens from the indices (since we rely on boolean matching, not scoring) to save a couple megabytes of space. 2018-04-29 01:09:40 +00:00			`- unique`
[ES Schema change] Multi-field search display_name to match words over ngram limit This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously). Does not (seem to!) break anything, but requires a re-indexing of ES. 2017-06-05 14:29:00 +00:00
ES: implement real substring matching (#500) ...by splitting input into characters, instead of whitespace delimited words. This means you can now match partial words, real substrings from anywhere: "foo ba" will match "Foo Bar Baz", while previously you had to have full words ("foo bar") to match anything. My dev setup incurred an 8% increase in storage usage, from ~13MB to ~14MB (for ~40k torrents). Small change, big improvement. Wonder why I didn't do this at first. 2018-06-08 07:59:19 +00:00			`tokenizer:`
			`# Splits input into characters, for exact substring matching`
			`exact_tokenizer:`
			`type: pattern`
			`pattern: "(.)"`
			`group: 1`

some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`filter:`
			`my_ngram:`
			`type: edgeNGram`
			`min_gram: 1`
			`max_gram: 15`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`fullword_min:`
			`type: length`
			`# Remember to change this if you change the max_gram below!`
			`min: 16`
updated indicies 2017-05-18 08:58:08 +00:00			`resolution:`
			`type: pattern_capture`
Resolves #129 and refactored create magnet es naming 2017-05-25 06:19:08 +00:00			`patterns: ["(\\d+)[xX](\\d+)"]`
Fixes #227 2017-06-05 06:03:32 +00:00			`trim_zero:`
			`type: pattern_capture`
			`patterns: ["0([0-9])"]`
Resolves #129 and refactored create magnet es naming 2017-05-25 06:19:08 +00:00			`word_delimit:`
			`type: word_delimiter`
			`preserve_original: true`
			`split_on_numerics: false`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`char_filter:`
			`my_char_filter:`
			`type: mapping`
updated indicies 2017-05-18 08:58:08 +00:00			`mappings: ["-=>_", "!=>_", "_=>\\u0020"]`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`index:`
			`# we're running a single es node, so no sharding necessary,`
			`# plus replicas don't really help either.`
			`number_of_shards: 1`
			`number_of_replicas : 0`
			`mapper:`
			`# disable elasticsearch's "helpful" autoschema`
			`dynamic: false`
			`# since we disabled the _all field, default query the`
			`# name of the torrent.`
			`query:`
			`default_field: display_name`
			`mappings:`
			`torrent:`
			`# don't want everything concatenated`
			`_all:`
			`enabled: false`
			`properties:`
			`id:`
			`type: long`
			`display_name:`
			`# TODO could do a fancier tokenizer here to parse out the`
			`# the scene convention of stuff in brackets, plus stuff like k-on`
			`type: text`
			`analyzer: my_index_analyzer`
[ES Schema change] Multi-field search display_name to match words over ngram limit This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously). Does not (seem to!) break anything, but requires a re-indexing of ES. 2017-06-05 14:29:00 +00:00			`fielddata: true # Is this required?`
			`fields:`
			`# Multi-field for full-word matching (when going over ngram limits)`
			`# Note: will have to be queried for, not automatic`
			`fullword:`
			`type: text`
			`analyzer: my_fullword_index_analyzer`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`# Stored for exact phrase matching`
			`exact:`
			`type: text`
			`analyzer: exact_analyzer`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`created_time:`
			`type: date`
			`# Only in the ES index for generating magnet links`
			`info_hash:`
			`enabled: false`
			`filesize:`
			`type: long`
			`anonymous:`
			`type: boolean`
			`trusted:`
			`type: boolean`
			`remake:`
			`type: boolean`
			`complete:`
			`type: boolean`
			`hidden:`
			`type: boolean`
			`deleted:`
			`type: boolean`
			`has_torrent:`
			`type: boolean`
			`download_count:`
			`type: long`
			`leech_count:`
			`type: long`
			`seed_count:`
			`type: long`
Update ElasticSeach index and scripts for comment_count 2017-05-26 13:12:47 +00:00			`comment_count:`
			`type: long`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`# these ids are really only for filtering, thus keyword`
			`uploader_id:`
			`type: keyword`
			`main_category_id:`
			`type: keyword`
			`sub_category_id:`
Reverted previous commit for mapping 2017-05-17 05:53:03 +00:00			`type: keyword`