nyaa/es_mapping.yml

---
# CREATE DTABASE/TABLE equivalent for elasticsearch, in yaml
# fo inline comments.
settings:
  analysis:
    analyzer:
      my_search_analyzer:
        type: custom
        tokenizer: standard
        char_filter:
          - my_char_filter
        filter:
          - standard
          - lowercase
      my_index_analyzer:
        type: custom
        tokenizer: standard
        char_filter:
          - my_char_filter
        filter:
          - resolution
          - lowercase
          - my_ngram
          - word_delimit
          - trim_zero
      # For exact matching - simple lowercase + whitespace delimiter
      exact_analyzer:
        tokenizer: whitespace
        filter:
          - lowercase
      # For matching full words longer than the ngram limit (15 chars)
      my_fullword_index_analyzer:
        type: custom
        tokenizer: standard
        char_filter:
          - my_char_filter
        filter:
          - lowercase
          - word_delimit
          # Skip tokens shorter than N characters,
          # since they're already indexed in the main field
          - fullword_min

    filter:
      my_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
      fullword_min:
        type: length
        # Remember to change this if you change the max_gram below!
        min: 16
      resolution:
        type: pattern_capture
        patterns: ["(\\d+)[xX](\\d+)"]
      trim_zero:
        type: pattern_capture
        patterns: ["0*([0-9]*)"]
      word_delimit:
        type: word_delimiter
        preserve_original: true
        split_on_numerics: false
    char_filter:
      my_char_filter:
        type: mapping
        mappings: ["-=>_", "!=>_", "_=>\\u0020"]
  index:
    # we're running a single es node, so no sharding necessary,
    # plus replicas don't really help either.
    number_of_shards: 1
    number_of_replicas : 0
    mapper:
      # disable elasticsearch's "helpful" autoschema
      dynamic: false
    # since we disabled the _all field, default query the
    # name of the torrent.
    query:
      default_field: display_name
mappings:
  torrent:
    # don't want everything concatenated
    _all:
      enabled: false
    properties:
      id:
        type: long
      display_name:
        # TODO could do a fancier tokenizer here to parse out the
        # the scene convention of stuff in brackets, plus stuff like k-on
        type: text
        analyzer: my_index_analyzer
        fielddata: true # Is this required?
        fields:
          # Multi-field for full-word matching (when going over ngram limits)
          # Note: will have to be queried for, not automatic
          fullword:
            type: text
            analyzer: my_fullword_index_analyzer
          # Stored for exact phrase matching
          exact:
            type: text
            analyzer: exact_analyzer
      created_time:
        type: date
        # Only in the ES index for generating magnet links
      info_hash:
        enabled: false
      filesize:
        type: long
      anonymous:
        type: boolean
      trusted:
        type: boolean
      remake:
        type: boolean
      complete:
        type: boolean
      hidden:
        type: boolean
      deleted:
        type: boolean
      has_torrent:
        type: boolean
      download_count:
        type: long
      leech_count:
        type: long
      seed_count:
        type: long
      comment_count:
        type: long
      # these ids are really only for filtering, thus keyword
      uploader_id:
        type: keyword
      main_category_id:
        type: keyword
      sub_category_id:
        type: keyword
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`---`
			`# CREATE DTABASE/TABLE equivalent for elasticsearch, in yaml`
			`# fo inline comments.`
			`settings:`
			`analysis:`
			`analyzer:`
			`my_search_analyzer:`
			`type: custom`
			`tokenizer: standard`
			`char_filter:`
			`- my_char_filter`
			`filter:`
			`- standard`
			`- lowercase`
			`my_index_analyzer:`
			`type: custom`
			`tokenizer: standard`
			`char_filter:`
			`- my_char_filter`
			`filter:`
updated indicies 2017-05-18 08:58:08 +00:00			`- resolution`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`- lowercase`
			`- my_ngram`
Resolves #129 and refactored create magnet es naming 2017-05-25 06:19:08 +00:00			`- word_delimit`
Fixes #227 2017-06-05 06:03:32 +00:00			`- trim_zero`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`# For exact matching - simple lowercase + whitespace delimiter`
			`exact_analyzer:`
			`tokenizer: whitespace`
			`filter:`
			`- lowercase`
[ES Schema change] Multi-field search display_name to match words over ngram limit This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously). Does not (seem to!) break anything, but requires a re-indexing of ES. 2017-06-05 14:29:00 +00:00			`# For matching full words longer than the ngram limit (15 chars)`
			`my_fullword_index_analyzer:`
			`type: custom`
			`tokenizer: standard`
			`char_filter:`
			`- my_char_filter`
			`filter:`
			`- lowercase`
			`- word_delimit`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`# Skip tokens shorter than N characters,`
			`# since they're already indexed in the main field`
			`- fullword_min`
[ES Schema change] Multi-field search display_name to match words over ngram limit This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously). Does not (seem to!) break anything, but requires a re-indexing of ES. 2017-06-05 14:29:00 +00:00
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`filter:`
			`my_ngram:`
			`type: edgeNGram`
			`min_gram: 1`
			`max_gram: 15`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`fullword_min:`
			`type: length`
			`# Remember to change this if you change the max_gram below!`
			`min: 16`
updated indicies 2017-05-18 08:58:08 +00:00			`resolution:`
			`type: pattern_capture`
Resolves #129 and refactored create magnet es naming 2017-05-25 06:19:08 +00:00			`patterns: ["(\\d+)[xX](\\d+)"]`
Fixes #227 2017-06-05 06:03:32 +00:00			`trim_zero:`
			`type: pattern_capture`
			`patterns: ["0([0-9])"]`
Resolves #129 and refactored create magnet es naming 2017-05-25 06:19:08 +00:00			`word_delimit:`
			`type: word_delimiter`
			`preserve_original: true`
			`split_on_numerics: false`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`char_filter:`
			`my_char_filter:`
			`type: mapping`
updated indicies 2017-05-18 08:58:08 +00:00			`mappings: ["-=>_", "!=>_", "_=>\\u0020"]`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`index:`
			`# we're running a single es node, so no sharding necessary,`
			`# plus replicas don't really help either.`
			`number_of_shards: 1`
			`number_of_replicas : 0`
			`mapper:`
			`# disable elasticsearch's "helpful" autoschema`
			`dynamic: false`
			`# since we disabled the _all field, default query the`
			`# name of the torrent.`
			`query:`
			`default_field: display_name`
			`mappings:`
			`torrent:`
			`# don't want everything concatenated`
			`_all:`
			`enabled: false`
			`properties:`
			`id:`
			`type: long`
			`display_name:`
			`# TODO could do a fancier tokenizer here to parse out the`
			`# the scene convention of stuff in brackets, plus stuff like k-on`
			`type: text`
			`analyzer: my_index_analyzer`
[ES Schema change] Multi-field search display_name to match words over ngram limit This fixes searching for "Machiavellianism", 16 chars ("Machiavellianis", 15 chars, worked previously). Does not (seem to!) break anything, but requires a re-indexing of ES. 2017-06-05 14:29:00 +00:00			`fielddata: true # Is this required?`
			`fields:`
			`# Multi-field for full-word matching (when going over ngram limits)`
			`# Note: will have to be queried for, not automatic`
			`fullword:`
			`type: text`
			`analyzer: my_fullword_index_analyzer`
[ES Change] Improve Elasticsearch term quoting (#473) * Optimize Elasticsearch fullword field Since the main display_name field ngrams words up to 15 characters, anything to and under that will already be indexed - the fullword field (which we have for words longer than 15 characters) needs to index only words longer than that. * Preprocess ES terms for better literal matching This commit adds a new .exact subfield to display_name, which holds a barely-filtered version of the original title we can do "literal" matching against. This is not real substring matching, but quoting terms now actually does something! Implements a simple preprocessor for the search terms to extract quoted parts from the search terms, optionally prefixed with - to negate them. The preprocessor will create a query that'll join all three query-types: the simple_query_string, must-phrases and must-not-phrases. 2018-04-14 00:06:25 +00:00			`# Stored for exact phrase matching`
			`exact:`
			`type: text`
			`analyzer: exact_analyzer`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`created_time:`
			`type: date`
			`# Only in the ES index for generating magnet links`
			`info_hash:`
			`enabled: false`
			`filesize:`
			`type: long`
			`anonymous:`
			`type: boolean`
			`trusted:`
			`type: boolean`
			`remake:`
			`type: boolean`
			`complete:`
			`type: boolean`
			`hidden:`
			`type: boolean`
			`deleted:`
			`type: boolean`
			`has_torrent:`
			`type: boolean`
			`download_count:`
			`type: long`
			`leech_count:`
			`type: long`
			`seed_count:`
			`type: long`
Update ElasticSeach index and scripts for comment_count 2017-05-26 13:12:47 +00:00			`comment_count:`
			`type: long`
some more elasticsearch work, including index mapping and analyzer 2017-05-15 18:14:01 +00:00			`# these ids are really only for filtering, thus keyword`
			`uploader_id:`
			`type: keyword`
			`main_category_id:`
			`type: keyword`
			`sub_category_id:`
Reverted previous commit for mapping 2017-05-17 05:53:03 +00:00			`type: keyword`