ES分词配置

发表于 2021-03-06 分类于学习笔记阅读次数：本文字数： 822 阅读时长 ≈ 3 分钟

ES 中的分词是通过分词器来实现的，主要作用是将文本转化成单词，其主要分词器有 standard（默认分词器，单词转小写，去除标点符号），simple（单词转小写，去除标点符号和数字类型的字符），whitespace（去除空格，不支持中文）

standard 分词器

standard 分词的过程

执行命令后可以看到，单词小写，标点符号也已经被去掉

GET _analyze
{
  "analyzer": "standard",
  "text": "Are you 18 years old, young man."
}

接着，将 text 的值修改为“中华人民共和国国歌”，执行后可以看到，中文按单个字进行分隔，并不符合我们要求，所以我们需要一个中文的分词器

ik 分词器

下载 ik 分词器：https://github.com/medcl/elasticsearch-analysis-ik/releases
将压缩包解压至 es 的 plugins 目录下后重启 es
执行命令后可以看到中文分词的结果

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}

ik 分词器主要有两种，ik_smart（智能分词，粗粒度），ik_max_word（最大化分词，细粒度），使用时的最佳实践是创建索引时使用 max_word，查询的时候使用 smart，如果出现 smart 搜索不到的情况下，可以做降级强制指定 max_word 进行搜索

自定义词库

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "凯悦酒店"
}

{
  "tokens" : [
    {
      "token" : "凯",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "悦",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "酒店",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

因为“凯悦”这个词在词库中并不存在，所以执行上述命令时返回的结果并不是我们想要的，这时候就需要自定义词库进行处理

在 ik/config 目录下创建 new_word.dic 文件，输入文本“凯悦”
修改 IKAnalyzer.cfg.xml

1 2	<!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">new_word.dic</entry>

重启 es 后执行 ik 分词，可以得到分词结果

{
  "tokens" : [
    {
      "token" : "凯悦",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "酒店",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

热更新词库

自定义词库创建后，需要重启 es 才会生效，一般不建议这么做，使用热更新的方式会更好
将文件配置改成可以用 http 请求访问即可，http 请求需要返回两个头部，last-modified 和 etag，两者任何一个发生变化则会重新更新，ik 一分钟检测一次

1 2	<!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">http://yoursite.com/getCustomeDic</entry>

同义词搜索

在 es/config 目录创建 ik/synonyms.txt 文件，输入文本“凯悦,锡伯,红桃”
创建索引并添加数据

PUT /testindex2
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms_path": "ik/synonyms.txt"
        }
      },
      "analyzer": {
        "ik_syno": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": ["my_synonym_filter"]
        },
        "ik_syno_max": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["my_synonym_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text", 
        "analyzer": "ik_syno_max",
        "search_analyzer": "ik_syno"
      }
    }
  }
}

PUT /testindex2/_doc/1
{
  "name": "凯悦酒店"
}
PUT /testindex2/_doc/2
{
  "name": "锡伯酒店"
}
PUT /testindex2/_doc/3
{
  "name": "红桃酒店"
}
PUT /testindex2/_doc/4
{
  "name": "测试酒店"
}

测试同义词搜索

GET /testindex2/_analyze
{
  "field": "name",
  "text": "凯悦"
}

GET /testindex2/_search
{
  "query": {
    "match": {
      "name": "红挑"
    }
  }
}

返回数据：

{
  "tokens" : [
    {
      "token" : "凯悦",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "锡伯",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "红桃",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}