小舞很执着

这个屌丝很懒，什么也没留下！

热门标签

Gulimall学习：ElasticSearch进阶(query、mapping、ik分词器)_elasticsearch analysis-ik mapping

作者：小舞很执着 | 2024-08-09 07:44:30

踩

elasticsearch analysis-ik mapping

文章目录

1. 两种检索方式
- 1.1 请求参数方式检索
- 1.2 url+请求体检索match_all
2. query DSL
3. mapping映射的创建、添加、修改
4. ik分词器

1. 两种检索方式

1.1 请求参数方式检索

GET /bank/_search?q=*&sort=account_number:asc
1

说明：
(1) q=* 查询所有；
(2) sort=account_number:asc 按照account_number升序

1.2 url+请求体检索match_all

GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "account_number": "asc"
    },
    {
      "balance": "desc"
    }
  ],
  "from": 1,
  "size": 5,
  "_source": [
    "balance",
    "firstname"
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

说明：
（1）先按照account_number升序，再按照balance进行降序
（2）from：从哪一个文档开始
（3）size：需要的个数

2. query DSL

2.1 match 全文检索

GET /bank/_search
{
  "query":{
    "match":{
      "address": "mill lane"
    }
  }
}
1
2
3
4
5
6
7
8

说明：
（1）当match检索的字段是数值型的，会进行精确匹配；当时字符串类型的，会进行模糊匹配；
（2）match会对检索的字段进行分词，如上述例子中，会将 "mill lane"分词，进而分别进行模糊匹配检索

2.2 match_phrase 短语匹配

GET /bank/_search
{
  "query":{
    "match_phrase":{
      "address": "mill lane"
    }
  }
}
1
2
3
4
5
6
7
8

说明：
（1）match_phrase不会对要检索的mill lane进行分词，会查询address字段中含有 "mill lane"的文档

2.3 multi_match多字段匹配

GET /bank/_search
{
  "query":{
    "multi_match": {
      "query": "mill movico",
      "fields": ["address","city"]
    }
  }
}
1
2
3
4
5
6
7
8
9

说明：
（1）multi_match用来对多个字段执行相同的查询
（2）multi_match中，fields支持使用通配符
（3）multi_match内部执行查询的方式主要取决于type参数，以下为type类型说明：

类型	说明
best_fields	（默认）查找与任何字段匹配但使用_score最佳字段中的文档

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

2.4 bool复合查询

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "address": "Mill"
          }
        },
        {
          "match": {
            "gender": "M"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "firstname": "Forbes"
          }
        }
      ],
      "should": [
        {
          "match": {
            "state": "KY"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "age": {
              "gte": 35,
              "lte": 38
            }
          }
        }
      ]
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

说明：
（1）bool复合查询。用于构造复杂的查询
（2）must: 必须匹配,查询上下文,加分
（3）must_not: 必须不匹配,过滤上下文,过滤
（4）should: 应该匹配,查询上下文,加分
（5）filter: 必须匹配,过滤上下文,过滤

2.5 term查询(terms、.keyword、match_phrase区别)

GET /bank/_search
{
  "query": {
    "term": {
      "balance": "45801"
    }
  }
}
1
2
3
4
5
6
7
8

说明：
（1）使用term查询去精确匹配类似price、productID、或者username的非text类型数据；
（2）使用match查询去匹配text类型数据
（3）term：查询某个字段里含有某个关键词的文档
（4）terms：查询某个字段里含有多个关键词的文档
（5）.keyword：完全精确匹配字段值

完全精确匹配address="198 Mill Lane"的文档

GET /bank/_search
{
  "query":{
    "match": {
      "address.keyword": "198 Mill Lane"
    }
  }
}
1
2
3
4
5
6
7
8

（6）match_phrase：匹配某个字段中含有该短语的文档

3. mapping映射的创建、添加、修改

3.1 自定义创建映射

PUT /my_index
{
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "email": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

3.2 添加新字段映射

PUT /my_index/_mapping
{
  "properties": {
    "employee-id": {
      "type": "keyword",
      "index": false
    }
  }
}
1
2
3
4
5
6
7
8
9

此时如果根据employee-id进行索引查询，会报如下错误：

{
  ...
 "reason" : "failed to create query: Cannot search on field [employee-id] since it is not indexed.",
  ...
}
1
2
3
4
5

说明：
（1）employee-id：要新添加的字段名；
（2）type：字段类型；
（3）index：是否可以被查询被索引（默认是true）

3.3 修改映射字段类型(reindex迁移)

（1）我们不可以修改一个已经存在的映射规则或者映射的字段类型（但可以添加新的映射关系，如上）；
（2）我们可以通过创建一个新的索引index，并指定好新的映射mapping，然后将旧索引中的数据通过reindex迁移到新索引中。

如果老索引中的文档数据指定了type，则需要指定type类型；如果没有则不需要指定。

POST _reindex
{
  "source": {
    "index": "bank",
    "type": "account"
  },
  "dest": {
    "index": "newbank"
  }
}
1
2
3
4
5
6
7
8
9
10

4. ik分词器

4.1 安装

（1）在https://github.com/medcl/elasticsearch-analysis-ik/releases这个地址下载对应ElasticSearch版本的elasticsearch-analysis-ik-7.9.2.zip(版本一定要对应）
（2）将压缩包中的文件解压，并将所有解压文件拷贝到ES安装目录的plugins目录下。
例如：D:\Software\ElasticSearch7.9.2\elasticsearch-7.9.2\plugins\analysis-ik\目录下（此处需要创建analysis-ik目录）
（3）重启ElasticSearch

4.2 验证与使用

现在支持ik_smart(智能分词)与ik_max_word(最大分词组合)两种分词

ik_max_word请求：

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "我是中国人"
}
1
2
3
4
5

ik_max_word返回：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

ik_smart请求：

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "我是中国人"
}
1
2
3
4
5

ik_smart返回：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

4.3 自定义拓展分词词汇

在ik分词器的config目录下，找到IKAnalyzer.cfg.xml配置文件，我们在此处配置要拓展的词汇的文件的未知，该文件内容如下：
（我的目录：D:\Software\ElasticSearch7.9.2\elasticsearch-7.9.2\plugins\analysis-ik\config\IKAnalyzer.cfg.xml)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict"></entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
1
2
3
4
5
6
7
8
9
10
11
12
13

说明：
（1）如果要配置远程拓展词典，可以在服务器中搭建nginx服务器，并在nginx服务器中放入拓展词库的txt文件，文件中的内容按照如下换行格式：

测试分词1
测试分词2
1
2

（2）此处以学习为目的，在windows本地下(即D:\Software\ElasticSearch7.9.2\elasticsearch-7.9.2\plugins\analysis-ik\config)配置了需要自定义拓展的词汇文件
在config目录下创建extend.dic文件：

王二麻
分词效果
1
2

KAnalyzer.cfg.xml配置

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">extend.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
1
2
3
4
5
6
7
8
9
10
11
12
13

先看下如下请求如果未拓展自定义词库的效果

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "王二麻测试分词效果!"
}
1
2
3
4
5

{
  "tokens" : [
    {
      "token" : "王",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "二",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "TYPE_CNUM",
      "position" : 1
    },
    {
      "token" : "麻",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "测试",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "分词",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "效果",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

采用自定义分词词库后

{
  "tokens" : [
    {
      "token" : "王二麻",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "测试",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "分词效果",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小舞很执着/article/detail/952168