Elasticsearch-近似搜索_search 近似查询

作者：运维做开发 | 2024-08-13 17:55:44

踩

search 近似查询

match_phrase短语搜索的原理
slop的原理
混合使用match和近似匹配来实现召回率和精准度的平衡
性能比较和优化方案

搜索需求：
我们想搜索doc中包含java spark的短语，也就是term下的java spark（不分词），我们可以用phrase match来搜索；另一方面，如果我们想让java和spark距离很近的doc优先返回，距离越近对应的relevance score能够更高，我们可以使用proximity match来搜索。

1.match_phrase短语匹配：

GET /forum/article/_search
{
 "query": {
   "match": {
     "content": "java spark"
   }
 }
}1
2
3
4
5
6
7
8

我们只能够通过这个搜索条件搜索出匹配java或者spark的doc，应为搜索条件会被分词处理成java和spark两个词

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": "java spark"
    }
  }
}1
2
3
4
5
6
7
8

我们使用match_phrase直接搜索一个短语，必须同时匹配多个单词并且顺序间隔都要相同才能够匹配

我们从doc的倒排索引来分析
doc1: hello world, java spark
doc2: hi,spark java

经过分词处理之后会有一个term position的值存在：
hello doc1(0)
world doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)

可以通过api来查看这个过程：

GET _analyze
{
  "text": ["hello world, java spark"],
  "analyzer": "standard"
}1
2
3
4
5

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 13,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "spark",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

括号中的数字代表position也就是term在doc中的原位置，从0开始计数。我们分析match_phrase的基本原理，对于搜索短语java spark，首先会查询匹配java的doc，然后匹配spark的doc，同时过滤出同时匹配java和spark的doc，最后最重要的一点是，spark的term position需要比java的term position大1，也就是两个term在原doc中要连在一起并且有先后顺序

2.对于query string搜索文本中的几个term，要经过几次移动才能够与一个document匹配，这个移动的次数，就是slop

hello world , java is very good , spark is also very good

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java spark",
        "slop":3
      }
    }
  }
}1
2
3
4
5
6
7
8
9
10
11

我们对上述doc进行match_phrase搜索java spark是搜索不到的。对于java spark而言，spark向右移动三次之后可以和上述文档匹配上，slop的值标识的是能够移动的最大次数。slop搜索的时候，关键词离得越近，对应的relevance score就会越高

3.对于上述的搜索，确实可以在精准度上有一定的改进，但是同时我们却降低了召回率（仅仅搜索java或者spark的结果无法返回）。所以我们和混合使用match来达到一个两者之间的平衡

GET /forum/article/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": {
              "query": "java spark"
            }
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "content": {
              "query": "java spark",
              "slop": 50
            }
          }
        }
      ]
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

4.match query的性能比phrase match和proximity match(有slop）要高很多，因为后两者要计算position的距离。match query比phrase match的性能要高10倍，比proximity match的性能要高20倍。但是es的性能一般在毫秒级别，这些近似操作也是可以接受的。

对于proximity query的优化，一般就是减少要进行proximity match搜索的doc数量。主要思路就是用match query先过滤出所需要的数据，然后再用proximity match来根据term距离提高doc的分数，但是我们可以控制proximity match对doc有影响的doc数量，因为用户一般会分页查询只会查询前几页的数据。

GET /forum/article/_search
{
  "query": {
    "match": {
      "content": "java spark"
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "java spark",
            "slop": 50
          }
        }
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/运维做开发/article/detail/975909