赞
踩
analysis(只是一个概念),文本分析是将全文本转换为一系列单词的过程,也叫分词。analysis是通过analyzer(分词器)来实现的,可以使用Elasticsearch内置的分词器,也可以自己去定制一些分词器。除了在数据写入的时候将词条进行转换,那么在查询的时候也需要使用相同的分析器对语句进行分析。
anaylzer是由三部分组成,例如有
Hello a World, the world is beautifu
:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K1NCQbR6-1599635677317)(images2/analysis.png)]
分词器名称 | 处理过程 |
---|---|
Standard Analyzer | 默认的分词器,按词切分,小写处理 |
Simple Analyzer | 按照非字母切分(符号被过滤),小写处理 |
Stop Analyzer | 小写处理,停用词过滤(the, a, this) |
Whitespace Analyzer | 按照空格切分,不转小写 |
Keyword Analyzer | 不分词,直接将输入当做输出 |
Pattern Analyzer | 正则表达式,默认是\W+(非字符串分隔) |
A. Standard Analyzer
GET _analyze
{
"analyzer": "standard",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
B. Simple Analyzer
GET _analyze
{
"analyzer": "simple",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
C. Stop Analyzer
GET _analyze
{
"analyzer": "stop",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
D. Whitespace Analyzer
GET _analyze
{
"analyzer": "whitespace",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
E. Keyword Analyzer
GET _analyze
{
"analyzer": "keyword",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
F. Pattern Analyzer
GET _analyze
{
"analyzer": "pattern",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
中文分词在所有的搜索引擎中都是一个很大的难点,中文的句子应该是切分成一个个的词,一句中文,在不同的上下文中,其实是有不同的理解,例如下面这句话:
这个苹果,不大好吃/这个苹果,不大,好吃
IK分词器支持自定义词库,支持热更新分词字典,地址为 https://github.com/medcl/elasticsearch-analysis-ik
elasticsearch-plugin.bat install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
安装步骤:
IK分词插件对应的分词器有以下几种:
安装步骤如下:
data-for-1.7.5.zip
解压到anayler-hanlp目录下第2步
解压目录下的 config
文件夹中两个文件 hanlp.properties
hanlp-remote.xml
拷贝到ES的家目录中的config目录下 analysis-hanlp
文件夹中(analyzer-hanlp
目录需要手动去创建)。hanlp
文件夹中提供的六个文件拷贝到 $ES_HOME\plugins\analysis-hanlp\data\dictionary\custom
目录下。HanLP对应的分词器如下:
安装步骤:
ik_smart
GET _analyze
{
"analyzer": "ik_smart",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp
GET _analyze
{
"analyzer": "hanlp",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp_standard
GET _analyze
{
"analyzer": "hanlp_standard",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp_speed
GET _analyze
{
"analyzer": "hanlp_speed",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
在如上列举了很多的分词器,那么在实际中该如何应用?
要想使用分词器,先要指定我们想要对那个字段使用何种分词,如下所示:
PUT customers
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_standard"
}
}
}
}
POST customers/_bulk
{"index":{}}
{"content":"如不能登录,请在百端登录百度首页,点击【登录遇到问题】,进行找回密码操作"}
{"index":{}}
{"content":"网盘客户端访问隐藏空间需要输入密码方可进入。"}
{"index":{}}
{"content":"剑桥的网盘不好用"}
GET customers/_search
{
"query": {
"match": {
"content": "密码"
}
}
}
在查询的过程中我们可能需要使用拼音来进行查询,在中文分词器中我们介绍过 pinyin
分词器,那么在实际的工作中该如何使用呢?
PUT /medcl { "settings" : { "analysis" : { "analyzer" : { "pinyin_analyzer" : { "tokenizer" : "my_pinyin" } }, "tokenizer" : { "my_pinyin" : { "type" : "pinyin", "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "remove_duplicated_term" : true } } } } }
如上所示,我们基于现有的拼音分词器定制了一个名为 pinyin_analyzer
这样一个分词器。可用的参数可以参照:https://github.com/medcl/elasticsearch-analysis-pinyin
PUT medcl/_mapping { "properties": { "name": { "type": "keyword", "fields": { "pinyin": { "type": "text", "analyzer": "pinyin_analyzer", "boost": 10 } } } } }
POST medcl/_bulk
{"index":{}}
{"name": "刘德华"}
{"index":{}}
{"name": "张学友"}
{"index":{}}
{"name": "四大天王"}
{"index":{}}
{"name": "柳岩"}
{"index":{}}
{"name": "angel baby"}
GET medcl/_search
{
"query": {
"match": {
"name.pinyin": "ldh"
}
}
}
PUT goods { "settings": { "analysis": { "analyzer": { "hanlp_standard_pinyin":{ "type": "custom", "tokenizer": "hanlp_standard", "filter": ["my_pinyin"] } }, "filter": { "my_pinyin": { "type" : "pinyin", "keep_separate_first_letter" : false, "keep_full_pinyin" : true, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "remove_duplicated_term" : true } } } } }
PUT goods/_mapping
{"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_standard_pinyin"
}
}
}
POST goods/_bulk
{"index":{}}
{"content":"如不能登录,请在百端登录百度首页,点击【登录遇到问题】,进行找回密码操作"}
{"index":{}}
{"content":"网盘客户端访问隐藏空间需要输入密码方可进入。"}
{"index":{}}
{"content":"剑桥的网盘不好用"}
GET goods/_search { "query": { "match": { "content": "caozuo" } }, "highlight": { "pre_tags": "<em>", "post_tags": "</em>", "fields": { "content": {} } } }
keep_separate_first_letter: 为true的时候,例如输入刘德华, 那么分词后结果是: l, d, h
keep_full_pinyin: 为true的时候,例如输入刘德华, 那么分词后结果是: liu, de, hua
"keep_original": 为true, 例如输入刘德华,那么分词结果中有一项是: 刘德华
limit_first_letter_length: 全拼的最大长度。
lowercase: 转小写。
remove_duplicated_term: 移除掉重复的项。
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>
spring:
elasticsearch:
rest:
uris: http://localhost:9200
@Configuration public class ElasticsearchConfig extends ElasticsearchConfigurationSupport { @Bean public Client elasticsearchClient() throws UnknownHostException { Settings settings = Settings.builder().put("cluster.name", "my-application").build(); TransportClient client = new PreBuiltTransportClient(settings); client.addTransportAddress(new TransportAddress(InetAddress.getByName("127.0.0.1"), 9300)); return client; } @Bean(name = {"elasticsearchOperations", "elasticsearchTemplate"}) public ElasticsearchTemplate elasticsearchTemplate() throws UnknownHostException { return new ElasticsearchTemplate(elasticsearchClient(), entityMapper()); } // use the ElasticsearchEntityMapper @Bean @Override public EntityMapper entityMapper() { ElasticsearchEntityMapper entityMapper = new ElasticsearchEntityMapper(elasticsearchMappingContext(), new DefaultConversionService()); entityMapper.setConversions(elasticsearchCustomConversions()); return entityMapper; } }
@Document(indexName = "movies", type = "_doc")
public class Movie {
private String id;
private String title;
private Integer year;
private List<String> genre;
// setters and getters
}
A. 分页查询
// 分页查询
@RequestMapping("/page")
public Object pageQuery(
@RequestParam(required = false, defaultValue = "10") Integer size,
@RequestParam(required = false, defaultValue = "1") Integer page) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withPageable(PageRequest.of(page, size))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
B. range查询
// 单条件范围查询, 查询电影的上映日期在2016年到2018年间的所有电影
@RequestMapping("/range")
public Object rangeQuery() {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new RangeQueryBuilder("year").from(2016).to(2018))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
C. match查询
// 单条件查询只要包含其中一个字段
@RequestMapping("/match")
public Object singleCriteriaQuery(String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MatchQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
D. 多条件分页查询
@RequestMapping("/match/multiple") public Object multiplePageQuery( @RequestParam(required = true) String searchText, @RequestParam(required = false, defaultValue = "10") Integer size, @RequestParam(required = false, defaultValue = "1") Integer page) { SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery( new BoolQueryBuilder() .must(new MatchQueryBuilder("title", searchText)) .must(new RangeQueryBuilder("year").from(2016).to(2018)) ).withPageable(PageRequest.of(page, size)) .build(); List<Movie> movies = elasticsearchTemplate .queryForList(searchQuery, Movie.class); return movies; }
E. 多条件或者查询
// 多条件并且分页查询 @RequestMapping("/match/or/multiple") public Object multipleOrQuery(@RequestParam(required = true) String searchText) { SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery( new BoolQueryBuilder() .should(new MatchQueryBuilder("title", searchText)) .should(new RangeQueryBuilder("year").from(2016).to(2018)) ).build(); List<Movie> movies = elasticsearchTemplate .queryForList(searchQuery, Movie.class); return movies; }
F. 精准匹配一个单词,且查询就一个单词
//其中包含有某个给定单词,必须是一个词
@RequestMapping("/term")
public Object termQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new TermQueryBuilder("title", searchText)).build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
精准匹配多个单词
//其中包含有某个几个单词
@RequestMapping("/terms")
public Object termsQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new TermsQueryBuilder("title", searchText.split("\\s+"))).build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
G. 短语匹配
@RequestMapping("/phrase")
public Object phraseQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MatchPhraseQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
H. 只查询部分列
@RequestMapping("/source")
public Object sourceQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withSourceFilter(new FetchSourceFilter(
new String[]{"title", "year", "id"}, new String[]{}))
.withQuery(new MatchPhraseQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
I. 多字段匹配
@RequestMapping("/multiple/field")
public Object allTermsQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MultiMatchQueryBuilder(searchText, "title", "genre")
.type(MultiMatchQueryBuilder.Type.MOST_FIELDS))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
J. 多单词同时包含
// 多单词同时包含
@RequestMapping("/also/include")
public Object alsoInclude(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new QueryStringQueryBuilder(searchText)
.field("title").defaultOperator(Operator.AND))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
要使用logstash导入数据的时候,首先需要将mysql的驱动包加入到logstash家目录下 logstash-core\lib\jars
.
input { jdbc { jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://localhost:3306/es?useSSL=false&serverTimezone=UTC" jdbc_user => es jdbc_password => "123456" #启用追踪,如果为true,则需要指定tracking_column use_column_value => false #指定追踪的字段, tracking_column => "id" #追踪字段的类型,目前只有数字(numeric)和时间类型(timestamp),默认是数字类型 tracking_column_type => "numeric" #记录最后一次运行的结果 record_last_run => true #上面运行结果的保存位置 last_run_metadata_path => "mysql-position.txt" statement => "SELECT * FROM news where tags is not null" #表示每天的 17:57分执行 schedule => " 0 57 17 * * *" } } filter { mutate { split => { "tags" => ","} } } output { elasticsearch { document_id => "%{id}" document_type => "_doc" index => "news" hosts => ["http://localhost:9200"] } stdout{ codec => rubydebug } }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。