赞
踩
目录
需求是将本地邮件内容以及PDF,EXCEL,WORD等附件内容进行处理,保存到ES数据库,实现邮件内容及附件内容的全文检索。
系统:CentOS7.3
elasticsearch版本:7.13.3
kibana版本:7.16.3
ingest-attachment插件版本:7.13.3
Kibana是一个开源的分析和可视化平台,设计用于和Elasticsearch一起工作。当前我们的用途主要是在kibana的开发工具dev tools中执行一些命令。
Ingest-Attachment是一个开箱即用的插件。可以将常用格式的文件作为附件写入Index。ingest attachment插件通过使用Apache Tika来提取文件,支持的文件格式有TXT、DOC、PPT、XLS和PDF等。 可以进行文本抽取及自动导入。注意:源字段必须是base64编码的二进制。
缺点:在处理xls和xlsx格式的时候,无法将sheet分开索引,只能将整个文件当做一个文档插入es中。
三、安装插件
我这里采用离线方式安装Ingest-Attachment,通过wget方式直接下载跟elasticsearch版本相同的离线文件。
wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.13.3.zip
上传到服务器 目录
/home/es/install/ingest-attachment-7.13.3.zip
进入ES_HOME的主目录,执行下面的命令进行安装
- cd /home/elasticsearch/
-
- ./bin/elasticsearch-plugin install file:///home/es/install/ingest-attachment-7.13.3.zip
安装完成后重启elasticsearch服务
插件安装完成!
在kibana的dev tool执行
我这里邮件可能是多个附件,所以定义文本抽取管道(多附件),我这里是设置 处理后移除base64的二进制数据。
需要注意的是,多附件的情况下,field和target_field必须要写成_ingest._value.*,否则不能匹配正确的字段。
- PUT _ingest/pipeline/multiple_attachment
- {
- "description" : "Extract attachment information from arrays",
- "processors" : [
- {
- "foreach" : {
- "field" : "attachments",
- "processor" : {
- "attachment" : {
- "target_field" : "_ingest._value.attachment",
- "field" : "_ingest._value.content"
- }
- }
- }
- },
- {
- "foreach" : {
- "field" : "attachments",
- "processor" : {
- "remove" : {
- "field" : "_ingest._value.content"
- }
- }
- }
- }
- ]
- }
Name | 是否必须 | Default | Description |
field | yes | - | 从这个字段中获取base64编码 |
target_field | no | attachment | 用于保留attachment信息,主要用于多附件的情况 |
indexed_chars | no | 100000 | 限制字段的最大保存字符数。-1为无限制。 |
indexed_chars_field | no | - | 可以从数据中设定的字段取到indexed_chars限制的值。 |
properties | no | 全属性 | 选择需要存储的属性。例如 content, title, name, author, keywords, date, content_type, content_length, language |
ignore_missing | no | FALSE | 如果使用true,并且 field 不存在, 则会忽略附件直接写入doc;否则则会报错。 |
- PUT mail
- {
- "settings": {
- "index": {
- "max_result_window": 100000000
- },
- "number_of_shards": 3,
- "number_of_replicas": 0
- },
- "mappings": {
- "properties": {
- "mfrom": {
- "type": "keyword"
- },
- "mto": {
- "type": "keyword"
- },
- "mcc": {
- "type": "keyword"
- },
- "mbcc": {
- "type": "keyword"
- },
- "rcvtime": {
- "type": "date",
- "format": "yyyy-MM-dd HH:mm:ss"
- },
- "subject": {
- "type": "keyword"
- },
- "importance": {
- "type": "keyword"
- },
- "savepath": {
- "type": "keyword"
- },
- "mbody": {
- "type": "text",
- "fields": {
- "keyword": {
- "ignore_above": 256,
- "type": "keyword"
- }
- }
- },
- "attachments": {
- "properties": {
- "attachment": {
- "properties": {
- "content": {
- "type": "text",
- "fields": {
- "keyword": {
- "ignore_above": 256,
- "type": "keyword"
- }
- }
- },
- "filename": {
- "type": "keyword"
- },
- "type": {
- "type": "keyword"
- }
- }
- }
- }
- }
- }
- }
- }
创建成功会返回
- {
- "acknowledged" : true,
- "shards_acknowledged" : true,
- "index" : "mail"
- }
可以使用Postman来调用elasticsearch的rest full接口完成文档插入或者更新。
请求类型:POST
请求地址:http://192.168.31.200:9200/mail/_doc?pipeline=multiple_attachment
请求地址中mail是索引名,pipeline=multiple_attachment指定需要使用的管道(pipeline)是multiple_attachment
请求body内容是JSON格式:
- {
- "mfrom": "microsoft.teams@outlook.com",
- "mto": "network@163.com",
- "mcc": "",
- "mbcc": "",
- "rcvtime": "2023-05-18 23:35:29",
- "subject": "神奇的邮件2023066- ",
- "importance": "1",
- "savepath": "d:\\mail\\TEST123.eml",
- "mbody": "这是邮件内容",
- "attachments": [
- {
- "filename": "附件名字1.pdf",
- "type": ".pdf",
- "content": "5oiR54ix5L2g5Lit5Zu9MjAyMw=="
- },
- {
- "filename": "附件名字2.xlsx",
- "type": ".xlsx",
- "content": "Q2hhdEdQVCDniZvpgLwh"
- }
- ]
- }
attachments是JSON数组,里面放2个附件的信息。filename是附件名字,content是附件解析出来的base64编码字符串。插入时通过管道处理,会自动识别内容,剩下的跟操作普通的索引一样。
下面是执行成功返回的内容:
- {
- "_index": "mail",
- "_type": "_doc",
- "_id": "eiCNNIgBUc2qXUv978Tg",
- "_version": 1,
- "result": "created",
- "_shards": {
- "total": 1,
- "successful": 1,
- "failed": 0
- },
- "_seq_no": 0,
- "_primary_term": 1
- }
Postman截图
6.1 根据_id查看文档
GET请求地址 http://192.168.31.200:9200/mail/_doc/eiCNNIgBUc2qXUv978Tg
参数和内容无
其中eiCNNIgBUc2qXUv978Tg为文档_id,mail为需要查询的索引名
返回结果:
- {
- "_index": "mail",
- "_type": "_doc",
- "_id": "eiCNNIgBUc2qXUv978Tg",
- "_version": 1,
- "_seq_no": 0,
- "_primary_term": 1,
- "found": true,
- "_source": {
- "savepath": "d:\\mail\\TEST123.eml",
- "mbody": "这是邮件内容",
- "attachments": [
- {
- "filename": "附件名字1.pdf",
- "attachment": {
- "content_type": "text/plain; charset=UTF-8",
- "language": "lt",
- "content": "我爱你中国2023",
- "content_length": 10
- },
- "type": ".pdf"
- },
- {
- "filename": "附件名字2.xlsx",
- "attachment": {
- "content_type": "text/plain; charset=UTF-8",
- "language": "lt",
- "content": "ChatGPT 牛逼!",
- "content_length": 12
- },
- "type": ".pdf"
- }
- ],
- "mbcc": "",
- "subject": "神奇的邮件2023066- ",
- "importance": "1",
- "mfrom": "microsoft.teams@outlook.com",
- "mto": "network@163.com",
- "mcc": "",
- "rcvtime": "2023-05-18 23:35:29"
- }
- }
Postman截图
6.2 模糊查询附件名字
Post请求地址 http://192.168.31.200:9200/mail/_search
请求内容是JSON字符串,attachments.filename.keyword是附件名字(不分词)
- {
- "query": {
- "bool": {
- "should": [{
- "wildcard": {
- "attachments.filename.keyword": "*附件*"
-
- }
- }]
- }
- }
- }
返回结果
- {
- "took": 2,
- "timed_out": false,
- "_shards": {
- "total": 3,
- "successful": 3,
- "skipped": 0,
- "failed": 0
- },
- "hits": {
- "total": {
- "value": 1,
- "relation": "eq"
- },
- "max_score": 1.0,
- "hits": [
- {
- "_index": "mail",
- "_type": "_doc",
- "_id": "eiCNNIgBUc2qXUv978Tg",
- "_score": 1.0,
- "_source": {
- "savepath": "d:\\mail\\TEST123.eml",
- "mbody": "这是邮件内容",
- "attachments": [
- {
- "filename": "附件名字1.pdf",
- "attachment": {
- "content_type": "text/plain; charset=UTF-8",
- "language": "lt",
- "content": "我爱你中国2023",
- "content_length": 10
- },
- "type": ".pdf"
- },
- {
- "filename": "附件名字2.xlsx",
- "attachment": {
- "content_type": "text/plain; charset=UTF-8",
- "language": "lt",
- "content": "ChatGPT 牛逼!",
- "content_length": 12
- },
- "type": ".pdf"
- }
- ],
- "mbcc": "",
- "subject": "神奇的邮件2023066- ",
- "importance": "1",
- "mfrom": "microsoft.teams@outlook.com",
- "mto": "network@163.com",
- "mcc": "",
- "rcvtime": "2023-05-18 23:35:29"
- }
- }
- ]
- }
- }
6.3 模糊查询附件内容
POST请求地址 http://192.168.31.200:9200/mail/_search
请求内容为JSON格式,attachments.attachment.content是附件内容(不加密)
- {
- "size":"10000",
- "_source" :[
- "_id",
- "seqnbr",
- "subject",
- "eml"
- ],
- "query": {
- "match": {
- "attachments.attachment.content":"*ChatGPT*"
- }
- }
- }
返回结果
- {
- "took": 1,
- "timed_out": false,
- "_shards": {
- "total": 3,
- "successful": 3,
- "skipped": 0,
- "failed": 0
- },
- "hits": {
- "total": {
- "value": 1,
- "relation": "eq"
- },
- "max_score": 0.2876821,
- "hits": [
- {
- "_index": "mail",
- "_type": "_doc",
- "_id": "eiCNNIgBUc2qXUv978Tg",
- "_score": 0.2876821,
- "_source": {
- "subject": "神奇的邮件2023066- "
- }
- }
- ]
- }
- }
七、其他说明
下面是单独说明的定义文本抽取的管道single_attachment
在kibana的dev tool执行
PUT _ingest/pipeline/single_attachment
- {
- "description" : "Extract single attachment information",
- "processors" : [
- {
- "attachment" : {
- "field": "data",
- "indexed_chars" : -1,
- "ignore_missing" : true
- }
- }
- ]
- }
剩下的就是代码集成的问题了。关于中文分词IK插件的使用,后期需要再详细说明。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。