当前位置:   article > 正文

【排错/运维】修复HDFS丢失、损坏以及副本数的问题_hdfs文件系统已损坏

hdfs文件系统已损坏

一. 问题描述

搭建了一个Hadoop的demo环境,用于一些功能测试,使用了一段时间之后发现flink任务提交不到hadoop上了。查看资源也都充足,查看hdfs后发现文件出现丢失和损坏的情况。此文章用于解决hdfs文件的问题。

 

二. 问题分析与解决

1. HDFS 块损坏

1.1. 问题表述

执行命令:

hdfs fsck /
  • 1

发现文件存在丢失和损坏的情况

.....
/dodb/datalake/jars/110/e24d18b0014183c95f56a26724353c15.jar:  Under replicated BP-1704786246-10.101.1.140-1663251681207:blk_1073742003_1179. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
.................................................................................
/flink-savepoint/34c63dd8507daaa028860d984baa6597/chk-142/_metadata: CORRUPT blockpool BP-1704786246-10.101.1.140-1663251681207 block blk_1073742316

/flink-savepoint/34c63dd8507daaa028860d984baa6597/chk-142/_metadata: MISSING 1 blocks of total size 17009 B................
..................
/flink-savepoint/ced56a39eca402a8050c99a5187b995a/chk-151/_metadata: CORRUPT blockpool BP-1704786246-10.101.1.140-1663251681207 block blk_1073742317

/flink-savepoint/ced56a39eca402a8050c99a5187b995a/chk-151/_metadata: MISSING 1 blocks of total size 2439 B.............
/user/commonuser/.flink/application_1663251697076_0001/dataflow-flink.jar:  Under replicated BP-1704786246-10.101.1.140-1663251681207:blk_1073741830_1006. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
.
.....
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

 

1.2. 问题解决

看到是执行flink任务时做的savepoint存在丢失的情况,一般savepoint用于集群迁移等情况。

直接删除文件的情况

根据上述命令找到有问题的块,查看无用之后,执行删除,然后再看hdfs的健康状态。

hdfs dfs -rm  -f -R  /flink-savepoint/34c63dd8507daaa028860d984baa6597
  • 1
需要文件恢复的情况

参考:
How to fix corrupt HDFS FIles

 

2. 副本同步问题

2.1. 问题表述

修复完损坏的文件之后,再次查看hadoop的健康状态

hdfs fsck /

Status: HEALTHY
 Total size:    1039836161 B
 Total dirs:    179
 Total files:   188
 Total symlinks:                0
 Total blocks (validated):      188 (avg. block size 5531043 B)
 Minimally replicated blocks:   188 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       69 (36.70213 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    1
 Average block replication:     1.0
 Corrupt blocks:                0
 Missing replicas:              138 (42.331287 %)
 Number of data-nodes:          1
 Number of racks:               1
FSCK ended at Thu Sep 29 13:53:02 CST 2022 in 7 milliseconds


The filesystem under path '/' is HEALTHY
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

看到大量的块存在副本问题

 Missing replicas:              138 (42.331287 %)
  • 1

demo环境是一个单节点,只有一个副本,但是日志却提示有三个副本

/user/commonuser/.flink/application_1664426842010_0004/log4j.properties:  Under replicated BP-1704786246-10.101.1.140-1663251681207:blk_1073752973_12149. 
Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
  • 1
  • 2

 

2.2. 问题解决

首先查看副本数的设置:dfs.replication=1,没有问题,这会保证新建的文件副本数=1。

然后对已存在的文件修改副本数为1

hadoop dfs -setrep -w 1 -R /
  • 1

再查看HDFS的健康状态,看到HDFS恢复正常

hdfs fsck /


Status: HEALTHY
。。。。
 Corrupt blocks:                0
 Missing replicas:              0
。。。。
FSCK ended at Fri Sep 30 10:49:05 CST 2022 in 5 milliseconds


The filesystem under path '/' is HEALTHY
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/711825
推荐阅读
相关标签
  

闽ICP备14008679号