-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time #7845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Leave nit comments inline, FYI. Please check the Yutes report and need to receive green yetus report before commit. Thanks again.
LOG.info("Triggering checkpoint because it has been {} seconds " + | ||
"since the last checkpoint, which exceeds the configured " + | ||
"interval {}", secsSinceLast, checkpointConf.getPeriod()); | ||
LOG.info("Triggering checkpoint because it has been {} seconds " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the original format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hexiaoqiao Thanks for review. I have resolve it.
@@ -487,8 +488,9 @@ private void doWork() { | |||
namesystem.setCreatedRollbackImages(true); | |||
namesystem.setNeedRollbackFsImage(false); | |||
} | |||
lastCheckpointTime = now; | |||
LOG.info("Checkpoint finished successfully."); | |||
lastCheckpointTime = monotonicNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch here. Thanks.
3f9c5da
to
63e6466
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
d585b7b
to
e525d98
Compare
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
...oop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
Outdated
Show resolved
Hide resolved
/** | ||
* Test that lastCheckpointTime is correctly updated at each checkpoint | ||
*/ | ||
@Test(timeout = 300000) | ||
public void testLastCheckpointTime() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is passing with your prod change as well for me
e525d98
to
4084cfa
Compare
💔 -1 overall
This message was automatically generated. |
4084cfa
to
a4e83fc
Compare
🎊 +1 overall
This message was automatically generated. |
a4e83fc
to
de1bdae
Compare
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
de1bdae
to
89899c4
Compare
🎊 +1 overall
This message was automatically generated. |
@Hexiaoqiao @ayushtkn Hi, Do you have any other suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1. Let's wait if @ayushtkn has anymore comments here. Thanks again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The capacity of Our hdfs federation cluster are more then 500 PB, with one NS containing over 600 million files. Once checkpoint takes nearly two hours.
We discover checkpoint frequently failures due to fail to put the fsimage to the active Namenode, leading to repeat checkpoints. We configured dfs.recent.image.check.enabled=true. After debug, the reason is the standby NN updates the lastCheckpointTime use the start time of checkpoint, rather than the end time. In our cluster, the lastCheckpointTime of the standby node is approximately 80 minutes ahead of the lastCheckpointTime of the active NN.
When the checkpoint interval in standby NN exceeds dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the active NN's lastCheckpointTime is later than standby NN's, the interval is less than dfs.namenode.checkpoint.period, and the putting fsimage is been rejected, causing the checkpoint to fail and retried.