As always, it's a reprint from the blog. https://munchkins-diary.hatenablog.com/entry/2019/10/28/230924 I haven't written much recently, but I want to resume it soon.
I wrote that it is easy to get addicted to the method of getting the checksum of the file on S3, and how to calculate the checksum of the file with Java as a bonus at the bottom.
I hope it helps someone. (Not too much)
I was writing a process that listed a file of several GB in S3, and wanted to retry due to a failure in subsequent processing.
In this case, it is wasteful to upload the file again, but there is a possibility that the target file in the migration source storage has been changed.
So I only want to retry the upload if I already have the exactly same
file on S3.
In this case, it's easy to compare checksums, so I wrote the code below using S3's getObjectMetaData
API.
private boolean shouldSkip(String bucketName, String key, String md5CheckSum) {
try {
ObjectMetadata meta = s3Client.getObjectMetadata(bucketName, key);
if (meta == null || meta.getContentMD5() == null) {
log.info("meta data not exist for the file {} in bucket {}", key, bucketName);
return false;
}
log.info(
"Checksum of existing file is {} and present file checksum is {}",
meta.getContentMD5(),
md5CheckSum);
return meta.getContentMD5().equals(md5CheckSum);
} catch (SdkClientException e) {
log.error("Exception thrown while validating the checksum of the file {}", key, e);
return false;
}
}
But it doesn't work. ʻObjectMetaData # contentMD5` is inevitably null.
After checking, it seems that the checksum of the existing object in S3 is given to ʻEtag` instead of contentMD5.
Then, what is contentMD5 used for? It is added to the HTTP header at the time of update and used for tampering confirmation (correct usage) in S3, so it is not returned when getting an object with get.
Therefore, if you want to know the checksum of the file dropped from S3, you need to compare it with Etag.
Like this.
private boolean shouldSkip(String bucketName, String key, String md5CheckSum) {
try {
ObjectMetadata meta = this.s3Client.getObjectMetadata(bucketName, key);
if (meta == null || meta.getETag() == null) {
log.info("meta data not exist for the file {} in bucket {}", key, bucketName);
return false;
}
log.info(
"Checksum of existing file is {} and present file checksum is {}",
meta.getETag(),
md5CheckSum);
return meta.getETag().equals(md5CheckSum);
} catch (SdkClientException e) {
log.error("Exception thrown while validating the checksum of the file {}", key, e);
return false;
}
}
This works fine. I hope it helps someone.
For those who googled with checksums and flew in, here's how to calculate checksums in Java.
public static String checkMd5Checksum(File file) {
try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file))) {
return DigestUtils.md5Hex(is);
} catch (Exception e) {
// Not likely to occur.
log.error(
"ERROR Happened while calculating the check sum for file {}", file.getAbsolutePath(), e);
return "NOT FOUND";
}
}
For sha256, just change DigestUtils # md5Hex
to DigestUtils # sha256Hex
.
That was a memo.
Recommended Posts