Unsolved Tar gzip file compression calculation without decompressing the file
-
So I was able to use 7zip to run a test against the file in question, what I'm not understanding is the output. . .
Testing archive: path/name.tar.gz Path = path/name.tar.gz Type = gzip Headers Size = 10 Everything is Ok Size: 1608683520 Compressed: 95962485348
How does that make any sense? The compressed amount matches what I can get through finder, but the size makes literally zero sense (unless it is using a different measurement, like bytes instead of bits).
-
Now I know I can also browse the file with
tar -tf name.tar.gz
and see what is actually in the tarball, but it's a bit of a pain in the rear to potentially read through possibly millions of entries. -
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
So I have a tar.gz file that I've built, it's about 95GB compressed from what the system shows. I would like to determine the uncompressed size of the tar file without actually decompressing the file.
Using
gzip -l file.tar.gz
should work, but reports an incorrect record (total size in bytes is just over 1GB).How else should I do this?
I'm not sure exactly what you're after. I think the gzip file format header doesn't include the original file size. So the only way to calculate how big the uncompressed file becomes, is to look through the entire compressed file.
The problem with decompression is that you have to make a big file and it takes time. If you however decompress on the fly and calculate the number of bytes you will avoid that problem.
gzip -c -d file.gz | wc -c
-
@Pete-S said in Tar gzip file compression calculation without decompressing the file:
I'm not sure exactly what you're after.
I'm attempting to create an archive, and confirm that 100% of what I archived is in said archive by byte count.
Rather than me having to review the compressed tarball and looking through it for specific files or folders.
Essentially, I want to verify my archives before I offload them to cloud storage and find out (who knows how far down the line) that something was missed. (for whatever reason)
-
I guess a more relevant way to have expressed my question would have been to ask:
How do you verify what is in your tarball before you offload it in a quick and efficient manner?
I want to trust, but verify (as this is a backup).
-
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
I guess a more relevant way to have expressed my question would have been to ask:
How do you verify what is in your tarball before you offload it in a quick and efficient manner?
I want to trust, but verify (as this is a backup).
Generally done with a hash of the files - sha-256, md5 or similar.
-
@Pete-S said in Tar gzip file compression calculation without decompressing the file:
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
I guess a more relevant way to have expressed my question would have been to ask:
How do you verify what is in your tarball before you offload it in a quick and efficient manner?
I want to trust, but verify (as this is a backup).
Generally done with a hash of the files - sha-256, md5 or similar.
And how would that work from my source of unpacked files?
-
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
@Pete-S said in Tar gzip file compression calculation without decompressing the file:
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
I guess a more relevant way to have expressed my question would have been to ask:
How do you verify what is in your tarball before you offload it in a quick and efficient manner?
I want to trust, but verify (as this is a backup).
Generally done with a hash of the files - sha-256, md5 or similar.
And how would that work from my source of unpacked files?
What point in the chain from
original file -> backup -> tar ball -> gzip -> offload to archive
do you want to verify?Is it safe to assume that the gzip file is correct when it is created?
-
@Pete-S So the simplest way I can think to explain this would be like this.
You have a network share which is relatively organized
You create a compressed tarball of any folder on that share and then move that tarball to offsite storage.
How would I realistically get a hash of that folder pre and post tar and compression and have it make sense? They aren't the same thing, even if they contain the same things.
@Pete-S said in Tar gzip file compression calculation without decompressing the file:
Is it safe to assume that the gzip file is correct when it is created?
This is what I'm looking to verify
-
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
@Pete-S So the simplest way I can think to explain this would be like this.
You have a network share which is relatively organized
You create a compressed tarball of any folder on that share and then move that tarball to offsite storage.
How would I realistically get a hash of that folder pre and post tar and compression and have it make sense? They aren't the same thing, even if they contain the same things.
@Pete-S said in Tar gzip file compression calculation without decompressing the file:
Is it safe to assume that the gzip file is correct when it is created?
This is what I'm looking to verify
Use a FIM like wazuh
-
@DustinB3403 said in Tar gzip file compression calculation without decompressing the file:
@Pete-S So the simplest way I can think to explain this would be like this.
You have a network share which is relatively organized
You create a compressed tarball of any folder on that share and then move that tarball to offsite storage.
How would I realistically get a hash of that folder pre and post tar and compression and have it make sense? They aren't the same thing, even if they contain the same things.
@Pete-S said in Tar gzip file compression calculation without decompressing the file:
Is it safe to assume that the gzip file is correct when it is created?
This is what I'm looking to verify
I'm assuming that files are static during backup.
If you first of all run md5deep on all files in the folder, you'll create a textfile that contains md5 (or sha256 or what you want) signatures on every file in the folder. Place it into the folder so it ends up inside the backup and you'll always have the ability to verify any uncompressed individual file.
If you really want to verify your tar.gz file after it's created I think you have to decompress the files to a temporary folder, run md5deep on the files to compare them with the original file. What you really are testing is that the backup-compress-decompress-restore operation is lossless on every file. It should be by design, but if there is an unlikely bug somewhere it's technically possible that it might not be.
If you use the gzip compression with tar, gzip has a CRC-32 checksum inside that can be used to verify the integrity of the gzip file.
Or to be even more certain you can create an md5 signature of the entire gzip archive with md5sum or md5deep. Then you can always verify that the archive has not been corrupted.
If you ever need to restore the files you can verify the integrity of the restored files with the md5 you created on the original files, before you did the backup.