PDA

View Full Version : A word on md5 sums



turbine
01-13-2008, 02:18 PM
A lot of us rely on the md5sum test to determine whether our download has errors, but the test is not infallable. If the md5's don't match then you can be certain that there is a problem. But if they do match, you can't be absolutely certain that everything is ok.

The problem lies in the mathematics in that md5 typically takes a large number of bits, the message or the file, and assigns them a much smaller number of bits, the md5 sum. Regardless of whatever computations are involved, there has to be duplicates in that assignment. Lots of them. For example, suppose you have 1000 items and you want to assign each one of them a 2 digit number. No matter how you do the assignment there has to be duplicates. In the case of md5 that means that there are a lot of different files or messages having the same md5 sum.

Harry Kuhman
01-13-2008, 08:50 PM
Let me give the opposite point of view here.

If the md5 test fails, the file is bad. If the md5 test passes, you can have absolute faith that the file is good.

Yes, it is a mathematical hash function, and so a large number of bits gets reduced to a smaller number of bits. And there have been ways found to deliberately force a collision of hash functions. But a deliberate false substitution of files would not be caught anyway, since in this case we are checking the md5 sum against a md5 value that we obtained from the same mirror (they could just give us the new md5 sum if they wanted to pass along a bad file, no need to create a bad file with the same md5 hash), and turbine didn't suggest this is happening, so he's really talking about could you download a corrupted file but still compute the same md5 checksum

So what about it? Could it happen? My position is that it's completely irresponsible to claim that this could happen, particularly in a public forum like this where someone is likely to read it and later repeat the bad information. The people who designed md5 certainly understood that 128 bits is less than billions of bits. But the test is completely valid. Before one starts spreading false rumors that it is not, one should at least have a minimal understanding of really big numbers (http://pages.prodigy.net/jhonig/bignum/). You will never get a corrupted download that matches the md5 checksum. here's why:

The md5 checksum is 128 bits in size. "Gee, only 128" I can imagine turbine thinking, "that doesn't seem like a very big number". But it's important to understand big numbers, and what that means in this case is that there are 2^128 values that a md5 hash can have. In decimal numbers that is more than 3.4^38. If anyone reading doesn't quite know how big that number is, it's larger than all of the grains of sand on the earth. It's larger than all of the stars in all of the galaxies in the universe (estimated by scientists to be roughly around 10^21 (N.A.S.A.) (http://imagine.gsfc.nasa.gov/docs/ask_astro/answers/970115.html) to 5x10^22 (http://astronomy.swin.edu.au/~gmackie/billions.html))! And it's approaching (not quite there) the number of all of the atoms in the Earth. And it's way way more than the number of seconds since the birth of the universe, or the number of md5 tests that will ever be done. Do you think there is any chance that you could download a corrupt file and the md5 checksum would just happen to compute the same value? Not gonna happen.

This number is so large that there is a much much better chance that if the md5 checksum did match, that one or more bits in the file would suddenly change spontaneously and the file would now be bad. But if you live your life thinking that that is going to happen, you likely don't use computers at all. For all practical purposes (and for practical dolphins also) the md5 test is completely valid and can be trusted. If you get a bad md5 check then the file is bad (or you have a CPU that made a math mistake), if you get a matching md5 checksum then the file is good. Don't let anyone talk about 1000 items and then generalize into a claim that it's not a valid test if they don't understand the math behind it, or are not interested in explaining the implications of the math to you.

turbine
01-13-2008, 11:55 PM
Well, Harry, as a practicing mathematician for 30 years and having spent 12 of those years working in data communications specifically designing communications protocols with an emphasis on forward error correction, I do indeed understand about large numbers. Now granted 2^128 is a large number but if your message is a billion bits long then the number of messages that you can have in a one billion bit set is 2^1000000000. Of those messages, the number that can have the same md5sum is 2^1000000000 minus 2^128. My man, that's a whopping large number and it's quite easy to encounter a transmission error that generates a message that is contained in that set.

Harry Kuhman
01-14-2008, 12:58 AM
I really take exception to the claim:

and it's quite easy to encounter a transmission error that generates a message that is contained in that set.
To claim that it's quite easy to encounter a transmission error that generates a message with a matching md5 checksum is absurd. It just doesn't really happen. The math is still the math, there is only a greater than 1 in 3.4^38 chance of a match, and that number is so mindbogglingly large that most people who read this thread just can't grasp how large it is or how infinitesimal the chance of ever having a match are.

Sure, if you start talking about theoretical infinities of files then you can claim that there must be md5 collisions. But in a real world situation with finite numbers of files, the odds of ever getting a md5 match that matches the checksum for the file you are testing, is so insanely large that if you checked files at the rate of one a second from the birth of the universe until now, you still would not be at all likely to get a bad match (that is only about 5.2x10^19 seconds (opps, rechecked my math, now I get only 4.7304e+17 seconds since the big bang, chances of a false MD5 match just went way down!)).

So I would suggest that instead of it being quite easy to encounter a message that generates such an error, you have never actually encountered one. Or know anyone who has ever really encountered even one. Or that it has ever happened, to anyone.

I'm not a mathematician. I'm an engineer, with a similar or slightly longer timeframe (not that that has any relation to the true facts of this question). I've also worked in telecommunications and satellite communications, and have done some work (less that you) in error correcting fields. If, as a mathematician you choose to say that for a huge set of files reduced to 128 bit hashes there must be some collision, sure, I can completely agree with that. But to post in a pubic forum like this that it is "quite easy" to actually find such collisions is absurd. It just doesn't happen. It will not happen. What is your estimate of the number of all of the files on all of the computers in the world? My quick calculations show that it is so much lower than 3.4x10^38 that even if every file that existed was tested against the md5 of the current Knoppix ISO it would be extremely unlikely (http://www.bartleby.com/59/4/snowballscha.html) that any of them match the Knoppix md5 checksum. It is irresponsible to spread any expectation that false md5 matches happen.