Results 1 to 4 of 4

Thread: A word on md5 sums

  1. #1
    Senior Member registered user
    Join Date
    Nov 2003
    Posts
    203

    A word on md5 sums

    A lot of us rely on the md5sum test to determine whether our download has errors, but the test is not infallable. If the md5's don't match then you can be certain that there is a problem. But if they do match, you can't be absolutely certain that everything is ok.

    The problem lies in the mathematics in that md5 typically takes a large number of bits, the message or the file, and assigns them a much smaller number of bits, the md5 sum. Regardless of whatever computations are involved, there has to be duplicates in that assignment. Lots of them. For example, suppose you have 1000 items and you want to assign each one of them a 2 digit number. No matter how you do the assignment there has to be duplicates. In the case of md5 that means that there are a lot of different files or messages having the same md5 sum.

  2. #2
    Administrator Site Admin-
    Join Date
    Apr 2003
    Location
    USA
    Posts
    5,441
    Let me give the opposite point of view here.

    If the md5 test fails, the file is bad. If the md5 test passes, you can have absolute faith that the file is good.

    Yes, it is a mathematical hash function, and so a large number of bits gets reduced to a smaller number of bits. And there have been ways found to deliberately force a collision of hash functions. But a deliberate false substitution of files would not be caught anyway, since in this case we are checking the md5 sum against a md5 value that we obtained from the same mirror (they could just give us the new md5 sum if they wanted to pass along a bad file, no need to create a bad file with the same md5 hash), and turbine didn't suggest this is happening, so he's really talking about could you download a corrupted file but still compute the same md5 checksum

    So what about it? Could it happen? My position is that it's completely irresponsible to claim that this could happen, particularly in a public forum like this where someone is likely to read it and later repeat the bad information. The people who designed md5 certainly understood that 128 bits is less than billions of bits. But the test is completely valid. Before one starts spreading false rumors that it is not, one should at least have a minimal understanding of really big numbers. You will never get a corrupted download that matches the md5 checksum. here's why:

    The md5 checksum is 128 bits in size. "Gee, only 128" I can imagine turbine thinking, "that doesn't seem like a very big number". But it's important to understand big numbers, and what that means in this case is that there are 2^128 values that a md5 hash can have. In decimal numbers that is more than 3.4^38. If anyone reading doesn't quite know how big that number is, it's larger than all of the grains of sand on the earth. It's larger than all of the stars in all of the galaxies in the universe (estimated by scientists to be roughly around 10^21 (N.A.S.A.) to 5x10^22)! And it's approaching (not quite there) the number of all of the atoms in the Earth. And it's way way more than the number of seconds since the birth of the universe, or the number of md5 tests that will ever be done. Do you think there is any chance that you could download a corrupt file and the md5 checksum would just happen to compute the same value? Not gonna happen.

    This number is so large that there is a much much better chance that if the md5 checksum did match, that one or more bits in the file would suddenly change spontaneously and the file would now be bad. But if you live your life thinking that that is going to happen, you likely don't use computers at all. For all practical purposes (and for practical dolphins also) the md5 test is completely valid and can be trusted. If you get a bad md5 check then the file is bad (or you have a CPU that made a math mistake), if you get a matching md5 checksum then the file is good. Don't let anyone talk about 1000 items and then generalize into a claim that it's not a valid test if they don't understand the math behind it, or are not interested in explaining the implications of the math to you.

  3. #3
    Senior Member registered user
    Join Date
    Nov 2003
    Posts
    203
    Well, Harry, as a practicing mathematician for 30 years and having spent 12 of those years working in data communications specifically designing communications protocols with an emphasis on forward error correction, I do indeed understand about large numbers. Now granted 2^128 is a large number but if your message is a billion bits long then the number of messages that you can have in a one billion bit set is 2^1000000000. Of those messages, the number that can have the same md5sum is 2^1000000000 minus 2^128. My man, that's a whopping large number and it's quite easy to encounter a transmission error that generates a message that is contained in that set.

  4. #4
    Administrator Site Admin-
    Join Date
    Apr 2003
    Location
    USA
    Posts
    5,441
    I really take exception to the claim:
    Quote Originally Posted by turbine
    and it's quite easy to encounter a transmission error that generates a message that is contained in that set.
    To claim that it's quite easy to encounter a transmission error that generates a message with a matching md5 checksum is absurd. It just doesn't really happen. The math is still the math, there is only a greater than 1 in 3.4^38 chance of a match, and that number is so mindbogglingly large that most people who read this thread just can't grasp how large it is or how infinitesimal the chance of ever having a match are.

    Sure, if you start talking about theoretical infinities of files then you can claim that there must be md5 collisions. But in a real world situation with finite numbers of files, the odds of ever getting a md5 match that matches the checksum for the file you are testing, is so insanely large that if you checked files at the rate of one a second from the birth of the universe until now, you still would not be at all likely to get a bad match (that is only about 5.2x10^19 seconds (opps, rechecked my math, now I get only 4.7304e+17 seconds since the big bang, chances of a false MD5 match just went way down!)).

    So I would suggest that instead of it being quite easy to encounter a message that generates such an error, you have never actually encountered one. Or know anyone who has ever really encountered even one. Or that it has ever happened, to anyone.

    I'm not a mathematician. I'm an engineer, with a similar or slightly longer timeframe (not that that has any relation to the true facts of this question). I've also worked in telecommunications and satellite communications, and have done some work (less that you) in error correcting fields. If, as a mathematician you choose to say that for a huge set of files reduced to 128 bit hashes there must be some collision, sure, I can completely agree with that. But to post in a pubic forum like this that it is "quite easy" to actually find such collisions is absurd. It just doesn't happen. It will not happen. What is your estimate of the number of all of the files on all of the computers in the world? My quick calculations show that it is so much lower than 3.4x10^38 that even if every file that existed was tested against the md5 of the current Knoppix ISO it would be extremely unlikely that any of them match the Knoppix md5 checksum. It is irresponsible to spread any expectation that false md5 matches happen.

Similar Threads

  1. Mozilla Composer As Word Processor
    By sakiZ in forum The Lounge
    Replies: 2
    Last Post: 02-08-2006, 02:38 AM
  2. sorry about the off color word: Windows
    By Leyden in forum Hardware & Booting
    Replies: 4
    Last Post: 12-29-2005, 08:08 PM
  3. Any word on new Cebit Edition of Knoppix
    By Jack_ in forum General Support
    Replies: 2
    Last Post: 03-18-2004, 08:27 PM
  4. Scanners -- a word of praise!
    By fingers99 in forum Hardware & Booting
    Replies: 5
    Last Post: 02-01-2004, 03:21 AM
  5. A word from the stupid :p
    By Gandalf_Grey_ in forum Customising & Remastering
    Replies: 2
    Last Post: 01-23-2003, 02:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Dell OptiPlex 3060 MICRO DESKTOP PC COMPUTER I5 8TH GEN UP TO 32GB UP TO 2TB M.2 picture

Dell OptiPlex 3060 MICRO DESKTOP PC COMPUTER I5 8TH GEN UP TO 32GB UP TO 2TB M.2

$295.00



Dell Latitude Business Class Laptop Quad Core i5 8th Gen 16GB 512GB SSD Grade A picture

Dell Latitude Business Class Laptop Quad Core i5 8th Gen 16GB 512GB SSD Grade A

$219.99



HP Workstation Z640 2x Xeon E5-2623V4 32GB Ram 2x256GB SSD Quadro K420 No OS GA2 picture

HP Workstation Z640 2x Xeon E5-2623V4 32GB Ram 2x256GB SSD Quadro K420 No OS GA2

$164.98



New HP 24

New HP 24" All In One Desktop Computer 11th Gen i3 Up To 32GB RAM 2TB SSD Win 11

$419.99



Apple Mac Mini A2348 (2020) M1 8 Core 3.2GHz | 16GB | 256GB SSD picture

Apple Mac Mini A2348 (2020) M1 8 Core 3.2GHz | 16GB | 256GB SSD

$399.00



Intel - Core i9-13900K 13th Gen 24 cores 8 P-cores + 16 E-cores 36M Cache, 3 ... picture

Intel - Core i9-13900K 13th Gen 24 cores 8 P-cores + 16 E-cores 36M Cache, 3 ...

$459.99



Intel - Core i7-13700K 13th Gen 16 cores 8 P-cores + 8 E-cores 30M Cache, 3.4... picture

Intel - Core i7-13700K 13th Gen 16 cores 8 P-cores + 8 E-cores 30M Cache, 3.4...

$329.99



Intel - Core i7-11700 11th Generation - 8 Core - 16 Thread - 2.5 to 4.9 GHz -... picture

Intel - Core i7-11700 11th Generation - 8 Core - 16 Thread - 2.5 to 4.9 GHz -...

$189.99



HP ENVY Desktop Computer, Intel Core i7-10700, 16 GB RAM, 1 TB HD & 512 GB SSD picture

HP ENVY Desktop Computer, Intel Core i7-10700, 16 GB RAM, 1 TB HD & 512 GB SSD

$339.00



HP Workstation Z640 2x Xeon E5-2623V4 32GB Ram 2x 256GB SSD Quadro P620 Linux GA picture

HP Workstation Z640 2x Xeon E5-2623V4 32GB Ram 2x 256GB SSD Quadro P620 Linux GA

$186.99