Chapter 7: Message digest
7.1 What is a message digest?
A short answer: a message digest is a numerical summary or
``fingerprint'' of a given set of data (such as a data file or e-mail
message).
A longer answer. A computer file is really just a sequence of
numbers.
For example, the file containing the single word ``Hello''
is really just the sequence 72 101 108 108 111. The file
containing the file ``Goodbye'' is the sequence
71 111 111 100 98 121 101.
Of course, most computer files are much bigger with correspondingly
longer sequences.
Imagine that a file with one million characters needs to be sent
from one computer to another computer thousands of miles away. How can
the recipient be assured that the data received is in fact the data
that was sent? One way would be to send the data twice and compare the
two transmissions. This might work, but would be rather slow: it would
take twice as long to send the message twice.
So, instead of sending the message again a second time we send a
shorter message that gives a summary of the original message. This
summary is calculated at both ends, and the receiver compares her
summary with the one sent by the transmitter. If they match, this is
evidence that the file was received intact.
Of course, the form of this summary will affect how reliable the
summary is at detecting if the data was sent correctly.
One sort of summary is called a parity check: add up all the
numbers in the file and see if this sum is odd or even. The message
digest is then either 0 (if the sum is even) or 1 (if the sum is
odd). As you may imagine, this form of fingerprint is not very useful
when the files get long.
Thus one desirable aspect of a message digest is that different data
sets which differ in a small number of places generate different
message digests. This way, changes to the data are readily detected.
One system that does this well is the MD5 message digest. It is
``cryptographically secure'', that is, if one file generates a given
MD5 message digest, it is hard to construct a different file with the
same MD5 message digest. Thus, it is suitable for ``digital
signatures'' that are hard to forge.
7.2 Why use a message digest with TeX?
The last (and often trickiest) step in setting a book is
``page layout''. In the TeX world this corresponds to optimizing
figure placement and fixing bad page breaks. This is often delicate:
changing one line of text can change not just the page on which the
line appears, but the pages following (``text reflow'').
Thus, it would be very nice to determine, after making a small change,
which pages have changed. One way to determine this is to look at
a printout of each
page and see which pages have changed. This, however, is not
very practical except for the smallest documents.
My approach is to take each page's ``finger print'' (via a message
digest) before making a change, make the change, and then take each
page's fingerprint again after making the change. Then, just check which
pages' fingerprints have changed.
Here is an example. Let filex.tex
contain the following source:
a\eject b\eject c\eject d\eject e\eject f\end
The summary of this file is
dvii -u filex
File size: 628 bytes (1 K)
Comment string: TeX output 1999.12.17:0937
Page count: 6
Number of fonts: 1
We can save and display the page information along with a message digest of
each page if we use the -M1
option:
dvii -p -M1 filex > before.md
cat before.md
[message digest: simple sum (ignore font)]
p:[1/1]::9C8E26458F1B019011D2F28DA18B18CC
p:[2/2]::9C8E26468F1B029011D2F28DA18B18CC
p:[3/3]::9C8E26478F1B039011D2F28DA18B18CC
p:[4/4]::9C8E26488F1B049011D2F28DA18B18CC
p:[5/5]::9C8E26498F1B059011D2F28DA18B18CC
p:[6/6]::9C8E264A8F1B069011D2F28DA18B18CC
We now edit filex.tex
and change the 'b
' that appears on
page 2 to a 'B
'; we now have
a\eject B\eject c\eject d\eject e\eject f\end
We next TeX the file, save the new fingerprints, and see which pages
have changed using the diff
utility:
tex filex
dvii -p -M1 filex > after.md
diff before.md after.md
3c3
< p:[2/2]::9C8E26468F1B029011D2F28DA18B18CC
---
> p:[2/2]::9C8E26468F1AE29011D2F28DA18B18CC
Note that page 2 is the only page with a different message digest.