Purple Hearts
Main menu
Scopo del sito
Mappa del sito
ID in pillole
Libri stranieri
Libri italiani


Remember me
Forgotten your password?


An automatic Comparison of the Human and Chimpanzee Genomes




Published human (Homo sapiens) and chimpanzee (Pan troglodytes) genome percentage similarities - 98 %,[1] 99.4 %,[2] 98.77 %,[3] 95 %,[4] and 96 %[5] - demonstrate that any perceived precision is illusory. Some similarity between these genomes is unsurprising, but the important lesson these numbers show is that relatively small differences in DNA may produce large morphological and behavioral differences. Summarizing the sequence differences in a single number is challenging if not impossible.


Among the important factors to take into consideration when comparing DNA sequences from a structural point of view are 1) the length of the sequences, 2) the way encoded information will be expressed, 3) the completeness of the sequences (significant portions of H. sapiens and P. troglodytes heterochromatin remain unsequenced[6]), how differences are distributed, 4) locations at which recombination occurs (these differ between  H. sapiens and P. troglodytes chromosomes[7]) and 5) how DNA is segmented (i.e., the different chromosomes numbers and different gene distribution among chromosomes).


DNA sequences from a mere informatics point of view are simple strings of characters. Therefore it is also possible to develop tests comparing genomes as unstructured sequences of symbols, without considering genes, pseudo-genes, coding and non coding regions, vertical and horizontal gene transfer, open reading frames (ORFs), or whatever structured concept. This is the goal of this article.

Comparing strings of characters


Strings of symbols are objects more complex than - for example - polygons. It's easy to define equality between polygons: they are equal when they have equal sides, angles and vertices. Polygons are similar when have equal angles and vertices but different side lengths. Differently from polygons there are a lot of ways to compare strings of characters. An approach to the problem is considering the set of all strings of characters as a metric space and defining a distance function on all the couples of strings. Many distance functions were developed by mathematicians for studying similarity among strings (for a list of them see for example http://www.dcs.shef.ac.uk/~sam/stringmetrics.html ).  


The simplest way of comparing two strings A and B is the pairwise comparison test or identity test. In this test the n-th character of A is compared to the n-th character of B. In other words the order matters. The test starts from the first character and terminates at the last character. If two strings of total length n have m characters matching then we say the two strings are identical at 100*m/n per cent. Of course if two strings of identical length n have n characters matching then we say the two strings are identical at cent per cent (in a sense they are the same abstract string). In the pairwise comparison test we can calculate a simple metric distance called "Hamming distance". At every comparison if the two characters don't match the Hamming distance increases of 1.


If the order doesn't matter and we can compare sub-strings inside the parent strings A and B also if they are at different positions in the two strings then many different tests are possible. We call them pattern matching or similarity tests. While in principle there is only one identity test (the above one) there are many possible similarity tests, depending on the rules of pattern matching we choose.


Any final result of a similarity test (especially if it is a unique number) has meaning only if: 1) the distance function is mathematically defined; 2) the rules of pattern matching and the formulas for calculating that number are explained in details; 3) it is declared what parts of the inputs strings are considered; 4) whether computer programs were used to make the comparison, the source codes and algorithms are freely exhibited. For example here is a test called 3SS-similarity: consider two strings A = xyz and B = yzx, where x,y,z are three sub-strings composed of i,j,k characters respectively. If we establish that the rules of pattern matching and the formulas are: i) find in both entire strings identical sub-strings, independently form their positions, and eliminate them from A and B; ii) after the deletions count the characters remaining in A (say them ra) and the characters remaining in B (say them rb), their sum being our 3SS metric distance; iii) the value of 3SS-similarity is obtained by the formula 100 - 50*(ra+rb)/(i+j+k). In this case being ra=0 and rb=0 the 3SS-similarity is cent per cent. Differently if A and B share no sub-strings then ra=(i+j+k) and rb=(i+j+k) the 3SS-similarity is 0 per cent.


Comparing DNA sequences


The characters most commonly present in DNA sequences are A, C, G, T. There are other less important characters that are used basically to indicate ambiguity about the identity of certain bases in the sequences.


Homo sapiens and Pan troglodytes genomes were freely downloaded from the bio-informatics public archives of UCSC Genome Bioinformatics:  http://genome.ucsc.edu/. The downloaded DNA sequences are in FASTA format. Before running the tests we have discarded all the symbols different from A, C, G, T. Mainly we had to discard the "N" symbols insofar they represent rare undefined situations (due probably to difficulties of the scanning technology). It is very low, if any, the presence of other symbols. The deletions of the "N" symbols don't change much the overall results however. Here we show the results of two methods of comparison we have applied to human and chimp genomes.


First method: pairwise comparison (equality test)


The first difficulty in applying the pairwise comparison test is that in general homologous chromosomes have DNA sequences of different lengths. In other words when we arrive at the end of the shortest chromosome we must stop the comparison. These differences are often of millions bases. Just only this ascertainment should lead us to understand that the two strings cannot be so equal after all. In particular human and chimp homologous chromosomes have always different lengths. In our pairwise comparison test we have discarded the unmatched tails of the longest chromosomes.


A second problem is that homo sapiens and chimpanzee have a different number of chromosomes. Meanwhile homo sapiens has an unique chromosome #2, chimpanzee had its chromosome #2 split in two parts, namely chromosome #2a and chromosome #2b. Therefore we have compared human chromosome #2 with the concatenation of chimp chromosomes #2a and #2b (the longest one).


Consider symbols drawn from a common vocabulary of N symbols, each of them having the same 1/N probability to occur. If two sets composed of an equal number of such symbols are fully random, at the "pairwise comparison" test they must result 1/N equal. In fact at each single symbol-by-symbol comparison there is the same probability of 1/N that the two twin symbols match. For example if the symbols in the vocabulary are four (as in the case of DNA) this probability is and we can say that two random strings generated from such vocabulary are 25% equal.


In reality in DNA the average probabilities of A, T, G, C are not exactly 0.25 but the following: A=0.3, T=0.3, G=0.2, C=0.2.  Hence in real DNA this formula applies for the probability of one single match when probabilities differ:


(30*30 + 30*30 + 20*20 + 20*20)/(100*100) = (900+900+400+400)/10000 = 26%


We will see below how this is exactly what our identity test has outputted.


The following table and graph show the report of the pairwise comparison test:

Remember that 25% represents the equality percentage of two random four-equally-probable-symbols sequences and 26% represents the equality percentage of two random DNA sequences. All the percentage values of pairwise identity (nucleotide by nucleotide starting from the same end along the entire chromosome) are very near 26%, which is the value that the theory predicts. In the same time this value means that, if many local similarities exist, they are at different offsets; in other words, identical patterns (if any) are scrambled in homologous chromosomes. This issue is related to the deletions, additions, inversions, translation, transfers and a number of other chromosomic events, that is the structural alterations that evolutionary theory hypothesizes chromosomes had. Anyway the simple pairwise comparison per se hadn't to take into account such kind of problems. This ascertainment leads us directly to the second test.

Second method: 30-base pattern matching (30BPM-similarity test)


Before the 26% average value of the pairwise comparison of the previous test some critics say that such result has little sense because human and chimp genomes don't show global identity (as that investigated by a simple pairwise comparison test) but have a lot of local similarities. Therefore we have tried to discover such alleged local similarities by mean of a new test. To limit someway the running times we have chosen to use a Monte Carlo method approach. According to this method a pseudorandom number generator (PRNG) generates a set of uniformly distributed random numbers, which will determine the places where the metric measures will be probed. In two words, in the Monte Carlo method only a portion of the metric space is investigated, but this portion reveals the characteristics of the whole. As a consequence our measure of similarity will be statistical.


However, this second method is a real pattern matching test because it searches for identical patterns in the chromosome N of homo sapiens and chimpanzee. In other words, in this test patterns can match independently from their offsets in the chromosomes. In fact the meaning of local similarities in homologous chromosomes is: identical patterns laying out in different positions in the two chromosomes. This test allows a total scrambling of the patterns between the twin chromosomes. Of course it is very difficult to know what are the functional implications of this scrambling. As an analogy we know for example that in software randomly scrambling parts of the binary code harms the functionality until halting the computer. May be the positions of genes can shift, but when non gene-coding is scrambled is doubtful that the functionality is preserved.


Many technologies were developed to investigate genomes. One of them is the BLAST (Basic Local Alignment Search Tool) set of programs (see for example NCBI web site[8]). BLAST is able to find regions of local similarity between sequences searching in a database of genomes. Alignment methods (as those BLAST and others techniques implement) allow geneticists to interactively search for common local patterns in different positions. The global comparison of two genomes is a job that cannot be worked out interactively by humans, only fully automatic computer programs can afford such task.


From this point of view our test #2 can be considered a fully automatic program for searching local alignments between two different chromosomes. It is always possible to search local similarities (even in a couple of fully random strings), but does a pattern matching local similarity always imply a genetic functional similarity? That isn't easy to prove. One can accept that searching local similarities in homologous genes has some sense, but in the non gene-coding regions (about which molecular biology knows little) a BLAST research has less sense. There the offsets (or the starting points of any patterns matching) are entirely arbitrary. After all if they call non gene-coding regions "junk DNA" is because these sequences are considered as random by them. Moreover, meanwhile in the gene-coding region of DNA we know that the universal genetic code matching codons and amino acids is respected (except some very rare exceptions), in the non gene-coding regions nobody knows what code or specification is used.


This additional test searches for shared 30-base-long patterns between two chromosomes. It might seem arbitrary to choose 30 base matches. It is arbitrary, as any other number would, but if the genomes were really 9x% identical as they say also a 30-base patterns comparison (or any other n-base patterns comparison) should get 9x% results.


Our second test (30BPM-similarity) implements the following simple iterative algorithm. For each couple of chromosomes a PRNG generates 10000 uniformly distributed random numbers that will specify the offsets of 10000 30-base sequences inside the chromosome A. Each of these 10000 30-base sequences is searched for in B. The absolute difference between 10000 and the number of patterns found in B (minimum 0, maximum 10000) is our 30BPM-distance. To be precise this space is only pseudo-metric (or quasi-metric) inasmuch the axiom of identity ("the distance is zero if and only if A and B are equal") defining a metric space is relaxed (the distance could be zero also if A and B are different) and the axiom of symmetry ("the distance between A and B is equal to the distance between B and A") doesn't hold. The 30BPM-distance is zero if the two strings are identical. In the case of a test with two 70-million pseudorandom DNA strings the 30BPM-distance was 10000 (no patterns of A found in B). We call this value 10000 as the "random-distance". The 30BPM-dissimilarity percentage is calculated as 100*30BPMdistance/randomdistance. The 30BPM-similarity percentage is 100-30BPMdissimilarity.


The following table and graph show the report of this pattern matching test:

The results are statistically meaningful. The same test was run on a sampling of 1000 random 30-base-long patterns and the percentages were almost identical.


The source files of the Perl programs used for the tests are freely downloadable at:





Different methods of genome comparison result in very different estimates of similarity. Assumptions driving the methods used also drive the results obtained and their interpretation. More objective genome strategies, like the ones reported in this paper tend to give lower estimates of similarity than those commonly reported. It is worth noting that as more information comparing the genomes is published, the differences appear to be more profound than originally thought. What one should conclude from similarities and differences between humans and chimpanzees remains the big question. Commonly reported statistics that should inform answers to this question may actually obscure the true answer.


We have seen that in genomes comparison only "similarity" makes sense. Unfortunately the message going public speaks even of 99% "identity". In fact it is usual to find propositions like this: "because the chimpanzee lies at such a short evolutionary distance with respect to human, nearly all of the bases are identical by descent and sequences can be readily aligned except in recently derived, large repetitive regions"[9]. As a consequence the laymen truly believes that man and ape genomes are almost identical. We have shown that the scenario is not so simplistic: many similarities measures are possible and to flag the comparison with a unique measure is like to think of describing well a very complex geometrical object by mean of a unique number. We hope our work adds a bit to the truth about the 99%-identity myth.


[1] For example: Marks J.  2002. What It Means to Be 98% Chimpanzee: Apes, People, and Their Genes. University of California Press, Berkeley. 325 pages.

[2] Wildman DE, Uddin M, Liu G, Grossman LI, Goodman M. 2003. Implications of natural selection in shaping 99.4% nonsynonymous DNA identity between humans and chimpanzees: Enlarging genus Homo. Proceedings of the National Academy of Sciences (USA) 100:7181-7188.

[3] Fujiyama A, Watanabe A, Toyoda A, Taylor TD, Itoh T, Tsai S-F, Park H-S, Yaspo M-L, Lehrach H,

Chen Z, Fu G, Saitou N, Osoegawa K, de Jong PJ, SutoY, Hattori M, Sakaki1 Y. 2000. Construction and Analysis of a Human-Chimpanzee Comparative Clone Map. Science 295:313-134.

[4] Britten, R.J. 2002. Divergence between samples of chimpanzee and human DNA sequences is 5% counting indels. Proceedings of the National Academy of Sciences (USA) 99:13633-13635.

[5] The Chimpanzee Sequencing and Analysis Consortium. 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:69-87.

[6] The human genome contains about 2.9 Gb of euchromatin, its total size is approximately 3.2 Gb, so approximately 10 % is heterochromatin, little of which has been sequenced. Green ED, Chakravarti A. 2001. The Human Genome Sequence Expedition: Views from the "Base Camp." Genome Res. 11: 645-651.

[7] Winckler W, Myers SR, Richter DJ, Onofrio RC, McDonald GJ, Bontrop RE, McVean GAT, Gabriel SB, Reich D, Donnelly P, Altshuler D. 2005. Comparison of Fine-Scale Recombination Rates in Humans and Chimpanzees. Science 308:107-111.

[9] The Chimpanzee Sequencing and Analysis Consortium, Initial sequence of the chimpanzee genome and comparison with the human genome, Vol 437/1 September 2005/doi:10.1038/nature04072.