Cophenetic correlation

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

In statistics, and especially in biostatistics, cophenetic correlation[1] (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. Although it has been most widely applied in the field of biostatistics (typically to assess cluster-based models of DNA sequences, or other taxonomic models), it can also be used in other fields of inquiry where raw data tend to occur in clumps, or clusters.[2] This coefficient has also been proposed for use as a test for nested clusters.[3]

Calculating the cophenetic correlation coefficient

Suppose that the original data {Xi} have been modeled using a cluster method to produce a dendrogram {Ti}; that is, a simplified model in which data that are "close" have been grouped into a hierarchical tree. Define the following distance measures.

  • x(i, j) = | XiXj |, the ordinary Euclidean distance between the ith and jth observations.
  • t(i, j) = the dendrogrammatic distance between the model points Ti and Tj. This distance is the height of the node at which these two points are first joined together.

Then, letting \bar{x} be the average of the x(i, j), and letting \bar{t} be the average of the t(i, j), the cophenetic correlation coefficient c is given by[4]


c = \frac {\sum_{i<j} (x(i,j) - \bar{x})(t(i,j) - \bar{t})}{\sqrt{[\sum_{i<j}(x(i,j)-\bar{x})^2] [\sum_{i<j}(t(i,j)-\bar{t})^2]}}.

See also

References

<templatestyles src="Reflist/styles.css" />

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />

External links