Background

COMET’s goal is to make it easier to isolate a specified cluster of cells from a larger population. We attempt to find the best set of ‘marker’ surface proteins that occur in the specified cluster, but not in the rest of the population. Given this information, researchers can isolate the specified cluster using antibodies which bind to these ‘marker’ proteins.

We attempt to accomplish this by applying the hypergeometric statistical test to a dataset generated by single-cell RNA sequencing of a representative cell population. This dataset maps each single cell to a numerical expression value for each gene measured by this sequencing. By normalizing these values, we can compare expression of a set of genes across the population, finding genes which are expressed by our specified cluster but not in the rest of the population.

The hypergeometric test

Traditional methods of extracting ‘marker’ proteins from a single-cell RNA sequencing dataset use the statistical t-test, finding single genes where the median expression in the specified cluster differ most from the median of the rest of the population. This method has limitations in utility and statistical rigor: most significantly, it cannot find sets of ‘marker’ proteins; only single proteins.

COMET uses the hypergeometric statistical test to overcome these limitations. The hypergeometric test considers discrete expression/non-expression instead of a continuous expression scale, allowing us to test gene sets by considering expression/non-expression of the entire set. Combining genes using a continuous expression scale and t-tests is possible: for example, by simply using the n ‘best’ marker genes. This, however, is ineffective: combinations of genes do not necessarily mark the same cells as their components.

Additionally, COMET uses Florian Wagner’s implementation of the ‘mHG’ statistical test, which in this context finds the most statistically significant cutoff between expression/non-expression, given our continuous gene expression values.