Publications and Meeting Abstracts

Papers

Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival

Proc Natl Acad Sci U S A. 2011 Apr 26;108(17):7265-70. Epub 2011 Apr 11.

Monica Nicolau1, Arnold J. Levine2, Gunnar Carlsson1,3

1Department of Mathematics, Stanford University, Stanford, CA, 2School of Natural Sciences, Institute for Advanced Study, Princeton, NJ, 3Ayasdi, Inc., Palo Alto, CA.

High-throughput biological data, whether generated as sequencing, transcriptional microarrays, proteomic, or other means, continues to require analytic methods that address its high dimensional aspects. Because the computational part of data analysis ultimately identifies shape characteristics in the organization of data sets, the mathematics of shape recognition in high dimensions continues to be a crucial part of data analysis. This article introduces a method that extracts information from high-throughput microarray data and, by using topology, provides greater depth of information than current analytic techniques. The method, termed Progression Analysis of Disease (PAD), first identifies robust aspects of cluster analysis, then goes deeper to find a multitude of biologically meaningful shape characteristics in these data. Additionally, because PAD incorporates a visualization tool, it provides a simple picture or graph that can be used to further explore these data. Although PAD can be applied to a wide range of high-throughput data types, it is used here as an example to analyze breast cancer transcriptional data. This identified a unique subgroup of Estrogen Receptor-positive (ER+) breast cancers that express high levels of c-MYB and low levels of innate inflammatory genes. These patients exhibit 100% survival and no metastasis. No supervised step beyond distinction between tumor and healthy patients was used to identify this subtype. The group has a clear and distinct, statistically significant molecular signature, it highlights coherent biology but is invisible to cluster methods, and does not fit into the accepted classification of Luminal A/B, Normal-like subtypes of ER+ breast cancers. We denote the group as c-MYB+ breast cancer.  (PDF)

 

Meeting Abstracts

R-2835:  Analysis of Escherichia coli and Shigella spp. Strain Relationships Based on Topological Data Analysis

American Society for Microbiology (ASM) May 2011, New Orleans, LA

M. K. Mammel1, J. Kloke2, G. Carlsson2,3, G. Singh2, D. W. Lacher1, S. A. Jackson1, I. R. Patel1, J. L. Lewis1, J. Gangiredla1,C. A. Elkins1, and P. Y.  Lum2

1DMB / OARSA / CFSAN, U.S. FDA, Laurel, MD; 2Ayasdi, Inc., Palo Alto, CA; 3Department of Mathematics, Stanford University, Stanford, CA

Background:  The Centers for Disease Control and Prevention estimates 110,000 cases of enterohemorrhagic Escherichia coli infection occur annually in the United States, while 14,000 cases of shigellosis are reported.  There is a continuing need to be able to distinguish the various serotypes and pathotypes of E. coli along with the Shigella serogroups and to understand the relationships among these strains.  Using Ayasdi’s Topological Data Analysis (TDA) methods, we reconstruct the relationship structure of these strains with a multiresolution output.

Methods:  Alignments of 1718 core genes present in 64 sequenced genomes of E. coli and Shigella yielded 133,140 distinct SNPs.  Relationships among the 64 strains were interactively visualized with TDA software using the Hamming distance (the number of steps needed to change one sequence to another) together with a standard centrality-based geometric construction.  Clusters obtained via TDA were compared to a minimum evolution phylogenetic tree generated from a pairwise distance matrix based on the 133,140 SNPs using the MEGA 4 software.

Results:  Distinct groups identified by the software included K-12, O157:H7, APEC/ExPEC, Shigella sonnei, Shigella MLST group 1, and Shigella MLST group 3.  The analysis exhibits relationships between strains based on the similarity in their sequences as measured by the Hamming metric.  Increasing the analysis resolution produces a progressively fragmented view of the data, thus allowing for the construction of a phylogenetic tree from the data.  SNPs which differentiate each group from the rest of the data were also easily identified.

Conclusions:  We have used TDA as a new approach to understanding phylogenetic relationships.  Using the Hamming distance and a geometric lens, we were able to very quickly identify how groups of bacterial strains are related to one another.  The procedure is much less time consuming than standard phylogenetic methods, and permits rapid interactive study of SNP data sets, permitting “rapid prototyping” for example experimentation with different notions of distance metric on spaces of sequences.  We have also identified the SNP positions that best explain groups of strains or an individual strain, which can be used as markers for molecular assays.

 

 

Topological Data Analysis of PhyloChip Assay Hybridization Scores Reveals Community Shift in Deep-Sea Oil Plume

American Society for Microbiology (ASM) May 2011, New Orleans, LA

T. Z. DeSantis1,2, P. Y. Lum3, Y. Piceno1, E. Dubinski1, L. Tom1, P. Hu1, G. Singh3, G. Carlsson3, G. Andersen1

1Lawrence Berkeley National Laboratory, Berkeley, CA, 2Second Genome, Inc., San Francisco, CA, 3Ayasdi, Inc., Palo Alto, CA.

Background:  The 2010 oil well blowout in the Gulf of Mexico was the deepest and one of the largest oil spills in history.  We hypothesized that distinct communities would exist within the resulting plume compared to pristine waters at similar depths and that changes in specific populations can be rapidly detected with the PhyloChip Assay (Second Genome, Inc.) followed by Topological Data Analysis (Ayasdi, Inc).

Methods: Amplicons of 16S rRNA were generated from 17 water samples collected in and near the plume. The overnight PhyloChip Assay was performed to collect hybridization fluorescence data from over one million probes.  TDA was performed with a novel software package (Ayasdi Inc.) using either the most varying probes or all probes to find topologically distinct domains useful for sample classification. The computations were completed on laptop and the network was completed in five minutes. In both procedures, probe responses distinguishing the sample types were evaluated for taxonomic inference.

Results: The water samples were classified into 3 topologically distinct groups: plume, non-plume, and boundary. Using both methods probes, low diversity was found in the plume compared to the non-plume. More than 99.95% of the probes had greater signal outside of the plume compared to in the plume. Probes that best discriminate the in-the-plume group from the out-of-the-plume group were identified. Within the plume, we observed a general decrease in Verrucomicrobia, Gammaproteobacteria, Flavobacteriales, and Prochlorococcus while Oceanospirillales and Pseudoalteromonadales displayed elevated populations.  Furthermore, multiple probes without a match in the current Greengenes 16S rRNA database displayed significantly different hybridization among the sample types.

Conclusions:  PhyloChip-TDA rapidly revealed the community shift within the oil plume.  The focus on individual probe responses as opposed to traditional “probe sets” allowed discovery of topological substructures, confirmed population shifts from previous methods and tracked potentially novel strains of bacteria that were enriched in the oil plume.