PRIMARY ENTRY ------------- Genetic Programming Based Feature Generation for Automated DNA Sequence Analysis SUPPORTING PAPERS -------------------------------- Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda Shehu. "An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction". Journal of Transactions in Computational Biology and Bioinformatics 2012, (in press). Uday Kamath, Kenneth A De Jong, and Amarda Shehu. "An Evolutionary-based Approach for Feature Generation: Eukaryotic Promoter Recognition." IEEE Congress on Evolutionary Computation (IEEE CEC), New Orleans, LA, pg. 277-284, 2011. BACKGROUND OF RESEARCH -------------------------------- (HISTORY) The birth of genome sequence analysis can be traced back to the 1960s, when Margaret Dayhoff and others at the National Biomedical Research Foundation collected protein sequences into a database and started studying evolutionary relationships. Since then, the post-genomic revolution caused an explosion in the number of genomic sequences available for analysis. The human genome project alone has resulted in the collection of about 25,000 protein-encoding genes. In the last two decades, significant efforts have been made by molecular biologists and bioinformaticians to extract information about the function and role of these sequences in healthy and sick cells. Nowadays, a driving effort in molecular biology is to annotate genomic sequences with functional information. While machine learning has started to make contributions to the underlying problem of finding meaningful signals hidden in biological sequences, important problems remain open, such as the annotation of regulatory elements in DNA sequences and the identification of splice sites. These problems, in particular, have proved difficult for biologists and other domain experts to scan sequences and extract signals or features that correlate with a specific functional category. (THE PROBLEM AND PREVIOUS RESEARCH) Predicting and annotating key functional regions in DNA sequences lends itself to formulation as a classification problem in machine learning. The effectiveness of machine learning algorithms for classification largely depends on finding discriminating signals or features. Generally, when human experts are taken out of the loop and computers take the central role in analysis and prediction, the approach is to either explicitly seek function-related features by essentially enumerating short k-mers or implicitly consider the role of features through black box kernel methods. There are several issues with these approaches. Techniques that do not explicitly identify the features that determine the functional role of a biological sequence (e.g., standard SVM approaches) fail to provide any meaningful insight to biologists into understanding how sequence determines function in healthy cells and how changes to sequence translate to changes in function in abnormal cells. On the other hand, enumeration-based approaches to feature identification introduce artificial limits on the length and the complexity of features in order to achieve reasonable computation times. Many central problems in genome analysis, such as the identification of splice sites in DNA sequences, appear to require complex features that are able to capture more than local information. If we are to automate such procedures, it is important to go beyond a dependence on humans for providing feature sets without being caught in the computational complexity of exhaustive enumeration. (OUR APPROACH) Our work, described in detail in the two included publications, advances the state-of-the-art in two central challenging problems in genome analysis and offers a general powerful approach to automate the process of feature generation in biological sequences by essentially wrapping machine learning inside an evolutionary computing framework. Specifically, we generalize the definition of a sequence-derived feature to incorporate local and global information about a biological sequence by representing features as boolean combinations of basic building blocks. While resulting in very general and rich features, expanding the scope and complexity of the feature generation process invariably confronts us with an NP-hard problem. We address this problem by employing evolutionary search algorithms that allow the exploration of large feature spaces in a computationally viable manner. Specifically, our work shows that the Genetic Programming (GP) approach is particularly powerful in identifying meaningful features for splice sites in DNA sequences. These features are generally too complex for humans to construct, but they are shown to be necessary to obtain high accuracy and precision in splice site identification and annotation. (WHY IT WORKS) Our approach has shown to be effective in two very difficult genome sequence analysis areas: the identification of splice sites and (regulatory) promoter regions in DNA sequences. The key advantage of our evolutionary-based feature-generation approach is that not only it automates the design of highly discriminating and complex features from datasets of biological sequences, but it also improves the prediction of unknown sequences from similar distributions with higher accuracy and precision as compared to other (kmer- or kernel-based) state-of-the-art techniques. In particular, the GP-based methodology that we have described in our publications has the advantage of being a white-box model, namely, the discriminating features are available to anyone for further analysis. (SUMMARY OF PAPERS) In the splice-site work (Paper #1), we compare our GP-based method with the state-of-the-art feature generation algorithms that use Markovian, probabilistic, and iterative methods on various data sets. We also compare with models proposed by black-box based kernel methods. The comparisons are extensive and thorough, comparing our method with both feature-based and kernel-based approaches. Our method improves various classification statistics, including precision, accuracy, and f-measures, not only in controlled settings but on actual genome-wide sequences. Moreover, the method performs very well beyond a classification setting; in a prediction setting, the method correctly identifies the location of splice sites in long human gene sequences. In the promoter region identification work (Paper #2), our method is also compared on various datasets with state-of-the-art Markovian, probabilistic, and iterative methods. According to various statistics, including accuracy, precision, and f-measures, our method outperforms seven state-of-the-art methods. In both settings, statistical techniques, such as Information Gain and F-Measure, allow demonstrating the high quality of the features constructed by the evolutionary framework as opposed to naive, iterative-based feature-generation techniques. Reassuringly, the features include some that have been the result of decades of built insight and expertise of biological domain experts. It is worth emphasizing that our work not only automates the process of feature generation for classification and annotation of genomic sequences with functional information, but it actually allows biologists to analyze all reported features and draw new knowledge and understanding into the inner workings of the biology of organisms. In addition to making available to the wide community our code and datasets, all features are also posted online on our websites for public access. (AVAILABILITY) We have made our algorithms, codes, utilities, datasets, and features publicly available under http://cs.gmu.edu/~ashehu/sites/default/files/tools/TCBB_2011/Software.html We have already been contacted by researchers interested in porting some of the ideas we implement in other biological problems. CLAIMS AND SUPPORT FOR CLAIMS -------------------------- (B) The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal. (E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions. (F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered. (G) The result solves a problem of indisputable difficulty in its field. Paper #1 -------------------------- CLAIMS: B,E,F RESULT: Table 2 shows comparisons with two state-of-the-art feature-generation methods and shows improvements of 12% and 3-4% respectively in accuracy over human sequences. CLAIMS: B,E,F and G RESULT: Figures 7 and 8 compare precision-recall curves with two state-of-the-art feature-generation methods. Statistically-significant changes of 10-20% are obtained in the break-even curves. AU-ROC comparisons also show statistically-significant improvements over state-of-the-arts methods. Table 4 shows AU-PRC improvements of 3-4% on classification of genome wide sequences over two state-of-the-art kernel-based methods. The datasets are those where the kernel methods to which we compare our method were shown superior before our work. CLAIM: G RESULT: Figure 11 and additional detailed results shown on our website (http://cs.gmu.edu/~ashehu/sites/default/files/tools/TCBB_2011/Software.html) showcases gene annotations as another key by-product of our method. Paper #2 -------------------------- CLAIMS: B,F RESULT: Table 2 shows significant improvements by our method of 4-10% on AU-ROCs in comparison with enumeration-based methods. Improvements are shown on a broad list of species. CLAIMS: B,F,G RESULT: Table 3 shows that our method achieves the highest precision and accurcy on extensive comparisons with 7 promoter identification methods on the plant dataset. AUTHORS FOR PRIMARY ENTRY -------------------------- Uday Kamath Department of Information Technology George Mason University Fairfax Virginia ukamath@gmu.edu Amarda Shehu Department of Computer Science George Mason University Fairfax Virginia ashehu@gmu.edu Kenneth A. De Jong Department of Computer Science George Mason University Fairfax Virginia kdejong@gmu.edu CORRESPONDING AUTHOR -------------------------- Kenneth A. De Jong kdejong@gmu.edu ABSTRACT OF PAPERS -------------------------- Paper 1: Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda Shehu. An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction. Journal of Transactions in Computational Biology and Bioinformatics 2012, (in press) Associating functional information with biological sequences remains a challenge for machine learning methods. The performance of these methods often depends on deriving predictive features from the sequences sought to be classified. Feature generation is a difficult problem, as the connection between the sequence features and the sought property is not known a priori. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features whose predictive power is then tested in the context of classification. This paper proposes an evolutionary algorithm to effectively explore a large feature space and generate predictive features from sequence data. The effectiveness of the algorithm is demonstrated on an important component of the gene-finding problem, DNA splice site prediction. This application is chosen due to the complexity of the features needed to obtain high classification accuracy and precision. Our results test the effectiveness of the obtained features in the context of classification by Support Vector Machines and show significant improvement in accuracy and precision over state-of-the-art approaches. Paper 2: Uday Kamath, Kenneth A De Jong, and Amarda Shehu. "An Evolutionary-based Approach for Feature Generation: Eukaryotic Promoter Recognition." IEEE Congress on Evolutionary Computation (IEEE CEC), New Orleans, LA, pg. 277-284, 2011 Abstract: Prediction of promoter regions continues to be a challenging subproblem in mapping out eukaryotic DNA. While this task is key to understanding the regulation of differential transcription, the gene-specific architecture of promoter sequences does not readily lend itself to general strategies.To date, the best approaches are based on a Support Vector Machines (SVMs) that employ standard spectrum features and achieve promoter region classification accuracies from a low of 84% to a high of 94% depending on the particular species involved. In this paper, we propose a general and powerful methodology that uses Genetic Programming (GP) techniques to generate more complex and more gene-specific features to be used with a standard SVM for promoter region identification. We evaluate our methodology on three data sets from different species and observe consistent classification accuracies in the 94-95% range. In addition, because the GP-generated features are gene-specific, they can be used by biologists to advance their understanding of the architecture of eukaryotic promoter regions. CITATION OF PAPERS -------------------------- Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda Shehu. An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction. Journal of Transactions in Computational Biology and Bioinformatics 2012, (in press). Uday Kamath, Kenneth A De Jong, and Amarda Shehu. "An Evolutionary-based Approach for Feature Generation: Eukaryotic Promoter Recognition." IEEE Congress on Evolutionary Computation (IEEE CEC), New Orleans, LA, pg. 277-284, 2011. PRIZE MONEY -------------------------- In order to scale this approach to much larger large genome-wide sequences (billions), we will need to develop a highly parallel version. We are already exploring the possibility of exploiting the power of GPU clusters and would use the prize money for GPU hardware and further research and development. WHY IS THIS THE BEST? -------------------------- a) It solves non-trivial computational biology/bioinformatics problems of sequence classification in two difficult arenas of splice site and promoter region identification. b) It reduces the computational cost while simultaneously improving the accuracy/precision over most state of the arts. c) It is a unique way of combining Genetic Programming and Machine Learning in bioinformatics for automating the process that has inspired and continues to inspire similar applications to other difficult problems in biology such as structure predictions and protein folding. d) It not only automates the sequence annotation task, but but does so in a more informative manner in the sense that researchers can study and learn from the generated features. For both problems, unique highly discriminating features are made available for biological experts to analyze (http://cs.gmu.edu/~ashehu/sites/default/files/tools/TCBB_2011/Software.html and http://promoter e) The entire code, features generated, tools and annotations of data sets are open sourced for the community to learn, use and extend it for other applications. (http://cs.gmu.edu/~ashehu/sites/default/files/tools/TCBB_2011/Software.html)