1) PAPER TITLE Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infrared Spectroscopic Imaging 2) AUTHORS: Xavier Llora National Center for Supercomputing Applications 1205 W. Clark St. 2028 NCSA Building University of Illinois at Urbana-Champaign Urbana, IL 61801 (217) 265-0894 xllora@uiuc.edu Rohith Reddy Department of Bioengineering & Beckman Institute for Advance Science and Technology 1304 W. Springfield Ave. 3120 Digital Computer Laboratory University of Illinois at Urbana-Champaign Urbana, IL 61801 (217) 333-1867 reddy2@uiuc.edu Brian Matesic Department of Bioengineering 1304 W. Springfield Ave. 3120 Digital Computer Laboratory University of Illinois at Urbana-Champaign Urbana, IL 61801 (217) 333-1867 matesic2@uiuc.edu Rohit Bhargava Department of Bioengineering & Beckman Institute for Advance Science and Technology 1304 W. Springfield Ave. 3120 Digital Computer Laboratory University of Illinois at Urbana-Champaign Urbana, IL 61801 (217) 265-6598 rxb@uiuc.edu 3) CORRESPONDING AUTHOR: Xavier Llora 4) ABSTRACT: Cancer diagnosis is essentially a human task. Almost universally, the process requires the extraction of tissue (biopsy) and examination of its microstructure by a human. To improve diagnoses based on limited and inconsistent morphologic knowledge, a new approach has recently been proposed that uses molecular spectroscopic imaging to utilize microscopic chemical composition for diagnoses. In contrast to visible imaging, the approach results in very large data sets as each pixel contains the entire molecular vibrational spectroscopy data from all chemical species. Here, we propose data handling and analysis strategies to allow computer-based diagnosis of human prostate cancer by applying a novel genetics-based machine learning technique (NAX). We apply this technique to demonstrate both fast learning and accurate classification that, additionally, scales well with parallelization. Preliminary results demonstrate that THIS APPROACH CAN IMPROVE CURRENT CLINICAL PRACTICE in diagnosing prostate cancer. 5) CRITERIA: (B) The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal. (D) The result is publishable in its own right as a new scientific result 3/4 independent of the fact that the result was mechanically created. (E) The result is equal to or better than the most recent human-created solution to a long- standing problem for which there has been a succession of increasingly better human- created solutions. 6) STATEMENT: Pathologist opinion of structures in stained tissue is the definitive diagnosis for almost all cancers and provides critical input for therapy. In particular, prostate cancer accounts for one-third of noncutaneous cancers diagnosed in US men, and it is a leading cause of cancer-related death. The diagnosis procedure involves the removal of cells or tissues, staining them with dyes to provide visual contrast and examination under a microscope by a skilled person (pathologist). Due to personnel, training, natural variability and biologic differences, the challenge in prostate cancer research and practice is to provide accurate, objective and reproducible decisions. The biopsy-staining-microscopy-manual recognition approach has been used for over 150 years and no automated method has thus far proven to be human competitive. Fourier transform infrared imaging (FTIR) provides the entire vibrational spectroscopic information from every pixel of a sample's microscopy image. The chemical composition of each cell in the image can be inferred from its spectroscopic signature. Thus, objective predictions may be possible if we could successfully navigate such large collections of high-dimensional images. A classification process, resulting in an automated labeling of tissue types and proper prediction of cell's pathology, is the basis of our approach to providing an accurate and objective decision. However, the large data size requires proper feature selection and learning of accurate models. Previous efforts on labeling tissue types using the vibrational spectroscopy have been partially successful but have not been able to achieve accuracy rates comparable to those of a human pathologist in a large sample population. Our work using efficient large-scale genetics-based machine learning is the first one to show that accurate objective predictions may be possible. Our submission satisfies the following criteria for a human-competitive result: (B) The result is better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal AND (E) The result is equal to or better than the most recent human-created solution to a long- standing problem for which there has been a succession of increasingly better human- created solutions. Our work has been able to demonstrate that FTIR imaging cannot only be used to analyze tissue as reported previously, but can also be used to detect disease accurately. Guidelines to diagnose prostate cancer manually have been published periodically but are, as yet, imperfect. Our genetics-based machine learning algorithms (NAX) was able to perform feature selection and model induction as part of the process. NAX efficient parallelization was able to deal with FTIR data and evolve accurate models. Previous attempts have used simplified data sets by pixel aggregation and extreme filtering of the training instances to fit traditional learning algorithms. These black-box models were unable to match pathologist accuracy (patient diagnostic accuracy did not break the 80% barrier). On the other hand, our approach using all the available FTIR data was able to accurately predict 87.43% of the raw pixels. Those predictions, when inspected at the patient level, lead to an overall patient diagnosis accuracy >95%, which is in the region of human performance by the world's leading authorities in prostate cancer. It is important to note here, that human predictions are prone to error and NAX misclassification may also reveal inaccurate diagnosis for complex diagnosis cases. However, the critical point is that no previous approach has been able to match human-driven diagnosis. Further, the models produced by NAX are human readable, providing helpful insight to the underlying problem structure and providing a rational for the diagnosis. Human diagnosis is subjective and not easily quantified. (D) The result is publishable in its own right as a new scientific result 3/4 independent of the fact that the result was mechanically created. Results obtained via both pathologists and automated diagnosis techniques have been previously published on numerous occasions. Since our results are significantly better than those previously published, and yields to accurate and objective diagnosis, it should therefore be publishable in its own right. Indeed, we have a journal paper in press describing the process and results obtained using NAX on a reputed evolutionary computation journal, and we are submitting a paper to a top medical journal on the results and implication for clinical diagnosis. 7) CITATION: X. Llorˆ, R. Reddy, B. Matesic, R. Bhargava, "Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infrared Spectroscopic Imaging", Proceedings of the Genetic and Evolutionary Computation Conference 2007 (GECCO 2007), in press. 8) STATEMENT OF PRIZE DISTRIBUTION: Prize money, if any, is to be divided among the contributing research personnel with the following percentages: X. Llorˆ (40%), R. Reddy (15%), B. Matesic (5%), R. Bhargava (40%) 9) STATEMENT OF COMPARISON TO OTHER "HUMAN COMPETITIVE" ENTRIES Prostate cancer accounts for one-third of noncutaneous cancers diagnosed in US men, and it is a leading cause of cancer-related death. Provide clinical input that will allow for more effective detection and treatment of human cancers has far reaching implications for both patients and physicians. By combining expertise in molecular chemistry, microscopy image processing for spectroscopy and structural information, optimization, and machine learning, we have been able to provide practical methods that: (1) determine all cell types in tissue in an automated and objective manner, and (2) detect disease (especially cancer) accurately and with a defined degree of confidence. NAX induced models perform, for the first time, accurate predictions when compared to human experts. Such a tool may make possible to reduce the workload of pathologists, helping them to focus on complex cases, and reduce the overall diagnosis time, also benefiting the treatment process. Most importantly, this genetics-based machine learning approach is generic enough to be applied to other tissue types other than prostate, which we have use as "proof of principle". Our current initial experiments with other tissuesÑbreast and colonÑshow very similar human-competitive results as the ones obtained in the prostate domain without sacrificing accuracy, efficiency, near instantaneous predictions, and large throughput. These results may also be the basis for model transference between domains (tissue types). Our submission also stands out by competing with humans in an exercise that has been the exclusive domain of human capability for almost 150 years. The computational approach to analyze unique data is itself interesting and innovative, but the results are comparable to those from the leading experts in the world. It is anticipated that this approach will not just compete but will beat the average practitioner. The problem is certainly difficult and is exceptionally relevant, as almost one in six men will be diagnosed with prostate cancer in the US during their lifetime.