1. The complete title of one (or more) paper(s) published in the open literature describing the work that the author claims describes a human-competitive result;
Automated discovery of test statistics using genetic programming
2. The name, complete physical mailing address, e-mail address, and phone number of EACH author of EACH paper(s);
Jason H. Moore, Ph.D. jhmoore@upenn.edu
Randal S. Olson, Ph.D. rso@randalolson.com
Yong Chen, Ph.D. ychen123@pennmedicine.upenn.edu
Moshe Sipper, Ph.D. sipper@gmail.com
Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia PA 19104, USA. Phone: 215-573-4411
3. the name of the corresponding author (i.e., the author to whom notices will be sent concerning the competition);
Jason H. Moore, Ph.D.
4. The abstract of the paper(s);
The process of developing new test statistics is laborious, requiring the manual development and evaluation of mathematical functions that satisfy several theoretical properties. Automating this process, hitherto not done, would greatly accelerate the discovery of much-needed, new test statistics. This automation is a challenging problem because it requires the discovery method to know something about the desirable properties of a good test statistic in addition to having an engine that can develop and explore candidate mathematical solutions with an intuitive representation. In this paper we describe a genetic programming-based system for the automated discovery of new test statistics. Specifically, our system was able to discover test statistics as powerful as the t-test for comparing sample means from two distributions with equal variances.
5. A list containing one or more of the eight letters (A, B, C, D, E, F, G, or H) that correspond to the criteria (see above) that the author claims that the work satisfies;
B, D, E, G
6. A statement stating why the result satisfies the criteria that the contestant claims (see examples of statements of human-competitiveness as a guide to aid in constructing this part of the submission);
The process of developing highly needed new test statistics is laborious, requiring the manual development and evaluation of mathematical functions that satisfy theoretical properties such as being unbiased, having low variance (efficient), and capturing the relevant information contained in the data (sufficient). Despite the many advances in applied statistics and data science, the development process for new test statistics has not yet been automated using computational methods. Automation would greatly accelerate the discovery of new test statistics that are very much needed in the era of big data. This would in turn accelerate scientific discovery and research translation.
The need for new statistics is exploding as new technologies give us new data with unique characteristics that yield new scientific questions. The development of new statistics has been done by humans to date. Our study is the very first one to ever develop a new and viable test statistic automatically, by means of an evolutionary algorithm. As such is meets criteria B, D, E, G very definitively: our results is as good as the t-test, actually better in terms of complexity (B); it is readily publishable in its own right as a new test statistic (D); it is better than the most recent human-created test statistic (E), being as good in terms of performance and better in terms of complexity; and it solves a fundamental problem of indisputable difficulty in the field of statistics -- the discovery of new test statistics (G).
7. A full citation of the paper (that is, author names; publication date; name of journal, conference, technical report, thesis, book, or book chapter; name of editors, if applicable, of the journal or edited book; publisher name; publisher city; page numbers, if applicable);
J. H. Moore, R. S. Olson, Y. Chen, and M. Sipper, Automated discovery of test statistics using genetic programming, Genetic Programming and Evolvable Machines, vol. 20, issue 1, pp. 127-137, March, 2019.
8. A statement either that "any prize money, if any, is to be divided equally among the co-authors" OR a specific percentage breakdown as to how the prize money, if any, is to be divided among the co-authors;
Prize money, if any, is to be divided equally among the co-authors
9. A statement stating why the authors expect that their entry would be the "best," and
The goal of the present study was to develop an evolutionary system for the automated discovery of new test statistics. This has never been done in an automated manner before, only by humans. To solve such an arduous task automatically, we had to address several challenges. First, we needed an engine for generating mathematical candidates for test statistics, in our case using available array-based operators in a modern programming language with a data structure that is easy for the computer to manipulate. Second, we needed a set of evaluation criteria that are general enough to allow the computer to generate innovative solutions while specific enough to satisfy human statistical objectives without directing the computer to pre-determined outcomes. Third, we needed a system that could tinker with candidate test statistics as a mathematician would, by making small changes and by interchanging functional modules to create new solutions.
AI has shown promise for automating several human-driven processes, such as object detection in images, speech recognition, game playing, financial trading, and more. The deep neural networks that are often in current-day use are purely pattern recognition engines that differentiate signal from noise in big data. The development of test statistics is a far more challenging problem because it requires the optimization system to know something about the desirable properties of a good test statistic in addition to having an engine that can develop and explore candidate mathematical solutions with an intuitive representation. Very few fundamental mathematical problems have been tackled by automated systems to date.
Despite the many advances in applied statistics and data science, the development process for new test statistics has not yet been automated using computational methods. Automation would greatly accelerate the discovery of new test statistics that are very much needed in the era of big data. This would in turn accelerate scientific discovery and research translation. To wit, the fruits of our current labor stand a chance of being highly transformative.
10. An indication of the general type of genetic or evolutionary computation used, such as GA (genetic algorithms), GP (genetic programming), ES (evolution strategies), EP (evolutionary programming), LCS (learning classifier systems), GE (grammatical evolution), GEP (gene expression programming), DE (differential evolution), etc.
Genetic programming
11. The date of publication of each paper. If the date of publication is not on or before the deadline for submission, but instead, the paper has been unconditionally accepted for publication and is "in press" by the deadline for this competition, the entry must include a copy of the documentation establishing that the paper meets the "in press" requirement.
March 2019