(1) TITLES PAPERS: -Evolving local and global weighting schemes in Information Retrieval- -An analysis of the Solution Space for Genetically Programmed Term-Weighting Schemes in Information Retrieval- -Term-Weighting in Information Retrieval using Genetic Programming: A Three-Stage Process- (2) AUTHORS: Ronan Cummins Department of Information Technology National University of Ireland Galway Ireland ronan.cummins@nuigalway.ie 353-91-492042 Colm O'Riordan Department of Information Technology National University of Ireland Galway Ireland colmor@it.nuigalway.ie 353-91-493143 (3) CORRESPONDING AUTHOR: Ronan Cummins (4) ABSTRACTS: -Evolving local and global weighting schemes in Information Retrieval- "This paper describes a method, using Genetic Programming, to automatically determine term-weighting schemes for the vector space model. Based on a set of queries and their human determined relevant documents, weighting schemes are evolved which achieve a high average precision. In Information Retrieval (IR) systems, useful information for term weighting schemes is available from the query, individual documents and the collection as a whole. We evolve term weighting schemes in both local (within-document) and global (collection-wide) domains which interact with each other correctly to achieve a high average precision. These weighting schemes are tested on well-known test collections and are compared to the traditional tf-idf weighting scheme and to the BM25 weighting scheme using standard IR performance metrics. Furthermore, we show that the global weighting schemes evolved on small collections also increase average precision on larger TREC data. These global weighting schemes are shown to adhere to Luhn's resolving power as both high and low frequency terms are assigned low weights. However, the local weightings evolved on small collections do not perform as well on large collections. We conclude that in order to evolve improved local (within-document) weighting schemes it is necessary to evolve these on large collections." -An analysis of the Solution Space for Genetically Programmed Term-Weighting Schemes in Information Retrieval- "Evolutionary algorithms and Genetic Programming (GP) in particular are increasingly being applied to the problem of evolving term-weighting schemes in Information Retrieval (IR). One fundamental problem with the solutions generated by this stochastic, non-deterministic process is that they are often difficult to analyse. We develop a number of different distance measures between the phenotypes (ranked lists) of the solutions (term-weighting schemes) returned by a GP process. Using these distance measures, we develop trees which show how different solutions are clustered in the solution space. Using this framework we show that our evolved solutions lie in a different part of the solution space than two of the best benchmark term-weighting schemes available." -Term-Weighting in Information Retrieval using Genetic Programming: A Three-Stage Process- "This paper presents term-weighting schemes that have been evolved using genetic programming in an adhoc Information Retrieval model. We create an entire term-weighting scheme by firstly assuming that term-weighting schemes contain a global part, a term-frequency influence part and a normalisation part. By separating the problem into three distinct phases, we reduce the search space and ease the analysis of the schemes generated by the process." (5) CRITERIA: (B), (D), (E), (F), (G) (6) STATEMENT (1): The retrieval of documents based on weighted keyword is well-known to be a difficult problem in the field of information retrieval for many reasons (G). In 1972, Karen Spärck Jones published the paper which defined the term weighting scheme now known as inverse document frequency (IDF) in the Journal of Documentation. IDF is still the basis for many term-weighting, feature extraction and document classification approaches. It has also been adopted into many other areas of information science. The genetically evolved global term-weighting schemes identified in work by Cummins and O'Riordan are shown to improve upon IDF (as a measure for the information content of a term). Equally as importantly, the evolved global term-weighting scheme adhere more closely to Luhn's theory (1958) of the resolving power of significant terms than IDF (B, F), as IDF is not fully consistent with Luhn's hypothesis. The different schemes produced, the improvements seen using these schemes and the fact that they appear to validate earlier theories (Luhn, 1958) are important regardless of the fact that they were evolved (D). All current benchmarks (BM25 and pivoted document length normalisation schemes) to the problem of term-weighting were analytically developed. The problem of term-weighting is not seen as solved (E, G) and improvements and/or theoretical advances in term-weighting are important. (7) CITATIONS: Ronan Cummins and Colm O’Riordan. Evolving local and global weighting schemes in information retrieval. Information Retrieval, 9(3):311-330, June 2006. Ronan Cummins and Colm O’Riordan. An analysis of the solution space for genetically programmed Term-Weighting schemes in Information Retrieval. In: P. M. D. Bell and P. Sage (eds.): 17th Artificial Intelligence and Cognitive Science Conference (AICS 2006). Queen’s University, Belfast, Northern Ireland, September 2006. Ronan Cummins and Colm O’Riordan. Term-weighting in information retrieval using genetic programming: A three stage process. In The 17th European Conference on Artificial Intelligence, ECAI-2006, Riva del Garda, Italy, August 28th - September 1st 2006. (8) BREAKDOWN: Any prize money is to be divided equally between the authors. (9) STATEMENT (2): The problem of term-weighting in information retrieval is a difficult problem. These term-weighting schemes try to model the human concept of relevance using weighted keywords to determine what documents to return to the user. The term-weighting schemes that are evolved not only are comparative to, but often surpass, that of the best modern benchmark (i.e. BM25). This alone is an interesting result in the field of information retrieval (i.e. the fact that better term-weighting schemes exist). Furthermore, the term-weighting schemes that are produced from the GP process provide useful information as to why they achieve such a high performance. An analysis of some of the best schemes show that they are consistent with a theory regarding the resolving power of significant terms proposed by Luhn in 1958. Other benchmarks in the literature are based on inverse document frequency (IDF) with is not totally consistent with this theory. Moreover, the best term-weighting schemes evolved are all similar in form which indicates that the GP converges to the same area in the search space (of term-weighting schemes) in a consistent manner, further validating the usefulness of the GP approach adopted.