Facta Univ. Ser.: Elec. Energ., vol. 19, No. 3, December 2006, pp. 429-451.

Text Classification: Forming Candidate Key-Phrases from Existing Shorter Ones

Nikitas N. Karanikolas and Christos Skourlas

Abstract: The hard problem of the Text Classification usually has various aspects and potential solutions. In this paper, two main research directions for narrative documents' classification are considered. The first one is based on data mining and rule induction techniques, while the second combines the traditional Text Retrieval techniques (use of the vector space model, index terms, and similarity measures), Natural Language Processing and Instance based Learning techniques. Key-phrases can be used as attributes for mining rules or as a basis for measuring the similarity of new (unclassified) documents with existing (classified) ones. Hence, we eventually focus on the problem of extracting key-phrases from text's collection in order to use them as attributes for text classification. A new algorithm for the discovery of key-phrases is described. Candidate key-phrases are built using frequent smaller ones and special emphasis is given to the reduction of the complexity of the algorithm.

Keywords: Text classification, key-phrase extraction, text indexing, information retrieval, document management.

karanikolas.pdf