Proceedings of
Second International Conference on Advances In Electronics, Electrical And Computer Engineering EEC 2013
"A TEXT MINING APPROACH FOR AUTOMATIC CLASSIFICATION OF WEB PAGES"
Abstract: “Today the web contains a huge amount of information provided as html and xml pages and their number is growing rapidly with expansion of the web. In Web text mining, the text extraction and filtering of extracted content is the foundation of text mining. Automatic Classification of text is a semi-supervised machine learning task that automatically classify a given document to a set of pre-defined categories based on its features and text content. This paper explains a generic strategy for automatic classification of web pages that deals with unstructured and semi-structured text. This work classified the datasets into different labeled classes using kNN and Naïve Bayesian classification techniques. The experimental evaluation concluded that kNN has better accuracy, precision and recall value as compared to Naïve Bayesian classification. This paper presents a unified approach that is able to provide robust classification and validation of web pages to different categories”
Keywords: accuracy, automatic classification, cosine similarity, kNN, Naïve Bayes, precision, recall, tf-idf