Gran recopilación de información, trucos y buenos consejos para hacer text mining con Weka de Pentaho, creado por Jose María Gómez Hidalgo, profesor de la UEM. Nosotros usamos weka para realizar sentiment analysis sobre redes sociales.
Algunas entradas que se incluyen en su gran recopilación:
- Text Mining in WEKA: Chaining Filters and Classifiers explains how and why you should so when evaluating your text classifiers using cross-fold validation. The explanation is done using the Explorer tools, and it helps as a quick introduction to the process of building a text classifier in WEKA, along to the FilteredClassifier class.
- Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters describes how to complete the life-cycle of the learning process by adding feature selection to it, by using the MultiFilter class.
- Command Line Functions for Text Mining in WEKA presents how perform previous experiments with the FilteredClassifier and MultiFilter classes but now in the command line interface instead on WEKA's Explorer.
- A Simple Text Classifier in Java with WEKA presents and discuses two little programs as examples of how to integrate WEKA into your Java code for text mining.
- URL Text Classification with WEKA, Part 1: Data Analysis shows an application of text classification to processing URLs text as a complement to URL database-based filtering in Web Filters. This first post just explains how I have built the dataset, while an upcoming post will explain my ongoing experiments.
- Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers discusses three ways of mapping the set of terms used in the representations of the training and test sets of a text dataset for enabling learning, namely using batch filters, the FilteredClassifier class and the InputMappedClassifier class.
- Language Identification as Text Classification with WEKA explains how to build an automated language guesser for texts as a complete example of a Text Mining process with WEKA, and in order to demonstrate a more advanced usage of the StringToWordVector class.
- Baseline Sentiment Analysis with WEKA shows how to configure and run an experiment on sentyment analysis and opinion mining using WEKA, and specially the TextDirectoryLoader and the NGramTokenizer classes.