This project is about the PHP TextCat extension, which aims to provide a fast, language independent and extensive tool to categorize texts.
The main theory is the N-Gram-Based Text Categorization. At this point, is not possible too much extensibility from PHP itself, but with rewrite of library it will change.
You should fetch the code, you can do it from the Download page, or from the git repository.
git clone git://github.com/crodas/phplibtextcat.git cd phplibtextcat
If you wish to use the development version:
git pull origin devel
In order to compile the module, you must have the development version of PHP (in Redhat based php5-dev or php4-dev) or compile from the source code, then do the follow instructions:
$ phpize $ ./configure --with-textcat $ make $ make test $ make install
In order to use, you train it feeding it with sample text, if you want to avoid this step it comes with some knowledge files about some common languages that can be found at samples/knowledge/.
textcat_train( "knowledge-output.lm", "Here goes a sample of the text", "Here another text", "And so forth" );
The degree of accuracy is given by the quality and quantity of samples. Also if it miscalculate a category, and you detect it, you should use this file as a sample when you rebuild you knowledge.