PHP Textcat

This project is about the PHP TextCat extension, which aims to provide a fast, language independent and extensive tool to categorize texts.

View the Project Git repository

Theory of operations

The main theory is the N-Gram-Based Text Categorization. At this point, is not possible too much extensibility from PHP itself, but with rewrite of library it will change.

How to compile

You should fetch the code, you can do it from the Download page, or from the git repository.

git clone git://github.com/crodas/phplibtextcat.git
cd phplibtextcat

If you wish to use the development version:

git pull origin devel

In order to compile the module, you must have the development version of PHP (in Redhat based php5-dev or php4-dev) or compile from the source code, then do the follow instructions:

$ phpize
$ ./configure --with-textcat
$ make 
$ make test
$ make install

How to use

In order to use, you train it feeding it with sample text, if you want to avoid this step it comes with some knowledge files about some common languages that can be found at samples/knowledge/.

textcat_train(
   "knowledge-output.lm",
   "Here goes a sample of the text",
   "Here another text",
   "And so forth"
);

The degree of accuracy is given by the quality and quantity of samples. Also if it miscalculate a category, and you detect it, you should use this file as a sample when you rebuild you knowledge.

 
php-textcat.txt · Last modified: 2009/04/16 10:06 by crodas
 
Except where otherwise noted, content on this wiki is licensed under the following license:Public Domain
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki