Archive for the ‘NLP’ Category

Open Text Summarizer

Sunday, July 6th, 2008

http://libots.sourceforge.net/

The Open Text Summarizer is an open source tool for summarizing texts. The program reads a text and decides which sentences are important and which are not. It ships with Ubuntu, Fedora and other major linux distros. Several academic publications have benchmarked it and praised it.

Example 1:
cat articles/sacbee1.txt | ./ots –ratio 20 –html

This command will summarize the article from the Sacramento Bee (sacbee1.txt) and highlight the 20% of sentences most important to the content of the article.

Example 2:
cat articles/sacbee1.txt | ./ots –ratio 40

This command will summarize the article, giving the 40% of the article which is most important, and print it as text.

Lingua::EN::Syllable & Lingua::EN::Fathom

Monday, June 2nd, 2008

http://search.cpan.org/~gregfast/Lingua-EN-Syllable-0.251

http://search.cpan.org/dist/Lingua-EN-Fathom

This module analyses English text in either a string or file. Totals are then calculated for the number of characters, words, sentences, blank and non blank (text) lines and paragraphs. Three common readability statistics are also derived, the Fog, Flesch and Kincaid indices. All of these properties can be accessed through individual methods, or by generating a text report. A hash of all unique words and the number of times they occur is generated.

HiLiter, Keywords, Snipper, SpellCheck

Wednesday, May 21st, 2008

Search-Tools: http://search.cpan.org/~karman/

Search::Tools::Keywords extracts the meaningful words from a search query. Since many search engines support a syntax that includes special characters, boolean words, stopwords, and fields, search queries can become complicated. In order to separate the wheat from the chafe, the supporting words and symbols are removed and just the actual search terms (keywords) are returned.

Search::Tools tools for building search applications   0.16
Search::Tools::HiLiter extract and highlight search results in original text   0.16
Search::Tools::Keywords extract keywords from a search query   0.16
Search::Tools::Object base class for Search::Tools objects    
Search::Tools::RegExp build regular expressions from search queries   0.16
Search::Tools::RegExp::Keyword access regular expressions for a keyword   0.16
Search::Tools::RegExp::Keywords access regular expressions for keywords   0.16
Search::Tools::Snipper extract keywords in context   0.16
Search::Tools::SpellCheck offer spelling suggestions   0.16
Search::Tools::Transliterate transliterations of UTF-8 chars   0.16
Search::Tools::UTF8 UTF8 string wrangling   0.16
Search::Tools::XML methods for playing nice with XML and HTML

Statistical natural language processing …

Wednesday, May 21st, 2008

http://www-nlp.stanford.edu/links/statnlp.html

Statistical natural language processing and corpus-based computational linguistics:

An annotated list of resources:

* Tools: Machine Translation, POS Taggers, NP chunking, Sequence models, Parsers, Semantic Parsers/SRL, NER, Language models, Concordances, Summarization, Other
* Corpora: Large collections, Particular languages, Treebanks, Discourse, WSD, Literature, Acquisition
* SGML/XML
* Dictionaries
* Lexical/morphological resources
* Courses, Syllabi, and other Educational Resources
* Mailing lists
* Other stuff on the Web: General, IR, IE/Wrappers, People, Societies

Automatically extracts keywords from text

Wednesday, May 21st, 2008

Lingua::EN::Keywords

http://search.cpan.org/~simon/Lingua-EN-Keywords-2.0/Keywords.pm

This is a very simple algorithm which removes stopwords from a summarized version of a text (generated with Lingua::EN::Summarize) and then counts up what it considers to be the most important “keywords”. The keywords subroutine returns a list of five keywords in order of relevance.

This is pretty dumb. Don’t expect any clever document categorization algorithms here, because you won’t find them. But it does a reasonable job.

Flesch-Kincaid grade level.

Thursday, May 15th, 2008

http://www.editcentral.com/gwt/com.editcentral.EC/EC.html

This is an interactive web page for checking a sample of writing. It is modeled after the ancient Unix utilities style and diction. Enter or copy text into the first box below. The scores to the right give the readability of the text according to various formulas.

The Flesch reading ease score is based on a range of 0-100, with lower values for harder text and higher values for easier text; the other scores show the approximate (US) school grade of the text. Grades 13-16 correspond to college level; grades higher than 16 correspond to graduate school level.

Calculate Gunning Fog Index

Thursday, May 15th, 2008

http://www.nlpmax.com/Tests/FogIndex.aspx