Text::Ngrams

Text::Ngrams is a flexible Ngram analysis (for characters, words, and more).
Download

Text::Ngrams Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Perl Artistic License
  • Price:
  • FREE
  • Publisher Name:
  • Simon Cozens
  • Publisher web site:
  • http://search.cpan.org/~simon/Sub-Versive-0.01/Versive.pm

Text::Ngrams Tags


Text::Ngrams Description

Text::Ngrams is a flexible Ngram analysis (for characters, words, and more). Text::Ngrams is a flexible Ngram analysis (for characters, words, and more).SYNOPSISFor default character n-gram analysis of string: use Text::Ngrams; my $ng3 = Text::Ngrams->new; $ng3->process_text('abcdefg1235678hijklmnop'); print $ng3->to_string; my @ngramsarray = $ng3->get_ngrams;One can also feed tokens manually: use Text::Ngrams; my $ng3 = Text::Ngrams->new; $ng3->feed_tokens('a'); $ng3->feed_tokens('b'); $ng3->feed_tokens('c'); $ng3->feed_tokens('d'); $ng3->feed_tokens('e'); $ng3->feed_tokens('f'); $ng3->feed_tokens('g'); $ng3->feed_tokens('h');We can choose n-grams of various sizes, e.g.: my $ng = Text::Ngrams->new( windowsize => 6 );or different types of n-grams, e.g.: my $ng = Text::Ngrams->new( type => byte ); my $ng = Text::Ngrams->new( type => word ); my $ng = Text::Ngrams->new( type => utf8 );To process a list of files: $ng->process_files('somefile.txt', 'otherfile.txt');This module implement text n-gram analysis, supporting several types of analysis, including character and word n-grams.The module Text::Ngrams is very flexible. For example, it allows a user to manually feed a sequence of any tokens. It handles several types of tokens (character, word), and also allows a lot of flexibility in automatic recognition and feed of tokens and the way they are combined in an n-gram. It counts all n-gram frequencies up to the maximal specified length. The output format is meant to be pretty much human-readable, while also loadable by the module.The module can be used from the command line through the script ngrams.pl provided with the package.Limitations:· If a user customizes a type, it is possible that a resulting n-gram will be ambiguous. In this way, to different n-grams may be counted as one. With predefined types of n-grams, this should not happen. For example, if a user chooses that a token can contain a space, and uses space as an n-gram separator, then a trigram like this "x x x x" is ambiguous.· Method process_file does not handle multi-line tokens by default. This can be fixed, but it does not seem to be worth the code complication. There are various ways around this if one really needs such tokens: One way is to preprocess them. Another way is to read as much text as necessary at a time then to use process_text, which does handle multi-line tokens. Requirements: · Perl


Text::Ngrams Related Software