boilerpipeA Java library for boilerplate removal and fulltext extraction from HTML pages | |
Download |
boilerpipe Ranking & Summary
Advertisement
- License:
- Apache
- Publisher Name:
- Christian Kohlschütter
- Publisher web site:
- http://code.google.com/u/@UBhURFFSDxBAWAV8/
- Operating Systems:
- Mac OS X
- File Size:
- 2 MB
boilerpipe Tags
boilerpipe Description
boilerpipe is a free and open-source Java library that provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.boilerpipe already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.Detailed instructions on how to install and use the boilerpipe utility on your Mac are available HERE.boilerpipe is a cross-platform utility capable of running on any operating system that comes with Java support (e.g. Mac OS X, Windows, Linux).
boilerpipe Related Software