Posted by: Wildan Maulana | January 19, 2009

Nutch Plugins

Nutch merupakan sebuah sub proyek dari Lucene yang memiliki fungsi sebagai mesin pencari, baik lokal/intranet ataupun internet, kelebihan nutch (setidaknya untuk sekarang) dibanding solr adalah Nutch memilik plugin-plugin yang cukup banyak, meskipun, katanya, kalau dilihat dari sisi skalabilitas, solr lebih unggul …, saat ini saya sedang mempelajari kedua engine ini, yang kedua-duanya merupakan sub proyek dari Lucene. Pada tulisan ini Saya hanya ingin menulis plugin-plugin yang telah dimiliki oleh Nutch (SVN Revision : 734360 ), untuk dokumentasi pribadi :

  1. analysis-de
  2. analysis-fr
  3. clustering-carrot2
    Info : www.carrot2.org
  4. creativecommons
    Support for crawling and searching Creative-Commons licensed content.
  5. feed
  6. field-basic
  7. field-boost
  8. index-anchor
  9. index-basic
  10. index-more
  11. languageidentifier
  12. lib-http
  13. lib-jakarta-poi
  14. lib-lucene-analyzers
  15. lib-nekohtml
    NekoHTML is a simple HTML scanner and tag balancer. (http://sourceforge.net/projects/nekohtml)
  16. lib-parsems
    A common framework for microsoft documents parsers implementations
  17. lib-regex-filter
    A common framework for RegExp based URL filters
  18. lib-xml
    XML library – Gathers many XML related libraries:

  19. microformats-reltag
  20. nutch-extensionpoints
  21. ontology20041129, John XingOntology plugin is a contribution from Michael J Pan <mjpan@cs.ucla.edu>.
    Currently it is used to do one kind of query refinement as implemented
    in refine-query-init.jsp and refine-query.jsp (both are called by search.jsp).

    By default, ontology plugin is compiled, but query refinement based on it
    is ignored in search.jsp. To enable query refinement, do the following

    (1) specify url(s) of owl files to property extension.ontology.urls in
    ./conf/nutch-default.xml (or better, ./conf/nutch-site.xml).
    (2) uncomment refine-query-init.jsp and refine-query.jsp in search.jsp

    If you want to check ontology defined by different owl file, modify property
    extension.ontology.urls in ./conf/nutch-default.xml (or better,
    ./conf/nutch-site.xml), and insert the following to ./bin/nutch:

    elif [ “$COMMAND” = “ontology” ] ; then
    for f in $NUTCH_HOME/build/plugins/ontology/*.jar; do
    CLASSPATH=${CLASSPATH}:$f;
    done
    CLASS=’org.apache.nutch.ontology.OntologyImpl’

    —————
    Possible issue:
    —————
    If search.jsp fails with this or similar error:

    ……
    root cause

    java.lang.NoSuchFieldError: actualValueType
    at
    com.hp.hpl.jena.datatypes.xsd.XSDDatatype.convertValidatedDataValue(XSDDatatype.java:371)
    ……

    it is because jena and tomcat are using conflicting versions of the same
    xerces library. To solve this, one needs to update tomcat’s xerces library.
    Here’s a reference
    http://jena.sourceforge.net/jena-faq.html#general-1

  22. parse-ext
  23. parse-html
  24. parse-js
  25. parse-mp3
  26. parse-msexcel
  27. parse-mspowerpointPugin to support parsing of MS PowerPoint files.Contributed by Stephan Strittmatter <Stephan.Strittmatter@sybit.de>.

    Note:
    ======
    For parsing MS PowerPoint files it is required to get the complete filestream.
    Please check the property <protocol>.content.limit at nutch-default.xml.

  28. parse-msword
  29. parse-oo
    OpenOffice/OpenDocument Parse Plug-in
  30. parse-pdf
  31. parse-rss
  32. parse-rtf
    From the readme :
    Prereqs: JDK 1.4+ and javacc version 3.2+

    This document describes how to create rtf-parser.jar file as used by Nutch.

    Source files are contained in:

    http://www.cobase.cs.ucla.edu/pub/javacc/rtf_parser_src.jar

    Create a new directory with the following files in:

    LICENCE
    RTFParser.jj
    RTFParserDelegate.java

    cd into this new directory create a src directory

    $mkdir src

    copy RTFParser.jj RTFParserDelegate.java into this src directory

    $cp RTFParser.jj RTFParserDelegate.java src/

    now cd into this src directory and generate the javacc classes for the parser
    and then cd out again

    $cd src
    $javacc RTFParser.jj
    $cd ..

    now compile all the source and generated files

    $javac -d . src/*.java

    (optional) remove the generated source

    $rm -rf src # (optional)

    finally create the jar archive of all the salient files

    $jar -cvf rtf-parser.jar com/ LICENCE RTFParser*

    –Andy Hedges

    Credits:

    Thanks to Eric Friedman for writing this javacc grammar file.

  33. parse-swf
  34. parse-text
  35. parse-zip
  36. protocol-file
  37. protocol-ftp
  38. protocol-http
  39. protocol-httpclient
  40. query-basic
    Basic Query Filter
  41. query-custom
    Custom Query Filter
  42. query-more
    More Query Filter
  43. query-site
    Site Query Filter
  44. query-url
    URL Query Filter
  45. response-json
    JSON Response Writer Plug-in
  46. response-xml
    XML Response Writer Plug-in
  47. scoring-link
    Link Analysis Scoring Plug-in
  48. scoring-opics
    OPIC Scoring Plug-in
  49. subcollection
    Subcollection indexing and query filter
    Readme :
    For brief description about this plugin see
    src/java/org/apache/nutch/collection/package.html

    Basically:
    You need to enable this during indexing and during searching

    After indexing you can limit your searches to certain
    subcollection with keyword subcollection, eg.

    “subcollection:nutch hadoop”

  50. summary-basic
    Basic Summarizer Plug-in
  51. summary-lucene
    Lucene Highlighter Summary Plug-in
  52. tld
    Top Level Domain Plugin
  53. urlfilter-automaton
    Automaton URL Filter
  54. urlfilter-domain
    Domain URL Filter
  55. urlfilter-prefix
    Prefix URL Filter
  56. urlfilter-regex
    Regex URL Filter
  57. urlfilter-suffix
    Suffix URL Filter
  58. urlfilter-validator
    URL Validator
  59. urlnormalizer-basic
    Basic URL Normalizer
  60. urlnormalizer-pass
    Pass-through URL Normalizer
  61. urlnormalizer-regex
    Regex URL Normalizer

Responses

  1. […] Check out the original for detail Comments [0]Digg it!FacebookTwitterEdit Post […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: