Reference: http://www.rondhuit.com/scriptupdateprocessor.html
The article is about javascript, Here, I will introduce how to register fields with Python using jython. By the way, the environment is CentOS7 and I am using Solr6 and Manifold CF2.4 version.
** 1. Added updateRequestProcessorChain to .solrconfig.xml **
solrconfig.xml
...
<updateRequestProcessorChain name="script">
<processor class="solr.StatelessScriptUpdateProcessorFactory">
<str name="script">update-script.py</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
...
It seems that solrconfig.xml is located in the following location after solr5.
var/solr/data/
<corename>
/conf/solrconfig.xml
** 2. Specify the UpdateChain defined above in requestHandler **
solrconfig.xml
...
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">script</str>
</lst>
</requestHandler>
...
If you want to crawl PDF or Excel with ManifoldCF, you can use ExtractingRequestHandler. If you want to crawl RDB, you need to add each of the above description to DataImportHanlder.
** 3. Place the python script in update-script.py **
Placement:
var/solr/data/
<corename>
/conf/update-script.py
To register a field in Python
doc.setField("field_name","field_value")
Write. The field_name to be registered by script must be added to "managed_schema.xml" in advance.
If there is a registration with the same name, it will be mapped on the solr side.
By the way, the index that has already been introduced
doc.getFieldValue("field_name")
You can get it at.
The script is evaluated each time it is run, so you don't have to restart Solr for editing.
** 4. Install Jython ** I use jython to run python on JavaVM. Get from the link below
http://www.jython.org/downloads.html
It seems that you need the Standalone version instead of the Installer version. Place it on the Solr side like this.
var/solr/data/
<corename>
/lib/jython-standalone-2.X.X.jar
** 5. When linking with DB ** I needed to get the data from postgresDB, so I need JDBC to access the DB from python.
Get the JDBC that matches the JDK and postegres versions from https://jdbc.postgresql.org/download.html
Place it in var / solr / data /
<corename>
/lib/postgresql-9.X-XXXX.jar </ code>
Description example of import statement and connection
update.script
from com.ziclix.python.sql import zxJDBC
DB_URL = "jdbc:postgresql://yourpostgreshost:port/dbname"
DB_USER = "postgres"
DB_PASS = "password"
DB_DRIVER = "org.postgresql.Driver"
connection = zxJDBC.connect(DB_URL, DB_USER, DB_PASS, DB_DRIVER)
Place it like this.
** 6. When referencing an external library **
If you want to import an external library, unzip the jython standalone version of the jar and External library You can jar it again after placing it in an appropriate location (usually / Lib). There may be a better way. .. ..
That's it. Run the job in Manifoldcf to see. If your Python script doesn't work properly, it's a bit annoying, but you may need to try and error while looking at the error logs.
Recommended Posts