Commit 53fe151e authored by Christoph Glaubitz's avatar Christoph Glaubitz
Browse files

initial

parents
Example for indexing calibre metadata
=====================================
**WARNING: this is only a stub**
I started to import some ebooks into calibre_ and served them via
calibre-server on my raspberry pi. Searching the builtin calibre-index felt
really slow, slow such as waiting up to a few minutes for a result. Also,
accessing the mobi files is slow, since calibre-server does not seem to serve
those files as static assets.
However, I thought there must be a faster solution. And because I configured
solr_ yet, I decided to set this up and feed it with calibres metadata.
I took the default solr configuration and just created my own core named
``ebooks``, and added it to ``/etc/solr/solr.xml``::
<core name="ebooks" instanceDir="/media/data/ebooks/solr" />
In this core directory, I again used an example configuration
``/media/data/ebooks/solr/conf/solrconfig.xml``. And only replaced the data and
updateLog directory::
<dataDir>/media/data/ebooks/solr/data</dataDir>
[...]
<updateLog>
<str name="dir">/media/data/ebooks/solr/data_updatelog</str>
</updateLog>
After that, I just had to create a ``/media/data/ebooks/solr/conf/schema.xml``.
And again, only used one of the solr example files and defined some fields like
author::
<field name="author" type="string" indexed="true"
stored="true" multiValued="false" />
In fact, I only used solr example config files with three changed lines. No
changes on any performance or memory parameters.
And it turns out... The index is quite fast, although solr is only running on
the raspberry pi, as well.
Like I mentioned already. This is far from being used in real life, but maybe will
give you an idea who to use solr, combining this with a little python example
(see python-workshop_).
Next Steps...
`````````````
... that will very likely never happen ;)
* Hacking a little web UI for querying solr
* Must be mobile compatible to look nice on Kindle
* Serve mobi files as static assets
.. _calibre: http://calibre-ebook.com/
.. _solr: http://lucene.apache.org/solr/
.. _python-workshop: https://github.com/chrigl/python-workshop
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<schema name="example core one" version="1.1">
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<!-- general -->
<field name="id" type="long" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="author" type="string" indexed="true" stored="true" multiValued="false" />
<field name="language" type="string" indexed="true" stored="true" multiValued="false" />
<field name="title" type="string" indexed="true" stored="true" multiValued="false" />
<field name="description" type="string" indexed="true" stored="true" multiValued="false" />
<field name="path" type="string" indexed="false" stored="true" multiValued="false" />
<field name="mobi" type="string" indexed="false" stored="true" multiValued="false" />
<field name="cover" type="string" indexed="false" stored="true" multiValued="false" />
<!-- NEVER remove this -->
<field name="_version_" type="long" indexed="true" stored="true" />
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>title</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
</schema>
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
This is a stripped down config file used for a simple example...
It is *not* a good example to work from.
-->
<config>
<luceneMatchVersion>4.8</luceneMatchVersion>
<!-- The DirectoryFactory to use for indexes.
solr.StandardDirectoryFactory, the default, is filesystem based.
solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<dataDir>/media/data/ebooks/solr/data</dataDir>
<!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>
When ManagedIndexSchemaFactory is specified, Solr will load the schema from
he resource named in 'managedSchemaResourceName', rather than from schema.xml.
Note that the managed schema resource CANNOT be named schema.xml. If the managed
schema does not exist, Solr will create it after reading schema.xml, then rename
'schema.xml' to 'schema.xml.bak'.
Do NOT hand edit the managed schema - external modifications will be ignored and
overwritten as a result of schema modification REST API calls.
When ManagedIndexSchemaFactory is specified with mutable = true, schema
modification REST API calls will be allowed; otherwise, error responses will be
sent back for these requests.
-->
<schemaFactory class="ClassicIndexSchemaFactory"/>
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">/media/data/ebooks/solr/data_updatelog</str>
</updateLog>
</updateHandler>
<!-- realtime get handler, guaranteed to return the latest stored fields
of any document, without the need to commit or open a new searcher. The current
implementation relies on the updateLog feature being enabled. -->
<requestHandler name="/get" class="solr.RealTimeGetHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
</lst>
</requestHandler>
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">solrpingquery</str>
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
</requestHandler>
<!-- config for the admin interface -->
<admin>
<defaultQuery>solr</defaultQuery>
</admin>
</config>
#!/usr/bin/env python
import xml.etree.ElementTree as etree
import requests
import json
import os
BASEDIR='/media/data/ebooks'
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/',
'opf': 'http://www.idpf.org/2007/opf'}
def get_mobi_file(files):
"""returns mobi file, if there is one in the file list"""
for f in files:
if f.endswith('.mobi'):
return f
return ''
def get_metadata(basedir):
"""walks basedir and yields relevant metadata, path, cover and mobi"""
for root, dirs, files in os.walk(basedir):
if 'metadata.opf' in files:
path = '/'.join(root.split('/')[-2:])
mobi = get_mobi_file(files)
cover = ''
if 'cover.jpg' in files:
cover = 'cover.jpg'
yield ('%s/metadata.opf' % root, path, cover, mobi)
def parse_metadata(metadata):
"""parse metadata and returns fields"""
x = etree.parse(metadata)
root = x.getroot()
def get_field(matcher):
match = root.find('./opf:metadata/dc:%s' % matcher, namespaces=namespaces)
if match is None:
return ''
return match.text
id_ = get_field('identifier[@id="calibre_id"]')
if not id_:
return None
author = get_field('creator')
language = get_field('language')
title = get_field('title')
description = get_field('description')
return {'id': int(id_),
'author': author,
'language': language,
'title': title,
'description': description,
}
def update_entry(ebook_data):
"""updates the solr entry"""
solr_data = {'add': {
'doc': ebook_data,
},
}
headers = {'Content-type': 'application/json', 'Accept': 'application/json'}
requests.post('http://localhost:8983/ebooks/update',
data=json.dumps(solr_data), headers=headers)
def reload_core():
"""tells solr to reload the core"""
payload = {'wt': 'json', 'action': 'RELOAD', 'core': 'ebooks'}
requests.get('http://leierkasten.local:8983/admin/cores', params=payload)
if __name__ == '__main__':
for idx, metadata in enumerate(get_metadata(BASEDIR)):
metadata_file, path, cover, mobi = metadata
ebook_data = parse_metadata(metadata_file)
if not ebook_data:
print("Unable to find metadata in %s." % metadata_file)
continue
ebook_data.update({'path': path,
'cover': cover,
'mobi': mobi,
})
update_entry(ebook_data)
if idx > 0 and idx % 100 == 0:
print("Added %s entries" % idx)
print('Finally reloading solr')
reload_core()
print('Done')
# vim: set tabstop=4 shiftwidth=4 expandtab:
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
All (relative) paths are relative to the installation path
persistent: Save changes made via the API to this file
sharedLib: path to a lib directory that will be shared across all cores
-->
<solr persistent="false">
<!--
adminPath: RequestHandler path to manage cores.
If 'null' (or absent), cores will not be manageable via request handler
-->
<cores adminPath="/admin/cores" host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}">
<core name="ebooks" instanceDir="/media/data/ebooks/solr" />
<shardHandlerFactory name="shardHandlerFactory" class="HttpShardHandlerFactory">
<str name="urlScheme">${urlScheme:}</str>
</shardHandlerFactory>
</cores>
</solr>
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment