Commit ab2f659d authored by Citronalco's avatar Citronalco
Browse files

initial commit

parent f7f540d4
This diff is collapsed.
# Calibre solr Web Search
Calibre and Calibre's web server are quite slow for bigger libraries. Searching in a library with a few 10000 books takes several minutes
But if you index Calibre's metadata in a Apache solr server and use a web site for searching, you get your results within a second.
This an index tool for parsing Calibre's metadata.opf files and store in in solr, the required schema file for solr's index, and a neat website for searching in the index and downloading books.
![screenshot 1](screenshot1.png?raw=true)
![screenshot 2](screenshot2.png?raw=true)
### Installation (on Debian 9):
#### What's where?
- In the "indextool" directory is the indexer, a python script that looks through your Calibre library, parses all metadata.opf files it can find and stuffs the result in the solr server, which again creates the index and answers to queries.
- The "website" directory contains a web site for searching (querying) the solr server and for downloading books.
- The "solr" directory contains a schema.xml file which tells solr how to store what in the index and a solrconfig.xml file, which tells solr where to save the index and how to talk to clients.
#### How to install and configure solr
I assume you already have a webserver up and running. I tested this on Debian 9, I'm pretty sure it works nearly the same way on other operating systems.
Debian 9 comes with solr 3.6.2, if you're using another version same ports in schema.xml and/or solrconfig.xml might need to be changed.
1. Put the content of the "website" directory on your web server (e.g. /var/www/html/ebooksearch/)
1. Configure your webserver to make your Calibre library available in a subdirectory of the website (e.g. in /ebooksearch/calibrelibrary/)
1. Install solr: `apt-get install solr-tomcat`
1. Create a directory `/var/lib/solr/ebooksearch/conf/ and put the content of the "solr" directory into.
1. `chown -R tomcat8:tomcat8 /var/lib/solr/ebooksearch`
1. Edit /etc/solr/solr.xml and add `<core name="ebooksearch" instanceDir="/var/lib/solr/ebooksearch" />` between `<core>` and `</core>`
1. Now map solr/Tomcat's search interface ('''') to /ebooksearch/solr, e.g. by using Apache's mod_proxy module:
<Location /ebooksearch/solr/select>
ProxyPassReverse /ebooksearch/solr/select
1. Start tomcat8: `systemctl start tomcat8`
1. Edit the website's index.html and set `calibre_url_prefix` to the webserver subdirectory from above (e.g. /ebooksearch/calibrelibrary/) and `solr_url_prefix` to the mapped solr location (e.g. /ebooksearch/solr/select/)
1. Edit the indexer from the "indextool" directory and set the ``BASEDIR`` to point to your calibre library directory on the file system.
1. Start the indexer and wait until it finishes.
1. You might need to reload the solr ebooksearch core to be able to search in the newly created index: either restart tomcat8 or execute `curl "http://localhost:8080/solr/admin/cores?action=RELOAD&core=ebooksearch"`
1. Done. Open the website in a browser and you're supposed to be able to search.
The schema.xml is losely based on the schema of [OPUS4](, the web interface inspired by [calibre-web](
The idea itself is from
Example for indexing calibre metadata
**WARNING: this is only a stub**
I started to import some ebooks into calibre_ and served them via
calibre-server on my raspberry pi. Searching the builtin calibre-index felt
really slow, slow such as waiting up to a few minutes for a result. Also,
accessing the mobi files is slow, since calibre-server does not seem to serve
those files as static assets.
Directory structure of my calibre library::
Some Author
|-- Some Title
|   |-- cover.jpg
|   |-- Some
|   |-- metadata.opf
|-- Some other Title
|-- cover.jpg
|-- metadata.opf
|-- Some other
``metadata.opf`` is the xml, containing the metadata stuff.
However, I thought there must be a faster solution. And because I configured
solr_ yet, I decided to set this up and feed it with calibres metadata.
I took the default solr configuration and just created my own core named
``ebooks``, and added it to ``/etc/solr/solr.xml``::
<core name="ebooks" instanceDir="/media/data/ebooks/solr" />
In this core directory, I again used an example configuration
``/media/data/ebooks/solr/conf/solrconfig.xml``. And only replaced the data and
updateLog directory::
<str name="dir">/media/data/ebooks/solr/data_updatelog</str>
After that, I just had to create a ``/media/data/ebooks/solr/conf/schema.xml``.
And again, only used one of the solr example files and defined some fields like
<field name="author" type="string" indexed="true"
stored="true" multiValued="false" />
In fact, I only used solr example config files with three changed lines. No
changes on any performance or memory parameters.
And it turns out... The index is quite fast, although solr is only running on
the raspberry pi, as well.
Like I mentioned already. This is far from being used in real life, but maybe
will give you an idea who to use solr, combining this with a little python
example. Walking the file system tree, parsing xml files, searching in the xml
result set via xpath, talking to http api using requests_.
(See python-workshop_).
Little webui
I just added a little webui based on jquery_. This only queries solr for json
results and show author, title, cover and description.
As a requirement, your library must be at::
However, your DOCUMENTROOT should look like this::
|-- books -> /YOUR/EBOOK/LIBRARY
|-- css
|   |-- master.css
|-- img
|   |-- next_book.png
|   |-- next.png
|   |-- prev_book.png
|   |-- prev.png
|   |-- top.png
|-- index.html
|-- js
|-- jquery-2.1.1.min.js
You have to configure your webserver to pass requests to /ebooks directly to
solr. This is how to do it with nginx_::
location /ebooks {
proxy_pass http://localhost:8983/ebooks;
This is what it looks like.
.. image:: screen.png
Unfortunately, the buttons with internal references does not work on the
Kindle :/
.. _calibre:
.. _solr:
.. _requests:
.. _python-workshop:
.. _jquery:
.. _nginx:
<?xml version="1.0" ?>
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
<schema name="example core one" version="1.1">
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<!-- general -->
<field name="id" type="long" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="author" type="string" indexed="true" stored="true" multiValued="false" />
<field name="language" type="string" indexed="true" stored="true" multiValued="false" />
<field name="title" type="string" indexed="true" stored="true" multiValued="false" />
<field name="description" type="string" indexed="true" stored="true" multiValued="false" />
<field name="path" type="string" indexed="false" stored="true" multiValued="false" />
<field name="mobi" type="string" indexed="false" stored="true" multiValued="false" />
<field name="cover" type="string" indexed="false" stored="true" multiValued="false" />
<!-- NEVER remove this -->
<field name="_version_" type="long" indexed="true" stored="true" />
<!-- field to use to determine and enforce document uniqueness. -->
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
<?xml version="1.0" encoding="UTF-8" ?>
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
This is a stripped down config file used for a simple example...
It is *not* a good example to work from.
<!-- The DirectoryFactory to use for indexes.
solr.StandardDirectoryFactory, the default, is filesystem based.
solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
When ManagedIndexSchemaFactory is specified, Solr will load the schema from
he resource named in 'managedSchemaResourceName', rather than from schema.xml.
Note that the managed schema resource CANNOT be named schema.xml. If the managed
schema does not exist, Solr will create it after reading schema.xml, then rename
'schema.xml' to 'schema.xml.bak'.
Do NOT hand edit the managed schema - external modifications will be ignored and
overwritten as a result of schema modification REST API calls.
When ManagedIndexSchemaFactory is specified with mutable = true, schema
modification REST API calls will be allowed; otherwise, error responses will be
sent back for these requests.
<schemaFactory class="ClassicIndexSchemaFactory"/>
<updateHandler class="solr.DirectUpdateHandler2">
<str name="dir">/media/data/ebooks/solr/data_updatelog</str>
<!-- realtime get handler, guaranteed to return the latest stored fields
of any document, without the need to commit or open a new searcher. The current
implementation relies on the updateLog feature being enabled. -->
<requestHandler name="/get" class="solr.RealTimeGetHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" />
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">solrpingquery</str>
<lst name="defaults">
<str name="echoParams">all</str>
<!-- config for the admin interface -->
#!/usr/bin/env python #!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as etree import xml.etree.ElementTree as etree
import requests import requests
import json import json
import os import os
BASEDIR='/media/data/ebooks' from bs4 import BeautifulSoup
from dateutil.parser import parse
from import *
namespaces = {'dc': '', namespaces = {'dc': '',
'opf': ''} 'opf': '',}
def get_mobi_file(files): def get_ebook_file(files):
"""returns mobi file, if there is one in the file list""" """returns ebook file, if there is one in the file list"""
for f in files: for f in files:
if f.endswith('.mobi'): if f.endswith(('.epub','.pdf')):
return f return f
return '' return ''
def get_metadata(basedir): def get_metadata(basedir):
"""walks basedir and yields relevant metadata, path, cover and mobi""" """walks basedir and yields relevant metadata, path, cover and ebook file"""
for root, dirs, files in os.walk(basedir): for root, dirs, files in os.walk(basedir):
if 'metadata.opf' in files: if 'metadata.opf' in files:
path = '/'.join(root.split('/')[-2:]) path = '/'.join(root.split('/')[-2:])
mobi = get_mobi_file(files) filename = get_ebook_file(files)
extension = os.path.splitext(filename)[1].lower()[1:]
cover = '' cover = ''
if 'cover.jpg' in files: if 'cover.jpg' in files:
cover = 'cover.jpg' cover = 'cover.jpg'
yield ('%s/metadata.opf' % root, path, cover, mobi) yield ('%s/metadata.opf' % root, path, cover, filename, extension)
def parse_metadata(metadata): def parse_metadata(metadata):
...@@ -36,23 +45,86 @@ def parse_metadata(metadata): ...@@ -36,23 +45,86 @@ def parse_metadata(metadata):
root = x.getroot() root = x.getroot()
def get_field(matcher): def get_field(matcher):
match = root.find('./opf:metadata/dc:%s' % matcher, namespaces=namespaces) matches = []
if match is None: for match in root.findall('./opf:metadata/dc:%s' % matcher, namespaces=namespaces):
return '' matches.append(match.text)
return match.text return matches
id_ = get_field('identifier[@id="calibre_id"]') def get_meta_field(matcher):
if not id_: matches = []
for match in root.findall("./opf:metadata/opf:meta[@name='%s']" % matcher, namespaces=namespaces):
return matches
def get_identifiers():
matches = []
for match in root.findall("./opf:metadata/dc:identifier[@opf:scheme]", namespaces=namespaces):
identifier_type = match.get('{}scheme')
if identifier_type == 'calibre':
matches.append(identifier_type + ':' + match.text)
return matches
calibre_id = get_field('identifier[@id="calibre_id"]')
if not calibre_id:
return None return None
author = get_field('creator')
language = get_field('language')
title = get_field('title') title = get_field('title')
author = get_field('creator')
series = get_meta_field('calibre:series')
series_index = get_meta_field('calibre:series_index')
subject = get_field('subject')
# description may contain html, we remove that
description = get_field('description') description = get_field('description')
return {'id': int(id_), if description:
soup = BeautifulSoup(description[0],"lxml")
for i in soup (['script','style']): # remove script and style
description = soup.get_text() # get text
lines = (line.strip() for line in description.splitlines()) # break into lines and remove leading and trailing space on each
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
description = '\n'.join(chunk for chunk in chunks if chunk) # drop blank lines
identifier = get_identifiers()
language = get_field('language')
date = get_field('date')
if (date[0]):
date="%sZ" % parse(date[0]).astimezone(tzutc()).isoformat()
publisher = get_field('publisher')
author_sort = get_meta_field('calibre:author_sort')
title_sort = get_meta_field('calibre:title_sort')
return {'id': calibre_id,
'title': title,
'title_output': title,
'author': author, 'author': author,
'author_facet': author,
'series': series,
'series_index': series_index,
'subject': subject,
'subject_facet': subject,
'abstract': description,
'abstract_output': description,
'identifier': identifier,
'language': language, 'language': language,
'title': title,
'description': description, 'date' : date,
'year': date[:4],
'publisher': publisher,
'author_sort': author_sort,
'title_sort': title_sort,
} }
...@@ -64,26 +136,28 @@ def update_entry(ebook_data): ...@@ -64,26 +136,28 @@ def update_entry(ebook_data):
} }
headers = {'Content-type': 'application/json', 'Accept': 'application/json'} headers = {'Content-type': 'application/json', 'Accept': 'application/json'}'http://localhost:8983/ebooks/update','http://localhost:8080/solr/ebooks/update/json',
data=json.dumps(solr_data), headers=headers) data=json.dumps(solr_data), headers=headers)
def reload_core(): def reload_core():
"""tells solr to reload the core""" """tells solr to reload the core"""
payload = {'wt': 'json', 'action': 'RELOAD', 'core': 'ebooks'} payload = { 'action': 'RELOAD', 'core': 'ebooks'}
requests.get('http://leierkasten.local:8983/admin/cores', params=payload) requests.get('http://localhost:8080/solr/admin/cores', params=payload)
# curl "http://localhost:8080/solr/admin/cores?action=RELOAD&core=ebooks"
if __name__ == '__main__': if __name__ == '__main__':
for idx, metadata in enumerate(get_metadata(BASEDIR)): for idx, metadata in enumerate(get_metadata(BASEDIR)):
metadata_file, path, cover, mobi = metadata metadata_file, path, cover, filename, extension = metadata
ebook_data = parse_metadata(metadata_file) ebook_data = parse_metadata(metadata_file)
if not ebook_data: if not ebook_data:
print("Unable to find metadata in %s." % metadata_file) print("Unable to find metadata in %s." % metadata_file)
continue continue
ebook_data.update({'path': path, ebook_data.update({'path': path,
'cover': cover, 'coverfile': cover,
'mobi': mobi, 'filename': filename,
'filetype': extension,
}) })
update_entry(ebook_data) update_entry(ebook_data)
if idx > 0 and idx % 100 == 0: if idx > 0 and idx % 100 == 0:
...@@ -92,5 +166,3 @@ if __name__ == '__main__': ...@@ -92,5 +166,3 @@ if __name__ == '__main__':
print('Finally reloading solr') print('Finally reloading solr')
reload_core() reload_core()
print('Done') print('Done')
# vim: set tabstop=4 shiftwidth=4 expandtab:
...@@ -16,24 +16,11 @@ ...@@ -16,24 +16,11 @@
limitations under the License. limitations under the License.
--> -->
<!-- <!-- If this file is found in the config directory, it will only be
All (relative) paths are relative to the installation path loaded once at startup. If it is found in Solr's data
directory, it will be re-loaded every commit.
persistent: Save changes made via the API to this file
sharedLib: path to a lib directory that will be shared across all cores
--> -->
<solr persistent="false">
<!-- <elevate>
adminPath: RequestHandler path to manage cores.
If 'null' (or absent), cores will not be manageable via request handler </elevate>
<cores adminPath="/admin/cores" host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}">
<core name="ebooks" instanceDir="/media/data/ebooks/solr" />
<shardHandlerFactory name="shardHandlerFactory" class="HttpShardHandlerFactory">
<str name="urlScheme">${urlScheme:}</str>
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="calibre" version="0.1">
<!-- attribute "name" is the name of this schema and is only used for display purposes.
Applications should change this to reflect the nature of the search collection.
version="1.2" is Solr's version number for the schema syntax and semantics. It should
not normally be changed by applications.
1.0: multiValued attribute did not exist, all fields are multiValued by nature
1.1: multiValued attribute introduced, false by default
1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.
<!-- field type definitions. The "name" attribute is
just a label to be used by field definitions. The "class"
attribute and any other attributes determine the real
behavior of the fieldType.
Class names starting with "solr" refer to java classes in the
org.apache.solr.analysis package.
<!-- The StrField type is not analyzed, but indexed/stored verbatim.
- StrField and TextField support an optional compressThreshold which
limits compression (if enabled in the derived fields) to values which
exceed a certain size (in characters).
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldtype name="binary" class="solr.BinaryField"/>
<!-- The optional sortMissingLast and sortMissingFirst attributes are
currently supported on types that are sorted internally as strings.
This includes "string","boolean","sint","slong","sfloat","sdouble","pdate"
- If sortMissingLast="true", then a sort on this field will cause documents
without the field to come after documents with the field,
regardless of the requested sort order (asc or desc).
- If sortMissingFirst="true", then a sort on this field will cause documents
without the field to come before documents with the field,
regardless of the requested sort order.
- If sortMissingLast="false" and sortMissingFirst="false" (the default),
then default lucene sorting will be used which places docs without the
field first in an ascending sort and last in a descending sort.
Default numeric field types. For faster range queries, consider the tint/tfloat/tlong/tdouble types.
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0"