Commit ab2f659d authored by Citronalco's avatar Citronalco
Browse files

initial commit

parent f7f540d4
This diff is collapsed.
# Calibre solr Web Search
Calibre and Calibre's web server are quite slow for bigger libraries. Searching in a library with a few 10000 books takes several minutes
But if you index Calibre's metadata in a Apache solr server and use a web site for searching, you get your results within a second.
This an index tool for parsing Calibre's metadata.opf files and store in in solr, the required schema file for solr's index, and a neat website for searching in the index and downloading books.
![screenshot 1](screenshot1.png?raw=true)
![screenshot 2](screenshot2.png?raw=true)
### Installation (on Debian 9):
#### What's where?
- In the "indextool" directory is the indexer, a python script that looks through your Calibre library, parses all metadata.opf files it can find and stuffs the result in the solr server, which again creates the index and answers to queries.
- The "website" directory contains a web site for searching (querying) the solr server and for downloading books.
- The "solr" directory contains a schema.xml file which tells solr how to store what in the index and a solrconfig.xml file, which tells solr where to save the index and how to talk to clients.
#### How to install and configure solr
I assume you already have a webserver up and running. I tested this on Debian 9, I'm pretty sure it works nearly the same way on other operating systems.
Debian 9 comes with solr 3.6.2, if you're using another version same ports in schema.xml and/or solrconfig.xml might need to be changed.
1. Put the content of the "website" directory on your web server (e.g. /var/www/html/ebooksearch/)
1. Configure your webserver to make your Calibre library available in a subdirectory of the website (e.g. in /ebooksearch/calibrelibrary/)
1. Install solr: `apt-get install solr-tomcat`
1. Create a directory `/var/lib/solr/ebooksearch/conf/ and put the content of the "solr" directory into.
1. `chown -R tomcat8:tomcat8 /var/lib/solr/ebooksearch`
1. Edit /etc/solr/solr.xml and add `<core name="ebooksearch" instanceDir="/var/lib/solr/ebooksearch" />` between `<core>` and `</core>`
1. Now map solr/Tomcat's search interface (''http://127.0.0.1:8080/solr/ebooksearch/select'') to /ebooksearch/solr, e.g. by using Apache's mod_proxy module:
<Location /ebooksearch/solr/select>
ProxyPass http://127.0.0.1:8080/solr/ebooksearch/select
ProxyPassReverse /ebooksearch/solr/select
</Location>
1. Start tomcat8: `systemctl start tomcat8`
1. Edit the website's index.html and set `calibre_url_prefix` to the webserver subdirectory from above (e.g. /ebooksearch/calibrelibrary/) and `solr_url_prefix` to the mapped solr location (e.g. /ebooksearch/solr/select/)
1. Edit the indexer from the "indextool" directory and set the ``BASEDIR`` to point to your calibre library directory on the file system.
1. Start the indexer and wait until it finishes.
1. You might need to reload the solr ebooksearch core to be able to search in the newly created index: either restart tomcat8 or execute `curl "http://localhost:8080/solr/admin/cores?action=RELOAD&core=ebooksearch"`
1. Done. Open the website in a browser and you're supposed to be able to search.
The schema.xml is losely based on the schema of [OPUS4](https://github.com/OPUS4/search), the web interface inspired by [calibre-web](https://github.com/janeczku/calibre-web).
The idea itself is from https://github.com/chrigl/solr-calibre-example
Example for indexing calibre metadata
=====================================
**WARNING: this is only a stub**
I started to import some ebooks into calibre_ and served them via
calibre-server on my raspberry pi. Searching the builtin calibre-index felt
really slow, slow such as waiting up to a few minutes for a result. Also,
accessing the mobi files is slow, since calibre-server does not seem to serve
those files as static assets.
Directory structure of my calibre library::
Some Author
|-- Some Title
|   |-- cover.jpg
|   |-- Some Title.mobi
|   |-- metadata.opf
|-- Some other Title
|-- cover.jpg
|-- metadata.opf
|-- Some other Title.mobi
``metadata.opf`` is the xml, containing the metadata stuff.
However, I thought there must be a faster solution. And because I configured
solr_ yet, I decided to set this up and feed it with calibres metadata.
I took the default solr configuration and just created my own core named
``ebooks``, and added it to ``/etc/solr/solr.xml``::
<core name="ebooks" instanceDir="/media/data/ebooks/solr" />
In this core directory, I again used an example configuration
``/media/data/ebooks/solr/conf/solrconfig.xml``. And only replaced the data and
updateLog directory::
<dataDir>/media/data/ebooks/solr/data</dataDir>
[...]
<updateLog>
<str name="dir">/media/data/ebooks/solr/data_updatelog</str>
</updateLog>
After that, I just had to create a ``/media/data/ebooks/solr/conf/schema.xml``.
And again, only used one of the solr example files and defined some fields like
author::
<field name="author" type="string" indexed="true"
stored="true" multiValued="false" />
In fact, I only used solr example config files with three changed lines. No
changes on any performance or memory parameters.
And it turns out... The index is quite fast, although solr is only running on
the raspberry pi, as well.
Like I mentioned already. This is far from being used in real life, but maybe
will give you an idea who to use solr, combining this with a little python
example. Walking the file system tree, parsing xml files, searching in the xml
result set via xpath, talking to http api using requests_.
(See python-workshop_).
Little webui
````````````
I just added a little webui based on jquery_. This only queries solr for json
results and show author, title, cover and description.
As a requirement, your library must be at::
$DOCUMENTROOT/books
However, your DOCUMENTROOT should look like this::
.
|-- books -> /YOUR/EBOOK/LIBRARY
|-- css
|   |-- master.css
|-- img
|   |-- next_book.png
|   |-- next.png
|   |-- prev_book.png
|   |-- prev.png
|   |-- top.png
|-- index.html
|-- js
|-- jquery-2.1.1.min.js
You have to configure your webserver to pass requests to /ebooks directly to
solr. This is how to do it with nginx_::
location /ebooks {
proxy_pass http://localhost:8983/ebooks;
}
This is what it looks like.
.. image:: screen.png
Unfortunately, the buttons with internal references does not work on the
Kindle :/
.. _calibre: http://calibre-ebook.com/
.. _solr: http://lucene.apache.org/solr/
.. _requests: http://docs.python-requests.org/
.. _python-workshop: https://github.com/chrigl/python-workshop
.. _jquery: http://jquery.com/
.. _nginx: http://nginx.org/
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<schema name="example core one" version="1.1">
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<!-- general -->
<field name="id" type="long" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="author" type="string" indexed="true" stored="true" multiValued="false" />
<field name="language" type="string" indexed="true" stored="true" multiValued="false" />
<field name="title" type="string" indexed="true" stored="true" multiValued="false" />
<field name="description" type="string" indexed="true" stored="true" multiValued="false" />
<field name="path" type="string" indexed="false" stored="true" multiValued="false" />
<field name="mobi" type="string" indexed="false" stored="true" multiValued="false" />
<field name="cover" type="string" indexed="false" stored="true" multiValued="false" />
<!-- NEVER remove this -->
<field name="_version_" type="long" indexed="true" stored="true" />
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>title</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
</schema>
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
This is a stripped down config file used for a simple example...
It is *not* a good example to work from.
-->
<config>
<luceneMatchVersion>4.8</luceneMatchVersion>
<!-- The DirectoryFactory to use for indexes.
solr.StandardDirectoryFactory, the default, is filesystem based.
solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<dataDir>/media/data/ebooks/solr/data</dataDir>
<!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>
When ManagedIndexSchemaFactory is specified, Solr will load the schema from
he resource named in 'managedSchemaResourceName', rather than from schema.xml.
Note that the managed schema resource CANNOT be named schema.xml. If the managed
schema does not exist, Solr will create it after reading schema.xml, then rename
'schema.xml' to 'schema.xml.bak'.
Do NOT hand edit the managed schema - external modifications will be ignored and
overwritten as a result of schema modification REST API calls.
When ManagedIndexSchemaFactory is specified with mutable = true, schema
modification REST API calls will be allowed; otherwise, error responses will be
sent back for these requests.
-->
<schemaFactory class="ClassicIndexSchemaFactory"/>
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">/media/data/ebooks/solr/data_updatelog</str>
</updateLog>
</updateHandler>
<!-- realtime get handler, guaranteed to return the latest stored fields
of any document, without the need to commit or open a new searcher. The current
implementation relies on the updateLog feature being enabled. -->
<requestHandler name="/get" class="solr.RealTimeGetHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
</lst>
</requestHandler>
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">solrpingquery</str>
</lst>
<lst name="defaults">
<str name="echoParams">all</str>
</lst>
</requestHandler>
<!-- config for the admin interface -->
<admin>
<defaultQuery>solr</defaultQuery>
</admin>
</config>
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as etree
import requests
import json
import os
BASEDIR='/media/data/ebooks'
from bs4 import BeautifulSoup
from dateutil.parser import parse
from dateutil.tz import *
BASEDIR='/data/my-calibre-library/'
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/',
'opf': 'http://www.idpf.org/2007/opf'}
'opf': 'http://www.idpf.org/2007/opf',}
def get_mobi_file(files):
"""returns mobi file, if there is one in the file list"""
def get_ebook_file(files):
"""returns ebook file, if there is one in the file list"""
for f in files:
if f.endswith('.mobi'):
if f.endswith(('.epub','.pdf')):
return f
return ''
def get_metadata(basedir):
"""walks basedir and yields relevant metadata, path, cover and mobi"""
"""walks basedir and yields relevant metadata, path, cover and ebook file"""
for root, dirs, files in os.walk(basedir):
if 'metadata.opf' in files:
path = '/'.join(root.split('/')[-2:])
mobi = get_mobi_file(files)
filename = get_ebook_file(files)
extension = os.path.splitext(filename)[1].lower()[1:]
cover = ''
if 'cover.jpg' in files:
cover = 'cover.jpg'
yield ('%s/metadata.opf' % root, path, cover, mobi)
yield ('%s/metadata.opf' % root, path, cover, filename, extension)
def parse_metadata(metadata):
......@@ -36,23 +45,86 @@ def parse_metadata(metadata):
root = x.getroot()
def get_field(matcher):
match = root.find('./opf:metadata/dc:%s' % matcher, namespaces=namespaces)
if match is None:
return ''
return match.text
matches = []
for match in root.findall('./opf:metadata/dc:%s' % matcher, namespaces=namespaces):
matches.append(match.text)
return matches
def get_meta_field(matcher):
matches = []
for match in root.findall("./opf:metadata/opf:meta[@name='%s']" % matcher, namespaces=namespaces):
matches.append(match.get("content"))
return matches
def get_identifiers():
matches = []
for match in root.findall("./opf:metadata/dc:identifier[@opf:scheme]", namespaces=namespaces):
identifier_type = match.get('{http://www.idpf.org/2007/opf}scheme')
if identifier_type == 'calibre':
continue
matches.append(identifier_type + ':' + match.text)
return matches
id_ = get_field('identifier[@id="calibre_id"]')
if not id_:
calibre_id = get_field('identifier[@id="calibre_id"]')
if not calibre_id:
return None
author = get_field('creator')
language = get_field('language')
title = get_field('title')
author = get_field('creator')
series = get_meta_field('calibre:series')
series_index = get_meta_field('calibre:series_index')
subject = get_field('subject')
# description may contain html, we remove that
description = get_field('description')
return {'id': int(id_),
if description:
soup = BeautifulSoup(description[0],"lxml")
for i in soup (['script','style']): # remove script and style
i.extract()
description = soup.get_text() # get text
lines = (line.strip() for line in description.splitlines()) # break into lines and remove leading and trailing space on each
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
description = '\n'.join(chunk for chunk in chunks if chunk) # drop blank lines
identifier = get_identifiers()
language = get_field('language')
date = get_field('date')
if (date[0]):
date="%sZ" % parse(date[0]).astimezone(tzutc()).isoformat()
publisher = get_field('publisher')
author_sort = get_meta_field('calibre:author_sort')
title_sort = get_meta_field('calibre:title_sort')
return {'id': calibre_id,
'title': title,
'title_output': title,
'author': author,
'author_facet': author,
'series': series,
'series_index': series_index,
'subject': subject,
'subject_facet': subject,
'abstract': description,
'abstract_output': description,
'identifier': identifier,
'language': language,
'title': title,
'description': description,
'date' : date,
'year': date[:4],
'publisher': publisher,
'author_sort': author_sort,
'title_sort': title_sort,
}
......@@ -64,26 +136,28 @@ def update_entry(ebook_data):
}
headers = {'Content-type': 'application/json', 'Accept': 'application/json'}
requests.post('http://localhost:8983/ebooks/update',
requests.post('http://localhost:8080/solr/ebooks/update/json',
data=json.dumps(solr_data), headers=headers)
def reload_core():
"""tells solr to reload the core"""
payload = {'wt': 'json', 'action': 'RELOAD', 'core': 'ebooks'}
requests.get('http://leierkasten.local:8983/admin/cores', params=payload)
payload = { 'action': 'RELOAD', 'core': 'ebooks'}
requests.get('http://localhost:8080/solr/admin/cores', params=payload)
# curl "http://localhost:8080/solr/admin/cores?action=RELOAD&core=ebooks"
if __name__ == '__main__':
for idx, metadata in enumerate(get_metadata(BASEDIR)):
metadata_file, path, cover, mobi = metadata
metadata_file, path, cover, filename, extension = metadata
ebook_data = parse_metadata(metadata_file)
if not ebook_data:
print("Unable to find metadata in %s." % metadata_file)
continue
ebook_data.update({'path': path,
'cover': cover,
'mobi': mobi,
'coverfile': cover,
'filename': filename,
'filetype': extension,
})
update_entry(ebook_data)
if idx > 0 and idx % 100 == 0:
......@@ -92,5 +166,3 @@ if __name__ == '__main__':
print('Finally reloading solr')
reload_core()
print('Done')
# vim: set tabstop=4 shiftwidth=4 expandtab:
......@@ -16,24 +16,11 @@
limitations under the License.
-->
<!--
All (relative) paths are relative to the installation path
persistent: Save changes made via the API to this file
sharedLib: path to a lib directory that will be shared across all cores
<!-- If this file is found in the config directory, it will only be
loaded once at startup. If it is found in Solr's data
directory, it will be re-loaded every commit.
-->
<solr persistent="false">
<!--
adminPath: RequestHandler path to manage cores.
If 'null' (or absent), cores will not be manageable via request handler
-->
<cores adminPath="/admin/cores" host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}">
<core name="ebooks" instanceDir="/media/data/ebooks/solr" />
<shardHandlerFactory name="shardHandlerFactory" class="HttpShardHandlerFactory">
<str name="urlScheme">${urlScheme:}</str>
</shardHandlerFactory>
</cores>
<elevate>
</solr>
</elevate>
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="calibre" version="0.1">
<!-- attribute "name" is the name of this schema and is only used for display purposes.
Applications should change this to reflect the nature of the search collection.
version="1.2" is Solr's version number for the schema syntax and semantics. It should
not normally be changed by applications.
1.0: multiValued attribute did not exist, all fields are multiValued by nature
1.1: multiValued attribute introduced, false by default
1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.
-->
<types>
<!-- field type definitions. The "name" attribute is
just a label to be used by field definitions. The "class"
attribute and any other attributes determine the real
behavior of the fieldType.
Class names starting with "solr" refer to java classes in the
org.apache.solr.analysis package.
-->
<!-- The StrField type is not analyzed, but indexed/stored verbatim.
- StrField and TextField support an optional compressThreshold which
limits compression (if enabled in the derived fields) to values which
exceed a certain size (in characters).
-->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldtype name="binary" class="solr.BinaryField"/>
<!-- The optional sortMissingLast and sortMissingFirst attributes are
currently supported on types that are sorted internally as strings.
This includes "string","boolean","sint","slong","sfloat","sdouble","pdate"
- If sortMissingLast="true", then a sort on this field will cause documents
without the field to come after documents with the field,
regardless of the requested sort order (asc or desc).
- If sortMissingFirst="true", then a sort on this field will cause documents
without the field to come before documents with the field,
regardless of the requested sort order.
- If sortMissingLast="false" and sortMissingFirst="false" (the default),
then default lucene sorting will be used which places docs without the
field first in an ascending sort and last in a descending sort.
-->
<!--
Default numeric field types. For faster range queries, consider the tint/tfloat/tlong/tdouble types.
-->
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<!--
Numeric field types that index each value at various levels of precision
to accelerate range queries when the number of values between the range
endpoints is large. See the javadoc for NumericRangeQuery for internal
implementation details.
Smaller precisionStep values (specified in bits) will lead to more tokens
indexed per value, slightly larger index size, and faster range queries.
A precisionStep of 0 disables indexing at different precision levels.
-->
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<!-- The format for this date field is of the form 1995-12-31T23:59:59Z, and
is a more restricted form of the canonical representation of dateTime
http://www.w3.org/TR/xmlschema-2/#dateTime
The trailing "Z" designates UTC time and is mandatory.
Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
All other components are mandatory.
Expressions can also be used to denote calculations that should be
performed relative to "NOW" to determine the value, ie...
NOW/HOUR
... Round to the start of the current hour
NOW-1DAY
... Exactly 1 day prior to now
NOW/DAY+6MONTHS+3DAYS
... 6 months and 3 days in the future from the start of
the current day
Consult the DateField javadocs for more information.
Note: For faster range queries, consider the tdate type
-->
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
<!-- A Trie based date field for faster date range queries and date faceting. -->