One of the most important parts of a wiki is the ability to search. Implementing search in a text-only, no-database wiki like DokuWiki isn't a trivial task.
The naive approach would be to search through the contents of every page file when asked. In fact this is what was done in very early releases of DokuWiki. It's simple but doesn't scale at all. As soon as you have a couple of hundred pages, it takes quite a while to search through each and every text file individually.
What you need is a search index. A search index allows you to search through the index for what you're looking for and that index then gives you the pages that contain your search term. DokuWiki uses a simple word-length based index. Basically, when you search for "airplane", DokuWiki opens the index for words with the length of 8 characters, finds the word "airplane" in it and then knows all the pages that contain this word. There are a few more complicated actions on top of that for handling phrases or match only parts of a word, but this is the basic mechanism. Besides its simplicity it works surprisingly well for the vast majority of wikis.
But sometimes the builtin search is not enough. That's when you want to use "professional" search engine. Now, if your wiki is completely public you can use a Google Custom Search engine or similar offerings to let some external search engine provide what you need. But when you have an internal wiki, you want something that you can run yourself. Enter Elasticsearch.
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
A new DokuWiki Plugin
For one of our customers, we created a DokuWiki plugin to properly integrate Elasticsearch. As always, the plugin has also been made available for the broader community at dokuwiki.org.
Once set up, indexing happens automatically -- pages are automatically submitted to the Elasticsearch server whenever they are changed. Not much different from what happens in vanilla DokuWiki. On indexing, we also save additional meta data that could be used for filtering later on: the namespace, author, last modification date etc. When the translation plugin is installed we also save the language of each page.
A command line utility allows for (re)indexing the whole wiki on demand.
For a seamless experience, the plugin integrates in a way that replaces the standard DokuWiki search. The user interface is very similar to the one provided by DokuWiki itself.
There are however a few improvements besides the better speed and scalability in the back.
Complex searches can now be constructed using the Elasticsearch Simple Query Syntax which despite its name is quite powerful.
Elasticsearch will also use proper language dependent stemming. Eg. searching for "run" will also search for "running" or "runs".
Search snippets provide context to a result, allowing the user to judge the usefulness of a result before following the link. For performance reasons, search snippets are only provided for the first 15 results in the default search. With the Elasticsearch plugin, every search result has a snippet.
Showing results on multiple pages is important for searches that return a lot of results. With DokuWiki's simple approach, pagination is not possible. With Elasticsearch it's readily available.
Access Control Lists are an important feature of DokuWiki in a corporate context and of course any search needs to make sure that they are checked correctly. Otherwise you risk leaking information. The plugin we created, uses a clever mechanism to map ACLs to properties in the search index itself. This allows to create search queries that do ACL checking directly in the search engine - that way the Elasticsearch already returns only results the current user is allowed to see.
Possible Future Enhancements
The plugin already covers what our customer asked for. However we do have a few ideas we would love to implement in the future.
The Flexibility of Elasticsearch allows to store different kind of data in the index. This could be used to not only index the page contents, but also the contents of uploaded data like Word or PDF documents or even images using OCR technology.
We have an internal prototype for this feature already and it looks quite promising.
Elasticsearch has a feature to implement search suggestions. This is helpful to suggest searches that are similar but more likely to deliver results than the given search term. You're probably familiar with the "Did you mean ..." suggestion of popular internet search engines.
It would be easy to add additional metadata to indexed pages provided by plugins. This could then be used to further filter or cluster the search results. Plugins like the Tagging Plugin or the Struct Plugin come to mind.
If you're interested in deploying Elasticsearch with DokuWiki in your company or want to sponsor the future development, please contact us.