Personal tools
Contact Us 24/7 > 1 866.SIX FEET
Sections

Skip to content. | Skip to navigation

Home > Blog > Word Splitting Filter in Solr

Word Splitting Filter in Solr

written by Clayton Parker on 11/08/10
— filed under: ,

Solr has a fantastic set of options for processing text. One of them is the WordDelimiterFilterFactory that allows you to turn a single word into multiple words. Here is an example of the filter applied to a TextField:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
        splitOnCaseChange="1"
        splitOnNumerics="1"
        generateWordParts="1"
        stemEnglishPossessive="0"
        generateNumberParts="0"
        catenateWords="0"
        catenateNumbers="0"
        catenateAll="0"
        preserveOriginal="0"/>
  </analyzer>
</fieldType>

This would turn the following token:

SpecialWords123More-words

Into several tokens based on the options that are turned on (splitOnCaseChange, splitOnNumerics, generateWordParts):

Special, Words, 123, More, words

We recently had a request to turn off the splitOnNumerics option so that letter to number transitions don't cause a word to split. This seemed like a simple request, but we ended up spending a fair amount of time because the Solr docs are inaccurate. Our client was still using the 1.3 release of Solr, and this particular feature was not introduced until the 1.4 release. Solr happily accepts the non-existent option and never warns you that it isn't valid! Getting this simple toggle of an option released required an upgrade to the 1.4 release. While this brings a lot of bug fixes and features, it was an unexpected set back.

If you are new to Solr, I highly recommend reading the book Solr 1.4 Enterprise Search Server. The authors do an excellent job explaining how all the pieces of Solr work.

Posted by Eric Pugh on 11/10/10
I've had the same challenge in the Solr wiki.  While there is an attempt to mark features with Solr4.0 or Solr1.4 links in the text to highlight when a feature is introduced, that kind of breaks down when an aspect like the WordDelimimiterFilterFactory is modified over multiple versions!  I just updated the CoreAdmin wiki page, and somewhat awkwardly added the MERGEINDEX command, a 1.4 feature, when the rest of the page primarily features 1.3 enhancements!

Thanks for the recommendation on the book, we are going to be starting a revision soon to update it for all the new features in Solr.  Maybe we can provide an index/chart of which features belong to which revision of Solr.
Posted by clayton on 11/10/10
The book certainly brought light to some of the dark corners of Solr! Is there any effort in the Solr community to make the online docs more coherent? I think I refer to the book more often than the wiki.
Add comment

You can add a comment by filling out the form below. Plain text formatting.

(Required)
(Required)
puzzle
Sections