Personal tools
Contact Us 24/7 > 1 866.SIX FEET
Sections

Skip to content. | Skip to navigation

Home > Blog > Word Splitting Filter in Solr

Blog

Word Splitting Filter in Solr

written by Clayton Parker on Monday November 8, 2010
Comments | Filed under: ,

Solr has a fantastic set of options for processing text. One of them is the WordDelimiterFilterFactory that allows you to turn a single word into multiple words. Here is an example of the filter applied to a TextField:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
        splitOnCaseChange="1"
        splitOnNumerics="1"
        generateWordParts="1"
        stemEnglishPossessive="0"
        generateNumberParts="0"
        catenateWords="0"
        catenateNumbers="0"
        catenateAll="0"
        preserveOriginal="0"/>
  </analyzer>
</fieldType>

This would turn the following token:

SpecialWords123More-words

Into several tokens based on the options that are turned on (splitOnCaseChange, splitOnNumerics, generateWordParts):

Special, Words, 123, More, words

We recently had a request to turn off the splitOnNumerics option so that letter to number transitions don't cause a word to split. This seemed like a simple request, but we ended up spending a fair amount of time because the Solr docs are inaccurate. Our client was still using the 1.3 release of Solr, and this particular feature was not introduced until the 1.4 release. Solr happily accepts the non-existent option and never warns you that it isn't valid! Getting this simple toggle of an option released required an upgrade to the 1.4 release. While this brings a lot of bug fixes and features, it was an unexpected set back.

If you are new to Solr, I highly recommend reading the book Solr 1.4 Enterprise Search Server. The authors do an excellent job explaining how all the pieces of Solr work.

 
Posted by Eric Pugh on Nov 10, 2010 05:35 AM
I've had the same challenge in the Solr wiki. While there is an attempt to mark features with Solr4.0 or Solr1.4 links in the text to highlight when a feature is introduced, that kind of breaks down when an aspect like the WordDelimimiterFilterFactory is modified over multiple versions! I just updated the CoreAdmin wiki page, and somewhat awkwardly added the MERGEINDEX command, a 1.4 feature, when the rest of the page primarily features 1.3 enhancements! Thanks for the recommendation on the book, we are going to be starting a revision soon to update it for all the new features in Solr. Maybe we can provide an index/chart of which features belong to which revision of Solr.
Add comment

You can add a comment by filling out the form below. Plain text formatting.

puzzle

Next Steps


Select a type of support:

Contact our sales team

First name:
Last name:
Email:
Phone Number:
Message:
Fight spam:
What is + ?
 
Call Us 1 866.SIX FEET
Sections