Solr Queries With Dashes

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.

So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.

I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">

    <analyzer type="index">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

  </fieldType>

My mapping.txt contains the line..

"-" => " "

So what this rule does, is it converts the dashes to a space.

So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".

Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.

-Paul

asked Nov 27 '18 at 18:49

Paul

3271414

add a comment |

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.

So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.

I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">

    <analyzer type="index">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

  </fieldType>

My mapping.txt contains the line..

"-" => " "

So what this rule does, is it converts the dashes to a space.

So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".

Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.

-Paul

asked Nov 27 '18 at 18:49

Paul

3271414

add a comment |

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.

So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.

I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">

    <analyzer type="index">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

  </fieldType>

My mapping.txt contains the line..

"-" => " "

So what this rule does, is it converts the dashes to a space.

So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".

Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.

-Paul

asked Nov 27 '18 at 18:49

Paul

3271414

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.

So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.

I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">

    <analyzer type="index">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.ClassicTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>

      <filter class="solr.LowerCaseFilterFactory"/>

      <charFilter class="solr.HTMLStripCharFilterFactory"/>

    </analyzer>

  </fieldType>

My mapping.txt contains the line..

"-" => " "

So what this rule does, is it converts the dashes to a space.

So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".

Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.

-Paul

solr solrnet

asked Nov 27 '18 at 18:49

Paul

3271414

asked Nov 27 '18 at 18:49

Paul

3271414

asked Nov 27 '18 at 18:49

Paul

3271414

asked Nov 27 '18 at 18:49

Paul

3271414

asked Nov 27 '18 at 18:49

Paul

3271414

add a comment |

1 Answer
1

active

oldest

votes

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.

The WordDelimiterGraphFilterFactory has many attributes.

generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"

preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"

catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"

So in your case it would be like

<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

          <!-- Splits words based on whitespace characters --> 

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <!-- splits words at delimiters based on different arguments --> 

          <filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>

          <!-- Transforms text to lower case -->   

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>



        <analyzer type="query">

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>

  </fieldType>

The more information on it would be found at Fiters available in solr

edited Nov 28 '18 at 11:54

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .

– Nikolay
Nov 28 '18 at 12:57

@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...

– Abhijit Bashetti
Nov 28 '18 at 13:54

Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.

– Nikolay
Nov 28 '18 at 18:18

This helped a lot, thank you.

– Paul
Dec 4 '18 at 16:31

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506280%2fsolr-queries-with-dashes%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.

The WordDelimiterGraphFilterFactory has many attributes.

generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"

preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"

catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"

So in your case it would be like

<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

          <!-- Splits words based on whitespace characters --> 

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <!-- splits words at delimiters based on different arguments --> 

          <filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>

          <!-- Transforms text to lower case -->   

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>



        <analyzer type="query">

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>

  </fieldType>

The more information on it would be found at Fiters available in solr

edited Nov 28 '18 at 11:54

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .

– Nikolay
Nov 28 '18 at 12:57

@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...

– Abhijit Bashetti
Nov 28 '18 at 13:54

Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.

– Nikolay
Nov 28 '18 at 18:18

This helped a lot, thank you.

– Paul
Dec 4 '18 at 16:31

add a comment |

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.

The WordDelimiterGraphFilterFactory has many attributes.

generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"

preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"

catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"

So in your case it would be like

<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

          <!-- Splits words based on whitespace characters --> 

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <!-- splits words at delimiters based on different arguments --> 

          <filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>

          <!-- Transforms text to lower case -->   

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>



        <analyzer type="query">

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>

  </fieldType>

The more information on it would be found at Fiters available in solr

edited Nov 28 '18 at 11:54

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .

– Nikolay
Nov 28 '18 at 12:57

@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...

– Abhijit Bashetti
Nov 28 '18 at 13:54

Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.

– Nikolay
Nov 28 '18 at 18:18

This helped a lot, thank you.

– Paul
Dec 4 '18 at 16:31

add a comment |

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.

The WordDelimiterGraphFilterFactory has many attributes.

generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"

preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"

catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"

So in your case it would be like

<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

          <!-- Splits words based on whitespace characters --> 

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <!-- splits words at delimiters based on different arguments --> 

          <filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>

          <!-- Transforms text to lower case -->   

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>



        <analyzer type="query">

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>

  </fieldType>

The more information on it would be found at Fiters available in solr

edited Nov 28 '18 at 11:54

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.

The WordDelimiterGraphFilterFactory has many attributes.

generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"

preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"

catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"

So in your case it would be like

<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">

          <!-- Splits words based on whitespace characters --> 

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <!-- splits words at delimiters based on different arguments --> 

          <filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>

          <!-- Transforms text to lower case -->   

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>



        <analyzer type="query">

          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>

  </fieldType>

The more information on it would be found at Fiters available in solr

edited Nov 28 '18 at 11:54

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

edited Nov 28 '18 at 11:54

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

answered Nov 28 '18 at 4:08

Abhijit Bashetti

4,35652034

This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .

– Nikolay
Nov 28 '18 at 12:57

@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...

– Abhijit Bashetti
Nov 28 '18 at 13:54

Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.

– Nikolay
Nov 28 '18 at 18:18

This helped a lot, thank you.

– Paul
Dec 4 '18 at 16:31

add a comment |

This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .

– Nikolay
Nov 28 '18 at 12:57

@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...

– Abhijit Bashetti
Nov 28 '18 at 13:54

Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.

– Nikolay
Nov 28 '18 at 18:18

This helped a lot, thank you.

– Paul
Dec 4 '18 at 16:31

This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .

– Nikolay
Nov 28 '18 at 12:57

@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...

– Abhijit Bashetti
Nov 28 '18 at 13:54

Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.

– Nikolay
Nov 28 '18 at 18:18

This helped a lot, thank you.

– Paul
Dec 4 '18 at 16:31

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl