Solr Queries With Dashes
I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.
-Paul
solr solrnet
add a comment |
I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.
-Paul
solr solrnet
add a comment |
I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.
-Paul
solr solrnet
I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.
-Paul
solr solrnet
solr solrnet
asked Nov 27 '18 at 18:49
PaulPaul
3271414
3271414
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can use the WordDelimiterGraphFilterFactory
for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory
has many attributes.
generateWordParts
: (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal
: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords
: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506280%2fsolr-queries-with-dashes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use the WordDelimiterGraphFilterFactory
for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory
has many attributes.
generateWordParts
: (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal
: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords
: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
add a comment |
You can use the WordDelimiterGraphFilterFactory
for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory
has many attributes.
generateWordParts
: (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal
: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords
: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
add a comment |
You can use the WordDelimiterGraphFilterFactory
for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory
has many attributes.
generateWordParts
: (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal
: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords
: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr
You can use the WordDelimiterGraphFilterFactory
for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory
has many attributes.
generateWordParts
: (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal
: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords
: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr
edited Nov 28 '18 at 11:54
answered Nov 28 '18 at 4:08
Abhijit BashettiAbhijit Bashetti
4,35652034
4,35652034
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
add a comment |
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
This is a correct answer, of course, and this is a preferred way to do it. But, original question might be related to phrase queries that can be converted to span queries for some complex cases, so be aware of issues.apache.org/jira/projects/LUCENE/issues/LUCENE-7398 .
– Nikolay
Nov 28 '18 at 12:57
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
@Nikolay : I will check again from my side if anything can be done...waiting for the Paul for his feedback...
– Abhijit Bashetti
Nov 28 '18 at 13:54
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
Your answer is perfect, I just added minor warning for Paul on Lucene bug that break this in some rare cases.
– Nikolay
Nov 28 '18 at 18:18
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
This helped a lot, thank you.
– Paul
Dec 4 '18 at 16:31
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53506280%2fsolr-queries-with-dashes%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown