Shortest unique combination of at least three characters for each string in list












1















I want to find the shortest unique combination of characters for each element in a list of strings. Each combination should consist of the string's first character and its two rarest characters at least (more if necessary) and order matters. If a character appears more than once in one string, it should get more weight.



Consider the following example:



liste = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]
combinations = ["apl", "per", "ban", "xyh", "ber", "bnu"


for apple, both p and e appear 4 times overall, but since p appears twice in apple, it should be used in the combination.



What is the most efficient way to write this logic in python?










share|improve this question




















  • 1





    "for apple, both p and e appear 4 times overall,..." I didn't get it. How do they appear 4 times?

    – Austin
    Nov 25 '18 at 16:28











  • In all strings together, both p and e occur 4 times in total. The total number of occurrences of a character is what determines its rarity and whether it should be included in the combination. In case of a tie, like here, preference should be given to the character that appears more than once in the string (if any).

    – iuvbio
    Nov 25 '18 at 16:34













  • Why 'l' instead of 'h' in xylophone?

    – Daniel Mesejo
    Nov 25 '18 at 16:46













  • My mistake, thanks. Edited the question.

    – iuvbio
    Nov 25 '18 at 16:48
















1















I want to find the shortest unique combination of characters for each element in a list of strings. Each combination should consist of the string's first character and its two rarest characters at least (more if necessary) and order matters. If a character appears more than once in one string, it should get more weight.



Consider the following example:



liste = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]
combinations = ["apl", "per", "ban", "xyh", "ber", "bnu"


for apple, both p and e appear 4 times overall, but since p appears twice in apple, it should be used in the combination.



What is the most efficient way to write this logic in python?










share|improve this question




















  • 1





    "for apple, both p and e appear 4 times overall,..." I didn't get it. How do they appear 4 times?

    – Austin
    Nov 25 '18 at 16:28











  • In all strings together, both p and e occur 4 times in total. The total number of occurrences of a character is what determines its rarity and whether it should be included in the combination. In case of a tie, like here, preference should be given to the character that appears more than once in the string (if any).

    – iuvbio
    Nov 25 '18 at 16:34













  • Why 'l' instead of 'h' in xylophone?

    – Daniel Mesejo
    Nov 25 '18 at 16:46













  • My mistake, thanks. Edited the question.

    – iuvbio
    Nov 25 '18 at 16:48














1












1








1








I want to find the shortest unique combination of characters for each element in a list of strings. Each combination should consist of the string's first character and its two rarest characters at least (more if necessary) and order matters. If a character appears more than once in one string, it should get more weight.



Consider the following example:



liste = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]
combinations = ["apl", "per", "ban", "xyh", "ber", "bnu"


for apple, both p and e appear 4 times overall, but since p appears twice in apple, it should be used in the combination.



What is the most efficient way to write this logic in python?










share|improve this question
















I want to find the shortest unique combination of characters for each element in a list of strings. Each combination should consist of the string's first character and its two rarest characters at least (more if necessary) and order matters. If a character appears more than once in one string, it should get more weight.



Consider the following example:



liste = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]
combinations = ["apl", "per", "ban", "xyh", "ber", "bnu"


for apple, both p and e appear 4 times overall, but since p appears twice in apple, it should be used in the combination.



What is the most efficient way to write this logic in python?







python combinatorics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 16:48







iuvbio

















asked Nov 25 '18 at 16:21









iuvbioiuvbio

359




359








  • 1





    "for apple, both p and e appear 4 times overall,..." I didn't get it. How do they appear 4 times?

    – Austin
    Nov 25 '18 at 16:28











  • In all strings together, both p and e occur 4 times in total. The total number of occurrences of a character is what determines its rarity and whether it should be included in the combination. In case of a tie, like here, preference should be given to the character that appears more than once in the string (if any).

    – iuvbio
    Nov 25 '18 at 16:34













  • Why 'l' instead of 'h' in xylophone?

    – Daniel Mesejo
    Nov 25 '18 at 16:46













  • My mistake, thanks. Edited the question.

    – iuvbio
    Nov 25 '18 at 16:48














  • 1





    "for apple, both p and e appear 4 times overall,..." I didn't get it. How do they appear 4 times?

    – Austin
    Nov 25 '18 at 16:28











  • In all strings together, both p and e occur 4 times in total. The total number of occurrences of a character is what determines its rarity and whether it should be included in the combination. In case of a tie, like here, preference should be given to the character that appears more than once in the string (if any).

    – iuvbio
    Nov 25 '18 at 16:34













  • Why 'l' instead of 'h' in xylophone?

    – Daniel Mesejo
    Nov 25 '18 at 16:46













  • My mistake, thanks. Edited the question.

    – iuvbio
    Nov 25 '18 at 16:48








1




1





"for apple, both p and e appear 4 times overall,..." I didn't get it. How do they appear 4 times?

– Austin
Nov 25 '18 at 16:28





"for apple, both p and e appear 4 times overall,..." I didn't get it. How do they appear 4 times?

– Austin
Nov 25 '18 at 16:28













In all strings together, both p and e occur 4 times in total. The total number of occurrences of a character is what determines its rarity and whether it should be included in the combination. In case of a tie, like here, preference should be given to the character that appears more than once in the string (if any).

– iuvbio
Nov 25 '18 at 16:34







In all strings together, both p and e occur 4 times in total. The total number of occurrences of a character is what determines its rarity and whether it should be included in the combination. In case of a tie, like here, preference should be given to the character that appears more than once in the string (if any).

– iuvbio
Nov 25 '18 at 16:34















Why 'l' instead of 'h' in xylophone?

– Daniel Mesejo
Nov 25 '18 at 16:46







Why 'l' instead of 'h' in xylophone?

– Daniel Mesejo
Nov 25 '18 at 16:46















My mistake, thanks. Edited the question.

– iuvbio
Nov 25 '18 at 16:48





My mistake, thanks. Edited the question.

– iuvbio
Nov 25 '18 at 16:48












1 Answer
1






active

oldest

votes


















3














You could do something like this:



import heapq

from collections import Counter
from operator import itemgetter


def combination(word, n, counts):
word_count = Counter(word)
elements =
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)
top = heapq.nlargest(n, elements)
characters = map(itemgetter(3), top)

return word[0] + ''.join(sorted(characters, key=lambda x: word.index(x)))


lst = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]

counts = Counter(''.join(lst))

result = [combination(w, 2, counts) for w in lst]
print(result)


Output



['apl', 'per', 'ban', 'xyh', 'ber', 'bnu']


The idea is to create a tuple of the priority criterias representing each unique letter. So elements is a list containing tuples represeting:





  1. counts[c]: The overall count (as you want the rarest multiply by -1)


  2. word_count[c]: The specific count of the letter in the word


  3. i: represents the firs position of the letter


  4. c: the letter itself.


Once you create the list elements with:



elements = 
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)


Note that as the characters must be unique we use a set (seen) to guarantee uniqueness. Finally you use heapq.nlargest to get the top n elements according to the above criterias.






share|improve this answer


























  • Zero explanation on how this works, or what heapq is

    – Moralous
    Nov 25 '18 at 17:00











  • That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

    – iuvbio
    Nov 25 '18 at 17:02











  • @iuvbio Updated the answer!

    – Daniel Mesejo
    Nov 25 '18 at 17:20











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53469457%2fshortest-unique-combination-of-at-least-three-characters-for-each-string-in-list%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














You could do something like this:



import heapq

from collections import Counter
from operator import itemgetter


def combination(word, n, counts):
word_count = Counter(word)
elements =
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)
top = heapq.nlargest(n, elements)
characters = map(itemgetter(3), top)

return word[0] + ''.join(sorted(characters, key=lambda x: word.index(x)))


lst = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]

counts = Counter(''.join(lst))

result = [combination(w, 2, counts) for w in lst]
print(result)


Output



['apl', 'per', 'ban', 'xyh', 'ber', 'bnu']


The idea is to create a tuple of the priority criterias representing each unique letter. So elements is a list containing tuples represeting:





  1. counts[c]: The overall count (as you want the rarest multiply by -1)


  2. word_count[c]: The specific count of the letter in the word


  3. i: represents the firs position of the letter


  4. c: the letter itself.


Once you create the list elements with:



elements = 
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)


Note that as the characters must be unique we use a set (seen) to guarantee uniqueness. Finally you use heapq.nlargest to get the top n elements according to the above criterias.






share|improve this answer


























  • Zero explanation on how this works, or what heapq is

    – Moralous
    Nov 25 '18 at 17:00











  • That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

    – iuvbio
    Nov 25 '18 at 17:02











  • @iuvbio Updated the answer!

    – Daniel Mesejo
    Nov 25 '18 at 17:20
















3














You could do something like this:



import heapq

from collections import Counter
from operator import itemgetter


def combination(word, n, counts):
word_count = Counter(word)
elements =
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)
top = heapq.nlargest(n, elements)
characters = map(itemgetter(3), top)

return word[0] + ''.join(sorted(characters, key=lambda x: word.index(x)))


lst = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]

counts = Counter(''.join(lst))

result = [combination(w, 2, counts) for w in lst]
print(result)


Output



['apl', 'per', 'ban', 'xyh', 'ber', 'bnu']


The idea is to create a tuple of the priority criterias representing each unique letter. So elements is a list containing tuples represeting:





  1. counts[c]: The overall count (as you want the rarest multiply by -1)


  2. word_count[c]: The specific count of the letter in the word


  3. i: represents the firs position of the letter


  4. c: the letter itself.


Once you create the list elements with:



elements = 
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)


Note that as the characters must be unique we use a set (seen) to guarantee uniqueness. Finally you use heapq.nlargest to get the top n elements according to the above criterias.






share|improve this answer


























  • Zero explanation on how this works, or what heapq is

    – Moralous
    Nov 25 '18 at 17:00











  • That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

    – iuvbio
    Nov 25 '18 at 17:02











  • @iuvbio Updated the answer!

    – Daniel Mesejo
    Nov 25 '18 at 17:20














3












3








3







You could do something like this:



import heapq

from collections import Counter
from operator import itemgetter


def combination(word, n, counts):
word_count = Counter(word)
elements =
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)
top = heapq.nlargest(n, elements)
characters = map(itemgetter(3), top)

return word[0] + ''.join(sorted(characters, key=lambda x: word.index(x)))


lst = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]

counts = Counter(''.join(lst))

result = [combination(w, 2, counts) for w in lst]
print(result)


Output



['apl', 'per', 'ban', 'xyh', 'ber', 'bnu']


The idea is to create a tuple of the priority criterias representing each unique letter. So elements is a list containing tuples represeting:





  1. counts[c]: The overall count (as you want the rarest multiply by -1)


  2. word_count[c]: The specific count of the letter in the word


  3. i: represents the firs position of the letter


  4. c: the letter itself.


Once you create the list elements with:



elements = 
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)


Note that as the characters must be unique we use a set (seen) to guarantee uniqueness. Finally you use heapq.nlargest to get the top n elements according to the above criterias.






share|improve this answer















You could do something like this:



import heapq

from collections import Counter
from operator import itemgetter


def combination(word, n, counts):
word_count = Counter(word)
elements =
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)
top = heapq.nlargest(n, elements)
characters = map(itemgetter(3), top)

return word[0] + ''.join(sorted(characters, key=lambda x: word.index(x)))


lst = ["apple", "pear", "banana", "xylophone", "bear", "banunu"]

counts = Counter(''.join(lst))

result = [combination(w, 2, counts) for w in lst]
print(result)


Output



['apl', 'per', 'ban', 'xyh', 'ber', 'bnu']


The idea is to create a tuple of the priority criterias representing each unique letter. So elements is a list containing tuples represeting:





  1. counts[c]: The overall count (as you want the rarest multiply by -1)


  2. word_count[c]: The specific count of the letter in the word


  3. i: represents the firs position of the letter


  4. c: the letter itself.


Once you create the list elements with:



elements = 
seen = set()
for i, c in enumerate(word[1:]):
if c not in seen:
elements.append((-1 * counts[c], word_count[c], i, c))
seen.add(c)


Note that as the characters must be unique we use a set (seen) to guarantee uniqueness. Finally you use heapq.nlargest to get the top n elements according to the above criterias.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 25 '18 at 17:20

























answered Nov 25 '18 at 16:49









Daniel MesejoDaniel Mesejo

16.8k21430




16.8k21430













  • Zero explanation on how this works, or what heapq is

    – Moralous
    Nov 25 '18 at 17:00











  • That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

    – iuvbio
    Nov 25 '18 at 17:02











  • @iuvbio Updated the answer!

    – Daniel Mesejo
    Nov 25 '18 at 17:20



















  • Zero explanation on how this works, or what heapq is

    – Moralous
    Nov 25 '18 at 17:00











  • That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

    – iuvbio
    Nov 25 '18 at 17:02











  • @iuvbio Updated the answer!

    – Daniel Mesejo
    Nov 25 '18 at 17:20

















Zero explanation on how this works, or what heapq is

– Moralous
Nov 25 '18 at 17:00





Zero explanation on how this works, or what heapq is

– Moralous
Nov 25 '18 at 17:00













That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

– iuvbio
Nov 25 '18 at 17:02





That seems like exactly what I was looking for! Could you elaborate a bit on elements.append((-1 * counts[c], word_count[c], i, c)) and what heapq.nlargest(n, elements) does?

– iuvbio
Nov 25 '18 at 17:02













@iuvbio Updated the answer!

– Daniel Mesejo
Nov 25 '18 at 17:20





@iuvbio Updated the answer!

– Daniel Mesejo
Nov 25 '18 at 17:20


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53469457%2fshortest-unique-combination-of-at-least-three-characters-for-each-string-in-list%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)