Capturing repeating subpatterns in Python regex












21














While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (.w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?










share|improve this question


















  • 1




    Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
    – Todd Owen
    Oct 15 '18 at 0:27












  • @ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
    – Michael Ohlrogge
    Nov 25 '18 at 0:22








  • 1




    @MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
    – Todd Owen
    Nov 28 '18 at 21:36










  • @ToddOwen Got it, thank you, that is a helpful clarification!
    – Michael Ohlrogge
    Nov 29 '18 at 1:03
















21














While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (.w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?










share|improve this question


















  • 1




    Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
    – Todd Owen
    Oct 15 '18 at 0:27












  • @ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
    – Michael Ohlrogge
    Nov 25 '18 at 0:22








  • 1




    @MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
    – Todd Owen
    Nov 28 '18 at 21:36










  • @ToddOwen Got it, thank you, that is a helpful clarification!
    – Michael Ohlrogge
    Nov 29 '18 at 1:03














21












21








21


1





While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (.w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?










share|improve this question













While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (.w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?







python regex






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 19 '12 at 4:09









yasar

4,8262061131




4,8262061131








  • 1




    Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
    – Todd Owen
    Oct 15 '18 at 0:27












  • @ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
    – Michael Ohlrogge
    Nov 25 '18 at 0:22








  • 1




    @MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
    – Todd Owen
    Nov 28 '18 at 21:36










  • @ToddOwen Got it, thank you, that is a helpful clarification!
    – Michael Ohlrogge
    Nov 29 '18 at 1:03














  • 1




    Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
    – Todd Owen
    Oct 15 '18 at 0:27












  • @ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
    – Michael Ohlrogge
    Nov 25 '18 at 0:22








  • 1




    @MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
    – Todd Owen
    Nov 28 '18 at 21:36










  • @ToddOwen Got it, thank you, that is a helpful clarification!
    – Michael Ohlrogge
    Nov 29 '18 at 1:03








1




1




Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27






Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27














@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22






@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22






1




1




@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
– Todd Owen
Nov 28 '18 at 21:36




@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses).
– Todd Owen
Nov 28 '18 at 21:36












@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03




@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03












4 Answers
4






active

oldest

votes


















24














re module doesn't support repeated captures (regex supports it):



>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']


In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.






share|improve this answer























  • Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
    – Li-aung Yip
    Mar 19 '12 at 7:55










  • @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
    – jfs
    Mar 19 '12 at 9:03










  • In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
    – Li-aung Yip
    Mar 19 '12 at 9:12










  • @Li-aung Yip: m.groups() above explicitly shows what 4 is.
    – jfs
    Mar 19 '12 at 9:13










  • The meaning hasn't changed: 4 is m.group(4) whatever it is.
    – jfs
    Mar 19 '12 at 9:21



















12














This will work:



>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)


But it's limited to a maximum of six subgroups. A better way to do this would be:



>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']


Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.






share|improve this answer































    6














    You can fix the problem of (.w+)+ only capturing the last match by doing this instead: ((?:.w+)+)






    share|improve this answer

















    • 2




      For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
      – scharfmn
      Aug 15 '15 at 9:58






    • 1




      Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
      – Tim Swast
      Jul 21 '16 at 22:22










    • Thank you@TimSwast this was exactly the comment and reference I needed!
      – Michael Ohlrogge
      Nov 24 '18 at 18:00



















    2














    This is what you are looking for:



    >>> import re

    >>> s="yasar@webmail.something.edu.tr"
    >>> r=re.compile(".w+")
    >>> m=r.findall(s)

    >>> m
    ['.something', '.edu', '.tr']





    share|improve this answer





















    • This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
      – Michael Ohlrogge
      Nov 24 '18 at 18:07










    • OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
      – Tushar Vazirani
      Nov 24 '18 at 18:09










    • Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
      – Michael Ohlrogge
      Nov 24 '18 at 18:31











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f9764930%2fcapturing-repeating-subpatterns-in-python-regex%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    24














    re module doesn't support repeated captures (regex supports it):



    >>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
    >>> m.groups()
    ('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
    >>> m.captures(4)
    ['.something', '.edu', '.tr']


    In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.






    share|improve this answer























    • Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
      – Li-aung Yip
      Mar 19 '12 at 7:55










    • @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
      – jfs
      Mar 19 '12 at 9:03










    • In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
      – Li-aung Yip
      Mar 19 '12 at 9:12










    • @Li-aung Yip: m.groups() above explicitly shows what 4 is.
      – jfs
      Mar 19 '12 at 9:13










    • The meaning hasn't changed: 4 is m.group(4) whatever it is.
      – jfs
      Mar 19 '12 at 9:21
















    24














    re module doesn't support repeated captures (regex supports it):



    >>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
    >>> m.groups()
    ('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
    >>> m.captures(4)
    ['.something', '.edu', '.tr']


    In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.






    share|improve this answer























    • Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
      – Li-aung Yip
      Mar 19 '12 at 7:55










    • @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
      – jfs
      Mar 19 '12 at 9:03










    • In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
      – Li-aung Yip
      Mar 19 '12 at 9:12










    • @Li-aung Yip: m.groups() above explicitly shows what 4 is.
      – jfs
      Mar 19 '12 at 9:13










    • The meaning hasn't changed: 4 is m.group(4) whatever it is.
      – jfs
      Mar 19 '12 at 9:21














    24












    24








    24






    re module doesn't support repeated captures (regex supports it):



    >>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
    >>> m.groups()
    ('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
    >>> m.captures(4)
    ['.something', '.edu', '.tr']


    In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.






    share|improve this answer














    re module doesn't support repeated captures (regex supports it):



    >>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
    >>> m.groups()
    ('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
    >>> m.captures(4)
    ['.something', '.edu', '.tr']


    In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited May 23 '17 at 12:09









    Community

    11




    11










    answered Mar 19 '12 at 5:22









    jfs

    262k775481077




    262k775481077












    • Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
      – Li-aung Yip
      Mar 19 '12 at 7:55










    • @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
      – jfs
      Mar 19 '12 at 9:03










    • In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
      – Li-aung Yip
      Mar 19 '12 at 9:12










    • @Li-aung Yip: m.groups() above explicitly shows what 4 is.
      – jfs
      Mar 19 '12 at 9:13










    • The meaning hasn't changed: 4 is m.group(4) whatever it is.
      – jfs
      Mar 19 '12 at 9:21


















    • Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
      – Li-aung Yip
      Mar 19 '12 at 7:55










    • @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
      – jfs
      Mar 19 '12 at 9:03










    • In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
      – Li-aung Yip
      Mar 19 '12 at 9:12










    • @Li-aung Yip: m.groups() above explicitly shows what 4 is.
      – jfs
      Mar 19 '12 at 9:13










    • The meaning hasn't changed: 4 is m.group(4) whatever it is.
      – jfs
      Mar 19 '12 at 9:21
















    Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
    – Li-aung Yip
    Mar 19 '12 at 7:55




    Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of 1, 2, 3 etc. change depending on how many times you matched (.w+)?
    – Li-aung Yip
    Mar 19 '12 at 7:55












    @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
    – jfs
    Mar 19 '12 at 9:03




    @Li-aung Yip: 1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
    – jfs
    Mar 19 '12 at 9:03












    In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
    – Li-aung Yip
    Mar 19 '12 at 9:12




    In your example, the meaning of 1, 2, and 3 is obvious because they only capture once. But what is the meaning of 4, corresponding to (.w+)+? 4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
    – Li-aung Yip
    Mar 19 '12 at 9:12












    @Li-aung Yip: m.groups() above explicitly shows what 4 is.
    – jfs
    Mar 19 '12 at 9:13




    @Li-aung Yip: m.groups() above explicitly shows what 4 is.
    – jfs
    Mar 19 '12 at 9:13












    The meaning hasn't changed: 4 is m.group(4) whatever it is.
    – jfs
    Mar 19 '12 at 9:21




    The meaning hasn't changed: 4 is m.group(4) whatever it is.
    – jfs
    Mar 19 '12 at 9:21













    12














    This will work:



    >>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
    >>> email_address = "william.adama@galactica.caprica.fleet.mil"
    >>> m = re.match(regexp, email_address)
    >>> m.groups()
    ('galactica', '.caprica', '.fleet', '.mil', None, None)


    But it's limited to a maximum of six subgroups. A better way to do this would be:



    >>> m = re.match(r"[w.]+@(.+)", email_address)
    >>> m.groups()
    ('galactica.caprica.fleet.mil',)
    >>> m.group(1).split('.')
    ['galactica', 'caprica', 'fleet', 'mil']


    Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.






    share|improve this answer




























      12














      This will work:



      >>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
      >>> email_address = "william.adama@galactica.caprica.fleet.mil"
      >>> m = re.match(regexp, email_address)
      >>> m.groups()
      ('galactica', '.caprica', '.fleet', '.mil', None, None)


      But it's limited to a maximum of six subgroups. A better way to do this would be:



      >>> m = re.match(r"[w.]+@(.+)", email_address)
      >>> m.groups()
      ('galactica.caprica.fleet.mil',)
      >>> m.group(1).split('.')
      ['galactica', 'caprica', 'fleet', 'mil']


      Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.






      share|improve this answer


























        12












        12








        12






        This will work:



        >>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
        >>> email_address = "william.adama@galactica.caprica.fleet.mil"
        >>> m = re.match(regexp, email_address)
        >>> m.groups()
        ('galactica', '.caprica', '.fleet', '.mil', None, None)


        But it's limited to a maximum of six subgroups. A better way to do this would be:



        >>> m = re.match(r"[w.]+@(.+)", email_address)
        >>> m.groups()
        ('galactica.caprica.fleet.mil',)
        >>> m.group(1).split('.')
        ['galactica', 'caprica', 'fleet', 'mil']


        Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.






        share|improve this answer














        This will work:



        >>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
        >>> email_address = "william.adama@galactica.caprica.fleet.mil"
        >>> m = re.match(regexp, email_address)
        >>> m.groups()
        ('galactica', '.caprica', '.fleet', '.mil', None, None)


        But it's limited to a maximum of six subgroups. A better way to do this would be:



        >>> m = re.match(r"[w.]+@(.+)", email_address)
        >>> m.groups()
        ('galactica.caprica.fleet.mil',)
        >>> m.group(1).split('.')
        ['galactica', 'caprica', 'fleet', 'mil']


        Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited May 23 '17 at 11:46









        Community

        11




        11










        answered Mar 19 '12 at 4:50









        Li-aung Yip

        9,41642241




        9,41642241























            6














            You can fix the problem of (.w+)+ only capturing the last match by doing this instead: ((?:.w+)+)






            share|improve this answer

















            • 2




              For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
              – scharfmn
              Aug 15 '15 at 9:58






            • 1




              Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
              – Tim Swast
              Jul 21 '16 at 22:22










            • Thank you@TimSwast this was exactly the comment and reference I needed!
              – Michael Ohlrogge
              Nov 24 '18 at 18:00
















            6














            You can fix the problem of (.w+)+ only capturing the last match by doing this instead: ((?:.w+)+)






            share|improve this answer

















            • 2




              For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
              – scharfmn
              Aug 15 '15 at 9:58






            • 1




              Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
              – Tim Swast
              Jul 21 '16 at 22:22










            • Thank you@TimSwast this was exactly the comment and reference I needed!
              – Michael Ohlrogge
              Nov 24 '18 at 18:00














            6












            6








            6






            You can fix the problem of (.w+)+ only capturing the last match by doing this instead: ((?:.w+)+)






            share|improve this answer












            You can fix the problem of (.w+)+ only capturing the last match by doing this instead: ((?:.w+)+)







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 19 '12 at 4:28









            Taymon

            16k64575




            16k64575








            • 2




              For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
              – scharfmn
              Aug 15 '15 at 9:58






            • 1




              Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
              – Tim Swast
              Jul 21 '16 at 22:22










            • Thank you@TimSwast this was exactly the comment and reference I needed!
              – Michael Ohlrogge
              Nov 24 '18 at 18:00














            • 2




              For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
              – scharfmn
              Aug 15 '15 at 9:58






            • 1




              Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
              – Tim Swast
              Jul 21 '16 at 22:22










            • Thank you@TimSwast this was exactly the comment and reference I needed!
              – Michael Ohlrogge
              Nov 24 '18 at 18:00








            2




            2




            For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
            – scharfmn
            Aug 15 '15 at 9:58




            For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
            – scharfmn
            Aug 15 '15 at 9:58




            1




            1




            Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
            – Tim Swast
            Jul 21 '16 at 22:22




            Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
            – Tim Swast
            Jul 21 '16 at 22:22












            Thank you@TimSwast this was exactly the comment and reference I needed!
            – Michael Ohlrogge
            Nov 24 '18 at 18:00




            Thank you@TimSwast this was exactly the comment and reference I needed!
            – Michael Ohlrogge
            Nov 24 '18 at 18:00











            2














            This is what you are looking for:



            >>> import re

            >>> s="yasar@webmail.something.edu.tr"
            >>> r=re.compile(".w+")
            >>> m=r.findall(s)

            >>> m
            ['.something', '.edu', '.tr']





            share|improve this answer





















            • This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
              – Michael Ohlrogge
              Nov 24 '18 at 18:07










            • OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
              – Tushar Vazirani
              Nov 24 '18 at 18:09










            • Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
              – Michael Ohlrogge
              Nov 24 '18 at 18:31
















            2














            This is what you are looking for:



            >>> import re

            >>> s="yasar@webmail.something.edu.tr"
            >>> r=re.compile(".w+")
            >>> m=r.findall(s)

            >>> m
            ['.something', '.edu', '.tr']





            share|improve this answer





















            • This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
              – Michael Ohlrogge
              Nov 24 '18 at 18:07










            • OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
              – Tushar Vazirani
              Nov 24 '18 at 18:09










            • Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
              – Michael Ohlrogge
              Nov 24 '18 at 18:31














            2












            2








            2






            This is what you are looking for:



            >>> import re

            >>> s="yasar@webmail.something.edu.tr"
            >>> r=re.compile(".w+")
            >>> m=r.findall(s)

            >>> m
            ['.something', '.edu', '.tr']





            share|improve this answer












            This is what you are looking for:



            >>> import re

            >>> s="yasar@webmail.something.edu.tr"
            >>> r=re.compile(".w+")
            >>> m=r.findall(s)

            >>> m
            ['.something', '.edu', '.tr']






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Oct 4 '17 at 18:22









            Tushar Vazirani

            40539




            40539












            • This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
              – Michael Ohlrogge
              Nov 24 '18 at 18:07










            • OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
              – Tushar Vazirani
              Nov 24 '18 at 18:09










            • Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
              – Michael Ohlrogge
              Nov 24 '18 at 18:31


















            • This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
              – Michael Ohlrogge
              Nov 24 '18 at 18:07










            • OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
              – Tushar Vazirani
              Nov 24 '18 at 18:09










            • Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
              – Michael Ohlrogge
              Nov 24 '18 at 18:31
















            This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
            – Michael Ohlrogge
            Nov 24 '18 at 18:07




            This doesn't match for the yasar@webmail. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
            – Michael Ohlrogge
            Nov 24 '18 at 18:07












            OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
            – Tushar Vazirani
            Nov 24 '18 at 18:09




            OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
            – Tushar Vazirani
            Nov 24 '18 at 18:09












            Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
            – Michael Ohlrogge
            Nov 24 '18 at 18:31




            Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
            – Michael Ohlrogge
            Nov 24 '18 at 18:31


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f9764930%2fcapturing-repeating-subpatterns-in-python-regex%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

            Calculate evaluation metrics using cross_val_predict sklearn

            Insert data from modal to MySQL (multiple modal on website)