Capturing repeating subpatterns in Python regex
While matching an email address, after I match something like yasar@webmail
, I want to capture one or more of (.w+)
(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr
matches but only include .tr
after yasar@webmail
part, so I lost .something
and .edu
groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
python regex
add a comment |
While matching an email address, after I match something like yasar@webmail
, I want to capture one or more of (.w+)
(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr
matches but only include .tr
after yasar@webmail
part, so I lost .something
and .edu
groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
python regex
1
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22
1
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that(?: ...)
are not capturing parentheses).
– Todd Owen
Nov 28 '18 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03
add a comment |
While matching an email address, after I match something like yasar@webmail
, I want to capture one or more of (.w+)
(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr
matches but only include .tr
after yasar@webmail
part, so I lost .something
and .edu
groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
python regex
While matching an email address, after I match something like yasar@webmail
, I want to capture one or more of (.w+)
(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr
matches but only include .tr
after yasar@webmail
part, so I lost .something
and .edu
groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
python regex
python regex
asked Mar 19 '12 at 4:09
yasar
4,8262061131
4,8262061131
1
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22
1
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that(?: ...)
are not capturing parentheses).
– Todd Owen
Nov 28 '18 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03
add a comment |
1
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22
1
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that(?: ...)
are not capturing parentheses).
– Todd Owen
Nov 28 '18 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03
1
1
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22
1
1
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that
(?: ...)
are not capturing parentheses).– Todd Owen
Nov 28 '18 at 21:36
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that
(?: ...)
are not capturing parentheses).– Todd Owen
Nov 28 '18 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03
@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03
add a comment |
4 Answers
4
active
oldest
votes
re
module doesn't support repeated captures (regex
supports it):
>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of1
,2
,3
etc. change depending on how many times you matched(.w+)
?
– Li-aung Yip
Mar 19 '12 at 7:55
@Li-aung Yip:1
corresponds tom.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and callm.captures()
in it.
– jfs
Mar 19 '12 at 9:03
In your example, the meaning of1
,2
, and3
is obvious because they only capture once. But what is the meaning of4
, corresponding to(.w+)+
?4
appears to be "the last substring matched by the 4th capture group", in this case.tr
.
– Li-aung Yip
Mar 19 '12 at 9:12
@Li-aung Yip:m.groups()
above explicitly shows what4
is.
– jfs
Mar 19 '12 at 9:13
The meaning hasn't changed:4
ism.group(4)
whatever it is.
– jfs
Mar 19 '12 at 9:21
add a comment |
This will work:
>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
add a comment |
You can fix the problem of (.w+)+
only capturing the last match by doing this instead: ((?:.w+)+)
2
For abbreviations (if you've lower-cased):re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
1
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
– Tim Swast
Jul 21 '16 at 22:22
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
add a comment |
This is what you are looking for:
>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
This doesn't match for theyasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
– Michael Ohlrogge
Nov 24 '18 at 18:07
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f9764930%2fcapturing-repeating-subpatterns-in-python-regex%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
re
module doesn't support repeated captures (regex
supports it):
>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of1
,2
,3
etc. change depending on how many times you matched(.w+)
?
– Li-aung Yip
Mar 19 '12 at 7:55
@Li-aung Yip:1
corresponds tom.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and callm.captures()
in it.
– jfs
Mar 19 '12 at 9:03
In your example, the meaning of1
,2
, and3
is obvious because they only capture once. But what is the meaning of4
, corresponding to(.w+)+
?4
appears to be "the last substring matched by the 4th capture group", in this case.tr
.
– Li-aung Yip
Mar 19 '12 at 9:12
@Li-aung Yip:m.groups()
above explicitly shows what4
is.
– jfs
Mar 19 '12 at 9:13
The meaning hasn't changed:4
ism.group(4)
whatever it is.
– jfs
Mar 19 '12 at 9:21
add a comment |
re
module doesn't support repeated captures (regex
supports it):
>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of1
,2
,3
etc. change depending on how many times you matched(.w+)
?
– Li-aung Yip
Mar 19 '12 at 7:55
@Li-aung Yip:1
corresponds tom.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and callm.captures()
in it.
– jfs
Mar 19 '12 at 9:03
In your example, the meaning of1
,2
, and3
is obvious because they only capture once. But what is the meaning of4
, corresponding to(.w+)+
?4
appears to be "the last substring matched by the 4th capture group", in this case.tr
.
– Li-aung Yip
Mar 19 '12 at 9:12
@Li-aung Yip:m.groups()
above explicitly shows what4
is.
– jfs
Mar 19 '12 at 9:13
The meaning hasn't changed:4
ism.group(4)
whatever it is.
– jfs
Mar 19 '12 at 9:21
add a comment |
re
module doesn't support repeated captures (regex
supports it):
>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.
re
module doesn't support repeated captures (regex
supports it):
>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.
edited May 23 '17 at 12:09
Community♦
11
11
answered Mar 19 '12 at 5:22
jfs
262k775481077
262k775481077
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of1
,2
,3
etc. change depending on how many times you matched(.w+)
?
– Li-aung Yip
Mar 19 '12 at 7:55
@Li-aung Yip:1
corresponds tom.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and callm.captures()
in it.
– jfs
Mar 19 '12 at 9:03
In your example, the meaning of1
,2
, and3
is obvious because they only capture once. But what is the meaning of4
, corresponding to(.w+)+
?4
appears to be "the last substring matched by the 4th capture group", in this case.tr
.
– Li-aung Yip
Mar 19 '12 at 9:12
@Li-aung Yip:m.groups()
above explicitly shows what4
is.
– jfs
Mar 19 '12 at 9:13
The meaning hasn't changed:4
ism.group(4)
whatever it is.
– jfs
Mar 19 '12 at 9:21
add a comment |
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of1
,2
,3
etc. change depending on how many times you matched(.w+)
?
– Li-aung Yip
Mar 19 '12 at 7:55
@Li-aung Yip:1
corresponds tom.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and callm.captures()
in it.
– jfs
Mar 19 '12 at 9:03
In your example, the meaning of1
,2
, and3
is obvious because they only capture once. But what is the meaning of4
, corresponding to(.w+)+
?4
appears to be "the last substring matched by the 4th capture group", in this case.tr
.
– Li-aung Yip
Mar 19 '12 at 9:12
@Li-aung Yip:m.groups()
above explicitly shows what4
is.
– jfs
Mar 19 '12 at 9:13
The meaning hasn't changed:4
ism.group(4)
whatever it is.
– jfs
Mar 19 '12 at 9:21
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of
1
, 2
, 3
etc. change depending on how many times you matched (.w+)
?– Li-aung Yip
Mar 19 '12 at 7:55
Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of
1
, 2
, 3
etc. change depending on how many times you matched (.w+)
?– Li-aung Yip
Mar 19 '12 at 7:55
@Li-aung Yip:
1
corresponds to m.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures()
in it.– jfs
Mar 19 '12 at 9:03
@Li-aung Yip:
1
corresponds to m.group(1)
; the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures()
in it.– jfs
Mar 19 '12 at 9:03
In your example, the meaning of
1
, 2
, and 3
is obvious because they only capture once. But what is the meaning of 4
, corresponding to (.w+)+
? 4
appears to be "the last substring matched by the 4th capture group", in this case .tr
.– Li-aung Yip
Mar 19 '12 at 9:12
In your example, the meaning of
1
, 2
, and 3
is obvious because they only capture once. But what is the meaning of 4
, corresponding to (.w+)+
? 4
appears to be "the last substring matched by the 4th capture group", in this case .tr
.– Li-aung Yip
Mar 19 '12 at 9:12
@Li-aung Yip:
m.groups()
above explicitly shows what 4
is.– jfs
Mar 19 '12 at 9:13
@Li-aung Yip:
m.groups()
above explicitly shows what 4
is.– jfs
Mar 19 '12 at 9:13
The meaning hasn't changed:
4
is m.group(4)
whatever it is.– jfs
Mar 19 '12 at 9:21
The meaning hasn't changed:
4
is m.group(4)
whatever it is.– jfs
Mar 19 '12 at 9:21
add a comment |
This will work:
>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
add a comment |
This will work:
>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
add a comment |
This will work:
>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
This will work:
>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
edited May 23 '17 at 11:46
Community♦
11
11
answered Mar 19 '12 at 4:50
Li-aung Yip
9,41642241
9,41642241
add a comment |
add a comment |
You can fix the problem of (.w+)+
only capturing the last match by doing this instead: ((?:.w+)+)
2
For abbreviations (if you've lower-cased):re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
1
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
– Tim Swast
Jul 21 '16 at 22:22
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
add a comment |
You can fix the problem of (.w+)+
only capturing the last match by doing this instead: ((?:.w+)+)
2
For abbreviations (if you've lower-cased):re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
1
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
– Tim Swast
Jul 21 '16 at 22:22
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
add a comment |
You can fix the problem of (.w+)+
only capturing the last match by doing this instead: ((?:.w+)+)
You can fix the problem of (.w+)+
only capturing the last match by doing this instead: ((?:.w+)+)
answered Mar 19 '12 at 4:28
Taymon
16k64575
16k64575
2
For abbreviations (if you've lower-cased):re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
1
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
– Tim Swast
Jul 21 '16 at 22:22
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
add a comment |
2
For abbreviations (if you've lower-cased):re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
1
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
– Tim Swast
Jul 21 '16 at 22:22
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
2
2
For abbreviations (if you've lower-cased):
re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
For abbreviations (if you've lower-cased):
re.sub(ur'((?:[a-z].){2,})', lambda m: m.group(1).replace('.', ''), text)
– scharfmn
Aug 15 '15 at 9:58
1
1
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that
(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.– Tim Swast
Jul 21 '16 at 22:22
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that
(?: ...)
makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.– Tim Swast
Jul 21 '16 at 22:22
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
Thank you@TimSwast this was exactly the comment and reference I needed!
– Michael Ohlrogge
Nov 24 '18 at 18:00
add a comment |
This is what you are looking for:
>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
This doesn't match for theyasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
– Michael Ohlrogge
Nov 24 '18 at 18:07
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
add a comment |
This is what you are looking for:
>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
This doesn't match for theyasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
– Michael Ohlrogge
Nov 24 '18 at 18:07
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
add a comment |
This is what you are looking for:
>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
This is what you are looking for:
>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
answered Oct 4 '17 at 18:22
Tushar Vazirani
40539
40539
This doesn't match for theyasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
– Michael Ohlrogge
Nov 24 '18 at 18:07
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
add a comment |
This doesn't match for theyasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.
– Michael Ohlrogge
Nov 24 '18 at 18:07
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
This doesn't match for the
yasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.– Michael Ohlrogge
Nov 24 '18 at 18:07
This doesn't match for the
yasar@webmail
. As such, it could easily pick up false positive results where there are things other than email addresses with multiple periods separating them.– Michael Ohlrogge
Nov 24 '18 at 18:07
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
OP has clearly written that this is just an example and what he is trying to do is more complicated. Hence, my answer.
– Tushar Vazirani
Nov 24 '18 at 18:09
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
Yes, but the problem is that your solution won't work even on the simplified version of the problem OP gave. Your solution is trivially simple for anyone with even the most basic understanding of RegEx. All other answers are more complicated because this is a genuinely non-trivial problem to solve.
– Michael Ohlrogge
Nov 24 '18 at 18:31
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f9764930%2fcapturing-repeating-subpatterns-in-python-regex%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module.
– Todd Owen
Oct 15 '18 at 0:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module.
– Michael Ohlrogge
Nov 25 '18 at 0:22
1
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that
(?: ...)
are not capturing parentheses).– Todd Owen
Nov 28 '18 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification!
– Michael Ohlrogge
Nov 29 '18 at 1:03