Tokenisation List Comprehension
I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.
def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())
rcr = ReutersCorpusReader()
sample_size = 10000
raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))
Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.
I also think I may have to input lower_sentences into my normalised_sentences.
Here is my normalise function:
def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])
Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.
python python-3.x token list-comprehension
add a comment |
I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.
def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())
rcr = ReutersCorpusReader()
sample_size = 10000
raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))
Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.
I also think I may have to input lower_sentences into my normalised_sentences.
Here is my normalise function:
def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])
Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.
python python-3.x token list-comprehension
Welcome to StackOverflow. What exactly is your question for us?
– Rory Daulton
Nov 24 '18 at 14:39
2
A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.
– usr2564301
Nov 24 '18 at 14:40
Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.
– usr2564301
Nov 24 '18 at 15:18
add a comment |
I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.
def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())
rcr = ReutersCorpusReader()
sample_size = 10000
raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))
Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.
I also think I may have to input lower_sentences into my normalised_sentences.
Here is my normalise function:
def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])
Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.
python python-3.x token list-comprehension
I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.
def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())
rcr = ReutersCorpusReader()
sample_size = 10000
raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))
Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.
I also think I may have to input lower_sentences into my normalised_sentences.
Here is my normalise function:
def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])
Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.
python python-3.x token list-comprehension
python python-3.x token list-comprehension
edited Nov 24 '18 at 15:19
usr2564301
17.7k73370
17.7k73370
asked Nov 24 '18 at 14:26
bemzoobemzoo
6611
6611
Welcome to StackOverflow. What exactly is your question for us?
– Rory Daulton
Nov 24 '18 at 14:39
2
A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.
– usr2564301
Nov 24 '18 at 14:40
Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.
– usr2564301
Nov 24 '18 at 15:18
add a comment |
Welcome to StackOverflow. What exactly is your question for us?
– Rory Daulton
Nov 24 '18 at 14:39
2
A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.
– usr2564301
Nov 24 '18 at 14:40
Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.
– usr2564301
Nov 24 '18 at 15:18
Welcome to StackOverflow. What exactly is your question for us?
– Rory Daulton
Nov 24 '18 at 14:39
Welcome to StackOverflow. What exactly is your question for us?
– Rory Daulton
Nov 24 '18 at 14:39
2
2
A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.
– usr2564301
Nov 24 '18 at 14:40
A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.
– usr2564301
Nov 24 '18 at 14:40
Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.
– usr2564301
Nov 24 '18 at 15:18
Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.
– usr2564301
Nov 24 '18 at 15:18
add a comment |
2 Answers
2
active
oldest
votes
I see some things that would clear things up for you.
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
Here you've forgotten to actually use the correct variable and you probably meant
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
also since a list doesn't have the function lower()
, you'd have to apply it for every token in each sentence, i.e
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
add a comment |
You appear to be using the wrong variable in your comprehensions:
# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
However, if you want to normalise your lower-case sentences, we need to change that line too:
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459145%2ftokenisation-list-comprehension%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I see some things that would clear things up for you.
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
Here you've forgotten to actually use the correct variable and you probably meant
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
also since a list doesn't have the function lower()
, you'd have to apply it for every token in each sentence, i.e
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
add a comment |
I see some things that would clear things up for you.
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
Here you've forgotten to actually use the correct variable and you probably meant
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
also since a list doesn't have the function lower()
, you'd have to apply it for every token in each sentence, i.e
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
add a comment |
I see some things that would clear things up for you.
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
Here you've forgotten to actually use the correct variable and you probably meant
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
also since a list doesn't have the function lower()
, you'd have to apply it for every token in each sentence, i.e
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
I see some things that would clear things up for you.
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
Here you've forgotten to actually use the correct variable and you probably meant
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
also since a list doesn't have the function lower()
, you'd have to apply it for every token in each sentence, i.e
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
answered Nov 24 '18 at 14:51
IAmBullsawIAmBullsaw
935
935
add a comment |
add a comment |
You appear to be using the wrong variable in your comprehensions:
# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
However, if you want to normalise your lower-case sentences, we need to change that line too:
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
add a comment |
You appear to be using the wrong variable in your comprehensions:
# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
However, if you want to normalise your lower-case sentences, we need to change that line too:
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
add a comment |
You appear to be using the wrong variable in your comprehensions:
# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
However, if you want to normalise your lower-case sentences, we need to change that line too:
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]
You appear to be using the wrong variable in your comprehensions:
# Wrong
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
However, if you want to normalise your lower-case sentences, we need to change that line too:
# Right
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]
edited Nov 24 '18 at 14:50
answered Nov 24 '18 at 14:41
Jonah BishopJonah Bishop
8,76233057
8,76233057
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
add a comment |
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now
– bemzoo
Nov 24 '18 at 14:48
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
I've updated my answer to include a fix for another potential error.
– Jonah Bishop
Nov 24 '18 at 14:51
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459145%2ftokenisation-list-comprehension%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Welcome to StackOverflow. What exactly is your question for us?
– Rory Daulton
Nov 24 '18 at 14:39
2
A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.
– usr2564301
Nov 24 '18 at 14:40
Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.
– usr2564301
Nov 24 '18 at 15:18