NLTK Tokenizer encoding issue
After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
The print(line)
prints a line of a txt. And print(sentences)
prints the tokenized sentences in the line.
But sometimes the sentences contains weird characters after nltk's processing.
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']
Like above example, where is the xa0
and xa0T
from?
python nlp nltk
add a comment |
After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
The print(line)
prints a line of a txt. And print(sentences)
prints the tokenized sentences in the line.
But sometimes the sentences contains weird characters after nltk's processing.
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']
Like above example, where is the xa0
and xa0T
from?
python nlp nltk
xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28
add a comment |
After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
The print(line)
prints a line of a txt. And print(sentences)
prints the tokenized sentences in the line.
But sometimes the sentences contains weird characters after nltk's processing.
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']
Like above example, where is the xa0
and xa0T
from?
python nlp nltk
After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
The print(line)
prints a line of a txt. And print(sentences)
prints the tokenized sentences in the line.
But sometimes the sentences contains weird characters after nltk's processing.
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']
Like above example, where is the xa0
and xa0T
from?
python nlp nltk
python nlp nltk
edited Nov 23 at 6:33
Aqueous Carlos
289213
289213
asked Nov 22 at 22:20
TIANLUN ZHU
567
567
xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28
add a comment |
xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28
xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28
xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28
add a comment |
1 Answer
1
active
oldest
votes
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
Output:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
Reference: unicodedata.normalize()
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438598%2fnltk-tokenizer-encoding-issue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
Output:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
Reference: unicodedata.normalize()
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
add a comment |
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
Output:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
Reference: unicodedata.normalize()
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
add a comment |
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
Output:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
Reference: unicodedata.normalize()
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
Output:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
Reference: unicodedata.normalize()
answered Nov 23 at 5:00
Srce Cde
1,136411
1,136411
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
add a comment |
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438598%2fnltk-tokenizer-encoding-issue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28