NLTK Tokenizer encoding issue












3














After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:



def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary


The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.



But sometimes the sentences contains weird characters after nltk's processing.



Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']


Like above example, where is the xa0 and xa0T from?










share|improve this question
























  • xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
    – PEREZje
    Nov 22 at 22:28


















3














After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:



def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary


The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.



But sometimes the sentences contains weird characters after nltk's processing.



Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']


Like above example, where is the xa0 and xa0T from?










share|improve this question
























  • xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
    – PEREZje
    Nov 22 at 22:28
















3












3








3







After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:



def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary


The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.



But sometimes the sentences contains weird characters after nltk's processing.



Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']


Like above example, where is the xa0 and xa0T from?










share|improve this question















After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:



def summary(filename, method):
list_names = glob.glob(filename)
orginal_data =
topic_data =
print(list_names)
for file_name in list_names:
article =
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary


The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.



But sometimes the sentences contains weird characters after nltk's processing.



Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people inxa0the crowd might boo Pushaxa0T during
the show, but said he never imagined actual violence would take
place.']


Like above example, where is the xa0 and xa0T from?







python nlp nltk






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 at 6:33









Aqueous Carlos

289213




289213










asked Nov 22 at 22:20









TIANLUN ZHU

567




567












  • xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
    – PEREZje
    Nov 22 at 22:28




















  • xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
    – PEREZje
    Nov 22 at 22:28


















xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28






xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28














1 Answer
1






active

oldest

votes


















2














x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'

# method 1
x.replace('xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)


Output:



Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.


Reference: unicodedata.normalize()






share|improve this answer





















  • Thanks, the second method works
    – TIANLUN ZHU
    Nov 23 at 22:10











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438598%2fnltk-tokenizer-encoding-issue%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'

# method 1
x.replace('xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)


Output:



Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.


Reference: unicodedata.normalize()






share|improve this answer





















  • Thanks, the second method works
    – TIANLUN ZHU
    Nov 23 at 22:10
















2














x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'

# method 1
x.replace('xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)


Output:



Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.


Reference: unicodedata.normalize()






share|improve this answer





















  • Thanks, the second method works
    – TIANLUN ZHU
    Nov 23 at 22:10














2












2








2






x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'

# method 1
x.replace('xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)


Output:



Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.


Reference: unicodedata.normalize()






share|improve this answer












x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'

# method 1
x.replace('xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)


Output:



Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.


Reference: unicodedata.normalize()







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 23 at 5:00









Srce Cde

1,136411




1,136411












  • Thanks, the second method works
    – TIANLUN ZHU
    Nov 23 at 22:10


















  • Thanks, the second method works
    – TIANLUN ZHU
    Nov 23 at 22:10
















Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10




Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438598%2fnltk-tokenizer-encoding-issue%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)