NLTK Tokenizer encoding issue

After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:

def summary(filename, method):

    list_names = glob.glob(filename)

    orginal_data = 

    topic_data = 

    print(list_names)

    for file_name in list_names:

        article = 

        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()

        for line in article_temp:

            print(line)

            if (line.strip()):

                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')

                sentences = tokenizer.tokenize(line)

                print(sentences)

                article = article + sentences

        orginal_data.append(article)

        topic_data.append(preprocess_data(article))

    if (method == "orig"):

        summary = generate_summary_origin(topic_data, 100, orginal_data)

    elif (method == "best-avg"):

        summary = generate_summary_best_avg(topic_data, 100, orginal_data)

    else:

        summary = generate_summary_simplified(topic_data, 100, orginal_data)

    return summary

The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.

But sometimes the sentences contains weird characters after nltk's processing.

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 

wondered if people in the crowd might boo Pusha T during the show, but 

said he never imagined actual violence would take place.



[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 

friends wondered if people inxa0the crowd might boo Pushaxa0T during 

the show, but said he never imagined actual violence would take 

place.']

Like above example, where is the xa0 and xa0T from?

edited Nov 23 at 6:33

Aqueous Carlos

289213

asked Nov 22 at 22:20

TIANLUN ZHU

567

xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28

add a comment |

After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:

def summary(filename, method):

    list_names = glob.glob(filename)

    orginal_data = 

    topic_data = 

    print(list_names)

    for file_name in list_names:

        article = 

        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()

        for line in article_temp:

            print(line)

            if (line.strip()):

                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')

                sentences = tokenizer.tokenize(line)

                print(sentences)

                article = article + sentences

        orginal_data.append(article)

        topic_data.append(preprocess_data(article))

    if (method == "orig"):

        summary = generate_summary_origin(topic_data, 100, orginal_data)

    elif (method == "best-avg"):

        summary = generate_summary_best_avg(topic_data, 100, orginal_data)

    else:

        summary = generate_summary_simplified(topic_data, 100, orginal_data)

    return summary

The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.

But sometimes the sentences contains weird characters after nltk's processing.

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 

wondered if people in the crowd might boo Pusha T during the show, but 

said he never imagined actual violence would take place.



[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 

friends wondered if people inxa0the crowd might boo Pushaxa0T during 

the show, but said he never imagined actual violence would take 

place.']

Like above example, where is the xa0 and xa0T from?

edited Nov 23 at 6:33

Aqueous Carlos

289213

asked Nov 22 at 22:20

TIANLUN ZHU

567

xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28

add a comment |

After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:

def summary(filename, method):

    list_names = glob.glob(filename)

    orginal_data = 

    topic_data = 

    print(list_names)

    for file_name in list_names:

        article = 

        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()

        for line in article_temp:

            print(line)

            if (line.strip()):

                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')

                sentences = tokenizer.tokenize(line)

                print(sentences)

                article = article + sentences

        orginal_data.append(article)

        topic_data.append(preprocess_data(article))

    if (method == "orig"):

        summary = generate_summary_origin(topic_data, 100, orginal_data)

    elif (method == "best-avg"):

        summary = generate_summary_best_avg(topic_data, 100, orginal_data)

    else:

        summary = generate_summary_simplified(topic_data, 100, orginal_data)

    return summary

The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.

But sometimes the sentences contains weird characters after nltk's processing.

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 

wondered if people in the crowd might boo Pusha T during the show, but 

said he never imagined actual violence would take place.



[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 

friends wondered if people inxa0the crowd might boo Pushaxa0T during 

the show, but said he never imagined actual violence would take 

place.']

Like above example, where is the xa0 and xa0T from?

edited Nov 23 at 6:33

Aqueous Carlos

289213

asked Nov 22 at 22:20

TIANLUN ZHU

567

After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:

def summary(filename, method):

    list_names = glob.glob(filename)

    orginal_data = 

    topic_data = 

    print(list_names)

    for file_name in list_names:

        article = 

        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()

        for line in article_temp:

            print(line)

            if (line.strip()):

                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')

                sentences = tokenizer.tokenize(line)

                print(sentences)

                article = article + sentences

        orginal_data.append(article)

        topic_data.append(preprocess_data(article))

    if (method == "orig"):

        summary = generate_summary_origin(topic_data, 100, orginal_data)

    elif (method == "best-avg"):

        summary = generate_summary_best_avg(topic_data, 100, orginal_data)

    else:

        summary = generate_summary_simplified(topic_data, 100, orginal_data)

    return summary

The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.

But sometimes the sentences contains weird characters after nltk's processing.

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 

wondered if people in the crowd might boo Pusha T during the show, but 

said he never imagined actual violence would take place.



[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 

friends wondered if people inxa0the crowd might boo Pushaxa0T during 

the show, but said he never imagined actual violence would take 

place.']

Like above example, where is the xa0 and xa0T from?

python nlp nltk

edited Nov 23 at 6:33

Aqueous Carlos

289213

asked Nov 22 at 22:20

TIANLUN ZHU

567

edited Nov 23 at 6:33

Aqueous Carlos

289213

asked Nov 22 at 22:20

TIANLUN ZHU

567

edited Nov 23 at 6:33

Aqueous Carlos

289213

edited Nov 23 at 6:33

Aqueous Carlos

289213

edited Nov 23 at 6:33

Aqueous Carlos

289213

asked Nov 22 at 22:20

TIANLUN ZHU

567

asked Nov 22 at 22:20

TIANLUN ZHU

567

asked Nov 22 at 22:20

TIANLUN ZHU

567

xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28

add a comment |

xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28

xa0 is an unicode character representing a no-break space. Your original text probably contains part UTF-8 encoding, part unicode encoding. Try to re-encode your original text file to UTF-8.
– PEREZje
Nov 22 at 22:28

add a comment |

1 Answer
1

active

oldest

votes

x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'



# method 1 

x.replace('xa0', ' ')



# method 2

import unicodedata

unicodedata.normalize('NFKD', x)



print(x)

Output:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

Reference: unicodedata.normalize()

answered Nov 23 at 5:00

Srce Cde

1,136411

Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438598%2fnltk-tokenizer-encoding-issue%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'



# method 1 

x.replace('xa0', ' ')



# method 2

import unicodedata

unicodedata.normalize('NFKD', x)



print(x)

Output:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

Reference: unicodedata.normalize()

answered Nov 23 at 5:00

Srce Cde

1,136411

Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10

add a comment |

x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'



# method 1 

x.replace('xa0', ' ')



# method 2

import unicodedata

unicodedata.normalize('NFKD', x)



print(x)

Output:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

Reference: unicodedata.normalize()

answered Nov 23 at 5:00

Srce Cde

1,136411

Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10

add a comment |

x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'



# method 1 

x.replace('xa0', ' ')



# method 2

import unicodedata

unicodedata.normalize('NFKD', x)



print(x)

Output:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

Reference: unicodedata.normalize()

answered Nov 23 at 5:00

Srce Cde

1,136411

x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people inxa0the crowd might boo Pushaxa0T during the show, but said he never imagined actual violence would take place.'



# method 1 

x.replace('xa0', ' ')



# method 2

import unicodedata

unicodedata.normalize('NFKD', x)



print(x)

Output:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

Reference: unicodedata.normalize()

answered Nov 23 at 5:00

Srce Cde

1,136411

answered Nov 23 at 5:00

Srce Cde

1,136411

answered Nov 23 at 5:00

Srce Cde

1,136411

answered Nov 23 at 5:00

Srce Cde

1,136411

Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10

add a comment |

Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10

Thanks, the second method works
– TIANLUN ZHU
Nov 23 at 22:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl