str.normalize not doing anything in pandas

I have a pandas dataframe, loaded from a CSV, with one column that has encoded unicode characters like u00ca. the str.normalize() method should take care of these, but it's not working, even when it works with unicodedata.normalize

import unicodedata

s = 'BC - CPE LE Hu00caTRE INC.'

unicodedata.normalize('NFKD', s)

>> 'BC - CPE LE HÊTRE INC.'

But not when it's in a pandas series.

import pandas as pd



names = ['BC - CPE LE Hu00caTRE INC.',

 'BC - CPE LE CHEZ-MOI DES PETITS',

 'BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE',

 'BC - BCGMF AHUNSTIC',

 'BC - CPE LE JARDIN DES Ru00caVES INC.',

 'BC - FORCE VIVE" CPE"',

 'BC - CPE GAMINVILLE INC.',

 'BC - CPE PIROUETTE DE FABREVILLE INC.',

 'B.C. ST-MICHEL',

 'BC - CPE DU PARC',

 'BC - CPE LA TROTTINETTE CAROTTEE',

 'BC - CPE DE MONTRu00c9AL-NORD']



names = pd.Series(names)

names.str.normalize('NFKD')



>> 0                           BC - CPE LE Hu00caTRE INC.

  1                       BC - CPE LE CHEZ-MOI DES PETITS

  2     BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE

  3                                   BC - BCGMF AHUNSTIC

  4                BC - CPE LE JARDIN DES Ru00caVES INC.

  5                               BC - FORCE VIVE" CPE"

  6                              BC - CPE GAMINVILLE INC.

  7                 BC - CPE PIROUETTE DE FABREVILLE INC.

  8                                        B.C. ST-MICHEL

  9                                      BC - CPE DU PARC

  10                     BC - CPE LA TROTTINETTE CAROTTEE

  11                       BC - CPE DE MONTRu00c9AL-NORD

  dtype: object

I have also tried every variation possible of str.encode and str.decode before and after normalize. Nothing changed.

asked Nov 28 '18 at 20:22

robroc

4841314

I see the problem now. The strings are being displayed as BC - CPE LE Hu00caTRE INC. but in reality are stored as BC - CPE LE H\u00caTRE INC., with the unicode gettting escaped. Do you know how to decode this?

– robroc
Nov 28 '18 at 21:45

add a comment |

import unicodedata

s = 'BC - CPE LE Hu00caTRE INC.'

unicodedata.normalize('NFKD', s)

>> 'BC - CPE LE HÊTRE INC.'

But not when it's in a pandas series.

import pandas as pd



names = ['BC - CPE LE Hu00caTRE INC.',

 'BC - CPE LE CHEZ-MOI DES PETITS',

 'BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE',

 'BC - BCGMF AHUNSTIC',

 'BC - CPE LE JARDIN DES Ru00caVES INC.',

 'BC - FORCE VIVE" CPE"',

 'BC - CPE GAMINVILLE INC.',

 'BC - CPE PIROUETTE DE FABREVILLE INC.',

 'B.C. ST-MICHEL',

 'BC - CPE DU PARC',

 'BC - CPE LA TROTTINETTE CAROTTEE',

 'BC - CPE DE MONTRu00c9AL-NORD']



names = pd.Series(names)

names.str.normalize('NFKD')



>> 0                           BC - CPE LE Hu00caTRE INC.

  1                       BC - CPE LE CHEZ-MOI DES PETITS

  2     BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE

  3                                   BC - BCGMF AHUNSTIC

  4                BC - CPE LE JARDIN DES Ru00caVES INC.

  5                               BC - FORCE VIVE" CPE"

  6                              BC - CPE GAMINVILLE INC.

  7                 BC - CPE PIROUETTE DE FABREVILLE INC.

  8                                        B.C. ST-MICHEL

  9                                      BC - CPE DU PARC

  10                     BC - CPE LA TROTTINETTE CAROTTEE

  11                       BC - CPE DE MONTRu00c9AL-NORD

  dtype: object

I have also tried every variation possible of str.encode and str.decode before and after normalize. Nothing changed.

asked Nov 28 '18 at 20:22

robroc

4841314

I see the problem now. The strings are being displayed as BC - CPE LE Hu00caTRE INC. but in reality are stored as BC - CPE LE H\u00caTRE INC., with the unicode gettting escaped. Do you know how to decode this?

– robroc
Nov 28 '18 at 21:45

add a comment |

import unicodedata

s = 'BC - CPE LE Hu00caTRE INC.'

unicodedata.normalize('NFKD', s)

>> 'BC - CPE LE HÊTRE INC.'

But not when it's in a pandas series.

import pandas as pd



names = ['BC - CPE LE Hu00caTRE INC.',

 'BC - CPE LE CHEZ-MOI DES PETITS',

 'BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE',

 'BC - BCGMF AHUNSTIC',

 'BC - CPE LE JARDIN DES Ru00caVES INC.',

 'BC - FORCE VIVE" CPE"',

 'BC - CPE GAMINVILLE INC.',

 'BC - CPE PIROUETTE DE FABREVILLE INC.',

 'B.C. ST-MICHEL',

 'BC - CPE DU PARC',

 'BC - CPE LA TROTTINETTE CAROTTEE',

 'BC - CPE DE MONTRu00c9AL-NORD']



names = pd.Series(names)

names.str.normalize('NFKD')



>> 0                           BC - CPE LE Hu00caTRE INC.

  1                       BC - CPE LE CHEZ-MOI DES PETITS

  2     BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE

  3                                   BC - BCGMF AHUNSTIC

  4                BC - CPE LE JARDIN DES Ru00caVES INC.

  5                               BC - FORCE VIVE" CPE"

  6                              BC - CPE GAMINVILLE INC.

  7                 BC - CPE PIROUETTE DE FABREVILLE INC.

  8                                        B.C. ST-MICHEL

  9                                      BC - CPE DU PARC

  10                     BC - CPE LA TROTTINETTE CAROTTEE

  11                       BC - CPE DE MONTRu00c9AL-NORD

  dtype: object

I have also tried every variation possible of str.encode and str.decode before and after normalize. Nothing changed.

asked Nov 28 '18 at 20:22

robroc

4841314

import unicodedata

s = 'BC - CPE LE Hu00caTRE INC.'

unicodedata.normalize('NFKD', s)

>> 'BC - CPE LE HÊTRE INC.'

But not when it's in a pandas series.

import pandas as pd



names = ['BC - CPE LE Hu00caTRE INC.',

 'BC - CPE LE CHEZ-MOI DES PETITS',

 'BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE',

 'BC - BCGMF AHUNSTIC',

 'BC - CPE LE JARDIN DES Ru00caVES INC.',

 'BC - FORCE VIVE" CPE"',

 'BC - CPE GAMINVILLE INC.',

 'BC - CPE PIROUETTE DE FABREVILLE INC.',

 'B.C. ST-MICHEL',

 'BC - CPE DU PARC',

 'BC - CPE LA TROTTINETTE CAROTTEE',

 'BC - CPE DE MONTRu00c9AL-NORD']



names = pd.Series(names)

names.str.normalize('NFKD')



>> 0                           BC - CPE LE Hu00caTRE INC.

  1                       BC - CPE LE CHEZ-MOI DES PETITS

  2     BC GARDE MILIEU FAMILIAL DE BORDEAUX-CARTIERVILLE

  3                                   BC - BCGMF AHUNSTIC

  4                BC - CPE LE JARDIN DES Ru00caVES INC.

  5                               BC - FORCE VIVE" CPE"

  6                              BC - CPE GAMINVILLE INC.

  7                 BC - CPE PIROUETTE DE FABREVILLE INC.

  8                                        B.C. ST-MICHEL

  9                                      BC - CPE DU PARC

  10                     BC - CPE LA TROTTINETTE CAROTTEE

  11                       BC - CPE DE MONTRu00c9AL-NORD

  dtype: object

I have also tried every variation possible of str.encode and str.decode before and after normalize. Nothing changed.

python pandas unicode

asked Nov 28 '18 at 20:22

robroc

4841314

asked Nov 28 '18 at 20:22

robroc

4841314

asked Nov 28 '18 at 20:22

robroc

4841314

asked Nov 28 '18 at 20:22

robroc

4841314

asked Nov 28 '18 at 20:22

robroc

4841314

I see the problem now. The strings are being displayed as BC - CPE LE Hu00caTRE INC. but in reality are stored as BC - CPE LE H\u00caTRE INC., with the unicode gettting escaped. Do you know how to decode this?

– robroc
Nov 28 '18 at 21:45

add a comment |

I see the problem now. The strings are being displayed as BC - CPE LE Hu00caTRE INC. but in reality are stored as BC - CPE LE H\u00caTRE INC., with the unicode gettting escaped. Do you know how to decode this?

– robroc
Nov 28 '18 at 21:45

I see the problem now. The strings are being displayed as BC - CPE LE Hu00caTRE INC. but in reality are stored as BC - CPE LE H\u00caTRE INC., with the unicode gettting escaped. Do you know how to decode this?

– robroc
Nov 28 '18 at 21:45

add a comment |

2 Answers
2

active

oldest

votes

unicodedata.normalize isn't doing what you think it is. unicodedata.normalize does not process u escape sequences; it converts input into various Unicode normalization forms.

Python string literal processing is what converts the u00ca to an Ê character, and Python string literal processing is not applied to anything but Python string literals. The input you're reading from a CSV file does not get Python string literal processing applied. (The contents of the names list in your question do get string literal processing applied, so your posted code fails to reproduce your error. You really should have checked that before posting.)

Depending on the content of the file and the context of your application, decoding your input with the unicode-escape encoding using codecs.decode may be an appropriate way to handle the u escapes.

answered Nov 28 '18 at 20:34

user2357112

157k13173267

This isn't strictly true; the posted code reproduces the problem in 2.7, and str.normalize does actually result in the correct string processing to convert the unicode literals. It, however, only does this if python2.7 has been compiled with 4-byte unicode support (otherwise, he gets the error that he's seeing).

– CJR
Nov 28 '18 at 21:03

1

@CJ59: On Python 2.7, unicodedata.normalize would have rejected s as an argument, since s would be a bytestring instead of a Unicode string on 2.7. Also, an Ê character would not have shown up in the repr of a string.

– user2357112
Nov 28 '18 at 21:07

1

@CJ59: Funnily enough, the Series.str.normalize call does actually process unicode escapes on Python 2.7, but as a side effect of a weird compatibility handling routine that Series.str.normalize calls, which decodes bytestrings with unicode-escape. This doesn't seem limited to 4-byte Unicode builds, but I don't have a 2-byte build handy to test with.

– user2357112
Nov 28 '18 at 21:18

On further inspection, I see that the unocode characters are being displayed as u00ca but in reality are stored as \u00ca, with the unicode getting escaped. How do I make them unicode again? Decode/encode do nothing.

– robroc
Nov 28 '18 at 21:47

add a comment |

The problem was with pandas escaping the unicode character in the string. So u00ca was being saved as \u00ca. To decode it back, I just needed this, which @user2357112 hinted at:

Series.str.decode('unicode-escape')

answered Nov 28 '18 at 21:51

robroc

4841314

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53527501%2fstr-normalize-not-doing-anything-in-pandas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

unicodedata.normalize isn't doing what you think it is. unicodedata.normalize does not process u escape sequences; it converts input into various Unicode normalization forms.

answered Nov 28 '18 at 20:34

user2357112

157k13173267

This isn't strictly true; the posted code reproduces the problem in 2.7, and str.normalize does actually result in the correct string processing to convert the unicode literals. It, however, only does this if python2.7 has been compiled with 4-byte unicode support (otherwise, he gets the error that he's seeing).

– CJR
Nov 28 '18 at 21:03

1

@CJ59: On Python 2.7, unicodedata.normalize would have rejected s as an argument, since s would be a bytestring instead of a Unicode string on 2.7. Also, an Ê character would not have shown up in the repr of a string.

– user2357112
Nov 28 '18 at 21:07

1

@CJ59: Funnily enough, the Series.str.normalize call does actually process unicode escapes on Python 2.7, but as a side effect of a weird compatibility handling routine that Series.str.normalize calls, which decodes bytestrings with unicode-escape. This doesn't seem limited to 4-byte Unicode builds, but I don't have a 2-byte build handy to test with.

– user2357112
Nov 28 '18 at 21:18

On further inspection, I see that the unocode characters are being displayed as u00ca but in reality are stored as \u00ca, with the unicode getting escaped. How do I make them unicode again? Decode/encode do nothing.

– robroc
Nov 28 '18 at 21:47

add a comment |

unicodedata.normalize isn't doing what you think it is. unicodedata.normalize does not process u escape sequences; it converts input into various Unicode normalization forms.

answered Nov 28 '18 at 20:34

user2357112

157k13173267

This isn't strictly true; the posted code reproduces the problem in 2.7, and str.normalize does actually result in the correct string processing to convert the unicode literals. It, however, only does this if python2.7 has been compiled with 4-byte unicode support (otherwise, he gets the error that he's seeing).

– CJR
Nov 28 '18 at 21:03

1

@CJ59: On Python 2.7, unicodedata.normalize would have rejected s as an argument, since s would be a bytestring instead of a Unicode string on 2.7. Also, an Ê character would not have shown up in the repr of a string.

– user2357112
Nov 28 '18 at 21:07

1

@CJ59: Funnily enough, the Series.str.normalize call does actually process unicode escapes on Python 2.7, but as a side effect of a weird compatibility handling routine that Series.str.normalize calls, which decodes bytestrings with unicode-escape. This doesn't seem limited to 4-byte Unicode builds, but I don't have a 2-byte build handy to test with.

– user2357112
Nov 28 '18 at 21:18

On further inspection, I see that the unocode characters are being displayed as u00ca but in reality are stored as \u00ca, with the unicode getting escaped. How do I make them unicode again? Decode/encode do nothing.

– robroc
Nov 28 '18 at 21:47

add a comment |

unicodedata.normalize isn't doing what you think it is. unicodedata.normalize does not process u escape sequences; it converts input into various Unicode normalization forms.

answered Nov 28 '18 at 20:34

user2357112

157k13173267

unicodedata.normalize isn't doing what you think it is. unicodedata.normalize does not process u escape sequences; it converts input into various Unicode normalization forms.

answered Nov 28 '18 at 20:34

user2357112

157k13173267

answered Nov 28 '18 at 20:34

user2357112

157k13173267

answered Nov 28 '18 at 20:34

user2357112

157k13173267

answered Nov 28 '18 at 20:34

user2357112

157k13173267

This isn't strictly true; the posted code reproduces the problem in 2.7, and str.normalize does actually result in the correct string processing to convert the unicode literals. It, however, only does this if python2.7 has been compiled with 4-byte unicode support (otherwise, he gets the error that he's seeing).

– CJR
Nov 28 '18 at 21:03

1

@CJ59: On Python 2.7, unicodedata.normalize would have rejected s as an argument, since s would be a bytestring instead of a Unicode string on 2.7. Also, an Ê character would not have shown up in the repr of a string.

– user2357112
Nov 28 '18 at 21:07

1

@CJ59: Funnily enough, the Series.str.normalize call does actually process unicode escapes on Python 2.7, but as a side effect of a weird compatibility handling routine that Series.str.normalize calls, which decodes bytestrings with unicode-escape. This doesn't seem limited to 4-byte Unicode builds, but I don't have a 2-byte build handy to test with.

– user2357112
Nov 28 '18 at 21:18

On further inspection, I see that the unocode characters are being displayed as u00ca but in reality are stored as \u00ca, with the unicode getting escaped. How do I make them unicode again? Decode/encode do nothing.

– robroc
Nov 28 '18 at 21:47

add a comment |

This isn't strictly true; the posted code reproduces the problem in 2.7, and str.normalize does actually result in the correct string processing to convert the unicode literals. It, however, only does this if python2.7 has been compiled with 4-byte unicode support (otherwise, he gets the error that he's seeing).

– CJR
Nov 28 '18 at 21:03

1

@CJ59: On Python 2.7, unicodedata.normalize would have rejected s as an argument, since s would be a bytestring instead of a Unicode string on 2.7. Also, an Ê character would not have shown up in the repr of a string.

– user2357112
Nov 28 '18 at 21:07

1

@CJ59: Funnily enough, the Series.str.normalize call does actually process unicode escapes on Python 2.7, but as a side effect of a weird compatibility handling routine that Series.str.normalize calls, which decodes bytestrings with unicode-escape. This doesn't seem limited to 4-byte Unicode builds, but I don't have a 2-byte build handy to test with.

– user2357112
Nov 28 '18 at 21:18

On further inspection, I see that the unocode characters are being displayed as u00ca but in reality are stored as \u00ca, with the unicode getting escaped. How do I make them unicode again? Decode/encode do nothing.

– robroc
Nov 28 '18 at 21:47

This isn't strictly true; the posted code reproduces the problem in 2.7, and str.normalize does actually result in the correct string processing to convert the unicode literals. It, however, only does this if python2.7 has been compiled with 4-byte unicode support (otherwise, he gets the error that he's seeing).

– CJR
Nov 28 '18 at 21:03

@CJ59: On Python 2.7, unicodedata.normalize would have rejected s as an argument, since s would be a bytestring instead of a Unicode string on 2.7. Also, an Ê character would not have shown up in the repr of a string.

– user2357112
Nov 28 '18 at 21:07

@CJ59: Funnily enough, the Series.str.normalize call does actually process unicode escapes on Python 2.7, but as a side effect of a weird compatibility handling routine that Series.str.normalize calls, which decodes bytestrings with unicode-escape. This doesn't seem limited to 4-byte Unicode builds, but I don't have a 2-byte build handy to test with.

– user2357112
Nov 28 '18 at 21:18

On further inspection, I see that the unocode characters are being displayed as u00ca but in reality are stored as \u00ca, with the unicode getting escaped. How do I make them unicode again? Decode/encode do nothing.

– robroc
Nov 28 '18 at 21:47

add a comment |

The problem was with pandas escaping the unicode character in the string. So u00ca was being saved as \u00ca. To decode it back, I just needed this, which @user2357112 hinted at:

Series.str.decode('unicode-escape')

answered Nov 28 '18 at 21:51

robroc

4841314

add a comment |

The problem was with pandas escaping the unicode character in the string. So u00ca was being saved as \u00ca. To decode it back, I just needed this, which @user2357112 hinted at:

Series.str.decode('unicode-escape')

answered Nov 28 '18 at 21:51

robroc

4841314

add a comment |

The problem was with pandas escaping the unicode character in the string. So u00ca was being saved as \u00ca. To decode it back, I just needed this, which @user2357112 hinted at:

Series.str.decode('unicode-escape')

answered Nov 28 '18 at 21:51

robroc

4841314

The problem was with pandas escaping the unicode character in the string. So u00ca was being saved as \u00ca. To decode it back, I just needed this, which @user2357112 hinted at:

Series.str.decode('unicode-escape')

answered Nov 28 '18 at 21:51

robroc

4841314

answered Nov 28 '18 at 21:51

robroc

4841314

answered Nov 28 '18 at 21:51

robroc

4841314

answered Nov 28 '18 at 21:51

robroc

4841314

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl