Get historical spelling corrected

Hello everyone I am posting this concern for my first time. I am writing a python script to make a program which will return standard words form. I base on rules to transform a historical text(spelling normalization). Here, the code does not work properly. It merely displays the modified word but not the entire file. Please, I ask for ideas on how to solve.

import re, string, unicodedata

from nltk.corpus import stopwords

import spacy

import codecs



nlp = spacy.load('fr')

with codecs.open(r'/home/m16/fatkab/RD_project/corpus.txt', encoding='utf8')as f:

    word =f.read()

    tokens = re.split(r'W+', word)

    print (tokens)



for word in tokens:

    rule1 = word.replace('y', 'i')



    # to avoid modifying y as a word itself:

    if word.endswith ('y')and len(word) >= 2:

        print(rule1)

my sample input: Or puis que Dieu est ainsi descendu
ÃƒÂ nous,qu'il luy a pleu de nous communiquer
ainsi sa bontÃƒÂ© : n'est ce pas raison que nous
soyons du tout siens? Et d'autant qu'il nous a tendu
la main pour nous racheter, ne faut-il pas que
nous soyons son heritage, quand il nous a acquis
par sa vertu? Le peuple donc s'il eust eu vn grain
de prudence , deuoit bien se ranger en toute humilitÃƒÂ©
pour receuoir la doctrine qui luy estoit
preschee par Moyse. Et mesme quelle authorite
meritoit la Loy , qui estoit ainsi approuuee par
tant de miracles?Car Dieu ne commande pas simplement
ÃƒÂ Moyse de parler, apres l'auoir choisi
pour son prophete:mais il le tire en la montagne,
il le separe de la compagnie des hommes,afin que
quand il viendra mettre en auant la Loy,qu'on le
tienne comme vn Ange,& non point comme vne creature mortelle.

here is the output

lui

lui

lui

ai

oui

Loi

lui

foi

Loi

hui

soi

lui

lui

lui

ci

Loi

soi

lui

ai

lui

lui

doi

quoi

soi

ai

lui

lui

soi

# the language is French

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

asked Nov 27 '18 at 10:57

Timat

184

1

Please add your code, your attempts and finally your error message or at least incorrect output. With your question we cannot reproduce your problem.

– Alex_P
Nov 27 '18 at 11:00

@Timat that should go in your question. :-)

– TrebuchetMS
Nov 27 '18 at 11:06

1

@Timat Please add your code in the post itself, not in comments.

– Mayank Porwal
Nov 27 '18 at 11:06

can you also add sample input?

– planetmaker
Nov 27 '18 at 11:14

add a comment |

import re, string, unicodedata

from nltk.corpus import stopwords

import spacy

import codecs



nlp = spacy.load('fr')

with codecs.open(r'/home/m16/fatkab/RD_project/corpus.txt', encoding='utf8')as f:

    word =f.read()

    tokens = re.split(r'W+', word)

    print (tokens)



for word in tokens:

    rule1 = word.replace('y', 'i')



    # to avoid modifying y as a word itself:

    if word.endswith ('y')and len(word) >= 2:

        print(rule1)

here is the output

lui

lui

lui

ai

oui

Loi

lui

foi

Loi

hui

soi

lui

lui

lui

ci

Loi

soi

lui

ai

lui

lui

doi

quoi

soi

ai

lui

lui

soi

# the language is French

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

asked Nov 27 '18 at 10:57

Timat

184

1

Please add your code, your attempts and finally your error message or at least incorrect output. With your question we cannot reproduce your problem.

– Alex_P
Nov 27 '18 at 11:00

@Timat that should go in your question. :-)

– TrebuchetMS
Nov 27 '18 at 11:06

1

@Timat Please add your code in the post itself, not in comments.

– Mayank Porwal
Nov 27 '18 at 11:06

can you also add sample input?

– planetmaker
Nov 27 '18 at 11:14

add a comment |

import re, string, unicodedata

from nltk.corpus import stopwords

import spacy

import codecs



nlp = spacy.load('fr')

with codecs.open(r'/home/m16/fatkab/RD_project/corpus.txt', encoding='utf8')as f:

    word =f.read()

    tokens = re.split(r'W+', word)

    print (tokens)



for word in tokens:

    rule1 = word.replace('y', 'i')



    # to avoid modifying y as a word itself:

    if word.endswith ('y')and len(word) >= 2:

        print(rule1)

here is the output

lui

lui

lui

ai

oui

Loi

lui

foi

Loi

hui

soi

lui

lui

lui

ci

Loi

soi

lui

ai

lui

lui

doi

quoi

soi

ai

lui

lui

soi

# the language is French

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

asked Nov 27 '18 at 10:57

Timat

184

import re, string, unicodedata

from nltk.corpus import stopwords

import spacy

import codecs



nlp = spacy.load('fr')

with codecs.open(r'/home/m16/fatkab/RD_project/corpus.txt', encoding='utf8')as f:

    word =f.read()

    tokens = re.split(r'W+', word)

    print (tokens)



for word in tokens:

    rule1 = word.replace('y', 'i')



    # to avoid modifying y as a word itself:

    if word.endswith ('y')and len(word) >= 2:

        print(rule1)

here is the output

lui

lui

lui

ai

oui

Loi

lui

foi

Loi

hui

soi

lui

lui

lui

ci

Loi

soi

lui

ai

lui

lui

doi

quoi

soi

ai

lui

lui

soi

# the language is French

python python-3.x

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

asked Nov 27 '18 at 10:57

Timat

184

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

asked Nov 27 '18 at 10:57

Timat

184

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

edited Nov 27 '18 at 14:51

Ivan Kolesnikov

1,26111032

asked Nov 27 '18 at 10:57

Timat

184

asked Nov 27 '18 at 10:57

Timat

184

asked Nov 27 '18 at 10:57

Timat

184

1

Please add your code, your attempts and finally your error message or at least incorrect output. With your question we cannot reproduce your problem.

– Alex_P
Nov 27 '18 at 11:00

@Timat that should go in your question. :-)

– TrebuchetMS
Nov 27 '18 at 11:06

1

@Timat Please add your code in the post itself, not in comments.

– Mayank Porwal
Nov 27 '18 at 11:06

can you also add sample input?

– planetmaker
Nov 27 '18 at 11:14

add a comment |

1

Please add your code, your attempts and finally your error message or at least incorrect output. With your question we cannot reproduce your problem.

– Alex_P
Nov 27 '18 at 11:00

@Timat that should go in your question. :-)

– TrebuchetMS
Nov 27 '18 at 11:06

1

@Timat Please add your code in the post itself, not in comments.

– Mayank Porwal
Nov 27 '18 at 11:06

can you also add sample input?

– planetmaker
Nov 27 '18 at 11:14

Please add your code, your attempts and finally your error message or at least incorrect output. With your question we cannot reproduce your problem.

– Alex_P
Nov 27 '18 at 11:00

@Timat that should go in your question. :-)

– TrebuchetMS
Nov 27 '18 at 11:06

@Timat Please add your code in the post itself, not in comments.

– Mayank Porwal
Nov 27 '18 at 11:06

can you also add sample input?

– planetmaker
Nov 27 '18 at 11:14

add a comment |

1 Answer
1

active

oldest

votes

Use re.sub on the entire text.

One major benefit of regex is that you can run a rule across large amounts of text - without having to manually tokenise and rebuild the output.

import re

text = "ouy you are the best luy guy in the try"

sub_pattern = re.compile(r"y(W+|$)")

print(re.sub(sub_pattern, r"i1", text))

# oui you are the best lui gui in the tri

Here we use the re.sub functionality to replace each match of the pattern with our replacement, across the entire file.

To maintain the spaces between the lines - we use the backreference 1 in the replacement pattern. This adds the text from capture group (1) in the match, back into the output.

Regex patterns explained:

re.compile - if you're using the same regex over and over, compiling it once saves the machine having to keep re-computing it. In this case, it's just used to separate that regex onto it's own line for clarity.

r"y(W+|$)" - the r tells python to treat the string as raw, that is backslashes will not escape characters incorrectly. To match the "y"s at the end of strings, the rule is "a 'y' followed by non-word characters, or the end of the string ($)". This is the pattern we use to match all the "incorrect" 'y' endings in the input. Note that the whitespace is captured in a group () so we can use it in the backreference later.

r"i1"1 - First we want to replace the matched y+whitespace with an "i" as per your rules. Then, we need to ensure we put the whitespace back in - which we do with the backreference 1 which adds whatever content was captured by group1 in our pattern (W+|$).

Alternatively

Instead of capturing the whitespace, replacing it and adding it back in. We can also use a non-capturing group in the original pattern - so we only capture the "y" and replace it.

For this you could use the pattern:

sub_pattern = re.compile(r"y(?=W+|$)")

print(re.sub(sub_pattern, r"i", text))

# oui you are the best lui gui in the tri

Note that the whitespace matching pattern is now prepended with ?= which denotes it is a non-capturing lookahead. This means it will check that these characters exist after the "y" but it does not remove them from the string during the replacement. As such, the replacement only needs to replace with "i" as the whitespace will not be modified.

answered Nov 27 '18 at 11:38

Bilkokuya

781616

This is very useful! thank so much for your great assistance, however, I have a question regarding other modifications.How to use regex when the character to be changed is located in the middle of the word and also concerns many words from different lemmas such us 'sauuage,gouuernement ,inuoque etc where I need to turn one 'u' into 'v'. since I am not good at regex I was proceeding individually.

– Timat
Nov 27 '18 at 12:00

1

@Timat Often the easiest solution will be to create a number of separate rules that solve individual/specific issues (such as ending y -> i) and then running them one after another (rather than trying to make a single regex pattern to solve everything). For your 'uu' rule for example you might simply replace all 'uu' with 'v', or even check something like (?=w)uu(?=w) if you want to ensure the 'uu' has at least one letter before&after it. If you're still unsure, please just ask as a separate question, and mark this as accepted if it has solved the issue you originally posted.

– Bilkokuya
Nov 27 '18 at 12:06

Thank you so much it is solved.

– Timat
Nov 27 '18 at 12:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53498130%2fget-historical-spelling-corrected%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Use re.sub on the entire text.

One major benefit of regex is that you can run a rule across large amounts of text - without having to manually tokenise and rebuild the output.

import re

text = "ouy you are the best luy guy in the try"

sub_pattern = re.compile(r"y(W+|$)")

print(re.sub(sub_pattern, r"i1", text))

# oui you are the best lui gui in the tri

Here we use the re.sub functionality to replace each match of the pattern with our replacement, across the entire file.

To maintain the spaces between the lines - we use the backreference 1 in the replacement pattern. This adds the text from capture group (1) in the match, back into the output.

Regex patterns explained:

Alternatively

Instead of capturing the whitespace, replacing it and adding it back in. We can also use a non-capturing group in the original pattern - so we only capture the "y" and replace it.

For this you could use the pattern:

sub_pattern = re.compile(r"y(?=W+|$)")

print(re.sub(sub_pattern, r"i", text))

# oui you are the best lui gui in the tri

answered Nov 27 '18 at 11:38

Bilkokuya

781616

This is very useful! thank so much for your great assistance, however, I have a question regarding other modifications.How to use regex when the character to be changed is located in the middle of the word and also concerns many words from different lemmas such us 'sauuage,gouuernement ,inuoque etc where I need to turn one 'u' into 'v'. since I am not good at regex I was proceeding individually.

– Timat
Nov 27 '18 at 12:00

1

@Timat Often the easiest solution will be to create a number of separate rules that solve individual/specific issues (such as ending y -> i) and then running them one after another (rather than trying to make a single regex pattern to solve everything). For your 'uu' rule for example you might simply replace all 'uu' with 'v', or even check something like (?=w)uu(?=w) if you want to ensure the 'uu' has at least one letter before&after it. If you're still unsure, please just ask as a separate question, and mark this as accepted if it has solved the issue you originally posted.

– Bilkokuya
Nov 27 '18 at 12:06

Thank you so much it is solved.

– Timat
Nov 27 '18 at 12:27

add a comment |

Use re.sub on the entire text.

One major benefit of regex is that you can run a rule across large amounts of text - without having to manually tokenise and rebuild the output.

import re

text = "ouy you are the best luy guy in the try"

sub_pattern = re.compile(r"y(W+|$)")

print(re.sub(sub_pattern, r"i1", text))

# oui you are the best lui gui in the tri

Here we use the re.sub functionality to replace each match of the pattern with our replacement, across the entire file.

To maintain the spaces between the lines - we use the backreference 1 in the replacement pattern. This adds the text from capture group (1) in the match, back into the output.

Regex patterns explained:

Alternatively

Instead of capturing the whitespace, replacing it and adding it back in. We can also use a non-capturing group in the original pattern - so we only capture the "y" and replace it.

For this you could use the pattern:

sub_pattern = re.compile(r"y(?=W+|$)")

print(re.sub(sub_pattern, r"i", text))

# oui you are the best lui gui in the tri

answered Nov 27 '18 at 11:38

Bilkokuya

781616

This is very useful! thank so much for your great assistance, however, I have a question regarding other modifications.How to use regex when the character to be changed is located in the middle of the word and also concerns many words from different lemmas such us 'sauuage,gouuernement ,inuoque etc where I need to turn one 'u' into 'v'. since I am not good at regex I was proceeding individually.

– Timat
Nov 27 '18 at 12:00

1

@Timat Often the easiest solution will be to create a number of separate rules that solve individual/specific issues (such as ending y -> i) and then running them one after another (rather than trying to make a single regex pattern to solve everything). For your 'uu' rule for example you might simply replace all 'uu' with 'v', or even check something like (?=w)uu(?=w) if you want to ensure the 'uu' has at least one letter before&after it. If you're still unsure, please just ask as a separate question, and mark this as accepted if it has solved the issue you originally posted.

– Bilkokuya
Nov 27 '18 at 12:06

Thank you so much it is solved.

– Timat
Nov 27 '18 at 12:27

add a comment |

Use re.sub on the entire text.

One major benefit of regex is that you can run a rule across large amounts of text - without having to manually tokenise and rebuild the output.

import re

text = "ouy you are the best luy guy in the try"

sub_pattern = re.compile(r"y(W+|$)")

print(re.sub(sub_pattern, r"i1", text))

# oui you are the best lui gui in the tri

Here we use the re.sub functionality to replace each match of the pattern with our replacement, across the entire file.

To maintain the spaces between the lines - we use the backreference 1 in the replacement pattern. This adds the text from capture group (1) in the match, back into the output.

Regex patterns explained:

Alternatively

Instead of capturing the whitespace, replacing it and adding it back in. We can also use a non-capturing group in the original pattern - so we only capture the "y" and replace it.

For this you could use the pattern:

sub_pattern = re.compile(r"y(?=W+|$)")

print(re.sub(sub_pattern, r"i", text))

# oui you are the best lui gui in the tri

answered Nov 27 '18 at 11:38

Bilkokuya

781616

Use re.sub on the entire text.

One major benefit of regex is that you can run a rule across large amounts of text - without having to manually tokenise and rebuild the output.

import re

text = "ouy you are the best luy guy in the try"

sub_pattern = re.compile(r"y(W+|$)")

print(re.sub(sub_pattern, r"i1", text))

# oui you are the best lui gui in the tri

Here we use the re.sub functionality to replace each match of the pattern with our replacement, across the entire file.

To maintain the spaces between the lines - we use the backreference 1 in the replacement pattern. This adds the text from capture group (1) in the match, back into the output.

Regex patterns explained:

Alternatively

Instead of capturing the whitespace, replacing it and adding it back in. We can also use a non-capturing group in the original pattern - so we only capture the "y" and replace it.

For this you could use the pattern:

sub_pattern = re.compile(r"y(?=W+|$)")

print(re.sub(sub_pattern, r"i", text))

# oui you are the best lui gui in the tri

answered Nov 27 '18 at 11:38

Bilkokuya

781616

answered Nov 27 '18 at 11:38

Bilkokuya

781616

answered Nov 27 '18 at 11:38

Bilkokuya

781616

answered Nov 27 '18 at 11:38

Bilkokuya

781616

This is very useful! thank so much for your great assistance, however, I have a question regarding other modifications.How to use regex when the character to be changed is located in the middle of the word and also concerns many words from different lemmas such us 'sauuage,gouuernement ,inuoque etc where I need to turn one 'u' into 'v'. since I am not good at regex I was proceeding individually.

– Timat
Nov 27 '18 at 12:00

1

@Timat Often the easiest solution will be to create a number of separate rules that solve individual/specific issues (such as ending y -> i) and then running them one after another (rather than trying to make a single regex pattern to solve everything). For your 'uu' rule for example you might simply replace all 'uu' with 'v', or even check something like (?=w)uu(?=w) if you want to ensure the 'uu' has at least one letter before&after it. If you're still unsure, please just ask as a separate question, and mark this as accepted if it has solved the issue you originally posted.

– Bilkokuya
Nov 27 '18 at 12:06

Thank you so much it is solved.

– Timat
Nov 27 '18 at 12:27

add a comment |

This is very useful! thank so much for your great assistance, however, I have a question regarding other modifications.How to use regex when the character to be changed is located in the middle of the word and also concerns many words from different lemmas such us 'sauuage,gouuernement ,inuoque etc where I need to turn one 'u' into 'v'. since I am not good at regex I was proceeding individually.

– Timat
Nov 27 '18 at 12:00

1

@Timat Often the easiest solution will be to create a number of separate rules that solve individual/specific issues (such as ending y -> i) and then running them one after another (rather than trying to make a single regex pattern to solve everything). For your 'uu' rule for example you might simply replace all 'uu' with 'v', or even check something like (?=w)uu(?=w) if you want to ensure the 'uu' has at least one letter before&after it. If you're still unsure, please just ask as a separate question, and mark this as accepted if it has solved the issue you originally posted.

– Bilkokuya
Nov 27 '18 at 12:06

Thank you so much it is solved.

– Timat
Nov 27 '18 at 12:27

This is very useful! thank so much for your great assistance, however, I have a question regarding other modifications.How to use regex when the character to be changed is located in the middle of the word and also concerns many words from different lemmas such us 'sauuage,gouuernement ,inuoque etc where I need to turn one 'u' into 'v'. since I am not good at regex I was proceeding individually.

– Timat
Nov 27 '18 at 12:00

@Timat Often the easiest solution will be to create a number of separate rules that solve individual/specific issues (such as ending y -> i) and then running them one after another (rather than trying to make a single regex pattern to solve everything). For your 'uu' rule for example you might simply replace all 'uu' with 'v', or even check something like (?=w)uu(?=w) if you want to ensure the 'uu' has at least one letter before&after it. If you're still unsure, please just ask as a separate question, and mark this as accepted if it has solved the issue you originally posted.

– Bilkokuya
Nov 27 '18 at 12:06

Thank you so much it is solved.

– Timat
Nov 27 '18 at 12:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl