Tokenizing non English Text in Python to sentences












0














I have an arabic text file that looks like this



اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار



I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.



I found this answer: Tokenizing non English Text in Python



It is splitting text into words but not into sentences.



I also tried this



from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))


It returned the text separated by 'u202a'



زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء


NB: The sentence doesn't make any sense, it is just an example in arabic characters.



I need the output to be in the form of sentences:



[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]


which means:



[sentence 1, sentence 2, sentence, 3]









share|improve this question
























  • For those who don't read Arabic, can you edit your question and add your desired output as well?
    – usr2564301
    Nov 23 '18 at 18:14










  • @usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
    – Jaba
    Nov 23 '18 at 18:32












  • @Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
    – usr2564301
    Nov 23 '18 at 19:09










  • I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
    – J.Doe
    Nov 23 '18 at 20:25










  • How does one know where a sentence ends in arabuc language?
    – user8408080
    Nov 23 '18 at 22:31
















0














I have an arabic text file that looks like this



اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار



I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.



I found this answer: Tokenizing non English Text in Python



It is splitting text into words but not into sentences.



I also tried this



from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))


It returned the text separated by 'u202a'



زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء


NB: The sentence doesn't make any sense, it is just an example in arabic characters.



I need the output to be in the form of sentences:



[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]


which means:



[sentence 1, sentence 2, sentence, 3]









share|improve this question
























  • For those who don't read Arabic, can you edit your question and add your desired output as well?
    – usr2564301
    Nov 23 '18 at 18:14










  • @usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
    – Jaba
    Nov 23 '18 at 18:32












  • @Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
    – usr2564301
    Nov 23 '18 at 19:09










  • I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
    – J.Doe
    Nov 23 '18 at 20:25










  • How does one know where a sentence ends in arabuc language?
    – user8408080
    Nov 23 '18 at 22:31














0












0








0







I have an arabic text file that looks like this



اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار



I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.



I found this answer: Tokenizing non English Text in Python



It is splitting text into words but not into sentences.



I also tried this



from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))


It returned the text separated by 'u202a'



زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء


NB: The sentence doesn't make any sense, it is just an example in arabic characters.



I need the output to be in the form of sentences:



[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]


which means:



[sentence 1, sentence 2, sentence, 3]









share|improve this question















I have an arabic text file that looks like this



اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار



I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.



I found this answer: Tokenizing non English Text in Python



It is splitting text into words but not into sentences.



I also tried this



from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))


It returned the text separated by 'u202a'



زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء


NB: The sentence doesn't make any sense, it is just an example in arabic characters.



I need the output to be in the form of sentences:



[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]


which means:



[sentence 1, sentence 2, sentence, 3]






python string python-3.x stringtokenizer






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 24 '18 at 9:01









kcorlidy

2,1702318




2,1702318










asked Nov 23 '18 at 18:05









J.Doe

206




206












  • For those who don't read Arabic, can you edit your question and add your desired output as well?
    – usr2564301
    Nov 23 '18 at 18:14










  • @usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
    – Jaba
    Nov 23 '18 at 18:32












  • @Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
    – usr2564301
    Nov 23 '18 at 19:09










  • I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
    – J.Doe
    Nov 23 '18 at 20:25










  • How does one know where a sentence ends in arabuc language?
    – user8408080
    Nov 23 '18 at 22:31


















  • For those who don't read Arabic, can you edit your question and add your desired output as well?
    – usr2564301
    Nov 23 '18 at 18:14










  • @usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
    – Jaba
    Nov 23 '18 at 18:32












  • @Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
    – usr2564301
    Nov 23 '18 at 19:09










  • I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
    – J.Doe
    Nov 23 '18 at 20:25










  • How does one know where a sentence ends in arabuc language?
    – user8408080
    Nov 23 '18 at 22:31
















For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14




For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14












@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32






@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32














@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09




@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09












I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25




I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25












How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31




How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451225%2ftokenizing-non-english-text-in-python-to-sentences%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451225%2ftokenizing-non-english-text-in-python-to-sentences%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Contact image not getting when fetch all contact list from iPhone by CNContact

count number of partitions of a set with n elements into k subsets

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks