Tokenizing non English Text in Python to sentences
I have an arabic text file that looks like this
اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب دار
I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.
I found this answer: Tokenizing non English Text in Python
It is splitting text into words but not into sentences.
I also tried this
from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))
It returned the text separated by 'u202a'
زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء
NB: The sentence doesn't make any sense, it is just an example in arabic characters.
I need the output to be in the form of sentences:
[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن, احبابك رامي مرض , النقرس ماذا]
which means:
[sentence 1, sentence 2, sentence, 3]
python string python-3.x stringtokenizer
|
show 3 more comments
I have an arabic text file that looks like this
اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب دار
I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.
I found this answer: Tokenizing non English Text in Python
It is splitting text into words but not into sentences.
I also tried this
from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))
It returned the text separated by 'u202a'
زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء
NB: The sentence doesn't make any sense, it is just an example in arabic characters.
I need the output to be in the form of sentences:
[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن, احبابك رامي مرض , النقرس ماذا]
which means:
[sentence 1, sentence 2, sentence, 3]
python string python-3.x stringtokenizer
For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14
@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25
How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31
|
show 3 more comments
I have an arabic text file that looks like this
اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب دار
I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.
I found this answer: Tokenizing non English Text in Python
It is splitting text into words but not into sentences.
I also tried this
from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))
It returned the text separated by 'u202a'
زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء
NB: The sentence doesn't make any sense, it is just an example in arabic characters.
I need the output to be in the form of sentences:
[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن, احبابك رامي مرض , النقرس ماذا]
which means:
[sentence 1, sentence 2, sentence, 3]
python string python-3.x stringtokenizer
I have an arabic text file that looks like this
اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب دار
I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.
I found this answer: Tokenizing non English Text in Python
It is splitting text into words but not into sentences.
I also tried this
from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل. افضل من قلب راشد ليش اتعب"
regex.findall(r'p{L}+', text.replace('[u200c]', ''))
print(sent_tokenize(data))
It returned the text separated by 'u202a'
زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء
NB: The sentence doesn't make any sense, it is just an example in arabic characters.
I need the output to be in the form of sentences:
[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن, احبابك رامي مرض , النقرس ماذا]
which means:
[sentence 1, sentence 2, sentence, 3]
python string python-3.x stringtokenizer
python string python-3.x stringtokenizer
edited Nov 24 '18 at 9:01
kcorlidy
2,1702318
2,1702318
asked Nov 23 '18 at 18:05
J.Doe
206
206
For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14
@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25
How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31
|
show 3 more comments
For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14
@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25
How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31
For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14
For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14
@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32
@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25
How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31
How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31
|
show 3 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451225%2ftokenizing-non-english-text-in-python-to-sentences%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451225%2ftokenizing-non-english-text-in-python-to-sentences%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14
@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25
How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31