Tokenizing non English Text in Python to sentences

I have an arabic text file that looks like this

اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار

I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.

I found this answer: Tokenizing non English Text in Python

It is splitting text into words but not into sentences.

I also tried this

from nltk.tokenize import sent_tokenize, word_tokenize

import regex

text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب" 

regex.findall(r'p{L}+', text.replace('[u200c]', ''))

print(sent_tokenize(data))

It returned the text separated by 'u202a'

زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء

NB: The sentence doesn't make any sense, it is just an example in arabic characters.

I need the output to be in the form of sentences:

[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]

which means:

[sentence 1, sentence 2, sentence, 3]

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

asked Nov 23 '18 at 18:05

J.Doe

206

For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14

@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32

@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09

I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25

How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31

|
show 3 more comments

I have an arabic text file that looks like this

I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.

I found this answer: Tokenizing non English Text in Python

It is splitting text into words but not into sentences.

I also tried this

from nltk.tokenize import sent_tokenize, word_tokenize

import regex

text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب" 

regex.findall(r'p{L}+', text.replace('[u200c]', ''))

print(sent_tokenize(data))

It returned the text separated by 'u202a'

زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء

NB: The sentence doesn't make any sense, it is just an example in arabic characters.

I need the output to be in the form of sentences:

[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]

which means:

[sentence 1, sentence 2, sentence, 3]

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

asked Nov 23 '18 at 18:05

J.Doe

206

For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14

@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32

@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09

I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25

How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31

|
show 3 more comments

I have an arabic text file that looks like this

I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.

I found this answer: Tokenizing non English Text in Python

It is splitting text into words but not into sentences.

I also tried this

from nltk.tokenize import sent_tokenize, word_tokenize

import regex

text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب" 

regex.findall(r'p{L}+', text.replace('[u200c]', ''))

print(sent_tokenize(data))

It returned the text separated by 'u202a'

زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء

NB: The sentence doesn't make any sense, it is just an example in arabic characters.

I need the output to be in the form of sentences:

[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]

which means:

[sentence 1, sentence 2, sentence, 3]

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

asked Nov 23 '18 at 18:05

J.Doe

206

I have an arabic text file that looks like this

I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.

I found this answer: Tokenizing non English Text in Python

It is splitting text into words but not into sentences.

I also tried this

from nltk.tokenize import sent_tokenize, word_tokenize

import regex

text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب" 

regex.findall(r'p{L}+', text.replace('[u200c]', ''))

print(sent_tokenize(data))

It returned the text separated by 'u202a'

زيز 240 و انا بدرب منال تاريخu202a.u202c برقاء

NB: The sentence doesn't make any sense, it is just an example in arabic characters.

I need the output to be in the form of sentences:

[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]

which means:

[sentence 1, sentence 2, sentence, 3]

python string python-3.x stringtokenizer

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

asked Nov 23 '18 at 18:05

J.Doe

206

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

asked Nov 23 '18 at 18:05

J.Doe

206

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

edited Nov 24 '18 at 9:01

kcorlidy

2,1702318

asked Nov 23 '18 at 18:05

J.Doe

206

asked Nov 23 '18 at 18:05

J.Doe

206

asked Nov 23 '18 at 18:05

J.Doe

206

For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14

@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32

@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09

I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25

How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31

|
show 3 more comments

For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14

@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32

@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09

I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25

How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31

For those who don't read Arabic, can you edit your question and add your desired output as well?
– usr2564301
Nov 23 '18 at 18:14

@usr2564301Edited his question with translations (using google translate) May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps
– Jaba
Nov 23 '18 at 18:32

@Jaba: but that does not show a list of sentences, does it? I don't need to know what it means, I wanted to see where that long input line needs breaking on.
– usr2564301
Nov 23 '18 at 19:09

I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences.
– J.Doe
Nov 23 '18 at 20:25

How does one know where a sentence ends in arabuc language?
– user8408080
Nov 23 '18 at 22:31

|
show 3 more comments

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53451225%2ftokenizing-non-english-text-in-python-to-sentences%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl