Tokenisation List Comprehension












0















I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.



def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())

rcr = ReutersCorpusReader()

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.



I also think I may have to input lower_sentences into my normalised_sentences.



Here is my normalise function:



def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])


Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.










share|improve this question

























  • Welcome to StackOverflow. What exactly is your question for us?

    – Rory Daulton
    Nov 24 '18 at 14:39






  • 2





    A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.

    – usr2564301
    Nov 24 '18 at 14:40











  • Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.

    – usr2564301
    Nov 24 '18 at 15:18


















0















I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.



def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())

rcr = ReutersCorpusReader()

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.



I also think I may have to input lower_sentences into my normalised_sentences.



Here is my normalise function:



def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])


Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.










share|improve this question

























  • Welcome to StackOverflow. What exactly is your question for us?

    – Rory Daulton
    Nov 24 '18 at 14:39






  • 2





    A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.

    – usr2564301
    Nov 24 '18 at 14:40











  • Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.

    – usr2564301
    Nov 24 '18 at 15:18
















0












0








0








I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.



def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())

rcr = ReutersCorpusReader()

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.



I also think I may have to input lower_sentences into my normalised_sentences.



Here is my normalise function:



def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])


Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.










share|improve this question
















I've created this code with the aim of using a large sample of a corpus to establish the extent to which vocabulary size is reduced when both number and case normalisation is applied.



def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())

rcr = ReutersCorpusReader()

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Though as it stands it only prints each individual character as it stands. I think I have localised the problem to 2 lines. List has no attribute .lower() so I'm not sure how I would replace it.



I also think I may have to input lower_sentences into my normalised_sentences.



Here is my normalise function:



def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[d]+(st|nd|rd|th)", token)
else token for token in token])


Though I may not be even meant to make use of this specific normalise function. Perhaps I'm attacking this the wrong way, my apologies, I shall be back with more information.







python python-3.x token list-comprehension






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 24 '18 at 15:19









usr2564301

17.7k73370




17.7k73370










asked Nov 24 '18 at 14:26









bemzoobemzoo

6611




6611













  • Welcome to StackOverflow. What exactly is your question for us?

    – Rory Daulton
    Nov 24 '18 at 14:39






  • 2





    A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.

    – usr2564301
    Nov 24 '18 at 14:40











  • Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.

    – usr2564301
    Nov 24 '18 at 15:18





















  • Welcome to StackOverflow. What exactly is your question for us?

    – Rory Daulton
    Nov 24 '18 at 14:39






  • 2





    A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.

    – usr2564301
    Nov 24 '18 at 14:40











  • Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.

    – usr2564301
    Nov 24 '18 at 15:18



















Welcome to StackOverflow. What exactly is your question for us?

– Rory Daulton
Nov 24 '18 at 14:39





Welcome to StackOverflow. What exactly is your question for us?

– Rory Daulton
Nov 24 '18 at 14:39




2




2





A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.

– usr2564301
Nov 24 '18 at 14:40





A text suddenly printing as separate characters usually means that you applied a list function to a single text string, where you should have fed it a list of text strings.

– usr2564301
Nov 24 '18 at 14:40













Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.

– usr2564301
Nov 24 '18 at 15:18







Do not add an answer in your question. If you want, and if it's sufficiently different from the one(s) given, you can always add it as a proper answer. I removed it from your question; if you still need the code, it's in the edit history.

– usr2564301
Nov 24 '18 at 15:18














2 Answers
2






active

oldest

votes


















3














I see some things that would clear things up for you.



 lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


Here you've forgotten to actually use the correct variable and you probably meant



 lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


also since a list doesn't have the function lower(), you'd have to apply it for every token in each sentence, i.e



 lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]


Also, your normalise(token) is not returning anything, just using print. So the list comprehension



 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


does not produce a list of anything but None.



I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.






share|improve this answer































    2














    You appear to be using the wrong variable in your comprehensions:



    # Wrong
    lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
    normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]

    # Right
    lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
    normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


    However, if you want to normalise your lower-case sentences, we need to change that line too:



    # Right
    lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
    normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]





    share|improve this answer


























    • Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

      – bemzoo
      Nov 24 '18 at 14:48











    • I've updated my answer to include a fix for another potential error.

      – Jonah Bishop
      Nov 24 '18 at 14:51











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459145%2ftokenisation-list-comprehension%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    I see some things that would clear things up for you.



     lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
    normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


    Here you've forgotten to actually use the correct variable and you probably meant



     lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
    normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


    also since a list doesn't have the function lower(), you'd have to apply it for every token in each sentence, i.e



     lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]


    Also, your normalise(token) is not returning anything, just using print. So the list comprehension



     normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


    does not produce a list of anything but None.



    I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.






    share|improve this answer




























      3














      I see some things that would clear things up for you.



       lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
      normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


      Here you've forgotten to actually use the correct variable and you probably meant



       lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
      normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


      also since a list doesn't have the function lower(), you'd have to apply it for every token in each sentence, i.e



       lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]


      Also, your normalise(token) is not returning anything, just using print. So the list comprehension



       normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


      does not produce a list of anything but None.



      I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.






      share|improve this answer


























        3












        3








        3







        I see some things that would clear things up for you.



         lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
        normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


        Here you've forgotten to actually use the correct variable and you probably meant



         lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
        normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


        also since a list doesn't have the function lower(), you'd have to apply it for every token in each sentence, i.e



         lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]


        Also, your normalise(token) is not returning anything, just using print. So the list comprehension



         normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


        does not produce a list of anything but None.



        I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.






        share|improve this answer













        I see some things that would clear things up for you.



         lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
        normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


        Here you've forgotten to actually use the correct variable and you probably meant



         lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
        normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


        also since a list doesn't have the function lower(), you'd have to apply it for every token in each sentence, i.e



         lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]


        Also, your normalise(token) is not returning anything, just using print. So the list comprehension



         normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here


        does not produce a list of anything but None.



        I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 24 '18 at 14:51









        IAmBullsawIAmBullsaw

        935




        935

























            2














            You appear to be using the wrong variable in your comprehensions:



            # Wrong
            lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]

            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


            However, if you want to normalise your lower-case sentences, we need to change that line too:



            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]





            share|improve this answer


























            • Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

              – bemzoo
              Nov 24 '18 at 14:48











            • I've updated my answer to include a fix for another potential error.

              – Jonah Bishop
              Nov 24 '18 at 14:51
















            2














            You appear to be using the wrong variable in your comprehensions:



            # Wrong
            lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]

            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


            However, if you want to normalise your lower-case sentences, we need to change that line too:



            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]





            share|improve this answer


























            • Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

              – bemzoo
              Nov 24 '18 at 14:48











            • I've updated my answer to include a fix for another potential error.

              – Jonah Bishop
              Nov 24 '18 at 14:51














            2












            2








            2







            You appear to be using the wrong variable in your comprehensions:



            # Wrong
            lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]

            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


            However, if you want to normalise your lower-case sentences, we need to change that line too:



            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]





            share|improve this answer















            You appear to be using the wrong variable in your comprehensions:



            # Wrong
            lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences]

            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]


            However, if you want to normalise your lower-case sentences, we need to change that line too:



            # Right
            lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
            normalised_sentences = [normalise(sentence) for sentence in lowered_sentences]






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 24 '18 at 14:50

























            answered Nov 24 '18 at 14:41









            Jonah BishopJonah Bishop

            8,76233057




            8,76233057













            • Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

              – bemzoo
              Nov 24 '18 at 14:48











            • I've updated my answer to include a fix for another potential error.

              – Jonah Bishop
              Nov 24 '18 at 14:51



















            • Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

              – bemzoo
              Nov 24 '18 at 14:48











            • I've updated my answer to include a fix for another potential error.

              – Jonah Bishop
              Nov 24 '18 at 14:51

















            Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

            – bemzoo
            Nov 24 '18 at 14:48





            Yes sorry I did originally have this but was just switching loads of things around to see if it was a basic syntax problem. The problem persists of me not going through the list but applying it to the list as a whole, which I will try to accomplish now

            – bemzoo
            Nov 24 '18 at 14:48













            I've updated my answer to include a fix for another potential error.

            – Jonah Bishop
            Nov 24 '18 at 14:51





            I've updated my answer to include a fix for another potential error.

            – Jonah Bishop
            Nov 24 '18 at 14:51


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53459145%2ftokenisation-list-comprehension%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

            Calculate evaluation metrics using cross_val_predict sklearn

            Insert data from modal to MySQL (multiple modal on website)