Word frequencies from large body of scraped text












1














I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?










share|improve this question




















  • 1




    I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    2 hours ago
















1














I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?










share|improve this question




















  • 1




    I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    2 hours ago














1












1








1







I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?










share|improve this question















I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?







python performance dictionary lookup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 1 hour ago

























asked 2 hours ago









Des Grieux

235




235








  • 1




    I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    2 hours ago














  • 1




    I've added the fixed code in one piece below. Thank you!
    – Des Grieux
    2 hours ago








1




1




I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago




I've added the fixed code in one piece below. Thank you!
– Des Grieux
2 hours ago










2 Answers
2






active

oldest

votes


















2














Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



from collections import Counter

yourListOfWords = [...]

frequencyOfEachWord = Counter(yourListOfWords)





share|improve this answer





























    1














    for i in range(1 ,num_batches +1):


    Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



    This string:



    r'input_batch_' + str(i) + r'.txt'


    can be:



    f'input_batch_{i}.txt'


    This code:



    entries_raw = infile.readlines()
    entries_single = [x.strip() for x in entries_raw]
    entries = [x.split('t') for x in entries_single]


    can also be simplified, to:



    entries = [line.rstrip().split('t') for line in infile]


    Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



    This is an antipattern inherited from C:



    for j in range(len(entries)):
    data.loc[j] = entries[j][1], entries[j][0]


    You should instead do:



    for j, entry in enumerate(entries):
    data.loc[j] = entry[1], entry[0]


    That also applies to your for x in range(len(data)):.



    This:



    freq_dict = dict()


    should be:



    freq_dict = {}


    This:



    if key in freq_dict:
    prior_freq = freq_dict.get(key)
    freq_dict[key] = prior_freq + data['freq'][x]
    else:
    freq_dict[key] = data['freq'][x]


    can be simplified to:



    freq_dict[key] = data['freq'][x]
    prior_freq = freq_dict.get(key)
    if prior_freq is not None:
    freq_dict[key] += prior_freq


    Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



    This loop:



    for key in freq_dict.keys():
    outfile.write("%s,%sn" % (key, freq_dict[key]))


    needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



    for key, freq in freq_dict.items():
    outfile.write(f'{key},{freq}n')





    share|improve this answer





















      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "196"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2














      Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



      from collections import Counter

      yourListOfWords = [...]

      frequencyOfEachWord = Counter(yourListOfWords)





      share|improve this answer


























        2














        Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



        from collections import Counter

        yourListOfWords = [...]

        frequencyOfEachWord = Counter(yourListOfWords)





        share|improve this answer
























          2












          2








          2






          Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



          from collections import Counter

          yourListOfWords = [...]

          frequencyOfEachWord = Counter(yourListOfWords)





          share|improve this answer












          Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



          from collections import Counter

          yourListOfWords = [...]

          frequencyOfEachWord = Counter(yourListOfWords)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 36 mins ago









          AleksandrH

          19919




          19919

























              1














              for i in range(1 ,num_batches +1):


              Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



              This string:



              r'input_batch_' + str(i) + r'.txt'


              can be:



              f'input_batch_{i}.txt'


              This code:



              entries_raw = infile.readlines()
              entries_single = [x.strip() for x in entries_raw]
              entries = [x.split('t') for x in entries_single]


              can also be simplified, to:



              entries = [line.rstrip().split('t') for line in infile]


              Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



              This is an antipattern inherited from C:



              for j in range(len(entries)):
              data.loc[j] = entries[j][1], entries[j][0]


              You should instead do:



              for j, entry in enumerate(entries):
              data.loc[j] = entry[1], entry[0]


              That also applies to your for x in range(len(data)):.



              This:



              freq_dict = dict()


              should be:



              freq_dict = {}


              This:



              if key in freq_dict:
              prior_freq = freq_dict.get(key)
              freq_dict[key] = prior_freq + data['freq'][x]
              else:
              freq_dict[key] = data['freq'][x]


              can be simplified to:



              freq_dict[key] = data['freq'][x]
              prior_freq = freq_dict.get(key)
              if prior_freq is not None:
              freq_dict[key] += prior_freq


              Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



              This loop:



              for key in freq_dict.keys():
              outfile.write("%s,%sn" % (key, freq_dict[key]))


              needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



              for key, freq in freq_dict.items():
              outfile.write(f'{key},{freq}n')





              share|improve this answer


























                1














                for i in range(1 ,num_batches +1):


                Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



                This string:



                r'input_batch_' + str(i) + r'.txt'


                can be:



                f'input_batch_{i}.txt'


                This code:



                entries_raw = infile.readlines()
                entries_single = [x.strip() for x in entries_raw]
                entries = [x.split('t') for x in entries_single]


                can also be simplified, to:



                entries = [line.rstrip().split('t') for line in infile]


                Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



                This is an antipattern inherited from C:



                for j in range(len(entries)):
                data.loc[j] = entries[j][1], entries[j][0]


                You should instead do:



                for j, entry in enumerate(entries):
                data.loc[j] = entry[1], entry[0]


                That also applies to your for x in range(len(data)):.



                This:



                freq_dict = dict()


                should be:



                freq_dict = {}


                This:



                if key in freq_dict:
                prior_freq = freq_dict.get(key)
                freq_dict[key] = prior_freq + data['freq'][x]
                else:
                freq_dict[key] = data['freq'][x]


                can be simplified to:



                freq_dict[key] = data['freq'][x]
                prior_freq = freq_dict.get(key)
                if prior_freq is not None:
                freq_dict[key] += prior_freq


                Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



                This loop:



                for key in freq_dict.keys():
                outfile.write("%s,%sn" % (key, freq_dict[key]))


                needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



                for key, freq in freq_dict.items():
                outfile.write(f'{key},{freq}n')





                share|improve this answer
























                  1












                  1








                  1






                  for i in range(1 ,num_batches +1):


                  Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



                  This string:



                  r'input_batch_' + str(i) + r'.txt'


                  can be:



                  f'input_batch_{i}.txt'


                  This code:



                  entries_raw = infile.readlines()
                  entries_single = [x.strip() for x in entries_raw]
                  entries = [x.split('t') for x in entries_single]


                  can also be simplified, to:



                  entries = [line.rstrip().split('t') for line in infile]


                  Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



                  This is an antipattern inherited from C:



                  for j in range(len(entries)):
                  data.loc[j] = entries[j][1], entries[j][0]


                  You should instead do:



                  for j, entry in enumerate(entries):
                  data.loc[j] = entry[1], entry[0]


                  That also applies to your for x in range(len(data)):.



                  This:



                  freq_dict = dict()


                  should be:



                  freq_dict = {}


                  This:



                  if key in freq_dict:
                  prior_freq = freq_dict.get(key)
                  freq_dict[key] = prior_freq + data['freq'][x]
                  else:
                  freq_dict[key] = data['freq'][x]


                  can be simplified to:



                  freq_dict[key] = data['freq'][x]
                  prior_freq = freq_dict.get(key)
                  if prior_freq is not None:
                  freq_dict[key] += prior_freq


                  Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



                  This loop:



                  for key in freq_dict.keys():
                  outfile.write("%s,%sn" % (key, freq_dict[key]))


                  needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



                  for key, freq in freq_dict.items():
                  outfile.write(f'{key},{freq}n')





                  share|improve this answer












                  for i in range(1 ,num_batches +1):


                  Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



                  This string:



                  r'input_batch_' + str(i) + r'.txt'


                  can be:



                  f'input_batch_{i}.txt'


                  This code:



                  entries_raw = infile.readlines()
                  entries_single = [x.strip() for x in entries_raw]
                  entries = [x.split('t') for x in entries_single]


                  can also be simplified, to:



                  entries = [line.rstrip().split('t') for line in infile]


                  Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



                  This is an antipattern inherited from C:



                  for j in range(len(entries)):
                  data.loc[j] = entries[j][1], entries[j][0]


                  You should instead do:



                  for j, entry in enumerate(entries):
                  data.loc[j] = entry[1], entry[0]


                  That also applies to your for x in range(len(data)):.



                  This:



                  freq_dict = dict()


                  should be:



                  freq_dict = {}


                  This:



                  if key in freq_dict:
                  prior_freq = freq_dict.get(key)
                  freq_dict[key] = prior_freq + data['freq'][x]
                  else:
                  freq_dict[key] = data['freq'][x]


                  can be simplified to:



                  freq_dict[key] = data['freq'][x]
                  prior_freq = freq_dict.get(key)
                  if prior_freq is not None:
                  freq_dict[key] += prior_freq


                  Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



                  This loop:



                  for key in freq_dict.keys():
                  outfile.write("%s,%sn" % (key, freq_dict[key]))


                  needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



                  for key, freq in freq_dict.items():
                  outfile.write(f'{key},{freq}n')






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 59 mins ago









                  Reinderien

                  2,226617




                  2,226617






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Code Review Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                      Calculate evaluation metrics using cross_val_predict sklearn

                      Insert data from modal to MySQL (multiple modal on website)