How to Generate large dataset and randomize it using python DataFrame












0















I have written a program that will Generate large data set and randomize it according to conditions
Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...



Input data:



Asset_Id  Asset Family  Asset Name  Location    Asset Component          Keywords                       
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
2 Hyperdome Insp HISP3 Technopolis Generator Generator
2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop

3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues


CONDITIONS:
1) Data should be read csv file and randomize whole data.
2) It should also randomize "Location" column separately and print along with all randomize data.
3) Data should be generate more than 30k rows from given data.
4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
If any doubt related with 4th condition please let me know..



For this i have written a program which will satisfy all the three conditions



import pandas as pd
import numpy as np
import random
import csv

def main():

df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
ds = (df.sample(frac=1))
# print(ds)

loc=df.Location
# Here we are deleting location column and store it in loc variable
df=df.drop("Location",1)

# This way we can randomise location column
randValue = (loc.sample(frac=1))

randValue = randValue.to_frame()

#Now we will join the column randValue with whole data
result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')

# cols = list(result.columns.values)
# print("cols-",cols)

result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]

#Now randomise the whole data again
ds1 = (result.sample(frac=1))
# print(ds1)

# Generating Large dataSet and randomize it
dd=ds1.append([ds1]*500)
ds2 = (dd.sample(frac=1))
print(ds2)
ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')


if __name__ == '__main__':
main()


This program will generate large dataSet and randomize it and also randomize the Column "Location"
But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.



The output data:



Asset_Id   Asset Family     Asset Name  Location    Asset Component     Keywords
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected


In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance



Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks










share|improve this question





























    0















    I have written a program that will Generate large data set and randomize it according to conditions
    Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...



    Input data:



    Asset_Id  Asset Family  Asset Name  Location    Asset Component          Keywords                       
    1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
    1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
    1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
    1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
    2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
    2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
    2 Hyperdome Insp HISP3 Technopolis Generator Generator
    2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop

    3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
    3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
    3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
    3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
    4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
    4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
    4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
    4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues


    CONDITIONS:
    1) Data should be read csv file and randomize whole data.
    2) It should also randomize "Location" column separately and print along with all randomize data.
    3) Data should be generate more than 30k rows from given data.
    4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
    If any doubt related with 4th condition please let me know..



    For this i have written a program which will satisfy all the three conditions



    import pandas as pd
    import numpy as np
    import random
    import csv

    def main():

    df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
    ds = (df.sample(frac=1))
    # print(ds)

    loc=df.Location
    # Here we are deleting location column and store it in loc variable
    df=df.drop("Location",1)

    # This way we can randomise location column
    randValue = (loc.sample(frac=1))

    randValue = randValue.to_frame()

    #Now we will join the column randValue with whole data
    result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')

    # cols = list(result.columns.values)
    # print("cols-",cols)

    result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]

    #Now randomise the whole data again
    ds1 = (result.sample(frac=1))
    # print(ds1)

    # Generating Large dataSet and randomize it
    dd=ds1.append([ds1]*500)
    ds2 = (dd.sample(frac=1))
    print(ds2)
    ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')


    if __name__ == '__main__':
    main()


    This program will generate large dataSet and randomize it and also randomize the Column "Location"
    But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.



    The output data:



    Asset_Id   Asset Family     Asset Name  Location    Asset Component     Keywords
    3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
    1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
    2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
    4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
    1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
    2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
    3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
    4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected


    In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance



    Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks










    share|improve this question



























      0












      0








      0








      I have written a program that will Generate large data set and randomize it according to conditions
      Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...



      Input data:



      Asset_Id  Asset Family  Asset Name  Location    Asset Component          Keywords                       
      1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
      1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
      1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
      1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
      2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
      2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
      2 Hyperdome Insp HISP3 Technopolis Generator Generator
      2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop

      3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
      3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
      3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
      3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
      4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
      4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
      4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
      4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues


      CONDITIONS:
      1) Data should be read csv file and randomize whole data.
      2) It should also randomize "Location" column separately and print along with all randomize data.
      3) Data should be generate more than 30k rows from given data.
      4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
      If any doubt related with 4th condition please let me know..



      For this i have written a program which will satisfy all the three conditions



      import pandas as pd
      import numpy as np
      import random
      import csv

      def main():

      df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
      ds = (df.sample(frac=1))
      # print(ds)

      loc=df.Location
      # Here we are deleting location column and store it in loc variable
      df=df.drop("Location",1)

      # This way we can randomise location column
      randValue = (loc.sample(frac=1))

      randValue = randValue.to_frame()

      #Now we will join the column randValue with whole data
      result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')

      # cols = list(result.columns.values)
      # print("cols-",cols)

      result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]

      #Now randomise the whole data again
      ds1 = (result.sample(frac=1))
      # print(ds1)

      # Generating Large dataSet and randomize it
      dd=ds1.append([ds1]*500)
      ds2 = (dd.sample(frac=1))
      print(ds2)
      ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')


      if __name__ == '__main__':
      main()


      This program will generate large dataSet and randomize it and also randomize the Column "Location"
      But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.



      The output data:



      Asset_Id   Asset Family     Asset Name  Location    Asset Component     Keywords
      3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
      1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
      2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
      4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
      1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
      2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
      3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
      4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected


      In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance



      Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks










      share|improve this question
















      I have written a program that will Generate large data set and randomize it according to conditions
      Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...



      Input data:



      Asset_Id  Asset Family  Asset Name  Location    Asset Component          Keywords                       
      1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
      1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
      1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
      1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
      2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
      2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
      2 Hyperdome Insp HISP3 Technopolis Generator Generator
      2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop

      3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
      3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
      3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
      3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
      4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
      4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
      4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
      4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues


      CONDITIONS:
      1) Data should be read csv file and randomize whole data.
      2) It should also randomize "Location" column separately and print along with all randomize data.
      3) Data should be generate more than 30k rows from given data.
      4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
      If any doubt related with 4th condition please let me know..



      For this i have written a program which will satisfy all the three conditions



      import pandas as pd
      import numpy as np
      import random
      import csv

      def main():

      df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
      ds = (df.sample(frac=1))
      # print(ds)

      loc=df.Location
      # Here we are deleting location column and store it in loc variable
      df=df.drop("Location",1)

      # This way we can randomise location column
      randValue = (loc.sample(frac=1))

      randValue = randValue.to_frame()

      #Now we will join the column randValue with whole data
      result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')

      # cols = list(result.columns.values)
      # print("cols-",cols)

      result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]

      #Now randomise the whole data again
      ds1 = (result.sample(frac=1))
      # print(ds1)

      # Generating Large dataSet and randomize it
      dd=ds1.append([ds1]*500)
      ds2 = (dd.sample(frac=1))
      print(ds2)
      ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')


      if __name__ == '__main__':
      main()


      This program will generate large dataSet and randomize it and also randomize the Column "Location"
      But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.



      The output data:



      Asset_Id   Asset Family     Asset Name  Location    Asset Component     Keywords
      3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
      1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
      2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
      4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
      1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
      2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
      3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
      4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected


      In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance



      Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks







      python pandas csv dataframe random






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 25 '18 at 18:16







      rahul singh

















      asked Nov 25 '18 at 18:08









      rahul singhrahul singh

      1158




      1158
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470393%2fhow-to-generate-large-dataset-and-randomize-it-using-python-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470393%2fhow-to-generate-large-dataset-and-randomize-it-using-python-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)