How to Generate large dataset and randomize it using python DataFrame
I have written a program that will Generate large data set and randomize it according to conditions
Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...
Input data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
2 Hyperdome Insp HISP3 Technopolis Generator Generator
2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
CONDITIONS:
1) Data should be read csv file and randomize whole data.
2) It should also randomize "Location" column separately and print along with all randomize data.
3) Data should be generate more than 30k rows from given data.
4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
If any doubt related with 4th condition please let me know..
For this i have written a program which will satisfy all the three conditions
import pandas as pd
import numpy as np
import random
import csv
def main():
df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
ds = (df.sample(frac=1))
# print(ds)
loc=df.Location
# Here we are deleting location column and store it in loc variable
df=df.drop("Location",1)
# This way we can randomise location column
randValue = (loc.sample(frac=1))
randValue = randValue.to_frame()
#Now we will join the column randValue with whole data
result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')
# cols = list(result.columns.values)
# print("cols-",cols)
result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]
#Now randomise the whole data again
ds1 = (result.sample(frac=1))
# print(ds1)
# Generating Large dataSet and randomize it
dd=ds1.append([ds1]*500)
ds2 = (dd.sample(frac=1))
print(ds2)
ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')
if __name__ == '__main__':
main()
This program will generate large dataSet and randomize it and also randomize the Column "Location"
But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.
The output data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance
Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks
python pandas csv dataframe random
add a comment |
I have written a program that will Generate large data set and randomize it according to conditions
Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...
Input data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
2 Hyperdome Insp HISP3 Technopolis Generator Generator
2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
CONDITIONS:
1) Data should be read csv file and randomize whole data.
2) It should also randomize "Location" column separately and print along with all randomize data.
3) Data should be generate more than 30k rows from given data.
4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
If any doubt related with 4th condition please let me know..
For this i have written a program which will satisfy all the three conditions
import pandas as pd
import numpy as np
import random
import csv
def main():
df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
ds = (df.sample(frac=1))
# print(ds)
loc=df.Location
# Here we are deleting location column and store it in loc variable
df=df.drop("Location",1)
# This way we can randomise location column
randValue = (loc.sample(frac=1))
randValue = randValue.to_frame()
#Now we will join the column randValue with whole data
result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')
# cols = list(result.columns.values)
# print("cols-",cols)
result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]
#Now randomise the whole data again
ds1 = (result.sample(frac=1))
# print(ds1)
# Generating Large dataSet and randomize it
dd=ds1.append([ds1]*500)
ds2 = (dd.sample(frac=1))
print(ds2)
ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')
if __name__ == '__main__':
main()
This program will generate large dataSet and randomize it and also randomize the Column "Location"
But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.
The output data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance
Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks
python pandas csv dataframe random
add a comment |
I have written a program that will Generate large data set and randomize it according to conditions
Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...
Input data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
2 Hyperdome Insp HISP3 Technopolis Generator Generator
2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
CONDITIONS:
1) Data should be read csv file and randomize whole data.
2) It should also randomize "Location" column separately and print along with all randomize data.
3) Data should be generate more than 30k rows from given data.
4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
If any doubt related with 4th condition please let me know..
For this i have written a program which will satisfy all the three conditions
import pandas as pd
import numpy as np
import random
import csv
def main():
df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
ds = (df.sample(frac=1))
# print(ds)
loc=df.Location
# Here we are deleting location column and store it in loc variable
df=df.drop("Location",1)
# This way we can randomise location column
randValue = (loc.sample(frac=1))
randValue = randValue.to_frame()
#Now we will join the column randValue with whole data
result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')
# cols = list(result.columns.values)
# print("cols-",cols)
result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]
#Now randomise the whole data again
ds1 = (result.sample(frac=1))
# print(ds1)
# Generating Large dataSet and randomize it
dd=ds1.append([ds1]*500)
ds2 = (dd.sample(frac=1))
print(ds2)
ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')
if __name__ == '__main__':
main()
This program will generate large dataSet and randomize it and also randomize the Column "Location"
But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.
The output data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance
Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks
python pandas csv dataframe random
I have written a program that will Generate large data set and randomize it according to conditions
Please Go through my whole program and conditions which i will write here if any thing which is not clear for you please ping me...
Input data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
1 Haptic Analy HAL2 Zenoa Micro Pressure Platform Low air pressure,
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
1 Haptic Analy HAL4 Technopolis Mirror Lens Combinator Mirror Angle Skewed,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
2 Hyperdome Insp HISP3 Technopolis Generator Generator
2 Hyperdome Insp HISP4 Zenoa High Frequency Emulator Emulator Frequency Drop
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
3 Nano Dial Assem NDA12 Zenoa Dial Loading Unit Faulty Scanner Unit
3 Nano Dial Assem NDA13 Zenoa Vaccum Line Control Above Normal
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn22 La Puente Scanning Electronic Faulty Scanner Unit
4 Geometric Synth GeoSyn23 La Puente Draft Synthesis Chamber Beam offset beyond Tolerance
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
CONDITIONS:
1) Data should be read csv file and randomize whole data.
2) It should also randomize "Location" column separately and print along with all randomize data.
3) Data should be generate more than 30k rows from given data.
4) Important- It should also read a "Asset Component" separately and randomize it as the value of the "Haptic Analyser" column- "Asset Family" will not mix with the value "Hyperdome Inspector" and "Nano Dial Assembler" and so on.. its means that It should be randomize column in a way that values of the "Asset Family" column should not match with the other values...
If any doubt related with 4th condition please let me know..
For this i have written a program which will satisfy all the three conditions
import pandas as pd
import numpy as np
import random
import csv
def main():
df=pd.read_csv("C:\Users\rahul\Desktop\Data Manufacturing - Seed Data.csv")
ds = (df.sample(frac=1))
# print(ds)
loc=df.Location
# Here we are deleting location column and store it in loc variable
df=df.drop("Location",1)
# This way we can randomise location column
randValue = (loc.sample(frac=1))
randValue = randValue.to_frame()
#Now we will join the column randValue with whole data
result=ds.join(randValue, how='left', lsuffix='_left', rsuffix='')
# cols = list(result.columns.values)
# print("cols-",cols)
result = result[['Asset_Id ', 'Asset Family', 'Asset Name', 'Location', 'Asset Component','Keywords','Conditions','Parts','No. of Parts','SR_Id','SR_Date','SR_Month','SR_Year']]
#Now randomise the whole data again
ds1 = (result.sample(frac=1))
# print(ds1)
# Generating Large dataSet and randomize it
dd=ds1.append([ds1]*500)
ds2 = (dd.sample(frac=1))
print(ds2)
ds1.to_csv('C:\Users\rahul\Desktop\people1.csv')
if __name__ == '__main__':
main()
This program will generate large dataSet and randomize it and also randomize the Column "Location"
But only thing i'm not able to do the 4th condition which will be randomize but according to the data which is in other column "Asset Family" values of "Haptic Analyser" and "Hyperdome Inspector" of "Asset Component " should not mix each other and print separately.
The output data:
Asset_Id Asset Family Asset Name Location Asset Component Keywords
3 Nano Dial Assem NDA11 Zenoa Fusion Tank Fall in Diffusion Ratio
1 Haptic Analy HAL3 Technopolis Rotation Chamber Rotation Chamber Intermittent Slowdown
2 Hyperdome Insp HISP2 Zenoa Turbo Quantifier Quantifier output Drops Intermittently
4 Geometric Synth GeoSyn25 La Puente Ion Gas Diffuser Column Alignment Issues
1 Haptic Analy HAL1 Zenoa Tetris Measuring Unit Measurement Inaccuracy,
2 Hyperdome Insp HISP1 Technopolis Laser Column Column Alignment Issues,
3 Nano Dial Assem NDA14 Zenoa Wave Generator Generator Power Failure
4 Geometric Synth GeoSyn24 La Puente Progeometric Plane Progeometric Plane Fault Detected
In this output all three conditions is given only 4th condition i'm able to do please help me to get it.. thanks in advance
Note : please go through my all conditions before coming to my coding part please if you are not able to understand any thing or any point please text in a comment box..thanks
python pandas csv dataframe random
python pandas csv dataframe random
edited Nov 25 '18 at 18:16
rahul singh
asked Nov 25 '18 at 18:08
rahul singhrahul singh
1158
1158
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470393%2fhow-to-generate-large-dataset-and-randomize-it-using-python-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470393%2fhow-to-generate-large-dataset-and-randomize-it-using-python-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown