Filtering data stored in csv files
I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.
Fields in csv file:
Stock Quantity Price PurchaseDate
Querying ability examples:
- Ability to query purchases made in a given duration.
- Ability to query purchases made for a specific stock.
- Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.
Questions:
- How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?
- Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?
- I don't want to read the file content again if data in the file has not changed, how can I memoize this data?
Generic API signature and implementation ideas are really appreciated.
Sample Data:
MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006
python
add a comment |
I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.
Fields in csv file:
Stock Quantity Price PurchaseDate
Querying ability examples:
- Ability to query purchases made in a given duration.
- Ability to query purchases made for a specific stock.
- Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.
Questions:
- How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?
- Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?
- I don't want to read the file content again if data in the file has not changed, how can I memoize this data?
Generic API signature and implementation ideas are really appreciated.
Sample Data:
MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006
python
2
post a minimal and testable input data
– RomanPerekhrest
Nov 25 '18 at 7:01
added sample data
– Lokesh Agrawal
Nov 25 '18 at 7:25
The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.
– Shiv Rajawat
Nov 25 '18 at 7:51
add a comment |
I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.
Fields in csv file:
Stock Quantity Price PurchaseDate
Querying ability examples:
- Ability to query purchases made in a given duration.
- Ability to query purchases made for a specific stock.
- Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.
Questions:
- How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?
- Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?
- I don't want to read the file content again if data in the file has not changed, how can I memoize this data?
Generic API signature and implementation ideas are really appreciated.
Sample Data:
MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006
python
I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.
Fields in csv file:
Stock Quantity Price PurchaseDate
Querying ability examples:
- Ability to query purchases made in a given duration.
- Ability to query purchases made for a specific stock.
- Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.
Questions:
- How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?
- Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?
- I don't want to read the file content again if data in the file has not changed, how can I memoize this data?
Generic API signature and implementation ideas are really appreciated.
Sample Data:
MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006
python
python
edited Nov 26 '18 at 17:32
E_net4
12.1k63568
12.1k63568
asked Nov 25 '18 at 6:49
Lokesh AgrawalLokesh Agrawal
416930
416930
2
post a minimal and testable input data
– RomanPerekhrest
Nov 25 '18 at 7:01
added sample data
– Lokesh Agrawal
Nov 25 '18 at 7:25
The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.
– Shiv Rajawat
Nov 25 '18 at 7:51
add a comment |
2
post a minimal and testable input data
– RomanPerekhrest
Nov 25 '18 at 7:01
added sample data
– Lokesh Agrawal
Nov 25 '18 at 7:25
The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.
– Shiv Rajawat
Nov 25 '18 at 7:51
2
2
post a minimal and testable input data
– RomanPerekhrest
Nov 25 '18 at 7:01
post a minimal and testable input data
– RomanPerekhrest
Nov 25 '18 at 7:01
added sample data
– Lokesh Agrawal
Nov 25 '18 at 7:25
added sample data
– Lokesh Agrawal
Nov 25 '18 at 7:25
The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.
– Shiv Rajawat
Nov 25 '18 at 7:51
The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.
– Shiv Rajawat
Nov 25 '18 at 7:51
add a comment |
1 Answer
1
active
oldest
votes
First, read the data from csv to a pandas dataframe.
import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)
Now if you want to select stocks purchased between two dates
print(df.loc['2010-07-01' : '2010-10-02'])
Similarly if you want data before/after a particular date, use
print(df.loc['2010-07-01' : ]) ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date
If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.
for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)
Or
for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)
There are many other ways you can achieve this. Just play around it and you'll get to know.
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
1
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465311%2ffiltering-data-stored-in-csv-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First, read the data from csv to a pandas dataframe.
import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)
Now if you want to select stocks purchased between two dates
print(df.loc['2010-07-01' : '2010-10-02'])
Similarly if you want data before/after a particular date, use
print(df.loc['2010-07-01' : ]) ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date
If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.
for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)
Or
for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)
There are many other ways you can achieve this. Just play around it and you'll get to know.
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
1
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
add a comment |
First, read the data from csv to a pandas dataframe.
import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)
Now if you want to select stocks purchased between two dates
print(df.loc['2010-07-01' : '2010-10-02'])
Similarly if you want data before/after a particular date, use
print(df.loc['2010-07-01' : ]) ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date
If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.
for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)
Or
for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)
There are many other ways you can achieve this. Just play around it and you'll get to know.
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
1
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
add a comment |
First, read the data from csv to a pandas dataframe.
import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)
Now if you want to select stocks purchased between two dates
print(df.loc['2010-07-01' : '2010-10-02'])
Similarly if you want data before/after a particular date, use
print(df.loc['2010-07-01' : ]) ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date
If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.
for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)
Or
for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)
There are many other ways you can achieve this. Just play around it and you'll get to know.
First, read the data from csv to a pandas dataframe.
import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)
Now if you want to select stocks purchased between two dates
print(df.loc['2010-07-01' : '2010-10-02'])
Similarly if you want data before/after a particular date, use
print(df.loc['2010-07-01' : ]) ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date
If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.
for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)
Or
for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)
There are many other ways you can achieve this. Just play around it and you'll get to know.
answered Nov 25 '18 at 7:48
Shiv RajawatShiv Rajawat
23117
23117
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
1
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
add a comment |
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
1
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?
– Lokesh Agrawal
Nov 25 '18 at 7:51
1
1
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I
– Shiv Rajawat
Nov 25 '18 at 7:53
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.
– Shiv Rajawat
Nov 25 '18 at 7:57
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
If it worked, kindly accept the answer.
– Shiv Rajawat
Nov 25 '18 at 14:12
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.
– Lokesh Agrawal
Nov 25 '18 at 14:17
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465311%2ffiltering-data-stored-in-csv-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
post a minimal and testable input data
– RomanPerekhrest
Nov 25 '18 at 7:01
added sample data
– Lokesh Agrawal
Nov 25 '18 at 7:25
The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.
– Shiv Rajawat
Nov 25 '18 at 7:51