Count number of true and false condition in spark data frame
I am coming from a MATLAB background, and I can simply do this
age_sum_error = sum(age > prediction - 4 & age < prediction + 4);
This will count the number of age
values for which the prediction (+4/-4)
is true, I want to do something similar in spark data frame.
Say that below is my spark data frame
+--------------------------+
|age | gender | prediction |
+----+--------+------------+
|35 | M | 30 |
|40 | F | 42 |
|45 | F | 38 |
|26 | F | 29 |
+----+--------+------------+
I want my result to look something like this
+------+----------+
|false | positive |
+------+----------+
|2 | 2 |
+------+----------+
python apache-spark pyspark apache-spark-sql
add a comment |
I am coming from a MATLAB background, and I can simply do this
age_sum_error = sum(age > prediction - 4 & age < prediction + 4);
This will count the number of age
values for which the prediction (+4/-4)
is true, I want to do something similar in spark data frame.
Say that below is my spark data frame
+--------------------------+
|age | gender | prediction |
+----+--------+------------+
|35 | M | 30 |
|40 | F | 42 |
|45 | F | 38 |
|26 | F | 29 |
+----+--------+------------+
I want my result to look something like this
+------+----------+
|false | positive |
+------+----------+
|2 | 2 |
+------+----------+
python apache-spark pyspark apache-spark-sql
add a comment |
I am coming from a MATLAB background, and I can simply do this
age_sum_error = sum(age > prediction - 4 & age < prediction + 4);
This will count the number of age
values for which the prediction (+4/-4)
is true, I want to do something similar in spark data frame.
Say that below is my spark data frame
+--------------------------+
|age | gender | prediction |
+----+--------+------------+
|35 | M | 30 |
|40 | F | 42 |
|45 | F | 38 |
|26 | F | 29 |
+----+--------+------------+
I want my result to look something like this
+------+----------+
|false | positive |
+------+----------+
|2 | 2 |
+------+----------+
python apache-spark pyspark apache-spark-sql
I am coming from a MATLAB background, and I can simply do this
age_sum_error = sum(age > prediction - 4 & age < prediction + 4);
This will count the number of age
values for which the prediction (+4/-4)
is true, I want to do something similar in spark data frame.
Say that below is my spark data frame
+--------------------------+
|age | gender | prediction |
+----+--------+------------+
|35 | M | 30 |
|40 | F | 42 |
|45 | F | 38 |
|26 | F | 29 |
+----+--------+------------+
I want my result to look something like this
+------+----------+
|false | positive |
+------+----------+
|2 | 2 |
+------+----------+
python apache-spark pyspark apache-spark-sql
python apache-spark pyspark apache-spark-sql
asked Nov 24 '18 at 21:13
Jam1Jam1
303313
303313
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
First calculate the condition, and then aggregate the result by summing up the 1
s and 0
s:
df.selectExpr(
'cast(abs(age - prediction) < 4 as int) as condition'
).selectExpr(
'sum(condition) as positive',
'sum(1-condition) as negative'
).show()
+--------+--------+
|positive|negative|
+--------+--------+
| 2| 2|
+--------+--------+
add a comment |
Its a lot more code than matlab, but here's how I would do it.
import numpy as np
ages = [35, 40, 45, 26]
pred = [30, 42, 38, 29]
tolerance = 4
# get boolean array of people older and younger than limits
is_older = np.greater(ages, pred-tolerance) # a boolean array
is_younger = np.less(ages, pred+tolerance) # a boolean array
# convert these boolean arrays to ints then multiply. True = 1, False = 0.
in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's
# add upp the indixes that are still 1
senior_count = np.sum(in_range)
Hope this helps.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53462422%2fcount-number-of-true-and-false-condition-in-spark-data-frame%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
First calculate the condition, and then aggregate the result by summing up the 1
s and 0
s:
df.selectExpr(
'cast(abs(age - prediction) < 4 as int) as condition'
).selectExpr(
'sum(condition) as positive',
'sum(1-condition) as negative'
).show()
+--------+--------+
|positive|negative|
+--------+--------+
| 2| 2|
+--------+--------+
add a comment |
First calculate the condition, and then aggregate the result by summing up the 1
s and 0
s:
df.selectExpr(
'cast(abs(age - prediction) < 4 as int) as condition'
).selectExpr(
'sum(condition) as positive',
'sum(1-condition) as negative'
).show()
+--------+--------+
|positive|negative|
+--------+--------+
| 2| 2|
+--------+--------+
add a comment |
First calculate the condition, and then aggregate the result by summing up the 1
s and 0
s:
df.selectExpr(
'cast(abs(age - prediction) < 4 as int) as condition'
).selectExpr(
'sum(condition) as positive',
'sum(1-condition) as negative'
).show()
+--------+--------+
|positive|negative|
+--------+--------+
| 2| 2|
+--------+--------+
First calculate the condition, and then aggregate the result by summing up the 1
s and 0
s:
df.selectExpr(
'cast(abs(age - prediction) < 4 as int) as condition'
).selectExpr(
'sum(condition) as positive',
'sum(1-condition) as negative'
).show()
+--------+--------+
|positive|negative|
+--------+--------+
| 2| 2|
+--------+--------+
answered Nov 24 '18 at 21:57
PsidomPsidom
123k1283126
123k1283126
add a comment |
add a comment |
Its a lot more code than matlab, but here's how I would do it.
import numpy as np
ages = [35, 40, 45, 26]
pred = [30, 42, 38, 29]
tolerance = 4
# get boolean array of people older and younger than limits
is_older = np.greater(ages, pred-tolerance) # a boolean array
is_younger = np.less(ages, pred+tolerance) # a boolean array
# convert these boolean arrays to ints then multiply. True = 1, False = 0.
in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's
# add upp the indixes that are still 1
senior_count = np.sum(in_range)
Hope this helps.
add a comment |
Its a lot more code than matlab, but here's how I would do it.
import numpy as np
ages = [35, 40, 45, 26]
pred = [30, 42, 38, 29]
tolerance = 4
# get boolean array of people older and younger than limits
is_older = np.greater(ages, pred-tolerance) # a boolean array
is_younger = np.less(ages, pred+tolerance) # a boolean array
# convert these boolean arrays to ints then multiply. True = 1, False = 0.
in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's
# add upp the indixes that are still 1
senior_count = np.sum(in_range)
Hope this helps.
add a comment |
Its a lot more code than matlab, but here's how I would do it.
import numpy as np
ages = [35, 40, 45, 26]
pred = [30, 42, 38, 29]
tolerance = 4
# get boolean array of people older and younger than limits
is_older = np.greater(ages, pred-tolerance) # a boolean array
is_younger = np.less(ages, pred+tolerance) # a boolean array
# convert these boolean arrays to ints then multiply. True = 1, False = 0.
in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's
# add upp the indixes that are still 1
senior_count = np.sum(in_range)
Hope this helps.
Its a lot more code than matlab, but here's how I would do it.
import numpy as np
ages = [35, 40, 45, 26]
pred = [30, 42, 38, 29]
tolerance = 4
# get boolean array of people older and younger than limits
is_older = np.greater(ages, pred-tolerance) # a boolean array
is_younger = np.less(ages, pred+tolerance) # a boolean array
# convert these boolean arrays to ints then multiply. True = 1, False = 0.
in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's
# add upp the indixes that are still 1
senior_count = np.sum(in_range)
Hope this helps.
answered Nov 24 '18 at 21:54
Charles StraussCharles Strauss
92
92
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53462422%2fcount-number-of-true-and-false-condition-in-spark-data-frame%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown