assembler treating float as string
I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.
code:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")
// Check out the Data
data.printSchema()
// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")
// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")
// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model
// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")
// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")
Data:
Avg Area Income Avg Area House Age Avg Area Number of Rooms
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
3 63345.240046 7.188236 5.586729
4 59982.197226 5.040555 7.839388
Avg Area Number of Bedrooms Area Population Price
0 4.09 23086.800503 1.059034e+06
1 3.09 40173.072174 1.505891e+06
2 5.13 36882.159400 1.058988e+06
3 3.26 34310.242831 1.260617e+06
4 4.23 26354.109472 6.309435e+05
Address
0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
1 188 Johnson Views Suite 079nLake Kathleen, CA...
2 9127 Elizabeth StravenuenDanieltown, WI 06482...
3 USS BarnettnFPO AP 44820
4 USNS RaymondnFPO AE 09386
Error:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 121 elided
scala apache-spark
add a comment |
I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.
code:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")
// Check out the Data
data.printSchema()
// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")
// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")
// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model
// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")
// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")
Data:
Avg Area Income Avg Area House Age Avg Area Number of Rooms
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
3 63345.240046 7.188236 5.586729
4 59982.197226 5.040555 7.839388
Avg Area Number of Bedrooms Area Population Price
0 4.09 23086.800503 1.059034e+06
1 3.09 40173.072174 1.505891e+06
2 5.13 36882.159400 1.058988e+06
3 3.26 34310.242831 1.260617e+06
4 4.23 26354.109472 6.309435e+05
Address
0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
1 188 Johnson Views Suite 079nLake Kathleen, CA...
2 9127 Elizabeth StravenuenDanieltown, WI 06482...
3 USS BarnettnFPO AP 44820
4 USNS RaymondnFPO AE 09386
Error:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 121 elided
scala apache-spark
add a comment |
I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.
code:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")
// Check out the Data
data.printSchema()
// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")
// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")
// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model
// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")
// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")
Data:
Avg Area Income Avg Area House Age Avg Area Number of Rooms
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
3 63345.240046 7.188236 5.586729
4 59982.197226 5.040555 7.839388
Avg Area Number of Bedrooms Area Population Price
0 4.09 23086.800503 1.059034e+06
1 3.09 40173.072174 1.505891e+06
2 5.13 36882.159400 1.058988e+06
3 3.26 34310.242831 1.260617e+06
4 4.23 26354.109472 6.309435e+05
Address
0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
1 188 Johnson Views Suite 079nLake Kathleen, CA...
2 9127 Elizabeth StravenuenDanieltown, WI 06482...
3 USS BarnettnFPO AP 44820
4 USNS RaymondnFPO AE 09386
Error:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 121 elided
scala apache-spark
I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.
code:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")
// Check out the Data
data.printSchema()
// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")
// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")
// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model
// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")
// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")
Data:
Avg Area Income Avg Area House Age Avg Area Number of Rooms
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
3 63345.240046 7.188236 5.586729
4 59982.197226 5.040555 7.839388
Avg Area Number of Bedrooms Area Population Price
0 4.09 23086.800503 1.059034e+06
1 3.09 40173.072174 1.505891e+06
2 5.13 36882.159400 1.058988e+06
3 3.26 34310.242831 1.260617e+06
4 4.23 26354.109472 6.309435e+05
Address
0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
1 188 Johnson Views Suite 079nLake Kathleen, CA...
2 9127 Elizabeth StravenuenDanieltown, WI 06482...
3 USS BarnettnFPO AP 44820
4 USNS RaymondnFPO AE 09386
Error:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 121 elided
scala apache-spark
scala apache-spark
asked Nov 26 '18 at 19:58
user3476463user3476463
76521335
76521335
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488138%2fassembler-treating-float-as-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488138%2fassembler-treating-float-as-string%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown