assembler treating float as string

I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.

code:

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}



// To see less warnings

import org.apache.log4j._

Logger.getLogger("org").setLevel(Level.ERROR)





// Start a simple Spark Session

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()



// Prepare training and test data.

// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")



val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")



// Check out the Data

data.printSchema()



// See an example of what the data looks like

// by printing out a Row

val colnames = data.columns

val firstrow = data.head(1)(0)

println("n")

println("Example Data Row")

for(ind <- Range(1,colnames.length)){

  println(colnames(ind))

  println(firstrow(ind))

  println("n")

}



////////////////////////////////////////////////////

//// Setting Up DataFrame for Machine Learning ////

//////////////////////////////////////////////////



// A few things we need to do before Spark can accept the data!

// It needs to be in the form of two columns

// ("label","features")



// This will allow us to join multiple feature columns

// into a single column of an array of feautre values

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors



// Rename Price to label column for naming convention.

// Grab only numerical columns from the data

val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")



// An assembler converts the input values to a vector

// A vector is what the ML algorithm reads to train a model



// Set the input columns from which we are supposed to read the values

// Set the name of the column where the vector will be stored

val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")



// Use the assembler to transform our DataFrame to the two columns

val output = assembler.transform(df).select($"label",$"features")

Data:

Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  

0     79545.458574            5.682861                  7.009188   

1     79248.642455            6.002900                  6.730821   

2     61287.067179            5.865890                  8.512727   

3     63345.240046            7.188236                  5.586729   

4     59982.197226            5.040555                  7.839388   



   Avg Area Number of Bedrooms  Area Population         Price  

0                         4.09     23086.800503  1.059034e+06   

1                         3.09     40173.072174  1.505891e+06   

2                         5.13     36882.159400  1.058988e+06   

3                         3.26     34310.242831  1.260617e+06   

4                         4.23     26354.109472  6.309435e+05   



                                             Address  

0  208 Michael Ferry Apt. 674nLaurabury, NE 3701...  

1  188 Johnson Views Suite 079nLake Kathleen, CA...  

2  9127 Elizabeth StravenuenDanieltown, WI 06482...  

3                          USS BarnettnFPO AP 44820  

4                         USNS RaymondnFPO AE 09386

Error:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors

df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1

java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.

Data type StringType of column Avg Area House Age is not supported.

  at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)

  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)

  at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)

  ... 121 elided

asked Nov 26 '18 at 19:58

user3476463

76521335

add a comment |

code:

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}



// To see less warnings

import org.apache.log4j._

Logger.getLogger("org").setLevel(Level.ERROR)





// Start a simple Spark Session

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()



// Prepare training and test data.

// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")



val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")



// Check out the Data

data.printSchema()



// See an example of what the data looks like

// by printing out a Row

val colnames = data.columns

val firstrow = data.head(1)(0)

println("n")

println("Example Data Row")

for(ind <- Range(1,colnames.length)){

  println(colnames(ind))

  println(firstrow(ind))

  println("n")

}



////////////////////////////////////////////////////

//// Setting Up DataFrame for Machine Learning ////

//////////////////////////////////////////////////



// A few things we need to do before Spark can accept the data!

// It needs to be in the form of two columns

// ("label","features")



// This will allow us to join multiple feature columns

// into a single column of an array of feautre values

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors



// Rename Price to label column for naming convention.

// Grab only numerical columns from the data

val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")



// An assembler converts the input values to a vector

// A vector is what the ML algorithm reads to train a model



// Set the input columns from which we are supposed to read the values

// Set the name of the column where the vector will be stored

val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")



// Use the assembler to transform our DataFrame to the two columns

val output = assembler.transform(df).select($"label",$"features")

Data:

Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  

0     79545.458574            5.682861                  7.009188   

1     79248.642455            6.002900                  6.730821   

2     61287.067179            5.865890                  8.512727   

3     63345.240046            7.188236                  5.586729   

4     59982.197226            5.040555                  7.839388   



   Avg Area Number of Bedrooms  Area Population         Price  

0                         4.09     23086.800503  1.059034e+06   

1                         3.09     40173.072174  1.505891e+06   

2                         5.13     36882.159400  1.058988e+06   

3                         3.26     34310.242831  1.260617e+06   

4                         4.23     26354.109472  6.309435e+05   



                                             Address  

0  208 Michael Ferry Apt. 674nLaurabury, NE 3701...  

1  188 Johnson Views Suite 079nLake Kathleen, CA...  

2  9127 Elizabeth StravenuenDanieltown, WI 06482...  

3                          USS BarnettnFPO AP 44820  

4                         USNS RaymondnFPO AE 09386

Error:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors

df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1

java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.

Data type StringType of column Avg Area House Age is not supported.

  at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)

  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)

  at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)

  ... 121 elided

asked Nov 26 '18 at 19:58

user3476463

76521335

add a comment |

code:

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}



// To see less warnings

import org.apache.log4j._

Logger.getLogger("org").setLevel(Level.ERROR)





// Start a simple Spark Session

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()



// Prepare training and test data.

// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")



val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")



// Check out the Data

data.printSchema()



// See an example of what the data looks like

// by printing out a Row

val colnames = data.columns

val firstrow = data.head(1)(0)

println("n")

println("Example Data Row")

for(ind <- Range(1,colnames.length)){

  println(colnames(ind))

  println(firstrow(ind))

  println("n")

}



////////////////////////////////////////////////////

//// Setting Up DataFrame for Machine Learning ////

//////////////////////////////////////////////////



// A few things we need to do before Spark can accept the data!

// It needs to be in the form of two columns

// ("label","features")



// This will allow us to join multiple feature columns

// into a single column of an array of feautre values

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors



// Rename Price to label column for naming convention.

// Grab only numerical columns from the data

val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")



// An assembler converts the input values to a vector

// A vector is what the ML algorithm reads to train a model



// Set the input columns from which we are supposed to read the values

// Set the name of the column where the vector will be stored

val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")



// Use the assembler to transform our DataFrame to the two columns

val output = assembler.transform(df).select($"label",$"features")

Data:

Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  

0     79545.458574            5.682861                  7.009188   

1     79248.642455            6.002900                  6.730821   

2     61287.067179            5.865890                  8.512727   

3     63345.240046            7.188236                  5.586729   

4     59982.197226            5.040555                  7.839388   



   Avg Area Number of Bedrooms  Area Population         Price  

0                         4.09     23086.800503  1.059034e+06   

1                         3.09     40173.072174  1.505891e+06   

2                         5.13     36882.159400  1.058988e+06   

3                         3.26     34310.242831  1.260617e+06   

4                         4.23     26354.109472  6.309435e+05   



                                             Address  

0  208 Michael Ferry Apt. 674nLaurabury, NE 3701...  

1  188 Johnson Views Suite 079nLake Kathleen, CA...  

2  9127 Elizabeth StravenuenDanieltown, WI 06482...  

3                          USS BarnettnFPO AP 44820  

4                         USNS RaymondnFPO AE 09386

Error:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors

df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1

java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.

Data type StringType of column Avg Area House Age is not supported.

  at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)

  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)

  at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)

  ... 121 elided

asked Nov 26 '18 at 19:58

user3476463

76521335

code:

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}



// To see less warnings

import org.apache.log4j._

Logger.getLogger("org").setLevel(Level.ERROR)





// Start a simple Spark Session

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().getOrCreate()



// Prepare training and test data.

// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")



val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")



// Check out the Data

data.printSchema()



// See an example of what the data looks like

// by printing out a Row

val colnames = data.columns

val firstrow = data.head(1)(0)

println("n")

println("Example Data Row")

for(ind <- Range(1,colnames.length)){

  println(colnames(ind))

  println(firstrow(ind))

  println("n")

}



////////////////////////////////////////////////////

//// Setting Up DataFrame for Machine Learning ////

//////////////////////////////////////////////////



// A few things we need to do before Spark can accept the data!

// It needs to be in the form of two columns

// ("label","features")



// This will allow us to join multiple feature columns

// into a single column of an array of feautre values

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors



// Rename Price to label column for naming convention.

// Grab only numerical columns from the data

val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")



// An assembler converts the input values to a vector

// A vector is what the ML algorithm reads to train a model



// Set the input columns from which we are supposed to read the values

// Set the name of the column where the vector will be stored

val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")



// Use the assembler to transform our DataFrame to the two columns

val output = assembler.transform(df).select($"label",$"features")

Data:

Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  

0     79545.458574            5.682861                  7.009188   

1     79248.642455            6.002900                  6.730821   

2     61287.067179            5.865890                  8.512727   

3     63345.240046            7.188236                  5.586729   

4     59982.197226            5.040555                  7.839388   



   Avg Area Number of Bedrooms  Area Population         Price  

0                         4.09     23086.800503  1.059034e+06   

1                         3.09     40173.072174  1.505891e+06   

2                         5.13     36882.159400  1.058988e+06   

3                         3.26     34310.242831  1.260617e+06   

4                         4.23     26354.109472  6.309435e+05   



                                             Address  

0  208 Michael Ferry Apt. 674nLaurabury, NE 3701...  

1  188 Johnson Views Suite 079nLake Kathleen, CA...  

2  9127 Elizabeth StravenuenDanieltown, WI 06482...  

3                          USS BarnettnFPO AP 44820  

4                         USNS RaymondnFPO AE 09386

Error:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.linalg.Vectors

df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1

java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.

Data type StringType of column Avg Area House Age is not supported.

  at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)

  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)

  at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)

  ... 121 elided

scala apache-spark

asked Nov 26 '18 at 19:58

user3476463

76521335

asked Nov 26 '18 at 19:58

user3476463

76521335

asked Nov 26 '18 at 19:58

user3476463

76521335

asked Nov 26 '18 at 19:58

user3476463

76521335

asked Nov 26 '18 at 19:58

user3476463

76521335

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488138%2fassembler-treating-float-as-string%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl