assembler treating float as string












0















I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.



code:



import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)


// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()

// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")

val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")

// Check out the Data
data.printSchema()

// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("n")
}

////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////

// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")

// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")

// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model

// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")

// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")


Data:



Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
3 63345.240046 7.188236 5.586729
4 59982.197226 5.040555 7.839388

Avg Area Number of Bedrooms Area Population Price
0 4.09 23086.800503 1.059034e+06
1 3.09 40173.072174 1.505891e+06
2 5.13 36882.159400 1.058988e+06
3 3.26 34310.242831 1.260617e+06
4 4.23 26354.109472 6.309435e+05

Address
0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
1 188 Johnson Views Suite 079nLake Kathleen, CA...
2 9127 Elizabeth StravenuenDanieltown, WI 06482...
3 USS BarnettnFPO AP 44820
4 USNS RaymondnFPO AE 09386


Error:



import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 121 elided









share|improve this question



























    0















    I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.



    code:



    import org.apache.spark.ml.evaluation.RegressionEvaluator
    import org.apache.spark.ml.regression.LinearRegression
    import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

    // To see less warnings
    import org.apache.log4j._
    Logger.getLogger("org").setLevel(Level.ERROR)


    // Start a simple Spark Session
    import org.apache.spark.sql.SparkSession
    val spark = SparkSession.builder().getOrCreate()

    // Prepare training and test data.
    // val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")

    val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")

    // Check out the Data
    data.printSchema()

    // See an example of what the data looks like
    // by printing out a Row
    val colnames = data.columns
    val firstrow = data.head(1)(0)
    println("n")
    println("Example Data Row")
    for(ind <- Range(1,colnames.length)){
    println(colnames(ind))
    println(firstrow(ind))
    println("n")
    }

    ////////////////////////////////////////////////////
    //// Setting Up DataFrame for Machine Learning ////
    //////////////////////////////////////////////////

    // A few things we need to do before Spark can accept the data!
    // It needs to be in the form of two columns
    // ("label","features")

    // This will allow us to join multiple feature columns
    // into a single column of an array of feautre values
    import org.apache.spark.ml.feature.VectorAssembler
    import org.apache.spark.ml.linalg.Vectors

    // Rename Price to label column for naming convention.
    // Grab only numerical columns from the data
    val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")

    // An assembler converts the input values to a vector
    // A vector is what the ML algorithm reads to train a model

    // Set the input columns from which we are supposed to read the values
    // Set the name of the column where the vector will be stored
    val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")

    // Use the assembler to transform our DataFrame to the two columns
    val output = assembler.transform(df).select($"label",$"features")


    Data:



    Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  
    0 79545.458574 5.682861 7.009188
    1 79248.642455 6.002900 6.730821
    2 61287.067179 5.865890 8.512727
    3 63345.240046 7.188236 5.586729
    4 59982.197226 5.040555 7.839388

    Avg Area Number of Bedrooms Area Population Price
    0 4.09 23086.800503 1.059034e+06
    1 3.09 40173.072174 1.505891e+06
    2 5.13 36882.159400 1.058988e+06
    3 3.26 34310.242831 1.260617e+06
    4 4.23 26354.109472 6.309435e+05

    Address
    0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
    1 188 Johnson Views Suite 079nLake Kathleen, CA...
    2 9127 Elizabeth StravenuenDanieltown, WI 06482...
    3 USS BarnettnFPO AP 44820
    4 USNS RaymondnFPO AE 09386


    Error:



    import org.apache.spark.ml.feature.VectorAssembler
    import org.apache.spark.ml.linalg.Vectors
    df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
    assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
    java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
    Data type StringType of column Avg Area House Age is not supported.
    at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
    at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
    ... 121 elided









    share|improve this question

























      0












      0








      0








      I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.



      code:



      import org.apache.spark.ml.evaluation.RegressionEvaluator
      import org.apache.spark.ml.regression.LinearRegression
      import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

      // To see less warnings
      import org.apache.log4j._
      Logger.getLogger("org").setLevel(Level.ERROR)


      // Start a simple Spark Session
      import org.apache.spark.sql.SparkSession
      val spark = SparkSession.builder().getOrCreate()

      // Prepare training and test data.
      // val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")

      val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")

      // Check out the Data
      data.printSchema()

      // See an example of what the data looks like
      // by printing out a Row
      val colnames = data.columns
      val firstrow = data.head(1)(0)
      println("n")
      println("Example Data Row")
      for(ind <- Range(1,colnames.length)){
      println(colnames(ind))
      println(firstrow(ind))
      println("n")
      }

      ////////////////////////////////////////////////////
      //// Setting Up DataFrame for Machine Learning ////
      //////////////////////////////////////////////////

      // A few things we need to do before Spark can accept the data!
      // It needs to be in the form of two columns
      // ("label","features")

      // This will allow us to join multiple feature columns
      // into a single column of an array of feautre values
      import org.apache.spark.ml.feature.VectorAssembler
      import org.apache.spark.ml.linalg.Vectors

      // Rename Price to label column for naming convention.
      // Grab only numerical columns from the data
      val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")

      // An assembler converts the input values to a vector
      // A vector is what the ML algorithm reads to train a model

      // Set the input columns from which we are supposed to read the values
      // Set the name of the column where the vector will be stored
      val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")

      // Use the assembler to transform our DataFrame to the two columns
      val output = assembler.transform(df).select($"label",$"features")


      Data:



      Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  
      0 79545.458574 5.682861 7.009188
      1 79248.642455 6.002900 6.730821
      2 61287.067179 5.865890 8.512727
      3 63345.240046 7.188236 5.586729
      4 59982.197226 5.040555 7.839388

      Avg Area Number of Bedrooms Area Population Price
      0 4.09 23086.800503 1.059034e+06
      1 3.09 40173.072174 1.505891e+06
      2 5.13 36882.159400 1.058988e+06
      3 3.26 34310.242831 1.260617e+06
      4 4.23 26354.109472 6.309435e+05

      Address
      0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
      1 188 Johnson Views Suite 079nLake Kathleen, CA...
      2 9127 Elizabeth StravenuenDanieltown, WI 06482...
      3 USS BarnettnFPO AP 44820
      4 USNS RaymondnFPO AE 09386


      Error:



      import org.apache.spark.ml.feature.VectorAssembler
      import org.apache.spark.ml.linalg.Vectors
      df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
      assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
      java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
      Data type StringType of column Avg Area House Age is not supported.
      at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
      at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
      at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
      ... 121 elided









      share|improve this question














      I have the example code below. I'm trying to build a ml pipeline in scala. My goal is to do simple linear regression. I'm getting the message below when I try to run the assembler with the list of features. The features I'm using are all floats without missing values. There's example data below. I'm very new to scala and I'm wondering what the issue is. Does the assembler have trouble with floats? I'm using spark 2.3.0.



      code:



      import org.apache.spark.ml.evaluation.RegressionEvaluator
      import org.apache.spark.ml.regression.LinearRegression
      import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

      // To see less warnings
      import org.apache.log4j._
      Logger.getLogger("org").setLevel(Level.ERROR)


      // Start a simple Spark Session
      import org.apache.spark.sql.SparkSession
      val spark = SparkSession.builder().getOrCreate()

      // Prepare training and test data.
      // val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")

      val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")

      // Check out the Data
      data.printSchema()

      // See an example of what the data looks like
      // by printing out a Row
      val colnames = data.columns
      val firstrow = data.head(1)(0)
      println("n")
      println("Example Data Row")
      for(ind <- Range(1,colnames.length)){
      println(colnames(ind))
      println(firstrow(ind))
      println("n")
      }

      ////////////////////////////////////////////////////
      //// Setting Up DataFrame for Machine Learning ////
      //////////////////////////////////////////////////

      // A few things we need to do before Spark can accept the data!
      // It needs to be in the form of two columns
      // ("label","features")

      // This will allow us to join multiple feature columns
      // into a single column of an array of feautre values
      import org.apache.spark.ml.feature.VectorAssembler
      import org.apache.spark.ml.linalg.Vectors

      // Rename Price to label column for naming convention.
      // Grab only numerical columns from the data
      val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")

      // An assembler converts the input values to a vector
      // A vector is what the ML algorithm reads to train a model

      // Set the input columns from which we are supposed to read the values
      // Set the name of the column where the vector will be stored
      val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")

      // Use the assembler to transform our DataFrame to the two columns
      val output = assembler.transform(df).select($"label",$"features")


      Data:



      Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  
      0 79545.458574 5.682861 7.009188
      1 79248.642455 6.002900 6.730821
      2 61287.067179 5.865890 8.512727
      3 63345.240046 7.188236 5.586729
      4 59982.197226 5.040555 7.839388

      Avg Area Number of Bedrooms Area Population Price
      0 4.09 23086.800503 1.059034e+06
      1 3.09 40173.072174 1.505891e+06
      2 5.13 36882.159400 1.058988e+06
      3 3.26 34310.242831 1.260617e+06
      4 4.23 26354.109472 6.309435e+05

      Address
      0 208 Michael Ferry Apt. 674nLaurabury, NE 3701...
      1 188 Johnson Views Suite 079nLake Kathleen, CA...
      2 9127 Elizabeth StravenuenDanieltown, WI 06482...
      3 USS BarnettnFPO AP 44820
      4 USNS RaymondnFPO AE 09386


      Error:



      import org.apache.spark.ml.feature.VectorAssembler
      import org.apache.spark.ml.linalg.Vectors
      df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
      assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
      java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
      Data type StringType of column Avg Area House Age is not supported.
      at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
      at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
      at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
      ... 121 elided






      scala apache-spark






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 26 '18 at 19:58









      user3476463user3476463

      76521335




      76521335
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488138%2fassembler-treating-float-as-string%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488138%2fassembler-treating-float-as-string%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)