How to replace white space with comma in Spark ( with Scala)?












1















I have a log file like this. I want to create a DataFrame in Scala.



2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2


I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.



Here is everything I tried:




  1. Tried importing it as text file first to see if there is a replaceAll method.

  2. Tried splitting on the basis of space.


Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..










share|improve this question























  • Possible duplicate of how to use Regexp_replace in spark

    – user10465355
    Nov 26 '18 at 20:54
















1















I have a log file like this. I want to create a DataFrame in Scala.



2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2


I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.



Here is everything I tried:




  1. Tried importing it as text file first to see if there is a replaceAll method.

  2. Tried splitting on the basis of space.


Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..










share|improve this question























  • Possible duplicate of how to use Regexp_replace in spark

    – user10465355
    Nov 26 '18 at 20:54














1












1








1








I have a log file like this. I want to create a DataFrame in Scala.



2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2


I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.



Here is everything I tried:




  1. Tried importing it as text file first to see if there is a replaceAll method.

  2. Tried splitting on the basis of space.


Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..










share|improve this question














I have a log file like this. I want to create a DataFrame in Scala.



2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2


I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.



Here is everything I tried:




  1. Tried importing it as text file first to see if there is a replaceAll method.

  2. Tried splitting on the basis of space.


Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..







scala apache-spark apache-spark-sql databricks






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '18 at 20:51









SanSan

74




74













  • Possible duplicate of how to use Regexp_replace in spark

    – user10465355
    Nov 26 '18 at 20:54



















  • Possible duplicate of how to use Regexp_replace in spark

    – user10465355
    Nov 26 '18 at 20:54

















Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54





Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54












3 Answers
3






active

oldest

votes


















1














You can simply tell spark that your delimiter is a white space like this:



val df = spark.read.option("delimiter", " ").csv("path/to/file")





share|improve this answer































    0














    Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
    Roughly:



    val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))


    Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.



    A simpler way to get up and going though would be



    spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)





    share|improve this answer
























    • Almost precise. Thank you.

      – San
      Nov 26 '18 at 21:30











    • It is also replacing the space inside the quotes. Looking for a way to overcome it.

      – San
      Nov 26 '18 at 22:09











    • Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

      – benlaird
      Nov 26 '18 at 22:23











    • Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

      – benlaird
      Nov 26 '18 at 22:26











    • I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

      – San
      Nov 26 '18 at 22:45





















    0














    If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.



    import org.apache.commons.csv.CSVParser._
    val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
    val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
    val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
    println(http)
    println(curl)


    Results:



    GET https://www.example.com:443/ HTTP/1.1
    curl/7.38.0





    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488866%2fhow-to-replace-white-space-with-comma-in-spark-with-scala%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      You can simply tell spark that your delimiter is a white space like this:



      val df = spark.read.option("delimiter", " ").csv("path/to/file")





      share|improve this answer




























        1














        You can simply tell spark that your delimiter is a white space like this:



        val df = spark.read.option("delimiter", " ").csv("path/to/file")





        share|improve this answer


























          1












          1








          1







          You can simply tell spark that your delimiter is a white space like this:



          val df = spark.read.option("delimiter", " ").csv("path/to/file")





          share|improve this answer













          You can simply tell spark that your delimiter is a white space like this:



          val df = spark.read.option("delimiter", " ").csv("path/to/file")






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 27 '18 at 7:08









          OliOli

          1,229213




          1,229213

























              0














              Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
              Roughly:



              val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))


              Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.



              A simpler way to get up and going though would be



              spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)





              share|improve this answer
























              • Almost precise. Thank you.

                – San
                Nov 26 '18 at 21:30











              • It is also replacing the space inside the quotes. Looking for a way to overcome it.

                – San
                Nov 26 '18 at 22:09











              • Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

                – benlaird
                Nov 26 '18 at 22:23











              • Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

                – benlaird
                Nov 26 '18 at 22:26











              • I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

                – San
                Nov 26 '18 at 22:45


















              0














              Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
              Roughly:



              val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))


              Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.



              A simpler way to get up and going though would be



              spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)





              share|improve this answer
























              • Almost precise. Thank you.

                – San
                Nov 26 '18 at 21:30











              • It is also replacing the space inside the quotes. Looking for a way to overcome it.

                – San
                Nov 26 '18 at 22:09











              • Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

                – benlaird
                Nov 26 '18 at 22:23











              • Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

                – benlaird
                Nov 26 '18 at 22:26











              • I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

                – San
                Nov 26 '18 at 22:45
















              0












              0








              0







              Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
              Roughly:



              val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))


              Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.



              A simpler way to get up and going though would be



              spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)





              share|improve this answer













              Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
              Roughly:



              val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))


              Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.



              A simpler way to get up and going though would be



              spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Nov 26 '18 at 21:24









              benlairdbenlaird

              23029




              23029













              • Almost precise. Thank you.

                – San
                Nov 26 '18 at 21:30











              • It is also replacing the space inside the quotes. Looking for a way to overcome it.

                – San
                Nov 26 '18 at 22:09











              • Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

                – benlaird
                Nov 26 '18 at 22:23











              • Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

                – benlaird
                Nov 26 '18 at 22:26











              • I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

                – San
                Nov 26 '18 at 22:45





















              • Almost precise. Thank you.

                – San
                Nov 26 '18 at 21:30











              • It is also replacing the space inside the quotes. Looking for a way to overcome it.

                – San
                Nov 26 '18 at 22:09











              • Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

                – benlaird
                Nov 26 '18 at 22:23











              • Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

                – benlaird
                Nov 26 '18 at 22:26











              • I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

                – San
                Nov 26 '18 at 22:45



















              Almost precise. Thank you.

              – San
              Nov 26 '18 at 21:30





              Almost precise. Thank you.

              – San
              Nov 26 '18 at 21:30













              It is also replacing the space inside the quotes. Looking for a way to overcome it.

              – San
              Nov 26 '18 at 22:09





              It is also replacing the space inside the quotes. Looking for a way to overcome it.

              – San
              Nov 26 '18 at 22:09













              Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

              – benlaird
              Nov 26 '18 at 22:23





              Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

              – benlaird
              Nov 26 '18 at 22:23













              Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

              – benlaird
              Nov 26 '18 at 22:26





              Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

              – benlaird
              Nov 26 '18 at 22:26













              I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

              – San
              Nov 26 '18 at 22:45







              I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

              – San
              Nov 26 '18 at 22:45













              0














              If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.



              import org.apache.commons.csv.CSVParser._
              val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
              val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
              val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
              println(http)
              println(curl)


              Results:



              GET https://www.example.com:443/ HTTP/1.1
              curl/7.38.0





              share|improve this answer




























                0














                If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.



                import org.apache.commons.csv.CSVParser._
                val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
                val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
                val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
                println(http)
                println(curl)


                Results:



                GET https://www.example.com:443/ HTTP/1.1
                curl/7.38.0





                share|improve this answer


























                  0












                  0








                  0







                  If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.



                  import org.apache.commons.csv.CSVParser._
                  val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
                  val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
                  val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
                  println(http)
                  println(curl)


                  Results:



                  GET https://www.example.com:443/ HTTP/1.1
                  curl/7.38.0





                  share|improve this answer













                  If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.



                  import org.apache.commons.csv.CSVParser._
                  val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
                  val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
                  val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
                  println(http)
                  println(curl)


                  Results:



                  GET https://www.example.com:443/ HTTP/1.1
                  curl/7.38.0






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 27 '18 at 13:06









                  stack0114106stack0114106

                  3,7322419




                  3,7322419






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488866%2fhow-to-replace-white-space-with-comma-in-spark-with-scala%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                      Calculate evaluation metrics using cross_val_predict sklearn

                      Insert data from modal to MySQL (multiple modal on website)