Spark show works for whole dataframe yet fails for same dataframe filtered












1














I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:



val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")


When I call



df.show



it shows first 20 rows.
But when I call



df.where("src = 'XXXXX' and dst = 'YYYYY'").show


It gives following error:



        org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
at org.apache.spark.scheduler.Task.run(Task.scala:125)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 56 elided
Caused by: org.apache.spark.util.TaskCompletionListenerException: null
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
at org.apache.spark.scheduler.Task.run(Task.scala:125)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
... 3 more


Here is the properties of Hive table:



      CREATE TABLE `EBYN.BABS_EDGES_2016    `(
`date` string,
`a_vno` string,
`s_vno` string,
`amount` double,
`number` int,
`share` int,
`share_ratio` int,
`administrator` int,
`account` int,
`all_sharelik` int)
COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='',
'line.delim'='n',
'serialization.format'='')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016 '
TBLPROPERTIES (
'bucketing_version'='2',
'last_modified_by'='hadoop_etluser',
'last_modified_time'='1539867401',
'transactional'='true',
'transactional_properties'='insert_only',


What is the reason that it shows dataframe but fails when called for filtered dataframe?










share|improve this question





























    1














    I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:



    val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")


    When I call



    df.show



    it shows first 20 rows.
    But when I call



    df.where("src = 'XXXXX' and dst = 'YYYYY'").show


    It gives following error:



            org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null
    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
    at org.apache.spark.scheduler.Task.run(Task.scala:125)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

    Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
    ... 56 elided
    Caused by: org.apache.spark.util.TaskCompletionListenerException: null
    at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
    at org.apache.spark.scheduler.Task.run(Task.scala:125)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    ... 3 more


    Here is the properties of Hive table:



          CREATE TABLE `EBYN.BABS_EDGES_2016    `(
    `date` string,
    `a_vno` string,
    `s_vno` string,
    `amount` double,
    `number` int,
    `share` int,
    `share_ratio` int,
    `administrator` int,
    `account` int,
    `all_sharelik` int)
    COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES (
    'field.delim'='',
    'line.delim'='n',
    'serialization.format'='')
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016 '
    TBLPROPERTIES (
    'bucketing_version'='2',
    'last_modified_by'='hadoop_etluser',
    'last_modified_time'='1539867401',
    'transactional'='true',
    'transactional_properties'='insert_only',


    What is the reason that it shows dataframe but fails when called for filtered dataframe?










    share|improve this question



























      1












      1








      1







      I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:



      val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")


      When I call



      df.show



      it shows first 20 rows.
      But when I call



      df.where("src = 'XXXXX' and dst = 'YYYYY'").show


      It gives following error:



              org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null
      at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
      at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
      at org.apache.spark.scheduler.Task.run(Task.scala:125)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
      at scala.Option.foreach(Option.scala:257)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
      at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
      at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
      at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
      at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
      at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
      ... 56 elided
      Caused by: org.apache.spark.util.TaskCompletionListenerException: null
      at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
      at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
      at org.apache.spark.scheduler.Task.run(Task.scala:125)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
      ... 3 more


      Here is the properties of Hive table:



            CREATE TABLE `EBYN.BABS_EDGES_2016    `(
      `date` string,
      `a_vno` string,
      `s_vno` string,
      `amount` double,
      `number` int,
      `share` int,
      `share_ratio` int,
      `administrator` int,
      `account` int,
      `all_sharelik` int)
      COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'
      ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      WITH SERDEPROPERTIES (
      'field.delim'='',
      'line.delim'='n',
      'serialization.format'='')
      STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
      'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016 '
      TBLPROPERTIES (
      'bucketing_version'='2',
      'last_modified_by'='hadoop_etluser',
      'last_modified_time'='1539867401',
      'transactional'='true',
      'transactional_properties'='insert_only',


      What is the reason that it shows dataframe but fails when called for filtered dataframe?










      share|improve this question















      I am using Spark 2.3.1 on Zeppelin notebook. I create a dataframe by loading it from Hive. Following is how dataframe is created:



      val df = hive.executeQuery("select trim(a_vno) as dst, trim(s_vno) as src, share, administrator, account, all_shares from ebyn.babs_edges_2016 where (share <> 0 or administrator <> 0 or account <> 0 or all_shares <> 0 ) and trim(date) = '201601'")


      When I call



      df.show



      it shows first 20 rows.
      But when I call



      df.where("src = 'XXXXX' and dst = 'YYYYY'").show


      It gives following error:



              org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 303.0 failed 4 times, most recent failure: Lost task 3.3 in stage 303.0 (TID 10797, analitik10.host, executor 96): org.apache.spark.util.TaskCompletionListenerException: null
      at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
      at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
      at org.apache.spark.scheduler.Task.run(Task.scala:125)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
      at scala.Option.foreach(Option.scala:257)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
      at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
      at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
      at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
      at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
      at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
      at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
      at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
      at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
      ... 56 elided
      Caused by: org.apache.spark.util.TaskCompletionListenerException: null
      at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139)
      at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117)
      at org.apache.spark.scheduler.Task.run(Task.scala:125)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
      ... 3 more


      Here is the properties of Hive table:



            CREATE TABLE `EBYN.BABS_EDGES_2016    `(
      `date` string,
      `a_vno` string,
      `s_vno` string,
      `amount` double,
      `number` int,
      `share` int,
      `share_ratio` int,
      `administrator` int,
      `account` int,
      `all_sharelik` int)
      COMMENT 'Imported by sqoop on 2018/10/17 14:53:12'
      ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      WITH SERDEPROPERTIES (
      'field.delim'='',
      'line.delim'='n',
      'serialization.format'='')
      STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
      'hdfs://ggmprod/warehouse/tablespace/managed/hive/ebyn.db/babs_edges_2016 '
      TBLPROPERTIES (
      'bucketing_version'='2',
      'last_modified_by'='hadoop_etluser',
      'last_modified_time'='1539867401',
      'transactional'='true',
      'transactional_properties'='insert_only',


      What is the reason that it shows dataframe but fails when called for filtered dataframe?







      scala apache-spark apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 26 '18 at 8:48

























      asked Nov 23 '18 at 14:54









      Gofrette

      121211




      121211
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448862%2fspark-show-works-for-whole-dataframe-yet-fails-for-same-dataframe-filtered%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448862%2fspark-show-works-for-whole-dataframe-yet-fails-for-same-dataframe-filtered%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)