bigquery: find following row matching condition











up vote
2
down vote

favorite












I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:



ID, Text
1, t
1, th
1, the
1, the
1, the c
1, the ca
1, the cat
1, the cat
1, the cat s
...
1, the cat sat on the mat
2, r
...


For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:



ID, Text, Boundary
1, t, the
1, th, the
1, the c, the cat
1, the ca, the cat
1, the cat s, the cat sat


In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.










share|improve this question


























    up vote
    2
    down vote

    favorite












    I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:



    ID, Text
    1, t
    1, th
    1, the
    1, the
    1, the c
    1, the ca
    1, the cat
    1, the cat
    1, the cat s
    ...
    1, the cat sat on the mat
    2, r
    ...


    For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:



    ID, Text, Boundary
    1, t, the
    1, th, the
    1, the c, the cat
    1, the ca, the cat
    1, the cat s, the cat sat


    In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.










    share|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:



      ID, Text
      1, t
      1, th
      1, the
      1, the
      1, the c
      1, the ca
      1, the cat
      1, the cat
      1, the cat s
      ...
      1, the cat sat on the mat
      2, r
      ...


      For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:



      ID, Text, Boundary
      1, t, the
      1, th, the
      1, the c, the cat
      1, the ca, the cat
      1, the cat s, the cat sat


      In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.










      share|improve this question













      I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:



      ID, Text
      1, t
      1, th
      1, the
      1, the
      1, the c
      1, the ca
      1, the cat
      1, the cat
      1, the cat s
      ...
      1, the cat sat on the mat
      2, r
      ...


      For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:



      ID, Text, Boundary
      1, t, the
      1, th, the
      1, the c, the cat
      1, the ca, the cat
      1, the cat s, the cat sat


      In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.







      sql google-bigquery






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 at 21:15









      mish15

      203




      203
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          Below is for BigQuery Standard SQL



          Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start



          #standardSQL
          SELECT id, item, boundary
          FROM (
          SELECT id, grp,
          STRING_AGG(IF(boundary, text, ''), '') boundary,
          ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
          FROM (
          SELECT id, text,
          LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
          SUBSTR(text, -1) = ' ' boundary
          FROM `project.dataset.table`
          )
          GROUP BY id, grp
          ), UNNEST(items) item WITH OFFSET pos
          WHERE RTRIM(item) != RTRIM(boundary)


          if to apply to dummy data in your question as below



          #standardSQL
          WITH `project.dataset.table` AS (
          SELECT 1 id, 't' text UNION ALL
          SELECT 1, 'th' UNION ALL
          SELECT 1, 'the' UNION ALL
          SELECT 1, 'the ' UNION ALL
          SELECT 1, 'the c' UNION ALL
          SELECT 1, 'the ca' UNION ALL
          SELECT 1, 'the cat' UNION ALL
          SELECT 1, 'the cat ' UNION ALL
          SELECT 1, 'the cat s' UNION ALL
          SELECT 1, 'the cat sat '
          )
          SELECT id, item, boundary
          FROM (
          SELECT id, grp,
          STRING_AGG(IF(boundary, text, ''), '') boundary,
          ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
          FROM (
          SELECT id, text,
          LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
          SUBSTR(text, -1) = ' ' boundary
          FROM `project.dataset.table`
          )
          GROUP BY id, grp
          ), UNNEST(items) item WITH OFFSET pos
          WHERE RTRIM(item) != RTRIM(boundary)
          ORDER BY id, grp, pos


          result is



          Row     id      item        boundary     
          1 1 t the
          2 1 th the
          3 1 the c the cat
          4 1 the ca the cat
          5 1 the cat s the cat sat





          share|improve this answer





















          • Perfect, thank you.
            – mish15
            Nov 23 at 20:51


















          up vote
          0
          down vote













          BigQuery UDF's come in handy in these situations. Here is a working solution:



          #standardSQL
          /*boundary function*/
          create temp function boundaryf (text string, sentence string) as (
          array_to_string(array(
          select q.w from unnest(
          array(select struct(w as w, row_number() over () as i) from unnest(split(sentence, ' ')) w
          )
          ) q
          -- respect the ending space
          where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))
          ), ' ')
          );

          WITH items AS (
          #--your data. assuming this is already ordered
          SELECT 1 as id, 't' as text UNION ALL
          SELECT 1, 'th' UNION ALL
          SELECT 1, 'the' UNION ALL
          SELECT 1, 'the ' UNION ALL
          SELECT 1, 'the c' UNION ALL
          SELECT 1, 'the ca' UNION ALL
          SELECT 1, 'the cat' UNION ALL
          SELECT 1, 'the cat ' UNION ALL
          SELECT 1, 'the cat s' UNION ALL
          SELECT 1, 'the cat sa' union all
          SELECT 1, 'the cat sat' union all
          SELECT 1, 'the cat sat ' union all
          SELECT 1, 'the cat sat o' union all
          SELECT 1, 'the cat sat on' union all
          SELECT 1, 'the cat sat on ' union all
          SELECT 1, 'the cat sat on a' union all
          SELECT 1, 'the cat sat on a ' union all
          SELECT 1, 'the cat sat on a m' union all
          SELECT 1, 'the cat sat on a ma' union all
          SELECT 1, 'the cat sat on a mat' union all
          select 2, 'i' union all
          select 2, 'i a' union all
          select 2, 'i am' union all
          select 2, 'i am f' union all
          select 2, 'i am fr' union all
          select 2, 'i am fre' union all
          select 2, 'i am free'
          ),
          sentences as (
          select id, sentences[offset (array_length(sentences)-1)] as sentence from (
          select id, array_agg(text) as sentences
          from items group by 1
          )
          ),
          control as (
          select i.id, i.text, boundaryf(i.text, s.sentence) as boundary
          from items i
          left join sentences s on s.id = i.id
          )
          select * from control





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420565%2fbigquery-find-following-row-matching-condition%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote



            accepted










            Below is for BigQuery Standard SQL



            Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start



            #standardSQL
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)


            if to apply to dummy data in your question as below



            #standardSQL
            WITH `project.dataset.table` AS (
            SELECT 1 id, 't' text UNION ALL
            SELECT 1, 'th' UNION ALL
            SELECT 1, 'the' UNION ALL
            SELECT 1, 'the ' UNION ALL
            SELECT 1, 'the c' UNION ALL
            SELECT 1, 'the ca' UNION ALL
            SELECT 1, 'the cat' UNION ALL
            SELECT 1, 'the cat ' UNION ALL
            SELECT 1, 'the cat s' UNION ALL
            SELECT 1, 'the cat sat '
            )
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)
            ORDER BY id, grp, pos


            result is



            Row     id      item        boundary     
            1 1 t the
            2 1 th the
            3 1 the c the cat
            4 1 the ca the cat
            5 1 the cat s the cat sat





            share|improve this answer





















            • Perfect, thank you.
              – mish15
              Nov 23 at 20:51















            up vote
            3
            down vote



            accepted










            Below is for BigQuery Standard SQL



            Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start



            #standardSQL
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)


            if to apply to dummy data in your question as below



            #standardSQL
            WITH `project.dataset.table` AS (
            SELECT 1 id, 't' text UNION ALL
            SELECT 1, 'th' UNION ALL
            SELECT 1, 'the' UNION ALL
            SELECT 1, 'the ' UNION ALL
            SELECT 1, 'the c' UNION ALL
            SELECT 1, 'the ca' UNION ALL
            SELECT 1, 'the cat' UNION ALL
            SELECT 1, 'the cat ' UNION ALL
            SELECT 1, 'the cat s' UNION ALL
            SELECT 1, 'the cat sat '
            )
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)
            ORDER BY id, grp, pos


            result is



            Row     id      item        boundary     
            1 1 t the
            2 1 th the
            3 1 the c the cat
            4 1 the ca the cat
            5 1 the cat s the cat sat





            share|improve this answer





















            • Perfect, thank you.
              – mish15
              Nov 23 at 20:51













            up vote
            3
            down vote



            accepted







            up vote
            3
            down vote



            accepted






            Below is for BigQuery Standard SQL



            Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start



            #standardSQL
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)


            if to apply to dummy data in your question as below



            #standardSQL
            WITH `project.dataset.table` AS (
            SELECT 1 id, 't' text UNION ALL
            SELECT 1, 'th' UNION ALL
            SELECT 1, 'the' UNION ALL
            SELECT 1, 'the ' UNION ALL
            SELECT 1, 'the c' UNION ALL
            SELECT 1, 'the ca' UNION ALL
            SELECT 1, 'the cat' UNION ALL
            SELECT 1, 'the cat ' UNION ALL
            SELECT 1, 'the cat s' UNION ALL
            SELECT 1, 'the cat sat '
            )
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)
            ORDER BY id, grp, pos


            result is



            Row     id      item        boundary     
            1 1 t the
            2 1 th the
            3 1 the c the cat
            4 1 the ca the cat
            5 1 the cat s the cat sat





            share|improve this answer












            Below is for BigQuery Standard SQL



            Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start



            #standardSQL
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)


            if to apply to dummy data in your question as below



            #standardSQL
            WITH `project.dataset.table` AS (
            SELECT 1 id, 't' text UNION ALL
            SELECT 1, 'th' UNION ALL
            SELECT 1, 'the' UNION ALL
            SELECT 1, 'the ' UNION ALL
            SELECT 1, 'the c' UNION ALL
            SELECT 1, 'the ca' UNION ALL
            SELECT 1, 'the cat' UNION ALL
            SELECT 1, 'the cat ' UNION ALL
            SELECT 1, 'the cat s' UNION ALL
            SELECT 1, 'the cat sat '
            )
            SELECT id, item, boundary
            FROM (
            SELECT id, grp,
            STRING_AGG(IF(boundary, text, ''), '') boundary,
            ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items
            FROM (
            SELECT id, text,
            LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,
            SUBSTR(text, -1) = ' ' boundary
            FROM `project.dataset.table`
            )
            GROUP BY id, grp
            ), UNNEST(items) item WITH OFFSET pos
            WHERE RTRIM(item) != RTRIM(boundary)
            ORDER BY id, grp, pos


            result is



            Row     id      item        boundary     
            1 1 t the
            2 1 th the
            3 1 the c the cat
            4 1 the ca the cat
            5 1 the cat s the cat sat






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 21 at 22:00









            Mikhail Berlyant

            54.2k43166




            54.2k43166












            • Perfect, thank you.
              – mish15
              Nov 23 at 20:51


















            • Perfect, thank you.
              – mish15
              Nov 23 at 20:51
















            Perfect, thank you.
            – mish15
            Nov 23 at 20:51




            Perfect, thank you.
            – mish15
            Nov 23 at 20:51












            up vote
            0
            down vote













            BigQuery UDF's come in handy in these situations. Here is a working solution:



            #standardSQL
            /*boundary function*/
            create temp function boundaryf (text string, sentence string) as (
            array_to_string(array(
            select q.w from unnest(
            array(select struct(w as w, row_number() over () as i) from unnest(split(sentence, ' ')) w
            )
            ) q
            -- respect the ending space
            where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))
            ), ' ')
            );

            WITH items AS (
            #--your data. assuming this is already ordered
            SELECT 1 as id, 't' as text UNION ALL
            SELECT 1, 'th' UNION ALL
            SELECT 1, 'the' UNION ALL
            SELECT 1, 'the ' UNION ALL
            SELECT 1, 'the c' UNION ALL
            SELECT 1, 'the ca' UNION ALL
            SELECT 1, 'the cat' UNION ALL
            SELECT 1, 'the cat ' UNION ALL
            SELECT 1, 'the cat s' UNION ALL
            SELECT 1, 'the cat sa' union all
            SELECT 1, 'the cat sat' union all
            SELECT 1, 'the cat sat ' union all
            SELECT 1, 'the cat sat o' union all
            SELECT 1, 'the cat sat on' union all
            SELECT 1, 'the cat sat on ' union all
            SELECT 1, 'the cat sat on a' union all
            SELECT 1, 'the cat sat on a ' union all
            SELECT 1, 'the cat sat on a m' union all
            SELECT 1, 'the cat sat on a ma' union all
            SELECT 1, 'the cat sat on a mat' union all
            select 2, 'i' union all
            select 2, 'i a' union all
            select 2, 'i am' union all
            select 2, 'i am f' union all
            select 2, 'i am fr' union all
            select 2, 'i am fre' union all
            select 2, 'i am free'
            ),
            sentences as (
            select id, sentences[offset (array_length(sentences)-1)] as sentence from (
            select id, array_agg(text) as sentences
            from items group by 1
            )
            ),
            control as (
            select i.id, i.text, boundaryf(i.text, s.sentence) as boundary
            from items i
            left join sentences s on s.id = i.id
            )
            select * from control





            share|improve this answer

























              up vote
              0
              down vote













              BigQuery UDF's come in handy in these situations. Here is a working solution:



              #standardSQL
              /*boundary function*/
              create temp function boundaryf (text string, sentence string) as (
              array_to_string(array(
              select q.w from unnest(
              array(select struct(w as w, row_number() over () as i) from unnest(split(sentence, ' ')) w
              )
              ) q
              -- respect the ending space
              where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))
              ), ' ')
              );

              WITH items AS (
              #--your data. assuming this is already ordered
              SELECT 1 as id, 't' as text UNION ALL
              SELECT 1, 'th' UNION ALL
              SELECT 1, 'the' UNION ALL
              SELECT 1, 'the ' UNION ALL
              SELECT 1, 'the c' UNION ALL
              SELECT 1, 'the ca' UNION ALL
              SELECT 1, 'the cat' UNION ALL
              SELECT 1, 'the cat ' UNION ALL
              SELECT 1, 'the cat s' UNION ALL
              SELECT 1, 'the cat sa' union all
              SELECT 1, 'the cat sat' union all
              SELECT 1, 'the cat sat ' union all
              SELECT 1, 'the cat sat o' union all
              SELECT 1, 'the cat sat on' union all
              SELECT 1, 'the cat sat on ' union all
              SELECT 1, 'the cat sat on a' union all
              SELECT 1, 'the cat sat on a ' union all
              SELECT 1, 'the cat sat on a m' union all
              SELECT 1, 'the cat sat on a ma' union all
              SELECT 1, 'the cat sat on a mat' union all
              select 2, 'i' union all
              select 2, 'i a' union all
              select 2, 'i am' union all
              select 2, 'i am f' union all
              select 2, 'i am fr' union all
              select 2, 'i am fre' union all
              select 2, 'i am free'
              ),
              sentences as (
              select id, sentences[offset (array_length(sentences)-1)] as sentence from (
              select id, array_agg(text) as sentences
              from items group by 1
              )
              ),
              control as (
              select i.id, i.text, boundaryf(i.text, s.sentence) as boundary
              from items i
              left join sentences s on s.id = i.id
              )
              select * from control





              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                BigQuery UDF's come in handy in these situations. Here is a working solution:



                #standardSQL
                /*boundary function*/
                create temp function boundaryf (text string, sentence string) as (
                array_to_string(array(
                select q.w from unnest(
                array(select struct(w as w, row_number() over () as i) from unnest(split(sentence, ' ')) w
                )
                ) q
                -- respect the ending space
                where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))
                ), ' ')
                );

                WITH items AS (
                #--your data. assuming this is already ordered
                SELECT 1 as id, 't' as text UNION ALL
                SELECT 1, 'th' UNION ALL
                SELECT 1, 'the' UNION ALL
                SELECT 1, 'the ' UNION ALL
                SELECT 1, 'the c' UNION ALL
                SELECT 1, 'the ca' UNION ALL
                SELECT 1, 'the cat' UNION ALL
                SELECT 1, 'the cat ' UNION ALL
                SELECT 1, 'the cat s' UNION ALL
                SELECT 1, 'the cat sa' union all
                SELECT 1, 'the cat sat' union all
                SELECT 1, 'the cat sat ' union all
                SELECT 1, 'the cat sat o' union all
                SELECT 1, 'the cat sat on' union all
                SELECT 1, 'the cat sat on ' union all
                SELECT 1, 'the cat sat on a' union all
                SELECT 1, 'the cat sat on a ' union all
                SELECT 1, 'the cat sat on a m' union all
                SELECT 1, 'the cat sat on a ma' union all
                SELECT 1, 'the cat sat on a mat' union all
                select 2, 'i' union all
                select 2, 'i a' union all
                select 2, 'i am' union all
                select 2, 'i am f' union all
                select 2, 'i am fr' union all
                select 2, 'i am fre' union all
                select 2, 'i am free'
                ),
                sentences as (
                select id, sentences[offset (array_length(sentences)-1)] as sentence from (
                select id, array_agg(text) as sentences
                from items group by 1
                )
                ),
                control as (
                select i.id, i.text, boundaryf(i.text, s.sentence) as boundary
                from items i
                left join sentences s on s.id = i.id
                )
                select * from control





                share|improve this answer












                BigQuery UDF's come in handy in these situations. Here is a working solution:



                #standardSQL
                /*boundary function*/
                create temp function boundaryf (text string, sentence string) as (
                array_to_string(array(
                select q.w from unnest(
                array(select struct(w as w, row_number() over () as i) from unnest(split(sentence, ' ')) w
                )
                ) q
                -- respect the ending space
                where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))
                ), ' ')
                );

                WITH items AS (
                #--your data. assuming this is already ordered
                SELECT 1 as id, 't' as text UNION ALL
                SELECT 1, 'th' UNION ALL
                SELECT 1, 'the' UNION ALL
                SELECT 1, 'the ' UNION ALL
                SELECT 1, 'the c' UNION ALL
                SELECT 1, 'the ca' UNION ALL
                SELECT 1, 'the cat' UNION ALL
                SELECT 1, 'the cat ' UNION ALL
                SELECT 1, 'the cat s' UNION ALL
                SELECT 1, 'the cat sa' union all
                SELECT 1, 'the cat sat' union all
                SELECT 1, 'the cat sat ' union all
                SELECT 1, 'the cat sat o' union all
                SELECT 1, 'the cat sat on' union all
                SELECT 1, 'the cat sat on ' union all
                SELECT 1, 'the cat sat on a' union all
                SELECT 1, 'the cat sat on a ' union all
                SELECT 1, 'the cat sat on a m' union all
                SELECT 1, 'the cat sat on a ma' union all
                SELECT 1, 'the cat sat on a mat' union all
                select 2, 'i' union all
                select 2, 'i a' union all
                select 2, 'i am' union all
                select 2, 'i am f' union all
                select 2, 'i am fr' union all
                select 2, 'i am fre' union all
                select 2, 'i am free'
                ),
                sentences as (
                select id, sentences[offset (array_length(sentences)-1)] as sentence from (
                select id, array_agg(text) as sentences
                from items group by 1
                )
                ),
                control as (
                select i.id, i.text, boundaryf(i.text, s.sentence) as boundary
                from items i
                left join sentences s on s.id = i.id
                )
                select * from control






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 22 at 22:02









                khan

                1,79883051




                1,79883051






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420565%2fbigquery-find-following-row-matching-condition%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                    Calculate evaluation metrics using cross_val_predict sklearn

                    Insert data from modal to MySQL (multiple modal on website)