bigquery: find following row matching condition

up vote
2
down vote

favorite

I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:

ID, Text

1, t

1, th

1, the

1, the

1, the c

1, the ca

1, the cat

1, the cat 

1, the cat s

...

1, the cat sat on the mat

2, r

...

For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:

ID, Text, Boundary

1, t, the

1, th, the

1, the c, the cat

1, the ca, the cat

1, the cat s, the cat sat

In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.

asked Nov 21 at 21:15

mish15

203

add a comment |

up vote
2
down vote

favorite

I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:

ID, Text

1, t

1, th

1, the

1, the

1, the c

1, the ca

1, the cat

1, the cat 

1, the cat s

...

1, the cat sat on the mat

2, r

...

For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:

ID, Text, Boundary

1, t, the

1, th, the

1, the c, the cat

1, the ca, the cat

1, the cat s, the cat sat

In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.

asked Nov 21 at 21:15

mish15

203

add a comment |

up vote
2
down vote

favorite

I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:

ID, Text

1, t

1, th

1, the

1, the

1, the c

1, the ca

1, the cat

1, the cat 

1, the cat s

...

1, the cat sat on the mat

2, r

...

For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:

ID, Text, Boundary

1, t, the

1, th, the

1, the c, the cat

1, the ca, the cat

1, the cat s, the cat sat

In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.

asked Nov 21 at 21:15

mish15

203

I'm looking at text sequences in BigQuery and trying to identify word completions over a number of rows (sharing an ID). The data looks like:

ID, Text

1, t

1, th

1, the

1, the

1, the c

1, the ca

1, the cat

1, the cat 

1, the cat s

...

1, the cat sat on the mat

2, r

...

For each given ID and sequence i'm trying to find the next word boundary. So the ideal output would be:

ID, Text, Boundary

1, t, the

1, th, the

1, the c, the cat

1, the ca, the cat

1, the cat s, the cat sat

In the above the next subsequent row that both shares an ID and ends in a space gives the next (there can be multiple) word completion boundary.

sql google-bigquery

asked Nov 21 at 21:15

mish15

203

asked Nov 21 at 21:15

mish15

203

asked Nov 21 at 21:15

mish15

203

asked Nov 21 at 21:15

mish15

203

asked Nov 21 at 21:15

mish15

203

add a comment |

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

Below is for BigQuery Standard SQL

Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start

#standardSQL

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

if to apply to dummy data in your question as below

#standardSQL

WITH `project.dataset.table` AS (

  SELECT 1 id, 't' text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sat ' 

)

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

ORDER BY id, grp, pos

result is

Row     id      item        boundary     

1       1       t           the  

2       1       th          the  

3       1       the c       the cat  

4       1       the ca      the cat  

5       1       the cat s   the cat sat

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

Perfect, thank you.
– mish15
Nov 23 at 20:51

add a comment |

up vote
0
down vote

BigQuery UDF's come in handy in these situations. Here is a working solution:

#standardSQL

/*boundary function*/

create temp function boundaryf (text string, sentence string) as (

  array_to_string(array(

    select q.w from unnest(

      array(select struct(w as w, row_number() over () as i)  from unnest(split(sentence, ' ')) w

      ) 

    ) q

    -- respect the ending space

    where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))

  ), ' ')

);



WITH items AS (

  #--your data. assuming this is already ordered

  SELECT 1 as id, 't' as text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sa' union all

  SELECT 1, 'the cat sat' union all

  SELECT 1, 'the cat sat ' union all

  SELECT 1, 'the cat sat o' union all 

  SELECT 1, 'the cat sat on' union all

  SELECT 1, 'the cat sat on ' union all

  SELECT 1, 'the cat sat on a' union all

  SELECT 1, 'the cat sat on a ' union all

  SELECT 1, 'the cat sat on a m' union all

  SELECT 1, 'the cat sat on a ma' union all

  SELECT 1, 'the cat sat on a mat' union all

  select 2, 'i' union all

  select 2, 'i a' union all

  select 2, 'i am' union all

  select 2, 'i am f' union all

  select 2, 'i am fr' union all

  select 2, 'i am fre' union all

  select 2, 'i am free'

),

sentences as (

  select id, sentences[offset (array_length(sentences)-1)] as sentence from (

    select id, array_agg(text) as sentences 

    from items group by 1

  )

),

control as (

  select i.id, i.text, boundaryf(i.text, s.sentence) as boundary

  from items i

  left join sentences s on s.id  = i.id

)

select * from control

answered Nov 22 at 22:02

khan

1,79883051

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420565%2fbigquery-find-following-row-matching-condition%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

Below is for BigQuery Standard SQL

Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start

#standardSQL

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

if to apply to dummy data in your question as below

#standardSQL

WITH `project.dataset.table` AS (

  SELECT 1 id, 't' text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sat ' 

)

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

ORDER BY id, grp, pos

result is

Row     id      item        boundary     

1       1       t           the  

2       1       th          the  

3       1       the c       the cat  

4       1       the ca      the cat  

5       1       the cat s   the cat sat

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

Perfect, thank you.
– mish15
Nov 23 at 20:51

add a comment |

up vote
3
down vote

accepted

Below is for BigQuery Standard SQL

Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start

#standardSQL

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

if to apply to dummy data in your question as below

#standardSQL

WITH `project.dataset.table` AS (

  SELECT 1 id, 't' text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sat ' 

)

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

ORDER BY id, grp, pos

result is

Row     id      item        boundary     

1       1       t           the  

2       1       th          the  

3       1       the c       the cat  

4       1       the ca      the cat  

5       1       the cat s   the cat sat

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

Perfect, thank you.
– mish15
Nov 23 at 20:51

add a comment |

up vote
3
down vote

accepted

Below is for BigQuery Standard SQL

Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start

#standardSQL

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

if to apply to dummy data in your question as below

#standardSQL

WITH `project.dataset.table` AS (

  SELECT 1 id, 't' text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sat ' 

)

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

ORDER BY id, grp, pos

result is

Row     id      item        boundary     

1       1       t           the  

2       1       th          the  

3       1       the c       the cat  

4       1       the ca      the cat  

5       1       the cat s   the cat sat

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

Below is for BigQuery Standard SQL

Note: it is brute force approach so query is not that elegant as potentially can be - but hope this will give you good start

#standardSQL

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

if to apply to dummy data in your question as below

#standardSQL

WITH `project.dataset.table` AS (

  SELECT 1 id, 't' text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sat ' 

)

SELECT id, item, boundary

FROM (

  SELECT id, grp, 

    STRING_AGG(IF(boundary, text, ''), '') boundary,

    ARRAY_AGG(IF(NOT boundary, text, NULL) IGNORE NULLS ORDER BY LENGTH(text)) items

  FROM (

    SELECT id, text, 

      LENGTH(text) - LENGTH(REPLACE(text, ' ', '')) - IF(SUBSTR(text, -1) = ' ', 1, 0) grp,

      SUBSTR(text, -1) = ' ' boundary

    FROM `project.dataset.table`

  )

  GROUP BY id, grp

), UNNEST(items) item WITH OFFSET pos

WHERE RTRIM(item) != RTRIM(boundary)

ORDER BY id, grp, pos

result is

Row     id      item        boundary     

1       1       t           the  

2       1       th          the  

3       1       the c       the cat  

4       1       the ca      the cat  

5       1       the cat s   the cat sat

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

answered Nov 21 at 22:00

Mikhail Berlyant

54.2k43166

Perfect, thank you.
– mish15
Nov 23 at 20:51

add a comment |

Perfect, thank you.
– mish15
Nov 23 at 20:51

Perfect, thank you.
– mish15
Nov 23 at 20:51

add a comment |

up vote
0
down vote

BigQuery UDF's come in handy in these situations. Here is a working solution:

#standardSQL

/*boundary function*/

create temp function boundaryf (text string, sentence string) as (

  array_to_string(array(

    select q.w from unnest(

      array(select struct(w as w, row_number() over () as i)  from unnest(split(sentence, ' ')) w

      ) 

    ) q

    -- respect the ending space

    where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))

  ), ' ')

);



WITH items AS (

  #--your data. assuming this is already ordered

  SELECT 1 as id, 't' as text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sa' union all

  SELECT 1, 'the cat sat' union all

  SELECT 1, 'the cat sat ' union all

  SELECT 1, 'the cat sat o' union all 

  SELECT 1, 'the cat sat on' union all

  SELECT 1, 'the cat sat on ' union all

  SELECT 1, 'the cat sat on a' union all

  SELECT 1, 'the cat sat on a ' union all

  SELECT 1, 'the cat sat on a m' union all

  SELECT 1, 'the cat sat on a ma' union all

  SELECT 1, 'the cat sat on a mat' union all

  select 2, 'i' union all

  select 2, 'i a' union all

  select 2, 'i am' union all

  select 2, 'i am f' union all

  select 2, 'i am fr' union all

  select 2, 'i am fre' union all

  select 2, 'i am free'

),

sentences as (

  select id, sentences[offset (array_length(sentences)-1)] as sentence from (

    select id, array_agg(text) as sentences 

    from items group by 1

  )

),

control as (

  select i.id, i.text, boundaryf(i.text, s.sentence) as boundary

  from items i

  left join sentences s on s.id  = i.id

)

select * from control

answered Nov 22 at 22:02

khan

1,79883051

add a comment |

up vote
0
down vote

BigQuery UDF's come in handy in these situations. Here is a working solution:

#standardSQL

/*boundary function*/

create temp function boundaryf (text string, sentence string) as (

  array_to_string(array(

    select q.w from unnest(

      array(select struct(w as w, row_number() over () as i)  from unnest(split(sentence, ' ')) w

      ) 

    ) q

    -- respect the ending space

    where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))

  ), ' ')

);



WITH items AS (

  #--your data. assuming this is already ordered

  SELECT 1 as id, 't' as text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sa' union all

  SELECT 1, 'the cat sat' union all

  SELECT 1, 'the cat sat ' union all

  SELECT 1, 'the cat sat o' union all 

  SELECT 1, 'the cat sat on' union all

  SELECT 1, 'the cat sat on ' union all

  SELECT 1, 'the cat sat on a' union all

  SELECT 1, 'the cat sat on a ' union all

  SELECT 1, 'the cat sat on a m' union all

  SELECT 1, 'the cat sat on a ma' union all

  SELECT 1, 'the cat sat on a mat' union all

  select 2, 'i' union all

  select 2, 'i a' union all

  select 2, 'i am' union all

  select 2, 'i am f' union all

  select 2, 'i am fr' union all

  select 2, 'i am fre' union all

  select 2, 'i am free'

),

sentences as (

  select id, sentences[offset (array_length(sentences)-1)] as sentence from (

    select id, array_agg(text) as sentences 

    from items group by 1

  )

),

control as (

  select i.id, i.text, boundaryf(i.text, s.sentence) as boundary

  from items i

  left join sentences s on s.id  = i.id

)

select * from control

answered Nov 22 at 22:02

khan

1,79883051

add a comment |

up vote
0
down vote

BigQuery UDF's come in handy in these situations. Here is a working solution:

#standardSQL

/*boundary function*/

create temp function boundaryf (text string, sentence string) as (

  array_to_string(array(

    select q.w from unnest(

      array(select struct(w as w, row_number() over () as i)  from unnest(split(sentence, ' ')) w

      ) 

    ) q

    -- respect the ending space

    where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))

  ), ' ')

);



WITH items AS (

  #--your data. assuming this is already ordered

  SELECT 1 as id, 't' as text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sa' union all

  SELECT 1, 'the cat sat' union all

  SELECT 1, 'the cat sat ' union all

  SELECT 1, 'the cat sat o' union all 

  SELECT 1, 'the cat sat on' union all

  SELECT 1, 'the cat sat on ' union all

  SELECT 1, 'the cat sat on a' union all

  SELECT 1, 'the cat sat on a ' union all

  SELECT 1, 'the cat sat on a m' union all

  SELECT 1, 'the cat sat on a ma' union all

  SELECT 1, 'the cat sat on a mat' union all

  select 2, 'i' union all

  select 2, 'i a' union all

  select 2, 'i am' union all

  select 2, 'i am f' union all

  select 2, 'i am fr' union all

  select 2, 'i am fre' union all

  select 2, 'i am free'

),

sentences as (

  select id, sentences[offset (array_length(sentences)-1)] as sentence from (

    select id, array_agg(text) as sentences 

    from items group by 1

  )

),

control as (

  select i.id, i.text, boundaryf(i.text, s.sentence) as boundary

  from items i

  left join sentences s on s.id  = i.id

)

select * from control

answered Nov 22 at 22:02

khan

1,79883051

BigQuery UDF's come in handy in these situations. Here is a working solution:

#standardSQL

/*boundary function*/

create temp function boundaryf (text string, sentence string) as (

  array_to_string(array(

    select q.w from unnest(

      array(select struct(w as w, row_number() over () as i)  from unnest(split(sentence, ' ')) w

      ) 

    ) q

    -- respect the ending space

    where q.i <= array_length(split(text, ' ')) - (length(text) - length(rtrim(text)))

  ), ' ')

);



WITH items AS (

  #--your data. assuming this is already ordered

  SELECT 1 as id, 't' as text UNION ALL

  SELECT 1, 'th' UNION ALL

  SELECT 1, 'the' UNION ALL

  SELECT 1, 'the ' UNION ALL

  SELECT 1, 'the c' UNION ALL

  SELECT 1, 'the ca' UNION ALL

  SELECT 1, 'the cat' UNION ALL

  SELECT 1, 'the cat ' UNION ALL

  SELECT 1, 'the cat s' UNION ALL

  SELECT 1, 'the cat sa' union all

  SELECT 1, 'the cat sat' union all

  SELECT 1, 'the cat sat ' union all

  SELECT 1, 'the cat sat o' union all 

  SELECT 1, 'the cat sat on' union all

  SELECT 1, 'the cat sat on ' union all

  SELECT 1, 'the cat sat on a' union all

  SELECT 1, 'the cat sat on a ' union all

  SELECT 1, 'the cat sat on a m' union all

  SELECT 1, 'the cat sat on a ma' union all

  SELECT 1, 'the cat sat on a mat' union all

  select 2, 'i' union all

  select 2, 'i a' union all

  select 2, 'i am' union all

  select 2, 'i am f' union all

  select 2, 'i am fr' union all

  select 2, 'i am fre' union all

  select 2, 'i am free'

),

sentences as (

  select id, sentences[offset (array_length(sentences)-1)] as sentence from (

    select id, array_agg(text) as sentences 

    from items group by 1

  )

),

control as (

  select i.id, i.text, boundaryf(i.text, s.sentence) as boundary

  from items i

  left join sentences s on s.id  = i.id

)

select * from control

answered Nov 22 at 22:02

khan

1,79883051

answered Nov 22 at 22:02

khan

1,79883051

answered Nov 22 at 22:02

khan

1,79883051

answered Nov 22 at 22:02

khan

1,79883051

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl