'sliding window on a tensor

I'm trying to build a simple word generator. However, I encounter some difficulty with the sliding windows.

here is my actual code:

files = glob("transfdata/*")# a list of text files
dataset = tf.data.TextLineDataset(files) # all files are one line 
dataset = dataset.map(lambda x: tf.strings.split(x)) # tokenize
dataset = dataset.window(6,1,1, drop_remainder=False)

The code doesn't work as I expected and adds a sliding window to text level (normal behavior). However, i want to window on a token level inside a text.

I did find a nonoptimal solution. The code works but i have a sliding window over all the documents. From methodological point of view, it shouldn't (different authors, topics, etc ). Is there any way to apply a window to a tensor and not a dataset?

files = glob("transfdata/*")
dataset = tf.data.TextLineDataset(files)
dataset = dataset.map(lambda x: tf.strings.split(x))
t = dataset.flat_map( lambda x: tf.data.Dataset.from_tensor_slices(x))
t = t.window(6,1,1, drop_remainder=False)

Any help would be appreciated, thanks!



Solution 1:[1]

Try using tensorflow-text, it has a decent sliding window function:

import tensorflow as tf
import tensorflow_text as tft

with open('data.txt', 'w') as f:
  f.write('How are we going to solve this extremely difficult problem with a bit of patience\n')

dataset = tf.data.TextLineDataset(['/content/data.txt'])
dataset = dataset.map(tf.strings.split) 
window_size = 6
dataset = dataset.map(lambda x: tft.sliding_window(x, width=window_size, axis=0)).flat_map(tf.data.Dataset.from_tensor_slices)

for d in dataset:
  print(d)
tf.Tensor([b'How' b'are' b'we' b'going' b'to' b'solve'], shape=(6,), dtype=string)
tf.Tensor([b'are' b'we' b'going' b'to' b'solve' b'this'], shape=(6,), dtype=string)
tf.Tensor([b'we' b'going' b'to' b'solve' b'this' b'extremely'], shape=(6,), dtype=string)
tf.Tensor([b'going' b'to' b'solve' b'this' b'extremely' b'difficult'], shape=(6,), dtype=string)
tf.Tensor([b'to' b'solve' b'this' b'extremely' b'difficult' b'problem'], shape=(6,), dtype=string)
tf.Tensor([b'solve' b'this' b'extremely' b'difficult' b'problem' b'with'], shape=(6,), dtype=string)
tf.Tensor([b'this' b'extremely' b'difficult' b'problem' b'with' b'a'], shape=(6,), dtype=string)
tf.Tensor([b'extremely' b'difficult' b'problem' b'with' b'a' b'bit'], shape=(6,), dtype=string)
tf.Tensor([b'difficult' b'problem' b'with' b'a' b'bit' b'of'], shape=(6,), dtype=string)
tf.Tensor([b'problem' b'with' b'a' b'bit' b'of' b'patience'], shape=(6,), dtype=string)

Solution 2:[2]

That is a very nice example, I adapt a bit for the word generators ( as in the question ) they are composed of sounds and winds.

[ Sample ]:

import tensorflow as tf
import tensorflow_text as tft
import numpy as np

input_word = tf.constant(' \'Cause it\'s easy as an ice cream sundae Slipping outta your hand into the dirt Easy as an ice cream sundae Every dancer gets a little hurt Easy as an ice cream sundae Slipping outta your hand into the dirt Easy as an ice cream sundae Every dancer gets a little hurt Easy as an ice cream sundae Oh, easy as an ice cream sundae ')

print( 'input_word: ' + str(input_word) )
print( " " )


dataset = tf.data.Dataset.from_tensors( tf.strings.bytes_split(input_word) )
print( dataset )

window_size = 6
dataset = dataset.map(lambda x: tft.sliding_window(x, width=window_size, axis=0)).flat_map(tf.data.Dataset.from_tensor_slices)

for d in dataset:
    print(d)

input('...')

[ Output ]:

tf.Tensor([b'i' b'c' b'e' b' ' b'c' b'r'], shape=(6,), dtype=string)
tf.Tensor([b'c' b'e' b' ' b'c' b'r' b'e'], shape=(6,), dtype=string)
tf.Tensor([b'e' b' ' b'c' b'r' b'e' b'a'], shape=(6,), dtype=string)
tf.Tensor([b' ' b'c' b'r' b'e' b'a' b'm'], shape=(6,), dtype=string)
tf.Tensor([b'c' b'r' b'e' b'a' b'm' b' '], shape=(6,), dtype=string)
tf.Tensor([b'r' b'e' b'a' b'm' b' ' b's'], shape=(6,), dtype=string)
tf.Tensor([b'e' b'a' b'm' b' ' b's' b'u'], shape=(6,), dtype=string)
tf.Tensor([b'a' b'm' b' ' b's' b'u' b'n'], shape=(6,), dtype=string)
tf.Tensor([b'm' b' ' b's' b'u' b'n' b'd'], shape=(6,), dtype=string)
tf.Tensor([b' ' b's' b'u' b'n' b'd' b'a'], shape=(6,), dtype=string)
tf.Tensor([b's' b'u' b'n' b'd' b'a' b'e'], shape=(6,), dtype=string)
tf.Tensor([b'u' b'n' b'd' b'a' b'e' b' '], shape=(6,), dtype=string)
tf.Tensor([b'n' b'd' b'a' b'e' b' ' b'O'], shape=(6,), dtype=string)
tf.Tensor([b'd' b'a' b'e' b' ' b'O' b'h'], shape=(6,), dtype=string)
tf.Tensor([b'a' b'e' b' ' b'O' b'h' b','], shape=(6,), dtype=string)
tf.Tensor([b'e' b' ' b'O' b'h' b',' b' '], shape=(6,), dtype=string)

Sample

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Martijn Pieters