'sliding window on a tensor
I'm trying to build a simple word generator. However, I encounter some difficulty with the sliding windows.
here is my actual code:
files = glob("transfdata/*")# a list of text files
dataset = tf.data.TextLineDataset(files) # all files are one line
dataset = dataset.map(lambda x: tf.strings.split(x)) # tokenize
dataset = dataset.window(6,1,1, drop_remainder=False)
The code doesn't work as I expected and adds a sliding window to text level (normal behavior). However, i want to window on a token level inside a text.
I did find a nonoptimal solution. The code works but i have a sliding window over all the documents. From methodological point of view, it shouldn't (different authors, topics, etc ). Is there any way to apply a window to a tensor and not a dataset?
files = glob("transfdata/*")
dataset = tf.data.TextLineDataset(files)
dataset = dataset.map(lambda x: tf.strings.split(x))
t = dataset.flat_map( lambda x: tf.data.Dataset.from_tensor_slices(x))
t = t.window(6,1,1, drop_remainder=False)
Any help would be appreciated, thanks!
Solution 1:[1]
Try using tensorflow-text
, it has a decent sliding window function:
import tensorflow as tf
import tensorflow_text as tft
with open('data.txt', 'w') as f:
f.write('How are we going to solve this extremely difficult problem with a bit of patience\n')
dataset = tf.data.TextLineDataset(['/content/data.txt'])
dataset = dataset.map(tf.strings.split)
window_size = 6
dataset = dataset.map(lambda x: tft.sliding_window(x, width=window_size, axis=0)).flat_map(tf.data.Dataset.from_tensor_slices)
for d in dataset:
print(d)
tf.Tensor([b'How' b'are' b'we' b'going' b'to' b'solve'], shape=(6,), dtype=string)
tf.Tensor([b'are' b'we' b'going' b'to' b'solve' b'this'], shape=(6,), dtype=string)
tf.Tensor([b'we' b'going' b'to' b'solve' b'this' b'extremely'], shape=(6,), dtype=string)
tf.Tensor([b'going' b'to' b'solve' b'this' b'extremely' b'difficult'], shape=(6,), dtype=string)
tf.Tensor([b'to' b'solve' b'this' b'extremely' b'difficult' b'problem'], shape=(6,), dtype=string)
tf.Tensor([b'solve' b'this' b'extremely' b'difficult' b'problem' b'with'], shape=(6,), dtype=string)
tf.Tensor([b'this' b'extremely' b'difficult' b'problem' b'with' b'a'], shape=(6,), dtype=string)
tf.Tensor([b'extremely' b'difficult' b'problem' b'with' b'a' b'bit'], shape=(6,), dtype=string)
tf.Tensor([b'difficult' b'problem' b'with' b'a' b'bit' b'of'], shape=(6,), dtype=string)
tf.Tensor([b'problem' b'with' b'a' b'bit' b'of' b'patience'], shape=(6,), dtype=string)
Solution 2:[2]
That is a very nice example, I adapt a bit for the word generators ( as in the question ) they are composed of sounds and winds.
[ Sample ]:
import tensorflow as tf
import tensorflow_text as tft
import numpy as np
input_word = tf.constant(' \'Cause it\'s easy as an ice cream sundae Slipping outta your hand into the dirt Easy as an ice cream sundae Every dancer gets a little hurt Easy as an ice cream sundae Slipping outta your hand into the dirt Easy as an ice cream sundae Every dancer gets a little hurt Easy as an ice cream sundae Oh, easy as an ice cream sundae ')
print( 'input_word: ' + str(input_word) )
print( " " )
dataset = tf.data.Dataset.from_tensors( tf.strings.bytes_split(input_word) )
print( dataset )
window_size = 6
dataset = dataset.map(lambda x: tft.sliding_window(x, width=window_size, axis=0)).flat_map(tf.data.Dataset.from_tensor_slices)
for d in dataset:
print(d)
input('...')
[ Output ]:
tf.Tensor([b'i' b'c' b'e' b' ' b'c' b'r'], shape=(6,), dtype=string)
tf.Tensor([b'c' b'e' b' ' b'c' b'r' b'e'], shape=(6,), dtype=string)
tf.Tensor([b'e' b' ' b'c' b'r' b'e' b'a'], shape=(6,), dtype=string)
tf.Tensor([b' ' b'c' b'r' b'e' b'a' b'm'], shape=(6,), dtype=string)
tf.Tensor([b'c' b'r' b'e' b'a' b'm' b' '], shape=(6,), dtype=string)
tf.Tensor([b'r' b'e' b'a' b'm' b' ' b's'], shape=(6,), dtype=string)
tf.Tensor([b'e' b'a' b'm' b' ' b's' b'u'], shape=(6,), dtype=string)
tf.Tensor([b'a' b'm' b' ' b's' b'u' b'n'], shape=(6,), dtype=string)
tf.Tensor([b'm' b' ' b's' b'u' b'n' b'd'], shape=(6,), dtype=string)
tf.Tensor([b' ' b's' b'u' b'n' b'd' b'a'], shape=(6,), dtype=string)
tf.Tensor([b's' b'u' b'n' b'd' b'a' b'e'], shape=(6,), dtype=string)
tf.Tensor([b'u' b'n' b'd' b'a' b'e' b' '], shape=(6,), dtype=string)
tf.Tensor([b'n' b'd' b'a' b'e' b' ' b'O'], shape=(6,), dtype=string)
tf.Tensor([b'd' b'a' b'e' b' ' b'O' b'h'], shape=(6,), dtype=string)
tf.Tensor([b'a' b'e' b' ' b'O' b'h' b','], shape=(6,), dtype=string)
tf.Tensor([b'e' b' ' b'O' b'h' b',' b' '], shape=(6,), dtype=string)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Martijn Pieters |