'Reverse of keras Text Vectorization layer?

tf.keras.layers.TextVectorization layer maps text features to integer sequences, and since it can be added as a keras model layer it makes it easy to deploy the model as a single file which takes string as input and processes it. But I need to do the reverse operation also, and cannot find any way to do this. I am working with an LSTM model that predicts next word from previous words. For example, my model need to accept a string "I love" and should output possible next words like "cats", "dogs", etc. I can do this mapping strings to and from integer manually using tf.keras.preprocessing.text.Tokenizer like this:

text = "I love cats"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<oov>')
tokenizer.fit_on_texts([text])

seqs = tokenizer.texts_to_sequences([text])
prediction = model.predict(seqs) # an integer
actual_prediction = tokenizer.sequences_to_texts(prediction) # now the desired string

How can I achieve the functionality of TextVecorization layer in model's output layer so that instead of getting a prediction of index I get the string represented by the index by TextVectorization layer?



Solution 1:[1]

it is easy but you need to separate tasks between string-text-to-sequences and model to find their relationship.

[ Sample 1 ]: As string sequences

import tensorflow as tf

text = "I love cats"
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<oov>')
tokenizer.fit_on_texts([text])

# input
vocab = [ "a", "b", "c", "d", "e", "f", "g", "h", "I", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_" ]
data = tf.constant([["_", "_", "_", "I"], ["l", "o", "v", "e"], ["c", "a", "t", "s"]])

layer = tf.keras.layers.StringLookup(vocabulary=vocab)
sequences_mapping_string = layer(data)
sequences_mapping_string = tf.constant( sequences_mapping_string, shape=(1,12) )

decoder = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode="int", invert=True)
result = decoder(sequences_mapping_string)
print( "encode: " + str( sequences_mapping_string ) )
print( "decode: " + str( result ) )

mapping_vocab = [ "_", "I", "l", "o", "v", "e", "c", "a", "t", "s" ]
string_matching = [ 27, 9, 12, 15, 22, 5, 3, 1, 20, 19 ]
string_matching_reverse = [ 1/27, 1/9, 1/12, 1/15, 1/22, 1/5, 1/3, 1/1, 1/20, 1/19 ]

print( tf.math.multiply( tf.constant(string_matching, dtype=tf.float32), tf.constant(string_matching_reverse, dtype=tf.float32 ), name=None ) )

[ Output ]:

# encode: tf.Tensor([[27 27 27  9 12 15 22  5  3  1 20 19]], shape=(1, 12), dtype=int64)
# decode: tf.Tensor([[b'_' b'_' b'_' b'I' b'l' b'o' b'v' b'e' b'c' b'a' b't' b's']], shape=(1, 12), dtype=string)
# text: I love cats
# seqs: [[2, 3, 4]]
# prediction: [[2.004947  0.        0.        1.4835927 3.3234084 3.586834  0.  0.6012034 0.       ]]
# tf.Tensor([1. 1. 1. 1. 1. 1. 1. 1. 1. 1.], shape=(10,), dtype=float32)

[ Sample 2 ]: As words sequences Applied models requirements

dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
dataset = dataset.batch(10)
batched_features = dataset
predictions = model.predict(input_array)

Sample

Sample

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Martijn Pieters