'TFPyEnvironment wrapper changing shape of observation
I have a custom gym environment wrapped in a PyEnvironment, wrapped in a TFPyEnvironment. Upon resetting, the TFPyEnvironment seems to change the shape of the observation.
Here is the observation_space spec of the gym env:
space = {'observation' : gym.spaces.Box(np.float32(0), np.float32(3),
shape=(size*size*3,)),
'legal_moves' : gym.spaces.Discrete(gogame.action_size(self.state_)-1)}
self.observation_space = gym.spaces.Dict(space)
Here is the return of the reset()
method:
observations_and_legal_moves = {'observation' : np.copy(self.state_)[:3].flatten(),
'legal_moves' : 1-self.state_[govars.INVD_CHNL].flatten() }
return observations_and_legal_moves
And of the step()
method:
observations_and_legal_moves = {'observation' : np.copy(self.state_)[:3].flatten(),
'legal_moves' : 1-self.state_[govars.INVD_CHNL].flatten() }
return observations_and_legal_moves, self.reward(), self.done, self.info()
I wrap the gym in a PyEnvironment :
tp_env = suite_gym.load('gym_go:go-v1', gym_kwargs={'size':3,'komi':0})
Then call reset: tp_env.reset()
with the output:
TimeStep(
{'discount': array(1., dtype=float32),
'observation': OrderedDict([('legal_moves',
array([1, 1, 1, 1, 1, 1, 1, 1, 1])),
('observation',
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32))]),
'reward': array(0., dtype=float32),
'step_type': array(0, dtype=int32)})
This is the correct shape : (9,)
or ()
for legal_moves
, and (27,)
for observation
. Then I wrap in TFPyEnvironment:
t_env = tf_py_environment.TFPyEnvironment(suite_gym.load('gym_go:go-v1', gym_kwargs={'size':3,'komi':0}))
, and call reset()
: t_env.reset()
with the output:
TimeStep(
{'discount': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
'observation': OrderedDict([('legal_moves',
<tf.Tensor: shape=(1, 9), dtype=int64, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1]])>),
('observation',
<tf.Tensor: shape=(1, 27), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>)]),
'reward': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>,
'step_type': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>})
I've looked through the source code for TFPyEnvironment but can't seem to figure out what it is that is changing the shape. Why is this happening?
EDIT:
This doesn't appear to be a problem, as it happens when the observation_space is just a gym.spaces.Box
as opposed to a dict.
When self.observation_space =gym.spaces.Box(np.float32(0), np.float32(3), shape=(size*size*3,))
in the custom gym environment, the TFPyEnvironment reset returns a BoundedTensorSpec of shape (1,27), and it works fine with RandomTFPolicy and the PyDriver. However when it's a dict and a observation splitter is provided :
space = {'observation' : gym.spaces.Box(np.float32(0), np.float32(3),
shape=(size*size*3,)),
'legal_moves' : gym.spaces.Discrete(gogame.action_size(self.state_)-1)
}
self.observation_space = gym.spaces.Dict(space)
.../
...
def observation_and_action_constraint_splitter_func(obs):
return obs['observation'],obs['legal_moves']
random_policy = random_tf_policy.RandomTFPolicy(t_env.time_step_spec(),
t_env.action_spec(),
observation_and_action_constraint_splitter=observation_and_action_constraint_splitter_func
)
py_driver.PyDriver(
env,
py_tf_eager_policy.PyTFEagerPolicy(
random_policy, use_tf_function=True),
[rb_observer],
max_steps=initial_collect_steps).run(tp_env.reset())
The error :
ValueError: Received a mix of batched and unbatched Tensors, or Tensors are not compatible with Specs. num_outer_dims: 1.
Saw tensor_shapes:
TimeStep(
{'discount': TensorShape([1]),
'observation': {'legal_moves': TensorShape([1, 9]),
'observation': TensorShape([1, 27])},
'reward': TensorShape([1]),
'step_type': TensorShape([1])})
And spec_shapes:
TimeStep(
{'discount': TensorShape([]),
'observation': {'legal_moves': TensorShape([]),
'observation': TensorShape([27])},
'reward': TensorShape([]),
'step_type': TensorShape([])})
is thrown. Even though it would appear the observation is the exact same shape (1,27). This makes even less sense than before
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|