'Which tensorflow method does decide to a particular batch of examples is for the model to learn?

I'm trying to understand the implementation of SGD in tensorflow.

I began with gradient_descent.py because of the file name.

Per keras doc, an optimizer needs to implement _resource_apply_dense method, which corresponds with the code (partly) shown below:

def _resource_apply_dense(self, grad, var, apply_state=None):
    var_device, var_dtype = var.device, var.dtype.base_dtype
    coefficients = ((apply_state or {}).get((var_device, var_dtype))
                    or self._fallback_apply_state(var_device, var_dtype))

    if self._momentum:
    momentum_var = self.get_slot(var, "momentum")
    return gen_training_ops.ResourceApplyKerasMomentum(
        ...

I'd like to know who passes the var variable to the _resource_apply_dense method? In other words, which method decides this particular batch of examples is for the model to learn?



Solution 1:[1]

Checking the optimizer_v2 or tensorflow keras, we find the only use of this function in the entire tensorflow codebase:

   #...
   def apply_grad_to_update_var(var, grad):
      #...
      if "apply_state" in self._dense_apply_args:
        apply_kwargs["apply_state"] = apply_state
      update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
      if var.constraint is not None:
        with ops.control_dependencies([update_op]):
          return var.assign(var.constraint(var))

We later see on that same file that the var variable comes from an argument to the _distributed_apply function:

#...
def _distributed_apply(self, distribution, grads_and_vars, name, apply_state):
    #...
    with name_scope_only_in_function_or_graph(name or self._name):
      for grad, var in grads_and_vars:
      #...

Finally, the grads_and_vars argument is defined as List of (gradient, variable) pairs in the function apply_gradients:

  #...
  def apply_gradients(self,
                      grads_and_vars,
    #...
    """...
    Args:
      grads_and_vars: List of (gradient, variable) pairs.
    """

If you check the occurrences of apply_gradients (this search), you will see that it is a common way to update the weights of the network, and is thus controlled by the "update" step of the optimizer.

Solution 2:[2]

These are two different questions:

  1. The caller: "who passes the var variable to the _resource_apply_dense method?"
  2. Particular examples: "which method decides this particular batch of examples is for the model to learn?"

1. The caller

The main function that updates weights in any TensorFlow optimizer is apply_gradients, and it receives a zip of trainable weights and their gradients. var is a list of trainable weights unzipped in this line. From my understanding, here is the call sequence:

  1. apply_gradients calls _distributed_apply.
  2. _distributed_apply, calls an inner apply_grad_to_update_var.
  3. apply_grad_to_update_var calls inherited and custom _resource_apply_dense or _resource_apply_sparse.

2. Particular examples

The decision on which examples are picked for a model to learn has nothing to do with the optimizer. Optimizers decide the amount in which the weights will be changed, it can be just the gradients, and it can be something more, and then they apply the change.

A batch is a subset of data. Thus, you can specify the data yourself or allow other classes to decide for you like Dataset class (please check shuffle function).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ibarrond
Solution 2