'How do I include the bias term with other weights when performing gradient descent in TensorFlow?

I'm a beginner with ML and have been following the Coursera intro syllabus. I am trying to implement the exercises using TensorFlow rather than Octave.

I have three versions - the first two work fine and the third doesn't. I would like to know why.

Note - I am using TensorFlow.Net in F#, but the binding is a 1:1 mapping of the API so it should look pretty familiar to Python devs.

1.) Completely manual, works fine

let gradientDescent (x : Tensor) (y : Tensor) (alpha : Tensor) iters = 

    let update (theta : Tensor) =
        let delta = 
            let h = tf.matmul(x,theta)
            let errors = h - y
            let s =  tf.matmul((tf.transpose x), errors) 
            s / m

        theta - alpha * delta

    let rec search (theta : Tensor) i =
        if i = 0 then
            theta
        else
            search (update theta) (i - 1)

    let initTheta = tf.constant([| 0.; 0. |], shape = Shape [| 2; 1 |])      
    search initTheta iters 

let ones = tf.ones(Shape [| m; 1 |], dtype = TF_DataType.TF_DOUBLE)
let X = tf.concat([ ones; x ], axis = 1)
let theta = gradientDescent X y alpha iterations

2.) Using Gradient Tape for auto differentiation with a separate bias term - works fine also

let gradientDescent (x : Tensor) (y : Tensor) (alpha : float32) iters = 
    let W = tf.Variable(0.f, name = "weight")
    let b = tf.Variable(0.f, name = "bias")
    let optimizer = keras.optimizers.SGD alpha
    for _ in 0 .. iters do 
        use g = tf.GradientTape()
        let h = W * x + b
        let loss = tf.reduce_sum(tf.pow(h - y,2)) / (2 * m)
        let gradients = g.gradient(loss, struct (b,W))
        optimizer.apply_gradients(zip(gradients, struct (b,W)))
        
    tf
        .constant(value = [| b.value(); W.value() |], shape = Shape [| 2; 1 |])
        .numpy()
        .astype(TF_DataType.TF_DOUBLE)

let theta = gradientDescent x y alpha iterations

3.) Using Gradient Tape as before, this time including the bias term as part of the weights - this throws a stack overflow exception when calling apply_gradients.

let gradientDescent (x : Tensor) (y : Tensor) (alpha : float32) iters = 
    let W = tf.Variable(tf.constant([| 0.; 0. |], shape = Shape [| 2; 1 |]))
    let optimizer = keras.optimizers.SGD alpha
    for _ in 0 .. iters do 
        use g = tf.GradientTape()
        let h = tf.matmul(x, W)
        let loss = tf.reduce_sum(tf.pow(h - y,2)) / (2 * m)
        let gradients = g.gradient(loss, W) // correct gradient tensor returned here
        optimizer.apply_gradients(struct(gradients, W)) // boom!
        
    tf
        .constant(value = W.value().ToArray<double>(), shape = Shape [| 2; 1 |])
        .numpy()

let ones = tf.ones(Shape [| m; 1 |], dtype = TF_DataType.TF_DOUBLE)
let X = tf.concat([ ones; x ], axis = 1)
let theta = gradientDescent X y alpha iterations


Solution 1:[1]

I worked it out - optimizer.apply_gradients requires an iterable.

All I had to do was change

optimizer.apply_gradients(struct(gradients, W))

to

optimizer.apply_gradients(zip([|gradients|], [|W|]))

plus a bit of float32 / 64 casting

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ryan