Recently I started toying with neural networks. I was trying to implement an AND
gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers.
First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).
I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[0],[0],[0],[1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 1])
W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))
activation = tf.nn.sigmoid(tf.matmul(x, W)+b)
cost = tf.reduce_sum(tf.square(activation - y))/4
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iterations:
[[ 0.0031316 ]
[ 0.12012422]
[ 0.12012422]
[ 0.85576665]]
Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b).
Question 2 - I read from a StackOverflow post here:
[Activation Function] selection depends on the problem.
So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this.
I also implemented the AND
gate with a different approach, with the output as one-hot true. As you can see the train_Y
[1,0]
means that the 0th index is 1, so the answer is 0. I hope you get it.
Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably.
import tensorflow as tf
import numpy
train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]])
train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]])
x = tf.placeholder("float",[None, 2])
y = tf.placeholder("float",[None, 2])
W = tf.Variable(tf.zeros([2, 2]))
b = tf.Variable(tf.zeros([2]))
activation = tf.nn.softmax(tf.matmul(x, W)+b)
cost = -tf.reduce_sum(y*tf.log(activation))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5000):
train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})
result = sess.run(activation, feed_dict={x:train_X})
print(result)
after 5000 iteration
[[ 1.00000000e+00 1.41971401e-09]
[ 9.98996437e-01 1.00352429e-03]
[ 9.98996437e-01 1.00352429e-03]
[ 1.40495342e-03 9.98595059e-01]]
Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation.
Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?
See Question&Answers more detail:
os