Best guess without reading all the code in detail. It looks like you applied a sigmoid activation. If you train with no activation (activation='linear'), you should get the visualization you are looking for. You may have to train longer to get convergence (assuming it can converge without an activation). If you want to keep the sigmoid, then you need to map your linear neuron through this activation (hence it won't look like a plane anymore).
EDIT:
My understanding of NNs. A dense layer from 3 to 1 and a sigmoid activation is the attempt to optimize the variables a,b,c,d in the equation:
f(x,y,z) = 1/(1+e^(-D(x,y,z)); D(x,y,z) = ax+by+cz+d
so that the binary_crossentropy (what you picked) is minimized, I will use B for the sum of the logs. Our loss equation would look something like:
L = ∑ B(y,Y)
where y is the value we want to predict, a 0 or 1 in this case, and Y is the value output by the equation above, the sum adds over all the data (or batches in a NN). Hence, this can be written like
L = ∑ B(y,f(x,y,z))
Finding the minimum of L given variables a,b,c,d can probably be calculated directly by taking partial derivatives and solving the given system of equations (This is why NN should never be used with a small set of variables (like 4), because they can be explicitly solved, so there is no point in training). Regardless of direct solving or using stocastic gradient decent to slowly move the a,b,c,d towards a minimum; in any case we end up with the optimized a,b,c,d.
a,b,c,d have be tuned to specifically produce values that when plugged into the sigmoid equation produce predicted categories that when tested in the Loss equation would give us a minimum loss.
I stand corrected though. In this case, because we have specifically a sigmoid, then setting up and solving the boundary equation, does appear to always produce a plane (did not know that). I don't think this would work with any other activation or with any NN that has more than one layer.
1/2 = 1/(1 + e^(-D(x,y,z)))
...
D(x,y,z) = 0
ax+by+cz+d = 0
So, I downloaded your data and ran your code. I don't get convergence at all; I tried various batch_sizes, loss functions, and activation functions. Nothing. Based on the picture, it seems plausible that nearly every randomized weights are going to favor moving away from the cluster than trying to find the center of it.
You probably need to transform your data first (normalize on all the axes might do the trick), or manually set your weights to something in the center, so that the training converges. Long story short, your a,b,c,d are not optimal. You could also explicitly solve the partial derivatives above and find the optimal a,b,c,d instead of trying to get a single neuron to converge. There are also explicit equations for calculating the optimal plane that separates binary data (an extension of linear regression).