hckrnws
NoProp: Training neural networks without back-propagation or forward-propagation
by belleville
It's a neat idea. It's not too dissimilar in spirit from gradient boosting. The point about credit assignment is crucial, and that's the same reason most architectures and initialization methods are so convoluted nowadays.
I don't really like one of their premises and conclusions:
> that does not learn hierarchical representations
There's an implicit bias here that (a) traditional networks do learn hierarchial representations, (b) that's bad, and (c) this training method does not learn those. However, [a] is situational, and it's easy to construct datasets where a standard gradient-descent neural net will learn a different way, even with a reverse hierarchy. [b] is unproven and also doesn't make a lot of intuitive sense to me. [c], even in this paper where they make that claim, has no evidence and also doesn't seem likely to be true.
https://www.reddit.com/r/MachineLearning/comments/1jsft3c/r_...
I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?
I think we need to start thinking about one shot training. I.e instead of context into LLM, you should be able to tell it a fact, and it will encode that fact into the updated weights.
In all their experiments, backprop is used for most of their parameter though...
There is a meaningful distinction. They only use backprop one layer at a time, requiring additional space proportional to that layer. Full backprop requires additional space proportional to the whole network.
It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.
An interesting idea for sure, but why only evaluate it on 28x28 pixel images? Why is their flow matching so much worse in some cases? Missing some analysis. Their words on it say nothing:
> For CIFAR-100 with one-hot embeddings, NoProp-FM fails to learn effectively, resulting in very slow accuracy improvement
In general any actual analysis is made impossible because of the lack of signal in the results. Fig 5 tells me nothing when the span is 99.58 to 99.46 percent accuracy.
If we could ever figure out what wet brains actually do (continuous feedback, ? enzyme release ? ) this might be possible
We know quite a lot. For example, we know that brains have various different nueromodulatory pathways. Take for example dopamine reward mechanism that is being talked about more openly these day. Dopamine is literally secreted by various different parts of the brain and affect different pathways.
I don't think it is anywhere feasible to emulate anything resembling this in a computational neural network with fixed input and output neurons.
Dopamine is not permanent, though. We're talking about long-term synaptic plasticity, not short-term neurotransmitter modulation.
Dopamine modulates long term potentiation and depression, in some complicated way.
Aren't we already emulating it? It's sort of a distributed and overlaid reward function, which we just undistributed
Keep in mind that our brains also have a great deal of built in trained structure from evolution. So even if we understood exactly how a brain learns, we may still not be able to replicate it if we can't figure out the highly optimized initial state from which it starts in a fetus.
Presumably that is limited by the gig or so of information in our DNA, though?
The amount of information transmitted from one generation to the next is potentially much more than the contents of DNA. DNA is not an encoding of every detail of a living body, it is a set of instructions for a living body to create an approximate copy of itself. You can't use DNA, as far as we know, to create a new organism from scratch to create a new organism without having the parent organism around to build it. We do know for certain that many parts of a cell divide separately from the nucleus and have no relation to the DNA of the cell - most well known being the mitochondria, which have their own DNA, but also many organelles just split off and migrate to the new cell quasi-independently. And this is just the simplest layer in some of the simplest organisms - we have no idea whatsoever how much other information is transmitted from the parent organism to the child in ways other than DNA.
In particular in mammals, we have no idea how actively the mother's body helps shape the child. Of course, there's no direct neuron to neuron contact, but that doesn't mean that the mother's body can't contribute to aspects of even the fetal brain development in other ways.
Interesting. As you say, that certainly makes sense for mammala. But I'd be interested in knowing what mechanisms you might conjecture for birds, where pretty much all foetal development happens inside the egg, separated from the mother -- or fish, or octopuses.
I concur. It might not be feasible in terms of computational power available, but I don't think there is anything fundamentally stopping application of those training mechanisms, unless the whole neuralnet paradigm is fundamentally incompatible with those learning methods.
How much of, especially "higher level cognition" like language, is encoded genetically is highly controversial and the thinking/pendulum in last decade or two has shifted substantially towards only general mechanisms being innate. E.g. the cortex may be in an essentially "random state" prior to getting input.
Yet for example the auditory/language processing part is almost always located in the same region for all humans.
E.g. ear input is connected to the same cortical location in almost all humans.
That's why I qualified all of my statements with "may" and "might". Still, I think it's extraordinarily unlikely that human brains could turn out, for example, to have no special bias for learning language. The training algorithm in our brains would have to be soany orders of magnitude better than the state of the art in ANNs that it would boggle the mind.
Consider the comparison with LLM training. A state of the art LLM that is, say, only an order of magnitude better than an average 4 year old human child in language use is trained on ~all of the human text ever produced, consuming many megawatts of power in the process. And it's helped with plenty of pre-processing of this text information, and receives virtually no noise.
In contrast, a human child that is not deaf acquires language from a noisy enviroment with plenty of auditory stimuli from which they first have to even understand that they are picking up language. To be able to communicate and thus receive significant feedback on the learning, they also have to learn how to control a very complex set of organs (tongue, lips, larynx, chest muscles), all with many degrees of freedom and precise timing needed to produce any sound whatsoever.
And yet virtually all human children learn all of this in a matter of 12-24 months, consuming, say, and then spend another 2-3 years learning more language without struggling as much with the basics of word recognition and pronunciation. And they do all this while consuming a total of some 5kWh, and this includes many bodily processes that are not directly related to language acquisition, and a lot of direct physical activity too.
So, either we are missing something extremely fundamental, or the initial state of the brain is very, very far from random and much of this was actually trained over tens or hundreds of thousands of years of evolution of the hominids.
Language capability is a bit difficult to quantify, but LLMs know tens of languages, and many of those better than at least the vast majority of even native humans at least grammar- and vocabulary-wise. They also encode magnitudes more fact-type knowledge than any human being. My take is that language isn't that hard but humans just kinda suck at it, like we suck at arithmetic and chess.
There sure is some "inductive bias" in the anatomy of the brain to develop things like language but it could be closer to how transformer architectures differ from pure MLPs.
The argument was for decades that no generic system can learn language from input alone. That turned out flat wrong.
Didn't they get neurons in a petri dish to fly a flight simulator?
We have gradient free algorithms: Hebbian learning. Since 1949?
And there's good reasons why we use gradients today.
That's more a theory/principle, not an algorithm by itself.
It is an update rule:
Wij = f(Wij, xi, xj)
The weight of the connection between nodes i and j is modified by a function over the activations or inputs of node i and j.
The are many variants of back propagation too.
Regardless, yes it would be used within a network model such as a Hopfield network.
"Whenever these kind of papers come out I skim it looking for where they actually do backprop.
Check the pseudo code of their algorithms.
"Update using gradient based optimizations""
I mean the only claim is no propagation, you always need a gradient of sorts to update parameters. Unless you just stumble upon the desired parameters. Even genetic algorithms effectively has gradients which are obfuscated through random projections.
No you don't. See Hebbian learning (neurons that fire together wire together). Bonus: it is one of the biologically plausible options.
Maybe you have a way of seeing it differently so that this looks like a gradient? Gradient keys my brain into a desired outcome expressed as an expectation function.
Nope that update with the rank one update is exactly the projected gradient of the reconstruction loss. That's not the way it is usually taught. So Hebbian learning was an unfortunate example.
Gradient descent is only one way of searching for a minima, so in that sense it is not necessary, for example, when one can analytically solve for the extrema of the loss. As an alternative one could do Monte Carlo search instead of gradient descent. For a convex loss that would be less efficient of course.
> See Hebbian learning
The one that is not used, because it's inherently unstable?
Learning using locally accessible information is an interesting approach, but it needs to be more complex than "fire together, wire together". And then you might have propagation of information that allows to approximate gradients locally.
Is that what they're teaching now? Originally it was not used because it was believed it couldn't learn XOR (it can [just not as perceptrons were defined]).
Is there anyone in particular whose work focuses on this that you know of?
Oja's rule dates back to 1982?
It’s Hebbian and solves all stability problems.
If there is a weight update, there is a gradient, and a loss objective. You might not write them down explicitly.
I can't recall exactly what the Hebbian update is, but something tells me it minimises the "reconstruction loss", and effectively learns the PCA matrix.
> loss objective
There is no prediction or desired output, certainly explicit. I was playing with those things in my work to try and understand how our brains cause the emergence of intelligence rather than solve some classification or related problem. What I managed to replicate was the learning of XOR by some nodes and further that multidimensional XORs up to the number of inputs could be learned.
Perhaps you can say that PCAish is the implicit objective/result but I still reject that there is any conceptual notion of what a node "should" output even if iteratively applying the learning rule leads us there.
Not every vector field has a potential. So not every weight update can be written as a gradient.
True.
Even with Hebbian learning, isn't there a synapse strength? If so, then you at least need a direction (+/-) if not a specific gradient value.
Yes there is a weight on every connection. At least when I was at it gradients were talked about in reference to the solution space (e.g. gradient descent). The implication is that there is some notion of what is "correct"for some neutron to have output and then we bend it to our will by updating the weight. In Hebbian learning there isn't a notion of correct activation, just a calculation over the local environment.
In genetic algorithms, any gradient found would be implied by way of the fitness function and would not be something to inherently pursue. There are no free lunches like with chain rule of calculus.
GP is essentially isomorphic with beam search where the population is the beam. It is a fancy search algorithm. It is not "training" anything.
True, genetic algorithms are only implied, but those implied gradients are used in the more successful evolutionary strategies. So while they might not look like it (because it's not used in a continuous descent) they still very much work like (although they represent a smoother function than) regular back-prop gradients when aggregated.
GP glancing at the pseudo-code is certainly an efficient way to dismiss an article, but something tells me he missed the crucial sentence in the abstract:
>"We believe this work takes a first step TOWARDS introducing a new family of GRADIENT-FREE learning methods"
I.e. for the time being, authors can't convince themselves not to take advantage of efficient hw for taking gradients
(*Checks that Oxford University is not under sanctions*)
Check out feedback alignment. You provide feedback with a random static linear transformation of the loss to earlier layers, and they eventually align with the feedback matrix to enable learning.
It's certifiably insane that it works at all. And not even vaguely backprop, though if you really wanted to stretch the definition I guess you could say that the feedforward layers align to take advantage of a synthetic gradient in a way that approximates backprop.
Same.
If I had to guess it's just local gradients, not an end-to-end gradient.
"Years of works of the genetic algorithms community came to the conclusion that if you can compute a gradient then you should use it in a way or another.
If you go for toy experiments you can brute force the optimization. Is it efficient, hell no."
Posted on 31st March which would've been 1st April somewhere else in the world?
Crafted by Rajat
Source Code