With all the talk about AGI, I thought it would be interesting to bring up the fact that we need to limit how neural networks generalize in order for them to perform as well as they do. I don’t know if this is something everyone already knows, but the multilayer perceptron (MLP), as it was first thought of in 1958, is theoretically capable of doing almost anything today’s top architectures can. This is because of the universal approximation theorem, which says that a large enough and well-trained MLP can approximate any continuous function to a very high degree of accuracy. The problem is that this doesn’t explain how easy it is to find those representations through training. We are still limited to smaller problems because tasks like NLP and computer vision require huge amounts of computational power.
The deep learning models we use today — like convolutional neural networks for images, recurrent or transformer-based models for sequences, and graph neural networks for graphs — are all shaped by these computational limits. To make things easier, we purposely put certain structural elements into models to reduce complexity, share parameters, and make training more manageable. Basically, we’ve put a lot of effort into figuring out which structures need to be learned for specific tasks, and instead of letting a network figure them out by itself, we tell it what to look for. From a model training perspective, we haven’t really made neural networks better at generalizing. We’ve just reduced the number of functions they can learn in order to make them work better on specific tasks.
If we could train large, fully connected networks (MLPs) without issues, we wouldn’t need these carefully designed architectures anymore. We could just train an overparameterized, fully connected network that, with enough data, finds its own internal representations. No convolutions, no attention mechanisms, no hand-crafted structures. This might lead to a situation where models get simpler in design, but their complexity comes from their size and the data they’re trained on.
In other words, we might be able to move forward by “moving backwards” — improving the efficiency and scalability of MLPs and getting rid of our current limits. This could reduce the need for specialized architectures and the careful training needed to get neural networks to do what they do now. But here’s the real question: Is it even possible to train fully connected networks at a large scale, or is this like the theoretical idea of warp drives in physics — even if the exotic matter needed to power it exists, we would need an amount of energy far beyond what we can currently handle?
And since we still have to customize architectures for specific domains or tasks, will there always be gaps in how we can use neural networks unless we overcome this computational barrier with MLPs? The issue is that we’re designing networks for specific tasks, but we don’t know what kind of surprising properties could come from structures we haven’t even thought of yet (and if that doesn’t make sense, it’s probably because I took an edible about 90 minutes ago, and it’s really kicking in).