The Path to AGI - Pt 2

I started shaking as my tests of my new synthetic audio cortex model showed early signs of achieving self learning.. I expected to need a lot more tweaking and perhaps weeks, if not months of trial and error to find the right way to build the architecture, so you can imagine both my surprise, excitement and simultaneous - โ€œnah, this can't be right so early onโ€ - apprehension when the results were showing the model challenging itself and improving it's understanding.

Note that the model was not trained with ANY labels.

THE MODEL - Head 1 - Contrastive Segregation of Audio

I trained the first head of the model on 1 million 2 second audio triplets (one speech, one noise and a mix). The idea here was to ensure that the model could adequately separate speech from other noises. Speech/Communication being the primary function of the final model, but environmental audio and audible context being also highly important. I used a combination of NT-Xent and BYOL in combination for this leg of the training. I'm not sure this is a common approach, but the results I got suggests perhaps it should be. NT-Xent, with the triplets helped to anchor the embeddings of the same class (voice, noise, mix) closer together, while ensuring distancing of others. This I theorised, would help aid the models downstream heads (things like noise classification, VAD etc etc). BYOL ensures that the model learns high-level abstraction of audio features, improving the gaps where contrastive negatives are scarce or noisy.

So how did the model fair when tested?
I ran 300,000 audio clips from my triplet dataset through the model to see how well it would segregate noise and speech and plotted these into a graph. The results were significantly better than I had expected them to be at this point of training. I really expected a lot more overlay but what I saw in the plot was clear and clean clusters of noise on the outer limits of the plot, and overlapping voice and noise+voice mixes clustered together (which is what we wanted to see) - the model can distinguish what is noise and what is communication, even when noise is present.

This was great, I then tested to see how well it would generate "tokens" for proto-vocabulary. This is where the model showed some promising signs, but wasn't too great at finding boundaries. But it was where I expected (actually didn't expect it to be this good) it to be. The next step was to enhance the model with longer audio clips, enhancing it's ability in richer temporal context. The thought process here was that with longer speech and audio clips the model would get stronger semantic content, such as full words, background transitions, pauses and speaker changes. All this allows it to distinguish fine grained acoustic events and have better embedding cohesion. But most importantly, as with the real world, audio is messy and this set of training further enhances its ability to untangle overlapping sources.

THE MODEL - Head 2 - Symbolic Self Play
The model relies on external data - i.e. embeddings, learned "grammar" rules and learned "vocab" tokens. Symbolic Self Play is a self-supervised curriculum generation engine designed to simulate structured symbolic interaction scenarios, automatically generating difficult training samples which encourages the model to discover and refine abstract distinction (like speech with background noise vs noise with faint speech). The curriculum progressively gets more challenging to encourage concept emergence in the latent space. It also helps the model recognise contradictions, dissonance and overlaps better. Samples that confuse the model are reused with slight modifications making the self play adversarial and evolving.
Why tho, if the Contrastive Segregation is already strong, what is the point of this? Well, let's say you just started working in a noisy cafe, you struggle to hear the barrister who is patiently trying to explain how things work over all the noise, you keep making mistakes and you go home wondering if tomorrow you will get fired. The next day comes, it's equally noisy, but for some reason, you seem to be able to hear the barristers instructions better than yesterday. What's changed? While you slept your brain processed over the day, replaying things over and over, working out how to filter out the noise and provide attention where it is needed. This is the same concept. Environments change, speech and what requires attention changes, accents are different, languages are different. This part of the model ensures that it can quickly adjust to various environments and improve it's ability to function at a high degree.

It may have its drawbacks tho, as the model learns to deal with a certain environment, it may become less accurate in environments it was previously good at, BUT once back in that previous environment will again re-adjust.

So .... What can the model do? Can it talk? Can it read? Can it understand when spoken to?

No, none of these things and it's not supposed to. These heads are to some degree parallels to parts of the human brain that have specific functions, particularly how we learn through experience, abstraction and internal simulation (think day dreaming). The Contrastive Head simulates the auditory cortex, extracting low level features from auditory input such as tone, rhythm, timbre, phonemes etc and splits these up into things like background noise, who's speaking, what's being heard and what noise is present. The Self-Play model is more analogous to the prefrontal cortex, handling executive functions (in this case, just for audio), challenging its own current understanding, and guides its own learning to resolve ambiguity and contradictions. There are possibly more parallels to the default mode network though, where our minds are active with imagination, mind-wandering and abstract concept formation, the Anterior Cingulate Cortex would be where I would loosely associate the symbolic self play with.

So what's next?
The model is now going to be my universal audio pretraining backbone to be used for things like ASR, VAD, Speaker identification, emotion recognition, sound event detection, scene understanding (audio). At the same time, I am going to write a very basic self learning model that will sit and run for days on an old computer with the idea that over time it should understand words, learn grammar rules on it's own and with a light weight reasoning model, be able to communicate to seek clarification. Just for shits and giggles.