The Path to AGI

Working on my P.A.S.T.A model with quite a satisfying degree of success, I felt something was off. I am feeling that my time spent developing training data, code and training the models has been diverting me from what I originally wanted to achieve. That is, to create an AI that is driven by curiosity. 

The models I am developing all fit the same old tired approach to AI - train the models on copious amounts of data and labels and when the model is sufficiently accurate at predicting, lock it down and that's it. The model is ready for production, woefully limited to its “knowledge” and accuracy at the point in time when it was finalised.

While I feel that my reasoning for following this approach is sound, that is, I wanted to use these fixed models to provide my AI with the means to “sense” and then, downstream learn and interpret from these “senses”; I realised pretty quickly that if the senses are locked down and limited in the way they are, downstream will be so convoluted and complicated that hallucinations and general viability of such an approach would be soon found to be impossible.

While I will still work on the P.A.S.T.A models, having already completed the VAD (voice activity detection) with noise categorisation ready for real-time inference, I've decided to get back to the root of my objective, to create real, curiosity driven AI. 

Inspired recently by a chapter in the book Genome by Matt Ridley, specifically "Chromosome 7 - Instinct" where he discusses the instinctive “gene” for grammar and language, the idea resurfaced. What if, rather than creating a model that has learned grammar through training on thousands of hours of audio and text, why not look to create a model that instinctively looks for patterns in audio, is language agnostic, does not fix grammatical rules within the model itself but rather has pure instinct driving its ability to learn with its learning engine housed outside in an evolving structure… Perhaps this is a more realistic and achievable start to the beginnings of my curios AI system…

 

In essence, the model I am now thinking to create would be the auditory cortex of my AI system, responsible for learning language, not trained on English, French, Japanese texts or audio clips, but completely label/token free in its training.

 

Here's how I conceive it coming together.

Raw Audio Instincts - Only built in “senses” allowed, so no labels, text, no pretrained embeddings, nada!

 

Each of these I expect can be achieved with some very shallow CNNs

 

Vocabulary Discovery

With each of these elements in place I can then work on proto-vocabulary discovery. I say proto because the model isn't going to know what a word is, or what a word means. In fact, thinking about it as I write now, I don't think this need be limited to language, but general audio interpretation which includes language, but also environmental sounds. Effectively the model creates tokens from audio without knowing what it is. It's just going to cluster these tokens with arbitrary names, then track their temporal co-occurrence and segment boundary confidence outside of the model. This gives us some form of arbitrary symbolic representations of acoustic units (many of which will be words, while others will be environmental sounds like cars, instruments, animals etc).

 

Grammar - Structural Pattern Discovery

I'd hazard a guess that one of the main reasons why we humans like music so much is that it leans to our desire to detect patterns in audio. Music is pleasing to many because it has clear patterns, rhythm and fluidity that is not taxing on our thinking brain. We can listen to music without expending copious amounts of energy and are rewarded with dopamine for effortlessly detecting a pattern. Unfortunately, speech is a bit more taxing on our cognition and requires a fair amount of energy to discern the patterns in speech, especially when mixed in with all forms of interfering and overlapping noise (music included).

So, while I started this paragraph with the header “Grammar” I realised that the model would find structural patterns in more than just grammar, but in music and nature. I've updated the header accordingly to include Structural Pattern Discovery instead of simply Grammar Discovery.  This head of the model will look for tokens that tend to follow other tokens, forming clusters (outside of the model). I envision the need to create a temporal graph of sequences, from which we can apply entropy, frequency and mutual information to find “rules”. these rules could be in the form of triplets which are then formed into higher order patters - recursive abstraction. Can you see what I mean by this not being limited to grammar, but rather emergent structure. I just got goosebumps…

 

Learning and curiosity

Every utterance, every soundbite passed through the model is an opportunity for the model to use what it has already learned, self-improve and self-repair. The model should always be challenging what it “knows” just as we do. It should try an predict boundaries and sequences using what it already knows, uses “surprise” for inconsistencies, adjusts “rule" weights, splits tokens, merges similar ones. This filters into a multi-tier system, where a sound unit to token, token to phrase, phrase to rule/pattern, patter to generalisation. 

 

Effectively we now have a synthetic audio cortex, that needs no supervision, has no fixed language, everything the model “knows” is discovered, it evolves from experience, and … now I think more about it, it could be enhanced to output audio too, although we segway into giving this module some form of reasoning, which I think will be a separate model that feeds from and into this one.. Let my mind wander not too far down the rabbit hole tonight.