Building The Auditory Cortex

Building The Auditory Cortex

I'm not sure now how many iterations of model design and redesign I have been through for the auditory cortex, but as I learn more and get a stronger understanding of how the brain interprets audio signals, how these are stored and recalled as phonemes, syllables, words and grammar structures, the more my model designs improve and get close to human auditory processing and recall.

Right now I am in the process of training and fine-tuning my latest iteration of the early-stage AI "ear". This is responsible for figuring out what needs focus and what can be filtered out from the myriad of sound in the real world.

The model is designed to disentangle speech and noise. This is not your typical VAD process, rather it is trained without labels, using a combination of contrastive loss, BYOL on a combination of raw waveforms and PCEN (per-channel energy normalized) Mel spectrum embeddings with multiple pathways for richer sonic context.

To put it into a form of diagram representation it looks something like this:

 

            		
                    |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
            		|     Raw Audio Encoder    |
            		|__________________________|
                      		|
            		|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
            		|    Instinctive Head (A)  | → (fast and contextual)
            		|__________________________|
                      		|
                      		|
        		|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
		|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|        |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
		|     Shared Raw Audio     |        |     Shared PCEN / Mel    |
		|          Encoder         |        |         Encoder          |
		|__________________________|        |__________________________|
    		|           |                          |           |
 |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
 |  Voice Encoder  || Noise Encoder |  | Voice Encoder |  | Noise Encoder |
 |_________________||_______________|  |_______________|  |_______________|
          |                 |                  |                  |
 |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
 |    Raw Voice    ||   Raw Noise   |  |   PCEN Voice  |  |   PCEN Noise  |
 |    Embedding    ||   Embedding   |  |   Embedding   |  |   Embedding   |
 |_________________||_______________|  |_______________|  |_______________|
And then from each of these heads:

|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
|    Raw Voice    ||   PCEN Voice  |  |   Raw Noise   |  |   PCEN Noise  |
|    Embedding    ||   Embedding   |  |   Embedding   |  |   Embedding   |
|_________________||_______________|  |_______________|  |_______________|
        |                 |                   |                 |
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
|  	  Fused Voice Embedding        |  |       Fused Noise Embedding	     | 
|__________________________________|  |__________________________________|
         |                 |                   |                 |
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
|     Speaker     ||  Instinctive  |  | Noise Learner |  |  Instinctive  |
| Invariance Head ||   Head (B)    |  |               |  |   Head (B)    |
|_________________||_______________|  |_______________|  |_______________|
         |                 |                   |                 |
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
|  Proto-Phoneme  ||  Instinctive  |  | Noise Learner |  |  Instinctive  |
|    separator    ||   Head (B)    |  |               |  |   Head (B)    |
|_________________||_______________|  |_______________|  |_______________|
         |                 |                   |                 |
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|  |‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|
|  Proto-Syllable ||   Attention   |  |     Noise     |  |   Attention   |
|    isolator     ||     Mask      |  |   Classifier  |  |     Mask      |
|_________________||_______________|  |_______________|  |_______________|
          |                | 
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾| 
|  Proto-Syllable ||      Word     | 
|     Learner     ||     Recall    | 
|_________________||_______________|
          |                |	
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾| 
|   Proto-Word    ||   Contextual  | 
|     Learner     || Understanding |
|_________________||_______________| 
         |                 |
|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾||‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾| 
|  Proto-Grammar  ||   Downstream  |
|     Learner     ||   Reasoning   |
|_________________||_______________|

The next phase of the model is what I am calling an "Instinctive Head" with its main responsibility to determine where attention is needed. Think of it like a primal threat sensor, detecting if something may need immediate attention and to prioritize this.
 

Everything below the fused voice and fused noise embeddings is still conceptual and there's a lot more going on with these than I can put into a ASCII diagram (note to self, update the site to support adding images, because drawing out these ASCII diagrams takes AGES!!!)
 

What's next? Once I am happy with the AI "ear" (everything up to the fused embeddings) I will start training the Instinctive Head (B) which provides downstream modules with time aware areas for attention, both in voice and noise paths. This will help the auditory cortex learn speech sentiment such as anger, happiness, indifference, deceit and other paralinguistic features, while on the noise side of things will learn what is a potential threat vs background noise. Good for robotics perhaps..
 

After that it will be speaker invariance. This basically ensures that we don't end up with thousands of unnecessary embeddings bloating our memory. It will be designed to allow the model to know that "ba" or "cat" or whatever sound is "ba" or "cat" no matter the speaker, essentially storing single representations of proto-constants, proto-syllables and proto-words in a voice independent manner. This will greatly aid the downstream proto learners in their ability to find repeating patterns regardless of speaker specific pitch, tone, harmonics etc.
 

The next bug steps after the auditory cortex's memory and self play have been formed, I'll begin working on the visual cortex and the speech brain (Broca's and Wernicke's area equivalents). These are already being scaffolded and my notebook is filling up with ideas on how these will be formed. Exciting times ahead.