Models of lexical retrieval for both perception and production have contributed significantly to our understanding of the relationship between the brain and language. They give insight into how the lexicon is structured in terms of where words are stored. They show the most effective (quickest) pathways that lead from stimulus (audio or visual speech input) to target (the intended word being retrieved), or concept to spoken word. They help explain how various linguistic levels (semantic, phonological, etc.) may interact during the retrieval process.
No two models are identical. In fact, many are developed with the goal of accounting for the weaknesses of previous models. Although a number of models seem to be in competition, each gives insight into how linguistic knowledge is represented in the mind, how it is used, acquired and lost. To date, no one model accounts for everything we know about human speech.
This tutorial is a brief overview of different structures, or architectures of the better-known models of lexical retrieval in speech perception and production.
Here are the more frequently used terms in the literature.
Serial models represent lexical retrieval as a process in which one unit of speech is activated at a time. For example, activation at the semantic level must be completed before phonological representations can be activated. Thus, each linguistic process is distinct and independent of the others such that there is no interaction between levels.
In parallel models, the processing of information at different levels can occur simultaneously. This means levels can converge or interact, allowing each to receive feedback from the others. Thus a node at one level can activate nodes at the same level or on other levels.
Active vs. Passive Processes
The terms Active vs. Passive processes describe whether or not a listener is involved in the speech perception process. For example, active models claim that speech perception involves linking sounds to how they are produced. Thus when a listener hears the word ‘kite’ produced or spoken, s/he is actively analyzing the
phonetic properties of /k/, /aj/, and /t/. Passive theories purport that speech perception is a sensory process and that information about how speech sounds are produced is not accessed, thus the listener is relatively passive.
Autonomous vs. Interactive Models
Autonomous models depict perception as a serial process (a series of steps) in which one level of information is recovered at a time. This means that a listener will first process information at the phonetic stage. This material (output) will then become the input for the lexical stage of processing, whose output will become the input for the semantic stage. These models are ‘closed’ in that they contain all that is necessary for perception without the need for exterior sources of information such as context.
The output of one stage of processing provides the input to the next stage. Autonomous theories posit feed-forward processing with lexical influence restricted to post-perceptual decision processes (uni-directional).
Interactive theories show that information and knowledge from many sources available to the listener are involved at any or all stages of processing. These are generally bi- directional models meaning that information can be accessed from the bottom-up, e.g., smaller linguistic units, or from the top-down, e.g., broader linguistic units.
Bottom-up vs. Top-Down Processes
These processes describe the direction of how information accessed between levels. In bottom-up models, the essential information needed for speech recognition is found the acoustic signal itself. Thus the first stage of perception involves the sensitivity to and identification of the smallest elements of speech, phonetic features, such as voicing, constriction of airflow, and which articulators are being used. No other information is necessary. This is based on the fact that speech perception is carried out at the segmental level and that these segments are composed of a number of acoustic/gestural features that act like cues. For instance, in the perception of the word cold, the brain recognizes that the initial sound /k/ is a voiceless velar stop and not a voiced velar stop, /g/. Thus there is no confusion as to whether the word cold or gold is being produced. In other words, listeners actually perceive a bundle of features, creating segments, which are then strung into words. Bottom-up perception does an excellent job of accounting for the perception of well-defined sounds in which there is no interference of the acoustic signal.
But what happens when the acoustic signal is not clear? Let’s say your neighbor knocks on your door. As she hands you a tray of homemade brownies, you hear her say, “Sarah’s bob baked these for you.” All sounds in this sentence seem to be acoustically clear. However, the word bob makes no sense. This is where top-down processing kicks-in. You must access a higher-level of linguistic knowledge in order to interpret this utterance. Thus your brain begins a memory search, locating information about your neighbor in order to provide a context. You remember that her cousin’s name is Sarah, whose mother owns a bakery. You also remember that your neighbor has a bad cold. Now you can surmise that you heard /b/ in place of /m/ in mom.
Top-down processing situates ambiguous speech segments in the context of a word, phrase, or discourse, as seen above. When a speech sound(s) is contaminated by background noise, rapid speech, or any other interference, a listener will use context, language-specific constraints, and information about the speaker to find meaning. In other words, top-down processing involves accessing previously acquired knowledge rather than discrete units of a speech signal.
Connectionist models are computational. They are ‘trained’ to simulate how the brain processes linguistic information at the sub-lexical, lexical and phrasal levels by building algorithms of probability. These architectures represent language processing as a network in which many discrete units of speech are activated and interact. When a unit is activated, all other units to which it is connected are also activated until an optimal ‘match’ is reached between the actual input (speech signal or word) and the targeted output (the same word stored in the lexicon). Simply stated, models simulate the activation of auditory input signals that, in turn, spread to all connected neurons. The strength of connections between neurons (which is given a numeric value) determines the pathway of the interactions. Strong connections facilitate activation of neurons at the next level; weak connections are deactivated.
Generally speaking, connectionist models are designed as either a localist or distributed network. Localist models consist of interconnected categories, e.g., phonemes, syllables, words, each belonging to a single node that can interact at the sub-lexical level. In this architecture there is a one-to-one correlation between a word and a node. They propose that word units are connected via lateral inhibitory links, enabling a unit to suppress or inhibit the activation of its competitors (see McQueen, Norris, & Cutler, 1994). The more similar a unit is to the target word, the more activation occurs, thus the more inhibition of its competitors.
Distributed models represent a many-to-one correlation between words and nodes. Features of speech signals are mapped onto simple units with no intermediate (sub- lexical) units. Lexical selection happens as activation is distributed to all words that are represented on a single node.

Figure 1 Connectionist structures. Left, uni-directional activation; right, bi-directional activation.
Regardless of design, all models are organized as layers of input units, hidden units (in certain models) and output units. The input units serve as the initial access of the stimuli and are activated based on the particular task of the model. Once the desired information has been accessed at one level, units then spread to the next. When hidden units are present in a model, non-linear activation across layers of input and output units will occur. Output units are activated when a match between the desired output and the target form has been reached.