This study reveals the multimodal integration of cognitive and embodied processes in interpreting by examining the interaction of verbal, visual, and kinetic modalities, focusing on gestures, gaze, and eye contact in meaning formation, managing cognitive load, and communication effectiveness in both simultaneous and consecutive interpreting. Using a mixed-methods design, the study analyzed conference speeches, simulated dialogues, and short texts, coding gaze and gestures to examine their coherence with speech and linking visual–kinetic behaviour to performance metrics across groups of experienced professional interpreters and graduate student interpreters with different experience levels.
The results reveal a dynamic interplay of bodily actions and cognitive processes. Gestures served as external memory aids: subtle pointing or rhythmic movements facilitated the structuring of information and the memorization of terminology, reducing cognitive load and freeing working memory for comprehension and retrieval. Gaze patterns reflected attentional strategies — experienced professional interpreters proactively anticipated key information, while graduate student interpreters demonstrated a reactive gaze, sometimes delaying paraphrasing.
The study presents a multimodal model of interpreting as an integrated system of cognitive and bodily processes, where gaze cues attention, gestures serve as memory aids and communicative tools, and verbal output with prosody interacts with these modalities to enhance clarity and coherence. Overall, the findings confirm that interpreting is an inherently multimodal process, and bodily behaviours serve as integrated cognitive strategies.
Traditional studies focused on linguistic precision, cognitive effort, and memory. Modern cognitive translation studies see interpreting as something the whole body performs.
In recent decades, interpreting has increasingly been viewed not merely as a linguistic activity but as a complex, multimodal form of human communication. Interpreting naturally incorporates multiple modes of expression — speech, intonation, gaze, gestures, and body movements — that operate simultaneously to create and convey meaning between the interpreter, speaker, and audience.
Integrating multimodal analysis allows scholars to explore how visual and kinetic modalities contribute to the comprehension, processing, and transmission of messages, offering a more holistic understanding of the interpreting process.
Despite growing interest in multimodal communication, empirical research on interpreters’ eye movements and gestures remains limited. Gesture studies focus on public speaking and bilingual interaction; eye-tracking is widely used in reading and translation, yet rarely applied to the interpreting process.
Systematic research is therefore needed to explore how visual attention and gesture dynamics interact with cognitive processing during interpreting.
The study draws together two research traditions — cognitive-linguistic approaches emphasizing mental representations, and interactionist approaches highlighting real-time, culture-specific coordination.
Integrates cognitive linguistics and multimodality, viewing meaning as co-constructed through multiple semiotic channels, with verbal, visual, and gestural cues interacting with perception, memory, and attention.
Explores multimodal patterns in cognition and communication, providing empirical evidence that gestures, gaze, and posture systematically contribute to meaning-making.
Propose a cognitive framework in which multimodal language is an integrated system — gestures, gaze, and visual symbols act as linguistic elements organizing meaning and facilitating comprehension and production.
Examines multimodal syntactic constructions in digital English, showing they are systematically structured and classifiable by dominant semiotic components — revealing cognitive and pragmatic potential.
From a crosslinguistic perspective, shows that gestures and gaze are shaped by linguistic and cultural norms — multimodal strategies reflect both universal cognitive processes and language-specific regularities.
Emphasize the social and pragmatic functions of gestures, gaze, and bodily cues — supporting turn-taking, emphasis, and coordination in a dynamic, interaction-oriented view of multimodality.
Research in interpreting often relies on Gile’s Effort Model, which describes the management of cognitive load during listening, production, and memorization. Developing this concept, Boiko (2025) examines these efforts in business translation, showing how high terminology density, rapid speech rate, and cultural differences increase cognitive stress — and reveals adaptive techniques such as anticipation, segmentation, and reformulation.
Embodied cognition theory (Macrine & Fugate; Gallagher) posits that cognition is grounded in sensorimotor experience rather than purely abstract symbols. Milošević & Risku (2024) argue interpreters act as bodily agents whose gestures, posture, and gaze actively support attention allocation, memory retrieval, and real-time reformulation — the body becomes a cognitive tool for coping with high cognitive loads.
To reveal the multimodal integration of cognitive and embodied processes in both simultaneous and consecutive interpreting — focusing on how mental operations of perception, comprehension, attention, working memory, and meaning reformulation are dynamically coordinated with bodily actions such as gaze, gestures, and posture during meaning construction, cognitive load management, and interpreter-mediated communication.
Analyze the interpreter’s efficiency through the integration of eye-tracking and gesture analysis methods.
Examine how interpreters coordinate visual attention and bodily movements during different interpreting stages.
Identify patterns of multimodal synchronization and their cognitive-pragmatic implications for interpreting efficiency.
Propose a multimodal model of the interpreting process that accounts for both cognitive processing and bodily interactional behaviour.
Quantitative eye-tracking measures were integrated with qualitative gesture analysis, with all data streams time-aligned for cross-modal analysis.
All data streams were time-aligned to allow cross-modal analyses. Multimodal correlations then examined the relationships between gaze, gestures, and interpreting performance — including accuracy, fluency, and cognitive load — giving a comprehensive view of how visual and kinetic modalities contribute to effective interpreting.
The interpreter acts as a mediator, dynamically coordinating verbal and non-verbal cues under time pressure to direct attention, ensure coherence, and enhance message clarity.
Meaning is co-constructed through several semiotic channels rather than language alone. Each modality carries a distinct cognitive function.
The integration of verbal, visual, gestural, and prosodic channels into communication. In interpreting, multimodal signals support comprehension, attention management, and message delivery.
A primary indicator of attention, focus, and cognitive processing. It guides comprehension and interaction management — distributing visual attention between speaker, notes, and audience.
Iconic, deictic, metaphorical, and rhythmic gestures facilitate verbal expression, memory retrieval, and emphasis — functioning as external cognitive tools and memory aids.
Intonation, stress, and rhythm convey discourse structure, emotional tone, and pragmatic intent — interacting with gestures and gaze to enhance clarity and coherence.
Cognitive processes are grounded in sensorimotor experience. The body mediates and facilitates cognition, linking mental representations with external actions during meaning-making.
The alignment of verbal, visual, and kinetic channels in time — the mechanism that optimizes the combination of cognitive and bodily resources during interpreting.
Eye-tracking and gesture analysis exposed systematic differences between experienced professionals (Group A) and graduate students (Group B) across stages, gesture types, and coordination.
| Stage | Pattern | Group A — professionals | Group B — students | Cognitive implication |
|---|---|---|---|---|
| Listening | Fixations on speaker’s face and visual aids | More targeted fixations, moderate gaze shifts, some reliance on visual aids | Long fixations, frequent gaze shifts, reliance on visual cues | Group B exerts higher cognitive effort; Group A shows emerging efficiency |
| Reformulation | Fixations on critical information for retrieval | Intermediate fixation patterns, partially focused gaze | Diffuse, chaotic gaze patterns, high cognitive load | Group B struggles with attention allocation; Group A consolidates visual strategies |
| Output | Fixations toward audience or notes | Inconsistent transitions, frequent returns to speaker or aids | More stable transitions, occasional backtracking | Group A shows ongoing effort; Group B demonstrates emerging automaticity |
| Gesture | Function | Group A | Group B | Interpretation |
|---|---|---|---|---|
| Iconic | Represent concrete concepts or actions | Used inconsistently, often to reinforce comprehension | Used strategically to emphasize key terms | Group B integrates gestures purposefully; Group A uses them compensatorily |
| Deictic | Direct attention or reference materials | Frequent; high reliance on visual cues | Less frequent; moderate use | Group A depends more on visual cues; Group B shows emerging independence |
| Metaphorical | Convey abstract ideas or relationships | Equally distributed | Equally distributed | Both groups use gestures for non-literal meaning similarly |
| Rhythmic | Mark speech rhythm, emphasize key points | Sporadically used | Consistently used; supports coherence and fluency | Group B shows more automatic integration; Group A is irregular |
| Overall | Memory support, emphasis, clarification | Less consistent; compensatory use | Greater consistency and integration with speech | Strategic vs. compensatory multimodal behavior |
| Aspect | Group A | Group B |
|---|---|---|
| Gaze–speech | Often delayed; fixations lag behind speech | Anticipates or complements speech; smoother transitions |
| Gesture timing | Gestures frequently follow verbal output | Gestures anticipate or align with speech |
| Prosody–gesture | Less consistent; rhythmic gestures not always timed | Rhythmic gestures consistently coincide with prosodic emphasis |
| Integration of cues | Slower coordination across channels | Efficient integration, supporting load management |
| Effectiveness | Lower coherence; cognitive effort visible | Higher coherence; more fluid and structured |
A more strategic use of gestures, coordinating them with verbal output to support memory retrieval and reduce cognitive effort — reflected in shorter and more focused fixations.
A higher but less systematic frequency of gestures, often used inconsistently as a compensatory mechanism under higher cognitive load — accompanied by longer fixations and more diffuse gaze.
Subtle pointing or rhythmic hand movements helped interpreters structure information segments and recall terminological elements when paraphrasing — freeing working memory to focus on comprehension rather than retention.
More experienced participants showed proactive gaze, anticipating important information; less experienced interpreters showed reactive gaze, attending to notes or speaker only after key material had passed — sometimes delaying paraphrasing.
Redirecting gaze toward the audience signalled active monitoring of comprehension. Brief eye contact maintained alignment and pragmatic coherence, particularly during information-dense segments.
Interpreting conceptualized as an integrated system of cognitive and bodily processes, where temporal synchronization links comprehension and production.
Gaze behavior reflects attentional allocation and cognitive monitoring, showing how interpreters distribute visual attention between speaker, notes, and audience.
Gestures act as external cognitive tools — functioning simultaneously as memory aids, communicative cues, and cognitive load regulators that facilitate information retention and discourse organization.
Verbal output and prosody form the primary channels of meaning transmission, where rhythm, intonation, and stress interact with gestures and gaze to enhance clarity and coherence.
The model emphasizes the temporal synchronization of these modalities and their dynamic, bidirectional contributions to both comprehension (processing the input) and production (formulating the output).
Targeted strategies that progressively build automatic attentional control, cognitive management, and multimodal coordination.
Watch short clips and recognize key visual cues — gestures, slide highlights, facial expressions — while ignoring irrelevant movement, optimizing load management.
Interpret short speeches while intentionally incorporating iconic, deictic, and rhythmic gestures, then review recordings for accuracy, rhythm, and coherence with speech.
Combine listening, note-taking, and verbal inference under gradually increasing difficulty and controlled distractions to build cognitive resilience.
Use eye-tracking and video to explore gaze patterns, fixation durations, and gesture synchronization, comparing against expert models for targeted adjustment.
Move from short, clear speeches to terminologically rich lectures and interactive simulations with audience questions — developing adaptive, real-time integration.
Combining gaze, gesture, and speech in increasingly complex tasks reduces cognitive load, improves interactional coherence, and prepares interpreters for high-pressure environments.
Effective interpreting extends beyond linguistic competence to encompass integrated multimodal expertise. The study makes three contributions.
It reveals the dynamic interplay between bodily behaviour and cognitive processes during real-time interpreting, showing how gaze, gestures, and prosody are coordinated to manage attention, regulate cognitive load, and maintain accuracy under stress.
It integrates eye-tracking and gesture analysis within a single multimodal framework, offering a comprehensive, empirically grounded understanding of interpreter behaviour across perceptual, kinetic, and verbal domains.
It provides empirical support for cognitive-pragmatic and embodied cognitive models, demonstrating that temporal synchronization of gaze, gesture, and speech enhances efficiency and communicative accuracy in both simultaneous and consecutive interpreting.
Future work should examine how language-specific syntactic structures, information-packaging patterns, and cultural traditions of gesture and gaze influence multimodal strategies. Interpreters working with flexible-word-order or high-context languages such as Ukrainian or Japanese may exhibit different gaze distribution and gesture synchronization than those working with fixed-syntax languages such as English or German — pointing toward a cross-culturally sensitive model of multimodal interpreting effectiveness.