Multimodal Integration of Cognitive and Embodied Processes in Interpreting

Abstract

The body as a cognitive tool in real-time interpreting

This study reveals the multimodal integration of cognitive and embodied processes in interpreting by examining the interaction of verbal, visual, and kinetic modalities, focusing on gestures, gaze, and eye contact in meaning formation, managing cognitive load, and communication effectiveness in both simultaneous and consecutive interpreting. Using a mixed-methods design, the study analyzed conference speeches, simulated dialogues, and short texts, coding gaze and gestures to examine their coherence with speech and linking visual–kinetic behaviour to performance metrics across groups of experienced professional interpreters and graduate student interpreters with different experience levels.

The results reveal a dynamic interplay of bodily actions and cognitive processes. Gestures served as external memory aids: subtle pointing or rhythmic movements facilitated the structuring of information and the memorization of terminology, reducing cognitive load and freeing working memory for comprehension and retrieval. Gaze patterns reflected attentional strategies — experienced professional interpreters proactively anticipated key information, while graduate student interpreters demonstrated a reactive gaze, sometimes delaying paraphrasing.

The study presents a multimodal model of interpreting as an integrated system of cognitive and bodily processes, where gaze cues attention, gestures serve as memory aids and communicative tools, and verbal output with prosody interacts with these modalities to enhance clarity and coherence. Overall, the findings confirm that interpreting is an inherently multimodal process, and bodily behaviours serve as integrated cognitive strategies.

multimodalityinterpreting efficiencygaze gestureseye contactcognitive load embodied cognitioncognitive processing multimodal model of interpreting process

Introduction

Interpreting as a multimodal form of human communication

Traditional studies focused on linguistic precision, cognitive effort, and memory. Modern cognitive translation studies see interpreting as something the whole body performs.

In recent decades, interpreting has increasingly been viewed not merely as a linguistic activity but as a complex, multimodal form of human communication. Interpreting naturally incorporates multiple modes of expression — speech, intonation, gaze, gestures, and body movements — that operate simultaneously to create and convey meaning between the interpreter, speaker, and audience.

Integrating multimodal analysis allows scholars to explore how visual and kinetic modalities contribute to the comprehension, processing, and transmission of messages, offering a more holistic understanding of the interpreting process.

The research gap

Despite growing interest in multimodal communication, empirical research on interpreters’ eye movements and gestures remains limited. Gesture studies focus on public speaking and bilingual interaction; eye-tracking is widely used in reading and translation, yet rarely applied to the interpreting process.

Systematic research is therefore needed to explore how visual attention and gesture dynamics interact with cognitive processing during interpreting.

Theoretical Framework

From cognitive linguistics to embodied cognition

The study draws together two research traditions — cognitive-linguistic approaches emphasizing mental representations, and interactionist approaches highlighting real-time, culture-specific coordination.

Pinar · 2013

Integrates cognitive linguistics and multimodality, viewing meaning as co-constructed through multiple semiotic channels, with verbal, visual, and gestural cues interacting with perception, memory, and attention.

Jelec · 2020

Explores multimodal patterns in cognition and communication, providing empirical evidence that gestures, gaze, and posture systematically contribute to meaning-making.

Cohn & Schilperoord · 2024

Propose a cognitive framework in which multimodal language is an integrated system — gestures, gaze, and visual symbols act as linguistic elements organizing meaning and facilitating comprehension and production.

Makaruk · 2025

Examines multimodal syntactic constructions in digital English, showing they are systematically structured and classifiable by dominant semiotic components — revealing cognitive and pragmatic potential.

Özyürek · 2021

From a crosslinguistic perspective, shows that gestures and gaze are shaped by linguistic and cultural norms — multimodal strategies reflect both universal cognitive processes and language-specific regularities.

Feyaerts, Brône & Oben · 2017

Emphasize the social and pragmatic functions of gestures, gaze, and bodily cues — supporting turn-taking, emphasis, and coordination in a dynamic, interaction-oriented view of multimodality.

Managing cognitive effort

Research in interpreting often relies on Gile’s Effort Model, which describes the management of cognitive load during listening, production, and memorization. Developing this concept, Boiko (2025) examines these efforts in business translation, showing how high terminology density, rapid speech rate, and cultural differences increase cognitive stress — and reveals adaptive techniques such as anticipation, segmentation, and reformulation.

The interpreter as a bodily agent

Embodied cognition theory (Macrine & Fugate; Gallagher) posits that cognition is grounded in sensorimotor experience rather than purely abstract symbols. Milošević & Risku (2024) argue interpreters act as bodily agents whose gestures, posture, and gaze actively support attention allocation, memory retrieval, and real-time reformulation — the body becomes a cognitive tool for coping with high cognitive loads.

Aim & Tasks

Tracing where the mind meets the body

Aim of the study

To reveal the multimodal integration of cognitive and embodied processes in both simultaneous and consecutive interpreting — focusing on how mental operations of perception, comprehension, attention, working memory, and meaning reformulation are dynamically coordinated with bodily actions such as gaze, gestures, and posture during meaning construction, cognitive load management, and interpreter-mediated communication.

Analyze the interpreter’s efficiency through the integration of eye-tracking and gesture analysis methods.

Examine how interpreters coordinate visual attention and bodily movements during different interpreting stages.

Identify patterns of multimodal synchronization and their cognitive-pragmatic implications for interpreting efficiency.

Propose a multimodal model of the interpreting process that accounts for both cognitive processing and bodily interactional behaviour.

Methodology

A mixed experimental and observational design

Quantitative eye-tracking measures were integrated with qualitative gesture analysis, with all data streams time-aligned for cross-modal analysis.

Corpus: Conference speeches (5–7 minutes each) with formal and informational content, simulated professional dialogues from business and academic contexts, and brief expository texts for consecutive interpreting tasks.
Participants: Two groups of 12 participants — Group A: 6 experienced professional interpreters; Group B: 6 graduate student interpreters. Balanced and fluent in both source and target languages, so differences reflect strategy, not proficiency.
Gaze metrics: Fixation duration, saccade direction, and gaze shifts across the listening, reformulating, and presenting phases.
fixation durationsaccade directiongaze shifts
Gesture coding: Classified by type, then analyzed for frequency, function, and timing relative to speech.
iconicdeicticrhythmicmetaphorical

Why combine the channels?

All data streams were time-aligned to allow cross-modal analyses. Multimodal correlations then examined the relationships between gaze, gestures, and interpreting performance — including accuracy, fluency, and cognitive load — giving a comprehensive view of how visual and kinetic modalities contribute to effective interpreting.

The interpreter acts as a mediator, dynamically coordinating verbal and non-verbal cues under time pressure to direct attention, ensure coherence, and enhance message clarity.

Key Concepts

The modalities at work

Meaning is co-constructed through several semiotic channels rather than language alone. Each modality carries a distinct cognitive function.

Multimodality

The integration of verbal, visual, gestural, and prosodic channels into communication. In interpreting, multimodal signals support comprehension, attention management, and message delivery.

Gaze

A primary indicator of attention, focus, and cognitive processing. It guides comprehension and interaction management — distributing visual attention between speaker, notes, and audience.

Gestures

Iconic, deictic, metaphorical, and rhythmic gestures facilitate verbal expression, memory retrieval, and emphasis — functioning as external cognitive tools and memory aids.

Prosody

Intonation, stress, and rhythm convey discourse structure, emotional tone, and pragmatic intent — interacting with gestures and gaze to enhance clarity and coherence.

Embodied cognition

Cognitive processes are grounded in sensorimotor experience. The body mediates and facilitates cognition, linking mental representations with external actions during meaning-making.

Temporal synchronization

The alignment of verbal, visual, and kinetic channels in time — the mechanism that optimizes the combination of cognitive and bodily resources during interpreting.

Findings

What the eyes and hands revealed

Eye-tracking and gesture analysis exposed systematic differences between experienced professionals (Group A) and graduate students (Group B) across stages, gesture types, and coordination.

Table 1

Visual attention patterns across interpreting stages

Stage	Pattern	Group A — professionals	Group B — students	Cognitive implication
Listening	Fixations on speaker’s face and visual aids	More targeted fixations, moderate gaze shifts, some reliance on visual aids	Long fixations, frequent gaze shifts, reliance on visual cues	Group B exerts higher cognitive effort; Group A shows emerging efficiency
Reformulation	Fixations on critical information for retrieval	Intermediate fixation patterns, partially focused gaze	Diffuse, chaotic gaze patterns, high cognitive load	Group B struggles with attention allocation; Group A consolidates visual strategies
Output	Fixations toward audience or notes	Inconsistent transitions, frequent returns to speaker or aids	More stable transitions, occasional backtracking	Group A shows ongoing effort; Group B demonstrates emerging automaticity

Table 2

Gesture types and functions

Gesture	Function	Group A	Group B	Interpretation
Iconic	Represent concrete concepts or actions	Used inconsistently, often to reinforce comprehension	Used strategically to emphasize key terms	Group B integrates gestures purposefully; Group A uses them compensatorily
Deictic	Direct attention or reference materials	Frequent; high reliance on visual cues	Less frequent; moderate use	Group A depends more on visual cues; Group B shows emerging independence
Metaphorical	Convey abstract ideas or relationships	Equally distributed	Equally distributed	Both groups use gestures for non-literal meaning similarly
Rhythmic	Mark speech rhythm, emphasize key points	Sporadically used	Consistently used; supports coherence and fluency	Group B shows more automatic integration; Group A is irregular
Overall	Memory support, emphasis, clarification	Less consistent; compensatory use	Greater consistency and integration with speech	Strategic vs. compensatory multimodal behavior

Table 3

Multimodal coordination patterns

Aspect	Group A	Group B
Gaze–speech	Often delayed; fixations lag behind speech	Anticipates or complements speech; smoother transitions
Gesture timing	Gestures frequently follow verbal output	Gestures anticipate or align with speech
Prosody–gesture	Less consistent; rhythmic gestures not always timed	Rhythmic gestures consistently coincide with prosodic emphasis
Integration of cues	Slower coordination across channels	Efficient integration, supporting load management
Effectiveness	Lower coherence; cognitive effort visible	Higher coherence; more fluid and structured

Visualized

Coordination profile by group

Qualitative comparison — ordinal encoding of the descriptive findings in Table 3.

Figure 1

Gesture frequency and cognitive load measures

Reproduction of the study’s statistical comparison (relative values on a 0–5 scale). Series — Saccade count, Fixation duration, Gesture frequency.

Group A — strategic

A more strategic use of gestures, coordinating them with verbal output to support memory retrieval and reduce cognitive effort — reflected in shorter and more focused fixations.

Group B — compensatory

A higher but less systematic frequency of gestures, often used inconsistently as a compensatory mechanism under higher cognitive load — accompanied by longer fixations and more diffuse gaze.

Bodily action as a functionally integrated cognitive strategy

Gestures as memory aids

Subtle pointing or rhythmic hand movements helped interpreters structure information segments and recall terminological elements when paraphrasing — freeing working memory to focus on comprehension rather than retention.

Proactive vs. reactive gaze

More experienced participants showed proactive gaze, anticipating important information; less experienced interpreters showed reactive gaze, attending to notes or speaker only after key material had passed — sometimes delaying paraphrasing.

Audience monitoring

Redirecting gaze toward the audience signalled active monitoring of comprehension. Brief eye contact maintained alignment and pragmatic coherence, particularly during information-dense segments.

Coordination differs by mode

Simultaneous

Tight, minimal, strategic

High cognitive load, time constraints, and parallel processing demand stricter synchronization between modalities.
Gestures are minimalistic and rhythmic — short movements aligned with prosodic stress, so as not to disrupt speech production.
Gaze is limited to brief, strategic fixations on the speaker or reference materials, avoiding excessive audience engagement.

Consecutive

Flexible, expressive, orchestrated

Greater temporal flexibility allows more purposeful coordination of multimodal processes.
While taking notes and reformulating, interpreters show broader gaze movements between notes, speaker, and audience.
A wider range of iconic and deictic gestures supports discourse structuring, referent tracking, and listener engagement.

The Model

A multimodal model of the interpreting process

Interpreting conceptualized as an integrated system of cognitive and bodily processes, where temporal synchronization links comprehension and production.

Cognitive Regulation

attention · memory · load control

Multimodal Integration

synchronization of gaze, gesture & speech

Gaze

monitoring

Gestures

memory / load

Verbal Output

meaning

Interpreting Process

coherence · efficiency · temporal alignment

Gaze behavior reflects attentional allocation and cognitive monitoring, showing how interpreters distribute visual attention between speaker, notes, and audience.

Gestures act as external cognitive tools — functioning simultaneously as memory aids, communicative cues, and cognitive load regulators that facilitate information retention and discourse organization.

Verbal output and prosody form the primary channels of meaning transmission, where rhythm, intonation, and stress interact with gestures and gaze to enhance clarity and coherence.

The model emphasizes the temporal synchronization of these modalities and their dynamic, bidirectional contributions to both comprehension (processing the input) and production (formulating the output).

Pedagogy

Training multimodal coordination

Targeted strategies that progressively build automatic attentional control, cognitive management, and multimodal coordination.

Strategy 01

Gaze-focused training

Watch short clips and recognize key visual cues — gestures, slide highlights, facial expressions — while ignoring irrelevant movement, optimizing load management.

Strategy 02

Gesture awareness

Interpret short speeches while intentionally incorporating iconic, deictic, and rhythmic gestures, then review recordings for accuracy, rhythm, and coherence with speech.

Strategy 03

Multitasking training

Combine listening, note-taking, and verbal inference under gradually increasing difficulty and controlled distractions to build cognitive resilience.

Strategy 04

Feedback & reflection

Use eye-tracking and video to explore gaze patterns, fixation durations, and gesture synchronization, comparing against expert models for targeted adjustment.

Strategy 05

Gradual complexity

Move from short, clear speeches to terminologically rich lectures and interactive simulations with audience questions — developing adaptive, real-time integration.

Synthesis

A progressive multimodal approach

Combining gaze, gesture, and speech in increasingly complex tasks reduces cognitive load, improves interactional coherence, and prepares interpreters for high-pressure environments.

Conclusions

Interpreting is bodily, multimodally integrated cognition

Effective interpreting extends beyond linguistic competence to encompass integrated multimodal expertise. The study makes three contributions.

It reveals the dynamic interplay between bodily behaviour and cognitive processes during real-time interpreting, showing how gaze, gestures, and prosody are coordinated to manage attention, regulate cognitive load, and maintain accuracy under stress.

It integrates eye-tracking and gesture analysis within a single multimodal framework, offering a comprehensive, empirically grounded understanding of interpreter behaviour across perceptual, kinetic, and verbal domains.

It provides empirical support for cognitive-pragmatic and embodied cognitive models, demonstrating that temporal synchronization of gaze, gesture, and speech enhances efficiency and communicative accuracy in both simultaneous and consecutive interpreting.

Future research

Future work should examine how language-specific syntactic structures, information-packaging patterns, and cultural traditions of gesture and gaze influence multimodal strategies. Interpreters working with flexible-word-order or high-context languages such as Ukrainian or Japanese may exhibit different gaze distribution and gesture synchronization than those working with fixed-syntax languages such as English or German — pointing toward a cross-culturally sensitive model of multimodal interpreting effectiveness.

References

Works cited

Cohn, N., Schilperoord, J. (2024). A Multimodal Language Faculty: A Cognitive Framework for Communication. London: Bloomsbury. DOI: 10.5040/9781350404861

Boiko, Ya. V. (2025). Interpreting in Business: Challenges and Solutions. Folium, 6, 32–38. DOI: 10.32782/folium/2025.6.4

Feyaerts, K., Brône, G., Oben, B. (2017). Multimodality in interaction. In B. Dancygier (Ed.), The Cambridge Handbook of Cognitive Linguistics (pp. 135–156). Cambridge University Press. DOI: 10.1017/9781316339732.010

Gallagher, S. (2011). Interpretations of embodied cognition. In W. Tschacher & C. Bergomi (Eds.), The Implications of Embodiment: Cognition and Communication (pp. 59–74). Exeter: Imprint Academic.

Giberga, A., Ahufinger, N., Igualada, A., Aguilera, M., Guerra, E., Esteve-Gibert, N. (2024). Prosody and gesture in the comprehension of pragmatic meanings: The case of children with developmental language disorder. In Yi. Chen, A. Chen, A. Arvaniti (Eds.), Proceedings of the 12th International Conference on Speech Prosody (pp. 697–701). Leiden: Leiden University Publ. DOI: 10.21437/SpeechProsody.2024-141

Gile, D. (2021). The effort models of interpreting as a didactic construct. In R. Muñoz Martín, S. Sun, D. Li (Eds.), Advances in Cognitive Translation Studies (pp. 139–160). Singapore: Springer. DOI: 10.1007/978-981-16-2070-6_7

Hu, T., Wang, X., Xu, H. (2022). Eye-tracking in interpreting studies: A review of four decades of empirical studies. Frontiers in Psychology, 13: 872247. DOI: 10.3389/fpsyg.2022.872247

Jelec, A. (2020). Multimodal patterns in cognition and communication. Studia Anglica Posnaniensia, 55 (s1), 179–184. DOI: 10.2478/stap-2020-0007

Makaruk, L. (2025). Multimodal Syntactic Constructions: A Striking Feature of Digital Communication in Modern English. Alfred Nobel University Journal of Philology, 1 (29), 265–283. DOI: 10.32342/3041-217X-2025-1-29-16

Macrine, S., Fugate, J. (2020). Embodied Cognition. In K. Hytten (Ed.), Oxford Research Encyclopedia of Education. New York: Oxford Academic. DOI: 10.1093/acrefore/9780190264093.013.885

Milošević, J., Risku, H. (2024). Interpreting and embodied cognition. In C. D. Mellinger (Ed.), The Routledge Handbook of Interpreting and Cognition (pp. 324–340). London: Routledge. DOI: 10.4324/9780429297533-24

Oben, B., Brône, G. (2015). What you see is what you do: On the relationship between gaze and gesture in multimodal alignment. Language and Cognition, 7 (4), 546–562. DOI: 10.1017/langcog.2015.22

Özyürek, A. (2021). Considering the nature of multimodal language from a crosslinguistic perspective. Journal of Cognition, 4 (1): 42, 1–5. DOI: 10.5334/joc.165

Pinar, M. J. (2013). Multimodality and cognitive linguistics: Introduction to the special volume. Review of Cognitive Linguistics, 11 (2), 227–235. DOI: 10.1075/rcl.11.2.01pin

Singer, M. A., Radinsky, J., Goldman, S. R. (2008). The role of gesture in meaning construction. Discourse Processes, 45 (4), 365–386. DOI: 10.1080/01638530802145601

Tiselius, E., Sneed, K. (2020). Gaze and eye movement in dialogue interpreting: An eye-tracking study. Bilingualism: Language and Cognition, 23 (4), 780–787. DOI: 10.1017/S1366728920000309

Vranjes, J., Brône, G. (2020). Eye-tracking in interpreter-mediated talk: From research to practice. In H. Salaets & G. Brône (Eds.), Linking up With Video: Perspectives on Interpreting Practice and Research (pp. 203–233). London: Benjamins Translation Library. DOI: 10.1075/btl.149.09vra