Mojito: LLM-Aided Motion Instructor with
Jitter-Reduced Inertial Tokens

Ziwei Shan1,*, Yaoyu He1,*, Chengfeng Zhao1,*,†, Jiashen Du1, Jingyan Zhang1, Qixuan Zhang1,2, Jingyi Yu1,‡, Lan Xu1,‡

1ShanghaiTech University    2Deemos Technology
*Equal contributions
Project lead    Corresponding author

Human bodily movements convey critical insights into action intentions and cognitive processes, yet existing multimodal systems primarily focused on understanding human motion via language, vision, and audio, which struggle to capture the dynamic forces and torques inherent in 3D motion. Inertial measurement units (IMUs) present a promising alternative, offering lightweight, wearable, and privacy-conscious motion sensing. However, processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift, limiting their utility for long-term real-time motion capture (MoCap), and more importantly, online motion analysis. To address these challenges, we introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models (LLMs) for interactive motion capture and behavioral analysis. The core innovation of Mojito lies in a jitter-reduced inertial token representation with a novel IMU signal encoding framework, and an extended language model involving inertial tokens. By employing VQVAE, Mojito learns a discrete latent space of continuous IMU signals, mitigating sensor noise and drift through quantization. The inertial tokens are then aligned with inductive bias of natural language and mapped to textual semantics to enhance compatibility with LLMs, enabling efficient sequence modeling. To support domain-specific applications, Mojito further incorporates tunable LoRA adapters, facilitating personalized feedback tailored to roles such as fitness trainers or rehabilitation therapists. Extensive experiments demonstrate that Mojito outperforms existing IMU-based methods in motion capture under noisy conditions, and achieves comparable behavior analysis capability compared to large vision-language models. The user study further highlights its practical effectiveness in various scenarios as a versatile tool for intelligent human-agent interaction.
Pipeline

Overview of our training pipeline. We quantize continuous and jittery IMU signals to a sequence of jitter-reduced and motion-aware inertial tokens by learning a IMU tokenizer through distribution matching strategy and adopt semantic aligned and LoRA fine-tuned LLM to generate precise, professional and stylistic text feedback for human motion analysis.

Demo Results

Mojito delivers precise motion descriptions and instructions within seconds, markedly enhancing efficiency and user experience compared to vision-language models. Here, we showcase the online interactions between user and our system for motion capture and analysis from streaming IMU data. The LLM running on our system backend responds to user's questions posted from web frontend instantly, bringing concise and professional feedback. The corresponding demonstration video and backend logging are placed aside for reference.

Qualitative MoCap Comparisons under Noisy Conditions

Compared to previous inertial posers, Mojito yields robust motion capture capabilities across diverse noisy input environments, which are commonly encountered in practical applications. Particularly, when the IMU sensor attached to the root joint is disturbed or missing, for example the three-point tracker setting in VR, our jitter-reduced inertial tokens are still capable of reconstructing reasonable full-body motions.

Citation

@InProceedings{shan2025mojito, title = {Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens}, author = {Shan, Ziwei and He, Yaoyu and Zhao, Chengfeng and Du, Jiashen and Zhang, Jingyan and Zhang, Qixuan and Yu, Jingyi and Xu, Lan}, journal = {arXiv preprint arXiv:2502.16175}, year = {2025} }

Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens
Thanks to Lior Yariv and Jianfeng Xiang for the website template