1ShanghaiTech University
2Deemos Technology
*Equal contributions
†Project lead ‡Corresponding author
Overview of our training pipeline. We quantize continuous and jittery IMU signals to a sequence of jitter-reduced and motion-aware inertial tokens by learning a IMU tokenizer through distribution matching strategy and adopt semantic aligned and LoRA fine-tuned LLM to generate precise, professional and stylistic text feedback for human motion analysis.
Mojito delivers precise motion descriptions and instructions within seconds, markedly enhancing efficiency and user experience compared to vision-language models. Here, we showcase the online interactions between user and our system for motion capture and analysis from streaming IMU data. The LLM running on our system backend responds to user's questions posted from web frontend instantly, bringing concise and professional feedback. The corresponding demonstration video and backend logging are placed aside for reference.
Compared to previous inertial posers, Mojito yields robust motion capture capabilities across diverse noisy input environments, which are commonly encountered in practical applications. Particularly, when the IMU sensor attached to the root joint is disturbed or missing, for example the three-point tracker setting in VR, our jitter-reduced inertial tokens are still capable of reconstructing reasonable full-body motions.
@InProceedings{shan2025mojito, title = {Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens}, author = {Shan, Ziwei and He, Yaoyu and Zhao, Chengfeng and Du, Jiashen and Zhang, Jingyan and Zhang, Qixuan and Yu, Jingyi and Xu, Lan}, journal = {arXiv preprint arXiv:2502.16175}, year = {2025} }