Achieving a robust and universal semantic representation for action description remains an key challenge in natural language understanding. Current approaches often check here struggle to capture the complexity of human actions, leading to limited representations. To address this challenge, we propose innovative framework that leverages multimodal