Embodied AI for Vision-Language based Navigation

Healthcare Robotics

A simulation platform that trains AI agents to understand and perform hospital tasks using natural language in 3D and text based environments boosting their reasoning and real-world readiness

Vector (2)-da6325

Institute:
IIT Kharagpur

Vector (3)-42d8ac

PI Name:
Prof. Pawan Goya

Technology Readiness Level (TRL)
3

Problem
Addressed

This project focuses on developing complementary simulation environments tailored for training embodied agents to perform hospital-based tasks. The environments cater to two specific domains:
  • Vision-Language Navigation (VLN): Agents learn to navigate a simulated 3D hospital setting using natural language instructions. Here we address the challenge of training agents in visually rich environments
  • Text-Based Tasks: Agents interact with a text-based representation of a hospital to perform language driven reasoning and task execution. This enables language-centric reasoning in purely text-based scenario (TextWorld).
Arrow 6

About the
Technology

  • VLN using Unity: Unity3D provides a development platform for a visually immersive hospital simulation environment. The build of the environment is referenced to a hospital setup and then the simulation using an agent is developed on a vision language navigation framework.
  • Text-based tasks using Text World : TextWorld offers a framework for simulating text-based hospital tasks, focusing on the agent’s ability
Simulation Scope:
  • First-person waypoints based navigation in the simulation environment.
  • Environment consists of corridors, wards, ICUs, operating rooms, etc.
  • Dynamic elements : medical equipments (few).
  • Textual descriptions of hospital rooms and objects.
  • Commands issued as text (e.g., “Pick up the thermometer from the cabinet”).
Agent Training:
  • Input: RGB or RGB-D visual data and natural language instructions using medical vocabulary.
  • Output : Navigation actions (move forward, turn left) or textual responses in reasoning.

Application Areas & Use Cases

  • Healthcare Robotics: train robots to assist in navigation and reasoning in hospital environments
  • Al Training for Assistive Systems: Develop intelligent systems for patient care and logistics.
  • Cognitive Reasoning: Test language models for understanding and executing instructions.
Arrow 6
Unity Development Platform for VLN:
Assets high-quality 3D models of domain objects.
Navigation Meshes Unity’s built-in NavMesh for agent navigation.
AI Frameworks : ML-Agents Toolkit for agent training, LLaVA models for VLM training.
TextWorld for Language- agents
Similar to gym-like framework for development of text based games
Simulate observations in natural language descriptions and outputs actions in text.