Overview
CRAB (Cross-environment Agent Benchmark) is an innovative framework designed to evaluate and benchmark multimodal language model agents across diverse computational environments. Developed by a collaborative team of researchers from leading institutions, CRAB provides a comprehensive platform for assessing AI agent performance through rigorous, multi-dimensional testing.
Key Features
- Cross-environment support for seamless agent adaptation
- Graph-based evaluator for detailed performance analysis
- Automated task generation using complex sub-task combinations
- Easy-to-use Python-based configuration
- Supports multiple communication and agent structures
- Benchmark includes 120 tasks across Ubuntu and Android environments
Use Cases
- Evaluating multimodal AI agents' capabilities
- Comparing performance across different language models
- Testing AI agents' adaptability in complex, real-world scenarios
- Generating dynamic, realistic task sequences for AI testing
- Benchmarking agent performance across different platforms
Technical Specifications
- Environments: Ubuntu, Android
- Supported Models: GPT-4o, Claude 3, Gemini 1.5 Pro, open-source models
- Evaluation Metrics:
- Completion Ratio
- Success Rate
- Termination Reason Analysis
- Communication Settings: Single and Multi-agent
- Visual Prompt Technique: Scene of Manipulation (SoM)