Self-operating computer logo

Self-operating computer

AI framework for autonomous computer operation

Self-operating computer is a framework enabling multimodal AI models to control a computer using screen view and mouse/keyboard inputs, compatible with GPT-4, Gemini Pro Vision, Claude 3, and LLaVa. It offers voice input and OCR capabilities for enhanced interaction.

Details
Free
Open Source
Self-operating computer Agent's User Interface

Self-Operating Computer: Empowering AI to Navigate Digital Interfaces

Introduction

The Self-Operating Computer is an innovative framework designed to enable multimodal AI models to operate a computer autonomously. By utilizing the same inputs and outputs as a human operator—viewing the screen and executing mouse and keyboard actions—this framework opens up new possibilities for AI-driven computer interaction and task automation.

Key Features

Multimodal Model Compatibility

  • Designed to work with various multimodal AI models
  • Currently integrated with:
    • GPT-4
    • Gemini Pro Vision
    • Claude 3
    • LLaVa

Flexible Operation Modes

  1. Standard Mode: Default operation using GPT-4 with OCR capabilities
  2. Voice Mode: Allows voice input for objectives
  3. Set-of-Mark (SoM) Prompting: Enhances visual grounding capabilities
  4. Optical Character Recognition (OCR): Improves element detection and interaction

Easy Installation and Usage

  • Simple pip installation: pip install self-operating-computer
  • Straightforward execution: operate command in terminal

Customization and Extensibility

  • Support for multiple AI models
  • Ability to integrate new models and capabilities

How It Works

  1. The AI model views the computer screen
  2. Based on the objective, it decides on a series of mouse and keyboard actions
  3. The framework translates these decisions into actual computer operations
  4. The process repeats until the objective is achieved

Use Cases

  • Automated software testing
  • User experience evaluation
  • Task automation for repetitive computer operations
  • Accessibility improvements for users with disabilities
  • AI-assisted computer troubleshooting

Future Developments

  • Agent-1-Vision: HyperwriteAI is developing a multimodal model with more accurate click location predictions
  • API Access: Upcoming API access to the Agent-1-Vision model
  • Expanded Model Support: Plans to integrate additional AI models into the framework

Getting Started

  1. Install the framework: pip install self-operating-computer
  2. Run the project: operate
  3. Enter your API key for the chosen model (e.g., OpenAI, Google AI Studio)
  4. Grant necessary permissions for screen recording and accessibility

Advanced Usage

Voice Mode

  • Install additional requirements: pip install -r requirements-audio.txt
  • Run with voice mode: operate --voice

Set-of-Mark Prompting

  • Use the command: operate -m gpt-4-with-som
  • Utilizes YOLOv8 for button detection (customizable)

OCR Mode

  • Default mode, enhances element detection
  • Run with: operate or operate -m gpt-4-with-ocr

Conclusion

The Self-Operating Computer framework represents a significant step forward in AI-computer interaction. By enabling AI models to operate computers as humans do, it opens up new possibilities for automation, accessibility, and AI-assisted computing. As the framework continues to evolve and support more models, its potential applications in various fields are bound to expand, making it an exciting technology to watch and utilize.

Explore similar agents