Fara

Automates web tasks by visually perceiving webpages and controlling the computer's mouse and keyboard to execute multi-step actions.

Automates web tasks by visually perceiving webpages and controlling the computer's mouse and keyboard to execute multi-step actions.

The gist

Fara is an open-source 7-billion parameter small language model from Microsoft, designed as a Computer Use Agent (CUA). It solves problems by visually perceiving computer screens and directly controlling the mouse and keyboard to execute multi-step tasks on websites. Its compact size enables on-device deployment, aiming to reduce latency and improve user privacy by keeping data local. Fara is trained on a synthetic dataset of 145K web trajectories.

What it does

  • Automates web-based tasks by visually perceiving interfaces and controlling the mouse and keyboard.
  • Searches for information online and summarizes the results.
  • Fills out forms, manages online accounts, and books reservations or tickets.
  • Shops on retail websites and compares prices across different platforms.
  • Finds job postings or real estate listings based on user criteria.

How it works

A user provides a task via a command-line interface. The Fara-7B model, a vision-language model, then interprets the screen's visual information to generate and execute a sequence of mouse clicks and keyboard inputs. The tool is open-source and can be run locally by self-hosting the model with vLLM or through cloud services like Azure Foundry. It requires a Python environment and Playwright for browser automation.

Best for

Fara is best for developers and researchers needing a lightweight, locally-deployable agent to automate complex, multi-step tasks on websites, such as automated data entry, comparison shopping, or testing web application workflows.

Watch out for

Fara is an experimental release. The creators recommend running it in a sandboxed environment and avoiding use with sensitive data or for high-risk domains.