Introduction
Key Takeaways
- By fine-tuning open-source language models on production-scale workflow data, we demonstrate unprecedented capabilities in understanding and executing complex web tasks while significantly reducing computational costs compared to existing solutions.
- Our models, ScribeAgent-Small and ScribeAgent-Large, achieve state-of-the-art performance on Mind2Web and substantially improve the baseline from the WebArena benchmark with a 53% task success rate, surpassing previous best results by 7.3 percentage points (considering text-only baselines).
- In partnership with our co-authors at Carnegie Mellon University, we open-source our complete code, along with a version of our LLM trained on open-source datasets.
- If you are passionate about cutting-edge AI techniques and turning innovations into impactful features for millions of users, join our team! We are hiring for multiple open ML/AI engineering roles!
Read the full paper here: https://arxiv.org/abs/2411.15004
Key Results
We evaluated our proprietary models ScribeAgent-Small (fine-tuned on Qwen2 7B) and ScribeAgent-Large (fine-tuned on Qwen2.5 32B), on three benchmarks:
- One internal benchmark:
- Our human-annotated proprietary workflow dataset comprised of 1,200 enterprise software workflows collected across 250 real-world domains
- Two external benchmarks:
- Mind2Web, a static text-based dataset for assessing the navigation ability of web agents
- WebArena, a dynamic web environment for end-to-end task completion benchmarking
We do not use any task-specific adaptation for Mind2Web and WebArena, even when extra training data is available. We showcase a summary of our results in the following sections.
Proprietary Workflow Benchmark
On our proprietary test dataset, our ScribeAgent models significantly outperform the proprietary GPT-4o and 4o-mini in the next-step prediction setting. This shows the benefit of specialized fine-tuning over using general-purpose LLMs. Moreover, while the non-fine-tuned Qwen2 performs poorly, fine-tuning with our dataset boosts its performance by nearly 6x, highlighting the importance of domain-specific data.
We also benchmarked the o1 suite of models on a subset of our test set. While o1-preview performs the best among all general-purpose baselines, ScribeAgent models still outperform it by a wide margin, highlighting the importance of fine-tuning on real-world web navigation data.
Notably, ScribeAgent models do not require any inference time scaling, whereas most proprietary baselines are typically larger in size and slower at inference time. This makes the family of ScribeAgent models a better choice in terms of accuracy, latency, and cost.
Mind2Web
On the Mind2Web benchmark, ScribeAgent-Large achieves best overall state-of-the-art zero shot performance in the multi-stage setting and is competitive with the best fine-tuned baseline. We attribute our model’s strong performance to the diversity and high quality of the workflows in our dataset.
WebArena
Compared with existing text-only baselines, ScribeAgent-Large augmented with GPT-4o obtains the highest task success rate in all five categories, leading to 16% performance improvements in total success rate over the previous-best AgentOccam results.
Notably, on Reddit and GitLab tasks where the domains are more realistic and thus closer to the ones in our training data, ScribeAgent-Small demonstrates stronger generalization ability and higher task success rates than in other domains.
Conclusion
With the ScribeAgent family of models, we showcase the power of domain-specific LLMs and how finetuning with high-quality real-world workflow data can benefit specialized web agents. To experience the power of AI and stay tuned about our upcoming releases, try Scribe today.