Overview
The ML Alignment & Theory Scholars (MATS) Program, together with researchers from Anthropic and University College London needed high-quality data to test scalable AI oversight mechanisms for ensuring AI safety and alignment.
The project demanded a highly-skilled AI training workforce to judge debates between language models, adaptable task guidelines to reflect the iterative nature of research, and direct collaboration channels for quick feedback between researchers and workers.
Pareto.AI assisted the consortium of researchers in collecting high-quality data from human-judged debates between LLMs.
Our advantage: Why they chose Pareto.AI
The researchers ran into significant problems finding the right partner for this project. Crowd work providers were hesitant to embrace the project's complexity, particularly its need for iterative weekly changes, customized workflows, and direct communication with labelers during the project.
Some notable providers in this space demonstrated a lack of responsiveness, taking over a month to reply to initial inquiries and showing reluctance to facilitate direct communication between the researchers and the workers. Their disinterest in supporting a project of this size and complexity revealed a gap between the researchers' needs and what most data labeling vendors could offer.
Existing crowd work platforms were not built to handle projects of this nature, as they are primarily designed for high-volume, repeatable tasks, making them unsuitable for projects that require nuanced understanding, expert judgment, or the ability to rapidly iterate based on evolving guidelines.
Additionally, most existing systems are structured around minimizing direct contact with workers to simplify management and reduce overhead, which unfortunately can lead to misunderstandings, generic feedback, and a slower resolution of issues. For a project of this nature, the lack of direct engagement and adaptability could hinder its success.
Solution
Pareto sourced, onboarded, and trained 20 experts in less than a month through referral-based sourcing, leveraging the strength of our extensive network of highly-skilled workers.
We secured and retained top-tier talent by committing upfront to guaranteed working hours for a carefully vetted group of expert workers over an extended period. We also implemented a thorough testing and qualification process, which included a week of feedback.
The data collection project involved judging debates between LLM responses, where the goal was to choose the correct answer to a question presented in the debate. Pareto oversaw the data collection process from start to finish, providing daily updates on debate judgements from labelers and ensuring adherence to quality standards.
Bugs and high latency were identified as significant issues that could impact workers' performance. To address this, we proactively aided researchers in implementing a robust error recovery system and conducted thorough testing of the platform before its rollout to workers. This was made possible because our system is intentionally designed to accommodate quick pivots and adjustments.
To facilitate collaboration between workers and requesters, Pareto established direct communication channels, enabling real-time interaction between researchers and workers.
Pareto's model, by prioritizing open dialogue and immediate feedback, fostered a more collaborative and adaptive environment that traditional platforms cannot easily implement for such projects.
Results
As a result of Pareto's high-quality, comprehensive data, researchers from MATS, Anthropic, and UCL were able to derive crucial insights:
- As language models get more capable, debating with LLM responses enables scalable oversight by non-expert human evaluators.
- Language models optimized for "judge approval" become more truthful in the process of debates. In other words, debating with persuasive LLMs lead to more truthful answers.
- Debate with language models leads to higher judge accuracy than consultancy.
Powerful insights like these allow MATS, Anthropic, and UCL researchers to better understand where they need to focus their efforts next. Furthermore, these results pave the way for future research on adversarial oversight methods and protocols that enable non-experts to elicit truth from sophisticated language models.
You can read MATS' research paper on the viability of aligning AI models with debate in the absence of ground truth here.