metr.org

RSP OVerview

1. Limits

Definition: Limits refer to specific dangerous capabilities or behaviors that, if observed in an AI system, would indicate that it might be unsafe to continue scaling or improving the AI. These are essentially "red lines" that shouldn't be crossed without serious consideration.

Example: Suppose an AI begins to show signs that it could autonomously develop bioweapons. This is a dangerous capability because it could have catastrophic consequences if mishandled. The RSP should outline specific tests or observations (e.g., certain tasks the AI can complete) that would trigger a halt in development. For instance, if an AI demonstrates the ability to walk through the steps of bioweapons development in a way that even someone with a basic biology degree could follow, this would be a sign to pause further AI training or deployment until the risks are fully understood and mitigated.

Importance: Setting clear limits helps ensure that the development of AI systems is paused before they reach a point where they could cause serious harm. It prevents open-ended discussions about safety, which could delay necessary actions until it's too late.

2. Protections

Definition: Protections are the safety measures that an AI developer puts in place to contain catastrophic risks. These measures are necessary to ensure that even if an AI shows dangerous capabilities, it can’t cause significant harm.

Example: For an AI that might be capable of developing bioweapons, protections could include strict information security protocols to prevent the AI’s knowledge from being accessed by unauthorized users, and training the AI to refuse to provide dangerous information, even under extreme circumstances.

Importance: Protections are critical because they act as barriers that prevent an AI from being misused or causing harm. If these protections are compromised, the AI could pose a significant risk, making it necessary to halt its development until the protections are restored.

3. Evaluation

Definition: Evaluation involves regularly testing the AI to catch early signs of dangerous capabilities. This includes evaluating the AI during different stages of its development to ensure it hasn’t acquired dangerous capabilities that exceed the protective measures in place.

Example: An AI developer might create checkpoints during the training process where they test the AI’s capabilities. For example, if the AI starts completing tasks that suggest it’s capable of autonomous replication and adaptation (ARA), this would be an early warning sign. The RSP should outline specific evaluation procedures, such as making a copy of the AI at regular intervals during training and testing it under controlled conditions to see if it has developed any dangerous capabilities.

Importance: Regular evaluations ensure that dangerous capabilities are identified before they become fully developed. This allows the AI developer to pause and reassess the situation before the AI crosses a safety threshold.

4. Response

Definition: The response component of the RSP outlines what actions the AI developer will take if the AI’s capabilities surpass the defined limits and it’s not possible to quickly improve protections. The default action should be to pause further development until the situation is under control.

Example: If an AI is found to be capable of ARA and the developer doesn’t have the necessary protective measures in place, the RSP should state that all development on the AI will pause. The AI should be labeled as “handle with care,” and its use should be restricted to specific research purposes only. The AI should not be deployed commercially, and all interactions with the AI should be carefully monitored and logged.

Importance: Having a clear response plan ensures that the AI developer can quickly and effectively mitigate risks if the AI’s capabilities exceed what can be safely managed. This helps prevent potential disasters by containing the AI and stopping further development until it’s safe to proceed.

5. Accountability

Definition: Accountability ensures that the commitments made in the RSP are followed through as intended. This includes verification processes, opportunities for external review, and clear procedures for revising the RSP when necessary.

Verification: The RSP should include processes for regularly checking that all the commitments are being met. This could involve assigning specific employees to write reports on evaluation results, protective measures, and responses to dangerous capabilities. These reports should be shared with relevant stakeholders, including a third party for external verification.

External Review: The RSP should allow for critique and feedback from parties outside the organization, such as independent experts. This external review helps ensure that the RSP is robust and not biased by internal pressures.

Revising the RSP: While it’s important to allow for changes to the RSP as new information or circumstances arise, these changes should not be made hastily or in secret. The RSP should include a process for revising the policy, which involves disseminating the proposed changes to key stakeholders, allowing time for feedback, and requiring approval from the board of directors.

Importance: Accountability ensures that the RSP is more than just a document—it’s a living policy that guides the safe and responsible development of AI. By involving external reviewers and allowing for transparent revisions, the AI developer can maintain trust and credibility with stakeholders.

Last updated