ASL-2

ASL-2, or "AI Safety Level 2," as described in Anthropic's Responsible Scaling Policy, represents the safety and security measures required for current state-of-the-art AI models that exhibit early signs of capabilities necessary for catastrophic harm but do not yet pose a significant risk of catastrophe.

Key Aspects of ASL-2:

  1. Capabilities and Threat Models:

    • ASL-2 models do not pose a significant risk of catastrophe but may show early signs of capabilities that could lead to catastrophic misuse if not properly contained.

    • Examples include models that might provide bioweapon-related information, but not reliably enough to be practically dangerous.

  2. Containment Measures:

    • Although today's models at ASL-2 do not pose significant risks by merely existing, Anthropic commits to treating AI model weights as core intellectual property, emphasizing cybersecurity and insider threat prevention.

    • Security commitments include limiting access to model weights to essential personnel, establishing a robust insider threat detection program, and ensuring secure environments for storing and working with model weights.

  3. Deployment Measures:

    • Despite the lower risk, deployment of ASL-2 models involves certain trust and safety, legal, and ethical risks. To mitigate these, Anthropic commits to several measures:

      • Model Cards: Publishing detailed model cards for new models to describe their capabilities, limitations, evaluations, and intended use cases.

      • Acceptable Use Policy (AUP): Enforcing an AUP that restricts high-risk use cases, including catastrophic harm scenarios, and maintaining the ability to restrict access in cases of extreme misuse.

      • Vulnerability Reporting: Providing clear paths for users to report harmful or dangerous model outputs or use cases.

      • Harm Refusal Techniques: Training models to refuse requests that could lead to harm, utilizing techniques like Constitutional AI.

      • Trust & Safety (T&S) Tooling: Requiring enhanced trust and safety detection and enforcement, such as classifiers to identify harmful user prompts and model completions.

Last updated