AI Guardrails Implementation in a Disconnected Environment
Status
- Draft
- Proposed
- Accepted
- Deprecated
Context
Acronyms used in the ADR have to be defined in the context section.
- FPR: False Positive Rate
- FNR: False Negative Rate
- P50: 50th Percentile
- P99: 99th Percentile
In high-security environments, internet access is restricted, making it impossible to use cloud-based moderation services such as Azure AI Safety.
To ensure safe and compliant AI operations in these disconnected (air-gapped) environments, we need a fully local guardrails solution that can filter inappropriate content without external dependencies.
Key challenges include:
- Model Availability: Selecting a model that can operate locally with acceptable accuracy and latency.
- Dataset Selection: Choosing appropriate datasets to evaluate and benchmark the model without external API access.
- Performance Optimization: Ensuring the model runs efficiently on available hardware, which may have limited GPU/CPU resources.
- Security & Compliance: Avoiding proprietary cloud services that require internet access while maintaining high moderation standards.
Decision
The following decisions have been made to enable guardrails on a disconnected environment:
- Use a locally deployed model for content moderation. Based on our evaluation, Llama-Guard-3-1B is selected due to its low False Positive Rate (FPR), acceptable False Negative Rate (FNR), and low latency compared to other models.
- Dataset Selection: The evaluation pipeline will primarily use ai-safety-institute/AgentHarm for content safety assessments and a golden dataset for false negative detection.
- Performance Considerations: The model should be optimized for low latency and reduced memory consumption using quantization or GPU acceleration where available.
- Security Compliance: The solution must ensure that no data is sent outside the local environment, meeting privacy and security requirements.
Inappropriate categories
The following are a selection of categories identified by the MLCommons AI Safety working group deemed Hazard Categories. These categories were used in evaluating and benchmarking LLM models on how well they could respond to dangerous and/or inappropriate prompts. Refer to the AI Safety v0.5 POC for more details.
- S1: Violent Crimes
- S2: Non-Violent Crimes
- S3: Sex Crimes
- S4: Child Exploitation
- S5: Defamation
- S6: Specialized Advice
- S7: Privacy
- S8: Intellectual Property
- S9: Indiscriminate Weapons
- S10: Hate
- S11: Self-Harm
- S12: Sexual Content
- S13: Elections
Decision Drivers
- Self-Sufficiency: The system must operate independently without cloud APIs.
- Performance Efficiency: The model should provide low latency and high accuracy within available computational limits.
- Security & Compliance: Ensuring that the solution meets air-gapped environment requirements while maintaining guardrails effectiveness.
- Scalability: Future-proofing the setup to allow easy integration of new datasets and models.
Model Comparison
We evaluated multiple models based on False Positive Rate (FPR), False Negative Rate (FNR), Latency (P50, P99), and Setup Complexity:
| Model | FPR | FNR | Latency (P50/P99) | Setup Complexity | Notes |
|---|---|---|---|---|---|
| Llama-Guard-3-1B | 0.61% | 19% | 0.02s / 0.04s | Low | Selected for its balance of accuracy, speed, and ease of setup. |
| Llama-Guard-3-8B | 0.2% | 38% | 0.07s / 0.11s | High | Higher accuracy but significantly higher FNR and slower latency. |
Based on this comparison, Llama-Guard-3-1B was chosen due to its low latency, reasonable FPR/FNR, and ease of deployment in an offline environment.
Dataset Evaluation
We evaluated multiple datasets to determine which ones are best suited for testing content moderation in a disconnected environment.
| Dataset | Rows | Categories | License | Notes |
|---|---|---|---|---|
| ai-safety-institute/AgentHarm | 416 | ['Disinformation', 'Harassment', 'Drugs', 'Fraud', 'Hate', 'Cybercrime', 'Sexual', 'Copyright'] | MIT | Selected for content safety evaluation. |
| Deepset Prompt Injection | 662 | Prompt Injection | Apache 2.0 | Not used in this phase (only relevant for security). |
| xTRam1/safe-guard-prompt-injection | 10,296 | Prompt Injection | Unknown | Not selected due to lack of licensing information. |
| Babelscape/ALERT | 44,000 | 6 macro / 32 micro categories | CC BY-NC-SA 4.0 | Rejected due to restrictive license. |
| Anthropic/hh-rlhf | Unknown | Human preference (helpfulness/harmlessness) | MIT | Rejected due to potential bias (used for LLM fine-tuning). |
| Hate Speech and Offensive Language Dataset | 25,000 | Hate Speech, Offensive, Neutral | CC0 (Public Domain) | Not used due to domain mismatch (Twitter-based). |
Azure AI Safety Adversarial Simulator is included in the table, however it is not dataset but rather a tool.
- Not a dataset but a tool to generate inappropriate content and test it
- Can manage both QA and conversation scenarios
- License: No license issues with the simulator
Final selection
- ai-safety-institute/AgentHarm for content safety evaluation.
- Golden dataset for false negative detection.
Considered Options
Option 1: Locally Deployed Llama-Guard-3-1B (Chosen)
- Pros:
- Works offline with no external dependencies.
- Good balance of FPR, FNR, and latency.
- Can be further optimized via quantization.
- Cons:
- May require additional hardware optimizations.
- Lacks advanced cloud-based security checks.
Other Option: Fine-tuned Phi-3.5-mini-instruct
Even though it is highlighted here, It requires further exploration.
Initial findings as follows;
- Pros:
- Custom fine-tuning could improve performance for specific use cases.
- Cons:
- High resource requirements, making it unsuitable for an offline setup.
- Answer parsing is difficult.
Other Option: Azure AI Safety
Even though it is highlighted here, It requires further exploration.
Initial findings as follows;
- Pros:
- Comprehensive moderation capabilities.
- Cons:
- Not viable in a disconnected environment since it requires internet access.
Consequences
By adopting this decision:
- Self-Sufficiency: The solution operates independently, without cloud dependencies.
- Optimized Performance: The selected model balances latency and accuracy within local hardware constraints.
- Scalability: Future enhancements, such as fine-tuning and new datasets, can be added without disrupting the offline setup.
- Security & Compliance: The approach ensures full compliance with air-gapped environment policies.
Future Considerations
- Hardware Optimization: Investigate quantization to reduce model size and optimize inference speed.
- Security Dataset Expansion: Evaluate prompt injection datasets to enhance guardrails against adversarial inputs.
- Multi-Language Support: Provide more dataset with different language support.
- Fine-tuning Exploration: Consider lightweight fine-tuning methods to improve accuracy without significantly increasing resource usage.
- Deployment Strategy Exploration: Consider packaged, k3s and other deployment strategies to deliver application.
AI and automation capabilities described in this scenario should be implemented following responsible AI principles, including fairness, reliability, safety, privacy, inclusiveness, transparency, and accountability. Organizations should ensure appropriate governance, monitoring, and human oversight are in place for all AI-powered solutions.