Skip to main content

Enterprise Knowledge Base

The AccelOS Knowledge Base is a central repository for your team’s operational knowledge. It stores runbooks, procedures, and learnings that the AI agent can reference during incident investigations.

Why a Knowledge Base?

When an incident occurs, your team’s collective knowledge is often scattered across wikis, Slack threads, and tribal memory. The AccelOS Knowledge Base:
  • Centralizes operational knowledge in one searchable location
  • Enables AI to reference your team’s expertise during investigations
  • Captures learnings from incidents to prevent future issues
  • Standardizes procedures across your team

Content Types

Alert Runbooks

Step-by-step procedures for handling specific alerts and incident types.

Integrations

Service-specific troubleshooting guides tied to your integrations.

Generic Instructions

General operational procedures and best practices.

Memories

Learnings from past investigations and incidents.

How It Works

1. Create Documentation

Write runbooks and procedures using markdown. Each document includes:
  • Title - Clear, searchable name
  • Category - Alert Runbook, Integration, Generic Instruction, or Memory
  • Content - Step-by-step procedures, context, and commands
  • Metadata - Associated alerts, integrations, or services

2. AI Agent Access

During investigations, the AI agent automatically:
  • Searches the knowledge base for relevant documents
  • References runbooks that match the incident type
  • Suggests procedures based on your team’s documented knowledge
  • Cites sources so you can verify recommendations

3. Continuous Improvement

After each incident:
  • Document learnings in the knowledge base
  • Update runbooks based on what worked
  • Add new procedures for novel issues
  • Build organizational memory over time

Document Structure

Each knowledge base document follows a consistent structure:
---
category: alert_runbooks
alert_name: HighCPUUsage
integration_id: <uuid>
---

# High CPU Usage Runbook

## Symptoms
- CPU usage above 90% for more than 5 minutes
- Increased latency on affected services

## Investigation Steps
1. Check which processes are consuming CPU
2. Review recent deployments
3. Check for traffic spikes

## Remediation
1. Scale horizontally if load-related
2. Restart affected pods if process is stuck
3. Roll back if caused by recent deployment

## Escalation
Contact the platform team if issue persists > 30 minutes.

Categories Explained

Alert Runbooks

Alert runbooks are tied to specific alerts or incident types. They provide:
  • Clear symptoms to identify the issue
  • Step-by-step investigation procedures
  • Remediation actions
  • Escalation paths
Best for: Recurring alerts, known failure modes, standard operating procedures.

Integrations

Integration docs contain service-specific knowledge for your connected tools:
  • How to interpret specific metrics
  • Common queries and commands
  • Service-specific troubleshooting
  • Architecture context
Best for: Service-specific knowledge, tool documentation, integration guides.

Generic Instructions

General procedures that apply across services:
  • Incident response protocols
  • Communication templates
  • General debugging techniques
  • Tool usage guides
Best for: Team-wide procedures, best practices, onboarding materials.

Memories

Learnings captured from past investigations:
  • Root cause analyses
  • Novel problems and solutions
  • Lessons learned
  • Post-mortem insights
Best for: Building institutional knowledge, preventing repeat incidents.

Features

Version History

All documents are version-controlled:
  • Track changes over time
  • See who made edits
  • Revert to previous versions if needed

AI-Generated Reviews

When you save a document, the system can:
  • Suggest improvements for clarity
  • Identify missing sections
  • Flag potential issues

Search & Discovery

Find relevant documents through:
  • Full-text search
  • Category filters
  • Integration/alert associations
  • AI-powered recommendations

Best Practices

Begin by documenting your most frequent or critical alerts:
  • Alerts that page on-call regularly
  • Incidents that take longest to resolve
  • Issues that require specialized knowledge
Write for the 3 AM on-call engineer:
  • Use clear, numbered steps
  • Include actual commands (with placeholders)
  • Specify expected outputs
  • Define clear escalation criteria
Treat runbooks as living documents:
  • Update procedures that didn’t work
  • Add new steps discovered during investigation
  • Remove outdated information
  • Capture novel solutions as memories

Getting Started

Setup Guide

Step-by-step guide to setting up your knowledge base.