Blog Post

Apps on Azure Blog
10 MIN READ

Build a Custom SSL Certificate Monitor with Azure SRE Agent: From Python Tool to Production Skill

dbandaru's avatar
dbandaru
Icon for Microsoft rankMicrosoft
Feb 19, 2026

Learn how to build a custom SSL certificate monitoring tool for Azure SRE Agent using Python tools and skills. This tutorial walks through creating a CheckSSLCertificateExpiry Python tool, composing it into a reusable skill, and deploying it as a dedicated SSLCertificateMonitor agent — all through the Azure SRE Agent portal.

 

TL;DR

Expired SSL certificates cause outages that are 100% preventable. In this post, you’ll learn how to create a custom Python tool in Azure SRE Agent that checks SSL certificate expiry across your domains, then wrap it in a skill that gives your agent a complete certificate health audit workflow. The result: your SRE Agent proactively flags certificates expiring in the next 30 days and recommends renewal actions , before they become 3 AM pages.


The Problem Every ITOps Team Knows Too Well

It’s a Tuesday morning. Your monitoring dashboard lights up with alerts: your customer-facing API is returning connection errors. Users are calling. Slack is on fire. After 20 minutes of frantic debugging, someone discovers the root cause: an SSL certificate expired overnight.

This scenario plays out across enterprises every week. According to industry data, certificate-related outages cost an average of $300,000 per incident in downtime and remediation. The frustrating part? Every single one is preventable.

ITOps teams say: “We have spreadsheets for tracking certs, but someone always forgets to update them after a renewal.”

On-call engineers say: “I spent 20 minutes debugging before realizing it was just an expired certificate.”

Most teams rely on a patchwork of solutions , and they all have gaps:

Current ApproachThe Gap
SpreadsheetsGo stale , someone forgets to update after renewal
Calendar remindersFire too late , 7 days isn’t enough for compliance review
Standalone SaaS toolsDon’t integrate with existing incident workflows
Manual checksDon’t scale with multi-domain sprawl

What if your SRE Agent could check certificate health as part of its regular investigation workflow, and proactively warn you during routine health checks?


What We’ll Build

In this tutorial, you’ll create two things:

  1. A Python Tool (CheckSSLCertificateExpiry) , A custom tool that connects to any domain, retrieves its SSL certificate details, and returns structured data about the certificate’s validity, issuer, and days until expiry.
  2. A Skill (ssl_certificate_audit) , A reusable knowledge package that teaches your SRE Agent how to perform a complete certificate health audit across multiple domains, classify risk levels, and recommend actions.

By the end, your agent will respond to prompts like:

  • “Check the SSL certificates for all our production domains”
  • “Are any of our certificates expiring in the next 30 days?”
  • “Run a certificate health audit for api.contoso.com, portal.contoso.com, and store.contoso.com”


The CheckSSLCertificateExpiry tool in the Azure SRE Agent portal , showing the Python code, parameters, and description.


Prerequisites

  • An Azure SRE Agent instance deployed in your subscription
  • Access to the Azure SRE Agent portal
  • Basic familiarity with Python and YAML

Part 1: Creating the Python Tool

Step 1: Create the Tool in the Portal

Navigate to the Azure SRE Agent portal, go to Settings > Subagent Builder, and click Create New Tool. Select Python Tool as the type, enter the name CheckSSLCertificateExpiry, and provide the description:

Checks SSL/TLS certificate expiry for a given domain and returns certificate details including days until expiration, issuer, and validity dates.

Add two parameters:

  • domain (string, required): The fully qualified domain name to check (e.g., api.contoso.com)
  • port (string, optional): The port to connect on (default 443)

Step 2: Write the Python Code

In the Function Code field, paste the following Python implementation:

import ssl
import socket
import json
from datetime import datetime, timezone

def main(domain, port="443"):
    """Check SSL certificate expiry for a domain."""
    port = int(port)
    context = ssl.create_default_context()

    try:
        with socket.create_connection((domain, port), timeout=10) as sock:
            with context.wrap_socket(sock, server_hostname=domain) as ssock:
                cert = ssock.getpeercert()

        not_before = datetime.strptime(cert["notBefore"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc)
        not_after = datetime.strptime(cert["notAfter"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc)
        now = datetime.now(timezone.utc)
        days_remaining = (not_after - now).days

        issuer = dict(x[0] for x in cert.get("issuer", []))
        subject = dict(x[0] for x in cert.get("subject", []))

        if days_remaining < 0:
            risk_level = "EXPIRED"
        elif days_remaining <= 7:
            risk_level = "CRITICAL"
        elif days_remaining <= 30:
            risk_level = "WARNING"
        elif days_remaining <= 60:
            risk_level = "ATTENTION"
        else:
            risk_level = "HEALTHY"

        san_list = []
        for entry_type, value in cert.get("subjectAltName", []):
            if entry_type == "DNS":
                san_list.append(value)

        return {
            "domain": domain,
            "port": port,
            "status": "valid" if days_remaining >= 0 else "expired",
            "risk_level": risk_level,
            "days_remaining": days_remaining,
            "not_before": not_before.isoformat(),
            "not_after": not_after.isoformat(),
            "issuer": issuer.get("organizationName", "Unknown"),
            "issuer_cn": issuer.get("commonName", "Unknown"),
            "subject_cn": subject.get("commonName", domain),
            "serial_number": cert.get("serialNumber", "Unknown"),
            "version": cert.get("version", "Unknown"),
            "san_count": len(san_list),
            "san_domains": san_list[:10],
            "checked_at": now.isoformat()
        }

    except ssl.SSLCertVerificationError as e:
        return {
            "domain": domain,
            "port": port,
            "status": "verification_failed",
            "risk_level": "CRITICAL",
            "error": str(e),
            "checked_at": datetime.now(timezone.utc).isoformat()
        }
    except (socket.timeout, socket.gaierror, ConnectionRefusedError, OSError) as e:
        return {
            "domain": domain,
            "port": port,
            "status": "connection_failed",
            "risk_level": "UNKNOWN",
            "error": str(e),
            "checked_at": datetime.now(timezone.utc).isoformat()
        }

Key Design Decisions:

Structured output , The tool returns a JSON object with clearly labeled fields so the LLM can compare, sort, and aggregate results across multiple domains.

Risk classification , Five risk levels (EXPIRED, CRITICAL, WARNING, ATTENTION, HEALTHY) give the agent clear thresholds to reason about.

Error handling , Specific exception types return structured error objects rather than crashing, so the agent gets useful information even when a domain is unreachable.

Zero dependencies , Uses only Python standard library (ssl, socket, datetime) for fast cold starts and no supply chain risk.

Step 3: Deploy the Tool

Click Save in the tool editor to deploy the tool to your SRE Agent instance. The portal validates the YAML and Python code before saving.


The Subagent Builder in the Azure SRE Agent portal , showing all deployed subagents, Python tools, and skills at a glance.

Step 4: Test the Tool

Open a new chat thread in the portal, select the SSLCertificateMonitor agent, and type: "Check the SSL certificate for microsoft.com"


The agent checks microsoft.com and returns real certificate data: valid, healthy, 164 days remaining, issued by Microsoft Azure RSA TLS Issuing CA 04.


Part 2: Creating the Skill

A tool gives the agent a capability. A skill gives the agent a methodology.

Tool: “I can check one certificate.”

Skill: “Here’s how to audit all your certificates, classify the risks, and tell you exactly what to do about each one.”

What is a Skill?

A skill is a markdown document with YAML frontmatter that contains:

  • Metadata: name, description, and which tools the skill uses
  • Instructions: step-by-step guidance the agent follows when the skill is loaded

Think of it as a runbook injected into the agent’s context when relevant.

Step 1: Create the Skill in the Portal

In the Azure SRE Agent portal, go to Settings > Subagent Builder and click Create New Skill. You will need to provide the full SKILL.md content, which includes both the YAML frontmatter and the markdown instructions.

Step 2: Write the Skill Document

Paste the following as the complete skill content:

---
name: ssl_certificate_audit
description: |
  Load this skill when the user asks about SSL/TLS certificate health, certificate expiry,
  certificate monitoring, or requests a certificate audit across one or more domains.
  Trigger phrases: "check our certificates", "are any certs expiring", "SSL audit",
  "certificate health check", "TLS certificate status", "cert renewal needed".
  Do NOT load for general security assessments, network connectivity issues unrelated to TLS,
  or application-level HTTPS errors (use standard troubleshooting for those).
tools:
  - CheckSSLCertificateExpiry
---

# SSL/TLS Certificate Health Audit Skill

## Purpose

Perform a structured certificate health audit across one or more domains: check each certificate, classify risk, aggregate findings, and deliver a prioritized action plan with specific renewal deadlines.

## Scope

Focus ONLY on SSL/TLS certificate validity, expiry, and health. Exclude:
- Application-level HTTPS configuration issues
- Cipher suite or TLS version analysis (unless certificate is the root cause)
- Certificate Authority trust chain debugging (unless verification fails)

## Workflow

### Phase 1: Domain Collection

1. If the user provides specific domains, use those directly.
2. If the user says "all our domains" or "production domains," ask them to list the domains or provide a resource group to discover App Services, Front Doors, or API Management instances with custom domains.
3. Confirm the domain list before proceeding.

### Phase 2: Certificate Checks

1. Run CheckSSLCertificateExpiry for each domain. Execute checks in parallel when possible.
2. Collect all results before analysis.
3. If any domain returns a connection error, note it separately; do not abort the audit.

### Phase 3: Risk Classification and Reporting

Classify each certificate into one of these categories:

| Risk Level | Criteria | Required Action |
|------------|----------|-----------------|
| EXPIRED | days_remaining < 0 | Immediate renewal, this is causing outages |
| CRITICAL | days_remaining <= 7 | Emergency renewal within 24 hours |
| WARNING | days_remaining <= 30 | Schedule renewal this sprint |
| ATTENTION | days_remaining <= 60 | Add to next renewal cycle |
| HEALTHY | days_remaining > 60 | No action needed |

### Phase 4: Summary Report

Present findings in this order:

1. **Executive Summary** (1-2 sentences): Total domains checked, how many need action.
2. **Certificates Needing Action** (table): Domain, expiry date, days remaining, risk level, recommended action. Sort by days_remaining ascending (most urgent first).
3. **Healthy Certificates** (compact list): Domain and expiry date only.
4. **Unreachable Domains** (if any): Domain and error reason.
5. **Recommendations**: Specific next steps based on findings.

### Phase 5: Actionable Recommendations

Based on findings, recommend:

- **For EXPIRED or CRITICAL**: "Renew the certificate for {domain} immediately. If using Azure-managed certificates, check the App Service custom domain binding. If using a third-party CA, initiate the renewal process with {issuer}."
- **For WARNING**: "Schedule renewal for {domain} (expires {date}). Recommended to renew by {date - 7 days} to allow for propagation and testing."
- **For ATTENTION**: "Add {domain} to the renewal queue. Certificate expires {date}."
- **For mixed results**: "Consider implementing automated certificate management (e.g., Azure Key Vault with auto-renewal) to prevent future expiry risks."

## Output Format

Use markdown tables for certificate status. Include the checked_at timestamp to establish when the audit was performed. Bold the risk level for EXPIRED and CRITICAL entries.

## Example Output (Condensed)

Certificate Health Audit: 5 domains checked at 2026-02-18T14:30:00Z.

2 certificates need immediate attention; 3 are healthy.

| Domain | Expires | Days Left | Risk | Action |
|--------|---------|-----------|------|--------|
| api.contoso.com | 2026-02-20 | **2** | **CRITICAL** | Renew within 24 hours |
| store.contoso.com | 2026-03-10 | 20 | WARNING | Schedule renewal this sprint |
| portal.contoso.com | 2026-06-15 | 117 | HEALTHY | None |
| auth.contoso.com | 2026-08-22 | 185 | HEALTHY | None |
| cdn.contoso.com | 2026-09-01 | 195 | HEALTHY | None |

Recommendation: Renew api.contoso.com immediately to prevent service disruption. Schedule store.contoso.com renewal by March 3rd.

## Quality Principles

- Check all domains before reporting (don't report one-by-one).
- Never guess certificate details; only report what the tool returns.
- Sort urgent items first in all outputs.
- Include specific dates, not vague timeframes.
- Align with system prompt: answer first, then evidence.

Step 3: Deploy the Skill and Configure the Agent

Back in the Subagent Builder, create a new subagent called SSLCertificateMonitor. In the agent configuration:

  1. Add the CheckSSLCertificateExpiry tool to the agent's tool list
  2. Enable Allow Parallel Tool Calls in the agent settings
  3. Click Save to deploy the agent

Skills are automatically enabled on every agent, so no additional configuration is needed. The portal will validate and deploy the skill, tool, and agent together.


The SSLCertificateMonitor subagent in the portal , showing the CheckSSLCertificateExpiry tool, agent instructions, and skills enabled.


Part 3: See It in Action

Here’s what happens when you ask the agent to audit four real domains , microsoft.com, azure.com, github.com, and learn.microsoft.com:

Open a new chat thread in the portal, select the SSLCertificateMonitor agent, and type:

"Run a certificate health audit for microsoft.com, azure.com, github.com, and learn.microsoft.com"


The agent checks all 4 domains in parallel, classifies github.com as ATTENTION (45 days remaining), and recommends scheduling renewal by March 29, 2026.

The agent:

  1. ✅ Loaded the ssl_certificate_audit skill (matched by “certificate health audit”)
  2. ✅ Ran CheckSSLCertificateExpiry for all 4 domains in parallel
  3. ✅ Classified github.com as ATTENTION (45 days) and the rest as HEALTHY
  4. ✅ Produced a prioritized report , action items first, healthy domains second
  5. ✅ Recommended a specific renewal date and suggested Azure Key Vault auto-renewal

Real result: This audit ran against live production domains and completed in under 25 seconds. The agent correctly identified that github.com’s certificate expires soonest and needs to be added to the renewal cycle.


Scenario 1: Morning Certificate Health Check

User: “Run a certificate health check across our production domains: api.contoso.com, portal.contoso.com, store.contoso.com, auth.contoso.com, and payments.contoso.com”

The agent:

  1. ✅ Loads the ssl_certificate_audit skill (matched by “certificate health check”)
  2. ✅ Runs CheckSSLCertificateExpiry for each domain in parallel
  3. ✅ Classifies each result by risk level
  4. ✅ Delivers a prioritized report with specific action items

Scenario 2: Discovering Cert Issues During Incident Investigation

During a connectivity incident, the agent may use CheckSSLCertificateExpiry to check if the certificate has expired , discovering the root cause without the engineer needing to manually check.

Scenario 3: Cross-Agent Integration

Because the skill references tools by name, any agent with access to CheckSSLCertificateExpiry can use it , add it to your triage agent, weekly health-check workflow, or other skills that deal with frontend health.


How Tools and Skills Work Together

┌──────────────────────────────────────┐
│              Skill                    │
│  "ssl_certificate_audit"             │
│                                      │
│  Methodology:                        │
│  1. Collect domains                  │
│  2. Check each certificate  ─┐       │
│  3. Classify risk levels     │       │
│  4. Generate report          │       │
│  5. Recommend actions        │       │
└──────────────────────────────┼───────┘
                               │
                               ▼
┌──────────────────────────────────────┐
│              Tool                    │
│  "CheckSSLCertificateExpiry"         │
│                                      │
│  Capability:                         │
│  - Connect to domain:port            │
│  - Read SSL certificate              │
│  - Return structured cert data       │
└──────────────────────────────────────┘
ConceptRoleAnalogy
ToolAtomic capability , does one thing, returns dataA stethoscope
SkillMethodology , combines tools, interprets results, makes decisionsA diagnostic protocol

Key Takeaways

Custom Python tools are first-class citizens
You don’t need to build a microservice or deploy an MCP server. Write a Python function, deploy it through the Azure SRE Agent portal, and it’s immediately available.

Skills turn tools into expertise
A tool tells the agent what it can do. A skill tells the agent what it should do and how. The audit skill transforms a simple certificate check into a comprehensive capability.

Start small, iterate fast
Tool creation, skill creation, deployment, and testing , under 30 minutes. Start with one domain check and expand incrementally.

ITOps value is immediate
Every team has certificates. Every team has been burned by an expired one. Deploy this on day one and prevent the next certificate outage.

Want to learn more about Azure SRE Agent extensibility? Check out the YAML Schema Reference and the Python Tool documentation.

Updated Feb 19, 2026
Version 1.0
No CommentsBe the first to comment