Web Content Ingestion

How content flows from markdown frontmatter through ingestion to database storage, and how to extend the system with custom fields.

Data Flow Overview

Markdown File (frontmatter + body)
         ↓
    parse_markdown()
         ↓
    ContentMetadata (struct)
         ↓
    CreateContentParams (builder)
         ↓
    ContentRepository::create()
         ↓
    markdown_content (PostgreSQL table)
         ↓
    ContentDataProvider::enrich_content()
         ↓
    Enriched JSON (for templates)

Ingestion Process

Step 1: Parse Markdown

File: extensions/web/src/services/ingestion.rs

The parse_markdown() function extracts YAML frontmatter and body:

fn parse_markdown(content: &str) -> Result<(ContentMetadata, String), BlogError> {
    // Find frontmatter delimiters (---)
    let frontmatter = &content[4..end_idx].trim();
    let body = content[end_idx + 3..].trim().to_string();

    // Deserialize YAML to ContentMetadata
    let metadata: ContentMetadata = serde_yaml::from_str(frontmatter)?;
    Ok((metadata, body))
}

Step 2: ContentMetadata Struct

File: extensions/web/src/models/content.rs

All frontmatter fields map to this struct:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ContentMetadata {
    // Required fields
    pub title: String,
    pub description: String,
    pub author: String,
    pub published_at: String,
    pub slug: String,
    pub keywords: String,
    pub kind: String,

    // Optional fields
    #[serde(default)]
    pub image: Option<String>,
    #[serde(default)]
    pub category: Option<String>,
    #[serde(default)]
    pub tags: Vec<String>,

    // Relation fields (stored as JSONB)
    #[serde(default)]
    pub links: Vec<ContentLinkMetadata>,
    #[serde(default)]
    pub after_reading_this: Vec<String>,
    #[serde(default)]
    pub related_playbooks: Vec<ContentLinkMetadata>,
    #[serde(default)]
    pub related_code: Vec<ContentLinkMetadata>,
    #[serde(default)]
    pub related_docs: Vec<ContentLinkMetadata>,
}

Step 3: Database Storage

File: extensions/web/src/repository/content.rs

The ContentRepository::create() method inserts content:

sqlx::query!(
    r#"
    INSERT INTO markdown_content (
        id, slug, title, description, body, author,
        published_at, keywords, kind, image, category_id, source_id,
        version_hash, links, after_reading_this, related_playbooks,
        related_code, related_docs, updated_at
    )
    VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19)
    ON CONFLICT (slug) DO UPDATE SET ...
    "#,
    // ... parameters
)

Adding Custom Frontmatter Fields

To add a new field (e.g., category), follow these steps:

1. Add Database Column

File: extensions/web/schema/011_content_category.sql (new)

-- Add category column for content filtering
ALTER TABLE markdown_content
ADD COLUMN IF NOT EXISTS category TEXT;

CREATE INDEX IF NOT EXISTS idx_markdown_content_category_filter
ON markdown_content(category);

2. Add to ContentMetadata

File: extensions/web/src/models/content.rs

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ContentMetadata {
    // ... existing fields ...

    #[serde(default)]
    pub category: Option<String>,  // Add new field
}

3. Add to CreateContentParams Builder

File: extensions/web/src/models/builders/content.rs

pub struct CreateContentParams {
    // ... existing fields ...
    pub category: Option<String>,
}

impl CreateContentParams {
    // ... existing methods ...

    #[must_use]
    pub fn with_category(mut self, category: Option<String>) -> Self {
        self.category = category;
        self
    }
}

4. Update Repository INSERT

File: extensions/web/src/repository/content.rs

Add category to the INSERT and UPDATE queries:

INSERT INTO markdown_content (
    ..., category, ...
)
VALUES (..., $20, ...)
ON CONFLICT (slug) DO UPDATE SET
    ...,
    category = EXCLUDED.category,
    ...

5. Update Ingestion

File: extensions/web/src/services/ingestion.rs

Pass the field to the builder:

let params = CreateContentParams::new(source_id.clone(), metadata.slug.clone())
    // ... existing fields ...
    .with_category(metadata.category);

6. Rebuild and Migrate

just build
systemprompt infra db migrate
systemprompt infra jobs run blog_content_ingestion

ContentDataProvider

ContentDataProvider enriches content after loading from the database. Use this when you need to:

Add computed fields
Fetch related data
Transform data for templates

Trait Definition

#[async_trait]
pub trait ContentDataProvider: Send + Sync {
    /// Unique identifier for this provider
    fn provider_id(&self) -> &'static str;

    /// Which content sources this provider applies to
    fn applies_to_sources(&self) -> Vec<String>;

    /// Enrich content item with additional data
    async fn enrich_content(
        &self,
        ctx: &ContentDataContext<'_>,
        item: &mut serde_json::Value,
    ) -> Result<()>;
}

Example: DocsContentDataProvider

File: extensions/web/src/docs/content_provider.rs

pub struct DocsContentDataProvider;

#[async_trait]
impl ContentDataProvider for DocsContentDataProvider {
    fn provider_id(&self) -> &'static str {
        "docs-content-enricher"
    }

    fn applies_to_sources(&self) -> Vec<String> {
        vec!["documentation".to_string()]
    }

    async fn enrich_content(
        &self,
        ctx: &ContentDataContext<'_>,
        item: &mut serde_json::Value,
    ) -> Result<()> {
        let db = ctx.db_pool::<Arc<Database>>()?;
        let pool = db.pool()?;
        let content_id = ctx.content_id();

        // Fetch additional data
        let row = sqlx::query!(
            r#"
            SELECT
                slug, kind, source_id,
                COALESCE(after_reading_this, '[]'::jsonb) as "after_reading_this!",
                COALESCE(related_playbooks, '[]'::jsonb) as "related_playbooks!"
            FROM markdown_content
            WHERE id = $1
            "#,
            content_id
        )
        .fetch_one(&*pool)
        .await?;

        // Insert enriched fields
        if let Some(obj) = item.as_object_mut() {
            obj.insert("after_reading_this".to_string(), row.after_reading_this);
            obj.insert("related_playbooks".to_string(), row.related_playbooks);
        }

        // Add children for index pages
        if row.kind == "docs-index" {
            let children = self.get_children(&pool, &row.source_id, &row.slug).await;
            if let Some(obj) = item.as_object_mut() {
                obj.insert("children".to_string(), json!(children));
            }
        }

        Ok(())
    }
}

Registration

File: extensions/web/src/extension.rs

impl Extension for WebExtension {
    fn content_data_providers(&self) -> Vec<Arc<dyn ContentDataProvider>> {
        vec![
            Arc::new(DocsContentDataProvider::new()),
            // Add more providers here
        ]
    }
}

Database Schema

markdown_content Table

Column	Type	Description
`id`	TEXT	Primary key (UUID)
`slug`	TEXT	URL-friendly identifier
`title`	TEXT	Content title
`description`	TEXT	SEO description
`body`	TEXT	Markdown content
`author`	TEXT	Author name
`published_at`	TIMESTAMPTZ	Publication date
`keywords`	TEXT	SEO keywords
`kind`	TEXT	Content type (article, guide, etc.)
`image`	TEXT	Featured image URL
`category_id`	TEXT	Source category
`source_id`	TEXT	Content source (blog, documentation)
`version_hash`	TEXT	Content hash for change detection
`public`	BOOLEAN	Published status
`links`	JSONB	External reference links
`after_reading_this`	JSONB	Learning objectives
`related_playbooks`	JSONB	Related playbook links
`related_code`	JSONB	Related code links
`related_docs`	JSONB	Related documentation links
`updated_at`	TIMESTAMPTZ	Last modification time

CLI Commands

# Run ingestion job
systemprompt infra jobs run blog_content_ingestion

# List content
systemprompt core content list --source blog

# Show content details
systemprompt core content show <slug> --source <source>

# Query database directly
systemprompt infra db query "SELECT slug, kind, category FROM markdown_content WHERE source_id = 'blog'"

Troubleshooting

Problem	Solution
Field not stored	Check ContentMetadata struct has the field
Column doesn't exist	Run database migration
Provider not called	Check `applies_to_sources()` matches content source
Field not in template	Check ContentDataProvider enriches the field

Quick Reference

Task	Location
Add frontmatter field	`extensions/web/src/models/content.rs`
Add database column	`extensions/web/schema/*.sql`
Store field in DB	`extensions/web/src/repository/content.rs`
Enrich at runtime	Create `ContentDataProvider`
Register provider	`extensions/web/src/extension.rs`