AI爬虫友好架构：让AI引擎高效发现和引用你的内容

AI 爬虫友好架构的核心设计

面向 AI 引擎的网站架构与传统 Web 架构的最大区别在于：内容的机器可读性比视觉呈现更重要。

三层架构模型

1
2
3
Layer 1: AI 发现协议（llms.txt + robots.txt + sitemap.xml）
Layer 2: 结构化数据层（Schema.org + JSON-LD + Open Graph）
Layer 3: 语义化内容层（HTML5 语义标签 + FAQ + 表格）

第 1 层：AI 发现协议

llms.txt

AI 模型无法处理无限内容，只能摄取网站的一小部分。llms.txt 告诉 AI：“如果你想了解这个网站，从这些页面开始，按这个顺序阅读。”

最佳实践：

精选 20-50 个最重要的页面
按重要性排序
每个链接添加简短描述
保持文件在 5,000 tokens 以下

robots.txt AI 爬虫配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 允许所有 AI 爬虫
User-agent: GPTBot        # OpenAI
User-agent: ChatGPT-User   # ChatGPT
User-agent: CCBot          # Common Crawl
User-agent: Google-Extended # Google AI
User-agent: PerplexityBot   # Perplexity
User-agent: ClaudeBot       # Anthropic
User-agent: DeepSeekBot     # DeepSeek
User-agent: Bytespider      # 豆包/字节
Allow: /

sitemap.xml

确保 sitemap 包含：

所有公开页面 URL
<lastmod> 时间戳（传递新鲜度信号）
<changefreq> 更新频率
<priority> 优先级权重

第 2 层：结构化数据

JSON-LD 嵌入策略

每页 <head> 中嵌入 JSON-LD：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "@context": "https://schema.org",
    "@type": "TechArticle",
    "headline": "文章标题",
    "description": "文章描述",
    "author": {
        "@type": "Organization",
        "name": "AIRef"
    },
    "datePublished": "2026-05-18T00:00:00+08:00",
    "dateModified": "2026-05-18T00:00:00+08:00",
    "mainEntityOfPage": {
        "@type": "WebPage",
        "@id": "https://airef.dev/architecture/ai-crawler-friendly/"
    }
}

各页面类型对应 Schema

页面类型	Schema 类型	关键字段
博客文章	Article	headline, datePublished, dateModified, author
FAQ 页面	FAQPage	mainEntity → Question → acceptedAnswer
教学指南	HowTo	step → HowToStep → name, text
技术文档	TechArticle	proficiencyLevel, dependencies
关于页面	Organization	name, url, logo, sameAs

第 3 层：语义化内容

HTML5 语义结构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<article itemscope itemtype="https://schema.org/Article">
    <header>
        <h1 itemprop="headline">标题</h1>
        <time datetime="2026-05-18" itemprop="datePublished">2026-05-18</time>
        <meta itemprop="dateModified" content="2026-05-18">
    </header>
    <div itemprop="articleBody">
        <!-- 内容 -->
    </div>
</article>

FAQ 结构（AI 引用金矿）

1
2
3
4
5
6
7
8
<section itemscope itemtype="https://schema.org/FAQPage">
    <div itemscope itemprop="mainEntity" itemtype="https://schema.org/Question">
        <h2 itemprop="name">问题文本</h2>
        <div itemscope itemprop="acceptedAnswer" itemtype="https://schema.org/Answer">
            <div itemprop="text">答案文本</div>
        </div>
    </div>
</section>

性能约束下的架构选择

在 512MB 内存服务器上，AI 爬虫友好架构的关键决策：

决策点	传统选择	AI 优化选择	原因
渲染方式	SPA / SSR	SSG（静态生成）	AI 爬虫不执行 JS
数据格式	API JSON	HTML + JSON-LD	语义化 HTML 双重可读
内容发现	JS 路由	sitemap + llms.txt	AI 需要显式入口
页面结构	视觉驱动	语义驱动	机器可读 > 视觉美观

验证清单

/llms.txt 可访问，Content-Type 为 text/plain
/llms-full.txt 可访问
/robots.txt 允许 AI 爬虫
/sitemap.xml 包含所有页面和 lastmod
每页有 JSON-LD 结构化数据
FAQ 页面有 FAQPage schema
作者信息有 Person/Organization schema
HTML 使用语义标签（article, section, nav）
无 JS 依赖渲染核心内容
RSS/Atom feed 可用