Website crawlability and discovery checklist for AI agents
AI agents need the same foundational crawlability signals that search engines need, plus cleaner paths to public business context such as docs, pricing, products, policies, and actions.
Start with the homepage
The homepage should return a successful public response, use HTTPS, include crawlable text, and expose a clear title and meta description. A visually rich page still fails agent discovery when core copy only appears after client-side execution.
Publish robots.txt
robots.txt should be reachable, consistent, and clear about crawler access. It is not only a search-engine file; agents and AI crawlers often inspect it first to understand whether public reading is allowed or intentionally restricted.
Make sitemap.xml useful
A sitemap should be discoverable, parseable, and non-empty. Prioritize canonical URLs for pages agents need to understand the business: product, documentation, pricing, policy, help, contact, and relevant conversion pages.
Keep metadata agent-readable
Canonical tags, html language, Open Graph title and description, and indexable robots metadata help agents reconcile duplicate URLs and cite the right public page. These signals are small, but they reduce ambiguity across reports and AI search summaries.
Use discovery links when helpful
HTTP Link headers and homepage resource links can advertise sitemaps, OpenAPI, alternate formats, or protocol manifests. They are not required for every site, but they are useful for API, documentation, and protocol-heavy products.
Do not assume that a sitemap fixes uncrawlable content. Discovery files point agents at pages; the pages still need readable HTML, clear metadata, and public access.
- What should an agent discover first?
- The homepage, canonical URLs, robots.txt, sitemap.xml, product or service pages, docs, pricing, policies, contact paths, and important actions should be easy to find from public entry points.
- Can sitemap.xml fix JS-only content?
- No. A sitemap can point to a URL, but the page still needs public readable HTML, useful metadata, and enough text or structured data for agents to understand it.
- Do all discovery links need to be in the header?
- No. Headers can help protocol-heavy sites, but normal homepage links, footer links, sitemap entries, and well-known files can also provide clear discovery signals.