Schema design

Practical patterns for product, content and SaaS schemas — including SKU variants, multi-level taxonomies and computed search keywords.

The schema you declare when you create an index is the single biggest lever you have on search quality. Most "the engine isn't finding my products" complaints trace back to a schema that didn't anticipate how customers actually type.

This page collects patterns we recommend after watching hundreds of catalogues go live.

Before you write a schema, capture two things:

  1. The five most common queries. Get them from your current search logs, Google Analytics site-search, or by asking customer support what people ask for.
  2. The single most-confused product type. The one where customers describe it five different ways. That's the field that needs the most love.

Then design the schema around making those queries instant — not around your database tables.

A real-world Skryx schema for an automotive parts retailer. Notice how multiple SKU variants and a layered taxonomy are explicit fields rather than free-text descriptions:

{
  "id": "1234",
  "title": "Alternator 12V Dacia Logan",
  "sku":              "BK99651",
  "sku_alternative":  "DIS99651",
  "sku_no_prefix":    "99651",
  "sku_no_punctuation": "BK99651",
  "search_keywords":  "alternator dacia logan 12v generator",
  "price": 285.50,
  "image_url": "https://cdn.example.com/products/1234.jpg",
  "stock": 1,
  "category":     "Electrice",
  "subcategory":  "Alternatoare",
  "product_type": "Alternatori 12V",
  "is_xl": 0,
  "brand": "Breckner",
  "loyalty_points": 28
}

The matching schema declaration:

{
  "name": "products",
  "schema": {
    "fields": [
      { "name": "title",                "type": "string"               },
      { "name": "sku",                  "type": "string",              "facet": false },
      { "name": "sku_alternative",      "type": "string",              "optional": true },
      { "name": "sku_no_prefix",        "type": "string",              "optional": true },
      { "name": "sku_no_punctuation",   "type": "string",              "optional": true },
      { "name": "search_keywords",      "type": "string",              "optional": true },
      { "name": "price",                "type": "float",  "facet": true },
      { "name": "image_url",            "type": "string",              "optional": true, "index": false },
      { "name": "stock",                "type": "int32",  "facet": true },
      { "name": "category",             "type": "string", "facet": true },
      { "name": "subcategory",          "type": "string", "facet": true },
      { "name": "product_type",         "type": "string", "facet": true },
      { "name": "is_xl",                "type": "int32",  "facet": true },
      { "name": "brand",                "type": "string", "facet": true },
      { "name": "loyalty_points",       "type": "int32",  "facet": false }
    ],
    "default_sorting_field": "stock"
  }
}

Search it with field-weighted query_by:

{
  "q": "BK99651",
  "query_by": "sku,sku_alternative,sku_no_prefix,sku_no_punctuation,title,search_keywords",
  "query_by_weights": "10,10,8,8,5,3"
}

# Why multiple SKU variants

Customers type the SKU different ways. They type the version printed on the part (BK99651), the alternative supplier code from a competitor catalogue (DIS99651), the bare number (99651), or the punctuated form (BK-99651). One field per variant lets each one match with a high weight without polluting the title with codes.

sku_no_prefix and sku_no_punctuation are usually computed at index time from sku — strip the letters, strip the dashes — but stored as fields so the engine can match them with a single lookup.

# Why a search_keywords field

Free-form text you control. Use it for:

  • Common misspellings of the title (already handled by typo tolerance, but this gives you per-product overrides).
  • Romanian + English duplicates when your customers speak both ("alternator generator alternateur").
  • Trade names, generation numbers, vehicle compatibility codes that don't belong in the title.

Keep search_keywords weighted lower than title so the title is what mostly drives ranking. The keyword field is a safety net.

# Why three levels of taxonomy

Splitting category / subcategory / product_type instead of one field gives you:

  • Facets at the right granularity. Customers filter by category=Electrice, drill into subcategory=Alternatoare, and may further pick product_type= Alternatori 12V.
  • Boost rules per level. "Boost in-stock items in category:Electrice" is one rule; "pin specific products in product_type:Alternatori 12V" is another.
  • Cleaner analytics. Top zero-result queries grouped by category tell you which taxonomy branch needs catalog work.

# Custom boolean / flag fields

Model anything you want to filter/boost on as a typed field, not free text in the title:

  • is_xl — large items that need different shipping pricing.
  • is_featured, is_new, is_bestseller — promotion flags.
  • stock — int, not boolean, so you can do filter_by: stock:>0.

Skryx accepts bool, but int32 for stock is more useful because it doubles as a sort key for "show in-stock first then by price".

# Image URL — store but don't index

image_url is shown in your front-end but never searched. Add "index": false so it doesn't bloat the search index. Same for any field that's display-only — last_modified, internal notes, supplier_part_number codes the customer never sees.

# Content / docs schema

Quite different from products. The query is usually a question or partial phrase; relevance hinges on title + first paragraph.

{
  "fields": [
    { "name": "title",     "type": "string"               },
    { "name": "subtitle",  "type": "string", "optional": true },
    { "name": "body",      "type": "string"               },
    { "name": "breadcrumb", "type": "string[]",            "optional": true },
    { "name": "author",    "type": "string", "facet": true },
    { "name": "section",   "type": "string", "facet": true },
    { "name": "tags",      "type": "string[]", "facet": true },
    { "name": "published_at", "type": "int64", "facet": true }
  ],
  "default_sorting_field": "published_at"
}

Query weighted heavily toward titles and first lines:

{ "q": "synonyms", "query_by": "title,subtitle,body", "query_by_weights": "8,4,1" }

Chunk long documents (articles, manuals) into paragraph-sized records so highlights are useful — one match window per result, not one giant truncated block.

Whatever the unit of search is in your product — jobs, candidates, tickets, messages — pin the customer-id on every record:

{
  "fields": [
    { "name": "tenant_id",  "type": "int64",  "facet": true },
    { "name": "title",      "type": "string" },
    { "name": "body",       "type": "string" },
    { "name": "owner",      "type": "string", "facet": true },
    { "name": "status",     "type": "string", "facet": true },
    { "name": "created_at", "type": "int64",  "facet": true }
  ]
}

Then filter every query with filter_by: "tenant_id:42". See Multi-tenancy for the trust model.

# Rules of thumb

  • Make a field for every thing customers filter on. Filters are free at query time; computing them on-the-fly isn't.
  • Don't put codes in the title. They hurt human-readability and don't help ranking — give them their own SKU-variant fields instead.
  • Don't change types after going live. Adding optional fields is free (PATCH /settings); renaming a field requires a re-index. The zero-downtime swap pattern is built for exactly this.
  • Use optional: true on fields you might not have on every record. Without it, the missing-field error breaks ingest for the whole batch.
  • Mark display-only fields "index": false. Saves memory, speeds search.
esc