Schema Validation

Ensure data quality and consistency with JSON Schema validation

Schema validation is a powerful feature that ensures your extracted data matches your expected structure and types. By defining a JSON Schema, you can guarantee data quality, catch extraction errors early, and maintain consistency across your applications.

Why Use Schema Validation?

Without schema validation, the same extraction request can return different response formats each time:

First Request Response
[
  {"product_name": "iPhone 14", "cost": 999.99},
  {"product_name": "Samsung Galaxy", "cost": 899.50}
]
Second Request Response (Same prompt)
[
  {"name": "iPhone 14", "price": 999.99},
  {"name": "Samsung Galaxy", "price": 899.50}
]
Third Request Response (Same prompt)
[
  {"title": "iPhone 14", "amount": 999.99},
  {"title": "Samsung Galaxy", "amount": 899.50}
]

With schema validation, you get consistent, predictable results.

Basic Schema Structure

A schema defines the expected structure of your data using JSON Schema specification:

Basic Schema
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string",
        "minLength": 1
      },
      "price": {
        "type": "number",
        "minimum": 0
      }
    },
    "required": ["name", "price"]
  }
}

Schema Types and Validation

String Validation

String Schema
{
  "type": "string",
  "minLength": 1,
  "maxLength": 100,
  "pattern": "^[A-Za-z\\s]+$"  // Only letters and spaces
}

Use cases:

  • Product names
  • Company names
  • Addresses
  • Categories

Number Validation

Number Schema
{
  "type": "number",
  "minimum": 0,
  "maximum": 10000,
  "multipleOf": 0.01  // Ensures 2 decimal places
}

Use cases:

  • Prices
  • Ratings
  • Quantities
  • Measurements

Integer Validation

Integer Schema
{
  "type": "integer",
  "minimum": 0,
  "exclusiveMaximum": 1000
}

Use cases:

  • Stock quantities
  • Review counts
  • Page numbers
  • Years

Boolean Validation

Boolean Schema
{
  "type": "boolean"
}

Use cases:

  • In stock status
  • Featured items
  • Active listings
  • Available options

Enum Validation

Enum Schema
{
  "type": "string",
  "enum": ["new", "used", "refurbished"]
}

Use cases:

  • Product conditions
  • Categories
  • Status values
  • Fixed options

Advanced Schema Features

Date and Time Formats

Date Schema
{
  "type": "string",
  "format": "date",           // YYYY-MM-DD
  "pattern": "^\\d{4}-\\d{2}-\\d{2}$"
}
DateTime Schema
{
  "type": "string",
  "format": "date-time"       // ISO 8601 format
}

URL Validation

URL Schema
{
  "type": "string",
  "format": "uri",
  "pattern": "^https?://"     // Only HTTP/HTTPS URLs
}

Email Validation

Email Schema
{
  "type": "string",
  "format": "email"
}

Complex Object Validation

Nested Object Schema
{
  "type": "object",
  "properties": {
    "product": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "specs": {
          "type": "object",
          "properties": {
            "weight": {"type": "number"},
            "dimensions": {
              "type": "object",
              "properties": {
                "length": {"type": "number"},
                "width": {"type": "number"},
                "height": {"type": "number"}
              }
            }
          }
        }
      }
    }
  }
}

Array Validation

Array Schema
{
  "type": "array",
  "items": {"type": "string"},
  "minItems": 1,
  "maxItems": 10,
  "uniqueItems": true
}

Real-World Examples

E-commerce Product Schema

Product Schema
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string",
        "minLength": 1,
        "maxLength": 200
      },
      "price": {
        "type": "number",
        "minimum": 0,
        "multipleOf": 0.01
      },
      "currency": {
        "type": "string",
        "enum": ["USD", "EUR", "GBP", "JPY"]
      },
      "rating": {
        "type": "number",
        "minimum": 0,
        "maximum": 5
      },
      "review_count": {
        "type": "integer",
        "minimum": 0
      },
      "in_stock": {
        "type": "boolean"
      },
      "category": {
        "type": "array",
        "items": {"type": "string"},
        "minItems": 1
      },
      "url": {
        "type": "string",
        "format": "uri"
      }
    },
    "required": ["name", "price", "currency"]
  }
}

Job Listing Schema

Job Schema
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "minLength": 5
      },
      "company": {
        "type": "string",
        "minLength": 2
      },
      "location": {
        "type": "string"
      },
      "salary_min": {
        "type": "number",
        "minimum": 0
      },
      "salary_max": {
        "type": "number",
        "minimum": 0
      },
      "experience_level": {
        "type": "string",
        "enum": ["entry", "mid", "senior", "executive"]
      },
      "employment_type": {
        "type": "string",
        "enum": ["full-time", "part-time", "contract", "internship"]
      },
      "posted_date": {
        "type": "string",
        "format": "date"
      },
      "skills": {
        "type": "array",
        "items": {"type": "string"},
        "uniqueItems": true
      }
    },
    "required": ["title", "company", "location"]
  }
}

Real Estate Schema

Property Schema
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "address": {
        "type": "string",
        "minLength": 10
      },
      "price": {
        "type": "number",
        "minimum": 1000
      },
      "bedrooms": {
        "type": "integer",
        "minimum": 0,
        "maximum": 20
      },
      "bathrooms": {
        "type": "number",
        "minimum": 0,
        "maximum": 20,
        "multipleOf": 0.5
      },
      "square_feet": {
        "type": "integer",
        "minimum": 100
      },
      "property_type": {
        "type": "string",
        "enum": ["house", "condo", "townhouse", "apartment", "land"]
      },
      "year_built": {
        "type": "integer",
        "minimum": 1800,
        "maximum": 2030
      },
      "listing_date": {
        "type": "string",
        "format": "date"
      }
    },
    "required": ["address", "price", "bedrooms", "bathrooms"]
  }
}

Using Schemas in Practice

With NextRows API

Python Example
import requests

def extract_with_schema(urls, prompt, schema):
    response = requests.post(
        "https://api.nextrows.com/v1/extract",
        headers={"Authorization": "Bearer sk-nr-your-api-key"},
        json={
            "type": "url",
            "data": urls,
            "prompt": prompt,
            "schema": schema
        }
    )
    
    result = response.json()
    
    if result.get('success'):
        # Data is guaranteed to match schema
        return result['data']
    else:
        print(f"Validation failed: {result.get('error')}")
        return None

# Usage
product_schema = {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "name": {"type": "string", "minLength": 1},
            "price": {"type": "number", "minimum": 0}
        },
        "required": ["name", "price"]
    }
}

products = extract_with_schema(
    ["https://store.com/products"],
    "Extract product name and price",
    product_schema
)

Schema Evolution

As your needs change, you can evolve your schemas:

Version 1 - Basic
{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "number"}
  }
}
Version 2 - Enhanced
{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "number"},
    "currency": {"type": "string", "default": "USD"},
    "availability": {"type": "boolean", "default": true}
  },
  "required": ["name", "price"]
}

Error Handling and Debugging

Common Validation Errors

Type Mismatch

// Schema expects number, got string
{
  "error": "Type validation failed",
  "details": "Expected number, got string '99.99' for field 'price'"
}

Solution: Adjust your prompt to specify data types:

{
  "prompt": "Extract price as a number without currency symbols"
}

Missing Required Fields

{
  "error": "Validation failed",
  "details": "Required field 'name' is missing"
}

Solution: Make your prompt more explicit:

{
  "prompt": "Extract product name (required) and price for each item"
}

Format Validation Errors

{
  "error": "Format validation failed", 
  "details": "Field 'date' does not match format 'date'"
}

Solution: Specify the expected format in your prompt:

{
  "prompt": "Extract publication date in YYYY-MM-DD format"
}

Debugging Schema Issues

  1. Start with a minimal schema:

    {
      "type": "array",
      "items": {"type": "object"}
    }
  2. Add constraints gradually:

    {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {"type": "string"}
        },
        "required": ["name"]
      }
    }
  3. Test with sample data:

    # Test schema with known data first
    test_data = [{"name": "Test Product", "price": 99.99}]
    # Validate against schema before using with NextRows

Best Practices

1. Start Simple, Add Complexity

// ❌ Too complex initially
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": {"type": "string", "pattern": "^[A-Z][a-z\\s]+$"},
      "price": {"type": "number", "multipleOf": 0.01, "minimum": 0.01},
      "category": {"type": "string", "enum": ["electronics", "clothing", "books"]}
    }
  }
}

// ✅ Start simple
{
  "type": "array", 
  "items": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "price": {"type": "number"}
    },
    "required": ["name"]
  }
}

2. Use Descriptive Error Messages

{
  "type": "string",
  "minLength": 1,
  "errorMessage": "Product name cannot be empty"
}

3. Handle Optional Data Gracefully

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "number"},
    "sale_price": {"type": ["number", "null"]},  // Optional
    "description": {"type": "string", "default": "No description available"}
  },
  "required": ["name", "price"]
}

4. Use Appropriate Data Types

// ❌ Everything as string
{
  "price": {"type": "string"},
  "rating": {"type": "string"},
  "in_stock": {"type": "string"}
}

// ✅ Proper types
{
  "price": {"type": "number"},
  "rating": {"type": "number", "minimum": 0, "maximum": 5},
  "in_stock": {"type": "boolean"}
}

Performance Considerations

Schema Complexity Impact

  • Simple schemas: Faster validation, lower processing overhead
  • Complex schemas: More validation rules, slightly higher overhead
  • Deep nesting: Can impact performance with large datasets

Optimization Tips

  1. Minimize nesting depth when possible
  2. Use specific types rather than generic validation
  3. Avoid overly restrictive patterns for large datasets

Schema validation happens after data extraction but before the response is returned. Failed validation will result in an error response, so ensure your schemas match your extraction requirements.

Schema Tools and Resources

Official Documentation

Online Schema Validators

Schema Generation Tools

Python Schema Generation
from genson import SchemaBuilder

# Generate schema from sample data
builder = SchemaBuilder()
builder.add_object({"name": "Product A", "price": 99.99})
builder.add_object({"name": "Product B", "price": 149.50})

schema = builder.to_schema()
print(schema)

Testing Schemas

Schema Testing
import jsonschema

def test_schema(schema, sample_data):
    try:
        jsonschema.validate(sample_data, schema)
        print("Schema validation passed")
        return True
    except jsonschema.ValidationError as e:
        print(f"Schema validation failed: {e.message}")
        return False

# Test before using with NextRows
schema = {"type": "array", "items": {"type": "object"}}
sample = [{"name": "Test", "price": 99.99}]
test_schema(schema, sample)

Next Steps