Schema Validation
Ensure data quality and consistency with JSON Schema validation
Schema validation is a powerful feature that ensures your extracted data matches your expected structure and types. By defining a JSON Schema, you can guarantee data quality, catch extraction errors early, and maintain consistency across your applications.
Why Use Schema Validation?
Without schema validation, the same extraction request can return different response formats each time:
[
{"product_name": "iPhone 14", "cost": 999.99},
{"product_name": "Samsung Galaxy", "cost": 899.50}
]
[
{"name": "iPhone 14", "price": 999.99},
{"name": "Samsung Galaxy", "price": 899.50}
]
[
{"title": "iPhone 14", "amount": 999.99},
{"title": "Samsung Galaxy", "amount": 899.50}
]
With schema validation, you get consistent, predictable results.
Basic Schema Structure
A schema defines the expected structure of your data using JSON Schema specification:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"minLength": 1
},
"price": {
"type": "number",
"minimum": 0
}
},
"required": ["name", "price"]
}
}
Schema Types and Validation
String Validation
{
"type": "string",
"minLength": 1,
"maxLength": 100,
"pattern": "^[A-Za-z\\s]+$" // Only letters and spaces
}
Use cases:
- Product names
- Company names
- Addresses
- Categories
Number Validation
{
"type": "number",
"minimum": 0,
"maximum": 10000,
"multipleOf": 0.01 // Ensures 2 decimal places
}
Use cases:
- Prices
- Ratings
- Quantities
- Measurements
Integer Validation
{
"type": "integer",
"minimum": 0,
"exclusiveMaximum": 1000
}
Use cases:
- Stock quantities
- Review counts
- Page numbers
- Years
Boolean Validation
{
"type": "boolean"
}
Use cases:
- In stock status
- Featured items
- Active listings
- Available options
Enum Validation
{
"type": "string",
"enum": ["new", "used", "refurbished"]
}
Use cases:
- Product conditions
- Categories
- Status values
- Fixed options
Advanced Schema Features
Date and Time Formats
{
"type": "string",
"format": "date", // YYYY-MM-DD
"pattern": "^\\d{4}-\\d{2}-\\d{2}$"
}
{
"type": "string",
"format": "date-time" // ISO 8601 format
}
URL Validation
{
"type": "string",
"format": "uri",
"pattern": "^https?://" // Only HTTP/HTTPS URLs
}
Email Validation
{
"type": "string",
"format": "email"
}
Complex Object Validation
{
"type": "object",
"properties": {
"product": {
"type": "object",
"properties": {
"name": {"type": "string"},
"specs": {
"type": "object",
"properties": {
"weight": {"type": "number"},
"dimensions": {
"type": "object",
"properties": {
"length": {"type": "number"},
"width": {"type": "number"},
"height": {"type": "number"}
}
}
}
}
}
}
}
}
Array Validation
{
"type": "array",
"items": {"type": "string"},
"minItems": 1,
"maxItems": 10,
"uniqueItems": true
}
Real-World Examples
E-commerce Product Schema
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"minLength": 1,
"maxLength": 200
},
"price": {
"type": "number",
"minimum": 0,
"multipleOf": 0.01
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP", "JPY"]
},
"rating": {
"type": "number",
"minimum": 0,
"maximum": 5
},
"review_count": {
"type": "integer",
"minimum": 0
},
"in_stock": {
"type": "boolean"
},
"category": {
"type": "array",
"items": {"type": "string"},
"minItems": 1
},
"url": {
"type": "string",
"format": "uri"
}
},
"required": ["name", "price", "currency"]
}
}
Job Listing Schema
{
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {
"type": "string",
"minLength": 5
},
"company": {
"type": "string",
"minLength": 2
},
"location": {
"type": "string"
},
"salary_min": {
"type": "number",
"minimum": 0
},
"salary_max": {
"type": "number",
"minimum": 0
},
"experience_level": {
"type": "string",
"enum": ["entry", "mid", "senior", "executive"]
},
"employment_type": {
"type": "string",
"enum": ["full-time", "part-time", "contract", "internship"]
},
"posted_date": {
"type": "string",
"format": "date"
},
"skills": {
"type": "array",
"items": {"type": "string"},
"uniqueItems": true
}
},
"required": ["title", "company", "location"]
}
}
Real Estate Schema
{
"type": "array",
"items": {
"type": "object",
"properties": {
"address": {
"type": "string",
"minLength": 10
},
"price": {
"type": "number",
"minimum": 1000
},
"bedrooms": {
"type": "integer",
"minimum": 0,
"maximum": 20
},
"bathrooms": {
"type": "number",
"minimum": 0,
"maximum": 20,
"multipleOf": 0.5
},
"square_feet": {
"type": "integer",
"minimum": 100
},
"property_type": {
"type": "string",
"enum": ["house", "condo", "townhouse", "apartment", "land"]
},
"year_built": {
"type": "integer",
"minimum": 1800,
"maximum": 2030
},
"listing_date": {
"type": "string",
"format": "date"
}
},
"required": ["address", "price", "bedrooms", "bathrooms"]
}
}
Using Schemas in Practice
With NextRows API
import requests
def extract_with_schema(urls, prompt, schema):
response = requests.post(
"https://api.nextrows.com/v1/extract",
headers={"Authorization": "Bearer sk-nr-your-api-key"},
json={
"type": "url",
"data": urls,
"prompt": prompt,
"schema": schema
}
)
result = response.json()
if result.get('success'):
# Data is guaranteed to match schema
return result['data']
else:
print(f"Validation failed: {result.get('error')}")
return None
# Usage
product_schema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "minLength": 1},
"price": {"type": "number", "minimum": 0}
},
"required": ["name", "price"]
}
}
products = extract_with_schema(
["https://store.com/products"],
"Extract product name and price",
product_schema
)
Schema Evolution
As your needs change, you can evolve your schemas:
{
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
}
}
{
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string", "default": "USD"},
"availability": {"type": "boolean", "default": true}
},
"required": ["name", "price"]
}
Error Handling and Debugging
Common Validation Errors
Type Mismatch
// Schema expects number, got string
{
"error": "Type validation failed",
"details": "Expected number, got string '99.99' for field 'price'"
}
Solution: Adjust your prompt to specify data types:
{
"prompt": "Extract price as a number without currency symbols"
}
Missing Required Fields
{
"error": "Validation failed",
"details": "Required field 'name' is missing"
}
Solution: Make your prompt more explicit:
{
"prompt": "Extract product name (required) and price for each item"
}
Format Validation Errors
{
"error": "Format validation failed",
"details": "Field 'date' does not match format 'date'"
}
Solution: Specify the expected format in your prompt:
{
"prompt": "Extract publication date in YYYY-MM-DD format"
}
Debugging Schema Issues
-
Start with a minimal schema:
{ "type": "array", "items": {"type": "object"} }
-
Add constraints gradually:
{ "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"} }, "required": ["name"] } }
-
Test with sample data:
# Test schema with known data first test_data = [{"name": "Test Product", "price": 99.99}] # Validate against schema before using with NextRows
Best Practices
1. Start Simple, Add Complexity
// ❌ Too complex initially
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[A-Z][a-z\\s]+$"},
"price": {"type": "number", "multipleOf": 0.01, "minimum": 0.01},
"category": {"type": "string", "enum": ["electronics", "clothing", "books"]}
}
}
}
// ✅ Start simple
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
},
"required": ["name"]
}
}
2. Use Descriptive Error Messages
{
"type": "string",
"minLength": 1,
"errorMessage": "Product name cannot be empty"
}
3. Handle Optional Data Gracefully
{
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"sale_price": {"type": ["number", "null"]}, // Optional
"description": {"type": "string", "default": "No description available"}
},
"required": ["name", "price"]
}
4. Use Appropriate Data Types
// ❌ Everything as string
{
"price": {"type": "string"},
"rating": {"type": "string"},
"in_stock": {"type": "string"}
}
// ✅ Proper types
{
"price": {"type": "number"},
"rating": {"type": "number", "minimum": 0, "maximum": 5},
"in_stock": {"type": "boolean"}
}
Performance Considerations
Schema Complexity Impact
- Simple schemas: Faster validation, lower processing overhead
- Complex schemas: More validation rules, slightly higher overhead
- Deep nesting: Can impact performance with large datasets
Optimization Tips
- Minimize nesting depth when possible
- Use specific types rather than generic validation
- Avoid overly restrictive patterns for large datasets
Schema validation happens after data extraction but before the response is returned. Failed validation will result in an error response, so ensure your schemas match your extraction requirements.
Schema Tools and Resources
Official Documentation
- JSON Schema - Official specification and documentation
- JSON Schema Getting Started Guide - Step-by-step tutorial
Online Schema Validators
- JSONSchemaLint
- JSON Schema Validator
- SchemaStore - Common schema patterns
Schema Generation Tools
from genson import SchemaBuilder
# Generate schema from sample data
builder = SchemaBuilder()
builder.add_object({"name": "Product A", "price": 99.99})
builder.add_object({"name": "Product B", "price": 149.50})
schema = builder.to_schema()
print(schema)
Testing Schemas
import jsonschema
def test_schema(schema, sample_data):
try:
jsonschema.validate(sample_data, schema)
print("Schema validation passed")
return True
except jsonschema.ValidationError as e:
print(f"Schema validation failed: {e.message}")
return False
# Test before using with NextRows
schema = {"type": "array", "items": {"type": "object"}}
sample = [{"name": "Test", "price": 99.99}]
test_schema(schema, sample)