Data Extraction - OpenSteer

Overview

OpenSteer provides powerful data extraction capabilities that combine AI vision models with persistent element paths. Define a schema, and OpenSteer extracts matching data with automatic caching for fast, deterministic replay.

How Extraction Works

Take extraction snapshot - Get data-oriented HTML representation
Define schema - Specify the structure of data you want
Extract data - OpenSteer uses AI to map page elements to schema
Cache paths - Element paths are saved for instant replay

Basic Extraction

Define a Schema

Schemas define the structure of data you want to extract:

const schema = {
  title: '',
  price: '',
  description: ''
}

Extract Data

import { Opensteer } from 'opensteer'

const opensteer = new Opensteer({ name: 'product-scraper' })

try {
  await opensteer.launch()
  await opensteer.goto('https://example.com/product')

  // Take extraction snapshot
  await opensteer.snapshot({ mode: 'extraction' })

  // Extract with schema
  const data = await opensteer.extract({
    description: 'product details',
    schema: {
      title: '',
      price: '',
      imageUrl: ''
    }
  })

  console.log(data)
  // { title: 'Product Name', price: '$99.99', imageUrl: 'https://...' }
} finally {
  await opensteer.close()
}

The first extraction uses AI vision to locate elements. Subsequent runs use cached paths for instant extraction.

Schema Field Types

String Fields

Extract text content:

const schema = {
  name: '',
  description: '',
  category: ''
}

Number Fields

Extract numeric values:

const schema = {
  price: 0,
  rating: 0,
  reviewCount: 0
}

Boolean Fields

Extract boolean values:

const schema = {
  inStock: true,
  onSale: false
}

Null Fields

Extract nullable values:

const schema = {
  salePrice: null,  // May or may not exist
  badge: null
}

Nested Structures

Object Fields

Extract nested objects:

const schema = {
  product: {
    name: '',
    price: '',
    specs: {
      weight: '',
      dimensions: ''
    }
  }
}

const data = await opensteer.extract({
  description: 'product with specs',
  schema
})

// Result:
// {
//   product: {
//     name: 'Widget',
//     price: '$50',
//     specs: { weight: '1kg', dimensions: '10x10cm' }
//   }
// }

Array Fields

Extract lists of items:

const schema = {
  products: [
    {
      title: '',
      price: '',
      imageUrl: ''
    }
  ]
}

const data = await opensteer.extract({
  description: 'product listing',
  schema
})

// Result:
// {
//   products: [
//     { title: 'Product 1', price: '$10', imageUrl: 'https://...' },
//     { title: 'Product 2', price: '$20', imageUrl: 'https://...' },
//     { title: 'Product 3', price: '$30', imageUrl: 'https://...' }
//   ]
// }

For arrays, OpenSteer automatically finds all matching items and extracts their fields.

Advanced Field Options

Extract Attributes

Extract HTML attributes instead of text:

const schema = {
  imageUrl: { element: 0, attribute: 'src' },
  linkUrl: { element: 0, attribute: 'href' },
  productId: { element: 0, attribute: 'data-id' }
}

Extract Current URL

Include the current page URL in extraction:

const schema = {
  title: '',
  price: '',
  sourceUrl: { source: 'current_url' }
}

const data = await opensteer.extract({
  description: 'product with url',
  schema
})

// Result:
// {
//   title: 'Product',
//   price: '$99',
//   sourceUrl: 'https://example.com/product/123'
// }

Explicit Element Selectors

Manually specify elements from snapshots:

const schema = {
  title: { element: 5 },
  price: { element: 8 },
  image: { element: 3, attribute: 'src' }
}

CSS Selectors

Use explicit CSS selectors:

const schema = {
  title: { selector: 'h1.product-title' },
  price: { selector: '.price-value' }
}

Real-World Example

Here’s a complete extraction script from the OpenSteer source:

import { Opensteer } from 'opensteer'

async function run() {
  const opensteer = new Opensteer({
    name: 'product-extraction',
    model: 'gpt-5.1',
  })

  await opensteer.launch({ headless: false })

  try {
    await opensteer.goto(
      'https://kbdfans.com/search?type=product&q=tactile+switches'
    )

    console.log('Starting extraction...')
    const data = await opensteer.extract({
      description: 'Extract product cards with title, price, image, and url',
      schema: {
        products: [
          {
            title: '',
            price: '',
            imageUrl: '',
            url: '',
          },
        ],
      },
    })

    console.log(data)
  } finally {
    await opensteer.close()
  }
}

run().catch((err) => {
  console.error(err)
  process.exit(1)
})

Two-Phase Extraction

For complex extractions, use extractFromPlan() to separate planning from execution.

Phase 1: Generate Plan

First extraction generates an extraction plan:

const plan = await opensteer.extract({
  description: 'product listing',
  schema: {
    products: [{ title: '', price: '' }]
  }
})

// Plan contains:
// - fields: Element counter mappings
// - paths: Cached element paths
// - data: Initial extracted data

Phase 2: Execute Plan

Reuse the plan for fast extraction:

const data = await opensteer.extractFromPlan({
  description: 'product listing',
  schema: {
    products: [{ title: '', price: '' }]
  },
  plan: plan
})

extractFromPlan() skips AI inference and uses cached paths directly. This is significantly faster for repeated extractions.

Extraction Options

Custom Snapshot

Provide snapshot options:

const data = await opensteer.extract({
  description: 'product data',
  schema: { title: '', price: '' },
  snapshot: {
    mode: 'extraction',
    withCounters: true
  }
})

Custom Prompt

Add instructions for the AI:

const data = await opensteer.extract({
  description: 'product prices',
  schema: { prices: [''] },
  prompt: 'Extract only regular prices, ignore sale prices'
})

Extraction Best Practices

1. Take Extraction Snapshots

Always take a snapshot before extraction:

// Take snapshot
await opensteer.snapshot({ mode: 'extraction' })

// Then extract
const data = await opensteer.extract({
  description: 'product data',
  schema: { title: '', price: '' }
})

2. Use Descriptive Names

Provide clear descriptions for caching:

// Good - descriptive
await opensteer.extract({
  description: 'product listing with name, price, and image',
  schema: { /* ... */ }
})

// Bad - vague
await opensteer.extract({
  description: 'data',
  schema: { /* ... */ }
})

3. Cache All Page Types

During CLI exploration, cache extraction for every page type your scraper will visit:

# List page
opensteer snapshot extraction
opensteer extract '{"products":[{"name":"","price":""}]}' \
  --description "product listing"

# Detail page
opensteer click 1 --description "first product"
opensteer snapshot extraction
opensteer extract '{"title":"","description":"","specs":[""]}' \
  --description "product detail page"

4. Handle Missing Data

Some fields may not exist on all pages:

const schema = {
  title: '',
  price: '',
  salePrice: null,  // May not exist
  badge: null       // May not exist
}

const data = await opensteer.extract({
  description: 'product',
  schema
})

// Check for null values
if (data.salePrice !== null) {
  console.log('On sale:', data.salePrice)
}

5. Structure Arrays Properly

For arrays, include representative items in the schema:

// Good - shows all fields
const schema = {
  products: [
    {
      title: '',
      price: '',
      imageUrl: ''
    }
  ]
}

// OpenSteer caches the pattern and finds all matching items

6. Use Type Hints

Use appropriate primitive types as defaults:

const schema = {
  name: '',           // String
  price: 0,           // Number
  inStock: true,      // Boolean
  badge: null,        // Nullable
  specs: [''],        // String array
  metadata: {}        // Object
}

Debugging Extraction

When extraction produces wrong or missing data:

Check timing

Ensure SPA content has loaded:

await opensteer.waitForText('Products loaded')
await opensteer.snapshot({ mode: 'extraction' })
const data = await opensteer.extract({ /* ... */ })

Verify cache exists

Make sure you cached the extraction during CLI exploration for this page type.

Handle obstacles

Remove cookie banners, modals, or login walls before extraction:

await opensteer.click({ description: 'close cookie banner' })
await opensteer.snapshot({ mode: 'extraction' })

Check for missing data

Some pages genuinely lack certain fields. Use null defaults and handle missing data:

const schema = { optional: null }
const data = await opensteer.extract({ schema })
if (data.optional === null) {
  console.log('Field not found on page')
}

Do NOT replace opensteer.extract() with page.evaluate() + querySelectorAll when debugging. Fix timing, caching, or obstacles instead.

Extraction vs Manual Parsing

OpenSteer Extraction

AI-powered element detection
Automatic path caching
Works across page structure changes
Deterministic replay
Type-safe schemas

Manual Parsing

Brittle CSS selectors
No caching
Breaks on DOM changes
Requires maintenance
Error-prone

Next Steps

Browser Automation

Learn core automation features and navigation

AI Agents

Integrate extraction with AI agent workflows

Cloud Integration

Scale extraction with cloud mode

Skills

Install OpenSteer skills for AI assistants

​Overview

​How Extraction Works

​Basic Extraction

​Define a Schema

​Extract Data

​Schema Field Types

​String Fields

​Number Fields

​Boolean Fields

​Null Fields

​Nested Structures

​Object Fields

​Array Fields

​Advanced Field Options

​Extract Attributes

​Extract Current URL

​Explicit Element Selectors

​CSS Selectors

​Real-World Example

​Two-Phase Extraction

​Phase 1: Generate Plan

​Phase 2: Execute Plan

​Extraction Options

​Custom Snapshot

​Custom Prompt

​Extraction Best Practices

​1. Take Extraction Snapshots

​2. Use Descriptive Names

​3. Cache All Page Types

​4. Handle Missing Data

​5. Structure Arrays Properly

​6. Use Type Hints

​Debugging Extraction

​Extraction vs Manual Parsing

OpenSteer Extraction

Manual Parsing

​Next Steps

Browser Automation

AI Agents

Cloud Integration

Skills

Overview

How Extraction Works

Basic Extraction

Define a Schema

Extract Data

Schema Field Types

String Fields

Number Fields

Boolean Fields

Null Fields

Nested Structures

Object Fields

Array Fields

Advanced Field Options

Extract Attributes

Extract Current URL

Explicit Element Selectors

CSS Selectors

Real-World Example

Two-Phase Extraction

Phase 1: Generate Plan

Phase 2: Execute Plan

Extraction Options

Custom Snapshot

Custom Prompt

Extraction Best Practices

1. Take Extraction Snapshots

2. Use Descriptive Names

3. Cache All Page Types

4. Handle Missing Data

5. Structure Arrays Properly

6. Use Type Hints

Debugging Extraction

Extraction vs Manual Parsing

Next Steps