Information Extraction

Important

This feature is in Public Preview and is HIPAA compliant.

This page covers the new version of Information Extraction. For information about the previous version, see Use Information Extraction (legacy)

Information Extraction transforms unstructured documents and text into key, structured insights using a defined schema. This lets you use information embedded in unstructured text, PDFs, images, or tables directly for analysis, reporting, or downstream agents and applications.

Examples of Information Extraction include:

  • Extracting legal parties and terms from contracts.
  • Extracting line items and payment terms from invoices.
  • Pulling key details from medical records and notes.

Information Extraction is built on top of the AI function ai_extract. Information Extraction has a visual UI to customize and optimize the function with a defined schema for extraction.

Information Extraction uses default storage to store temporary data transformations, model checkpoints, and internal metadata that power each agent. When you delete an agent, Databricks removes all data associated with the agent from default storage.

Requirements

Create an Information Extraction agent

Go to Agents icon. Agents in the left navigation pane of your workspace. Click Create Agent > Information Extraction.

Step 1. Select the data to extract information from

  1. On the Start with your data page, select the files or data you want to extract information from. You can do any of the following:

    • Drag and drop one or more files into the upload area, or click to browse for files to upload.
    • Click Select volume to select a Unity Catalog volume with supported file types.
    • Click Select table to select a Unity Catalog table that contains text data.
  2. If you select a table, select the column that contains the data to extract from. You must select a column with a supported type, such as STRING or VARIANT, before you can continue. If the table has no supported columns, select a different table.

  3. Click Create Agent. This button is enabled only after you select a valid data source, and, for a table, a supported column.

Step 2. Configure and refine your extraction schema

After Information Extraction processes your data, configure and refine what data you want to extract from your documents.

  1. Under Configuration, define your extraction schema. There are several ways to do this:

    • Enter natural language that describes the information you want to extract and click Generate Schema. Information Extraction automatically generates a JSON schema with field names and definitions for you. Edit these descriptions as needed.
    • Alternatively, click Or, Define manually to manually define your schema:
      1. Click Add field.
      2. Enter your field name, type, and description.
      3. Click Confirm.
      4. Repeat for each field you want to extract.
      5. Click Save and Run extraction.
    • You can also click JSON to edit the JSON schema directly. Click Apply Changes when complete.

    Each time you update your schema and click Save and run extraction, Information Extraction updates the extraction agent, runs the extraction, and shows the results for each input.

  2. On the left, review the parsed document and the agent's extraction. Iterate the extraction results in two ways. First, provide natural language feedback on one or more inputs, which auto-tunes your descriptions when you press Save and run extraction. Second, manually revise the schema descriptions, which take effect when you press Save and run extraction.

  3. Use versions to compare or revert to a previous configuration. Click Versions, then click Compare to compare the schema definition of a previous version with the current version. Click Restore to restore a previous version.

Step 3. Use your extraction agent

After you're happy with the agent's performance, use the agent to extract information.

Click Use Agent in the upper-right. You can select either:

  • Run in SQL to use the agent to extract information from all your data. This opens a SQL query that uses ai_extract to extract information from your volume or table using the schema defined. For more information on using ai_extract in SQL queries, see ai_extract function.
  • Create a Spark Declarative Pipeline to deploy an ETL pipeline that runs on scheduled intervals to invoke your agent on new data. This creates Lakeflow Spark Declarative Pipelines that updates a streaming table with your extracted data. You can configure the pipeline's schedule to run when new data arrives. For more information on Lakeflow Spark Declarative Pipelines, see Lakeflow Spark Declarative Pipelines.

Limitations

  • Information Extraction agents have a 128k token max context length.
  • Workspaces that have Enhanced Security and Compliance enabled are not supported.
  • Union schema types are not supported.