Beispiel für das grundlegende Einbetten des AI-Suchmodells (GTE)

Notizbuchversion dieser Seite öffnen

Dieses Notizbuch zeigt, wie Sie das AI Search Python SDK verwenden, das als primäre API für die Arbeit mit AI Search bereitstelltAISearchClient.

Dieses Notizbuch verwendet Databricks Foundation Model-APIs , um auf das GTE-Einbettungsmodell zuzugreifen, um Einbettungen zu generieren.

%pip install --upgrade --force-reinstall databricks-ai-search
dbutils.library.restartPython()

from databricks.ai_search.client import AISearchClient

vsc = AISearchClient(disable_notice=True)

help(AISearchClient)

Laden des Toy-Datensatzes in die Delta-Quelltabelle

Im Folgenden wird die Delta-Quelltabelle erstellt.

# Specify the catalog and schema to use. You must have USE_CATALOG privilege on the catalog and USE_SCHEMA and CREATE_TABLE privileges on the schema.
# Change the catalog and schema here if necessary.

catalog_name = "main"
schema_name = "default"

source_table_name = "wiki_articles_demo"
source_table_fullname = f"{catalog_name}.{schema_name}.{source_table_name}"

# Uncomment if you want to start from scratch.

# spark.sql(f"DROP TABLE {source_table_fullname}")

source_df = spark.read.parquet("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet").limit(10)
display(source_df)

Datenset "Blockbeispiel"

Indem Sie den Beispieldatensatz in Teile aufteilen, können Sie vermeiden, dass das Kontextlimit des Einbettungsmodells überschritten wird. Das GTE-Modell unterstützt bis zu 8192 Token. Databricks empfiehlt jedoch, die Daten in kleinere Kontextabschnitte aufzuteilen, sodass Sie eine breitere Auswahl von Beispielen in das Begründungsmodell für Ihre RAG-Anwendung einfügen können.

import tiktoken
import pandas as pd

# The GTE model has been trained on a max context lenth of 8192 tokens.
max_chunk_tokens = 8192
encoding = tiktoken.get_encoding("cl100k_base")

def chunk_text(text):
    # Encode and then decode within the UDF
    tokens = encoding.encode(text)
    chunks = []
    while tokens:
        chunk_tokens = tokens[:max_chunk_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
        tokens = tokens[max_chunk_tokens:]
    return chunks

# Process the data and store in a new list
pandas_df = source_df.toPandas()
processed_data = []
for index, row in pandas_df.iterrows():
    text_chunks = chunk_text(row['text'])
    chunk_no = 0
    for chunk in text_chunks:
        row_data = row.to_dict()

        # replace the id column with a new unique chunk id
        # and the text column with the text chunk
        row_data['id'] = f"{row['id']}_{chunk_no}"
        row_data['text'] = chunk

        processed_data.append(row_data)
        chunk_no += 1

chunked_pandas_df = pd.DataFrame(processed_data)
chunked_spark_df = spark.createDataFrame(chunked_pandas_df)

# Write the chunked DataFrame to a Delta table
spark.sql(f"DROP TABLE IF EXISTS {source_table_fullname}")
chunked_spark_df.write.format("delta") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(source_table_fullname)

display(spark.sql(f"SELECT * FROM {source_table_fullname}"))

Endpunkt erstellen

ai_search_endpoint_name = "ai-search-demo-endpoint"

vsc.create_endpoint(
    name=ai_search_endpoint_name,
    endpoint_type="STANDARD" # or "STORAGE_OPTIMIZED"
)

vsc.get_endpoint(
  name=ai_search_endpoint_name
)

Erstellen eines Index

# AI Search index
vs_index = f"{source_table_name}_gte_index"
vs_index_fullname = f"{catalog_name}.{schema_name}.{vs_index}"

embedding_model_endpoint = "databricks-qwen3-embedding-0-6b"

index = vsc.create_delta_sync_index(
  endpoint_name=ai_search_endpoint_name,
  source_table_name=source_table_fullname,
  index_name=vs_index_fullname,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_source_column="text",
  embedding_model_endpoint_name=embedding_model_endpoint
)
index.describe()['status']['message']

# Wait for index to come online. Expect this command to take several minutes.
# You can also track the status of the index build in Catalog Explorer in the
# Overview tab for the index.
import time
index = vsc.get_index(endpoint_name=ai_search_endpoint_name,index_name=vs_index_fullname)
while not index.describe().get('status')['ready']:
  print("Waiting for index to be ready...")
  time.sleep(30)
print("Index is ready!")
index.describe()

Ähnlichkeitssuche

In den folgenden Zellen wird gezeigt, wie sie den Index abfragen, um ähnliche Dokumente zu finden.

results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5
  )
rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

# Search with a filter. Note that the syntax depends on the endpoint type.

# Standard endpoint syntax
results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5,
  filters={"title NOT": "Hercules"}
)

# Storage-optimized endpoint syntax
# results = index.similarity_search(
#   query_text="Greek myths",
#   columns=["id", "text", "title"],
#   num_results=5,
#   filters='title != "Hercules"'
#   )


rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

Index löschen

vsc.delete_index(
  endpoint_name=ai_search_endpoint_name,
  index_name=vs_index_fullname
)

Beispiel-Notebook

Beispiel für das grundlegende Einbetten des AI-Suchmodells (GTE)

Notebook abrufen

Feedback

War diese Seite hilfreich?

Last updated on 2026-06-23

Beispiel für das grundlegende Einbetten des AI-Suchmodells (GTE)

Laden des Toy-Datensatzes in die Delta-Quelltabelle

Datenset "Blockbeispiel"

Endpunkt erstellen

Erstellen eines Index

Ähnlichkeitssuche

Index löschen

Beispiel-Notebook

Beispiel für das grundlegende Einbetten des AI-Suchmodells (GTE)

Feedback

Zusätzliche Ressourcen