Read image files

Important

Databricks recommends that you use the binary file data source to load image data into the Spark DataFrame as raw bytes. See Reference solution for image applications for the recommended workflow to handle image data.

The image data source provides a standard API for loading image files into Spark DataFrames as a decoded struct, giving you direct access to image metadata such as height, width, channel count, and raw pixel data. It is primarily used in machine learning preprocessing pipelines where structured image fields are required alongside pixel data. Azure Databricks supports the image data source for batch reads, including partition discovery for organized image directories. To read image files, specify the data source format as image.

Prerequisites

Azure Databricks does not require additional configuration to use the image data source.

Options

Use the .option() and .options() methods of DataFrameReader to configure the image data source. For a complete list of supported options, see Spark API options reference.

Usage

The following examples demonstrate loading image files using the Spark DataFrame API, selecting image metadata fields, displaying image thumbnails, and saving decoded image data to a Delta table.

Read image files

Use the Apache Spark DataFrame API to load image files into a DataFrame. You can import a nested directory structure by providing a directory path, and use partition discovery by specifying a path with a partition directory (for example, /path/to/dir/date=2018-01-02/category=automobile).

Python

# Read all images from a directory
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
display(df)

# Use partition discovery by specifying a partitioned path
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/date=2024-01-01/category=dogs/")
display(df)

Scala

// Read all images from a directory
val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.show()

// Use partition discovery by specifying a partitioned path
val partitioned = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/date=2024-01-01/category=dogs/")
partitioned.show()

SQL

-- Read all images from a directory
SELECT * FROM read_files(
  '/Volumes/<catalog>/<schema>/<volume>/images/',
  format => 'image'
)

Select image metadata

To work with image dimensions or channel information without processing the full pixel data, select specific fields from the image struct column.

Python

df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
metadata = df.select("image.origin", "image.height", "image.width", "image.nChannels")
display(metadata)

Scala

val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
val metadata = df.select("image.origin", "image.height", "image.width", "image.nChannels")
metadata.show()

SQL

SELECT image.origin, image.height, image.width, image.nChannels FROM read_files(
  '/Volumes/<catalog>/<schema>/<volume>/images/',
  format => 'image'
)

Display image data

The Databricks display function renders image thumbnails directly in the image column when working with the image data source. See Images for supported display options.

Python

df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
display(df)

Scala

val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.show()

SQL

SELECT * FROM read_files(
  '/Volumes/<catalog>/<schema>/<volume>/images/',
  format => 'image'
)

Save image data to a Delta table

To improve read performance when loading image data back, save the DataFrame to a Delta table.

Note

The image data source stores decoded pixel data, which increases disk usage compared to raw bytes. For storage-efficient persistence, use the binary file data source instead.

Python

df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.write.format("delta").saveAsTable("<catalog>.<schema>.<table>")

Scala

val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.write.format("delta").saveAsTable("<catalog>.<schema>.<table>")

Output schema

Image files are loaded as a DataFrame containing a single struct-type column called image with the following fields:

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- mode: integer (nullable = false)
 |    |-- data: binary (nullable = false)

The following fields describe the image file and its decoded pixel data.

origin: The file path of the source image.
height: The height of the image in pixels.
width: The width of the image in pixels.
nChannels: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.
mode: Integer flag that indicates how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order.

Map of type to numbers in OpenCV (data types x number of channels)

Type	C1	C2	C3	C4
CV_8U	0	8	16	24
CV_8S	1	9	17	25
CV_16U	2	10	18	26
CV_16S	3	11	19	27
CV_32U	4	12	20	28
CV_32S	5	13	21	29
CV_64F	6	14	22	30

data: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.

Limitations

Because the image data source decodes image files during DataFrame creation, it increases data size and has the following limitations:

Disk usage when persisting: Decoded image data is significantly larger than raw bytes. If you persist the DataFrame to a Delta table, store raw bytes instead of decoded data to save disk space.
Shuffle performance: Shuffling decoded image data requires more disk space and network bandwidth, resulting in slower shuffle operations. Delay decoding as long as possible in your pipeline.
Fixed decoding library: The image data source uses the javax Image IO library to decode images, which prevents you from using alternative decoding libraries for better performance or custom decoding logic.

To avoid these limitations, use the binary file data source to load image data and decode only as needed.

Additional resources

Read binary files: If your workload requires raw image bytes rather than a decoded struct, the binary file data source avoids the decoding overhead and limitations of the image data source.

Feedback

Was this page helpful?

Last updated on 2026-06-15