Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
Databricks recommends that you use the binary file data source to load image data into the Spark DataFrame as raw bytes. See Reference solution for image applications for the recommended workflow to handle image data.
The image data source provides a standard API for loading image files into Spark DataFrames as a decoded struct, giving you direct access to image metadata such as height, width, channel count, and raw pixel data. It is primarily used in machine learning preprocessing pipelines where structured image fields are required alongside pixel data. Azure Databricks supports the image data source for batch reads, including partition discovery for organized image directories. To read image files, specify the data source format as image.
Prerequisites
Azure Databricks does not require additional configuration to use the image data source.
Options
Use the .option() and .options() methods of DataFrameReader to configure the image data source. For a complete list of supported options, see Spark API options reference.
Usage
The following examples demonstrate loading image files using the Spark DataFrame API, selecting image metadata fields, displaying image thumbnails, and saving decoded image data to a Delta table.
Read image files
Use the Apache Spark DataFrame API to load image files into a DataFrame. You can import a nested directory structure by providing a directory path, and use partition discovery by specifying a path with a partition directory (for example, /path/to/dir/date=2018-01-02/category=automobile).
Python
# Read all images from a directory
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
display(df)
# Use partition discovery by specifying a partitioned path
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/date=2024-01-01/category=dogs/")
display(df)
Scala
// Read all images from a directory
val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.show()
// Use partition discovery by specifying a partitioned path
val partitioned = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/date=2024-01-01/category=dogs/")
partitioned.show()
SQL
-- Read all images from a directory
SELECT * FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/images/',
format => 'image'
)
Select image metadata
To work with image dimensions or channel information without processing the full pixel data, select specific fields from the image struct column.
Python
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
metadata = df.select("image.origin", "image.height", "image.width", "image.nChannels")
display(metadata)
Scala
val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
val metadata = df.select("image.origin", "image.height", "image.width", "image.nChannels")
metadata.show()
SQL
SELECT image.origin, image.height, image.width, image.nChannels FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/images/',
format => 'image'
)
Display image data
The Databricks display function renders image thumbnails directly in the image column when working with the image data source. See Images for supported display options.
Python
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
display(df)
Scala
val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.show()
SQL
SELECT * FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/images/',
format => 'image'
)
Save image data to a Delta table
To improve read performance when loading image data back, save the DataFrame to a Delta table.
Note
The image data source stores decoded pixel data, which increases disk usage compared to raw bytes. For storage-efficient persistence, use the binary file data source instead.
Python
df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.write.format("delta").saveAsTable("<catalog>.<schema>.<table>")
Scala
val df = spark.read.format("image").load("/Volumes/<catalog>/<schema>/<volume>/images/")
df.write.format("delta").saveAsTable("<catalog>.<schema>.<table>")
Output schema
Image files are loaded as a DataFrame containing a single struct-type column called image with the following fields:
root
|-- image: struct (nullable = true)
| |-- origin: string (nullable = true)
| |-- height: integer (nullable = false)
| |-- width: integer (nullable = false)
| |-- nChannels: integer (nullable = false)
| |-- mode: integer (nullable = false)
| |-- data: binary (nullable = false)
The following fields describe the image file and its decoded pixel data.
origin: The file path of the source image.height: The height of the image in pixels.width: The width of the image in pixels.nChannels: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.mode: Integer flag that indicates how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order.
Map of type to numbers in OpenCV (data types x number of channels)
| Type | C1 | C2 | C3 | C4 |
|---|---|---|---|---|
| CV_8U | 0 | 8 | 16 | 24 |
| CV_8S | 1 | 9 | 17 | 25 |
| CV_16U | 2 | 10 | 18 | 26 |
| CV_16S | 3 | 11 | 19 | 27 |
| CV_32U | 4 | 12 | 20 | 28 |
| CV_32S | 5 | 13 | 21 | 29 |
| CV_64F | 6 | 14 | 22 | 30 |
data: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.
Limitations
Because the image data source decodes image files during DataFrame creation, it increases data size and has the following limitations:
- Disk usage when persisting: Decoded image data is significantly larger than raw bytes. If you persist the DataFrame to a Delta table, store raw bytes instead of decoded data to save disk space.
- Shuffle performance: Shuffling decoded image data requires more disk space and network bandwidth, resulting in slower shuffle operations. Delay decoding as long as possible in your pipeline.
- Fixed decoding library: The image data source uses the javax Image IO library to decode images, which prevents you from using alternative decoding libraries for better performance or custom decoding logic.
To avoid these limitations, use the binary file data source to load image data and decode only as needed.
Additional resources
- Read binary files: If your workload requires raw image bytes rather than a decoded struct, the binary file data source avoids the decoding overhead and limitations of the image data source.