Data preprocessing using tf.keras.utils.image_dataset_from_directory
In this tutorial, we will learn about image preprocessing using tf.keras.utils.image_dataset_from_directory of Keras Tensorflow API in Python. This tutorial explains the working of data preprocessing / image preprocessing.
What is Data Preprocessing?
Tensorflow /Keras preprocessing utility functions enable you to move from raw data on the disc to tf.data.Dataset object that can be used to train a model.
For example: Let’s say you have 9 folders inside the train that contains images about different categories of skin cancer. Each subfolder contains images of around 5000 and you want to train a classifier that assigns a picture to one of many categories.
This is what your training data sub-folder classes look like :
Then run image_dataset_from directory(main directory, labels=’inferred’) to get a tf.data. A dataset that generates batches of photos from subdirectories.
Image formats that are supported are: jpeg,png,bmp,gif.
Usage of tf.keras.utils.image_dataset_from_directory
- Image Classification.
- Load and preprocess images.
- Retrain an image classifier.
- Transfer learning fine-tuning.
Let’s say we have images of different kinds of skin cancer inside our train directory. We want to load these images using tf.keras.utils.images_dataset_from_directory() and we want to use 80% images for training purposes and the rest 20% for validation purposes. We define batch size as 32 and images size as 224*244 pixels,seed=123.
IMG_SIZE=[224,224] BATCH_SIZE=32 train_ds=tf.keras.utils.image_dataset_from_directory(train,validation_split=0.2,subset='training',shuffle=True, batch_size=BATCH_SIZE,image_size=IMG_SIZE,seed=123) valid_ds=tf.keras.utils.image_dataset_from_directory(train,validation_split=0.2,subset='validation',shuffle=True, batch_size=BATCH_SIZE,image_size=IMG_SIZE,seed=123) test_t = tf.keras.utils.image_dataset_from_directory(test, shuffle=True,batch_size=BATCH_SIZE,image_size=IMG_SIZE)
Total Images will be around 20239 belonging to 9 classes.
For training, purpose images will be around 16192 which belongs to 9 classes.
For validation, images will be around 4047.
The different kinds of arguments that are passed inside image_dataset_from_directory are as follows :
- directory: The directory in which the data (images) is stored. If labels are “inferred,” it should have subdirectories, each with photos for a certain class. The directory structure is otherwise disregarded.
- labels: Either “inferred” (labels are created from the directory structure), None (no labels), or a list/tuple of integer labels equal to the number of image files discovered in the directory. Labels should be ordered in the alphanumeric order of the image file paths (obtained in Python using os. walk(directory)).
- class_names: Only if “labels” are set to “inferred.” This is a specific list of class names (must match names of subdirectories). Controls the order of the classes (otherwise alphanumerical order is used).
- image_size: Size at which pictures should be resized once they have been read from the disc. The default value is (256, 256). This is required since the pipeline handles batches of photos that must all be the same size.
- batch_size: The size of the data batches. Default value: 32. If None is specified, the data will not be batched (the dataset will yield individual samples).
- shuffle: Whether or whether the data should be shuffled. True by default. If False, the data is sorted in alphabetic order.
- validation_split: Optional float between 0 and 1, data portion to reserve for validation.
- subset: Either “training” or “validation.” A validation split is only utilized if a validation_split is specified.
- interpolation: It is a method used for resizing images. Different interpolation methods are bilinear, gaussian, area, etc.
To read more about the use of tf.keras.utils.image_dataset_from_directory follow the below links: