*Underlying concepts and step by step Python code explanation*

Vidushi Bhatia

·

Follow

--

Larry Roberts in his Ph. D. thesis (cir. 1960) at MIT discussed the possibilities of extracting 3D geometrical information and is considered to have set the foundation of the research surrounding Computer Vision. Since then, researchers have made tremendous progress, especially within the last decade, making Computer Vision the frontier of real-world AI applications in the form of facial recognition, medical imaging, self-driving cars, and many more.

In this blog, my purpose is to deep dive into one such tremendous computer vision model called the U-Net. The blog provides insights on operations used in the U-Net architecture like Convolution, Max Pooling, Transposed Convolution, Skip Connections and also explains how to implement these concepts from scratch using TensorFlow.

By the end of this blog, you would have created the following architecture (fig-2) to classify image pixels into segments (like fig-1).

- Overview of U-Net
- Understanding the Key Operations used in U-Net
- Processing the Data
- Defining the U-Net Architecture
- Training the Model
- Evaluating the Model
- Prediction!

U-Net architecture was introduced by Olaf Ronneberger, Philipp Fischer, Thomas Brox in 2015 for tumor detection but since has been found to be useful across multiple industries. As an image segmentation tool, the model aims to classify each pixel as one of the output classes, creating an output resembling fig-1.

Many Neural Nets have tried to perform ‘image segmentation’ before, but U-Net beats its predecessors by being less computationally expensive and minimizing information loss. Let’s deep dive further to learn more about how U-Net does this.

Before we create a U-Net, let’s understand the key operations used in the architecture (bottom-right corner of fig-3)

If we only use fully connected layers to create networks for high-resolution images, the models would become extremely computationally expensive. Hence, the mathematical operation called ‘convolution’ is a white knight in the Computer Vision story. Convolution retains the influence of all input pixels but keeps them only loosely connected to reduce computation cost.

To perform a convolution operation, repeat the following steps for the entire input image matrix:

**Step 1:**Take a filter matrix K of size smaller than the input image matrix I. Conduct element-wise multiplication with the overlaid elements and then add to create a single value in the output matrix.**Step 2:**Move the filter to the columns on the right based on the defined stride and repeat step 1.*Example: If we started the operation with column 1 and stride is 3, then we’ll move to column 4 and repeat Step 1.*

Change in dimensions for Convolution Operation:**Input Matrix**: A x B x C where the height is A; width is B and channels/depth is C (e.g. RGB images would have 3 channels)**Filter Matrix**: D x E x C x G where the height of filter is D; width is E, C is number of channels/depth (same as input image) and G is number of applied filters**Output Matrix**: H x W x G where Height and Width can be computed using the formula shown below and G is the number of filters that were applied to the input

The elements of the filter matrix act as the ‘weight’ parameters and are optimized during training the model. Please refer to this article for more information on Conv operation and ConvNets

To allocate a class to each pixel in an image, Image Segmentation requires the downscaled image (due to convolution) to be upscaled to a size closer to the original image. This can be done using fully connected layers but it becomes very computationally expensive. To solve for this, U-Net uses transposed convolution operation which increases the dimensions of the input image by using a filter bigger than the input.

Please refer to this article to find out more about Transposed Convolutions

Pooling is used for the same purpose as convolution— to reduce the number of parameters and increase the speed of computation. The layer also inadvertently allows for a bit of regularization. There are typically 2 operations performed in pooling — average or max. In both of them, we create subsets of the input based on filter size ‘f’, stride ‘s’ and then apply these functions (max or average) to the input matrix.

Unlike convolutions, no weight parameters are generated in pooling operations

Skip Connections in U-Net copies the image matrix from the earlier layers (LHS layers of fig-3) and uses it as a part of the later layers (RHS layers). This enables the model to preserve information from a richer matrix and prevent information loss. A lot of popular Computer Vision architectures use skip connections to make the output richer.

Now that we are brushed up on some underlying concepts, let’s start implementing this model and get some hands-on knowledge using The Oxford-IIIT Pet Dataset. The files in this dataset are of varying sizes and we’ll use resize, reshape to transform them all into a consistent desired size. We will also normalize the image matrix by dividing the pixel values by 256. Please note that the values in the ‘mask’ matrix represent the classes, hence, we won’t normalize them.

**for** file **in** img:

index = img.index(file)

path = os.path.join(path1, file)

single_img = Image.open(path).convert('RGB')

single_img = single_img.resize((i_h,i_w))

single_img = np.reshape(single_img,(i_h,i_w,i_c))

single_img = single_img/256.

X[index] = single_img single_mask_ind = mask[index]

path = os.path.join(path2, single_mask_ind)

single_mask = Image.open(path)

single_mask = single_mask.resize((m_h, m_w))

single_mask = np.reshape(single_mask,(m_h,m_w,m_c))

single_mask = single_mask - 1

y[index] = single_mask

Congratulations! Our folder of images has been converted to X *(dims: # images, img height, img width, img channels)* and y *(dims: # masks, mask height, mask width, mask channels). W*e can now proceed with designing the architecture of U-Net!

The number of images in X should be equal to the number of masks in y, other dimensions of the datasets can differ.

While coding the U-Net architecture, I divided it into 2 parts — encoder and decoder. They can further be divided into a sequence of repeated encoder mini-blocks and decoder mini-blocks.

To design a U-Net, we will have to design reusable mini-blocks and simply string them together.

We will develop a function for encoder mini-block which would allow us to dynamically create all encoder layers. If we look at the above diagram, there are two conv 3x3 operations in each mini-block with a max pool operation (the latter is not present in the ‘bottleneck’ block).

The below function allows us to implement the same along with options for operations like Batch Normalization, dropout to make the model more robust. We have used ‘He initialization’ along with ReLU to get best results. Before we apply max pool, we are saving the information for a skip connection that we’ll use later in the decoder.

defEncoderMiniBlock(inputs, n_filters=32, dropout_prob=0.3, max_pooling=True):

conv = Conv2D(n_filters,

3,# filter size

activation='relu',

padding='same',

kernel_initializer='HeNormal')(inputs)

conv = Conv2D(n_filters,

3,# filter size

activation='relu',

padding='same',

kernel_initializer='HeNormal')(conv)conv = BatchNormalization()(conv, training=

False)ifdropout_prob > 0:

conv = tf.keras.layers.Dropout(dropout_prob)(conv)

ifmax_pooling:

next_layer = tf.keras.layers.MaxPooling2D(pool_size = (2,2))(conv)

else:

next_layer = convskip_connection = conv

returnnext_layer, skip_connection

To complete the encoder, we’ll stack these mini-blocks with the number of filters doubling in each subsequent block (like shown in fig- 10)

The decoder first increases image dimensions using transposed convolutions and then merges the results with the information from skip connection (stored in the encoder code block). With 2 more convolution operations, our mini-block would be ready. Note that we are using ‘same’ padding in convolutions to ensure our image size doesn’t decrease.

defDecoderMiniBlock(prev_layer_input, skip_layer_input, n_filters=32):

up = Conv2DTranspose(

n_filters,

(3,3),

strides=(2,2),

padding='same')(prev_layer_input) merge = concatenate([up, skip_layer_input], axis=3) conv = Conv2D(n_filters,

3,

activation='relu',

padding='same',

kernel_initializer='HeNormal')(merge)

conv = Conv2D(n_filters,

3,

activation='relu',

padding='same',

kernel_initializer='HeNormal')(conv)

returnconv

After stacking 4 mini-blocks, we will top up the compiled decoder with a conv 1x1 operation which converts the mini-block output to the desired dimensions.

The number of filters used in output layer would be equal to the number of output classes. Hence, our output will have the dimensions: H * W * # classes

After compiling all the mini-blocks shown in the previous section, we need to now decide an optimizer, loss function and accuracy metric for the model. We can then use model.fit() for training. Below, I have used Adam optimizer along with Sparse Categorical Cross Entropy.

If your output labels are one-hot encoded, use Categorical Cross Entropy instead of Sparse Categorical Cross Entropy

unet.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])results = unet.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_valid, y_valid))

- First, we will check if our model is
**learning at the correct rate**. We can do so by plotting ‘loss function’ for each epoch. If the learning rate is too large, the ‘train loss’ would oscillate, otherwise, we would see a consistently decreasing loss. - Second, we will look for
**high bias or underfitting**i.e. if both the training and validation accuracy is very low. This means the model hasn’t been trained well and would need to be tuned. Some options to**solve for high bias**are — a bigger network, more training iterations or adding more features. A better optimization algorithm and better initialization of weights also might help. - Lastly, we will check for
**high variance or overfitting**i.e. if the train accuracy is high but the validation accuracy is low. This means that the model is very tightly fitted to the train data and not general enough to predict new data values. To solve for this, we can use**regularization**which will shrink the influence of weights or add more examples to our train set.

After evaluation, tune the model to get the best results on the above shown criteria

Now that we have checked how are model is performing in numbers, we can also visualize its predictions by using model.predict(). Don’t forget to ensure the dimensions of your input match the input dimensions of the trained model. Also, to visualize the predicted mask, adjust it’s axis to match the output dimensions.

**def** VisualizeResults(index):

img = X_valid[index]

img = img[np.newaxis, ...]

**pred_y = unet.predict(img)**

pred_mask = tf.argmax(pred_y[0], axis=-1)

pred_mask = pred_mask[..., tf.newaxis]

fig, arr = plt.subplots(1, 3, figsize=(15, 15))

arr[0].imshow(X_valid[index])

arr[0].set_title('Processed Image')

arr[1].imshow(y_valid[index,:,:,0])

arr[1].set_title('Actual Masked Image ')

arr[2].imshow(pred_mask[:,:,0])

arr[2].set_title('Predicted Masked Image ')

The below images compare the actual mask vs the predicted mask from the U-Net model. Try using the model we have created to predict the outline and background of an image of your choice!

With the help of transposed convolutions and skip connections, U-Net has outperformed its predecessors and proved to be a useful Computer Vision tool in multiple industries. I hope this blog is a good starting point for you to try making a U-Net model for your own application. I would also highly recommend reading the original published paper U-Net: Convolutional Networks for Biomedical Image Segmentation. The referenced code in this blog is stored on GitHub and I would be happy to answer any questions.

Pro tip: A lot of errors can be resolved by keeping track of input and output data dimensions at each step. Happy coding!