U-Net: Image Segmentation Network

23 November 2018

U-Net: Image Segmentation Network

U-Net is considered one of the standard CNN architectures for image classification tasks, when we need not only to define the whole image by its class but also to segment areas of…

U-Net is considered one of the standard CNN architectures for image classification tasks, when we need not only to define the whole image by its class but also to segment areas of an image by class, i.e. produce a mask that will separate an image into several classes. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.

The network is trained in end-to-end fashion from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC), U-Net won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512×512 image takes less than a second on a modern GPU.

u-net medical segmentation
Brain segmentation

Key Points

  1. Achieve Good performance on various real-life tasks especially biomedical application;
  2. Computational difficulty (how many and which GPUs you need, how long it will train);
  3. Uses a small number of data to achieve good results.

The U-net Architecture

u-net architecture
Fig. 1. U-net architecture (example for 32×32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3×3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2×2 max pooling operation with stride 2 for downsampling.

At each downsampling step, feature channels are doubled. Every step in the expansive path consists of an upsampling of the feature map followed by a 2×2 convolution (up-convolution) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3×3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution.

u-net pipeline

At the final layer, a 1×1 convolution is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.


The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width. A pixel-wise soft-max computes the energy function over the final feature map combined with the cross-entropy loss function. The cross-entropy that penalizes at each position is defined as:

u-net training

The separation border is computed using morphological operations. The weight map is then computed as:

U-net formula

where wc is the weight map to balance the class frequencies, d1 denotes the distance to the border of the nearest cell and d2 denotes the distance to the border of the second nearest cell.

Use Cases and Implementation

U-net was applied to many real-time examples. Some of these are mentioned below:

As we see from the example, this network is versatile and can be used for any reasonable image masking task. High accuracy is achieved,  given proper training, adequate dataset and training time. If we consider a list of more advanced U-net usage examples we can see some more applied patters:



u-net comparison
Fig:2 Segmentation results (IOU) on the ISBI cell tracking challenge 2015.

U-Net is applied to a cell segmentation task in light microscopic images. This segmentation task is part of the ISBI cell tracking challenge 2014 and 2015. The dataset PhC-U373 contains Glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate recorded by phase contrast microscopy. It contains 35 partially annotated training images. Here U-Net achieved an average IOU (intersection over union) of 92%, which is significantly better than the second-best algorithm with 83% (see Fig 2). The second data set DIC-HeLa are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy [See below figures]. It contains 20 partially annotated training images. Here U-Net achieved an average IOU of 77.5% which is significantly better than the second-best algorithm with 46%.

u-net results
Result on the ISBI cell tracking challenge. (a) part of an input image of the PhC-U373 data set. (b) Segmentation result (cyan mask) with the manual ground truth (yellow border) (c) input image of the DIC-HeLa data set. (d) Segmentation result (random colored masks) with the manual ground truth (yellow border).

The u-net architecture achieves outstanding performance on very different biomedical segmentation applications. It only needs very few annotated images and has a very reasonable training time of just 10 hours on NVidia Titan GPU (6 GB).

Synthesising Images of Humans in Unseen Poses

19 July 2018
Моделирование позы на фотографии

Synthesising Images of Humans in Unseen Poses

Humans have an incredible capability to imagine things in a different context. At the core of our intelligence are the imagination and learning from experience. Both of these things are…

Humans have an incredible capability to imagine things in a different context. At the core of our intelligence are the imagination and learning from experience. Both of these things are connected, and creativity always comes from the memory and our experience. Therefore, we can estimate a shape of an object (even though we look at it only from one particular viewpoint), we can imagine motion or deformation of an object by just looking at it while static. Our memory provides us with the ability to visualize complicated things, such as what will a person see in a different context or different pose.

Researchers from Massachusetts Institute of Technology have addressed the Computer vision task of novel human pose synthesis. In their work, they present an image synthesis method that given an image containing a person and a desired, target pose it can synthesise a depiction of the person in that pose in a new realistic model.

synthesizing images of human poses

They combine multiple techniques, and they frame the novel pose synthesis as a deep learning problem. The approach is unique, and it produces realistic images as they demonstrate in a few different use cases.

Problem Statement

The problem statement: given an image and a target pose synthesise the person in the picture in that pose. From a problem statement point of view, the task of novel human pose synthesis is non-trivial, and there are a few crucial things to be taken into account.

source and synthesized image
The problem statement: Given an image and a target pose synthesise the person in the image in that pose

First, the generated image has to be as realistic as possible. Second, changing the pose requires segmenting the person from the background or other objects present in the picture. Third, introducing a novel pose leaves gaps in the background caused by disocclusion which have to be filled appropriately and moreover self-occlusion has to be handled as well.

Capturing these complex changes in the image space represents a challenge and in their approach, the researchers tackle the problem by dividing it into smaller sub-problems, solved by separate modules.


In fact, they design a modular architecture comprising of several modules to address several different challenges and provide realistic image synthesis in the end. They propose an architecture of 4 modules:

A. Source Image Segmentation

B. Spatial Transformation

C. Foreground Synthesis

D. Background Synthesis

The architecture is trained in a supervised manner mapping a tuple of a source image, source pose and target pose to a target image. The whole architecture is trained jointly using a single model as a target.

GNN architecture
The proposed architecture comprising of 4 modules

A. Source Image Segmentation

Differences in poses and motion introduced by pose transformation often involve several moving body parts, large displacements, occlusions and self-occlusion. To overcome this problem, the first module is segmenting the source image. The segmentation is two-fold: first, the image is segmented into foreground and background, then the foreground (the person) is segmented into body parts such as arms, legs etc. Therefore, in the output of the segmentation stage, there are 1 background layer and L foreground layers corresponding to each of the L predefined body parts.

As mentioned before, a tuple of the input image along with the pose and the desired target pose is given as input. Unlike the source image which is an RGB image, the poses are defined as a stack of multiple layers. A pose is represented as a 3D volume given in R(HxWxJ). Each of the J layers (or channels) in the pose representation contains a “Gaussian bump” centred at the (x,y) location of each joint. The Gaussian representation (instead of deterministic dense representation) introduces some kind of regularization since joint location estimates can be often noisy and incorrect. In the experiments, the authors use 14 body parts (head, neck, shoulders, elbows, wrists, hips, knees and ankles) as 14 channels.

results from the segmentation module using Gaussian bumps
The results from the segmentation module using Gaussian bumps

The segmentation module is a U-Net network, that takes a concatenated volume (of the input image plus the pose layers) as input and gives L layer masks as output, specifying the rough location of each joint. The output is, in fact, 2D Gaussian mask over the approximate spatial region of each body part that enables to obtain the desired segmentation.

B. Spatial Transformation

The segmented layers from the segmentation module are then spatially transformed to fit the desired pose parts. The spatial transformation is not learned but directly computed from the input poses.

C. Foreground Synthesis

The foreground synthesis module is again a U-shape network (Encoder-decoder with skip connections) that takes the spatially transformed layers with the target pose layers as a concatenated volume and it outputs two different outputs (by branching the end of the network) — the first one being the target foreground and the second one being the target mask.

D. Background Synthesis

The task that the background synthesis module is solving is filling the background that is missing i.e. that is being occluded by the person in the input image. This module is also a U-net taking a volume of the input image (alongside with Gaussian noise in place of the foreground pixels) and the input pose mask. It outputs a realistic background without the foreground (the person in the image).

Source Image Segmentation
The results of the separate submodules building gradually the final synthesised image

Image Synthesis

Finally, the target background and foreground images are fused by a weighted linear sum taking into account the target mask (see formula below):

image synthesis

As in many generative models recently, the researchers propose an adversarial discriminator to force the generation of realistic images. In fact, the generative model was trained using L1 loss, a feature-wise Loss (denoted L-VGG ) and finally a combined L-VGG + GAN loss using binary cross-entropy classification error of the discriminator.

image synthesiz with loss function
Results with the different loss function
gradient magnitude
A plot of pixel gradient magnitude for different loss functions


The method was evaluated using videos of people performing actions collected from YouTube. The experiments were done using videos from three action classes: golf swings, yoga/workout routines, and tennis actions having a dataset of sizes 136, 60 and 70 videos, respectively. Simple data augmentation is also used to increase the size of the dataset.

Errors and SSIM score
Errors (lower is better) and SSIM score (higher is better) of the proposed method vs. a UNet architecture
Synthesized Images
Some outputs of the proposed method
ynthesized Images with different loss function
Comparison of different loss functions using the same input

Bottom line

The evaluation shows that the method is capable of synthesizing realistic images across different action classes. Though trained on pairs of images within the same video, the method can generalize to pose-appearance combinations that it has never seen (e.g., a golfer in a tennis player’s pose). The decoupling approach proved successful in this non-trivial task, and it shows that tackling a problem by dividing it into sub-problems can give outstanding results despite the complexity of the problem itself.

Dane Mitrev

Image Editing Becomes Easy with Semantically Meaningful Objects Generated

3 July 2018
semantic soft segmentation

Image Editing Becomes Easy with Semantically Meaningful Objects Generated

Image editing and compositing could be a fascinating creative process unless you need to spend most of your time on the tedious task of object selection. The process becomes even…

Image editing and compositing could be a fascinating creative process unless you need to spend most of your time on the tedious task of object selection. The process becomes even more time-consuming when some fuzzy boundaries and transparency are involved. Existing tools such as the magnetic lasso and the magic wand exploit only low-level cues and produce binary selections that need further refinement by the virtual artist to account for soft boundaries.

In this article, we are going to discover how neural networks may assist with this challenging task and create a set of layers that correspond to semantically meaningful regions with accurate soft transitions between different objects.

Suggested Approach

Group of researchers from MIT CSAIL (USA) and ETH Zürich (Switzerland), headed by Y. Aksoy suggested approaching this problem from a spectral segmentation angle. In particular, they propose a graph structure that combines the texture and color information from the input image as well as higher-level semantic information generated by a neural network. The soft segments are generated via eigendecomposition of the carefully constructed Laplacian matrix fully automatically. High-quality layers generated from the eigenvectors can then be utilized for quick and easy image editing. Combining elements from different images has always been a powerful way to produce new content, and now it’s also becoming much more efficient with the automatically created layers.

Overview of the suggested approach

Model Specifications

Let’s now discuss this approach to creating semantically meaningful layers step-by-step:

1. Spectral matting. The approach builds upon the work of Levin and his colleagues, who were first to introduce the matting Laplacian that uses local color distributions to define a matrix L that captures the affinity between each pair of pixels in a local patch. Using this matrix, they minimize the quadratic functional aᵀLa, subject to user-provided constraints, with a denoting a vector made of all the a values for a layer. So, each soft segment is a linear combination of the K eigenvectors corresponding to the smallest eigenvalues of L that maximizes matting sparsity.

2. Color affinity. For defining nonlocal color affinity, the researchers suggest generating 2500 superpixels and estimate the affinity between each superpixel and all the superpixels within a radius that corresponds to 20% of the image size. This affinity essentially makes sure that the regions with very similar colors stay connected in challenging scene structures like the one depicted below.

Nonlocal color affinity

3. High-level semantic affinity. This stage was introduced to create segments that are confined in semantically similar regions. Semantic affinity encourages the grouping of pixels that belong to the same scene object and discourages that of pixels from different objects. Here the researchers build upon prior work in the domain of object recognition to compute a feature vector at each pixel that correlated with the underlying object. Feature vectors are computed via a neural network, which will be discussed in more details later. Semantic affinity is defined over superpixels similarly to color affinity. However, unlike the color affinity, the semantic affinity only relates nearby superpixels to favor the creation of connected objects. Combination of nonlocal color affinity and local semantic affinity allows creating layers that cover spatially disconnected regions of the same semantically coherent region (e.g., greenery, sky, other types of background).

Semantic affinity

4. Creating the layers. This part is carried out using the affinities defined earlier to form a Laplacian matrix L. The eigenvectors corresponding to the 100 smallest eigenvalues of L are extracted from this matrix, and then a two-step sparsification process is used to create 40 layers from these eigenvectors. Then, the number of layers is reduced by running the k-means algorithm with k = 5. This approach produced better results than trying to directly sparsify the 100 eigenvectors into 5 layers since such drastic reduction makes the problem overly constrained. The researchers have chosen a number of segments to be equal to 5 and claim that it is a reasonable number for most images. Still, this number can be changed by the user depending on the scene structure.

Soft segments before and after grouping

5. Semantic feature vectors. In this implementation, a semantic segmentation approach was combined with a network for metric learning. The base network of the feature extractor is based on DeepLab-ResNet-101 and trained with a metric learning approach to maximize the L2 distance between the features of different objects. Thus, the network minimizes the distance between the features of samples having the same ground-truth classes and maximizes the distance otherwise.

Qualitative Comparison to the Related Methods

Comparison between the proposed soft segments and soft color segments by Aksoy

Figures below show the results of the suggested approach (marked as ‘Our result’) together with that of spectral matting as the most related soft segmentation method, and two state-of-the-art methods for semantic segmentation: the scene parsing method PSPNet and the instance segmentation method Mask R-CNN.

Qualitative comparison of the semantic soft segmentation approach with the related methods

You may observe that PSPNet and Mask R-CNN tend to produce inaccuracies around object boundaries, while the soft segments of spectral matting often extend beyond object boundaries. At the same time, the semantic soft segmentation approach, described here, captures objects in their entirety without grouping unrelated objects and achieves a high accuracy at edges, including soft transitions when appropriate. However, it should be noted that the semantic features in this method are not instance-aware, i.e. the features of two different objects of the same class are similar. This results in multiple objects being represented in the same layer such as cows or giraffes on the pictures above.

Image Editing with Semantic Soft Segments

Several use cases of soft segments for targeted editing and compositing are demonstrated below. As you see, the soft segments can also be used to define masks for specific adjustment layers such as adding motion blur to the train in (2), color grading the people and the backgrounds separately in (5, 6) and separate stylization of the hot-air balloon, sky, terrain and the person in (8). While these edits can be done via user-drawn masks or natural matting algorithms, automated defining of the semantically meaningful objects makes the targeted edits effortless for the visual artist.

Use of semantic soft segmentation in image editing tasks

Bottom Line

The proposed approach generates soft segments that correspond to semantically meaningful regions in the image by fusing the high-level information from a neural network with low-level image features fully automatically. However, the method has several limitations. First of all, it is relatively slow: runtime for a 640 x 480 image lies between 3 and 4 minutes. Secondly, the method does not generate separate layers for different instances of the same class of objects. And finally, as demonstrated below, the method may fail at the initial constrained sparsification step when the object colors are very similar (top example), or the grouping of soft segments may fail due to unreliable semantic feature vectors around large transition regions (bottom example).

Failure cases

Still, soft segments generated using the presented approach, provide a convenient intermediate image representation that makes it much easier to handle image editing and compositing tasks, which otherwise require lots of manual labor.

FAIR Proposed a New Partially Supervised Trading Paradigm to Segment Every Thing

26 June 2018
image segmentation

FAIR Proposed a New Partially Supervised Trading Paradigm to Segment Every Thing

Object detectors have become significantly more accurate and gained new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object, a…

Object detectors have become significantly more accurate and gained new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object, a task called instance segmentation. In practice, typical instance segmentation systems are restricted to a narrow slice of the vast visual world that includes only around 100 object categories. A principal reason for this limitation is that state-of-the-art instance segmentation algorithms require strong supervision and such supervision may be limited and expensive to collect for new categories. By comparison, bounding box annotations are more abundant and cheaper.

FAIR (Facebook AI Research) introduced a new partially supervised instance segmentation task and proposed a novel transfer learning method to address it. The partially supervised instance segmentation task as follows:

  1. Given a set of categories of interest, a small subset has instance mask annotations, while the other categories have only bounding box annotations.
  2. The instance segmentation algorithm should utilize this data to fit a model that can segment instances of all object categories in the set of interest.

Since the training data is a mixture of strongly annotated examples (those with masks) and weakly annotated examples (those with only boxes), the task is referred to partially supervised. To address partially supervised instance segmentation, a novel transfer learning approach built on Mask R-CNN. Mask R-CNN is well-suited to this task because it decomposes the instance segmentation problem into the subtasks of bounding box object detection and masks prediction.

Learning to Segment Every Thing

Let C be the set of object categories for which instance segmentation model is trained. All training examples in C are annotated with instance masks. It is to be assumed that C = A ∪ B where samples from the categories in A have masks, while those in B have only bounding boxes. Since the examples of the B categories are weakly labeled w.r.t., the target task (instance segmentation), it is referred to train on the combination of strong and weak labels as a partially supervised learning problem. Given an instance segmentation model like Mask RCNN that has a bounding box detection component and a mask prediction component, a model Mask^X RCNNmethod that transfers category-specific information from the model’s bounding box detectors to its instance mask predictors.

Mask^X R-CNN method
Detailed illustration of proposed Mask^X R-CNN method.

This method is built on Mask R-CNN, because it is a simple instance segmentation model that also achieves state-of-the-art results. In Mask R-CNN, the last layer in the bounding box branch and the last layer in the mask branch both contain category-specific parameters that are used to perform bounding box classification and instance mask prediction, respectively, for each category. Instead of learning the category-specific bounding box parameters and mask parameters independently, authors propose to predict a category’s mask parameters from its bounding box parameters using a generic, category-agnostic weight transfer function that can be jointly trained as part of the whole model.

For a given category c, let w(det) be the class-specific object detection weights in the last layer of the bounding box head, and w(seg) be the class-specific mask weights in the mask branch. Instead of treating w(seg) as model parameters, w(seg) is parameterized using a generic weight prediction function T (·):

Generic weight prediction function T (·)

where θ are class-agnostic, learned parameters. The same transfer function T(·) may be applied to any category c and, thus, θ should be set such that Tgeneralizes to classes whose masks are not observed during training.

T (·) can be implemented as a small fully connected neural network. Figure 1 illustrates how the weight transfer function fits into Mask R-CNN to form Mask^X R-CNN. Note that the bounding box head contains two types of detection weights: the RoI classification weights w(cls) and the bounding box regression weights w(box).

Experiments on COCO

This method is evaluated on COCO dataset which is small-scale w.r.t. the number of categories but contains exhaustive mask annotations for 80 categories. This property enables rigorous quantitative evaluation using standard detection metrics, like average precision (AP). Each class has a 1024-d RoI classification parameter vector w(cls) and a 4096- d bounding box regression parameter vector w(box) in the detection head, and a 256-d segmentation parameter vector w(seg) in the mask head. The output mask resolution is M × M = 28 × 28. Table below compares full Mask^X R-CNN method (i.e., Mask R-CNN with ‘transfer+MLP’ and T implemented as ‘cls+box, 2-layer, LeakyReLU’) and the class-agnostic baseline using end-to-end training.

Experiments on COCO

Mask^X R-CNN outperforms these approaches by a large margin (over 20% relative increase in mask AP).


Mask^X R-CNN approach
Mask predictions from the class-agnostic baseline (top row) vs. Mask^X R-CNN approach (bottom row). Green boxes are classes in set A while the red boxes are classes in set B. The left 2 columns are A = {voc} and the right 2 columns are A = {non-voc}.

This research addresses the problem of large-scale instance segmentation by formulating a partially supervised learning paradigm in which only a subset of classes have instance masks during training while the rest have box annotations.The FAIR proposes a novel transfer learning approach, where a learned weight transfer function predicts how each class should be segmented based on parameters learned for detecting bounding boxes. Experimental results on the COCO dataset demonstrate that this method significantly improves the generalization of mask prediction to categories without mask training data. This model will help to build a large-scale instance segmentation model over 3000 classes in the Visual Genome dataset.

Mask predictions on 3000 classes in Visual Genome.
Mask predictions from Mask^X R-CNN on 3000 classes in Visual Genome.

Muneeb Ul Hassan

Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

12 June 2018
Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

The immeasurable amount of multimedia data is recorded and shared in the current era of the Internet. Among it, a video is one of the most common and rich modalities,…

The immeasurable amount of multimedia data is recorded and shared in the current era of the Internet. Among it, a video is one of the most common and rich modalities, albeit it is also one of the most expensive to process. Algorithms for fast and accurate video processing thus become crucially important for real-world applications. Video object segmentation, i.e. classifying the set of pixels of a video sequence into the object(s) of interest and background, is among the tasks that despite having numerous and attractive applications, cannot currently be performed in a satisfactory quality level and at an acceptable speed.

The problem is model in a simple and intuitive, yet powerful and unexplored way. The video object segmentation is formulating as pixel-wise retrieval in a learned embedding space. Ideally, in the embedding space, pixels belonging to the same object instance are close together, and pixels from other objects are further apart. The model is built by learning a Fully Convolutional Network (FCN) as the embedding model, using a modified triplet loss tailored for video object segmentation, where no clear correspondence between pixels is given.

There are several main advantages of this formulation: Firstly, the proposed method is highly efficient as there is no fine-tuning in test time, and it only requires a single forward pass through the embedding network and a nearest-neighbour search to process each frame. Secondly, this method provides the flexibility to support different types of user input (i.e., clicked points, scribbles, segmentation masks, etc.) in a unified framework. Moreover, the embedding process is independent of user input. Thus the embedding vectors do not need to be recomputed when the user input changes, which makes this method ideal for the interactive scenario.

Interactive Video Object Segmentation: Interactive Video Object Segmentation relies on iterative user interaction to segment the object of interest. Many techniques have been proposed for the task.

Deep Metric LearningThe key idea of deep metric learning is usually to transform the raw features by a network and then compare the samples in the embedding space directly. Usually, metric learning is performed to learn the similarity between images or patches, and methods based on pixel-wise metric learning are limited.

Proposed Architecture

The problem is to formulate video object segmentation as a pixel-wise retrieval problem, that is, for each pixel in the video, we look for the most similar reference pixel in the embedding space and assign the same label to it. The method consists of two steps:

  1. First, embed each pixel into a d-dimensional embedding space using the proposed embedding network.
  2. Secondly, perform per-pixel retrieval in embedding space to transfer labels to each pixel according to its nearest reference pixel.

Embedding Network

User input to fine-tune the modelThe first way is to fine-tune the network to the specific object based on the user input. For example, techniques such as OSVOS or MaskTrack fine-tune the network at test time based on the user input. When processing a new video, they require many iterations of training to adapt the model to the specific target object. This approach can be time-consuming (seconds per sequence) and therefore impractical for real-time applications, especially with a human in the loop.

User input as the network input: Another way of injecting user interaction is to use it as an additional input to the network. In this way, no training is performed at test time. A drawback of these methods is that the network has to be recomputed once the user input changes. This can still be a considerable amount of time, especially for video, considering a large number of frames.

In contrast to above two methods, in the proposed work user input is disentangled from the network computation. Thus the forward pass of the network needs to be computed only once. The only computation after the user input is then the nearest neighbour search, which is very fast and enables rapid response to the user input.

Embedding Model: In the proposed Model f where each pixel xj,i is represented as a d-dimensional embedding vector ej,i = f(xj,i). Ideally, pixels belonging to the same object are close to each other in the embedding space, and pixels belonging to different objects are distant to each other. The embedding model is built on DeepLab based on the ResNet backbone architecture.

  1. First, the network is pre-train for semantic segmentation on COCO.
  2. Secondly, the final classification layer is removed and replace it with a new convolutional layer with d output channels.
  3. Then fine-tune the network to learn the embedding for video object segmentation.

The DEEP lab architecture is a base feature extractor and to the two convolutional layers as embedding head. The resulting network is fully convolutional, thus the embedding vector of all pixels in a frame can be obtained in a single forward pass. For an image of size h × w pixels the output is a tensor [h/8, w/8, d], where d is the dimension of the embedding space. Since an FCN is deployed as the embedding model, spatial and temporal information are not kept due to the translation invariance nature of the convolution operation. Formally, the embedding function can be represented as:

Modified Triplet

where i and j refer to the ith pixel in frame j. The modified triplet loss is used:

Modified Triplet

The proposed method is evaluated on the DAVIS 2016 and DAVIS 2017 data sets, both in the semi-supervised and interactive scenario. In the context of semi-supervised Video Object Segmentation (VOS), where the full annotated mask in the first frame is provided as input.

Evaluation results
Evaluation results on DAVIS 2016 validation set


 Pixel-wise feature distribution
Illustration of pixel-wise feature distribution

This work presents a conceptually simple yet highly effective method for video object segmentation. The problem is cast as a pixel-wise retrieval in an embedding space learned via a modification of the triplet loss designed specifically for video object segmentation. This way, the annotated pixels on the video (via scribbles, segmentation on the first mask, clicks, etc.) are the reference samples, and the rest of pixels are classified via a simple and fast nearest neighbour approach.

Video object segmentation

Muneeb Ul Hassan