[논문] Exploring the Potential of UPerNet-based Multi-task Learning


Abstract

In this study, we investigate the potential of UPerNet-based multi-task learning. UPerNet is a model that can predict five tasks (scene, object, part, material, and texture) from a single image input. We conducted experiments to explore the interrelations between these tasks and incorporated the Ultra-lightweight Subspace Attention module (ULSAM) into the model using different configurations to optimize task performance. Although the results of this study are similar to those of the original UPerNet model, our findings provide valuable insights into multi-task learning. Specifically, we highlight the benefits of selectively utilizing feature maps and exploiting the interrelations between heads. These experiments pave the way for future research aimed at enhancing performance in multi-task learning schemes.

Introduction

Untitled

Training multiple tasks simultaneously on a single image is a challenging task as each task requires a different set of features.UPerNet [1] is a model that can predict five different multi-tasks (scene, object, part, material, and texture) from a single image.The model is composed of the following components: 1) A backbone network (ResNet-50) that extracts features from the input image.2) A Pyramid Pooling Module (PPM) [3] that generates feature maps at various scales to gather diverse contextual information about objects.3) A multi-scale feature fusion module that effectively combines feature maps from the backbone network and PPM at different scales to extract high-level semantic information.4) Five heads that independently perform predictions for each task.The overall architecture of the base model is illustrated in Fig. 1. Xiao et al.[1] also presented an analysis of the relationships between the heads.

In multi-task learning, each head performs prediction independently.However, the prediction result of one head may have a positive influence on the prediction result of another head.Therefore, considering the correlation between their prediction results is a crucial factor for performance improvement.We conducted experiments on multi-task learning with the aim of investigating the correlation between heads and examining the effects of attention-based head interaction.In this paper, we present two experimental approaches: Context-aware Head and Attention Mechanism.For the context-aware head, we focused on two out of five heads to verify the correlation between heads.

We selected object and material heads that use different datasets and conducted experiments with two heads.For the attention approach, we applied ULSAM (Ultra Lightweight Subspace Attention Module), which is an attention module that preserves multi-scale information and predicts all heads with a single integrated feature.Our goal is to enhance the quality used by each head by incorporating relevant information from other heads.We attempted to enhance performance by utilizing an attention-based approach to integrate multiple head predictions with a feature.However, this did not contribute to performance improvement.The reason could be the poor compatibility between ULSAM and UPerNet, or the lack of consideration for cross-subspace dependencies in ULSAM.We will discuss this in detail in Section 2-C.On the other hand, we conducted an experiment where we fused the feature of the object head with the input of the material head.We observed that while the performance of the material head slightly increased, the performance of the object head tended to decrease.This suggests that there is an interdependent relationship between the heads.That is, the material and object heads extract different features of the image while preserving its overall consistency.This relationship can serve as evidence supporting our hypothesis.

Experiments and Analysis

A. Experimental Environment

We implemented the UPerNet model as a baseline and conducted experiments on two NVIDIA GeForce RTX 3090 GPUs with a maximum batch size of 4. The learning rate was set to 0.02 for the encoder and decoder, and the Stochastic Gradient Descent (SGD) optimizer was usedwith a momentum value of 0.9 and a weight decay of 0.0001.The NLLLoss function was used.Pixel Accuracy (P.A.) and mean Intersection over Union (IoU) were used to measure the performance of the semantic segmentation task, while Top-1 Accuracy was used to measure the performance of the scene classification task.

B. Context-aware Head

Untitled 1

Prior to conducting the experiments, we simplified the UPerNet by focusing solely on the material and object heads. This allowed us to directly examine the interrelation between the two heads. The architecture of both heads consists of a convolutional layer with a 3x3 kernel, followed by batch normalization (BN), ReLU activation, and a classifier. We merged the object feature map from the pre-final layer with the input of the material head to incorporate object information into material predictions.

As shown in Table 1. (Model B), the experimental results showed that the precision accuracy (P.A.) and mean intersection over union (mIOU) for materials were 81.8% and 50.1%, respectively. For objects, the P.A. and mIOU were 68.7% and 16.7%, respectively. The material head exhibited improved performance by utilizing object information, while the object head showed a decline in performance. This is because the two tasks have different objectives and levels of abstraction. For material prediction, different objects would be classified as the same material label if they share the same material. However, in the case of object prediction, different objects should be distinguished by their shape or category, regardless of their material. Therefore, using material information for object prediction can be misleading or confusing. However, this does not mean that the two heads are completely independent. Rather, they have an interdependent relationship that maintains the overall consistency of the image features. This relationship can serve as evidence supporting our hypothesis, namely that the prediction results from one head can help improve the prediction results in another head, especially for the tasks that are highly related.

C. Attention Mechanism

Untitled 2

In our preliminary experiments, we observed that when one head uses the information from another head, the performance of the former head improves but the performance of the latter head deteriorates. The head structure of UPerNet consists of a simple 3x3 convolution and a classifier, which limits its ability to handle multiple tasks effectively.

To perform multi-tasking with a single integrated feature map, we experimentally used an attention mechanism. We adopted an attention module called ULSAM (Ultra Lightweight Subspace Attention Module) [5], which employs subspace channel attention. It can represent multi-scale and multi-frequency features while maintaining a low number of parameters and facilitates fast learning. We applied ULSAM to UPerNet in an attempt to predict all heads with a single integrated feature map while preserving multi-scale information. The modified architectures of our experiments are shown in Fig. 2.

In the first experiment (Model C.(1)), we resized the feature maps obtained from the Feature Pyramid Network (FPN) to a uniform size and combined these resized feature maps to create a fused feature map. Attention was applied to this fused feature map, and then we fed the attention map into three heads for predicting the outputs. This approach preserves the multi-scale feature that might be lost during the convolution process of the fused map by using ULSAM. Additionally, it maximizes the multi-scale and multifrequency representations for predicting objects, parts, and materials in parallel within the subspaces that are divided based on the channels in the fused map.

In the second experiment (Model C.(2)), we used FPN maps to help the heads concentrate more on their respective tasks. The P5 feature map of the Feature Pyramid Network (FPN), which passed through the Pyramid Pooling Module (PPM), is designed to capture global information. It also incorporated local information from the backbone as it progressed through P4, P3, and P2, producing a feature map with rich context. We combined P2 and P5 to generate Map 1, which has comprehensive context, and P4 and P5 to create Map 2, which emphasizes global context. Then we applied attention to each generated feature map (Map1, Map2) and concatenated them before passing them to the heads.

In the third experiment (Model C.(3)), we created Map 1, 2, and 3 by concatenating adjacent feature maps to generate features suitable for the task. We applied attention to these maps (a combination of P2 and P5), then combined all the attention maps to include all contexts and passed them to the head.

As shown in Table. 1, all three results showed a decline in performance, especially when we applied attention to all feature maps caused a significant decline. This suggested that there was too much redundant information that prevented us from obtaining meaningful attention. Both the first and second experiments showed a decline in performance; however, comparing them revealed that selectively applying attention was better than using attention on all maps of FPN for performance improvement. One possible reason for these results is the incompatibility between the ULSAM module and UPerNet. ULSAM and UPerNet’s FPN structure both learn multi-scale features; however, combining these two mechanisms may lead to extracting redundant features that may degrade performance. The subspaces in ULSAM learn independent attention maps that only consider inter-channel relationships; however, they do not handle cross-subspace dependencies. This limitation may hinder their ability to effectively model the spatial context necessary for segmentation tasks.

Untitled 3

Conclusion

In this paper, we investigated the inter-task correlation of the multi-task model, UPerNet to improve its performance. We also conducted an experiment to predict multi-heads from the integrated feature by applying the attention module, ULSAM. In summary, applying attention maps did not enhance performance compared to UPerNet. However, we experimentally found that there was an interaction between the heads. Moreover, in multi-task schemes, it is more effective to selectively utilize the feature maps that allow focused attention for each task than to use all the feature maps. ULSAM had fewer parameters and might be computationally efficient, but it might not be compatible with the UPerNet model. For our future work, we plan to explore other attention modules that can generate new information from the feature maps and apply them to the UPerNet model. An important aspect of our future work is to investigate attention mechanisms that are suitable for multi-task learning and explore how to design and evaluate them effectively. We expect that this will enhance both the performance and interpretability of multi-task learning.