[논문] Material Recognition in the Wild with the Materials in Context Database


논문 링크 : Material Recognition in the Wild with the Materials in Context Database
Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala; Computer Vision and Pattern Recognition (CVPR), 2015

Abstract

Figure 1

Figure 1. Overview. (a) We construct a new dataset by combining OpenSurfaces [1] with a novel three-stage Amazon Mechanical Turk (AMT) pipeline. (b) We train various CNNs on patches from MINC to predict material labels. (c) We transfer the weights to a fully convolutional CNN to efficiently generate a probability map across the image; we then use a fully connected CRF to predict the material at every pixel.

Recognizing materials in real-world images is a challenging task. Real-world materials have rich surface texture, geometry, lighting conditions, and clutter, which combine to make the problem particularly difficult. In this paper, we introduce a new, large-scale, open dataset of materials in the wild, the Materials in Context Database ( MINC ), and combine this dataset with deep learning to achieve material recognition and segmentation of images in the wild. MINC is an order of magnitude larger than previous material databases, while being more diverse and well-sampled across its 23 categories. Using MINC, we train convolutional neural networks (CNNs) for two tasks: classifying materials from patches, and simultaneous material recognition and segmentation in full images. For patch-based classification on MINC we found that the best performing CNN architectures can achieve 85.2% mean class accuracy. We convert these trained CNN classifiers into an efficient fully convolutional framework combined with a fully connected conditional random field (CRF) to predict the material at every pixel in an image, achieving 73.1% mean class accuracy. Our experiments demonstrate that having a large, well-sampled dataset such as MINC is crucial for real-world material recognition and segmentation.

1. Introduction

Material recognition plays a critical role in our understanding of and interactions with the world. To tell whether a surface is easy to walk on, or what kind of grip to use to pick up an object, we must recognize the materials that make up our surroundings. Automatic material recognition can be useful in a variety of applications, including robotics, product search, and image editing for interior design. But recognizing materials in real-world images is very challenging. Many categories of materials, such as fabric or wood, are visually very rich and span a diverse range of appearances. Materials can further vary in appearance due to lighting and shape. Some categories, such as plastic and ceramic, are often smooth and featureless, requiring reasoning about subtle cues or context to differentiate between them.

Large-scale datasets (e.g., ImageNet [21], SUN [31, 19] and Places [34]) combined with convolutional neural networks (CNNs) have been key to recent breakthroughs in object recognition and scene classification. Material recognition is similarly poised for advancement through large-scale data and learning. To date, progress in material recognition has been facilitated by moderate-sized datasets like the Flickr Material Database (FMD) [26]. FMD contains ten material categories, each with 100 samples drawn from Flickr photos. These images were carefully selected to illustrate a wide range of appearances for these categories. FMD has been used in research on new features and learning methods for material perception and recognition [17, 10, 20, 25]. While FMD was an important step towards material recognition, it is not sufficient for classifying materials in real-world imagery. This is due to the relatively small set of categories, the relatively small number of images per category, and also because the dataset has been designed around hand-picked iconic images of materials. The OpenSurfaces dataset [1] addresses some of these problems by introducing 105,000 material segmentations from real-world images, and is significantly larger than FMD. However, in OpenSurfaces many material categories are under-sampled, with only tens of images.

A major contribution of our paper is a new, well-sampled material dataset, called the Materials in Context Database (MINC), with 3 million material samples. MINC is more diverse, has more examples in less common categories, and is much larger than existing datasets. MINC draws data from both Flickr images, which include many “regular” scenes, as well as Houzz images from professional photographers of staged interiors. These sources of images each have different characteristics that together increase the range of materials that can be recognized. See Figure 2 for examples of our data. We make our full dataset available online at https://minc.cs.cornell.edu/

We use this data for material recognition by training different CNN architectures on this new dataset. We perform experiments that illustrate the effect of network architecture, image context, and training data size on subregions (i.e., patches) of a full scene image. Further, we build on our patch classification results and demonstrate simultaneous material recognition and segmentation of an image by performing dense classification over the image with a fully connected conditional random field (CRF) model [12]. By replacing the fully connected layers of the CNN with convolutional layers [24 ], the computational burden is significantly lower than a naive sliding window approach.

In summary, we make two new contributions:

  • We introduce a new material dataset, MINC, and 3- stage crowdsourcing pipeline for efficiently collecting millions of click labels (Section 3.2).
  • Our new semantic segmentation method combines a fully-connected CRF with unary predictions based on CNN learned features (Section 4.2) for simultaneous material recognition and segmentation.

2. Prior Work

Material Databases.

Much of the early work on material recognition focused on classifying specific instances of tex- tures or material samples. For instance, the CUReT [4] database contains 61 material samples, each captured under 205 different lighting and viewing conditions. This led to research on the task of instance-level texture or material classification [15, 30], and an appreciation of the challenges of building features that are invariant to pose and illumination. Later, databases with more diverse examples from each material category began to appear, such as KTH-TIPS [9, 2], and led explorations of how to generalize from one example of a material to another—from one sample of wood toa completely different sample, for instance. Real-world texture attributes have also recently been explored [3].

In the domain of categorical material databases, Sharan et al.released FMD [26] (described above). Subsequently, Bell et al. released OpenSurfaces [1] which contains over 20,000 real-world scenes labeled with both materials and objects, using a multi-stage crowdsourcing pipeline. Because OpenSurfaces images are drawn from consumer photos on Flickr, material samples have real-world context, in contrast to prior databases (CUReT, KTH-TIPS, FMD) which feature cropped stand-alone samples. While OpenSurfaces is a good starting point for a material database, we substantially expand it with millions of new labels.

Material recognition.

Much prior work on material recognition has focused on the classification problem (categorizing an image patch into a set of material categories), often using hand-designed image features. For FMD, Liu et al. [17] introduced reflectance-based edge features in conjunction with other general image features. Hu et al. [10] proposed features based on variances of oriented gradients. Qi et al. [20] introduced a pairwise local binary pattern (LBP) feature. Li et al. [16] synthesized a dataset based on KTH-TIPS2 and built a classifier from LBP and dense SIFT. Timofte et al. [29] proposed a classification framework with minimal parameter optimization. Schwartz and Nishino [23] introduced material traits that incorporate learned convolutional auto-encoder features. Recently, Cimpoi et al. [3] developed a CNN and improved Fisher vector (IFV) classifier that achieves state-of-the-art results on FMD and KTH-TIPS2. Finally, it has been shown that jointly predicting objects and materials can improve performance [10, 33].

Convolutional neural networks.

While CNNs have been around for a few decades, with early successes such as LeNet [14], they have only recently led to state-of-the- art results in object classification and detection, leading to enormous progress. Driven by the ILSVRC challenge [21], we have seen many successful CNN architectures [32, 24, 28, 27], led by the work of Krizhevsky et al. on their SuperVision (a.k.a. AlexNet) network [13], with more recent architectures including GoogLeNet [28]. In addition to image classification, CNNs are the state-of-the-art for detection and localization of objects, with recent work including R-CNNs [7], Overfeat [24], and VGG [27]. Finally, relevant to our goal of per-pixel material segmentation, Farabet et al. [6] use a multi-scale CNN to predict the class at every pixel in a segmentation. Oquab et al. [18] employ a sliding window approach to localize patch classification of objects. We build on this body of work in deep learning to solve our problem of material recognition and segmentation.

3. The Materials in Context Database (MINC)

Figure 2. Example patches from all 23 categories of the Materials in Context Database (MINC). Note that we sample patches so that the patch center is the material in question (and not necessarily the entire patch). See Table 1 for the size of each category.

Figure 2. Example patches from all 23 categories of the Materials in Context Database (MINC). Note that we sample patches so that the patch center is the material in question (and not necessarily the entire patch). See Table 1 for the size of each category.

We now describe the methodology that went into building our new material database. Why a new database? We needed a dataset with the following properties:

  • Size: It should be sufficiently large that learning methods can generalize beyond the training set.
  • Well-sampled: Rare categories should be represented with a large number of examples.
  • Diversity: Images should span a wide range of appearances of each material in real-world settings.
  • Number of categories: It should contain many different materials found in the real world.

3.1. Sources of data

We decided to start with the public, crowdsourced OpenSurfaces dataset [1] as the seed for MINC since it is drawn from Flickr imagery of everyday, real-world scenes with reasonable diversity. Furthermore, it has a large number of categories and the most samples of all prior databases.

While OpenSurfaces data is a good start, it has a few limitations. Many categories in OpenSurfaces are not well sampled. While the largest category, wood, has nearly 20K samples, smaller categories, such as water, have only tens of examples. This imbalance is due to the way the OpenSurfaces dataset was annotated; workers on Amazon Mechanical Turk (AMT) were free to choose any material subregion to segment. Workers often gravitated towards certain common types of materials or salient objects, rather than being encouraged to label a diverse set of materials. Further, the images come from a single source (Flickr). We decided to augment OpenSurfaces with substantially more data, especially for underrepresented material categories, with the initial goal of gathering at least 10K samples per material category. We decided to gather this data from another source of imagery, professional photos on the interior design website Houzz (houzz.com). Our motivation for using this different source of data was that, despite Houzz photos being more “staged” (relative to Flickr photos), they actually represent a larger variety of materials. For instance, Houzz photos contain a wide range of types of polished stone. With these sources of image data, we now describe how we gather material annotations.

3.2. Segments, Clicks, and Patches

Figure 3. AMT pipeline schematic for collecting clicks. (a) Workers filter by images that contain a certain material, (b) workers click on materials, and (c) workers validate click locations by re-labeling each point. Example responses are shown in orange.

Figure 3. AMT pipeline schematic for collecting clicks. (a) Workers filter by images that contain a certain material, (b) workers click on materials, and (c) workers validate click locations by re-labeling each point. Example responses are shown in orange.

What specific kinds of material annotations make for a good database? How should we collect these annotations? The type of annotations to collect is guided in large part by the tasks we wish to generate training data for. For some tasks such as scene recognition, whole-image labels can suffice [31, 34]. For object detection, labeled bounding boxes as in PASCAL are often used [5]. For segmentation or scene parsing tasks, per-pixel segmentations are required [22, 8]. Each style of annotation comes with a cost proportional to its complexity. For materials, we decided to focus on two problems, guided by prior work:

  • Patch material classification. Given an image patch, what kind of material is it at the center?
  • Full scene material classification. Given a full image, produce a full per-pixel segmentation and labeling. Also known as semantic segmentation or scene parsing (but in our case, focused on materials). Note that classification can be a component of segmentation, e.g., with sliding window approaches.

Segments.

OpenSurfaces contains material segmentations— carefully drawn polygons that enclose same-material regions. To form the basis of MINC, we selected OpenSurfaces segments with high confidence (inter-worker agreement) and manually curated segments with low confidence, giving a total of 72K shapes. To better balance the categories, we manually segmented a few hundred extra samples for sky, foliage and water. Since some of the OpenSurfaces categories are difficult for humans, we consolidated these categories. We found that many AMT workers could not disambiguate stone from concrete, clear plastic from opaque plastic, and granite from marble. Therefore, we merged these into stone, plastic, and polished stone respectively. Without this merging, many ground truth examples in these categories would be incorrect. The final list of 23 categories is shown in Table 1. The category other is different in that it was created by combining various smaller categories.

Table 1. MINC patch counts by category. Patches were created from both OpenSurfaces segments and our newly collected clicks. See Section 3.2 for details.

Table 1. MINC patch counts by category. Patches were created from both OpenSurfaces segments and our newly collected clicks. See Section 3.2 for details.

Clicks.

Since we want to expand our dataset to millions of samples, we decided to augment OpenSurfaces segments by collecting clicks: single points in an image along with a material label, which are much cheaper and faster to collect. Figure 3 shows our pipeline for collecting clicks.

Initially, we tried asking workers to click on examples of a given material in a photo. However, we found that workers would get frustrated if the material was absent in too many of the photos. Thus, we added an initial first stage where workers filter out such photos. To increase the accuracy of our labels, we verify the click labels by asking different workers to specify the material for each click without providing them with the label from the previous stage. To ensure that we obtain high quality annotations and avoid collecting labels from workers who are not making an effort, we include secret known answers (sentinels) in the first and third stages, and block workers with an accuracy below 50% and 85% respectively. We do not use sentinels in the second stage since it would require per-pixel ground truth labels, and it turned out not to be necessary. Workers generally performed all three tasks so we could identify bad workers in the first or third task.

Material clicks were collected for both OpenSurfaces images and the new Houzz images. This allowed us to use labels from OpenSurfaces to generate the sentinel data; we included 4 sentinels per task. With this streamlined pipeline we collected 2,341,473 annotations at an average cost of $0.00306 per annotation (stage 1: $0.02 / 40 images, stage 2: $0.10 / 50 images, 2, stage 3: $0.10 / 50 points).

Patches.

Labeled segments and clicks form the core of MINC. For training CNNs and other types of classifiers, it is useful to have data in the form of fixed-sized patches. We convert both forms of data into a unified dataset format: square image patches. We use a patch center and patch scale (a multiplier of the smaller image dimension) to define the image subregion that makes a patch. For our patch classification experiments, we use 23.3% of the smaller image dimension. Increasing the patch scale provides more context but reduces the spatial resolution. Later in Section 5 we justify our choice with experiments that vary the patch scale for AlexNet.

We place a patch centered around each click label. For each segment , if we were to place a patch at every interior pixel then we would have a very large and redundant dataset. Therefore, we Poisson-disk subsample each segment , separating patch centers by at least 9.1% of the smaller image dimension. These segments generated 655,201 patches (an average of 9.05 patches per segment). In total, we generated 2,996,674 labeled patches from 436,749 images. Patch counts are shown in Table 1, and example patches from various categories are illustrated in Figure 2.

4. Material recognition in real-world images

Our goal is to train a system that recognizes the material at every pixel in an image. We split our training procedure into multiple stages and analyze the performance of the network at each stage. First, we train a CNN that produces a single prediction for a given input patch. Then, we convert the CNN into a sliding window and predict materials on a dense grid across the image. We do this at multiple scales and average to obtain a unary term. Finally, a dense CRF [12] combines the unary term with fully connected pairwise reasoning to output per-pixel material predictions. The entire system is depicted in Figure 1, and described more below.

Figure 4. Pipeline for full scene material classification. An image (a) is resized to multiple scales [1/√2, 1, √2]. The same sliding CNN predicts a probability map (b) across the image for each scale; the results are upsampled and averaged. A fully connected CRF predicts a final label for each pixel (c). This example shows predictions from a single GoogLeNet converted into a sliding CNN (no average pooling).

Figure 4. Pipeline for full scene material classification. An image (a) is resized to multiple scales [1/√2, 1, √2]. The same sliding CNN predicts a probability map (b) across the image for each scale; the results are upsampled and averaged. A fully connected CRF predicts a final label for each pixel (c). This example shows predictions from a single GoogLeNet converted into a sliding CNN (no average pooling).

4.1. Training procedure

MINC contains 3 million patches that we split into training, validation and test sets. Randomly splitting would result in nearly identical patches (e.g., from the same OpenSurfaces segment) being put in training and test, thus inflating the test score. To prevent correlation, we group photos into clusters of near-duplicates, then assign each cluster to one of train, validate or test. We make sure that there are at least 75 segments of each category in the test set to ensure there are enough segments to evaluate segmentation accuracy. To detect near-duplicates, we compare AlexNet CNN features computed from each photo (see the supplemental for details). For exact duplicates, we discard all but one of the copies.

We train all of our CNNs by fine-tuning the network starting from the weights obtained by training on 1.2 million images from ImageNet (ILSVRC2012). When training AlexNet, we use stochastic gradient descent with batchsize 128, dropout rate 0.5, momentum 0.9, and a base learning rate of 10−3 that decreases by a factor of 0.25 every 50,000 iterations. For GoogLeNet, we use batchsize 69, dropout 0.4, and learning rate αt = 10−4√1 − t/250000 for iteration t.

Our training set has a different number of examples per class, so we cycle through the classes and randomly sample an example from each class. Failing to properly balance the examples results in a 5.7% drop in mean class accuracy (on the validation set). Further, since it has been shown to reduce overfitting, we randomly augment samples by taking crops (227 × 227 out of 256 × 256), horizontal mirror flips, spatial scales in the range [1/√2, √2], aspect ratiosfrom 3:4 to 4:3, and amplitude shifts in [0.95, 1.05]. Since we are looking at local regions, we subtract a per-channel mean (R: 124, G: 117, B: 104) rather than a mean image [13].

4.2. Full Scene material classification

Figure 4 shows an overview of our method for simultaneously segmenting and recognizing materials. Given a CNN that can classify individual points in the image, we convert it to a sliding window detector and densely classify a grid across the image. Specifically, we replace the last fully connected layers with convolutional layers, so that the network is fully convolutional and can classify images of any shape. After conversion, the weights are fixed and not fine-tuned. With our converted network, the strides of each layer cause the network to output a prediction every 32 pixels. We obtain predictions every 16 pixels by shifting the input image by half-strides (16 pixels). While this appears to require 4x the computation, Sermanet et al. [24] showed that the convolutions can be reused and only the pool5 through fc8 layers need to be recomputed for the half-stride shifts. Adding half-strides resulted in a minor 0.2% improvement in mean class accuracy across segments (after applying the dense CRF, described below), and about the same mean class accuracy at click locations.

The input image is resized so that a patch maps to a 256x256 square. Thus, for a network trained at patch scale s, the resized input has smaller dimension d = 256/s. Note that d is inversely proportional to scale, so increased context leads to lower spatial resolution. We then add padding so that the output probability map is aligned with the input when upsampled. We repeat this at 3 different scales (smaller dimension d/√2, d, d√2), upsample each output probability map with bilinear interpolation, and average the predictions. To make the next step more efficient, we upsample the output to a fixed smaller dimension of 550.

We then use the dense CRF of Kr ̈ahenb ̈uhl et al.[ 12 ] to predict a label at every pixel, using the following energy:

Untitled

where ψi is the unary energy (negative log of the aggregated softmax probabilities) and ψij is the pairwise term that connects every pair of pixels in the image. We use a single pairwise term with a Potts label compatibility term δ weighted by wp and unit Gaussian kernel k. For the features fi, we convert the RGB image to Lab* and use color (IL i , Ia i , Ib i ) and position (px, py ) as pairwise features for each pixel: fi = [ pxi θp d , py i θp d , IL i θL , Ia i θab , Ib i θab ], where d is the smaller image dimension. Figure 4 shows an example unary term pi and the resulting segmentation x.

5. Experiments and Results

작성중 . . .




6. Conclusion

Material recognition is a long-standing, challenging problem. We introduce a new large, open, material database, MINC, that includes a diverse range of materials of everyday scenes and staged designed interiors, and is at least an order of magnitude larger than prior databases. Using this large database we conduct an evaluation of recent deep learning algorithms for simultaneous material classification and segmentation, and achieve results that surpass prior attempts at material recognition.

Some lessons we have learned are:

  • Training on a dataset which includes the surrounding context is crucial for real-world material classification.
  • Labeled clicks are cheap and sufficient to train a CNN alone. However, to obtain high quality segmentation results, training a CRF on polygons results in much better boundaries than training on clicks.

Many future avenues of work remain. Expanding the dataset to a broader range of categories will require new ways to mine images that have more variety, and new annotation tasks that are cost-effective. Inspired by attributes for textures [3], in the future we would like to identify material attributes and expand our database to include them. We also believe that further exploration of joint material and object classification and segmentation will be fruitful [10] and lead to improvements in both tasks. Our database, trained models, and all experimental results are available online at http://minc.cs.cornell.edu/.