Erasing Integrated Learning : A Simple yet Effective Approach for Weakly Supervised Object Localization

Formally, we denote the training image set as , where is the label of the image , is total classes of images and is the amount of images. Let , lowercase , and uppercase denote network parameters, functions, and feature maps, respectively.
The network before EIL is applied can produce the original unerased feature map, which is denoted as and , where stands for the number of channels, and for the width and height, respectively. We make use of as self-attention to generate the erasing mask. Specifically, we compress into an average map through channel-wise average pooling. Then we apply a hard threshold on to produce the erasing mask , in which the spatial locations of those pixels having intensity greater than are set to zero. We perform the erasing operation by doing spatial-wise multiplication between unerased feature map and mask , to produce the erased feature map .
Afterwards, both the unerased feature map and the erased counterpart are fed into the latter part of the network together. As these two data streams are processed by the same function and parameters , such structure can be regarded as a dual-branch network of shared weights. More specifically, produces class activation maps (CAM) \cite{zhou2016learning}, applies global average pooling \cite{lin2013network} on CAM and utilizes a fully connected layer followed by softmax operation to get the prediction score for each branch, with and for the erased and the unerased, respectively. In the end, the classification losses from the two branches will be added up to calculate the total loss . Note that we also introduce a loss weighting hyperparameter to control the relative importance between the unerased loss and the erased loss .
\begin{algorithm}
\SetAlgoLined
\KwInput{Input image $I=\{\{I_i,y_i\}\}^N_{i=1}$ from $C$ classes, erasing threshold $\gamma$, weighting hyperparameter $\sigma$}
\While {training is not convergent}{
Calculate the feature map $F^{u}\leftarrow f^1(I_i,\theta^1)$ \;
Calculate the average map $M_{avg}=\frac{\sum^K_{i=1}F^{u}_i}{K}$\;
Calculate the erasing mask $M_{e_{i,j}}=\begin{cases}
0, &\quad if\ M_{avg_{i,j}}\ge \gamma\\
1, &\quad else\\
\end{cases}$\;
Get the erased feature map $F^{e}=F^{u}\otimes M_{e}$ \;
Calculate prediction of $F^{e}$: $p^{e}\leftarrow f^2(F^{e},\theta_2)$ \;
Calculate prediction of $F^{u}$: $p^{u}\leftarrow f^2(F^{u},\theta_2)$ \;
Obtain erased loss: $\mathcal{L}_{e}=-\frac{1}{C}\sum_c y_{i,c} log(p^{e}_c)$ \;
Obtain unerased loss:$\mathcal{L}_{u}=-\frac{1}{C}\sum_c y_{i,c} log(p^{u}_c)$\;
Calculate the total loss: $\mathcal{L}=\mathcal{L}_{u}+\sigma \mathcal{L}_{e}$ \;
Back-propagate and update parameters $\theta^1$, $\theta^2$ \;
}
\caption{Training algorithm for EIL}
\label{al:1}
\end{algorithm}