Re-annotation of training samples for robust maritime object detection

Machine learning and specifically deep learning techniques address many of the issues faced in visual object detection and classification tasks. However, they have the caveat of needing large amounts of annotated training data. In the maritime domain one may encounter objects fairly infrequently, depending on weather and location. This creates an issue of data collection. Areas such as harbors and channels see a lot of traffic, but the ships are of a specific class. Furthermore, the variability of the buoys from region to region and within regions is difficult and expensive to sample. Thus the amount and quality of available data is severely lacking. Furthermore very few publicly available maritime datasets exist In this work, we present a novel approach that detects possible ‘‘poor’’ training samples and automatically re-annotates them, based on the current state of the object detector. We show the applicability of our approach on real-life maritime data and show that the poor annotation quality of the datasets used can be mitigated. We show performance gain with respect to a baseline approach is proportional to the amount of poorly annotated data in the dataset. When 25% of the data is poor we achieve a 5.5%, 13.7%, and 8.0% increase in performance on 3 separate datasets, compared to a baseline model. With 50% noise we reach 58.5%, 18.7% and 94.2% increase respectively. Our approach also allows for the iterative improvement of a given dataset by providing a set of pseudo-annotations to replace the current incorrect ones.


Introduction
In this work, our objective is to improve object detection in maritime environments, as an enabler of more complex tasks in the autonomous maritime industry.General image-based object detection has greatly benefited from advances in deep neural networks, and the same applies to the maritime domain (Becktor, Schöller, Boukas, Blanke, & Nalpantidis, 2020;Schöller, Blanke, Plenge-Feidenhans'l, & Nalpantidis, 2020), as in the example frame in Fig. 1.Acceptable performance in a more applied setting such as the automotive or maritime domain requires large amounts of carefully annotated data; which in automotive domain is commonplace, but not in the maritime domain.
Obtaining such data, representative of the real world is a very time-consuming task, and to make the situation even worse, manual annotation requires an additional large investment of time.Furthermore, the annotation process is a very tedious and error-prone process.This means that poor samples, that is; missing, incorrect, or imprecise annotations, are to be expected in a given dataset.This work has been performed within the ShippingLab (ShippingLab, 2019(ShippingLab, -2022) ) project, during which we have collected Maritime Data around the Danish coast.As earlier stated a big issue is a need for, precise and correct labels.This becomes even more important when we move to finer distinctions between classes.While there are services that produce labels, it is tedious and expensive to validate and verify everything.
With our proposed approach we can save both time and money while increasing the robustness of our networks.
We build upon the hypothesis that automating the annotation process and improving the resilience of models with regard to erroneous data can significantly improve the practical applicability of deep neural networks in marine applications, as well as other autonomous robotics applications.To this end, we present an algorithm that automatically re-annotates poorly annotated training samples and thus leads to more robust and reliable object detection results.Our approach is based on the work initially proposed in Ren, Zeng, Yang, and Urtasun (2018) and then used in Becktor, Boukas, Blanke, and Nalpantidis (2021) which allows for the identification of poor training samples.According to that work, it is possible to compute and assign appropriate weights for each sample of the training set, with respect to a small but carefully and precisely annotated set, hereafter referred to as re-weight set   .However, in this work, we extend the re-weighting (RW) scheme to actually propose an additional re-annotation (RA) of down-weighted samples, i.e. instead of assigning small or even zero weights, we attempt to correct the problematic annotations.More precisely, the list of training samples, which in the ''eyes" of   are problematic, is reannotated with an appropriate pseudo-label obtained from the object detection network in its current state.The intuition is that given enough iterations of the RA process, the effect of poorly annotated training samples will be filtered out.This will allow for better generalization and more robust class detection, as outliers and erroneous cases are improved or fixed.
In the maritime domain, there is a lack of publicly available datasets, and works thus rely on self-collected data.Our work introduces a novel semi-supervised learning mechanism on top of the approach presented in Ren et al. (2018).This method allows the network to learn from mislabeled or missing annotations by creating new pseudo-labels for the samples that are deemed ''poor", thereby reducing the overall cost of collecting data and getting it thoroughly annotated by providing an updated dataset for incremental training and improvement.Furthermore, our method allows for the training of very noisy datasets given a small subset of properly annotated labels.Finally, we explore the applicability of such an approach, in a real-life domain, targeted on autonomous maritime vessels.
Our contribution is three-fold: (i) we introduce a novel reannotation scheme that given ''poor" training samples creates new pseudo-labels for those samples based on the current state of the network.(ii) Furthermore, we test the introduced mechanisms and show their real-life applicability on data collected from two different maritime environments.(iii) This method introduces an automatic annotation framework; this is achieved by the pseudo-labels generated while training, by creating a set of the existing poor data samples with new re-annotated labels.

Related work
The main idea of this work is to achieve more robust and reliable object detection at sea.We aim to improve this by re-sampling poor samples and applying pseudo-labels.Previous work has aimed to build reliable systems for the detection of maritime objects, such as Dulski et al. (2011), where the detection of small surface vessels was achieved using the Scale Invariant Feature Transform (SIFT) along with a bagof-features approach.The work of Stets et al. (2019) explores the use of neural networks in the maritime domain, comparing RGB, longwave infrared (LWIR), and near-infrared (NIR) data using ResNet-50 RetinaNet (Lin, Goyal, Girshick, He, & Dollar, 2017).In the field of object detection we have seen large leaps in performance, such as the current state-of-the-art real-time object detection framework presented in Wang, Bochkovskiy, and Liao (2022).Furthermore, the work in Alotaibi, Omri, Abdel-Khalek, Khalil, and Mansour (2022), creates an ensemble of object detectors and tracking algorithms resulting in the computational intelligence-based harmony search algorithm for real-time object detection and tracking (CIHSA-RTODT) achieving competitive performance in the object tracking and surveillance field.
The importance of how training data is sampled has been explored extensively.The work in Kahn and Marshall (1953) introduces Importance Sampling; by assigning weights to each sample it is possible to ensure that the class distributions match more closely, highlighting the similarity between class objects.We draw inspiration from this work to claim that samples from the same classes, being similar, can constitute the basis of a re-weighing set.Focal loss (Lin et al., 2017) introduces a soft weighting scheme that focuses on harder samples and thereby increasing their say in the final network.Hard example mining (Malisiewicz, Gupta, & Efros, 2011) down-samples the majority samples, and back-propagates on the samples with the highest loss.Similarly, AdaBoost (Freund & Schapire, 1997) implements several weak classifiers, and the output is combined into a weighted sum that represents the final output of the boosted classifier, thereby reducing variance.These methods show that there is a need for better data sampling methods, which is one of our focus points of this current work.However, it is not always beneficial to prefer high-loss samples, as they might be outliers, noisy, or incorrect.In the work of Arazo, Ortego, Albert, O'Connor, and Mcguinness (2019) the authors show that convolutional neural networks trained with stochastic gradient methods have been shown to fit random labels.By introducing a generative model that produces a ''mislabeled" probability given a sample, they are able to correct for these in the loss.
Class imbalance is a common issue with real-life datasets, and several approaches have been proposed to mitigate it, such as re-sampling the less represented classes set as introduced in Chawla, Bowyer, Hall, and Kegelmeyer (2002).In our previous work (Becktor et al., 2021), we explored the use of the re-weighing scheme, first introduced in Ren et al. ( 2018)-in a maritime setting.By weighting each training sample according to a smaller clean re-weight set, we increase performance when noisy data samples are introduced compared to the baseline.The need for robustness of neural networks is addressed in Szegedy et al. (2014), where the authors by constraining the weight matrices of each layer and ensuring that they are orthonormal, ensure the Lipschitz constant of a network to be lower than 1.This allows for networks to be more robust against noise and adversarial attacks.In Anil, Lucas, and Grosse (2019), methods on how to train neural networks under a strict Lipschitz constraint are explored and are shown to be useful for provable adversarial robustness.In our previous work (Becktor et al., 2020), we explore the use of Lipschitz-constrained networks in a maritime setting to ensure robustness of predictions.The work of Song, Kim, Park, Shin, and Lee (2021) presents a method to train on noisy datasets.By training on two datasets-their full dataset and their clean dataset-the method learns to separate out the noisy samples and thereby being able to converge.While similar to our own approach, this work excludes the ''poor" samples, whereas we create new pseudo-labels for the discarded samples which allow for better convergence.
The field of semi-supervised learning (Oliver, Odena, Raffel, Cubuk, & Goodfellow, 2018), that is, using unlabeled data, along with only a small amount of labeled data, has been thoroughly explored.Some of the relevant methods include generative methods (Odena, 2016;Radford, Metz, & Chintala, 2015;Springenberg, 2015), consistency regularization methods (Belkin & Niyogi, 2001), and pseudo-labeling methods (Blum & Mitchell, 1998;Pham, Dai, Xie, & Le, 2021).In Lee (2013) a network is trained on labeled data for warm-up and followed pseudo-labeling.Previous work such as (Grandvalet et al., 2005)   switches between training the model using unlabeled/labeled data and building a graph representation of the network to apply the k-nearest neighbor for better pseudo-labels.Recent studies have shown the usefulness of meta-networks, for example learning under loud labels (Ren et al., 2018;Shu et al., 2019), discounting outliers, or correcting for class imbalance.Meta-networks can also automate the process of selecting the learning curriculum (Elman, 1993) -they choose the order of samples to learn from, instead of random order, achieving better performance (Fan, Tian, Qin, Li, & Liu, 2018;Jiang, Zhou, Leung, Li, & Fei-Fei, 2018).We aim to fill the gap that exists when the sampled data contain imprecise or poorly labeled annotations.Furthermore, with our approach, we can incrementally improve the labeling of our datasets by automatically re-annotating or flagging poor samples for inspection.This is of major importance when training on large datasets whose quality is infeasible to verify manually.

Method
In this work, we propose an algorithm to enhance the performance, robustness, and convergence of maritime object detection by creating pseudo-labels of ''poor" input samples.Considering the work of Ren et al. (2018) as a starting point, we extend it significantly to tackle the problem at hand.Our approach incorporates ideas from semisupervised learning, such as pseudo-labels, to re-annotate ''poorly" annotated labels in the eyes of the meta-network with respect to the reweight set   .In the following section, we will describe our proposed method.

Application of re-weight scheme
The re-weight scheme, as introduced in Ren et al. (2018), enables better training of networks with poor-quality data.Consider a convolutional neural network model  with parameters ,  () and the dataset split: training   , validation   , and re-weight   set.For each iteration of the model  a copy of the network in its current state (metamodel) is created.This allows us to find the importance (weighting) of each sample  at the current step  with respect to the re-weight set   .By scaling the sample loss, we can guide the model to ignore cases that are termed ''poor" by the re-weight set.Our work introduces a novel method to learn from unlabeled information from the discarded samples by re-adding them to a separate set-the re-annotation set   .This set has the ''poor" sample labels replaced with a new pseudo-label from the current state of the network.The algorithm is presented in pseudo-code in Algorithm 1 along with a block diagram in Fig. 2.
In the following paragraphs, we will briefly describe the application of the re-weighting scheme on the output of  (;   ) to obtain a weighting of each sample.These are included in the newly introduced re-annotation set.Given the model   (;   ) at time , the setup is as follows for each step: where  is each sample in the current step, and   is its corresponding weight that must be found.The meta-model   is then updated with a step in the direction of the current training samples perturbed by  (step (1-2) in Fig. 2 and line (10-12) in Algorithm 1).After that, the weighing of each sample is found by using the gradients ∇ (step 5 in Fig. 2 and line (19)(20) in Algorithm 1) from the backward pass of   (;   ) with respect to  (step (3-4) in Fig. 2 and line (13-18) in Algorithm 1).
The updated parameters θ+1 () for the meta-model are described as follows when considering stochastic gradient descent, given the model  and the current parameters  in step  with gradient step : The  that locally minimizes the meta-loss can then be found by setting up the minimization: where  is the number of samples in   .As running a minimization to completion of each training step is not feasible, a gradient step can be taken with regard to the samples in   instead: where  is the size of the descent step with respect to .By finding the gradients of the updated meta-model with respect to , and by rectifying the output  , , we get the weights in step  for each sample .
The weights are then normalized to one for a more consistent parameter update  , .These are applied to each of the losses of each inputoutput pair of the model  (;   ) in the current state followed by the parameter update of the model.

Re-annotation via pseudo-labels
This section introduces our novel additions to the work of Ren et al. (2018).Given the normalized weights  , if a sample has a weight lower than  |  | -where  is determined empirically-it is added to the re-weighting set. ℎ. (5) For each pass of the network, the input of the model is checked against the re-annotation set.If the sample  in the batch  of the set   is ''poor", it is swapped with the re-annotated pseudo label   (,) →  (,) (5).
The selected samples are then run through the current state of the model which provides the new annotations to be used.If the certainty of each of the predictions  (,) is greater than a threshold  (6), then the prediction label is updated.We define where  is the total number of training steps and  is the current step, (line (21) in the pseudo-code Algorithm 1)  is a hyperparameter that needs to be tuned.If none of the predicted certainties are less than , the sample is removed from the re-annotation set.This ensures that we initially start with very confident pseudo-labels, and as the model converges, we relax the constraint to allow more pseudo-labels.Finally, as the model converges we re-tighten the constraint to avoid overfitting and replacing all labels.Furthermore, to ensure that incorrect pseudo-labels do not have too much influence while training, we introduce pseudo-label decay.When   steps have passed since the label was annotated, it is removed from the re-annotation set.This is partly handled by the re-weighting mechanism, but to ensure usable pseudo-labels for label update after training, this is included.The presented method introduces a novel semi-supervised learning mechanism to the approach presented in Ren et al. (2018).Our method allows for the training of very noisy datasets given a small subset of properly annotated labels.
see line (25) in the pseudo-code of Algorithm 1.

Dataset
In this work we have used the classical MNIST (Deng, 2012) dataset for our initial experiments, and two maritime datasets to show the real-life applicability of our method.The used maritime datasets comprise the Funen Archipelago Dataset (FAD) introduced in the previous work of Stets et al. (2019) and one publicly available-the Singapore Maritime Dataset (SMD) (Prasad, Rajan, Rachmawati, Rajabally, & Quek, 2017).The two datasets are summarized in Table 1.In the next subsections, we will describe our datasets in more detail.
Algorithm 1 The re-annotation algorithm extended from the reweighting algorithm introduced in Ren et  end for 29: end for

MNIST
The MNIST dataset consists of 10,000 32 × 32 pixel images, containing handwritten digits ranging from 0-9.This is a classical dataset used in the field of Machine Learning to test new algorithms.It has been solved, and the current highest achieved accuracy is 99.91% (An, Lee, Park, Yang, & So, 2020).This dataset allows us to modify it, knowing that it is well balanced and clean, and that the noise we introduce is the only noise included.

Funen Archipelago Dataset (FAD)
The work of Stets et al. (2019) introduced the dataset that is used as a baseline for the experimental evaluation of our work.Consisting of a total of 51,000 images with 31,900 annotations separated into two classes namely; boats (17,600) and buoys (14,300).Each of the raw images is 2560 × 2048 pixel RGB images and subsequently reduced to 1920 × 1080 pixels.Data were acquired onboard ferries/motorboats operating in the Southern Funen Archipelago, in Denmark.Images were captured sequentially from midday to dusk, ensuring a broad range of lighting conditions.The dataset yields a training set containing roughly 51.000 images and a validation set with roughly 1000 images.The image in Fig. 4 shows a cropped section of an annotated image from this dataset.

Singapore Maritime Dataset (SMD)
The work of Prasad et al. (2017) introduced the Singapore Maritime Dataset.The dataset consists of several 1080p videos, collected using Canon 70D cameras in Singaporean waters.The dataset is separated   into two subcategories, on-shore videos and on-board videos, which are acquired by a camera placed on-shore from a fixed platform and a camera placed on board a moving vessel, respectively.The videos are from various locations and do not necessarily capture the same backgrounds.It contains various environmental conditions like, etc. sunrise, mid-day, afternoon, evening, and after sunset, with Haze and Rain from July 2015 to May 2016.

Data augmentation:
Our previous work Becktor et al. (2021) introduced an augmentation method to normalize the varying label sizes, for more efficient training and to better use make use of the re-weighting scheme.Considering the maritime environment, the horizon line is easily detected (cf.Fig. 3).Using this as an anchor, the image is sampled into multiple cropped versions with height 1 3 of the image height in pixels, while traversing the horizon line.Furthermore, the original image is split into two squares with an overlap in the center; these along with the cropped images are scaled to 384 × 384 pixels.The cropped images are randomly sampled to ensure that any existing labels are selected at least once; thus, the issue of an overwhelming amount of images without annotations is alleviated.Given hundreds of training iterations, it can be assumed that the network has been through all annotations multiple times.A major issue with the data collected in the maritime domain is that due to a general lack of occlusions we have a wide variety of sizes of our known labels.A buoy in the horizon and one right next to us create dramatically different annotations.Thus the introduction of the augmentation was necessary to reduce the dimensionality of the input and to normalize the size of the annotations.By reducing the pixel count of the image from 1920 × 1080 = 2, 304, 000 pixels to 384 × 384 = 147, 456 pixels, we reduce the input size by a factor of (1920 × 1080)∕384 2 ≈ 14.

Experimental evaluation
In this section, we present the experimental setup and the results of our approach.Our approach is model agnostic, thus the selection of backbone should not impact the relative improvement.The performance of our approach will be measured in terms of mean Average Precision (mAP) with an intersection over union (IOU) of 50.The models are initially trained with the RW and RA schemes toggled off, and after 30 epochs, they are enabled.This is to ensure that the network has time to start learning before it starts to regularize itself.
Experiments: We initially ran experiments on the MNIST dataset (Deng, 2012) for quicker prototyping in a more controlled and relatively simple domain.We use the LENET architecture presented in Le-Cun et al. (1989) and we apply different levels of noise to the labels of the dataset: we switch each label to a random one with a probability of 0, 50, 75, or 90% respectively.
Then, we train 3 models each for the baseline RetinaNet, RW, and RA, with clean data, 25% noise, and 50% noise.Noise is introduced by applying a Label flipping augmentation.Given an annotation, it has a chance to change the label to an incorrect one, e.g. the boat becomes a buoy or the buoy becomes a boat.With this, we can precisely control the percentage of noisy data in our datasets for a more consistent comparison of the schemes.

MNIST
Model/Training: The models are trained for 12,000 iterations with a batch size of 100.We use SGD and a constant learning rate.The models are trained five times with each level of noise on an Nvidia RTX 3090.See Fig. 6 for an example.

Maritime domain
Model/Training: The base model used is a RetinaNet (Lin et al., 2017) with ResNet-50 backbone, trained with AdamW optimizer (Loshchilov & Hutter, 2019) on an Nvidia A100 GPU with 40 GB of memory.We used a batch size of 16 for 180 epochs with the cosine annealing with warm restarts (Loshchilov & Hutter, 2019) learning rate scheduler, with a base learning rate of 5 × 10 −5 , for faster convergence resulting in 3 restarts.

Results and discussion
Tables 2 and 3 show that re-weighting and re-annotation perform very similarly, yet re-annotation provides improvements to the dataset.It provides a list of poor annotations and a possible annotation to replace them with.Both RW and re-annotation perform noticeably better than the baseline RetinaNet implementation when noise is introduced.The decrease in performance for the noise-free dataset is negligible.With this, we introduce new hyper-parameters that need tuning.The confidence threshold, weighing threshold, and the selection of the re-weight set.
For MNIST, Fig. 5 shows clearly that as the amount of noise increases, the performance degrades and that this degradation is a lot faster when using a regular network.The performance of re-weighting and re-annotation are on par when the noise is less extreme, but at 90% noise we can see a clear benefit of our method.While using reannotation we are provided with a list to update our stored annotations.The number of pseudo-labels depends on the certainty threshold  in our tests; the number of pseudo-labels increased as the noise level increased.The accuracy of this list can be seen in Fig. 5 as the orange plot and is equivalent to the accuracy of the models on the validation set.Looking at the standard deviations, we can see that re-annotation has the tightest bounds and converges roughly to the same, which is a sign of robustness of training.Our method does increase the computation time and the memory overhead as we are running several back-propagation steps, along with storing the new label information in memory.Recent work such as Song et al. (2021) shows a similar method.While they continuously sort and clean the dataset, they do not create new pseudo-labels and are limited to the information captured by the annotators.
Maritime Setting: Results on the FAD dataset, summarized in Table 3, shows a clear difference when comparing the RetinaNet baseline and our approach when noise is introduced.From Fig. 7-specifically Fig. 7(a)-we see a slight decrease in performance −2.6, −3.6% mAP when the dataset is clean.However, when the proportion of ''poor" noisy data is 25% we see an increase for re-weighting +11.5% mAP and re-annotation +13.7% mAP.Finally, when the proportion is 50% noisy we have the following increase re-weighting +16.7% mAP and for re-annotation +18.7% mAP The SMD dataset, is summarized in Table 3 and the bar charts in Fig. 7(b) show the performance.It can be seen that when the noise  level is 25% we have an increase of 8.4% for re-weighting and 8.7% for re-annotation, while when the proportion of noise is 50%, we see the reannotation hedge out a 5% performance increase over the re-weighting with 89.1% and 94.2% respectively.In our previous work (Becktor et al., 2021) we hypothesized that a larger amount of ''noisy" data should lead to a larger relative performance gain, and from our results we clearly show this performance gain, as the level of noise is increased.The discrepancies from MNIST to the maritime domain can possibly be attributed to the selection of the baseline network and the difficulty of the task.
To get an overview of the models we compare them all to the baseline model trained on the clean data (see Fig. 7(c)).For models trained on samples containing 25% ''noise", the baseline RetinaNet performance decreases by −23.7% and −22.2%, for re-weighting it is −9.6%, −18.6%, and for re-annotation it is −7.8%, −17.2%, on the FAD and SMD dataset respectively.With 50% noise, we see an even more dramatic difference, where for the baseline RetinaNet we have a decrease of −30.5%, −57.2%, for re-weighting it is −12.2%, −16.7% and for re-annotation it is −10.1%, −14.3%, on FAD and SMD respectively.
The baseline models struggle to handle the larger amounts of noisy data and overfit to the introduced noise, in accordance with the observations of Arazo et al. (2019).The re-weighting and re-annotation schemes are able to maintain decent performance with 50% noisy data.On the simpler case-MNIST-the performance was decent even with 90% noisy data, where the improvement from re-weight to reannotation is roughly 15% higher.
Our work combines the benefits of the re-weighting scheme and the task of semi-supervised learning.It is self-regularizing and continuously updates the produced pseudo-labels.In semi-supervised learning a task is to create pseudo-labels (Lee, 2013;Oliver et al., 2018;Shi et al., 2018) from unlabeled data; the selection and weighting however of such labels are not that well understood.Our approach compared to the works previously referenced, uses the re-weighting mechanism to rank the new labels and ensure that the network is not contaminated with erroneous pseudo-labels.

Conclusion
In this work, we have presented an extension to the re-weighting scheme, the re-annotation scheme.This is to improve upon common issues in the object detection field, such as (i) poor data, (ii) the cost of labeling data, and (iii) the trust we put in our annotated data.Our algorithm allows for the constant improvement of our datasets with online pseudo-labeling and offline incremental dataset improvements.We show that the re-annotation scheme improves performance on datasets with a high proportion of ''poor" samples +10%; our labels are artificially corrupted, but resulting datasets mimic real cases where procuring annotations is very difficult or very expensive.This is exhibited in two domains; in the relatively simple case of the MNIST hand-drawn digit database and in the maritime domain on two datasets (FAD, SMD) with the task of detecting Boats and Buoys.Furthermore, we explored the accuracy of the predicted pseudo-labels and show that the obtained results are on par with those of noncorrupted datasets.As the network is continuously improves, so will the pseudo-labels.The achieved improvement is shown to follow the proportion of noise in the data ranging from 5.5%, 13.7%, and 8.0% increase in performance on our 3 presented datasets when the noise is 25% to an improvement of 58.5%, 18.7%, and 94.2% respectively when the noise is 50%.
A limitation of our framework is the initial selection of the reweighting set.There will always be an inherent bias when manually selecting it, and this will propagate to the network.With an incremental approach, this can be improved as samples that are correct but have very low confidence can be added to the re-weighting set on a successive run.Furthermore, we see that when the dataset is properly annotated our method has a slight decrease in performance.Using our approach, the training time is increased quite dramatically.Future work could automate the incremental part and slowly introduce new samples to the re-weighting set online by perhaps prompting an end user.
Our objective was to produce a novel training scheme that not only picks the best samples to learn from but incrementally updates the dataset, providing possible alternatives to poor or missing labels.Secondly, we wanted to prove this in a real-life setting and show its applicability, which is shown in our result section.Finally, we wanted a framework to continuously update our datasets so that they become more precise and better for future use.We have shown the feasibility, and we are convinced that it would be an interesting topic to continue working on; possibly by introducing synthetic data or data from other domains.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
introduced the concepts of entropy regularization, which in turn can be harnessed by selfsupervised learning by encouraging predictions of low entropy for unlabeled data and then using the unlabeled data for future training.More recent work, such as Shi, Gong, Ding, Tao, and Zheng (2018), introduces the Transductive Learning Principle, which uses network classification as hard labels for unlabeled samples.The samples are regularized by an uncertainty weight, depending on the distance to k nearest neighbors, and thus combat outliers and dissimilar samples.The loss function introduced in Tarvainen and Valpola (2017) encourages compactness and separation between classes and can be used as a consistency term for samples with different perturbations.The work explored in Iscen, Tolias, Avrithis, and Chum (2019) extends pseudolabeling by introducing a graph-based label selection.The method

Fig. 2 .
Fig. 2.Computation graph of the proposed re-annotation algorithm, extended from the re-weighting scheme(Ren et al., 2018).Here,   is a training sample with corresponding label   ; if this sample exists in the re-annotation set (  ) the corresponding label   is replaced with the pseudo label ŷ .

Fig. 4 .
Fig. 4. Example image (cropped for better visibility) of annotated data used in this paper.

Fig. 5 .
Fig.5.Training of MNIST with 90% noisy data.On the left, we can see that the network has a hard time converging, as it is being guided with respect to the re-weight set.On the right, we see the accuracy of the model and the accuracy of the selected pseudo-labels.

Fig. 6 .
Fig. 6.MNIST dataset: Comparison of each run of baseline, RW, and RA and comparison of the mean accuracy for the different amounts of noise after training.By applying the re-weighting and re-annotation scheme the performance is increased as a factor of the amount of noise introduced.

Fig. 7 .
Fig. 7. Comparison of the mean average precision (mAP) performance of the models trained on the FAD/SMD with differing levels of noise.Comparison of the mean average precision (mAP) performance of the models to the base line model without noise introduced.baseline (BL), Re-weight (RW), Re-annotation (RA).

Table 1
Summary of the two considered maritime datasets.

Table 2
Accuracy comparison of MNIST for various percentages of noisy annotations.