Why aren't these data sets editable instead of static? Treat them like a collaborative wiki or something (OpenStreetMap being the closest fit) and allow everyone to submit improvements so that all may benefit.
I hope the people in this article had a way to contribute back their improvements, and did so.
The datasets serve as benchmarks. You get an idea for a new model that solves a problem current models have. These ideas don't pan out, so you need empirical evidence that it works. To show that your model does better than previous models, you need some task that your model and previous models can share for both training and evaluation. It's more complicated than that, but that's the gist.
It would be so wasteful to have to retrain a dozen models that require a month of GPU time each on to serve as baselines for your new model...
It also potentially gives every paper N replication problems to solve, in addition to just the gpu time. I would have to figure out HOW to retrain all of these models on the current form of the dataset... Which is fine for an occasional explicit replication study, but terrible if everyone has to do it.
I think it's probably better to have a (say) yearly release of the dataset, with results of some benchmark models released alongside the new version.
This is similar to how Common Voice is handling the problem: it's a crowd sourced, constantly growing dataset, which is awesome if you want to train in as much as possible for production models. You can get the whole current version any time, but they also have releases with a static fileset and train/test split, which should be better for research.
Is it wasteful to throw away a batch of food when 20% of it has been studied to contain the wrong substance, which ends up causing disease?
Isn't it even more wasteful to continue using unedited and unverified data sets just because all the previous models were trained on it, and thus we can no longer advance the state of the research? It's a case of garbage in garbage out.
The thing is, the value as a baseline doesn't actually change that much for being 20% garbage. A bit counter-intuitive, but basically accepted as true in several fields.
The comparisons are all relative accuracy, not absolute accuracy. And the comparison is fair. The new technique is receiving the same part-garbage input that the old-techniques were trained on. For the most part, the better technique will still tend to do better unless there's specifically something about it that makes it more sensitive to labeling errors.
And frankly, a percentage of junk has some advantages. Real-word data is a pile of ass, so it's useful for academic models to require robustness.
>By one estimate, the training time for AlphaGo cost $35 million [0]
How about XLNet which cost something like $30k-60k to train [1]? GPT-2 may have been around the same [2]
is estimated around the same, while thankfully BERT only costs about $7k[3], unless of course you're going to do any new hyperparameter tuning on their models which you of course will do on your own model. Who cares about apples-to-apples comparisons?
We're not talking about spending an extra couple hours and a little money on updated replication. We're talking about an immediate overhead of tens to hundreds of thousands of dollars per new paper.
Tasks are updated over time already to take issues into account, but not continuously as far as I know.
Yah it is by no means wasteful for AlphaGo to throw away all their training data and then re-train itself!
That kind of ruthless experimentation is how AlphaGo was able to exceed even itself. The willingness to say - all these human games we've fed the computer? All these terabytes of data? It's all meaningless! We're going to throw it all away! We will have AlphaGo determine what is good by playing games against itself!
And I bet you that for the next iteration of AlphaGo, the creators of this system will again, delete their own data and retrain when they have a better approach.
If you don't "waste" your existing datasets (once you reallze the flaws in your data sets), you are being held back by the sunk cost principle. You only have yourself to blame when someone does train for the exact same purposes, but with cleaner data.
The person who has the cleanest source of training data will win in deep learning.
You're sabotaging yourself in my opinion. 30k is nothing when you're just sabotaging the training with faulty data.
As an investor, $35m to train just about the pinnacle of AI seems like a cheap, oh so cheap cost. I can't even buy 1 freaking continental jet for that ticket, and there are thousands of these babies flying (not as we speak, but generally).
I don't think you are fully cognizant yet with the formidable scale of AI in the grander scheme of things, as an industry, which is nowadays comparable to transistors circa 1972 in terms of maturity. Long, long ways to go before we sit on "reference" anything. Whether architectures, protocols, models, test standards, it's a Far West as we speak.
You make excellent points in principle, which are important to keep in mind in guiding us all along the way, but now is not the time to set things in stone. More like the opposite.
The matter of the fact is that someone will eventually grab the old and new benchmarks, prove superiority in both, and by that point the new is the one to beat since it would be presumably error-free this time.
The dataset is a controlled variable in an experiment so it has to be held constant. If you update your model and the dataset for every trial (eg new hyperparameters or new architecture), and find it performs better, you won't know if the model is really better or just the dataset.
This is accurate, but it's also worth noting the AI community has been moving to new benchmarks all the time (eg SQUAD 2.0 came out like a year after SQUAD). So in effect editing does happen all the time, just in a batch way instead of continuous wiki type way. This blog post deals with "VOC 2012, COCO 2017, Self Driving Car Udacity" which seem like pretty old datasets no longer really in use. There were news stories about the self driving car dataset actually , so the knowledge it has issues is not even new. Not to say this is not really useful, but would be nice to note...
What you're saying is that it's worth it to lie because it's too expensive to give a truthful answer. That is something that your customers likely would not agree with.
- Hosting static data is dramatically easier than making a public editing interface
- You want reference versions of the dataset for papers to refer to so that results are comparable. Sometimes this is used as a justification for not fixing completely broken data, like with Fasttext.
- Building on the previous point, large datasets like this don't play nice with Git. There are lots of "git for data" things but none of them are very mature, and most people don't spend time trying to figure something out.
One major use of the public datasets in the academic community is to serve as a common reference when comparing new techniques against the existing standard. A static baseline is desirable for this task.
You could maybe split the difference by having an "original" or "reference" version, and a separate moving target that incorporates crowdsourced improvements.
This sounds like a revisioning system would help a lot. Have a quarterly or annual release cycle or something, so that when you want to compare performance across techniques, you just train both of them to the same target (and ideally all the papers coming out at roughly the same time would already be using the same revision anyway).
You'd always work with a versioned release when training models, and you'd only typically work with HEAD when you were specifically looking to correct flaws in the data (as the authors in the linked article are).
One problem with correcting the benchmark datasets is that it's important for the algorithms to be robust to labelling errors as well. But having multiple versions sounds important anyways.
In general these things are open source, so you can always contribute an improved version of the dataset. But as another commenter said having relatively static ones is also important for benchmarking purposes.
I'm a Product Manager at Deepomatic and I have been leading the study in question here. To detect the errors, we trained a model (with a different neural network architecture than the 6 listed in the post), and we then have a matching algorithm that highlights all bounding boxes that were either annotated but not predicted (False Negative), or predicted but not annotated (False Positive). Those potential errors are also sorted based on an error score to get first the most obvious errors. Happy to answer any other question you may have!
Was the corrected datasets larger or smaller than the originals?
Would also be interesting to see these improved datasets run thru simulation of crashes with existing datasets and see how they handle? Though not sure how you would go about that beyond approaching current providers of such cars for data to work thru and suspect they may be less open to admitting flaws and with that, may be a stumbling block.
Certainly makes you wonder how far we can optimise such datasets to get better results. I know some ML datasets are a case of humans fine tuning and going thru examples and classifying them, and wonder how much that skews or effects error rates as we all know humans error.
To answer your first question, we had both bounding boxes added and removed, and depending on the dataset, the main type of error was different (I'd say it was overall more objectifs that were forgotten, especially small objects).
It would indeed be very interesting to see the impact of those improved datasets on driving, which is ultimately the task that is automated for cars.
We've been working on many projects at Deepomatic not only related to autonomous cars, and we did see some concrete impact of cleaning the datasets beyond performance metrics.
So in the article you write that you found 20% errors in the data, but at what point do you conclude that “this is an error in the data” and “this is an error in the prediction”?
Is that done manually?
Also, do you have strategy for finding errors, where the model learned to mislabel items in order to increase its score? (E.g, red trucks are labeled red cars in both train and test)
There was indeed a manual review of the "potential errors" highlighted by our algorithm to determine is it was indeed an error in the data or if it was an error in the prediction. The 20% corresponds to the proportion of objects that was corrected with this manual review. So it's actually likely that some errors (that were not found by our algorithm) are still in our clean version of the dataset.
Curious if you could find errors by comparing the results from the different models. Places where models disagree with each other more often would be areas that I would want to target for error checking.
> Places where models disagree with each other more often would be areas that I would want to target for error checking.
This is a great idea if your goal is to maximize the rate at which things you look at turn out to be errors. (On at least one side.)
But it's guaranteed to miss cases where every model makes the same inexplicable-to-the-human-eye mistake, and those cases would appear to be especially interesting.
This is a good idea and there are actually 2 objectives when one wants to clean its dataset:
- you might want to optimize your time and correct as many errors as you can as fast as you can. Using several models will help you ion that case, adn that's actually what we've been focusing on so far.
- you might want to find the most ambiguous cases where you really need to improve your models as those edge cases are the ones causing the problems you have in production.
Those 2 objectives are quite opposite. In the first case, you want to find the "easiest" errors, while in the other one, you want to focus on edge cases and you then probably need to look at errors with intermediate scores, where nothing is really sure..
“Annotator agreement” is a measure of confidence in the correctness of labels. And you should always keep an eye out for how these are handled, when reading papers that present a dataset.
Saying we should start doing model agreement is a really good idea imho.
my guess would be using some sort of active learning. In other words:
1) building a model using the data set
2) making predictions using the training data
3) finding the cases where the model is the most confused (difference in probability between classes is low)
4) raising those cases to humans
- In the cars-on-the-bridge image, the red bounding box for the semitruck in the oncoming lanes is too small, with its upper bound just above the top of the semi's windshield, ignoring the much taller roof and towed container.
- In the same image, there are red bounding boxes around cars that exist, and also red bounding boxes around non-cars that don't exist. If false positives and false negatives are going to be represented in the same picture, it'd be nice to use different colors for them, so the viewer can tell whether the error was identified correctly or spuriously.
- I have trouble understanding the "bus" screenshot. The caption says "(green pictures are valid errors) – The pink dotted boxes are objects that have not been labelled but that our error spotting algorithm highlighted." In other words, the green-highlighted pictures are false negatives considered from the perspective of the original data set, and the red-highlighted pictures are true negatives. Or alternatively, the green-highlighted pictures are true positives from the perspective of the error-spotting algorithm, and the red-highlighted pictures are false positives. What confuses me is that all 9 pictures are labeled "false positive" by the tabbing at the top of the screenshot.
In one of his fastai videos Jeremy Howard makes the point that wrong labels can act as regularization and you shouldn't worry too much about them. I'm a bit skeptical as to how far you can push this but you certainly don't need perfect labelling.
That is true up to a certain point (for instance, in my experience, having bounding boxes that are not pixel-perfect acts as a regularizer), but there is also a good chance that you are mislabelling edge cases, situations that happen rarely, and that definitely hurts the performance of the neural network to make a correct prediction on these difficult / uncommon scenarios.
We did some interesting experiments with Go where we inverted the label of who won and measured what impact that had on the final model. This is a binary label so it's probably more impactful (it's the only signal we are measuring)
From memory it had only a small impact (2% strength) with ~7% of results flipped, at 4% it was hard to measure the impact (<1%)
A lot of things. One is the "AI" which isn't so much "I" and quite error prone and hard to impossible to analyze in detail and/or debug. The idea that bad people (be it trolls, criminals or spooks) could force deliberate malfunctioning of/misclassifications in AIs and thus cause crashes is off-putting, on top of the general "normal" errors you can expect.
Then the business/political aspects of it, like Tesla demanding somebody who bought a used car pay again for Autopilot.
We already saw crashes by Autopilot users not paying any attention whatsoever (granted AP isn't fully "self-driving", but still).
On top of that, just like with better car safety and even with the introduction safety belt laws, we saw a stark uptick in accidents, that usually affected people outside the car the most, such as pedestrians and bikers. So me being a pedestrian quite often, I dread in particular the semi-self-driving/assisted driving car tech like autopilot, and have a good skepticism when people tell me that the (almost) perfect fully self-driving cars are just around the corner. If my skepticism turns out to be unwarranted, great.
And this tech will keep many consumer cars around longer, in disfavor of public transportation. The one good-ish thing that came out of SARS-CoV-2 is the reduction in air pollution (I am not saying it is a net positive because of that, far from it). The air smells noticeably nicer around here and the noise is also down.
> The idea that bad people (be it trolls, criminals or spooks) could force deliberate malfunctioning of/misclassifications in AIs and thus cause crashes
I wish people would stop trotting this one out. Bad actors can deliberately cause humans to crash just as easily if not moreso. If they don't, it's only because such behavior is punishable.
Ah, yes, the ethical murderer who only wants to fuck up just that one car but who sincerely worries about the other drivers on the road. That's the demographic you're concerned about? So how does indiscriminately trying to trick generally available systems specifically target only one person without risking other drivers?
If you're interested in replying in a condescending manner and attacking strawmen arguments I never made, be my guest, but I have no desire to further discuss this with you.
Is it really 20% annotation error? I read it as 20% of the errors were detected. Errors could be some very small percent and of those that had error 20% were detected.
The process is actually a bit complicated but let me explain it to you.
Once you are on a dataset, click on the label that you want and use the slider at the top right corner of the page to switch modes (we call it smart detection).
You should then be able to access three tabs and the errors are listed in the False Positive and False Negative tabs (I've added a screenshot in the blogpost so that you can make sure to be at the right place).
Let me know if you have any problem, thanks!
"Cleaning algorithm finds 20% of errors in major image recognition datasets" -> "Cleaning algorithm finds errors in 20% of annotations in major image recognitions."
We don't know if the found errors represent 20%, 90% or 2% of the total errors in the dataset.
Best I can tell, they are using the ML model to detect the errors. Isn't this a bit of an ouroboros? The model will naturally get better, because you are only correcting problems where it was right but the label was wrong.
It's not necessarily a representation of a better model, but just of a better testing set.
Using simple techniques, they found out that popular open source datasets like VOC or COCO contain up to 20% annotation errors in. By manually correcting those errors, they got an average error reduction of 5% for state-of-the-art computer vision models.
An idea on how this could work: repeatedly re-split the dataset (to cover all of it), and re-train a detector on the splits, then at the end of each training cycle surface validation frames with the highest computed loss (or some other metric more directly derived from bounding boxes, such as the number of high confidence "false" positives which could be instances of under-labeling) at the end of training. That's what I do on noisy, non-academic datasets, anyway.
I hope the people in this article had a way to contribute back their improvements, and did so.