Cleaning algorithm finds 20% of errors in major image recognition datasets

CydeWeys · on April 16, 2020

Why aren't these data sets editable instead of static? Treat them like a collaborative wiki or something (OpenStreetMap being the closest fit) and allow everyone to submit improvements so that all may benefit.

I hope the people in this article had a way to contribute back their improvements, and did so.

6gvONxR4sf7o · on April 16, 2020

The datasets serve as benchmarks. You get an idea for a new model that solves a problem current models have. These ideas don't pan out, so you need empirical evidence that it works. To show that your model does better than previous models, you need some task that your model and previous models can share for both training and evaluation. It's more complicated than that, but that's the gist.

It would be so wasteful to have to retrain a dozen models that require a month of GPU time each on to serve as baselines for your new model...

sdenton4 · on April 16, 2020

It also potentially gives every paper N replication problems to solve, in addition to just the gpu time. I would have to figure out HOW to retrain all of these models on the current form of the dataset... Which is fine for an occasional explicit replication study, but terrible if everyone has to do it.

I think it's probably better to have a (say) yearly release of the dataset, with results of some benchmark models released alongside the new version.

This is similar to how Common Voice is handling the problem: it's a crowd sourced, constantly growing dataset, which is awesome if you want to train in as much as possible for production models. You can get the whole current version any time, but they also have releases with a static fileset and train/test split, which should be better for research.

barkingcat · on April 16, 2020

That's not wasteful. That's correction.

Is it wasteful to throw away a batch of food when 20% of it has been studied to contain the wrong substance, which ends up causing disease?

Isn't it even more wasteful to continue using unedited and unverified data sets just because all the previous models were trained on it, and thus we can no longer advance the state of the research? It's a case of garbage in garbage out.

lmkg · on April 16, 2020

The thing is, the value as a baseline doesn't actually change that much for being 20% garbage. A bit counter-intuitive, but basically accepted as true in several fields.

The comparisons are all relative accuracy, not absolute accuracy. And the comparison is fair. The new technique is receiving the same part-garbage input that the old-techniques were trained on. For the most part, the better technique will still tend to do better unless there's specifically something about it that makes it more sensitive to labeling errors.

And frankly, a percentage of junk has some advantages. Real-word data is a pile of ass, so it's useful for academic models to require robustness.

ethbro · on April 16, 2020

I thought SOTA was still a few % in difference?

It seems worrisome that they few percent might be making a coin flip right-randomly instead of wrong-randomly on a mislabelled subgroup of data...

6gvONxR4sf7o · on April 16, 2020

>By one estimate, the training time for AlphaGo cost $35 million [0]

How about XLNet which cost something like $30k-60k to train [1]? GPT-2 may have been around the same [2] is estimated around the same, while thankfully BERT only costs about $7k[3], unless of course you're going to do any new hyperparameter tuning on their models which you of course will do on your own model. Who cares about apples-to-apples comparisons?

We're not talking about spending an extra couple hours and a little money on updated replication. We're talking about an immediate overhead of tens to hundreds of thousands of dollars per new paper.

Tasks are updated over time already to take issues into account, but not continuously as far as I know.

[0] https://www.wired.com/story/deepminds-losses-future-artifici...

[1] https://twitter.com/jekbradbury/status/1143397614093651969

[2] https://news.ycombinator.com/item?id=19402666

[3] https://syncedreview.com/2019/06/27/the-staggering-cost-of-t...

visarga · on April 16, 2020

BERT is trained on unsupervised data. It's not the same kind of model the article talks about.

barkingcat · on April 16, 2020

Yah it is by no means wasteful for AlphaGo to throw away all their training data and then re-train itself!

That kind of ruthless experimentation is how AlphaGo was able to exceed even itself. The willingness to say - all these human games we've fed the computer? All these terabytes of data? It's all meaningless! We're going to throw it all away! We will have AlphaGo determine what is good by playing games against itself!

And I bet you that for the next iteration of AlphaGo, the creators of this system will again, delete their own data and retrain when they have a better approach.

If you don't "waste" your existing datasets (once you reallze the flaws in your data sets), you are being held back by the sunk cost principle. You only have yourself to blame when someone does train for the exact same purposes, but with cleaner data.

The person who has the cleanest source of training data will win in deep learning.

You're sabotaging yourself in my opinion. 30k is nothing when you're just sabotaging the training with faulty data.

p1esk · on April 16, 2020

I'm actually glad it costs so much to train these models. Great incentive to find more efficient algorithms. That's how biological brains evolved.

third_I · on April 16, 2020

As an investor, $35m to train just about the pinnacle of AI seems like a cheap, oh so cheap cost. I can't even buy 1 freaking continental jet for that ticket, and there are thousands of these babies flying (not as we speak, but generally).

I don't think you are fully cognizant yet with the formidable scale of AI in the grander scheme of things, as an industry, which is nowadays comparable to transistors circa 1972 in terms of maturity. Long, long ways to go before we sit on "reference" anything. Whether architectures, protocols, models, test standards, it's a Far West as we speak.

You make excellent points in principle, which are important to keep in mind in guiding us all along the way, but now is not the time to set things in stone. More like the opposite.

The matter of the fact is that someone will eventually grab the old and new benchmarks, prove superiority in both, and by that point the new is the one to beat since it would be presumably error-free this time.

lopmotr · on April 16, 2020

The dataset is a controlled variable in an experiment so it has to be held constant. If you update your model and the dataset for every trial (eg new hyperparameters or new architecture), and find it performs better, you won't know if the model is really better or just the dataset.

andreyk · on April 17, 2020

This is accurate, but it's also worth noting the AI community has been moving to new benchmarks all the time (eg SQUAD 2.0 came out like a year after SQUAD). So in effect editing does happen all the time, just in a batch way instead of continuous wiki type way. This blog post deals with "VOC 2012, COCO 2017, Self Driving Car Udacity" which seem like pretty old datasets no longer really in use. There were news stories about the self driving car dataset actually , so the knowledge it has issues is not even new. Not to say this is not really useful, but would be nice to note...

6gvONxR4sf7o · on April 17, 2020

Also worth noting that many of those moves were because the original was found to be too easy, like in SQuAD's case, rather than being too noisy.

kalium-xyz · on April 17, 2020

Why not just version them?

hatmatrix · on April 16, 2020

But you can have version numbers like with code and models.

roosterdawn · on April 16, 2020

What you're saying is that it's worth it to lie because it's too expensive to give a truthful answer. That is something that your customers likely would not agree with.

polm23 · on April 16, 2020

Multiple reasons, but to name a few:

- Don't want to deal with vandalism

- Hosting static data is dramatically easier than making a public editing interface

- You want reference versions of the dataset for papers to refer to so that results are comparable. Sometimes this is used as a justification for not fixing completely broken data, like with Fasttext.

https://github.com/facebookresearch/fastText/issues/710

- Building on the previous point, large datasets like this don't play nice with Git. There are lots of "git for data" things but none of them are very mature, and most people don't spend time trying to figure something out.

seveibar · on April 16, 2020

I'm working on this[1], my theory is the lack of a good IDE (rather than simple crowdsourcing interface) is the reason why it hasn't been done.

Imagine if github had an integrated ide for editing large datasets. Also see dolt which is doing good work here.

[1] https://github.com/UniversalDataTool/universal-data-tool

lmkg · on April 16, 2020

One major use of the public datasets in the academic community is to serve as a common reference when comparing new techniques against the existing standard. A static baseline is desirable for this task.

You could maybe split the difference by having an "original" or "reference" version, and a separate moving target that incorporates crowdsourced improvements.

CydeWeys · on April 16, 2020

This sounds like a revisioning system would help a lot. Have a quarterly or annual release cycle or something, so that when you want to compare performance across techniques, you just train both of them to the same target (and ideally all the papers coming out at roughly the same time would already be using the same revision anyway).

You'd always work with a versioned release when training models, and you'd only typically work with HEAD when you were specifically looking to correct flaws in the data (as the authors in the linked article are).

xiphias2 · on April 16, 2020

One problem with correcting the benchmark datasets is that it's important for the algorithms to be robust to labelling errors as well. But having multiple versions sounds important anyways.

andreyk · on April 17, 2020

Some datasets are indeed like this, see eg Common Voice - https://voice.mozilla.org/en

In general these things are open source, so you can always contribute an improved version of the dataset. But as another commenter said having relatively static ones is also important for benchmarking purposes.

rathel · on April 16, 2020

Nothing is however said about how the errors are detected. Can an ML expert chime in?

thibaut-duguet · on April 16, 2020

I'm a Product Manager at Deepomatic and I have been leading the study in question here. To detect the errors, we trained a model (with a different neural network architecture than the 6 listed in the post), and we then have a matching algorithm that highlights all bounding boxes that were either annotated but not predicted (False Negative), or predicted but not annotated (False Positive). Those potential errors are also sorted based on an error score to get first the most obvious errors. Happy to answer any other question you may have!

Zenst · on April 16, 2020

Was the corrected datasets larger or smaller than the originals?

Would also be interesting to see these improved datasets run thru simulation of crashes with existing datasets and see how they handle? Though not sure how you would go about that beyond approaching current providers of such cars for data to work thru and suspect they may be less open to admitting flaws and with that, may be a stumbling block.

Certainly makes you wonder how far we can optimise such datasets to get better results. I know some ML datasets are a case of humans fine tuning and going thru examples and classifying them, and wonder how much that skews or effects error rates as we all know humans error.

thibaut-duguet · on April 16, 2020

To answer your first question, we had both bounding boxes added and removed, and depending on the dataset, the main type of error was different (I'd say it was overall more objectifs that were forgotten, especially small objects).

It would indeed be very interesting to see the impact of those improved datasets on driving, which is ultimately the task that is automated for cars. We've been working on many projects at Deepomatic not only related to autonomous cars, and we did see some concrete impact of cleaning the datasets beyond performance metrics.

wodenokoto · on April 17, 2020

So in the article you write that you found 20% errors in the data, but at what point do you conclude that “this is an error in the data” and “this is an error in the prediction”?

Is that done manually?

Also, do you have strategy for finding errors, where the model learned to mislabel items in order to increase its score? (E.g, red trucks are labeled red cars in both train and test)

thibaut-duguet · on April 17, 2020

There was indeed a manual review of the "potential errors" highlighted by our algorithm to determine is it was indeed an error in the data or if it was an error in the prediction. The 20% corresponds to the proportion of objects that was corrected with this manual review. So it's actually likely that some errors (that were not found by our algorithm) are still in our clean version of the dataset.

liquidify · on April 16, 2020

Curious if you could find errors by comparing the results from the different models. Places where models disagree with each other more often would be areas that I would want to target for error checking.

thaumasiotes · on April 16, 2020

> Places where models disagree with each other more often would be areas that I would want to target for error checking.

This is a great idea if your goal is to maximize the rate at which things you look at turn out to be errors. (On at least one side.)

But it's guaranteed to miss cases where every model makes the same inexplicable-to-the-human-eye mistake, and those cases would appear to be especially interesting.

thibaut-duguet · on April 17, 2020

This is a good idea and there are actually 2 objectives when one wants to clean its dataset:

- you might want to optimize your time and correct as many errors as you can as fast as you can. Using several models will help you ion that case, adn that's actually what we've been focusing on so far.

- you might want to find the most ambiguous cases where you really need to improve your models as those edge cases are the ones causing the problems you have in production. Those 2 objectives are quite opposite. In the first case, you want to find the "easiest" errors, while in the other one, you want to focus on edge cases and you then probably need to look at errors with intermediate scores, where nothing is really sure..

wodenokoto · on April 17, 2020

You do that with human annotators.

“Annotator agreement” is a measure of confidence in the correctness of labels. And you should always keep an eye out for how these are handled, when reading papers that present a dataset.

Saying we should start doing model agreement is a really good idea imho.

rathel · on April 16, 2020

Thank you for the explanation.

alexchamberlain · on April 16, 2020

ie you get some to check where the model and the annotations disagree.

ArnoVW · on April 16, 2020

my guess would be using some sort of active learning. In other words: 1) building a model using the data set 2) making predictions using the training data 3) finding the cases where the model is the most confused (difference in probability between classes is low) 4) raising those cases to humans

https://en.wikipedia.org/wiki/Active_learning_(machine_learn...

shihab · on April 16, 2020

plus we'll have to register simply to see a few examples of mislabeling...that was disappointing

thibaut-duguet · on April 16, 2020

I've added screenshots of errors in the blogpost so that you have an idea of the errors we spotted. Let me know what you think of them.

thaumasiotes · on April 16, 2020

A couple notes on those screenshots:

- In the cars-on-the-bridge image, the red bounding box for the semitruck in the oncoming lanes is too small, with its upper bound just above the top of the semi's windshield, ignoring the much taller roof and towed container.

- In the same image, there are red bounding boxes around cars that exist, and also red bounding boxes around non-cars that don't exist. If false positives and false negatives are going to be represented in the same picture, it'd be nice to use different colors for them, so the viewer can tell whether the error was identified correctly or spuriously.

- I have trouble understanding the "bus" screenshot. The caption says "(green pictures are valid errors) – The pink dotted boxes are objects that have not been labelled but that our error spotting algorithm highlighted." In other words, the green-highlighted pictures are false negatives considered from the perspective of the original data set, and the red-highlighted pictures are true negatives. Or alternatively, the green-highlighted pictures are true positives from the perspective of the error-spotting algorithm, and the red-highlighted pictures are false positives. What confuses me is that all 9 pictures are labeled "false positive" by the tabbing at the top of the screenshot.

kent17 · on April 16, 2020

20% annotation error is huge, especially since those datasets (COCO, VOC) are used for basically every benchmark and state of the art research.

rndgermandude · on April 16, 2020

And people wonder why I am still a bit skeptical of self-driving cars....

s1t5 · on April 16, 2020

In one of his fastai videos Jeremy Howard makes the point that wrong labels can act as regularization and you shouldn't worry too much about them. I'm a bit skeptical as to how far you can push this but you certainly don't need perfect labelling.

groar · on April 16, 2020

That is true up to a certain point (for instance, in my experience, having bounding boxes that are not pixel-perfect acts as a regularizer), but there is also a good chance that you are mislabelling edge cases, situations that happen rarely, and that definitely hurts the performance of the neural network to make a correct prediction on these difficult / uncommon scenarios.

kingvash · on April 16, 2020

We did some interesting experiments with Go where we inverted the label of who won and measured what impact that had on the final model. This is a binary label so it's probably more impactful (it's the only signal we are measuring)

From memory it had only a small impact (2% strength) with ~7% of results flipped, at 4% it was hard to measure the impact (<1%)

strbean · on April 16, 2020

Also, this is applies to mislabeled data in your training set, right? Not a good thing if it is in your test set.

rumanator · on April 16, 2020

What sparks your skepticism of self-driving cars? Although some companies use stereo vision to generate point clouds, others use lidar.

rndgermandude · on April 16, 2020

A lot of things. One is the "AI" which isn't so much "I" and quite error prone and hard to impossible to analyze in detail and/or debug. The idea that bad people (be it trolls, criminals or spooks) could force deliberate malfunctioning of/misclassifications in AIs and thus cause crashes is off-putting, on top of the general "normal" errors you can expect.

Then the business/political aspects of it, like Tesla demanding somebody who bought a used car pay again for Autopilot.

We already saw crashes by Autopilot users not paying any attention whatsoever (granted AP isn't fully "self-driving", but still).

On top of that, just like with better car safety and even with the introduction safety belt laws, we saw a stark uptick in accidents, that usually affected people outside the car the most, such as pedestrians and bikers. So me being a pedestrian quite often, I dread in particular the semi-self-driving/assisted driving car tech like autopilot, and have a good skepticism when people tell me that the (almost) perfect fully self-driving cars are just around the corner. If my skepticism turns out to be unwarranted, great.

And this tech will keep many consumer cars around longer, in disfavor of public transportation. The one good-ish thing that came out of SARS-CoV-2 is the reduction in air pollution (I am not saying it is a net positive because of that, far from it). The air smells noticeably nicer around here and the noise is also down.

ebg13 · on April 16, 2020

> The idea that bad people (be it trolls, criminals or spooks) could force deliberate malfunctioning of/misclassifications in AIs and thus cause crashes

I wish people would stop trotting this one out. Bad actors can deliberately cause humans to crash just as easily if not moreso. If they don't, it's only because such behavior is punishable.

rndgermandude · on April 16, 2020

Making somebody crash in a dumb car is pretty hard if you want to do it in an undetectable manner with minor to no risk to yourself or anybody else.

Glitching an AI on the other hand e.g. by holding up a sign is less risky for yourself, and less detectable.

ebg13 · on April 16, 2020

> Making somebody crash in a dumb car is pretty hard

That's not true even allowing for your next constraints, one of which I find to be quite absurd.

In the advanced technological case, you have https://www.theverge.com/2015/7/21/9009213/chrysler-uconnect...

In the non-advanced technological case, you can drop caltrops behind your vehicle as you drive and no one would know it was you.

"But that only happens in cartoons" - Yes, because most people are not cartoon villains. And yet, look, kids throwing rocks, no AI necessary: https://en.wikipedia.org/wiki/2017_Interstate_75_rock-throwi...

> with minor to no risk to ... anybody else

Ah, yes, the ethical murderer who only wants to fuck up just that one car but who sincerely worries about the other drivers on the road. That's the demographic you're concerned about? So how does indiscriminately trying to trick generally available systems specifically target only one person without risking other drivers?

rndgermandude · on April 16, 2020

If you're interested in replying in a condescending manner and attacking strawmen arguments I never made, be my guest, but I have no desire to further discuss this with you.

ebg13 · on April 16, 2020

> I have no desire to further discuss this with you

I'll just talk to myself then, because, while I understand you feeling hurt by my comment, I did not attack a strawman.

> Making somebody crash in a dumb car is pretty hard...

Not true. (I gave examples.)

> ...if you want to do it in an undetectable manner...

Still not true. (Same examples.)

> ...with minor to no risk to yourself...

Still not true. (Same examples.)

> ...or anybody else.

Still not true. (This is absurd. Also the same examples still apply.)

peteradio · on April 16, 2020

Is it really 20% annotation error? I read it as 20% of the errors were detected. Errors could be some very small percent and of those that had error 20% were detected.

ebg13 · on April 16, 2020

I think the submitted headline is wrong. The article says "we found annotation errors on more than 20% of images". Maybe dang could fix it.

groar · on April 16, 2020

Agreed, my initial title is not accurate. It should say "finds errors in 20% of annotations".

magicalhippo · on April 16, 2020

> Create an account on the Deepomatic platform with the voucher code “SPOT ERRORS” to visualize the detected errors.

Nice ad.

thibaut-duguet · on April 16, 2020

Our platform is actually designed for enterprise companies, so we don't provide open access unfortunately.

magicalhippo · on April 16, 2020

Still, couldn't you have included an example or two in the article no to illustrate the kind of errors we're talking about?

scribu · on April 16, 2020

I signed up and still couldn't see the errors.

I just see 3 datasets with generic annotations.

thibaut-duguet · on April 16, 2020

The process is actually a bit complicated but let me explain it to you. Once you are on a dataset, click on the label that you want and use the slider at the top right corner of the page to switch modes (we call it smart detection). You should then be able to access three tabs and the errors are listed in the False Positive and False Negative tabs (I've added a screenshot in the blogpost so that you can make sure to be at the right place). Let me know if you have any problem, thanks!

scribu · on April 16, 2020

Thanks, I can see them now.

fwip · on April 16, 2020

The title here seems wrong. Suggested change:

"Cleaning algorithm finds 20% of errors in major image recognition datasets" -> "Cleaning algorithm finds errors in 20% of annotations in major image recognitions."

We don't know if the found errors represent 20%, 90% or 2% of the total errors in the dataset.

groar · on April 16, 2020

Yes agreed with that ! I can't change the title unfortunately

kent17 · on April 16, 2020

> We then used the error spotting tool on the Deepomatic platform to detect errors and to correct them.

I'm wondering if those errors are selected on how much they impact the performance?

Anyway, this is probably a much better way of gaining accuracy on the cheap than launching 100+ models for hyperparameter tuning.

frenchie4111 · on April 16, 2020

Best I can tell, they are using the ML model to detect the errors. Isn't this a bit of an ouroboros? The model will naturally get better, because you are only correcting problems where it was right but the label was wrong.

It's not necessarily a representation of a better model, but just of a better testing set.

groar · on April 16, 2020

If I understand correctly they actually did not change the test set.

frenchie4111 · on April 16, 2020

Ah, I guess I missed that

benibela · on April 16, 2020

These things are why I stopped doing computer vision after my master thesis

jontro · on April 16, 2020

Weird behaviour on pinch to zoom (macbook). It scrolls instead of zooming and when swiping back nothing happens.

Another example of why you should never mess with the defaults unless strictly necessary.

groar · on April 16, 2020

Using simple techniques, they found out that popular open source datasets like VOC or COCO contain up to 20% annotation errors in. By manually correcting those errors, they got an average error reduction of 5% for state-of-the-art computer vision models.

jessermeyer · on April 16, 2020

Garbage in garbage out.

m0zg · on April 16, 2020

An idea on how this could work: repeatedly re-split the dataset (to cover all of it), and re-train a detector on the splits, then at the end of each training cycle surface validation frames with the highest computed loss (or some other metric more directly derived from bounding boxes, such as the number of high confidence "false" positives which could be instances of under-labeling) at the end of training. That's what I do on noisy, non-academic datasets, anyway.