A transformer-based method for zero and few-shot biomedical NER

chaxor · on May 11, 2023

It's concerning that there are no references to scispacy (from allenAI) in the paper. Scispacy is a bit dated in it's core tech, but it's still one of the easier ways of getting quick NER results on text.

This paper is extremely similar in ___domain (same corpus, etc - but that's not surprising since everyone uses these) but they're leaning heavily on the pretraining allowing capabilities towards few and zero shot, which is already well understood. Ultimately I think it's a good resource to use for the code, and if the API ends up being easier to change some of the internals on compared to the many options out there such as scispacy, or any of the pipelines used to achieve Pubtator, then it's a welcome addition.

My assessment is that this is a useful alternative where there are many solutions, but mostly an engineering product, and quite far away from any scientific contribution.

calny · on May 11, 2023

Does anybody know of major efforts to use cutting edge LLMs to read through medical/scientific literature and come up with new insights from it? Preferably open source, but closed projects as well. Of course, there are copyright issues and you might run into the tension between paid journals and open access science. The technical challenges would be interesting, though, and this seems like an incredible use case for AI to make novel connections that build on exiting literature. Not necessarily because it’s “superhuman” intelligence, but rather that it can “read” through vast amounts of text at a clip no human can.

chaxor · on May 11, 2023

There are plenty of examples of encoders, like this paper, that are being applied to read through all literature, but using an LLM directly (typically referring to the decoder) is not really necessarily useful or efficient (for typical tasks done in the field). It's far more efficient to use smaller models for these tasks, because, for example, if you have 60 million articles with ~2000 tokens, and your system can process only 30 tokens per second, you're looking at 16 *years* to process the whole dataset. 1*10^1 tokens/second is the order of magnitude that LLMs work in, and maybe 10^2 regime if you have a machine you purchase for around $1 million. So that's years of processing for any typical researcher trying to do this, or months of you have a small amount (single millions) of investment. If it's billions, more possibilities open up as is clear with Deepmind et al.

So clearly you need to have some very good hardware to process all of it.

However, compare that with some of the simpler encoder models that have far fewer params that are targeted for specific tasks. These systems can plow through 10^5 or 10^6 tokens per second. So now that 16 years is a week.

This is why small models that are task specific are so important. They make much more possible in reasonable time frames and reduce CO2 emissions by orders of magnitude. Along the same lines of "why use an LLM to extract everywhere the string 'Starbucks4{:digit:}' appears in text when you can use a regex?". You can get the output in a few seconds on billions of articles with a DB, whereas an LLM would take more than a decade.

rickette · on May 11, 2023

I've also got remarkably good results pushing regular expressions to the max to perform "NER", while other more fancy (ML) solutions failed. So don't rule them out.

chaxor · on May 11, 2023

Perhaps... but If that's true, you're likely not doing what is typically referred to as NER in computational linguistics. Either that or yours going with a type of 'dictionary based' regex system, which gets typically high recall, low precision. Several years ago, when manual FE was a thing, this would be fed to a CRF model or some other Markov like model, perhaps even an LSTM prior to W2V, and be used as a feature - along with other things like 'ends in ly' etc.

There's always tradeoffs. Regex is fast, linear CRFs are quite fast too. Simple LSTMs are fast-ish, as well as BERT-like systems, can offer a decent tradeoff in speed/performance. LLMs are much slower, and need some type of distill step by step to get anything very useful out (for task specific model).

Ultimately, you're right in that regexes along with other older techniques should be understood and weighed for what is the optimal solution for the task.

leobg · on May 11, 2023

Regex? Just use Langchain agents, bro!

:-P

armcat · on May 11, 2023

Google have developed Med-Palm 2 for this purpose, https://cloud.google.com/blog/topics/healthcare-life-science...

leobg · on May 11, 2023

I would think the problem is that these LLMs don’t have a concept of any reality/objects outside of language. So many of the insights that come intuitively to humans, like Newton’s apple, will probably not come to an LLM. But I’m really just guessing. I also wouldn’t have thought that ChatGPT can draw a unicorn, and yet apparently it can.

travisporter · on May 11, 2023

Named entity recognition. "Named entity" usually refers to generic units such as personal names, locations, organizations, etc., or ___domain-specific units such as names of genes, proteins or enzymes.

piccogabriele · on May 12, 2023

I've been working on this package last year, it will be presented at ACL 2023: https://github.com/IBM/zshot