4 waves of NLP techniques and how to stitch them together
SOURCE: DRUGDISCOVERYTRENDS.COM
JUL 20, 2024
By Brian Buntz | July 20, 2024
While clinical trials and regulatory filings offer a semi-structured view of drug safety, a large amount of insights lie in sources ranging from patient support programs (PSPs) to social media posts. As Natural Language Processing (NLP) evolves, a growing number of tools are becoming available to unlock this potential.
Deepanshu Saini, Director of Program Management at IQVIA divides NLP techniques into four broad categories as they relate to pharmacovigilance. “These technologies have evolved over time, each bringing new capabilities to process and understand unstructured data in the pharmaceutical world,” Saini said.
[For more on Saini’s take on NLP, check out his article “From social media to safety signals: How AI and NLP are transforming drug safety monitoring“]
While seemingly straightforward, keyword search has been a cornerstone of data analysis for decades. The technology has its roots in decades-old web search techniques that used a basic keyword matching to index websites. While fast, it lacks precision. Imagine a pharmacovigilance team tasked with identifying reports of headaches associated with a specific drug. A simple keyword search for “headache” across patient forums and social media would unearth a large volume of mentions. Yet many of these results could be false positives.
Yet this simplicity comes at a cost. As Deepanshu Saini, Director of Program Management at IQVIA, explains, “Keyword search examines large unstructured datasets for keywords and signals, but without any context.” This lack of nuance can lead to misleading results.
In addition, the technique can fail to surface related results. A search for “cephalalgia,” a medical term for headache, might work in medical journals but not on social media sites where patients use everyday language.
[Infographic based on IQVIA interview. Created in Photoshop with a Firefly template]
To overcome the shortcomings of purely keyword-based approaches, semantic search emerged as a more intelligent alternative. The technique, whose roots also stretch back for decades, became more mainstream in circa 2012/2013 when Google made strides in implementing semantic search at scale with its Knowledge Graph and Hummingbird update, which prized context over keywords.
Instead of simply matching words, semantic search explores the meaning and relationships between them. Returning to the example of a pharmacovigilance team searching for headache reports, a keyword search might miss mentions of “migraine” or “severe head pain” when looking for either “headache” or “cephalalgia,” semantic search could capture all of these related terms simultaneously. “Semantic search doesn’t just look for a keyword but considers the context and all kinds of meanings associated with it,” Saini said. This ability to recognize synonyms and related terms drastically reduces false negatives.
Google also helped shift the landscape of NLP in 2018 with the launch of BERT (Bidirectional Encoder Representations from Transformers), one of the first and most influential transformer-based language models. “This technology enables anyone to train their own state-of-the-art question answering system,” wrote Pandu Nayak Pandu Nayak, Google Fellow and Vice President, Search in 2019.
The launch of BERT followed the publication of the seminal paper “Attention Is All You Need” that would popularize the transformer architecture, which helped displace recurrent neural networks (RNNs), which process words in the order they appear.
Transformers, on the other hand, deploy a mechanism called “self-attention.” This capability allows the model to weigh the importance of all words in a sentence simultaneously no matter where they are, capturing their interdependencies. Imagine the model looking at every word in a sentence at the same time and determining which connections are most meaningful for understanding the overall message. BERT, in being bidirectional, can explore the semantics preceding and following a given word. Models like BERT “provide context in all directions, looking for context before and after the word you’re searching for,” Saini said.
BERT also makes use of word embeddings, which convert words into floating-point numbers that can then be analyzed computationally. Tensorflow, for instance, has an Embedding Projector (originally developed by Google), that allows users to visualize embedding data.
In pharmacovigilance, all of this translates to higher accuracy. “We are actually using [BERT] at IQVIA on the safety side to detect adverse events and reduce false positives,” Saini said. “When you detect an adverse event, your search and semantic search give you a lot of results. BERT helps us nail down the context and reduce those false positives.”
While transformer models like BERT significantly advanced the NLP landscape, the emergence of large language models (LLMs) such as ChatGPT has sparked both excitement and apprehension, especially in the tightly regulated drug safety world. “Large language models have changed the game quite a bit because of the buzz,” Saini said.
While BERT is open source, the most popular large language models are proprietary, which presents a challenge for auditability.
FDA CFR guidelines, for instance, require “validation of computerized systems.” “You should be able to produce technology that is testable and can be validated as per the laws,” Saini said. “If you’re rolling out any new technology in the market, you should be able to validate it.”
The challenge with standard publicly available LLMs lies in their “black box” nature. “You’re not able to predict what the outcomes will be,” Saini observed. “For example, if you go to ChatGPT and want to rephrase or summarize something, every time you give the same input, it’s going to give you a slightly different output. That’s not testable and can’t be validated, so it can’t meet the benchmarks that the current laws have put in place.”
While LLMs show promise in processing vast amounts of unstructured data, they are known to sometimes “hallucinate” or generate plausible-sounding but incorrect information. This tendency can be particularly problematic in the drug safety context, where accuracy is paramount. Researchers have developed methods to mitigate this issue, such as Retrieval Augmented Generation (RAG), which grounds LLM outputs in verified information sources.
Deepanshu: “Large language models: These are Gen AI models like ChatGPT. A lot of research is being done, but I’m yet to see high accuracy, especially in the safety world.”
Off-the-shelf LLMs when used in medical contexts also can complicate FDA’s guidance concerning having “audit trails or other physical, logical, or procedural security measures in place to ensure the trustworthiness and reliability.”
While NLP tools have evolved over the years, that doesn’t necessarily mean that each new technology supplants earlier ones. “We don’t use these models in isolation,” Saini explains. “It’s really like a toolbox. You use whichever one is best for the task or a combination of tools.”
In addition to the tools outlined here, when combing through social media posts to find adverse event signals, pharma companies employ a multi-pronged strategy. “It’s a combination of tools – ontologies for semantic search (keywords plus synonyms), BERT to reduce false positives, and another tool to break ties between those two,” Saini said. “We don’t use BERT or [other NLP] models in isolation. We use them to augment the search — either to make the initial search better or to fine-tune the results and reduce false positives after semantic search.”
On top of the tools mentioned above, Saini underscores the importance of human expertise as well as myriad machine learning algorithms such as XGBoost and decision trees to explore large datasets of human-categorized information.
But Saini points out that simply identifying social media signals is only the first step. The larger impact comes from confirming these insights and using them to inform strategies. “It’s beneficial for pharma companies to design patient support programs and develop educational material,” Saini said. “On social media, you pick up signals and build hypotheses. We recommend our clients to confirm these signals through primary research or focus groups, inviting patients and HCPs.” Once confirmed, they develop strategies such as designing patient support programs, training materials, and better guidance for HCPs. Saini concluded: “Then they implement and gather feedback to see if the desired change has occurred. It’s a multi-pronged strategy, not as simple as reading something online and making immediate changes.”
Filed Under: Drug Discovery, Drug Discovery and Development, machine learning and AI, Regulatory affairs
Tagged With: adverse event detection, BERT, drug safety, natural language processing, semantic search, social media monitoring
Brian Buntz, Drug Discovery and Development
Deepanshu Saini et al., Drug Discovery and Development, 2024
Brian Buntz, Drug Discovery and Development, 2022
Brian Buntz, Drug Discovery and Development, 2024
Brian Buntz, Drug Discovery and Development, 2024
Zishan Ahmed et al., Journal of Information Systems Engineering and Business Intelligence, 2023
Nur Azmina Mohamad Zamani et al., Journal of Information Systems Engineering and Business Intelligence, 2023
Stefan Koos et al., Media Iuris, 2023
Ali Ahmed Julul et al., Mozaik Humaniora, 2019
Novri Susan et al., Masyarakat, Kebudayaan dan Politik, 2023
Brian Buntz
As the pharma and biotech editor at WTWH Media, Brian has almost two decades of experience in B2B media, with a focus on healthcare and technology. While he has long maintained a keen interest in AI, more recently Brian has made making data analysis a central focus, and is exploring tools ranging from NLP and clustering to predictive analytics.
Throughout his 18-year tenure, Brian has covered an array of life science topics, including clinical trials, medical devices, and drug discovery and development. Prior to WTWH, he held the title of content director at Informa, where he focused on topics such as connected devices, cybersecurity, AI and Industry 4.0. A dedicated decade at UBM saw Brian providing in-depth coverage of the medical device sector. Engage with Brian on LinkedIn or drop him an email at bbuntz@wtwhmedia.com.
LATEST NEWS
WHAT'S TRENDING
Data Science
5 Imaginative Data Science Projects That Can Make Your Portfolio Stand Out
OCT 05, 2022
SOURCE: HACKSTER.IO
SEP 05, 2024
SOURCE: HEMATOLOGYADVISOR.COM
AUG 30, 2024
SOURCE: WOLTERSKLUWER.COM
AUG 22, 2024
SOURCE: RESEARCH.GOOGLE
AUG 22, 2024
SOURCE: FINTECH.GLOBAL
AUG 16, 2024