Natural language processing: an introduction

doi:10.1136/amiajnl-2011-000464

Review

. 2011 Sep-Oct;18(5):544-51.

doi: 10.1136/amiajnl-2011-000464.

Natural language processing: an introduction

Prakash M Nadkarni¹, Lucila Ohno-Machado, Wendy W Chapman

Affiliations

PMID: 21846786
PMCID: PMC3168328
DOI: 10.1136/amiajnl-2011-000464

Review

Natural language processing: an introduction

Prakash M Nadkarni et al. J Am Med Inform Assoc. 2011 Sep-Oct.

. 2011 Sep-Oct;18(5):544-51.

doi: 10.1136/amiajnl-2011-000464.

Authors

Prakash M Nadkarni¹, Lucila Ohno-Machado, Wendy W Chapman

Affiliation

¹ Yale University School of Medicine, New Haven, Connecticut, USA. prakash.nadkarni@yale.edu

PMID: 21846786
PMCID: PMC3168328
DOI: 10.1136/amiajnl-2011-000464

Abstract

Objectives: To provide an overview and tutorial of natural language processing (NLP) and modern NLP-system design.

Target audience: This tutorial targets the medical informatics generalist who has limited acquaintance with the principles behind NLP and/or limited knowledge of the current state of the art.

Scope: We describe the historical evolution of NLP, and summarize common NLP sub-problems in this extensive field. We then provide a synopsis of selected highlights of medical NLP efforts. After providing a brief description of common machine-learning approaches that are being used for diverse NLP sub-problems, we discuss how modern NLP architectures are designed, with a summary of the Apache Foundation's Unstructured Information Management Architecture. We finally consider possible future directions for NLP, and reflect on the possible impact of IBM Watson on the medical field.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

**Figure 1**
Support vector machines: a simple 2-D case is illustrated. The data points, shown as categories A (circles) and B (diamonds), can be separated by a straight line X–Y. The algorithm that determines X–Y identifies the data points (‘support vectors’) from each category that are closest to the other category (a1, a2, a3 and b1, b2, b3) and computes X–Y such that the margin that separates the categories on either side is maximized. In the general N-dimensional case, the separator will be an (N−1) hyperplane, and the raw data will sometimes need to be mathematically transformed so that linear separation is achievable.

**Figure 2**
Hidden Markov models. The small *circles* S1, S2 and S3 represent *states*. *Boxes* O1 and O2 represent *output values*. (In practical cases, hundreds of states/output values may occur.) The *solid* lines/arcs connecting states represent *state switches*; the arrow represents the switch's direction. (A state may switch back to itself.) Each line/arc label (not shown) is the *switch probability*, a decimal number. A *dashed* line/arc connecting a state to an output value indicates ‘output probability’: the probability of that output value being generated from the particular state. If a particular switch/output probability is zero, the line/arc is not drawn. The sum of the switch probabilities leaving a given state (and the similar sum of output probabilities) is equal to 1. The sequential or temporal aspect of an HMM is shown in figure 3.

**Figure 3**
The relationship between Naive Bayes, logistic regression, hidden Markov models (HMMs) and conditional random fields (CRFs). Logistic regression is the discriminative-model counterpart of Naive Bayes, which is a generative model. HMMs and CRFs extend Naive Bayes and logistic regression, respectively, to sequential data (adapted from Sutton and McCallum73). In the generative models, the arrows indicate the direction of dependency. Thus, for the HMM, the state Y2 depends on the *previous* state Y1, while the output X1 depends on Y1.

**Figure 4**
A UIMA pipeline. An input task is sequentially put through a series of tasks, with intermediate results at each step and final output at the end. Generally, the output of a task is the input of its successor, but exceptionally, a particular task may provide feedback to a previous one (as in task 4 providing input to task 1). Intermediate results (eg, successive transformations of the original bus) are read from/written to the CAS, which contains metadata defining the formats of the data required at every step, the intermediate results, and annotations that link to these results.

See this image and copyright information in PMC

Cited by

Trends in Glucagon-Like Peptide-1 Receptor Agonist Social Media Posts Using Artificial Intelligence.
Javaid A, Baviriseaty S, Javaid R, Zirikly A, Kukreja H, Kim CH, Blaha MJ, Blumenthal RS, Martin SS, Marvel FA. Javaid A, et al. JACC Adv. 2024 Aug 28;3(9):101182. doi: 10.1016/j.jacadv.2024.101182. eCollection 2024 Sep. JACC Adv. 2024. PMID: 39372460 Free PMC article.
Development and Portability of a Text Mining Algorithm for Capturing Disease Progression in Electronic Health Records of Patients With Stage IV Non-Small Cell Lung Cancer.
Verschueren MV, Abedian Kalkhoran H, Deenen M, van den Borne BEEM, Zwaveling J, Visser LE, Bloem LT, Peters BJM, van de Garde EMW. Verschueren MV, et al. JCO Clin Cancer Inform. 2024 Oct;8:e2400053. doi: 10.1200/CCI.24.00053. Epub 2024 Oct 4. JCO Clin Cancer Inform. 2024. PMID: 39365963 Free PMC article.
Sentiment analysis in medication adherence: using ruled-based and artificial intelligence-driven algorithms to understand patient medication experiences.
Bottacin WE, Luquetta A, Gomes-Jr L, de Souza TT, Reis WCT, Melchiors AC. Bottacin WE, et al. Int J Clin Pharm. 2024 Oct 4. doi: 10.1007/s11096-024-01803-0. Online ahead of print. Int J Clin Pharm. 2024. PMID: 39365522
Significance of Artificial Intelligence in the Study of Virus-Host Cell Interactions.
Elste J, Saini A, Mejia-Alvarez R, Mejía A, Millán-Pacheco C, Swanson-Mungerson M, Tiwari V. Elste J, et al. Biomolecules. 2024 Jul 26;14(8):911. doi: 10.3390/biom14080911. Biomolecules. 2024. PMID: 39199298 Free PMC article. Review.
Use of Natural Language Processing to Extract and Classify Papillary Thyroid Cancer Features From Surgical Pathology Reports.
Loor-Torres R, Wu Y, Esteban Cabezas, Borras-Osorio M, Toro-Tobon D, Duran M, Al Zahidy M, Mateo Chavez M, Soto Jacome C, Fan JW, Singh Ospina NM, Wu Y, Brito JP. Loor-Torres R, et al. Endocr Pract. 2024 Nov;30(11):1051-1058. doi: 10.1016/j.eprac.2024.08.008. Epub 2024 Aug 26. Endocr Pract. 2024. PMID: 39197747

See all "Cited by" articles

References

1. Manning C, Raghavan P, Schuetze H. Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008
1. Hutchins W. The First Public Demonstration of Machine Translation: the Georgetown-IBM System, 7th January 1954. 2005. http://www.hutchinsweb.me.uk/GU-IBM-2005.pdf (accessed 4 Jun 2011).
1. Chomsky N. Three models for the description of language. IRE Trans Inf Theory 1956;2:113–24
1. Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, Tools. Reading, MA: Addison-Wesley, 1988
1. Chomsky N. On certain formal properties of grammars. Inform Contr 1959;2:137–67

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 LM009520/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations

[1] Manning C, Raghavan P, Schuetze H. Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008

[2] Manning C, Raghavan P, Schuetze H. Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008

[3] Hutchins W. The First Public Demonstration of Machine Translation: the Georgetown-IBM System, 7th January 1954. 2005. http://www.hutchinsweb.me.uk/GU-IBM-2005.pdf (accessed 4 Jun 2011).

[4] Hutchins W. The First Public Demonstration of Machine Translation: the Georgetown-IBM System, 7th January 1954. 2005. http://www.hutchinsweb.me.uk/GU-IBM-2005.pdf (accessed 4 Jun 2011).

[5] Chomsky N. Three models for the description of language. IRE Trans Inf Theory 1956;2:113–24

[6] Chomsky N. Three models for the description of language. IRE Trans Inf Theory 1956;2:113–24

[7] Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, Tools. Reading, MA: Addison-Wesley, 1988

[8] Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, Tools. Reading, MA: Addison-Wesley, 1988

[9] Chomsky N. On certain formal properties of grammars. Inform Contr 1959;2:137–67

[10] Chomsky N. On certain formal properties of grammars. Inform Contr 1959;2:137–67

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Natural language processing: an introduction

Affiliation

Natural language processing: an introduction

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources