iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://pubmed.ncbi.nlm.nih.gov/21846786/
Natural language processing: an introduction - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2011 Sep-Oct;18(5):544-51.
doi: 10.1136/amiajnl-2011-000464.

Natural language processing: an introduction

Affiliations
Review

Natural language processing: an introduction

Prakash M Nadkarni et al. J Am Med Inform Assoc. 2011 Sep-Oct.

Abstract

Objectives: To provide an overview and tutorial of natural language processing (NLP) and modern NLP-system design.

Target audience: This tutorial targets the medical informatics generalist who has limited acquaintance with the principles behind NLP and/or limited knowledge of the current state of the art.

Scope: We describe the historical evolution of NLP, and summarize common NLP sub-problems in this extensive field. We then provide a synopsis of selected highlights of medical NLP efforts. After providing a brief description of common machine-learning approaches that are being used for diverse NLP sub-problems, we discuss how modern NLP architectures are designed, with a summary of the Apache Foundation's Unstructured Information Management Architecture. We finally consider possible future directions for NLP, and reflect on the possible impact of IBM Watson on the medical field.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

Figure 1
Figure 1
Support vector machines: a simple 2-D case is illustrated. The data points, shown as categories A (circles) and B (diamonds), can be separated by a straight line X–Y. The algorithm that determines X–Y identifies the data points (‘support vectors’) from each category that are closest to the other category (a1, a2, a3 and b1, b2, b3) and computes X–Y such that the margin that separates the categories on either side is maximized. In the general N-dimensional case, the separator will be an (N−1) hyperplane, and the raw data will sometimes need to be mathematically transformed so that linear separation is achievable.
Figure 2
Figure 2
Hidden Markov models. The small circles S1, S2 and S3 represent states. Boxes O1 and O2 represent output values. (In practical cases, hundreds of states/output values may occur.) The solid lines/arcs connecting states represent state switches; the arrow represents the switch's direction. (A state may switch back to itself.) Each line/arc label (not shown) is the switch probability, a decimal number. A dashed line/arc connecting a state to an output value indicates ‘output probability’: the probability of that output value being generated from the particular state. If a particular switch/output probability is zero, the line/arc is not drawn. The sum of the switch probabilities leaving a given state (and the similar sum of output probabilities) is equal to 1. The sequential or temporal aspect of an HMM is shown in figure 3.
Figure 3
Figure 3
The relationship between Naive Bayes, logistic regression, hidden Markov models (HMMs) and conditional random fields (CRFs). Logistic regression is the discriminative-model counterpart of Naive Bayes, which is a generative model. HMMs and CRFs extend Naive Bayes and logistic regression, respectively, to sequential data (adapted from Sutton and McCallum73). In the generative models, the arrows indicate the direction of dependency. Thus, for the HMM, the state Y2 depends on the previous state Y1, while the output X1 depends on Y1.
Figure 4
Figure 4
A UIMA pipeline. An input task is sequentially put through a series of tasks, with intermediate results at each step and final output at the end. Generally, the output of a task is the input of its successor, but exceptionally, a particular task may provide feedback to a previous one (as in task 4 providing input to task 1). Intermediate results (eg, successive transformations of the original bus) are read from/written to the CAS, which contains metadata defining the formats of the data required at every step, the intermediate results, and annotations that link to these results.

Similar articles

Cited by

References

    1. Manning C, Raghavan P, Schuetze H. Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008
    1. Hutchins W. The First Public Demonstration of Machine Translation: the Georgetown-IBM System, 7th January 1954. 2005. http://www.hutchinsweb.me.uk/GU-IBM-2005.pdf (accessed 4 Jun 2011).
    1. Chomsky N. Three models for the description of language. IRE Trans Inf Theory 1956;2:113–24
    1. Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, Tools. Reading, MA: Addison-Wesley, 1988
    1. Chomsky N. On certain formal properties of grammars. Inform Contr 1959;2:137–67

Publication types