United States Patent | 6,507,829 |
Richards ,   et al. | January 14, 2003 |
A method and apparatus for classifying textual data is provided. The invention is adapted to automatically classify text. In particular, the invention utilizes a sparse vector framework to evaluate natural language text and to accurately and automatically assign that text to a predetermined classification. This can be done even where the disclosed system has not seen an example of the exact text before. The disclosed method and apparatus are particularly well-suited for coding adverse event reports, commonly referred to as "verbatims," generated during clinical trials of pharmaceuticals, The invention also provides a method and apparatus that can be used to translate verbatims that have already been classified according to one coding scheme to be translated to another coding scheme in a highly automated process.
Inventors: | Richards; Jon Michael (San Francisco, CA); Kornai; Andras (Arlington, MA) |
Assignee: | PPD Development, Lp (Wilmington, NC) |
Appl. No.: | 483828 |
Filed: | January 17, 2000 |
Current U.S. Class: | 706/45; 704/9 |
Intern'l Class: | G06F 017/00 |
Field of Search: | 706/45,46,12 700/49 |
5991714 | Nov., 1999 | Shaner | 704/1. |
Duda, Richard O.; Hart, Peter E.; Stork, David G., "Linear Discriminant Functions", Chapter 5, Pattern Classification and Scene Analysis: Pattern Classification, Wiley, 1999. Divita, Guy; Browne, Allen C; and Rindflesch, Thomas C., "Evaluating Lexical Variant Generation to Improve Information Retrieval", Proc. American Informatics Association 1998 Annual Symposium. Fizames, Christian, MD, MPh, "How to Improve the Medical Quality of the Coding Reports Based on Who-Art and Costart Use", Drug Information Journal, vol. 31, pp. 85-92, 1997. Lewis, David D.; Schapire, Robert E.; Callan, James P.; Papka, Ron, "Training Algorithms for Linear Text Classifiers", in proceedings of Annual Meeting of ACM Special Interest Group Information Retrieval, pp. 298-306, SGIR'96 Zurich, Switzerland, 1996. Chute, Christopher C.; Yang, Yiming, "An Overview of Statistical Methods for the Classification and Retrieval of Patient Events", Methods of Information in Medicine, vol. 34, No. 1/2, pp. 104-109, 1995. Gillum, Terry L.; George, Robert H.; Leitmeyer, Jack E., "An Autoencoder for Clinical and Regulatory Data Processing", Drug Information Journal, vol. 29, pp. 107-113, 1995. Schutze, Hinrich; Hull, David A.; Pedersen, Jan O., "A Comparison of Classifiers and Document Representations for the Routing Problem", in proceedings of Annual Meeting of ACM Special Interest Group Information Retrieval, pp. 229-237, SGIR'95 Seattle, WA, 1995. Dupin-Spriet, Therese, Ph.D.; Spriet, Alain, Ph.D., "Coding Errors: Classification, Detection, and Prevention", Drug Information Journal, vol. 28, pp. 787-790, 1994. Press, William H.; Teukolsky, Saul A.; Vetterling, William T.; Flannery, Brian P., "Numerical Recipes in C: The Art of Scientific Computing", second edition, Cambridge University Press, Cambridge UK, 1993. Joseph, Michael C., MD, MPH; Schoeffler, Kath; Doi, Peggy A.; Yefko, Helen; Engle, Cindy; Nissman, Erika F., "An Automated Costart Coding Scheme", Drug Information Journal, vol. 25, pp. 97-108, 1991. Abney, Steven P., "Parsing by Chunks", Principal-Based Parsing: Computation and Pshycholinguistics, 257-278, Kluwer Academic Publishers, 1991. Tou, Julius T; Gonzalez, Rafael C., "Trainable Pattern Classifiers, the Deterministic Approach", Chapter 5, Pattern Recognition Principles, Addison Wesley, Reading, MA 1974. |
VECTOR N-GRAM FREQUENCY WEIGHT PRODUCT "pre" 1 0 0 "menstrual symptom" 1 .8 .8 "lack of concentration" 1 .05 .05 "bloating" 1 .4 .4 "dull headache" 1 .05 .05 "pre-menstrual symptom" 1 .98 .98 "menstrual symptom-lack of 1 .9 .9 concentration" "lack of concentration- 1 .2 .2 bloating" "bloating-dull headache" 1 .15 .15 SUM 3.53
VECTOR N-GRAM FREQUENCY WEIGHT PRODUCT "pre" 1 0 0 "menstrual symptom" 1 .1 .1 "lack of concentration" 1 .5 .5 "bloating" 1 0 0 "dull headache" 1 .9 .9 "pre-menstrual symptom" 1 .2 .2 "menstrual symptom-lack of 1 .05 .05 concentration" "lack of concentration- 1 0 0 bloating" "bloating-dull headache" 1 .5 .5 SUM 2.25