Chairs: Andras Kornai (Metacarta Inc.)
Beth Sundheim (US Navy SPAWAR Systems Center)
Members: Doug Appelt (SRI)
Merrick Lex Berman (Harvard Yenching Institute)
Sean Boisen (BBN Verizon)
Quintin Congdon (Army NGIC)
James Cowie (NMSU/CRL)
Linda Hill (Univ. California, Santa Barbara)
Doug Jones (MIT Lincoln Labs)
George Wilson (Mitre)
Effective analysis of textual references to places is a critical core technology for a wide variety of NLP applications. This workshop focuses on topics concerning the recognition, disambiguation, normalization, storage, and display of geographic references, e.g., "New York", "Nueva York", "LaGuardia Airport", "LaGuardia", "[the] Brooklyn Bridge", "a mile from downtown Manhattan", "the southern tip of Manhattan Island", "the Amazon delta", "the San Diego-Tijuana border".
Many place names are lexically ambiguous with respect to their physical location -- "Orange" as a county in either California or Florida, for example -- and sometimes also with respect to type -- "New York" as city or state in the U.S. Such names are relatively straightforward to normalize, though there is room for discussion about establishing some common normalization practices. Other forms of expression are more difficult to normalize, e.g., references to vague areas such as "the Amazon delta" and relative locations such as "a mile south of the village".
To respond to the challenges of providing accurate analyses in broad domains and across languages, and useful information on subjects for which there is sparse training data, advances in core technology are needed that can bring to bear both lexical and spatial background knowledge about places worldwide. For most applications, it is not enough for the system to bracket a text string and tag it as a "LOCATION"; it is necessary to normalize the information in a way that specifically describes or even uniquely identifies the place in question.
Since texts may contain place references without providing all the extra information needed to disambiguate them, the system needs background knowledge in some form or another that it can draw on to tell it about known names and their types and locations. One form of knowledge resource is a placename gazetteer, and there are large electronic gazetteers in existence that are publicly available. How can they be tailored and exploited to meet the needs of NLP?
Such resources may also contain foreign names in native script or in transliteration. These name forms are critically needed to support analysis in multilingual and cross-lingual settings. Recognition of the various ways that a given place may be referenced in one language is a challenging problem in and of itself, and issues of name translation and transliteration and special character sets multiply that problem.
Map-based visualization of the results geographic reference analysis introduces the challenge of associating places with specific locations. Coordinates that identify a centerpoint of a place may be found in some gazetteers; in some cases, more extensive coordinates may be available that approximately define the boundaries of the place. Although it's obvious that such gazetteer information enables data visualization, it's also clear that it can sometimes be useful in the process of doing geographic reference analysis, e.g., to identify the set of states that are intended by a phrase such as "the states that neighbor California" or "the states on the coast of the Gulf of Mexico". Research that explores this topic further is of interest to the workshop.
• Disambiguation of recognized terms and geographic references based on text evidence ("London" in Ontario v. "London" in England; "New York" as city v. "New York" as state): Special aspects of recognizing and characterizing place entities (i.e., methods, etc. that do not apply equally to processing other types of entities, e.g., persons and organizations)
• Usage of background knowledge (external gazetteers or other knowledge sources) to assist in analysis of textual references to places: Partial term matching; cross-gazetteer term matching; methods for weighing name, type and location evidence to identify best match
• Usage of results of text analysis to improve knowledge resources: detecting and filling gaps in coverage of names, types and containment/coordinate data; detecting need to update existing entries due to text-based evidence of changes in place names and/or characteristics (e.g., a commercial building that is converted to a church, a city that becomes the capital of a newly defined republic)
• Gazetteer localization and multilingual fill; spelling normalization and cross-language term matching
• Automatic categorization and description of place names: Definition of standard categorization schemes and mapping among schemes; automatic processes for coarse- and fine-grained category assignment, e.g., to distinguish at some level among functional (bridge, building, ...), geographical (cave, bay, ...) and administrative-political (county, province, ...) types of places
• Interpretation and representation of complex references, such as relative locations ("10 miles from Ankara"), vague areas ("the Amazon delta"), boundaries ("the San Diego-Tijuana border"), metonymies ("Green Bay" used in reference to football team)
• Language analysis uses for data on absolute coordinates (bounding boxes, polygons, polylines, etc.) in gazetteers, e.g., understanding containment, border and neighbor relationships via knowledge of areas and boundaries
• Standards for annotation and data interchange
Edmonton, Travel, Hotel, Registration