![]() ![]() The italics being the recognised trigger word(s). This is followed by evaluation results (Section Evaluation ) and by a section on learning relationships between people and how the automatically generated information on names can be used in automatic news analysis (Section Using names to explore document collections ). each group of related texts is treated as one meta-text, for which person and geographical place names are extracted and keywords are identified.Ĥ After giving some background on name transliteration and referring to related work (Section Background and related work ), we describe tools to identify names in text (Section Proper name recognition ) and the mechanism to merge name variants, including those written in Cyrillic, Arabic, and Greek script (Section Detecting and merging name variants ). The JRC’s name recognition tools are applied to each of these clusters, i.e. We then track related news clusters within the same language and across six of the languages (Pouliquen et al. For a subset of about 15,000 articles per day in currently eight languages, we apply unsupervised hierarchical clustering techniques to group related articles separately for each language. EMM is a software toolset that monitors a daily average of 25,000 news articles in currently 30 languages, deriving from 800 different international news sources. Due to the highly multilingual work environment in the European Commission – an organisation with twenty official languages – multilinguality of tools and the cross-lingual aspect are of prime importance.ģ Our analysis is applied to the output of the Europe Media Monitor system EMM (Best et al., 2002). Previous work focused on answering the questions What (Pouliquen et al. This paper focuses on the occurrence of proper names in news, i.e. This seems plausible as, according to Gey (2000), 30% of content-bearing words in news are proper names.Ģ In news analysis it is important to know What is the subject, Who is being talked about, Where and When things happened, and How it was reported. Crestan & de Loupy (2004) showed that Named Entity extraction and visualisation help users to browse large document collections more quickly and efficiently. ![]() Software tools that automatically pre-select the news articles of interest and that pre-process the chosen text collection simplify the daily repetitive task of media monitoring. Introductionġ Many large organisations continuously monitor the media, and especially the news, to stay informed about events of interest, and to find out what the media say about certain persons, organisations, or subjects. We thank Tomaž Erjavec for helping us with the Slovene language, and Helen Salak for providing us with knowledge about Farsi. We also want to thank Carlo Ferigato who introduced us to various fuzzy matching techniques. In particular, the nonLatin-Latin display needs to be merged.We thank the whole team of the Web Technology sector at the JRC for providing us with the valuable news data to test the tools, as well as for their technical support. The display in some of the charts needs to be improved, such as Greek, Indic, and Kana.Only the script-script charts are shown.Characters that are not normally used in isolation, such as ぁ, will show as an odd format (eg with extra punctuations marks or accents). For example: For greek, Ψ shows as 'PH', when the transliteration rules will change it to 'Ph' in front of lowercase letters. Because the context is not taken into account, significant combinations will not show in the charts.Some browsers will not show combinations of accents correctly.There are known bugs in some of the charts, such as Hangul.For example, an isolated 'a' transliterates to Some transliterations only work in context, which won't be visible.Less common characters may be missing as may be some characters that don't appear in isolation.(Implementations like ICU allows those to be easily stripped.) The unmarked script transliterations to Latin are generally designed to be reversible, thus some of the transliterations use extra accents to provide for a round-trip.So, for example, the Cyrillic transliteration is not very natural for English speakers. The CLDR data currently does not contain many language-specific transliterations.Hovering over each cell should show the character name, if enabled on your browser.A cell with a red background indicates a missing case.A cell with a blue background indicates a case that doesn't roundtrip.Note: these charts are preliminary for more information, see below. The following illustrates some of the transliterations available in CLDR. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |