Реферат: Cognitive aspects of lexicon in the light of the language picture of the world
Dictionaries
information access lexicography
In most paper-based general and learners’ dictionaries only some information about synonyms and sometimes antonyms is presented. Newer dictionaries, such as the “Longman Language Activator” (Summers, 1999), are providing lists of related words. While these will be useful to learners, information about the kind of semantic relation is usually missing.
Semantic relations are often available in electronic resources, most famously in WordNet (Fellbaum, 1998) and related projects like Kirrkirr (Jansz et al., 1999), ALEXIA (Chanier and Selva, 1998), or as described in Fontenelle (1997). However, these resources tend to include few relation types (hypernymy, meronymy, antonymy, etc.).
The salience of the relations chosen is not verified experimentally, and the same set of relation types is used for all words that share the same part-of-speech. Our results below, as well as work by Vinson et al. (2008), indicate that different concept classes should, instead, be characterized by different relation types (e. g., function is very salient for tools, but not at all for animals).
Work in Cognitive Sciences
Several projects addressed the collection of property generation data to provide the community with feature norms to be used in different psycholinguistic experiments and other analyses: Garrard et al. (2001) instructed subjects to complete phrases (“concept is/has/can. . . ”), thus restricting the set of producible feature types. McRae etal. (2005) instructed their subjects to list concept properties without such restrictions, but providing them with some examples. Vinson et al. (2008) gave similar instructions, but explicitly asked subjects not to freely associate.
However, these norms have been collected for the English language. It remains to be explored if concept representations in general and semantic relations for our specific investigations have the same properties across languages.
Data Collection
After choosing the concept classes and appropriate concepts for the production experiment, concept descriptions were collected from participants.
These were transcribed, normalized, and annotated with semantic relation types.
Stimuli
The stimuli for the experiment consisted of 50 concrete concepts from 10 different classes (i. e., 5 concepts for each of the classes): mammal (dog, horse, rabbit, bear, monkey), bird (seagull, sparrow, woodpecker, owl, goose), fruit (apple, orange, pear, pineapple, cherry), vegetable (corn, onion, spinach, peas, potato), body part (eye, finger, head, leg, hand), clothing (chemise, jacket, sweater, shoes, socks), manipulability tool (comb, broom, sword, paintbrush, tongs), vehicle (bus, ship, air-plane, train, truck), furniture (table, bed, chair, closet, armchair), and building (garage, bridge, skyscraper, church, tower). They were mainly taken from Garrard et al. (2001) and McRae et al. (2005). The concepts were chosen so that they had unambiguous, reasonably monosemic lexical realizations in both target languages.
The words representing these concepts were translated into the two target languages, German and Italian. A statistical analysis (using Tukey’s honestly significant difference test as implemented in the R toolkit 2) of word length distributions (within and across categories) showed no significant differences in either language. There were instead significant differences in the frequency of target words, as collected from the German, Italian and English WaCky corpora3. In particular, words of the class body part had significantly larger frequencies across languages than the words of the other classes (not surprisingly, the words eye, head and hand appear much more often in corpora than the other words in the stimuli list).
Experimental Procedure
The participants in the concept description experiment were students attending the last 3 years of a German or Italian high school and reported to be native speakers of the respective languages. 73 German and 69 Italian students participated in the experiment, with ages ranging between 15 and 19.
The average age was 16.7 (standard deviation 0.92) for Germans and 16.8 (s.d. 0.70) for Italians. The experiment was conducted group-wise in schools. Each participant was provided with a random set of 25 concepts, each presented on a separate sheet of paper. To have an equal number of participants describing each concept, for each randomly matched subject pair the whole set of concepts was randomised and divided into 2 subsets.
Each subject saw the target stimuli in his/her subset in a different random order (due to technical problems, the split was not always different across subject pairs).
Short instructions were provided orally before the experiment, and repeated in written format on the front cover of the questionnaire booklet distributed to each subject. To make the concept description task more natural, we suggested that participants should imagine a group of alien visitors, to each of which a particular word for a concrete object was unknown and thus had to be described.
Participants should assume that each alien visitor knew all other words of the language apart from the unknown (target) word.
Participants were asked to enter a descriptive phrase per line (not necessarily a whole sentence) and to try and write at least 4 phrases per word.
They were given a maximum of one minute per concept, and they were not allowed to go back to the previous pages.
Before the real experiment, subjects were presented an example concept (not in the target list) and were encouraged to describe it while asking clarifications about the task.
All subjects returned the questionnaire so that for a concept we obtained, on average, descriptions by German subjects
Transcription and Normalization
The collected data were digitally transcribed and responses were manually checked to make sure that phrases denoting different properties had been properly split. We tried to systematically apply the criterion that, if at least one participant produced 2 properties on separate lines, then the properties would always be split in the rest of the data set.
However, this approach was not always equally applicable in both languages. For example, Trans-portmittel (German) and mezzo di trasporto (Italian) both are compounds used as hyponyms for what English speakers would probably rather classify as vehicles. In contrast to Transportmittel, mezzo di trasporto is splittable as mezzo, that can also be used on its own to refer to a kind of vehicle (and is defined more specifically by adding the fact that it is used for transportation). The German compound word also refers to the function of transportation, but -mittel has a rather general meaning, and would not be used alone to refer to a vehicle.
Hence, Transportmittel was kept as a whole and the Italian quasi-equivalent was split, possibly creating a bias between the two data sets (if the Italian string is split into mezzo and trasporto, these will be later classified as hypernym and functional features, respectively; if the German word is not split, it will only receive one of these type labels). More in general, note that in German compounds are written as single orthographic words, whereas in Italian the equivalent concepts are often expressed by several words. This could also create further bias in the data annotation and hence in the analysis.
Data were then normalized and transcribed into English, before annotating the type of semantic relation. Normalization was done in accordance with McRae et al. (2005), using their feature norms as guidelines, and it included leaving habitual words like “normally,”, “often”, “most” etc. out, as they just express the typicality of the concept description, which is the implicit task.