|
The POS-tagging of Learner Corpora for Second Language Acquisition Research |
|
This site describes the SLAt project, which aims at suggesting a way to POS-tag automatically and to run queries on learner corpora by using a well known, already available free software designed for L1s. The SLAt project sticks closely to four points: (1) SLA research consists in making hypotheses about learner data; (2) far from substituting human re-analysis, an automatic tagger can only 'prepare the field' (by easily managing large amount of data) for such hypotheses; (3) the more an automatic tagger allows researchers to make as many as possible competing hypotheses, the more useful it is; (4) the more the automatic tagger's designer excludes any preconceived idea on the nature of learner data, the less its outcomes are spoiled by circularity. Our starting assumption is twofold: (a) any tagger based on errors (L1/L2 discrepancies) fails drastically to match requirement n. 4; (b) a probabilistic Treetagger designed for L1 could allow researcher to make competing hypotheses on L2 data on condition that researchers know how to filter the results and know where to direct their gaze. A similar solution has been independently adopted also by researchers engaged in other Italian Learner Corpora (notably, by the VALICO corpus project undertaken at the University of Turin, see http://www.valico.org/) and partially also by the LIPS corpus in Siena). We hope that in the future it will be possible to share research protocols on the acquisition of L2 which could possibly be based on an adapted version of Schmid's Treetagger.
In order to let researchers practise running their queries, we provide a downloadable xml tagged sample from different Italian Learner Corpora. A description of other ongoing projects of Italian learner corpora is also provided.
|
|