Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.77 MB, 266 trang )
Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal
some of the four-hour/month samples yielded inaccurate estimates (range = 12.50% to
57%, SD = 14.60%).
Importantly, the variance across samples was caused only by chance variation in
the number of correct questions and errors captured in any particular sample. In real
terms, the only difference between the samples that showed no or low error rates and
those that showed high error rates was the inclusion or exclusion of one or two inversion errors. However, this chance inclusion/exclusion had a large impact on error rates
because so few questions overall were captured in each sample (on average, six questions with DO/modals in the four-hour samples, three in the two-hour samples, two in
the one-hour samples). Rowland and Fletcher concluded that studies using small samples can substantially over or under-estimate error rates in utterance types that occur
relatively infrequently, and thus that calculations of error rates based on small amounts
of data are likely to be misleading.
2.2
The effect of calculating overall error rates
To sum up so far, small samples can lead to one missing rare phenomena, can fail to
capture short lived errors or errors in low frequency structures, and can inaccurately
estimate error rates. Given these facts, the temptation is to sacrifice a more fine-grained
analysis of performance in different parts of the system in favour of an overall error
rate in order to ensure enough data for reliable analysis. Thus, the most popular method of assessing the rate of error is to calculate the total number of errors as a proportion of all the possible contexts for error. For example, Stromswold (1990) reports the
error rate of auxiliaries as:
Number of auxiliary errors
Total number of contexts that require an auxiliary (i.e. correct use + errors)
This method clearly maximizes the amount of data available in small samples. However, this method also leads to an under-estimation of the incidence of errors in certain cases, particularly errors in low frequency structures or short-lived errors. There
are three main problems. First, overall error rates will be statistically dominated by
high frequency items, and thus will tend to represent error rate in high, not low frequent items. Second, overall error rates fail to give a picture of how error rates change
over time. Third, overall error rates can hide systematic patterns of error specific to
certain subsystems.
2.2.1 High frequency items dominate overall error rates
High frequency items will statistically dominate overall error rates. This problem is outlined clearly by Maratsos (2000) in his criticism of the “massed-token pooling methods”
(p.189) of error rate calculation used by Marcus et al. (1992). In this method, Marcus et
How big is big enough
al. calculated error rates by pooling together all tokens of irregular verbs (those that
occur with correct irregular past tense forms and those with over-regularized pasts) and
calculating the error rate as the proportion of all tokens of irregular pasts that contain
over-regularized past tense forms. Although this method maximizes the sample size
(and thus the reliability of the error rate), it gives much more weight to verbs with high
token frequency, resulting in an error rate that disproportionately reflects how well children perform with these high frequency verbs. For example, verbs sampled over 100
times contributed 10 times as many responses as verbs sampled 10 times and “so have
statistical weight equal to 10 such verbs in the overall rate” (Maratsos 2000 :189).
To illustrate his point, Maratsos analysed the past-tense data from three children
(Abe, Adam and Sarah). Overall error rates were low as Marcus et al. (1992) also reported. However, Maratsos showed that overall rates were disproportionately affected
by the low rates of errors for a very small number of high frequency verbs which each
occurred over 50 times (just 6 verbs for Sarah, 17 for Adam, 11 for Abe). The verbs that
occurred less than 10 times had a much smaller impact on the overall error rate simply
because they occurred less often, despite their being more of them (40 different verbs
for Abe, 22 for Adam, 33 for Sarah). However, it was these verbs that demonstrated
high rates of error (58% for Abe, 54% for Adam, 29% for Sarah). Thus, Maratsos showed
that overall error rates disproportionately reflect how well children perform with high
frequency items and can hide error rates in low frequency parts of the system.
2.2.2 Overall error rates collapse over time
A second problem with using overall error rates is that they provide only a representation of average performance over time, taking no account of the fact that since children
will produce fewer errors as they age, error rates are bound to decrease with time. This
problem is intensified by the fact that since children talk more as they get older, overall
error rates are likely to be statistically dominated by data from later, perhaps less errorprone, periods of acquisition. This is illustrated by Maslen, Theakston, Lieven and Tomasello’s (2004) analysis of the past tense verb uses in the dense data of one child, Brian,
who was recorded for five hours a week from age 2;0 to 3;2, then for four or five hours
a month (all recorded during the same week) from 3;3 to 3;11. Because of the denseness
of the data collection, Maslen et al. were able to chart the development of irregular past
tense verb use over time, using weekly samples. They reported that, although the overall error rate was low (7.81%), error rates varied substantially over time, reaching a
peak of 43.5% at 2;11 and gradually decreasing subsequently. They concluded that
“viewed from a longitudinal perspective, … regularizations in Brian’s speech are in fact
more prevalent than overall calculations would suggest” (Maslen et al. 2004: 1323).
2.2.3 Overall error rates collapse over subsystems
Third, the use of overall error rates can hide systematic patterns of error specific to
some of the sub-systems within the structure under consideration. Aguado-Orea and
Pine’s (2005, see also Aguado-Orea 2004) analysis of the development of subject-verb
Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal
agreement in children learning Spanish demonstrates this problem. Aguado-Orea and
Pine (2005) analysed dense data from two monolingual Spanish children (approximately aged 2 years). They reported that the overall rate of agreement error in present
tense contexts over a six month period was 4% for both children (see Table 2), as had
been systematically reported in the literature (Gathercole, Sebastián and Soto 1999;
Hoekstra and Hyams 1998; Hyams 1986; Pizzuto and Caselli 1992).5 Yet this figure
overwhelmingly represented how good the children were at providing the correct inflections in 1st and 3rd person singular contexts, which made up over 85% of verb
contexts. Error rates for the other, less frequent, inflections were much higher, especially when the rarer 3rd person plural inflections were required (which comprised
only 8% of all verb contexts). For Juan, 31% of the verbs that required 3rd person plural inflections had inaccurate inflectional endings. For Lucia, this figure was 67%; in
other words, agreement errors occurred in over two thirds of the verbs that required
3rd person plural inflections.6
Not only were rates of error higher for low frequency inflectional contexts, error
rates for high frequency verbs were significantly lower than error rates for low frequency verbs even within the frequent 3rd person singular inflectional contexts. Thus,
the conclusion of Aguado-Orea and Pine (2005) was not only that agreement error
rates can be high in certain parts of the system, but that they are high in low frequency
parts of the system – with a strong relation between the frequency of a verb form in the
input and the accuracy with which a child can use that verb form in his/her own production. In other words, overall error rates – which are bound to disproportionately
reflect the children’s performance on high frequency structures – will inevitably under-estimate the true extent of errors in low frequency structures.
Table 2. Number of verb contexts requiring present tense inflection and percentage rate of
agreement error.*
Child
% Agreement error (no. of contexts)
Total
Juan
Lucia
*
4.5
(3151)
4.6
(1676)
Singular
1st person 2nd person 3rd person
4.9
10.2
0.7
(693)
(147)
(1997)
3.0
22.9
0.5
(469)
(96)
(1018)
Plural
1st person 2nd person 3rd person
0
33.3
31.5
(61)
(3)
(251)
0
–
66.7
(14)
(0)
(48)
The table is based on data from Aguado-Orea (2004)
5. % agreement error = (Number of incorrect inflections/Number of incorrect + correct inflections) x 100
6. High rate of errors remained even the when verbs produced before the children attested
knowledge of the correct inflections were removed.
2.3
How big is big enough
Sampling and error rates: Some solutions
To recap, sample size can have a significant effect on the calculation of error rates.
Small samples are unlikely to capture even one instance of a low frequency error. Even
when errors are recorded, error rates based on small amounts of data are unreliable
because the chance absence or presence of a few tokens can have a substantial effect on
the calculation. Using overall error rates can help alleviate this problem (by maximising the amount of data included in the calculation) but can misrepresent error rates in
low frequency subsections of the system. Thus, it is important to analyse different subsystems separately. However, analysing data at such a fine-grained level sometimes
means that the amount of data on which error rates are based can be very small, even
in substantial corpora. And this leads back to the original problem – when analysing
small samples of data, we often fail to capture rare errors.
The most obvious solution is to collect a lot more data but this is not always practical or cost-effective. There are a number of alternate solutions, both for those recording new corpora and those using existing datasets.
2.3.1 Techniques for maximising the effectiveness of new corpora
2.3.1.1 Statistical methods for assessing how much data is required
The simplest way to calculate how much data is necessary for the study of a particular
error is to estimate the number and proportion of errors we would expect to capture
given the proportion of data that we are sampling (e.g., a one hour/week sampling
density might capture one example of an error that occurs once an hour every week).
However, this works only if the child regularly produces one error every hour, an improbable assumption. In reality, children’s errors are likely to be more randomly distributed across their speech.
Given this fact, Tomasello and Stahl (2004) suggest calculating hit rates (or hit
probabilities). A hit rate is the “probability of detecting at least one target during a
sampling period” (p. 111), and supplies an estimate of the likelihood of capturing an
error given a particular error rate and sampling density. Figure 2 reproduces Tomasello
and Stahl’s analysis, using the same method of calculation7 and based on the same assumptions (see Footnote 1). The figure plots hit rate (y-axis) against sampling density
(x-axis) for a number of rates of occurrence. This figure can then be used to work out
7. Hit rate is defined as the probability of detecting one (or more) Poisson distributed target
event, which is equal to 1 minus the probability of no events occuring, and is thus calculated:
Hit rate = 1 – [p(k=0)]
where p(k=0) is the probability that no Poisson distributed target will be captured and is calculated:
p(k=0) = e-λ
where e is the base of the natural logarithm (e = 2.71828) and λ = (expected error rate * sampling
rate)/waking hours.
Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal
how dense the sampling needs to be to capture errors of different frequency (an accompanying Excel file which can be downloaded and used to calculate the required
sampling density for targets of any frequency is available at http://www.liv.ac.uk/psychology/clrc/clrg.html).
For example, let us assume that we want to be 95% certain that our sampling regime will capture at least one error (i.e. we set our criterion to p = 0.05), and that we
estimate that an error occurs 70 times a week. Figure 2 shows that we need to sample
for three hours a week to be 95% certain of capturing at least one error. If the error
occurs only 35 times per week, we need to sample for six hours per week.
Figure 2. Probability of capturing at least one target during a one week period, given different sampling densities and target frequencies.
For more infrequent errors (those that occur only 14 or 7 times/week), the figure demonstrates that even intensive sampling regimes may not be enough. For errors that
occur 14 times/week, we need 15 hours data collection per week to be 95% sure of
capturing one or more errors. Even sampling for 15 hours per week, we would only be
78% certain of capturing at least one error at the 7 errors/week rate. More importantly,
the figure only provides information about the sampling density required to capture at
least one error in our sample. If we wish to capture more errors (which is necessary if,
for example, we want to calculate an accurate error rate) we will need to sample even
How big is big enough
more intensively. Since this is unlikely to be cost-effective, for rare errors it is important to consider alternatives to simply increasing sampling density.
2.3.1.2 Using different types of sampling regimes
The most popular sampling regime in the child language literature is what we will call
continuous sampling, where data is collected at regular intervals (e.g., every week, every fortnight) over a period of time. However, there are a number of alternatives that
might be better suited to analyses of phenomena that occur with low frequency. One
alternative is to sample densely for a short period of time, with long temporal gaps
between samples (interval sampling). For example, we could sample 5 hours every
week, but only for one week per month (Abbot-Smith and Behrens 2006; Maslen et
al. 2004). This way we can be sure of gaining a detailed picture of the child’s language
at each time point. Another idea is to sample all utterances but only those of interest
(targeted sampling; similar to sequence sampling in the animal behaviour literature).
This is the technique used by Tomasello (1992) and Rowland (e.g., Rowland and
Fletcher 2006, Rowland et al. 2005) and involves recording all (or nearly all) the productions of a particular structure (e.g., utterances with verbs, wh-questions). Alternatively we could sample only during situations likely to elicit the target structures (situational sampling; e.g., Rowland’s Lara produced most of her why questions during car
journeys). Finally, a more systematic investigation of a particular structure could be
achieved by introducing elicitation games into the communicative context. For example, Kuczaj and Maratsos (1975) introduced elicited imitation games designed to encourage the production of low frequency auxiliaries into their longitudinal study of
Abe’s language. These games not only provided detailed information about Abe’s auxiliary use, but also demonstrated where the naturalistic data failed to provide an accurate
picture of development (e.g., Abe was able to produce non-negative modals correctly
in elicited utterances even though he never produced them in spontaneous speech).
2.3.2 Techniques for maximising the reliability of analyses on existing corpora
For researchers using datasets that have already been collected (e.g., those available on
CHILDES, MacWhinney 2000), it is important to use statistical procedures to assess
the accuracy of error rates. Tomasello and Stahl’s (2004) hit rate method (described
above) can be used to calculate whether an existing sample is big enough to capture a
target structure. However, there are also ways of maximising the use of datasets that,
though too small for reliable analysis in isolation, can be combined with other datasets
to provide accurate estimates of error rates.
2.3.2.1 Statistical methods
The simple fact that “a group of individual scores has more reliability than the individual
scores” themselves (Maratsos 2000: 200) can be exploited to provide more accurate error
estimates. In particular, mean error rates calculated either over a number of children or