Table 1. Rates of inversion error in Lara’s wh-questions calculated from samples of different sizes (% of questions).

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.77 MB, 266 trang )



Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal

some of the four-hour/month samples yielded inaccurate estimates (range = 12.50% to

57%, SD = 14.60%).

Importantly, the variance across samples was caused only by chance variation in

the number of correct questions and errors captured in any particular sample. In real

terms, the only difference between the samples that showed no or low error rates and

those that showed high error rates was the inclusion or exclusion of one or two inversion errors. However, this chance inclusion/exclusion had a large impact on error rates

because so few questions overall were captured in each sample (on average, six questions with DO/modals in the four-hour samples, three in the two-hour samples, two in

the one-hour samples). Rowland and Fletcher concluded that studies using small samples can substantially over or under-estimate error rates in utterance types that occur

relatively infrequently, and thus that calculations of error rates based on small amounts

of data are likely to be misleading.

2.2

The effect of calculating overall error rates

To sum up so far, small samples can lead to one missing rare phenomena, can fail to

capture short lived errors or errors in low frequency structures, and can inaccurately

estimate error rates. Given these facts, the temptation is to sacrifice a more fine-grained

analysis of performance in different parts of the system in favour of an overall error

rate in order to ensure enough data for reliable analysis. Thus, the most popular method of assessing the rate of error is to calculate the total number of errors as a proportion of all the possible contexts for error. For example, Stromswold (1990) reports the

error rate of auxiliaries as:

Number of auxiliary errors

Total number of contexts that require an auxiliary (i.e. correct use + errors)

This method clearly maximizes the amount of data available in small samples. However, this method also leads to an under-estimation of the incidence of errors in certain cases, particularly errors in low frequency structures or short-lived errors. There

are three main problems. First, overall error rates will be statistically dominated by

high frequency items, and thus will tend to represent error rate in high, not low frequent items. Second, overall error rates fail to give a picture of how error rates change

over time. Third, overall error rates can hide systematic patterns of error specific to

certain subsystems.

2.2.1 High frequency items dominate overall error rates

High frequency items will statistically dominate overall error rates. This problem is outlined clearly by Maratsos (2000) in his criticism of the “massed-token pooling methods”

(p.189) of error rate calculation used by Marcus et al. (1992). In this method, Marcus et

How big is big enough

al. calculated error rates by pooling together all tokens of irregular verbs (those that

occur with correct irregular past tense forms and those with over-regularized pasts) and

calculating the error rate as the proportion of all tokens of irregular pasts that contain

over-regularized past tense forms. Although this method maximizes the sample size

(and thus the reliability of the error rate), it gives much more weight to verbs with high

token frequency, resulting in an error rate that disproportionately reflects how well children perform with these high frequency verbs. For example, verbs sampled over 100

times contributed 10 times as many responses as verbs sampled 10 times and “so have

statistical weight equal to 10 such verbs in the overall rate” (Maratsos 2000 :189).

To illustrate his point, Maratsos analysed the past-tense data from three children

(Abe, Adam and Sarah). Overall error rates were low as Marcus et al. (1992) also reported. However, Maratsos showed that overall rates were disproportionately affected

by the low rates of errors for a very small number of high frequency verbs which each

occurred over 50 times (just 6 verbs for Sarah, 17 for Adam, 11 for Abe). The verbs that

occurred less than 10 times had a much smaller impact on the overall error rate simply

because they occurred less often, despite their being more of them (40 different verbs

for Abe, 22 for Adam, 33 for Sarah). However, it was these verbs that demonstrated

high rates of error (58% for Abe, 54% for Adam, 29% for Sarah). Thus, Maratsos showed

that overall error rates disproportionately reflect how well children perform with high

frequency items and can hide error rates in low frequency parts of the system.

2.2.2 Overall error rates collapse over time

A second problem with using overall error rates is that they provide only a representation of average performance over time, taking no account of the fact that since children

will produce fewer errors as they age, error rates are bound to decrease with time. This

problem is intensified by the fact that since children talk more as they get older, overall

error rates are likely to be statistically dominated by data from later, perhaps less errorprone, periods of acquisition. This is illustrated by Maslen, Theakston, Lieven and Tomasello’s (2004) analysis of the past tense verb uses in the dense data of one child, Brian,

who was recorded for five hours a week from age 2;0 to 3;2, then for four or five hours

a month (all recorded during the same week) from 3;3 to 3;11. Because of the denseness

of the data collection, Maslen et al. were able to chart the development of irregular past

tense verb use over time, using weekly samples. They reported that, although the overall error rate was low (7.81%), error rates varied substantially over time, reaching a

peak of 43.5% at 2;11 and gradually decreasing subsequently. They concluded that

“viewed from a longitudinal perspective, … regularizations in Brian’s speech are in fact

more prevalent than overall calculations would suggest” (Maslen et al. 2004: 1323).

2.2.3 Overall error rates collapse over subsystems

Third, the use of overall error rates can hide systematic patterns of error specific to

some of the sub-systems within the structure under consideration. Aguado-Orea and

Pine’s (2005, see also Aguado-Orea 2004) analysis of the development of subject-verb





Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal

agreement in children learning Spanish demonstrates this problem. Aguado-Orea and

Pine (2005) analysed dense data from two monolingual Spanish children (approximately aged 2 years). They reported that the overall rate of agreement error in present

tense contexts over a six month period was 4% for both children (see Table 2), as had

been systematically reported in the literature (Gathercole, Sebastián and Soto 1999;

Hoekstra and Hyams 1998; Hyams 1986; Pizzuto and Caselli 1992).5 Yet this figure

overwhelmingly represented how good the children were at providing the correct inflections in 1st and 3rd person singular contexts, which made up over 85% of verb

contexts. Error rates for the other, less frequent, inflections were much higher, especially when the rarer 3rd person plural inflections were required (which comprised

only 8% of all verb contexts). For Juan, 31% of the verbs that required 3rd person plural inflections had inaccurate inflectional endings. For Lucia, this figure was 67%; in

other words, agreement errors occurred in over two thirds of the verbs that required

3rd person plural inflections.6

Not only were rates of error higher for low frequency inflectional contexts, error

rates for high frequency verbs were significantly lower than error rates for low frequency verbs even within the frequent 3rd person singular inflectional contexts. Thus,

the conclusion of Aguado-Orea and Pine (2005) was not only that agreement error

rates can be high in certain parts of the system, but that they are high in low frequency

parts of the system – with a strong relation between the frequency of a verb form in the

input and the accuracy with which a child can use that verb form in his/her own production. In other words, overall error rates – which are bound to disproportionately

reflect the children’s performance on high frequency structures – will inevitably under-estimate the true extent of errors in low frequency structures.

Table 2. Number of verb contexts requiring present tense inflection and percentage rate of

agreement error.*

Child

% Agreement error (no. of contexts)

Total

Juan

Lucia

*

4.5

(3151)

4.6

(1676)

Singular

1st person 2nd person 3rd person

4.9

10.2

0.7

(693)

(147)

(1997)

3.0

22.9

0.5

(469)

(96)

(1018)

Plural

1st person 2nd person 3rd person

0

33.3

31.5

(61)

(3)

(251)

0

–

66.7

(14)

(0)

(48)

The table is based on data from Aguado-Orea (2004)

5. % agreement error = (Number of incorrect inflections/Number of incorrect + correct inflections) x 100

6. High rate of errors remained even the when verbs produced before the children attested

knowledge of the correct inflections were removed.

2.3

How big is big enough

Sampling and error rates: Some solutions

To recap, sample size can have a significant effect on the calculation of error rates.

Small samples are unlikely to capture even one instance of a low frequency error. Even

when errors are recorded, error rates based on small amounts of data are unreliable

because the chance absence or presence of a few tokens can have a substantial effect on

the calculation. Using overall error rates can help alleviate this problem (by maximising the amount of data included in the calculation) but can misrepresent error rates in

low frequency subsections of the system. Thus, it is important to analyse different subsystems separately. However, analysing data at such a fine-grained level sometimes

means that the amount of data on which error rates are based can be very small, even

in substantial corpora. And this leads back to the original problem – when analysing

small samples of data, we often fail to capture rare errors.

The most obvious solution is to collect a lot more data but this is not always practical or cost-effective. There are a number of alternate solutions, both for those recording new corpora and those using existing datasets.

2.3.1 Techniques for maximising the effectiveness of new corpora

2.3.1.1 Statistical methods for assessing how much data is required

The simplest way to calculate how much data is necessary for the study of a particular

error is to estimate the number and proportion of errors we would expect to capture

given the proportion of data that we are sampling (e.g., a one hour/week sampling

density might capture one example of an error that occurs once an hour every week).

However, this works only if the child regularly produces one error every hour, an improbable assumption. In reality, children’s errors are likely to be more randomly distributed across their speech.

Given this fact, Tomasello and Stahl (2004) suggest calculating hit rates (or hit

probabilities). A hit rate is the “probability of detecting at least one target during a

sampling period” (p. 111), and supplies an estimate of the likelihood of capturing an

error given a particular error rate and sampling density. Figure 2 reproduces Tomasello

and Stahl’s analysis, using the same method of calculation7 and based on the same assumptions (see Footnote 1). The figure plots hit rate (y-axis) against sampling density

(x-axis) for a number of rates of occurrence. This figure can then be used to work out

7. Hit rate is defined as the probability of detecting one (or more) Poisson distributed target

event, which is equal to 1 minus the probability of no events occuring, and is thus calculated:

Hit rate = 1 – [p(k=0)]

where p(k=0) is the probability that no Poisson distributed target will be captured and is calculated:

p(k=0) = e-λ

where e is the base of the natural logarithm (e = 2.71828) and λ = (expected error rate * sampling

rate)/waking hours.





Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal

how dense the sampling needs to be to capture errors of different frequency (an accompanying Excel file which can be downloaded and used to calculate the required

sampling density for targets of any frequency is available at http://www.liv.ac.uk/psychology/clrc/clrg.html).

For example, let us assume that we want to be 95% certain that our sampling regime will capture at least one error (i.e. we set our criterion to p = 0.05), and that we

estimate that an error occurs 70 times a week. Figure 2 shows that we need to sample

for three hours a week to be 95% certain of capturing at least one error. If the error

occurs only 35 times per week, we need to sample for six hours per week.

Figure 2. Probability of capturing at least one target during a one week period, given different sampling densities and target frequencies.

For more infrequent errors (those that occur only 14 or 7 times/week), the figure demonstrates that even intensive sampling regimes may not be enough. For errors that

occur 14 times/week, we need 15 hours data collection per week to be 95% sure of

capturing one or more errors. Even sampling for 15 hours per week, we would only be

78% certain of capturing at least one error at the 7 errors/week rate. More importantly,

the figure only provides information about the sampling density required to capture at

least one error in our sample. If we wish to capture more errors (which is necessary if,

for example, we want to calculate an accurate error rate) we will need to sample even

How big is big enough

more intensively. Since this is unlikely to be cost-effective, for rare errors it is important to consider alternatives to simply increasing sampling density.

2.3.1.2 Using different types of sampling regimes

The most popular sampling regime in the child language literature is what we will call

continuous sampling, where data is collected at regular intervals (e.g., every week, every fortnight) over a period of time. However, there are a number of alternatives that

might be better suited to analyses of phenomena that occur with low frequency. One

alternative is to sample densely for a short period of time, with long temporal gaps

between samples (interval sampling). For example, we could sample 5 hours every

week, but only for one week per month (Abbot-Smith and Behrens 2006; Maslen et

al. 2004). This way we can be sure of gaining a detailed picture of the child’s language

at each time point. Another idea is to sample all utterances but only those of interest

(targeted sampling; similar to sequence sampling in the animal behaviour literature).

This is the technique used by Tomasello (1992) and Rowland (e.g., Rowland and

Fletcher 2006, Rowland et al. 2005) and involves recording all (or nearly all) the productions of a particular structure (e.g., utterances with verbs, wh-questions). Alternatively we could sample only during situations likely to elicit the target structures (situational sampling; e.g., Rowland’s Lara produced most of her why questions during car

journeys). Finally, a more systematic investigation of a particular structure could be

achieved by introducing elicitation games into the communicative context. For example, Kuczaj and Maratsos (1975) introduced elicited imitation games designed to encourage the production of low frequency auxiliaries into their longitudinal study of

Abe’s language. These games not only provided detailed information about Abe’s auxiliary use, but also demonstrated where the naturalistic data failed to provide an accurate

picture of development (e.g., Abe was able to produce non-negative modals correctly

in elicited utterances even though he never produced them in spontaneous speech).

2.3.2 Techniques for maximising the reliability of analyses on existing corpora

For researchers using datasets that have already been collected (e.g., those available on

CHILDES, MacWhinney 2000), it is important to use statistical procedures to assess

the accuracy of error rates. Tomasello and Stahl’s (2004) hit rate method (described

above) can be used to calculate whether an existing sample is big enough to capture a

target structure. However, there are also ways of maximising the use of datasets that,

though too small for reliable analysis in isolation, can be combined with other datasets

to provide accurate estimates of error rates.

2.3.2.1 Statistical methods

The simple fact that “a group of individual scores has more reliability than the individual

scores” themselves (Maratsos 2000: 200) can be exploited to provide more accurate error

estimates. In particular, mean error rates calculated either over a number of children or



Xem Thêm

Table 1. Rates of inversion error in Lara’s wh-questions calculated from samples of different sizes (% of questions).

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về