Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.77 MB, 266 trang )
How big is big enough
more intensively. Since this is unlikely to be cost-effective, for rare errors it is important to consider alternatives to simply increasing sampling density.
2.3.1.2 Using different types of sampling regimes
The most popular sampling regime in the child language literature is what we will call
continuous sampling, where data is collected at regular intervals (e.g., every week, every fortnight) over a period of time. However, there are a number of alternatives that
might be better suited to analyses of phenomena that occur with low frequency. One
alternative is to sample densely for a short period of time, with long temporal gaps
between samples (interval sampling). For example, we could sample 5 hours every
week, but only for one week per month (Abbot-Smith and Behrens 2006; Maslen et
al. 2004). This way we can be sure of gaining a detailed picture of the child’s language
at each time point. Another idea is to sample all utterances but only those of interest
(targeted sampling; similar to sequence sampling in the animal behaviour literature).
This is the technique used by Tomasello (1992) and Rowland (e.g., Rowland and
Fletcher 2006, Rowland et al. 2005) and involves recording all (or nearly all) the productions of a particular structure (e.g., utterances with verbs, wh-questions). Alternatively we could sample only during situations likely to elicit the target structures (situational sampling; e.g., Rowland’s Lara produced most of her why questions during car
journeys). Finally, a more systematic investigation of a particular structure could be
achieved by introducing elicitation games into the communicative context. For example, Kuczaj and Maratsos (1975) introduced elicited imitation games designed to encourage the production of low frequency auxiliaries into their longitudinal study of
Abe’s language. These games not only provided detailed information about Abe’s auxiliary use, but also demonstrated where the naturalistic data failed to provide an accurate
picture of development (e.g., Abe was able to produce non-negative modals correctly
in elicited utterances even though he never produced them in spontaneous speech).
2.3.2 Techniques for maximising the reliability of analyses on existing corpora
For researchers using datasets that have already been collected (e.g., those available on
CHILDES, MacWhinney 2000), it is important to use statistical procedures to assess
the accuracy of error rates. Tomasello and Stahl’s (2004) hit rate method (described
above) can be used to calculate whether an existing sample is big enough to capture a
target structure. However, there are also ways of maximising the use of datasets that,
though too small for reliable analysis in isolation, can be combined with other datasets
to provide accurate estimates of error rates.
2.3.2.1 Statistical methods
The simple fact that “a group of individual scores has more reliability than the individual
scores” themselves (Maratsos 2000: 200) can be exploited to provide more accurate error
estimates. In particular, mean error rates calculated either over a number of children or
Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal
over a number of different sub-samples from the same child will provide a much more
reliable estimate than each individual error rate, even in low frequency structures.
This fact is illustrated by the results of the sampling analysis conducted by Rowland and Fletcher (2006) on the Lara data and summarized above (see section 2.1.3).
The analysis demonstrates that small samples of data are extremely inaccurate at estimating true error rates for infrequent structures – error rates for questions with DO/
modal auxiliaries varied from 0% to 100% for the smallest sampling density (see Table
1). However, for each sampling density, the mean error rate calculated across seven
samples was often quite accurate, despite the fact that the estimates from the individual samples contributing to these means varied widely (mean error rate: four-hour sample = 26%, two-hour sample = 17%, one-hour sample = 26%; compared to 20% for the
intensive data). Thus, mean error rates calculated across a range of samples provide a
more accurate method of error rate.
Maratsos (2000) has also used means across verb types to provide reliable figures
for past tense over-regularization errors in low frequency verbs (verbs that occurred
only between one and nine times in the samples). Maratsos calculated error rates for
each individual verb type and then averaged across these error rates to provide a mean
error rate. As well as controlling for small sample size by providing a more accurate
measure of error rate, this method also ensured that each verb type contributed equally to the calculation (thus controlling for verb token frequency). The resulting figure
gives, as Maratsos (2000: 200) says “an average rate more believable than each individual verb-rate that went into it”.
The samples from which means are derived do not have to be multiple samples
from the same participant. Means calculated across samples from a number of children will also provide more reliable measures of error rates. Of course, this approach
does not record individual differences either across children or across items, nor does
it tell us about the reliability of individual samples. However, information about the
standard deviation and the range can be used to assess the reliability of each individual sample, and to identify outliers with extreme scores.
The range and standard deviation are two commonly used measures of statistical
dispersion. The range of a group of samples is simply the spread between the largest
and smallest estimate and is calculated by subtracting the smallest observation from
the greatest. However, the range only provides information about the spread of the
samples as a whole; it does not provide information about how the individual sample
estimates pattern within this range. The standard deviation (SD) is a more sophisticated
measure of statistical dispersion that provides information about how tightly all the
How big is big enough
estimates from all the samples are clustered around the mean.8 A small standard deviation means that most (if not all) estimates are close to the mean. Since the mean of
a number of samples is a reliable estimate of error rate, a small standard deviation indicates that the estimates from each individual sample are likely to be reliable. A large
standard deviation means that many of the samples have yielded estimates that are
substantially different (far) from the mean (and also from each other). This would indicate that estimates from individual samples are more likely to be inaccurate.
For example, returning to the data on questions from Lara (see section 2.1.3 and
Table 1), we can see that the standard deviation derived from the seven one-hour samples was large for questions with DO/modal auxiliaries (38.32%). This indicates that
each individual sample at this sampling density was likely to give an inaccurate estimate. However, for questions with copula BE, the standard deviation for the same
sample density (1 hour/week) was much smaller (3.44%); indicating that each sample
provided a relatively accurate estimate of error rate. Thus, standard deviations can be
used to assess the reliability of a particular sampling density. A low standard deviation
across samples at a particular sampling density indicates that each individual sample
may be large enough to provide reliable error rate estimations on its own. These figures
can then be used to assess optimum sampling density for new data collection studies.
2.3.2.2 Combining different types of samples
Although combining data from a number of samples may give us accurate error rates,
sometimes we wish to assess the accuracy of an error rate from an individual sample
or, more often, from an individual child. For example, we may have collected dense
data from one child (which we assume give us accurate error rates) but want to check
whether our results can be taken as indicative of language learning in the wider population (i.e. is the child representative or do the results simply reflect idiosyncrasies of
this particular child?). The solution to this problem is to use statistical methods to
compare the dense data with data from a larger number of children, albeit collected in
smaller samples.
Rowland et al. (2005) performed such an analysis. They reported that one child –
Lara – produced large numbers of inversion error in wh-questions with DO/modal
auxiliaries for a short period of time at the beginning of Brown’s (1973) Stage IV (see
section 2.1.2) but that these high error rates were not reflected in the much less dense
data collected from twelve other children (the Manchester corpus). Rowland et al.
concluded that the Manchester corpus data was not dense enough to capture the very
8. Most statistics and spreadsheet packages will calculate standard deviations (SDs). The SD is
the square root of the variance and is calculated using the formula below (where x is the individual score, x is the mean, and n is the total number of scores)
SD =
∑(x – x)2
n–1
Caroline F. Rowland, Sarah L. Fletcher and Daniel Freudenthal
short period of high error, but recognized that the differences could be due to individual differences between Lara and the Manchester corpus children. To check, Rowland et al. compared Lara’s data with the Manchester corpus in terms of the percentage
of correct questions and errors produced at Stage IV overall, and the percentage of
correct questions and errors produced with DO/modal auxiliaries at Stage IV overall.
The data on which this analysis was based are reproduced in Table 3.
Using means, standard deviations and 95% confidence intervals they demonstrated that Lara was not an outlier on any of the comparison measures. For example, her
rate of correct question production (67.02%) over the whole of Stage IV was very close
to the mean rate demonstrated by the Manchester corpus children (68.43%) and was
well within one standard deviation of the mean (68.43% +/- 25.73%). In other words,
when we analyse Lara’s data at the same grain size as the Manchester corpus, the high
rate of error disappears. These comparisons indicated that Lara’s data can be considered representative and that the difference between corpora was due to the fact that the
Manchester corpus was not large enough to capture the short period of high error in
the lower frequency structures.
Table 3. Comparison of Descriptive Statistics: Manchester Corpus Children and Lara
Question type
Stage iv Manchester corpus data
Stage iv Lara data
Mean %
of questions
Standard
deviation
95% Confidence
%
interval
Of total questions
Correct
68.43
25.73
49 – 88
67.02
Omission error
Inversion error
Other commission
error
24.13
1.39
4.40
28.45
1.84
3.45
4 – 48
0–3
2–7
23.84
2.35
5.29
53.32
35.30
8.51
2.87
34.39
38.22
9.42
4.44
27 – 80
6 – 65
1 – 16
0–6
68.64
16.38
12.89
2.08
All wh-questions
Questions with do/
modal forms
Correct
Inversion error
Inversion error
Other commission
error
2.4
How big is big enough
Summary
To conclude this section, estimates of error rates are dependent upon the size of the
sample and the analysis methods used. In order to estimate error rates accurately, we
need datasets big enough or statistical measures sensitive enough to capture examples
of, and to estimate rates of, low frequency errors, short lived errors and errors in low
frequency structures. However, even if we employ such methods, we should be especially cautious about drawing conclusions about how data support our hypothesis
when we know that the methods we have used may bias the results in its favour. Those
who hypothesize that error rates will be low for a certain structure (e.g., Hyams 1986)
must recognize that overall error rates are likely to under-estimate rates of error in low
frequency parts of the system. Those who argue for high error rates in low frequency
structures (e.g., Maratsos 2000) cannot point to high error rates in individual samples
or at particular points in time as support for their predictions, unless they have also
demonstrated that such error rates cannot be attributed to chance variation.
3. Sampling and the investigation of productivity
A second issue at the heart of much recent work is the extent to which children have
productive knowledge of syntax and morphology from a very early age. Many have
claimed that children have innate knowledge of grammatical categories from the outset (e.g., Hyams 1986; Pinker 1984; Radford 1990; Valian 1986; Wexler 1998). In support is the fact that even children’s very first multi-word utterances obey the distributional and semantic regularities governing the presence and positioning of
grammatical categories.
However, others have claimed that children could demonstrate adult like levels of
correct performance without access to adult like knowledge, simply by applying much
narrower scope lexical and/or semantic patterns such as agent + action or even ingester
+ ingest or eater + eat. In support are studies on naturalistic data that suggest that children’s performance, although accurate, may reflect an ability to produce certain high
frequency examples of grammatical categories, rather than abstract knowledge of the
category itself (e.g., Bowerman 1973; Braine 1976; Lieven, Pine and Baldwin 1997;
Maratsos 1983). These studies suggest that we cannot attribute abstract categorical
knowledge to children until we have first ruled out the possibility that their utterances
could be produced with only partially productive lexically-specific knowledge.
This is clearly a valid argument. However, it is equally important that we do not
assume that lexical specificity in children’s productions equates simply and directly to
partial productivity in their grammar. In fact, the apparent lexical specificity of children’s speech may sometimes simply be an artefact of the fact that researchers are analysing samples of data. There are three potential problems. First, even in big samples,
we capture only a proportion of the child’s speech, which means children are unlikely