Proficiency means how good they are in the language outside of some sort of formal achievement measure, not it’s not a test, it’s nothing they’ve they are now demonstrating they learned from a specific course, but rather it’s their abilities to speak the language in anything from a survival situation like uh can you how much does this uh um pack of um of uh uh of playing cards cost to something um much more sophisticated like a speech act, like the ability to complain effectively and have their complaint work. So that would be the range of speaking. Uh it would involve listening as well. How well they can actually listen to and understand for example an important telephone message that may even be a little be garbaled on the answering machine. Um writing, are they able to write uh are they able to write a message that someone is able to understand are they able to fill out a form? Uh and this could happen to even a grade school kid and they have to fill out a form about a uh um some sort of a school outing um to uh reading. Proficiency would be concerned about how well they can read, not and any again not in any specific test they might take in reading or any particular text they have to read, but out there in the real world in a in a in a in a more general non course specific way how effective are they at reading? This would be their reading proficiency. Then the field of assessment also adds other things to proficiency. They talk about vocabulary ability and again that cross cuts the different skills, it’s not just speaking, listening, reading, writing, it can be a combination of all these and some kids use very very clear they are very strong in vocabulary in uh speaking let’s say in a certain domain a certain area they have the words for things on the street. Um they don’t have the words for things in a in a in a more technical or content specific domain. They can’t they can’t talk about um in uh their native language, let’s say Spanish, they can’t talk about something in the sciences, something about an earthquake they maybe they only heard those terms around the school. So that that that’s the main thing with this proficiency though distinguishing it from achievement is that it’s looking at general abilities not linked to anyone. Course where as achievement language achievement is looking at what you gained out of one specific curriculum that you were part of in one specific course.
The uh question um um we’re faced with now is how to help a school administrator, teacher, an advisor, understand what a single score on a single measure a single language measure means at school so the students has taken a uh a test of speaking uh general speaking and they have a score of I don’t know 72 out of 100, what does the score mean? Well frankly I think that right now where we are in the field and people like Elana Shaw and others have pointed this out is we are very wary of um putting too much faith too much stock in one individual score, but rather we want to see multiple measures. We want to see different we want to see the different measure of the same thing, so if we’re interested in how well somebody can express themselves and maybe maybe complain we may want to see um ok let’s say about a fourth grader could be very concerned about something that’s bothering them something they want to complain about at school that isn’t working out for them. Um maybe there would be one measure which is a role play where they are to role play with another classmate and that situation. Another might be a written description and they have to put in writing what they would say, the advantage there is that they have alittle bit more time to think out things. Maybe in a role play they’re nervous in fact sometime there are there are kids who in uh uh a oral situation they feel confronted and they are unwilling and unable to express themselves at all effectively so maybe they will be rated much lower and it really has to do with personality um where as in the written projected measure of their speaking they could do a better job. Um another situation might be to have kids in a group um of 3 or 4 and they are interacting and some kids flourish that way where as they’ll be intimidated if there’s one adult talking to them. Uh they won’t they won’t perform at all well in that kind of situation. So a single score is iffy. Me personally what I like to do these days I like to know what the the child’s experience was producing that particular um response or set of responses for that particular test. So I will ask them what was it like, what would you what would you want to do did you do what you wanted to do, did you have any frustrations, and if need be ask them in the language they feel comfortable in like Spanish if they were performing in English. We’ve done that with adults, Elif Olsten and I did that in Israel where we had 15 performers who had to perform 3 speech acts, of requests, apologies, and a complaint, and we had them look at the video tape and reconstruct for them what went into the production of their responses. Cause the thing is a response is like frozen, it’s like a product which it may just be the tip of the iceberg, it isn’t necessarily telling us um what went into producing that response. Um on the other hand and the other side of the coin is let’s we had a score we have a score of 4.3 which is on a scale from 1 to 5 and it’s been determined by raters or a rater has looked at the video tape or something of this child um um responding, what went into that response. Cause one 3 one 4.5 may not be equal to another 4.5 qualitatively in terms of what (tape does something weird)we’d want to see several scores. Uh why are we going the portfolio route in writing, why is uh uh uh uh uh teachers be encouraged to have their students put together a set of their writing samples of uh of of um little short stories they might have written, or a poem or a essay. So that there is more than one single measure to go on since it may be deceptive it may have been that the child wasn’t feeling well on that day or wasn’t performing the best they possibly could, or maybe they don’t like to do that particular kind of measure but they’re good at other kinds of things. And this is why Merrill Swane has referred to this as biased for best. In other words to have the respondant doing enough things and having them also bend over backwards to make sure whatever they do they can do in a fair way they can feel good about themselves and maintain their own self esteem um um I’m wearing to them I’m wearing my black sheep tie this is the outlier, the individual who who doesn’t feel accepted, when you said I you wanted me to talk about assessments see I’d wear it one way, this is the black sheep tie, cause I think a lot of go through tests, we’ve all had a experience where we just bomb out whatever it is they don’t ask us the questions we like to answer and this happens in assessment in the schools repeatedly and and anything we can to do help youngsters feel good about themselves and enjoy the experience that’s why I enjoy working in speech acts because it’s a fun area. Particularly I’ve found we’ve found respondants like to do speech acts where they are in the drivers seat. You see you put someone in a situation where they have to apoligize and some people don’t like that because they are embarrassed and uh uh uh tasks often can make us feel very inferior. We don’t know the text, we’re reading something in science and we’re not good at that, you know we feel oh they asked me the wrong thing. So I’m very much in favor of of multiple measures and not putting too much weight into any one test. And certainly knowing being clear what went into that test, both in terms of what the respondant had to do and if there’s a rating involved what the raters and to do and how easy it was to do it.
There there is a very excellent article that was published by JD Brown and Hudson on on uh criticizing uh uh alternative assessments recently and I think the meas the statement being made was don’t throw the baby out with the bathwater. That we’ve that we have some time honored means of assessment and just because things are new doesn’t mean they’re better and just because they’re catchy doesn’t mean they’re better. We have to leap before we look. Um I followed that early literature on portfolios I mentioned before that I though portfolios could be an important way to get samples of writing on the other hand the state of Vermont did it’s best to use portfolios in the school system in 4th grade in 8th grade they found a notorious lack of reliability. They did better, their their math portfolios worked out well, the writing portfolios, uh this was actually for English native language they were frought with reliability problems. It’s difficult to get raters of essays in schools to agree on what they mean by by certain scores. So we have to distinguish between measures, feel good measures, measures that will just make everybody feel good, and measures that’ll actually tell us something. And and if we if we really want to be using the same yard stick for for for each person or whether we’re willing to have different yard sticks, uh you know different people according to their to their background, the minority student comes with can come with a certain handicap particularly if they come out of minority homes where they haven’t had the kinds of exposures that other kids have had where they haven’t filled out box tops and sent away for things, where they where they don’t spend half their time on the web uh and you know filling in blanks and things like that and they come in the classroom and they get they make they could get some alternative measures to which is uh which involves a portfolio approach or something along those lines and they they don’t do any better. It’s not biasing for best in their case they’re not gonna perform any better because they’re getting this measure which is which is doubling over backwards to try to be uh an official to them. Well it it it is problematic and I think for that reason there is a need to involve them. The teachers, the children in in in the process um to try to better to come to some sort of understanding about about the whole assessment process and how to do it effective. But definantly not to eliminate the so called traditional tests out of hand, but to to worry about making them beneficial, if it’s reading passages, if it’s reading texts that are contextualized that the learners have some sense of understand, uh uh with vocabulary that they can deal with. One of the big frustrations I had in the early days of the bilingual movement is um all we had were tests from Puerto Rico for testing the Spanish and English uh abilities of youngsters in the bilingual program and in Red Wood City, California. And the norms were for Puerto Rican kids, the Spanish was Puerto Rican Spanish. I think we’ve gone along way or we’ve come along way but still there’s some querks out there. There are Spanish tests used with big textbooks, uh textbook companies uh in the southwest that where the Spanish tests are back translations from English and they haven’t been adequately Hispanized. Um there there are there are problem areas that I think we are going to have to continue to be concerned about. Uh as as the bilingual uh needs for multi-lingual assessment will continue.
So the question is why should we uh care about reliability and validity. Well if you have a weight problem you get on a scale and one time it tells you you are 10 pounds uh under where you think you are and obviously you go rushing off and you get yourself another piece of cake uh and the other time it tells you you’re 30 pounds over. Um it can be a problem for you. You say what can I do? I want to know how heavy I actually am, I don’t want to be told you know so much more one time so much less the other. That’s the issue of consistency. But reliability is talking about consistency. Are you going to get the same results each time you use the particular instrument. Validity is a more complex issue, I mean it starts with sort of a basic issue which is(hits microphone and adjusts with a pause)so that so reliability so validity then validity then is does the test what the assessment measures, measure what it reports to measure? Does it assess what it is intended to assess, or is it actually assessing something else? So you have a measure that is intended to be a measure of listening. But you have the the children who have to deal with this listening measure actually reading things. They’re listening but they have to read responses so they hear something on a tape and they have to read and then tick off which is the which is the best response. Is that a listening test or is it a reading test. So so this is what validity is all about, the extent to which your measure is really measuring what you want to measure. And there are a lot of ways at getting at validity. There’s face validity which asks whether the tests looks like it could possibly measure what you want it to measure and and in a lot of cases that what that’s what turns teachers off. They look at a closed test and say how could this swiss chesses here with all these holes, how could this possibly be any measure of someone’s language ability all it is is a riddled text. I can’t even understand it. It’s like filling in the dots when there aren’t enough dots to even begin with. Where you’re filling in the dots when you’re missing a lot of numbers and you don’t know what to do. So uh the the validity is is in one case this is one case face validity. It’s systemic validity whether the the the test or the assessment measure makes sense in that community because and I gave the example of the uh the test from Puerto Rico for for bilinguals in Puerto Rico uh whether that would be appropriate for Mexican American children from a totally different socio-economic environment and so forth in California, and I didn’t think they were at that time, there weren’t any other measures to go on. You know so all of these things need to need to come together and and assessment people and teachers, yes need to be asking the question does this test test what we want it to. And that is why I am so concerned about finding out what the respondants have done to answer a test from item to item, because sometimes they get there through other means, they don’t get there through the way the test wanted them to get there. So you have the item saying um the this it doesn’t need any water and you have a cactus and you have a fruit tree and you have a cabbage on a plate. And the kid decides it’s going to be the cabbage on the plate, the wus the correct response was intended to be the cabbage, you’re suppose to know that it doesn’t need very much water. The kids why the kids, well it doesn’t need any water at all it isn’t even living. I mean they have their logic for things. Or the bird has build his nest and there’s a picture of a little house with a bird cage and a little hole and then there’s the traditional bird’s nest and the bird and the child picks the birds nest. And oh wonderful the kid really understood. You ask the kids why they made that selection and they said that wood house, they could have made it but the hole was too small for them to get their body into so that’s why it’s got to be the other one. So they have all sorts of reasons for responses. And by the same token I think it’s crucial to what goes into people’s decisions about what’s a right and wrong answer. And all of this is validity, all all of these aspects of validity as much as the more technical kinds of validity where you where you you corollate a test to some other measure. You say this we can determine how valid this test is by ey um um criterion referenced validity, we will corrolate it with another measure out there that we think is established and reputable. Man when you take a minority respondant and the test was not norm for and it’s a whole new ballgame. Those those tests may have very little significance with that population, they don’t respond the same way. So we we need to do our homework, we need to do more work and and I worry all the time about things like our Act for American Council that Teaches Foreign Language proficiency guidelines. I worry about these band scores where all sorts of dimensions factors are being melded together and go into a particular band so you’re you’re an intermediate plus if you can do this this and this. Yeah what if you can do a few of those things and not other of those things. It’s a dilemma. How you come up with something called uh a score. So I think we really are back to multiple measures but this is very much a validity issue. How what does a test need, does it actually measure what it reports to measure, beyond the fancy labels.