Failure To Disrupt Book Club November 9 2020

- Great, well, welcome everyone. We're back for BookClub meeting number eight, chapter seven, the Trap of Routine Assessment. And we're very lucky once again, to have Audrey Watters here, And we're very lucky to have Courtney Bell formerly of the Education Testing Service, but much more recently, as in the summer to the University of Wisconsin. And then as always, we're super grateful to have folks here with all different kinds of backgrounds from all different parts of education.

For those of you who are here live, welcome. If you could introduce yourself in the chat and tell us who you are and where are you from and what kinds of things you work on and how the weather is where you are. We'd love to get to know who's here.

And for those of you who have made it back a week after week, it's just a real treat to see you again and have you join us. So Courtney the way that we ask everyone to introduce themselves, we'll get more into your background specifically, but maybe you can start with your EdTech story. Is there any particular moment as a student or as a teacher that got you interested in assessment or interested in education technology or what stands out to you as you think about your pathway here? - Yeah, so two things, one, which may be then Justin you're gonna say to yourself, "Oh wow, I really should have rethought this guest." But so the first one is in graduate school. We had to take a course at a different institution. And as a graduate student you have that kind of self-consciousness of, "Oh, maybe I don't really understand what, like the big point here is."

So here we were listening to this super famous science educator, who this gives away my age, was working with Palm Pilots. And this science teacher educator researcher person was like going on and on about how this Palm Pilot was his amazing data collection tool, it was gonna revolutionize K-12 science education. And I am sitting there like wheels turning as fast as they can turn like, "Okay, I don't get it." The kids are, and I had just been a science teacher. I had just been a high school science teacher biology.

I don't get it. How is this gonna work? Which science is it? What am I missing? So I come to EdTech as a skeptic, I should say first. So a born skeptic about EdTech- - Say more, I don't know if you can remember in that moment, what was it about the description of the Palm Pilot that just seemed totally discordant with reality for you? Well, the first and most important thing is it solved a problem I'd never had as a teacher. (laughs) I didn't have any trouble getting people to write things down and collect data. I had 8 million probes that I could use. I have like awesome graphing calculators.

Why do I need a Palm Pilot? I just, it solved a problem I didn't have as a teacher. So I'm like, okay, I must be missing something. Maybe somebody else is it's favorite, I don't know. - That's great. Did you say there was a second one? - Yeah, so the second one, so that's like the skeptic in need which goes to the second story, which sends the skeptic in this way. And I can share a slide to show people this setup later if we're curious about it.

But so when I was at ETS, we partnered with two organizations, TeachingWorks, and which is a kind of center at the University of Michigan run by Deborah Ball, For those of you who know stuff about Martha and Francesca Forzani, her colleagues there. And so they were busy working on these things called high-leverage practices, which are like discreet teaching practices that all teachers do across grade levels. They look very different across grade levels and subjects but they're important, and they are repetitive. People do them a lot.

So teachers do them a lot. And so we've been partnering with them, Immersion, which is an EdTech company that we can say more about in a second. But they do this kind of AI supported avatar technology. So it's basically a person, an actor actually when the company was originally founded, an actor behind and animating a number of avatars on the screen. And so- - Like a digital puppetry kind of thing. - You got it, digital puppetry.

- Somebody sitting in a warehouse with an Xbox controller and a voice modulator making five little avatars that look like children, talk to people and stuff like that. - And it's a one actor to a five kid thing in the case like Justin is describing. So what you hit, what you do in real time as they brought the technology, the ETS, to show us this. And so you would put on this headset and this, it had a microphone around it and a camera would capture you. And what you were looking at, imagine yourself standing in front of a big like, slide projector screen like you would for a PowerPoint slide projector.

So you're up there in front of it and it's projecting, it's capturing you and you've got this headset on. And then the kids would say something. They would say like, "Oh, Hey, Ms. Bell." And so I watched a couple of people go through what we call the simulator.

And I was like, "Huh." And everyone's quiet and they kept asking like, Courtney do you wanna have anyone happy anywhere? Okay, so I'm like, I'm in. So what was really weird for me is this, I fully, the skeptic in me, fully expected this to be like a kind of performing like, oh, me thinking about how do I get these, how do I test this assessment system.

That's what I thought it was gonna feel like. And that is not at all what it felt like. It felt like some kind of somebody needed to have an MRI on my brain.

Like I seriously felt like I was a high school teacher again, it's like neural pathways that I was in acting of the, like calling on the kids who are by the way, little cartoons, puppets. And I know that full well, right? And completely interacting with them using the kinds of thinking that I did as a teacher, both as a high school teacher and as a university teacher. And that was a profound experience to me until I was inside that simulator, I would never have believed it would have felt that way. - That's great. So the skeptic, fully prepared to screen new technologies and say, That's not what's gonna work, that's not what's gonna be helpful." And then you found something that you could step into, this kind of digital teaching simulator, where you go, "Wow, this is making me exercise my brain in a way that feels really real and authentic to me and could potentially be helpful to other teachers."

That's great, that's a great introduction. So we probably ought to ask you for like a little bit more background just so people know where you're coming from. So you're a biology teacher, and then you taught for a little bit at the University of Connecticut.

And then you went and worked at Educational Testing Services. - Yep, and so- - With the folks who are out of the country. Maybe you could just describe what ETS is and what your work there was like.

- Yeah, ETS is a nonprofit testing company. So all that nonprofit part means is that the money that they do make from those test fees that all of us pay for various tests, the GRE, TOEFL, TOEIC, for those of you who are out of the country you're probably more familiar with those. Those then get invested back into the public good. And one of the versions of the public good is supporting foundational basic science research and all kinds of things within the assessment domain. So while I was there, I started and led a center that developed assessments of teaching quality. And some of them were technology-enhanced kinds of assessment, and some of them were not at all, like observation tools, out in schools and U.S.

and around the globe, actually. And I most recently, when just before I left at the end of June, this past year, we were just in the process, it will be released next week by the OACD. We were doing a big large-scale study in eight countries of how the relationship between teaching and learning using all different kinds of assessment tools.

So my assessment background focuses on observation tools but we've used multiple choice items, all kinds of computer adaptive stuff, all kinds of stuff like around portfolios. So lots of different kinds of ways but all for me personally, around the assessment of teaching and teachers that said people that were my center focused on students and their learning as well. - Great, so your expertise is in this sort of really challenging domain of teaching as this immensely complex task where the outcomes of what happens from teaching is really hard to trace. As all teachers know, sometimes it's obvious that the kid in front of you clicks and gets it and then sometimes it looks like it's obvious but they actually knew it before and you haven't taught them a thing.

And then sometimes it looks like they totally don't get it at all. And a month later, they snap something from November, or something from December and have this major breakthrough and half of what you're doing, isn't really related to academic content anyway. It's making sure they feel good, healthy whole people. And how do we figure out who's doing that well and what they're doing when they that well, so we can tell other people about what that work look like and raise up another generation of educators to be a little bit better than the last one. Is that a reasonable way of capturing sort of what what you're aiming for? - And for sure, complex performance assessment is what I would call it for shorthand and assessment language. - Good, yeah, so if the simple performance assessment would be, can you add? Can you repeat something? Can you remember a list of numbers? Or something like that and this is complex doing a real-world task.

Okay, so hopefully now it's obvious to people why we invited you here to talk with us about the trap of routine assessment. One of these things that I describe as sort of a fundamental dilemma for education technology that we, if we can't get better at assessment technologies then there's gonna be parts of our education technology that remain stunted for a long time. Maybe we'll pull in Audrey Watters into the conversation here, and Audrey and then Courtney, maybe you can just sort of, especially for focusing at the chance to read the chapter, kind of what did you take away from it? What are sort of the key arguments and key ideas here? And then we can get into, what you think worked and made sense and what was the problem? Audrey do you wanna start this off? - I wanna say a couple one thing about ETS.

I actually just finished. I just finished working on a book on some of the history of EdTech. And one of the people who I look at is Ben wood. He ran ETS for a while but his archives were at ETS headquarters in New Jersey.

And he was a professor at Columbia in the 1920s, '30s, '40s. And he was one of the very first people he managed to, so he was really interested in standardized testing, early, and early standardized testing, and at the time things were graded by hand and he was one of the first who became really interested in the idea of starting to use computational machinery, business machinery at the time in order to be able to scale assessment, right? And so it's just interesting to think about, we have this really long legacy of what assessment started to look like almost a hundred years ago, in order for it to scale. He happened to have a partnership reached out to IBM and was very interested in building machines that would automate the grading of tests. And what you think about what those kinds of machines would look like in the 1930s.

It's not a surprise that they were multiple choice tests, right? And so just thinking about the machinery, the machinery that we use to automate assessment is actually sort of one of those classic almost like cart before the horse kinds of things. And in some ways we're still using technology of assessment, the multiple choice test, that's a hundred years old. And so I think what's interesting is thinking about the new, talking about, I liked how you talked about TUTOR, the programming language from PLATO, is that we think that we're building these brand new artificial intelligence assessments that are using the latest and greatest in data analysis and machine learning. But really there's this whole other really long legacy of assessment that we're still kind of stuck with. And PLATO was a computer system that was developed at the University of Illinois, Urbana-Champaign.

It was one of the first massively networked computer systems. And so people bought terminals rather than machines, and they hooked into the, like we do now with most computers, but it was sort of a kind of internet before there was an internet. And TUTOR was one of the very first programming languages for PLATO after machine code.

And it was called TUTOR because one of the main things that people tried to do with the PLATO computer system was to teach other people. One, they taught them lots of things. The example that I cite in the book is from a lesson about art history but a lot of what they taught was math. And then of course universally throughout the history of computer assisted instruction people are trying to teach computer programming.

Some of the most popular lessons in TUTOR were how to use TUTOR to program other lessons that people could take in other topics. So there's sort of one theme in the book is that are not only are education technologies, but in some ways our whole learning systems are shaped by what assessment technologies are available to us. And those assessment technologies have always been limited. They've always been constrained in various ways. And then of course, everyone should buy teaching machines that it comes out next year from MIT Press by Audrey Watters, which I've had a chance to read and is outstanding.

And I'm hoping that we'll be able to do another conversation like this. But Courtney, what would be your takeaway on the trap of routine assessment? - I love the ideas that sort of connects back to the workplace. I love like the connection back to the sociological, which is right.

Society keeps valuing these things that come in very simple terms. Just are more and more increasingly complex human behaviors whether it's problem solving or collaboration. And so increasingly both economically, and as humans we value those things. That's fine, and we're able to take apart and decompose or dissect the lower level things, but it's that that we teach to computers.

And so computers by definition are always gonna be our assessment to connect Justin to your world. Our assessment technology is always gonna be back behind the thing that we value in society and the thing we want most for our children or want most for our undergraduates, for example. And so that's trap, right? And so how do we think about the nature of that trap? And one of the things the chapter offers for us is this idea that it might be possible to broaden out what those computers can do with. We could pick up a little bit here, like maybe the framing of a problem. It can't, we can't, the computer can't figure out how to score whether or not Courtney can problem solve but the computer could figure out or help figure out make more possible at scale.

Does Courtney have the ability to read a complex situation and frame the problem? That's one piece of problem solve, extend the whole of it. So that to me is a very striking idea for a way to walk forward from a progress perspective. - And I think the chapter is a bit hand wavy about why this is so hard. One of the arguments that I make is just sort of empirically if you look back over the last 20 years, education policymakers, there's no serious person out there who is arguing, "No, no, no, we basically have all the education technology, the assessment technologies that we need." But let's just use them. There's this sort of, and in the era in the United States of the Common Core there are these two huge testing consortiums that are made.

They have millions of dollars put behind them, park and art are balanced for consortium's of state with the experience to try to come up with better, and then they're either universities, there's organizations like ETS and the College Board, like there's lots of smart people who are working on these things. And somebody this morning actually in Russia asked me, well, when AI comes along how much of a difference is this gonna make? And my answer is something like I don't think that much, like we've had super smart people working on this problem with millions of dollars at their disposal for a long long time, and lots of motivation, both financially, but also kind of like educationally, morally. I mean, I assume that the people in testing companies look at their tests and go, "Yeah, we wish these things were better."

And so why, but I don't think I explained very well. Why it's the case that it's so hard to make progress? And I'm wondering if you who've been inside the belly of the beast can give us some more insight into that dilemma. - Yeah, so I'm gonna try Justin, if you'll let me. I don't know if you need to make me a hostess. - I think I have. - I'd like to try to share a slide.

Let's see if we can pull that up. Let's see if this works. So I think of first kind of, can you see this slide that has, it says overlap in activities.

Okay, so don't read the slide for a second. It's a bad teaching move for a second. I should wait and tell you very sorry. So first we need to think together about what we mean by the word assessment, right? We wanna know some information from a certain kind of setting. And so we can think about these contexts in which human beings interact in the world.

Let's say as practices with a little p practice, not the practice capital P of teaching, but a little p practice. I can engage in a certain way with students around, let's say, double digit interaction and it goes a certain way. That's a practice and I do it repetitively. Okay, fine. So let's say what we really care about. We'll take it a student, let's keep the teaching example.

So let's say what I really care about from an assessment perspective is I really wanna know, can Justin teach very well? That's really what I wanna know. I want to assess that. And so Justin has a practice. A little key practice is club Justin's teaching and it's going on all the time in the real-world.

So on the left-hand side of the picture this is a Venn diagram. There's Justin's real-world practice of teaching. So for example, Justin plans his lesson this week, he thinks about the unit, teachers in general, do not think only lesson by lesson, they often have a curriculum.

And then they make a plan for the lesson of that day. They go ahead, they teach the lesson, they think about the lesson, they reflect on it. And then they seek to address those lesson, maybe that particular lesson's strengths and weaknesses over time. Maybe they found out that Audrey didn't understand in the lesson on Monday and so the teacher decides kinda reteach that thing on Tuesday. Okay, fine.

That all happens and that's a part of Justin's little key real-world teaching practice. Now we say to ourselves we wanna assess Justin's teaching. We wanna know how well does Justin teach. We now have to intervene in some way in that real-world phenomenon. In the case of children this is in the real-world phenomenon of how like, for example they learn how to read, right? They're learning how to read it. Lots of places, not just inside of school, they're learning in their homes as they drive down the street on their bicycles, et cetera, seeing the stop signs.

So when we layer on top of real-world practice, which is the thing, again, we care about in assessment, the practice of assessment. Now we create all kinds of things going on. First, we've gotta make decisions about what lesson that we're gonna go watch of, Justin's gonna engage in selecting that lesson. If I'm Justin's principal, he's gonna think real carefully about which lesson he invites me to. If he gets the choice to invite me, Justin and I, his principal teacher might go back and forth about that lesson, et cetera. And on, and on.

The point here being that the assessment practice of observing Justin teach or Justin's teaching in an assessment situation is not the same by definition from Justin's real-world teaching. That is the real- - You see, it's like you observe for a fact that people see throughout science. When you do things, when you intervene in some way, the circumstances become different.

Like what we would ideally want to evaluate is this sort of abstract thing called real teaching. But as soon as we start looking at Justin's real teaching we don't get Justin's real teaching anymore. We get sort of Justin's teaching under assessment. - That is exactly right and you cannot escape that that fact is like maybe somebody will argue with us, Justin. I hope they do.

My assertion is that's always true in every assessment. And so if that's the case, then we think to ourselves where can technology fit into this thing? And so some people argue that the thing we need to do is to create, to use technology to create opportunities, to make the the assessment space more like the real-world space, right? This is the gaming stuff that you talk about. An example of this that we see, that's a low-tech version of it is when we started to do student portfolios, like per month he actually had a huge effort around student portfolios. So we wanna like keep the things and then use technologies, various kinds of technologies to build upon the real-world setting. So for us, I think whenever we're assessing and the places where we can imagine assessment technology coming in, unless it figures out which pieces of this are we gonna engage with the technology in? We will always be doing the kinds of, machine learning kinds of things that you're reacting to your Russian colleagues with, like, yeah, pretty sure you're not gonna make a lot of progress on that. So we have to be clear-eyed about the reality that there is only so much, right this second about the real-world that the technology can get at.

And so people will be dissatisfied until we begin to learn about, for example things as complex as games and simulations and et cetera. But that's not gonna get us out of the problem that those are not the thing that we really want. What we really want is to kids understand science, can kids problem solve? Et cetera. - So I would say a thing that you're maybe arguing against in the book is that, I think one way you can interpret my chapter is I say, look what we basically do with assessments is some form of pattern matching. And the pattern matching was very simple and very boring with the PLATO system.

You just programmed a list of answers into it and if they match the list of answers originally actually some of the programming had to be very precise. Like if someone misspelled the number five as F-I-V and it wasn't in your bank, like they got the answer wrong. And much of what we've done to improve assessment since then is to make sort of more complex pattern matching. Like basically when Duolingo or when other language learning apps are deciding whether or not you've pronounced the word correctly, they're still not really listening to the word.

They're just like looking at a bank of probabilities about what correctly pronounce words look like and seeing if your pronunciation kind of lines up with that. And I sort of make the claim like, "Well, it seems like the way we're gonna make assessment better is by coming up with like more and more clever ways of doing that kind of pattern matching." And I hear you saying, "No, that's maybe not." That might be one thing to do but a more interesting thing to do is to ask the question, can technologies create environments in which people, in which we sort of have more observational control but people still perform in ways that seem authentic and natural to them? Like you described with this Immersion teaching where a great thing about the Immersion teaching is that you can basically do it in front of any computer monitor.

It doesn't have to be with a particular group of students on a particular day in a particular whatever. That is a more promising way of getting out of the trap of routine assessment, is like building cooler worlds for people to perform in rather than just trying to do better pattern matching about their responses. How fair is that? - And I would add not, I think that's fair. And I would add not just cooler and there's actually something very specific we're aiming at. We're aiming at that too. We want the technology to be able to get us closer to the real-world actions that the person is engaged in that we care about.

So in the case of teachers, we care the quality in teaching and here is impart in their decision-making, their moment to moment decision-making. Do they hear Audrey? Tell them, tell the teacher the wrong answer and decide to ignore Audrey because the teacher's then gonna call on Justin, who she knows has tracked that math problem is gonna, it's be able to explain to the class the steps that he went through, or does the teacher make the choice? Let me let Audrey say whatever Audrey's gonna say. And I'm gonna work with what Audrey brings to us as a class.

That's a decision a teacher makes in a moment. So if we can get our technology to get our teachers for example, in this case, our students in the space where they're more likely to engage in the behavior we care about, we're much more likely to learn something that's worth knowing. - I'm gonna go ahead and let Candace Thille who's coming in next week for the toxic power of data and assessment. Candace, if you wanna try to hop in and she promised to argue with us.

So go ahead Candace. We can't hear you yet, Candace. But let's move on. You'll hop in later if you can. Christine Dicerbo asks a great question, which is, does the digital data collected in online learning environments, reduce the need for tests as we know them? What if we gather an aggregate that data? This might not exactly be what Chris is talking about but some of that reminds me of some of the things that we discussed about stealth assessment.

This idea that one of the things that feels uncomfortable about assessment to teachers is it's like we'll do regular learning and do regular learning, and now we're gonna stop and do this thing called assessment. And wouldn't it be better if instead of doing this regular thing called assessment we just let people keep doing the thing that they were doing and sort of gather data in an online environment? So there's a woman at Florida State University, and in the book, I say that she's from the University of Florida and that's wrong. She's from Florida State and it's at the top of my errata. So Val Shute ever listens to this, you have my apology. This was fact checked by multiple people and we got this one wrong, but it's my fault.

But to Val Shute at Florida State University, she's built this system I called Newton's Playground. And basically you do a bunch of physics puzzles and the idea is by watching people do a bunch of physics puzzles, you ought to be able to infer their understanding of physics without having to stop them and being like, "Okay, calculate this formula, give me this definition." That's at least how I interpret Christine's question. Where do you see that as sort of a frontier for this? - So we have people at WCER where I work now that are building these educational games like Val's work. She used to be at ETS by the way. We overlap at teenage but she's just fantastic and such a brilliant thinker.

I hope she gets the clarification, I'm sure. I imagine she's probably read your book, but at any rate one of the things that is the puzzle that the field faces right now, that frankly at ETS we were still really working on and had not made a tremendous amount of progress is how you score that thing. Here's the deal with all that metadata. The metadata is useless without a very clear design about the ways somebody can go through that assessment that I'm gonna call it a task.

Because when you frame a task, let's take one that might be familiar to all of us. Like say you put someone in a, we have a problem solving one at ETS as some researchers are working on. And you want three people in a virtual environment let's say to solve some problem. So you have to make decisions as a designer about what do you include? What are the prompts? What are the likely ways someone would access the various resources? Are you trying to measure just their ability to learn as the goal? Let's say the tools of problem solving? or are you, for example, also trying to understand to what degree over let's say two hours of interactions with games in this Venn, do they begin to learn how to close their mouth and ask a question that somebody else can answer? Which is the collaborative problem solving skill. So the knowledge versus the skills that you're trying to assess require you on the front end when you build that those tasks, to have a very clear idea about what the claim is on the back end that you want to be able to make about the person going through it.

And that is a serious engineering problem that also requires you to specify a way to score. You've got, you have to say to yourself, what are we gonna pump out of this thing? A number, one to four, Courtney's better than she was the time before she did it. She gets a four, she did better at this, quote-unquote, better. A three, do we weigh her knowledge of problem-solving skills? The same as we weigh her skill at being able to close your mouth and listen to other people and listen other people's thinking in the problem solving space, like those are all like the nitty gritty down in the weeds design decisions, so you gotta have thought, you have to have thought about if you hope for us to be able to make any kind of a claim on the back end.

So I hold real possibility for that, but we should not kid ourselves that those are really complicated design spaces to develop. And if we could figure out how to do it over multiple ones I think we really could build on both curriculum and sort of assessments that could do the kind of thing that Val is after and others as well or after. - Chris, but one of the statements you made is that there's gotta be some way of choosing in advance how you're gonna score it. Like why does the assessment need a score? Is Chris part of this question? - Yeah, it's such a good question.

I don't think it does need a score depending on your purpose, right? So, and this gets to the scale as you just said that you talk about, so what do we want the technology to do for us? What do we want the assessment to do for us? Let's say you're a teacher. And what you really care about is the degree to which students are learning to elicit one another's thinking and respond to it appropriately. So maybe embark and you use it more formatively for diagnostically for you as the teacher. So you wanna watch as a group of 25, kids are working on these tasks. And you need some way to report out, the computer program needs some way to report out.

And you then as the teacher needs some way to figure out is it changing in the direction I want it to change. So maybe those are verbal descriptions, maybe there's summary, who knows? It doesn't have to be a number, but as soon as you start to get up to scale, and you want any of those data to inform larger scale decisions like at the school level, or maybe even at the classroom level, do they all understand it or not? Those kinds of things in general we tend to get down to numbers. We need a way to summarize movement on some underlying principal. - Yeah, even if we, I mean, I think very commonly assessment we use it to sort and rank people to say you belong here and you don't belong here.

But even if you got rid of the sorting and ranking function, I mean one of the things that you all do in your center at Wisconsin, is do these English language learner tests. And part of what you could choose never, to use those English language learner test to certain rank. You could just use them to say, what parts of English language learning do students in X school typically get better at quickly? And what part do they get better at slowly? Because if we could find the things that they're getting better at slowly, that might be a better place to invest our professional development, our other kinds of resources, you wouldn't necessarily have to have sorting and ranking as one of your goals in order to have scores be something that you think you would find useful. Especially if you then were at the State of Wisconsin and saying, "Okay, across all of our schools what are our English learning teachers doing well? Where did they need more help?" Those are the kinds of sort of education policy decisions that we wanna make that might wanna have some numbers associated with them.

- Good, that's enormously helpful sort of why things are slowly developing. It would be great to hear more both from Audrey and Courtney about sort of other things in the chapter that you thought, "Oh, this doesn't sound quite right or we ought to rethink that." And then if there are questions that are coming up in the chat it would be great to hear from some other folks. So I'll process the chat, but Audrey and Courtney were there things in the chapter that you found otherwise where you went, "Oh, I'm not sure if that's the right way to think about this."

- Let me think. I don't think that there was for me. I was really stuck with the couple of pieces and actually I think it ties into what Christine Dicerbo said about, since we're doing with, does digital data contain all the answers for us? This is idea of what you're talking about with the reification fallacy, right? That like when we call it a math test, we were like assuming that the math is the part that we're, that the test is somehow only capturing when really it captures a lot and fails to capture a lot of other things. And coming back to Christine's point, just thinking about the ways in which, I do think that there is a narrative and I think a lot about, what's his name? He's at Google, Peter Norvig. He wrote a piece on I think it was called the Unreasonable Effectiveness of Data, just this idea.

And I think it is very commonly held within a lot of a lot of engineering folks. That as long as we do have tons of data, the answers are going to bubble up to the top. Norvig would say, "We don't need theory anymore, right? We just have, we have data."

And I think that that runs really counter to some of the stuff that Courtney was talking about with really carefully thinking about not just how do we design assessment, but how are we, what are we designing in terms of instruction, in terms of curriculum as well? - I love that thought Audrey. I also think one of the things I'd argue back I guess just into the chapter, is not so much that the chapter gets it wrong but the chapter narrowly has to talk about assessment and kind of as a tool that we can use for a particular purpose. And it puts into the background, right? All authors have to do this.

It puts into the background the social setting in which it comes and sits. So the story that you tell about the dual, can you create an automated scoring engine that can throw all the words into a word bag and figure out with some amount of sort of similar to human level of reliability, what score that essay should be given? And then the computer programmers on the other side are like, "Oh, let's break this thing. And let's like figure out how to send a gibberish and have that thing score high." - And one of the things that's so profound about that and you do mention it, is this ideal like some people object to automated scoring engines of texts, because they feel like we write for audiences, we're human beings in interaction. You never write for a nondescript audience.

You're always writing for a purpose. So already you're bankrupting it by first making an assessment, and second then when you put an automated scoring engine into the whole thing, it's like, what are we even after anyway? And the thing that it made me think about this is the idea that assessment at some level is is built on this idea that we know what we're measuring and we agree what, the thing, I'll say scores, what the score is coming out of that assessment mean. And so if the technology is only ever aiming at getting those scores right, and doesn't actually aim at getting the meaning right, I'll say something provocative. I think we've done something where we've actually started to erode the trust in the assessment itself because the person taking the test and the person designing the test is actually trying to get at math knowledge. They're actually trying to get at writing capability, right? So that isn't to say we shouldn't work on automated scoring engines, for sure we should.

But it is to say assessments wind up in social situations. And the whole assessment enterprise is built on shared trust and shared meaning in what those scores mean. So to the degree, technology begins to undermine that we've got a serious problem on our hands.

- Well, I think the most powerful illustration of that undermining of trust and communication, I'm sure there are others Audrey may have her own suggestions, but was with peer grading. So when massive open online courses were released there are a bunch of folks who realized we're not gonna be able to assess some of the things that we most care about using multiple choice questions, using AI grading. But it's entirely possible that if we ask a bunch of people in the class to evaluate someone else's performance then what we'll find is that the average of those peer assessment scores comes out to be typically what an expert would say. Or that a group of peers will disagree about as much as two experts will. And that proved generally to be true. It proved particularly to be true when people came up with clever mechanism, first testing people as peers.

So they'd say, "Courtney, evaluate these five essays from your colleagues." And then secretly give you two that they had already graded. And so if you were way too easy or way too hard we could be like, "Alright, let's downgrade Courtney because she's way too tougher, she's way too easy." And then if you're right on with those other two we can update you.

Once we do a few tricks like that, it turns out that if you randomly assign a hundred essays to be graded by peers, and then randomly assign a hundred essays to be graded by experts that they average out to be about the same scores or reasonably close. But what a grade means to a person in a course is that there is a single mechanism which has evaluated your performance, usually an expert, and then given you some meaningful feedback to it. Not this gibberish, that it just took me three minutes to describe, of like, yeah, a bunch of people looked at your thing and we averaged it and we're pretty sure that that average is about what an expert would have given you. And so, even though these people are individually clueless we're fairly confident that the wisdom of the crowds should substance valuation.

And some of the research that came back on people's response to this was like, "No, this just doesn't feel right." Like it wasn't enough to build the peer assessment technology. You had to sort of reinculcate people into a new culture where there's a new kind of trust. And there's the thing that the peer graders were trying to get away from, was these sort of mechanisms of automated assessment that felt cheesy. I mean, it feels gross to have a computer say, "Well, we didn't actually read your essay but like we predict based on the your word usage that a human would have graded it as X." But even when you have multiple humans, you still come up with these kinds of dilemmas in the social situation.

- To that end, Justin, I mean, I think that this ties back to what we were talking about when Dan Meyer was the guest, it's in that situation those peers were not part of your community. I mean, we use the word peer but in a mook with 10,000 participants. It wasn't really part of your community. And I think that the other piece that you just sort of alluded to is a lot of this automatic grading stuff that always makes me chuckle is that they say that, they claim that we're, that the auto graders are just as good as the people who grade, grade the essay portions of standardized tests.

But the essay portions and standardized tests are also graded in these massive warehouses with people making barely minimum wage, who are given a sort of a rubric to follow. It's not the way in which your teacher, again, who's part of your learning community would grade you. It's a job that someone got off, from craigslist and is making minimum wage doing. I mean the bar is really low. We're not asking, we're not having students work read by their community. We're not having their writing read by people that they're engaged with.

- That's right, and I mean, I think it's exactly what Courtney said too about the idea of this sort of you're trying to modify the social situation like, "Hey, the 10,000 of you that are taking this class you're a community now, and your community is gonna evaluate that." And of course, lots of people will go, no, no, no. I'm pretty sure these 10,000 people I've never seen before some of what you're just typing gibberish into this peer editing thing are not my community. Yeah, absolutely.

But that peer editing might come off very differently in a group of 30 or 40 that know one another, rather than in a group of 10,000 for whom you could never plausibly meet. Eric who works on Graspable Math makes a comment where he's saying that part of what they're trying to do with their particular piece of education technology related to stealth assessment is to not make people feel like they're being assessed because when people feel like they're being assessed, then they behave in different ways that are not real-world ways, In your field, Courtney, of teacher evaluation, this must come up all the time. I teach a certain way when it's just me and my students in the classroom, as soon as someone else walks in with a Palm pilot that's taking notes on my performance or with a rubric or with a video camera or something like that, I go, "Oh, maybe I don't wanna teach as regular me." Like makes a bunch of off-color jokes and regular me, like does these other kinds of things. I mean, is that connected to this idea that, of the observer effect, again, if as soon as we tell you you're being assessed you start doing something different. It seems like a major with assessment.

- By definition, and then when you scale it, you get Campbell's law, right? Which is like sort of disasters, double disasters. So even backed it by assessing it and now you incentivize certain kinds of behaviors which is like, back to the thing Audrey was commenting on. Like, we rectify this is what we think math learning is down to this assessment, right? So here's the thing, it's not like out there in nature. So to speak out in the wild, everything is all great.

Things are not great, right? Lots of kids, lots of kids are not learning what we want for them to learn, what they're capable of learning. So we shouldn't somehow vilify assessment as like, "Oh, this thing is just awful on a hundred." It's not, what we absolutely have to have clear eyes that even in a technology-enhanced environment. It's only ever gonna do part of the work that we need it to do. And if we could, in my mind, if we could get over that and be like, "Yeah, okay, it's only gonna do part of it." Let's figure out how to make it do the work that it's best suited to do from a technology standpoint, like we were talking about at the beginning, Justin build rich environments, figure out how to get to interactions.

That that feels like that's really good. And that's a fruitful direction to press ourselves in as researchers. - And what sort of sustains you 'cause this is something that you've dedicated so many years of your career towards. But you come at this as a bit of a skeptic, you come at this with some authentic teaching experience and then you decide to go into assessment design.

And because you're there you see all of the problems and all of the challenges that Audrey described. What are the things that make you go sort of at the end of the day, but yeah, this is still a thing worth really pushing on and worth really trying to get better at? - I mean, I guess this is so personal. Left to our own devices like we're tribalists, right? We're people who we love one another. We wanna take care of one another. But the truth is we're pretty bad at doing it with people that are different than ourselves. And we have a really long history in this country and around the world of doing that to one another.

So assessment can be, and people will really hate this thought. One thought, one upside of NCLB, No Child Left Behind. - That No Child Left Behind, which was this act which sort of mandated greater assessment in third through eighth grade and 10th grade in the United States. - It shed light on something that had been going on in the United States for years, which is we have been failing certain groups of kids systematically for, and egregiously, egregiously.

Now it led to a ton of horrific other implications. Bubble kids and all kinds of teaching behaviors, and test pro kind of things that truly have been very detrimental for education. But if we can't know what the, if we can't document that there's a problem, it's very hard to act on it. So for me, if we take assessment and we right-size our expectations of it, and we try to work on its most productive aspects and we treat it as a part of a larger system that we use to help ourselves get better in this democracy, speaking specifically in the U.S. context, to create more equitable learning opportunities and outcomes for all children in this country. I think that's the best we could hope for.

And I'm not optimistic we'll do it without something that sheds that light. I guess that's the thing. So in some ways it's like the worst of two evils. Do we like let ourselves be just regular, keep going as we are, or do we, you pick a tool or set of tools, put it together with other information and chain ourselves to that as a society to try to use it as a tool to improve overall? And I guess I choose the second. - Well, I don't think you could have a more impassioned argument for sort of the thinker's view towards assessment that, this is what we have, there's a particular function that it can perform if we build that well and if we build the whole system around that well.

And for all of its imperfections to say, well, let's keep kind of working on this thing until we can get it more right. Audrey, do you have any parting words thinking about the trap of routine assessment for this week? - Yeah, I mean, it's been interesting to watch the chat and think about this idea, the ideas of how does students feel anxiety and distrust around assessment and how do we create situations where students feel less anxiety? And I think that part of that is not actually having the stealth assessment so that students are actually sort of not just tested on, not just having their once every other year or a big test in the spring. But the students are somehow always being assessed seems to be, lead us down to some other paths that we've talked about already with surveillance. And I guess we'll talk about with Candace, right? When we think about thinking about data. But, sort I do think, how do we answer some of these questions and are there ways in which we can think about it without being so reliant on scores? I think that that's an important takeaway.

- Terrific, well, Audrey Watters, thank you. Once again, Courtney Bell, thank you so much for joining us, a really great conversation. And one that I hope a lot of folks will have benefited from. I know it was helpful for me to think about some of the ways that the chapter that tracks assessment pros is some ways forward, but you've you've added some more to that list, which was really wonderful.

So thank you so much Courtney, for joining us. - Yeah, thank you guys, it's great. - Good, we've got one more conversation coming up next week with Candace Thille who helped develop the Open Learning Initiative and then went and worked at Stanford, is now one of the, I don't know her exact title, Chief Learning Officer at amazon.com, wonderful thinker and researcher. And we're gonna be talking about data and data collection. We're gonna be talking about experimentation, and we're gonna be talking about the powerful ways that these tools might be able to improve online learning and learning at scale.

And then also the ways that they can feel pretty gross collecting vast amounts of data about small children and using them as guinea pigs in experiments. And like this week, we'll try to find, see if there are any pathways forward through those 30 dilemmas. So thanks to everyone for joining, stay safe out there around the world. And we'll hopefully see you next week with Candace and Audrey again.

And then one last week to wrap up the book with Kevin Gannon and Cathy Davidson. Thanks so much everybody, have a great afternoon.

2021-02-14

Show video