Carmen Wilson taught the construct validity research lesson Thursday, October 14, 2004 in Psychology 451/551: Psychological Measurement. Rob Dixon observed and two guest observers from the Psychology Department were Tracie Blumentritt and Susan Wabaunsee.
Summarize the evidence, identifying major patterns and tendencies in student performance.
The evidence of student learning we collected consisted of:
• Students’ written work: 1) definitions of depression, 2) test items to measure depression, 3) and descriptions of studies to determine test validity.
• Observations of students during the lesson. Three instructors observed students, and a videographer filmed the lesson.
• Minute paper. Students completed a three question “minute paper” at the end of class (what was most difficult; what was the most important thing you learned; what is still confusing).
The comments below identify patterns for some students and some groups and “problems” observed in one or more groups. The problems may not typify the entire class but it is important to make note of them when we revise the lesson.
In preparation for the class, each student developed a definition of depression. Students who were familiar with the concept from other classes wrote well developed definitions.
At the start of the lesson, students were in small groups and asked to compare their ideas and formulate a single group definition. Students were familiar with the concept of depression from other classes and were able to produce a lot of characteristics.
• Problem: Familiarity with the topic of depression had a dual effect. It was easy for groups to generate a list of characteristics about the concept, but some got bogged down in minute details about depression. (e.g., talking about etiology and whether it is always triggered by a traumatic event).
Based on their group definition, students created several test items to measure depression.
• Problem: Some groups got bogged down in framing/posing questions effectively.
• Problem: Some groups got bogged down on what statistical test to use.
• Problem: Little or no discussion about the nature of a “construct.” Were students aware that they had created a construct?
Groups then designed two studies to determine whether the test they had just created actually works.
• Problem: Some groups got caught up in technical details of doing a study (e.g., how to word the questions properly, how to get subjects to participate) rather than the logic of the research design. Were students aware of the real point of the lesson—to test validity?
• Problem: In some cases, there was no evidence from the group discussions that students understood the logic of determining test validity. It’s not clear whether students understood it and therefore had no need to discuss it or whether they really did not understand it clearly.
Based on their worksheets:
• 6 of 9 groups described one study that was acceptable
• 6 of 9 groups wrote a tangled description of at least one study. In these cases we could not determine what the students understood about validity.
Last segment of the lesson. Groups were given a handout that described two tests (one of depression and one of academic skills), and asked to use these to determine the validity of their test. Groups were perplexed about whether the academic skills test was related to depression or not.
• 5 groups wrote an acceptable description of a divergent validity study and two others seemed to be on the right track but their answers were tangled.
Describe major findings and conclusions about what, how and why students met or did not meet learning goals.
The concept of “construct” was invisible during the lesson. There is no evidence that students are aware of the idea and how it relates to the test they developed.
Based on observations of groups and their written descriptions:
1. Most groups designed at least one acceptable study to determine test validity, which suggests they “get” the logic at some level.
2. Many group written responses were tangled and incomplete. We can’t tell whether they didn’t get it or just can’t talk about it or are just careless.
3. Groups got sidetracked in the details of research design and data analysis
4. There were multiple examples in which the term “correlation” was used or invoked incorrectly. In some cases they used it generically (e.g., “correlate scores”) without indicating which scores and in some cases they were just wrong (e.g., “correlate the scores of two groups;” or they used negative correlation when they meant no or low correlation).
5. There is a group dynamics issue. Some students are not engaged, some groups are not effective.
6. The results are actually pretty good considering students have not read about or been taught about construct validity.
Based on your analysis how will you change the lesson?
1. Stay with the overall structure of the lesson
2. Have students define depression and develop test items outside of class
3. Pick a different type of test for the divergent study (e.g., math facts) so students are less likely to anticipate a connection to depression
4. More analysis at the end of class to compare, analyze and discuss their studies and connect them to the major concepts of the lesson
5. Try to de-emphasize statistics and focus more on “evidence” and general descriptions of the studies.
Contact: Bill Cerbin
Previous Logs: Step 1 | Step 2 | Step 3 | Step 4
Project Log 6: Summary of evidence for the second iteration of the research lesson taught march 2, 2005.
Summarize the evidence, identifying major patterns and tendencies in student performance.
Observers’ comments and evaluation: Four observers filled out an observation protocol. Below are the scores for each likert scale item [1 (totally disagree) - 7(totally agree)]. The variability of scores indicates pronounced differences among the groups in terms their understanding of the concepts. There was little or no evidence that students understood the concept of “construct.”
1. All members participated in the process 3, 2, 6, 5
2. The group was able to stay on track with the lesson (i.e. did not derail, discussing irrelevant information) 6, 2, 5, 5
3. The group seemed confused about the technical processes of the lesson 2, 5, 2, 2,
4. The group seemed confused about the concepts the lesson was addressing 2, 6, 2, 5
5. The group seemed to understand the concept of construct validity 5, 3, 5, 2
6. The group seemed to understand the concept of construct. NA, No evidence, NA, 3
7. The group seemed to understand the logic of construct validity 6, 3, 6, 2
Problems observed in some groups
• Bogged down in technical details such as how to word test items.
• Two of the observers watched groups that were particularly ineffective during the lesson. It was difficult to tell what each student actually thought or knew in these cases. This indicated that students could be disengaged, lost or inattentive during the lesson.
• Students did not discuss the concept of “construct” at any point in the lesson.
Brief Analysis of Groups’ Written Work
All nine groups described a study in which depressed and non-depressed individuals would take the group’s test. The groups’ ability to describe results that would support the validity of the test varied. One group appeared to be on track with the following:
• “Give our test to a group of 500 clinically diagnosed depressed people as well as a group of 500 people that have not been diagnosed as depressed. See if the scores of both groups are significantly different from each other. Clinically depressed people would score higher than non-depressed people.”
Another group suggested giving the test, and then having a clinician evaluate the participants. Those participants that scored high on the test should be diagnosed as depressed. Most other groups talked about the need for the test scores to “correlate” with a psychologist’s classification of the individual as depressed or not depressed.
All nine groups were able to describe the expected correlations between their individual depression tests and another test of depression (i.e. scores on the two tests should be related). Eight of the nine groups correctly predicted the correlation between their individual depression test and a math achievement test would be low/not significant. One group stated they expected a low correlation, but then went on to describe a negative correlation (i.e. “if a person scores high on the depression test, they should score low on the math test.”). Only one group tied the results to the validity of their test.
• “If our test is accurate it will have a high correlation with the depression scale and a low correlation with the math achievement test.”
Results of the “minute paper” at the end of class:
What was the most difficult part of the lesson?
1) Confused by how the math test was related to determining validity of depression test.
2) Lack of clarity of assignment—how much depth, detail, direction
Most important thing learned from the lesson
1) There are multiple ways to determine the validity of a test
What is still confusing?
1) Nothing!
2) Statistics
Describe major findings and conclusions about what, how and why students met or did not meet learning goals.
Results of Related Exam Questions: Three exam questions related specifically to the lesson. Results of the three questions were mixed.
• One of the processes used to examine construct validity is examining group differences. Explain the logic behind this process (2 pts).
o Correct answers:
If the theory of the construct suggests two groups have different levels of the construct, and the test actually measures the construct, then the two groups should have different scores on the test.
If a test is supposed to measure depression then when the test is given to a group of depressed people and a group of non-depressed people, then the depressed group should score higher than the non-depressed group.
o Only 27% of the class received full points for their answers, while 48% received no points for their answers. Those that received 0 points tended to focus on the need for a test to be valid for different demographic groups (e.g. ethnic groups)
“A test must be able to measure results for all types of groups, or one group in particular. Because you can’t give a five year old a test meant for a 20-year-old and expect them to score well. Therefore groups must be examined to make up for the differences.”
“Everybody is different thus when putting people into groups the groups will be different so in order to obtain the desired results from them it is impairable (sic) that the test be accurate in what it wants to measure.”
• I have developed Dr. V’s Attention Deficit Hyperactivity Disorder (ADHD) Scale, a 25 item paper and pencil self-report instrument to diagnose ADHD. I want to evaluate the construct validity of this instrument. Since it is often difficult to differentiate ADHD from anxiety, I want to be sure my test measures ADHD and not anxiety. I have collected data from 75 children. I gave each child my scale as well as a self-report anxiety scale. In addition, each child was observed by a trained research assistant for ADHD and anxiety behaviors. These data yielded the following multitrait-multimethod matrix.
(Note: a correlation matrix was provided.)
o What is the convergent validity evidence for or against Dr. W’s (ADHD Scale) test [make sure to list and explain the number(s)]? (3 pts)
Correct answer: If Dr. W’s scale measures ADHD, then scores on that scale should relate to another scale that measures ADHD. The correlation is .62 (significantly different than 0) , therefore the validity of Dr. W’s test is supported
Nearly 42% of the class received full points for the answer, and no students received 0 points. Students who failed to receive all the points tended to explain the correlation, but failed to include the “theory” behind the explanation.
o What is the divergent validity evidence for or against MY (ADHD Scale) test [make sure to list and explain the number(s)]? (3 pts)
Correct answer: If Dr. W’s scale measures ADHD, then scores on that scale should not relate to scales that measure anxiety. The correlation between the self-report anxiety scale and Dr. W’s scale is .33 (significantly different than 0), therefore the validity of Dr. W’s test is not supported. The correlation between the behavioral observation anxiety scale and Dr. W’s scale is .08 (not significantly different than 0), therefore the validity of Dr. W’s test is supported.
About 35% of the class received perfect scores on the question, while again, no students received 0 points. Mistakes on this question were similar in nature to the convergent validity question. Namely students correctly interpreted the correlations, but failed to frame them in the theory.
Based on your analysis how will you change the lesson?
Quicker transition to Step 3. My group did nothing for 20+ minutes. Consider addressing all the groups at the same time—even if they have not completed their work.
Presentation and analysis of the studies. This segment seemed repetitive. Groups presented the same types of studies with minor differences (e.g., number of subjects). Plus, none of the students asked any questions during this segment.
Foster better analysis of the studies. Rather than have each group present its study:
1) Select groups to present on the basis of type of study. Group 1 presents then ask for a group that has a different type of study. Ask students to point out key differences. Goal should be to categorize the studies and bring out the essential differences among them.
OR
2) Give groups a handout that identifies types of studies and their characteristics and ask the groups to categorize the studies presented.
STEP 4: Leave more time to analyze Step 4.
The group I observed discussed the step 4 handout for one minute. They tried to find a relationship between Math Achievement and Depression. Someone offered a plausible connection and that terminated the discussion. One member wrote an answer on the worksheet while the others contributed nothing. The answer was wrong.
Posted by: Bill Cerbin | April 12, 2005 at 07:43 AM