Alan Reifman, Texas Tech University
Return to Main Syllabus

QUALITIES OF A GOOD MEASURE

BOTH ARE BASED LARGELY ON THE CORRELATION STATISTIC.


RELIABILITY: Consistently yields the same result.

If you took a ruler and repeatedly measured your book, you would always get the same answer. Ruler measurements of a physical object are perfectly reliable. Social scientific measurements of people are not perfectly reliable.

1 2 3 4 5 6 7 8 9 10 11 12


 

BOOK




VALIDITY: Really measuring what it intends to measure.

The article BMI [Body Mass Index] is a Terrible Measure of Health raises validity issues:

The goal of using any obesity indicator should be to identify people with excess fat, since that fat has been associated with bad health outcomes. But the BMI is a function of a person’s weight and height. Weight includes fat, but it also includes bones, muscle, fluids and everything else in the body... [One study] found that 47 percent of people classified as overweight by BMI and 29 percent of those who qualified as obese were healthy as measured by [other indicators such as blood pressure and cholesterol]... Using BMI alone as a measure of health would misclassify almost 75 million adults in the U.S., the authors concluded.    

EXAMPLE: SAT analogy item à RUNNER : MARATHON

  1. envoy : embassy
  2. martyr : massacre
  3. oarsman : regatta
  4. referee : tournament
  5. horse : stable

This item is intended to measure analogical reasoning ability, but really may be as much or more a measure of socioeconomic status.

(From Herrnstein & Murray, 1994, The Bell Curve)



Correlation Coefficient (r) : How do two variables go together?



Determining the Correlation Coefficient (r)

 

Take values from data file (below) and plot each person's values (see graph on right)

Years of
Education
Salary

Ann

16 50000

Barney

12 25000

Carol

19 80000

Dave

16 55000

Edna

16 60000

Fred

14 35000
90K
85K
80K

·

75K
70K
65K
60K

·

55K

·

50K

·

45K
40K
35K

·

30K
25K

·

20K
12 13 14 15 16 17 18 19 20 21

Computerized statistical program will try to find "best-fitting line," which comes as close to touching as many data points as possible.  Correlation (r) based upon slope of best-fitting line and degree to which points are close to the line.



Websites for learning about correlations via interactive displays of data-points and best-fit lines:

BFW Publisher

NCTM Illuminations

 Plotly


A song to nail down our understanding of correlation, best-fitting lines, upward and downward slopes, etc.

Fitting the Line

Lyrics by Alan Reifman

(May be sung to the tune of “Draggin’ the Line,” James/King)

 

(Back-up vocals in parentheses)

 

Plotting the data, on X and Y,

Finding the slope, with most points nearby,

We want to find the angle, of the trend’s incline,

Fitting the line (fitting the line),

 

Upward slopes make r positive,
Slopes trending down, make it negative,

From minus-one to plus-one, r can feel fine,
Fitting the line (fitting the line),
Fitting the line (fitting the line),

 

Points align, how will the data shine?

If you have upward slopes, it’ll give you a plus sign,

Fitting the line (fitting the line),
Fitting the line (fitting the line),

 

How strongly will your variables relate?

Is there a trend, or just a zero flat state?

You want to know what your analysis will find,

Fitting the line (fitting the line),
Fitting the line (fitting the line),

 

Points align, how will the data shine?

Your r will be minus, if the slope declines,

Fitting the line (fitting the line),
Fitting the line (fitting the line),

 

(Guitar solo)

 

Points align, how will the data shine?

If you have upward slopes, it’ll give you a plus sign,

Fitting the line (fitting the line),
Fitting the line (fitting the line)…

 

Facebook album of Dr. Reifman meeting Tommy James, after the latter's concert at the 2013 South Plains Fair.
 


 

TYPES OF RELIABILITY

PROCEDURAL CONDITIONS

TYPE OF RELIABILITY

Self-report, multiple occasions to gather data from each participant.

TEST-RETEST: Give same measure twice, separated by days, weeks, or months. Correlation between scores at Time 1 and Time 2.

Self-report, single occasion, multiple-item measure, such as Hendrick & Hendrick love scales, each with seven items (four items in the shortened version).  You would compute a separate alpha each for Eros, Ludus, Storge, Pragma, Mania, and Agape. 

INTERNAL CONSISTENCY (ALPHA, a ): If high internal consistency, how a person answered any one item tells you how he/she answered the others.  Based in part on correlations, with maximum = 1.0.

Observation with two raters (single occasion)

INTER-RATER RELIABILITY:
Correlation between two judges’ tallies.

Test-retest reliability correlations involving people who took the SAT more than once have been reported as .77 for whites and .90 for blacks (Vars & Bowen chapter in Jencks & Phillips, The Black-White Test Score Gap, p. 471, footnote 22).

Here's a sports example of "test-retest reliability" that I came up with.


The item listings above are just shortened, keyword descriptions.  The actual wordings are available here.  Instead of the True/False format shown on the web document, we used a system of 0 = Strongly Agree to 4 = Strongly Disagree.

For a given set of items (such as the Storge subscale), alpha is based on the number of items and the average of the correlations between each pair of items (Wikipedia page). As Zeller and Carmines (1980) state in their book Measurement in the Social Sciences: "In general, as the average correlation among the items increases and as the number of items increases, alpha takes on a larger value" (p. 56; see Table 3.2A).


Both types of reliability shown below should exhibit large positive correlations (high reliability).

TEST-RETEST DEPRESSION EXAMPLE

Most people who score highly on the test the first time would also score highly on the same test the second time, and people with low initial scores would likely also get a low score the second time. A few individuals might get very different scores on the two occasions, but the correlation statistic represents the trend for the whole sample.

  INTER-RATER SMILING EXAMPLE

If a particular wife smiles a lot, two well-trained raters should both record a large number of smiles, although they may differ slightly. However, no rater should have a tally of 0 for this wife. If another wife very rarely smiles, both raters should have tallies at or near 0 for her, with neither rater at, say, 10.

 



Reliability correlations tend to be much larger than validity correlations.  Why might this be so?



TYPES OF VALIDITY
Evidence a test is measuring what it intends to measure
(From most to least important, in Dr. Reifman's view)

TYPE

DEFINITION

EXAMPLE

Predictive
(Criterion Related)

Test scores should correlate with real-world outcomes

SAT (V) & first-year grades correlation = .36; SAT (M) & first-year grades correlation = .35

Construct:
Convergent

Test should correlate with other similar measures

SAT should correlate with other academic ability tests

Construct:
Discriminant

Test should not correlate with irrelevant tests

SAT should not correlate with political attitudes

Content

Covers the necessary range of material

Different areas of math and verbal abilities should be covered

Face

Items look like they are covering proper topics

Math test should not have history items

Let's revisit the question of how to measure happiness, from the introductory measurement lecture.

Source for SAT validity coefficients: David Owen (with Marilyn Doerr), None of the Above: The Truth Behind the SATs (1999, revised and updated edition; p. 197)


Chart to Summarize Reliability and Validity

RELIABILITY (test-retest preferred, if possible)
Test Correlated With Repeat Administration of Test to Same Persons
If only one testing session available, correlate items with each other (internal consistency).
If a behavioral observation, correlate two judges' scores of same videotapes (inter-rater).

 

VALIDITY (predictive preferred, if possible)
Test Correlated With Real-World
Behavior
If only one testing session available, correlate test with other established tests.

In class discussion, Dr. Reifman asked the class how one might try to validate the state examination for barbers and cosmetologists.  In other words, we would correlate barbers' and stylists' scores on the exams to what real-world outcomes?  Students in a previous class came up with excellent suggestions, such as customer-satisfaction surveys and observing how often the same customers came back to the same barber/stylist.  Also, the barber/stylist's work could be judged by experts, such as Vidal Sassoon.

Real-world examples: 

How eHarmony has attempted to validate its measures used for matching singles (and another article about whether online matchmaking companies have the scientific validity to back up their claims; thanks to Dr. Niehuis for sending me the article).

Does the Wonderlic intelligence test, which is given to football players coming out of college at a camp where they work out for NFL team scouts, show validity in predicting on-the-field success once the players begin their pro careers? (the linked study looks at quarterbacks)

Does a popular computerized test for racial bias show reliability and validity?


...and a Song

Reliable and Valid

Lyrics by Alan Reifman

(May be sung to the tune of “Don’t Stop (Thinking About Tomorrow),” Christine McVie, popularized by Fleetwood Mac)

 

 

When selecting a questionnaire,

Psychometrics have to be sound,

You can make your own, if you have to,

But try to use one already around,

 

Make… it… re-liable and valid,

Make… it… the best that you can find,

It will help, strengthen your research,

Measurement’s prime, measurement’s prime,

 

(Guitar solo)

 

To assess re-li-a-bility,

Use test-retest with two occasions,

Use alpha for a one-time test, and,

Inter-rater for observations,

 

Make… it… re-liable and valid,

Make… it… the best that you can find,

It will help, strengthen your research,

Measurement’s prime, measurement’s prime,

 

(Guitar solo)

 

To assess a test’s validity,

There are many forms to make your case,

They may or may not be statistical,

Predictive, construct, content and face,

 

Make… it… re-liable and valid,

Make… it… the best that you can find,

It will help, strengthen your research,

Measurement’s prime, measurement’s prime,

 

Make… it… re-liable and valid,

Make… it… the best that you can find,

It will help, strengthen your research,

Measurement’s prime, measurement’s prime,

 

Ooh, make your tests sound,

Ooh, make your tests sound,…

(Fade out)

 


 

More advanced discussion of reliability and validity in terms of Classical Test Theory (intended for graduate students)