Alan Reifman,
Texas Tech University
Return to Main Syllabus
QUALITIES OF A GOOD MEASURE
BOTH ARE BASED LARGELY ON THE CORRELATION STATISTIC.
RELIABILITY: Consistently yields the same result.
If you took a ruler and repeatedly measured your book, you would always get the same answer. Ruler measurements of a physical object are perfectly reliable. Social scientific measurements of people are not perfectly reliable.
1 2 3 4 5 6 7 8 9 10 11 12 
BOOK 
VALIDITY: Really measuring what it intends to measure.
The article BMI [Body Mass Index] is a Terrible Measure of Health raises validity issues:
The goal of using any obesity indicator should be to identify people with excess fat, since that fat has been associated with bad health outcomes. But the BMI is a function of a person’s weight and height. Weight includes fat, but it also includes bones, muscle, fluids and everything else in the body... [One study] found that 47 percent of people classified as overweight by BMI and 29 percent of those who qualified as obese were healthy as measured by [other indicators such as blood pressure and cholesterol]... Using BMI alone as a measure of health would misclassify almost 75 million adults in the U.S., the authors concluded.
EXAMPLE: SAT analogy item à RUNNER : MARATHON
This item is intended to measure analogical reasoning ability, but really may be as much or more a measure of socioeconomic status.
(From Herrnstein & Murray, 1994, The Bell Curve)
Correlation Coefficient (r) : How do two variables go together?
Determining the Correlation Coefficient (r)
Take values from data file (below) and plot each person's values (see graph on right)


Computerized statistical program will try to find "bestfitting line," which comes as close to touching as many data points as possible. Correlation (r) based upon slope of bestfitting line and degree to which points are close to the line.
Websites for learning about correlations via interactive displays of datapoints and bestfit lines:
A song to nail down our understanding of correlation, bestfitting lines, upward and downward slopes, etc.
Fitting the Line
Lyrics by Alan Reifman
(May be sung to the tune of “Draggin’ the Line,” James/King)
(Backup vocals in parentheses)
Plotting the data, on X and Y,
Finding the slope, with most points nearby,
We want to find the angle, of the trend’s incline,
Fitting the line (fitting the line),
Upward slopes make r positive,
Slopes trending down, make it negative,
From minusone to plusone, r can feel
fine,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
Points align, how will the data shine?
If you have upward slopes, it’ll give you a plus sign,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
How strongly will your variables relate?
Is there a trend, or just a zero flat state?
You want to know what your analysis will find,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
Points align, how will the data shine?
Your r will be minus, if the slope declines,
Fitting the line (fitting the line),
Fitting the line (fitting the line),
(Guitar solo)
Points align, how will the data shine?
If you have upward slopes, it’ll give you a plus sign,
Fitting the line (fitting the line),
Fitting the line (fitting the line)…
Facebook album of
Dr. Reifman meeting Tommy James, after the latter's concert at the 2013
South Plains Fair.
TYPES OF RELIABILITY 

PROCEDURAL CONDITIONS 
TYPE OF RELIABILITY 
Selfreport, multiple occasions to gather data from each participant. 
TESTRETEST: Give same measure twice, separated by days, weeks, or months. Correlation between scores at Time 1 and Time 2. 
Selfreport, single occasion, multipleitem measure, such as Hendrick & Hendrick love scales, each with seven items (four items in the shortened version). You would compute a separate alpha each for Eros, Ludus, Storge, Pragma, Mania, and Agape. 
INTERNAL CONSISTENCY (ALPHA, a ): If high internal consistency, how a person answered any one item tells you how he/she answered the others. Based in part on correlations, with maximum = 1.0. 
Observation with two raters (single occasion) 
INTERRATER RELIABILITY: 
Testretest reliability correlations involving people who took the SAT more than once have been reported as .77 for whites and .90 for blacks (Vars & Bowen chapter in Jencks & Phillips, The BlackWhite Test Score Gap, p. 471, footnote 22).
Here's a sports example of "testretest reliability" that I came up with.
The item listings above are just shortened, keyword descriptions. The actual wordings are available here. Instead of the True/False format shown on the web document, we used a system of 0 = Strongly Agree to 4 = Strongly Disagree.
For a given set of items (such as the Storge subscale), alpha is based on the number of items and the average of the correlations between each pair of items (Wikipedia page). As Zeller and Carmines (1980) state in their book Measurement in the Social Sciences: "In general, as the average correlation among the items increases and as the number of items increases, alpha takes on a larger value" (p. 56; see Table 3.2A).
Both types of reliability shown below should exhibit large positive correlations (high reliability).
TESTRETEST DEPRESSION EXAMPLE Most people who score highly on the test the first time would also score highly on the same test the second time, and people with low initial scores would likely also get a low score the second time. A few individuals might get very different scores on the two occasions, but the correlation statistic represents the trend for the whole sample. 
INTERRATER SMILING EXAMPLE If a particular wife smiles a lot, two welltrained raters should both record a large number of smiles, although they may differ slightly. However, no rater should have a tally of 0 for this wife. If another wife very rarely smiles, both raters should have tallies at or near 0 for her, with neither rater at, say, 10.

Reliability correlations tend to be much larger than validity correlations. Why might this be so?
Evidence a test is measuring what it intends to measure (From most to least important, in Dr. Reifman's view) 

TYPE 
DEFINITION 
EXAMPLE 
Predictive 
Test scores should correlate with realworld outcomes 
SAT (V) & firstyear grades correlation = .36; SAT (M) & firstyear grades correlation = .35 
Construct: 
Test should correlate with other similar measures 
SAT should correlate with other academic ability tests 
Construct: 
Test should not correlate with irrelevant tests 
SAT should not correlate with political attitudes 
Content 
Covers the necessary range of material 
Different areas of math and verbal abilities should be covered 
Face 
Items look like they are covering proper topics 
Math test should not have history items 
Let's revisit the question of how to measure happiness, from the introductory measurement lecture.
Source for SAT validity coefficients: David Owen (with Marilyn Doerr), None of the Above: The Truth Behind the SATs (1999, revised and updated edition; p. 197)
Chart to Summarize Reliability and Validity
RELIABILITY (testretest preferred, if possible)  
Test  Correlated With  Repeat Administration of Test to Same Persons 
If only one testing session
available, correlate items with each other (internal consistency). If a behavioral observation, correlate two judges' scores of same videotapes (interrater). 
VALIDITY (predictive preferred, if possible)  
Test  Correlated With 
RealWorld Behavior 
If only one testing session available, correlate test with other established tests. 
In class discussion, Dr. Reifman asked the class how one might try to validate the state examination for barbers and cosmetologists. In other words, we would correlate barbers' and stylists' scores on the exams to what realworld outcomes? Students in a previous class came up with excellent suggestions, such as customersatisfaction surveys and observing how often the same customers came back to the same barber/stylist. Also, the barber/stylist's work could be judged by experts, such as Vidal Sassoon.
Realworld examples:
How eHarmony has attempted to validate its measures used for matching singles (and another article about whether online matchmaking companies have the scientific validity to back up their claims; thanks to Dr. Niehuis for sending me the article).
Does the Wonderlic intelligence test, which is given to football players coming out of college at a camp where they work out for NFL team scouts, show validity in predicting onthefield success once the players begin their pro careers? (the linked study looks at quarterbacks)
Does a popular computerized test for racial bias show reliability and validity?
...and a Song
Reliable and Valid
Lyrics by Alan Reifman
(May be sung to the tune of “Don’t Stop (Thinking About Tomorrow),” Christine McVie, popularized by Fleetwood Mac)
When selecting a questionnaire,
Psychometrics have to be sound,
You can make your own, if you have to,
But try to use one already around,
Make… it… reliable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
(Guitar solo)
To assess reliability,
Use testretest with two occasions,
Use alpha for a onetime test, and,
Interrater for observations,
Make… it… reliable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
(Guitar solo)
To assess a test’s validity,
There are many forms to make your case,
They may or may not be statistical,
Predictive, construct, content and face,
Make… it… reliable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
Make… it… reliable and valid,
Make… it… the best that you can find,
It will help, strengthen your research,
Measurement’s prime, measurement’s prime,
Ooh, make your tests sound,
Ooh, make your tests sound,…
(Fade out)
More advanced discussion of reliability and validity in terms of Classical Test Theory (intended for graduate students)