Lecture 3:  Describing Data

Outline
Why summerize
 -- the problem
 -- the solution
 -- what you lose
Numbers
 -- central tendency
 -- variability
Graphics
Example for Thinking about Central Tendency

Why summerize

The problem

Letís say that I have collected some memory data.
I presented subjects with a short story
Then 10 minutes later had them write down what they could remember.
Now I need to communicate what I found to other researchers.  How shall I do it?

Canít see the forest for the trees
 I could just type up the responses and publish them, but that isnít very satisfactory and it wastes a lot of space in journals.  Besides, people need some way of summarizing it.

Summarize
 Based on the Operational Definition.
Memory is the number of idea units in the story that the subjects recalled.  Say there were 80 in the original story.  Now we can count how many each subject got.  I could just print all the scores, but even that can be wasteful and silly if there are other ways to summarize -- which there are.  So we are working toward simple and informative summaries.
 

What you lose
 It is also important to know what you lose.  Each time you summarize, you lose the individual instances.  To say yes or no to each idea unit, you lose what the subject actually wrote.  To go to a score, you no longer know what units each subject recalled.  To sum across, no longer know what each subject did.  Sometime you sum in many ways:  say for each subject and each idea unit.

Numbers
 The most common way of summarizing is numerically and our primary concerns here are measures of central tendency and variability.

Central tendency
Mean, median, and mode
 Mean is the numerical average
 Median is the central score in rank order
 Mode is the most common score

MeanM = (SumiXi)/N
                                i=1   i=2   1=3 ...   i=n
                                = X1 + X2 + X3 ...... Xn / N
 1315/20 = 65.75

Median:  central score
odd numbers, the one with equal number of scores above and below;
even numbers, it is the average of the two middle scores
 In this case:  average of score 10 & 11 = 66+66/2 = 66

Mode:  The most common score.
 66
 

Variability
central tendency tells us what the average is and measures of variability tells us how the scores are distributed around that average

No scores are "average" -- most deviate from average

*** would like to know the average deviation***

Deviation :  How far each score is from the mean (Xi - M)
 If you wanted to know in general, well you could just sum these deviations.  But that sum it turns out is 0.  Thatís a property of mathematical means.
*** so we canít get the average straight from the deviations, but that it goal***

So what we do is square them before we sum them.

Sum of the Squared Deviations; Sum of Squares; SS
 by definition
 SS = Sumi (Xi - M)2

Problem is that this is totaled across all the instances and depends of the number of observations you have made, more subjects, the larger the SS.
*** Now we have a non-zero number and we work but to the average***
Thus we need the average of this.
*** the sum divided by the number of things***

Variance:  SS divided by the degrees of freedom:  s2 = SS/df

   s2 =  Sumi (Xi - M)2
                  N-1

*** Could also think of this as the Mean of the Squared Deviations -- Mean Square
But the problem here is that we are in terms of X-squared, not X

Standard Deviation:  Square root of the Variance:  s = ?s2

Graphics:  Frequency Distributions
 Sometimes it is enough to have just the numbers:  mean and standard deviation.  Graphic presentation, however, allows you to see more about the data.  A picture is worth a thousand words.

Frequency Distributions:

x-axis = scores on the measure
y-axis = frequency (number of people with that score)

 Choose size of the range:  each score or groups?

How to group?  B/t 7 and 10 groups Try a few and go with the best picture--

Group by 3  Group by 4
52-54  1 52-55 1
55-57  0 56-59 1
58-60 3 60-63 4
61-63 2 64-67 7
64-66 6 68-71 4
67-69 4 72-75 2
70-72 2 76-80 1
73-75 1
76-78 1

Types of Distributions are:  Normal, Skewed, or Bimodal
  What does this mean for the mean, median, and mode?

Example for Thinking about Central Tendency

Drawn from the Bellingham Herald

Median family income:  $35,225
  Washington:  The largest family paychecks in the nation are brought home in the big-city suburbs of the Northeast.  At the other end of the scale, family income on a stretch of South Texas farm land on the Mexican border is the lowest in the nation.  Thereís a connection between what you make and where you live, 1990 census figures show.  The counties where median family income is highest are concentrated outside Washington, D.C., and New York City.  Nationally, the typical family median income was $35,225 a year.  To government statisticians, a family is a group of related people living together.  The Washington suburb of Fairfax County, VA, had the highest median family income:  $65,201 a year.  Most of lowest-income counties are rural.  In Starr County, Texas, the median family makes $10,903 a year.

Question:  What is the ratio of income for poor to rich?

What measure of central tendency?

Compute by family or per capita?

What were those government statisticians trying to do with these numbers?