Reliability: dependability, stability and consistency

Article author
Rachel Scriven
  • Updated


Reliability – consistency and stability

Personality data is only useful if the information it gives is reliable. If a person completes the questionnaire a second time will they come up with broadly the same scores? And will the results be interpreted in the same way and with similar conclusions being drawn? This is what test-retest reliability and consistency scores measure. 


Test-retest reliability

Test-retest reliability means the questionnaire will give similar results each time it is used on the same person. It's crucial that personality questionnaires show similar results. Whether someone completes it twice in the same day (dependability), or completes it again in a few months (stability), the results should be statistically similar. 

Often retest reliability scores are generated from an artificial situation, such as getting a sample group to complete the questionnaire twice in short succession purely for test purposes.

At Facet5 we capture retest data from the real world, using genuine respondents from our database who have completed the questionnaire twice. This might be from interest to see if they had changed in some way, or because it had been a long time. 

When measured for each factor, Facet5 show retest scores between .69 - .80.  A score of 1 would mean exactly the same scores, and typically anything above .65 in a personality measure is a robust reliability score. This score gives us confidence that most of the time if a person retakes a personality assessment, they will get very similar results. 

You can read more about retest reliability and circumstances likely to change a personality profile here.



Consistency is a measure of the way the profile is constructed from the questions. 

Once the personality traits have been conceptualised, consistency tells you whether the items (questions) are a fair and even reflection of the trait. Do all the questions that measure "Affection" contribute to the Affection score in the same way? 

This is important because the response to any individual item is made up of the respondent’s genuine position on the scale, plus a little bit of human variation. We might even notice for ourselves how we answer very similar questions a little bit differently, or in ways which seem to contradict a previous answer. 

We don't know how much variation or error is attached to any individual score, but if a response to an question contains a high proportion of error, and that question is a major contributor to the overall score, then the score would be overly affected by the error. But if that question is no more important than any other, then the
effect of error will be less. That is what Consistency measures – how evenly do the items contribute
to the overall score. Or how sensitive is the overall score to each individual item.


Main factor consistency

Cronbach’s alpha is a measure of internal consistency, that is, how closely related a set of items are as a group.  It is considered to be a measure of scale reliability. 


Coefficient and number of items

The more quesitons in a scale, the greater the coefficient: how much all the questions measure the same factor.

With only 1 question, there is no reliability to compute. With 2 or more questions though, we can calculate the coefficient between them, and we tend to find there is higher coefficient with more questions. However the gains in coeffiecient get smaller with each additional question, so we rarely need more than 12 questions for each trait to get a good coefficient value. 


Coefficient and item inter-correlation

However - you could also have an extremely high coefficient score by asking the exact same question ! This would be very reliable, but not very valid. So we need questions which all correlate with the same underlying factor, but are not too highly inter-correlated with each other. Ideally, scale inter item-correlations should be around but not much above 0.7. 


Coefficient and item set dimensionality

Finally, consistency as measured by Coefficient α can vary according to the dimensionality of the items in the scale. This means that although all the items relate to one factor, in fact they can be seen as showing different elements of that factor.  In Facet5 we call these sub-factors. 


Sub-factor consistency

Computing the consistency of sub-factors is not as simple as for the main factors. Facet5 sub-factor scores are computed from a Promax rotation of the items in the main factor. Therefore although each Facet5 sub-factor is made from all the items in the scale, they are given different weights. This method of constructing scales does make estimating sub-factor consistency quite difficult. Promax makes sure that the factor loading given to each item is either quite high or quite low with not much in the middle. So we can get an estimate of the Facet5 sub-factor consistency by computing Coefficient for the items in the sub-factor which have the highest loadings.


What is acceptable consistency?

There are different opinions as to what is acceptable consistency for a scale and to some degree it depends on what is being measured. Kline (1999) suggests that the acceptable level of consistency as measured by Coefficient is around 0.7. He also suggests it might be higher for constructs like intelligence and this is widely accepted, but, as mentioned previously, he warns against going much higher than 0.7 for personality scales as it risks making the scale too narrowly defined.

Most tests ask a broad range of questions that relate to the core factor. As the questions range further from the core so the consistency will drop. So a lower consistency can in fact be a deliberate act in order to broaden the measure. The following table shows the Coefficient reported by some reputable personality models:


These figures are all taken from published data.

Most reputable models fit within the 0.6 to 0.9 range. The lower value (0.28) for the original 16PF reflects Cattell’s broad definition of the domain. Similarly the lower values (0.29) for an HPI HIC probably reflects the smaller number of items. NEO-PI facets are also lower than the NEO-PI main scores.


Consistency in Facet5

To give as full an understanding of the consistency of Facet5 we have taken Kline’s advice and aimed for the following three parameters: 

  • An average item inter-correlation below 0.3 
  • A Coefficient of around but not much higher than 0.7
  • Limited dimensionality: 2–3 sub-factors at most.

The internal consistency of Facet5 are at or above the expected level of 0.7 across factors and when using many different languages and samples. 

Download the full reliability and validity document to see computed Coefficient or Spearman-Brown coefficients across English language data sets and translated versions. 


Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request



Please sign in to leave a comment.