Southern African Regional Poverty Network (SARPN) SARPN thematic photo
Country analysis > South Africa Last update: 2020-11-27  

The sensitivity of estimates of post-apartheid changes in South African poverty and inequality to key data imputations

Cally Ardington, David Lam, Murray Leibbrandt and Matthew Welch

Centre for social science research - Working Paper No. 106

University of Cape Town

February 2005

Posted with permission of the Centre for Social Science Research at the University of Cape Town.
The full range of papers produced by the CSSR can be accessed at:
[Download complete version - 573Kb ~ 3 min (37 pages)]     [ Share with a friend  ]


Changes in inequality and poverty are key dimensions of the transformation of any society. Given the twentieth century history of South Africa, these two dimensions of economic well-being and, in particular, their changing racial profiles have been of special interest. One of the important empirical traditions in tracking longer-run South African inequality and poverty changes has made use of records of personal income collected in the national censuses of 1970, 1991 and 1996 (McGrath (1983), Whiteford and McGrath (1994), Whiteford and van Seventer (2000)). In the apartheid era, such empirical work was central to highlighting the destructive impact of racially driven policies on South Africa’s non-white groups. In the post-apartheid era, these empirical analyses have taken on additional importance. The size and national reach of the 10 per cent micro sample from the 1996 census made it uniquely suited to deriving a set of district and provincial level poverty profiles that could be used to inform provincial and municipal budgetary allocations for various anti-poverty policies (Babita et al (2002)).

In 2004, the ten percent micro-sample from the 2001 census was released. This made it possible to use 1996 and 2001 micro-data to track immediate post-1994 progress in undoing the apartheid legacy. Leibbrandt et al. (2004) and Simkins (2005) presented initial results on the changes in the levels and composition of income inequality and poverty between 1996 and 2001 using these data. Whiteford and van Seventer (2000) had documented a high but constant national income inequality for the 1991 to 1996 period. Both Simkins (2005) and Leibbrandt et al (2004) showed that this inequality remained high and even took a turn for the worse. As regards racial inequality, between 1996 and 2001, inequality within each race group increased. Formal decompositions showed that this within-group contribution to aggregate inequality increased while the between-group component decreased. This represented a continuation of the trend that Whiteford and van Seventer (2000) had noted for the 1991-1996 period and, indeed, from as far back as 1975.

The poverty analysis of Simkins (2005) and Leibbrandt et al (2004) revealed that national poverty worsened over the period, particularly for Africans. This suggested a continuation of the longer-run poverty trend revealed by Whiteford and van Seventer (2000). However, for the 1996-2001 period, the extent to which poverty was measured as increasing was very much dependent on the choice of poverty line. At lower poverty lines, the increase in poverty is significantly more muted than at higher poverty lines.1

The rationale behind Leibbrandt et al (2004) was to produce comparable empirical results to those of analyses of earlier years, such as that of Whiteford and van Seventer (2000). Such comparability demands that detailed attention be given to replicating data assumptions and methods that were used on the pre-1996 data sets. These methods are not necessarily current best practice and a number of improvements can be considered as soon as the focus switches to building up the best possible analyses of the income data for each of 2001 and 1996 without regard to longer-run comparability.2 It is certainly possible to undertake a thorough set of imputations for missing values on the personal income variable in the census and to ascertain the sensitivity of key incomebased measures of well-being to these imputations. This is the broad task of our paper.3

In the 1996 and 2001 censuses, data on personal incomes is gathered by means of a question asking each person in the household ‘What is the income category that best describes the gross income of (this person) before tax?’ (Statistics South Africa, (1996: 6) and Statistics South Africa, (2001a: 3)).4 While the broad reach of the census data is its strength, this income data is far from ideal. Cronje & Budlender (2004) highlight one particular weakness; namely, that in both 1996 and 2001, the question on personal income requested an appropriate income band for each person rather than an income value. These bands were not a consistent set of real income categories across the two years. This is especially true at the top end. The highest band for personal income in 1996 was R30 000 or more. This is lower than the real income equivalent of the top three bands in 2001. This incompatibility of income bands in real terms needs to be dealt with in order to compare the data across time.5 There is no particular subtlety to the decisions that analysts make in this regard and the most that can be asked for is that the decisions are spelled out explicitly and that there is some assessment of the sensitivity of any analysis to alternatives.

A more important but largely unexplored consequence of the fact that personal incomes are recorded in bands is the fact that all those using the income variable for poverty and inequality analysis have to translate the bands into point incomes for each person. The general practice in South Africa has been to attribute the band midpoints to all individuals. This is only one of a number of possibilities and one of the tasks of this paper is to assess the importance of different within-band point income allocation rules.

The paper explores two further weaknesses in the personal income variable; namely, the large number of working age adults for whom the income variable is missing and the large number of working age adults for whom recorded income is zero. As shown later in the paper, a large percentage of individuals are recorded as having missing incomes or zero incomes. On aggregating these personal incomes into household incomes, this translates into a large number of households with missing total income values or zero total income values.

It is important that these two issues receive detailed attention. Regarding the missing data, who are these people and households? If they were not missing, where would they have fallen in the distribution of income and what impact would they have had on measured poverty and inequality? Regarding the zero incomes, even allowing for South Africa’s low labour market participation rates and high unemployment rates, it is highly unlikely that all of these zero income households had no adult members earning any income. In analysing poverty and inequality, previous practice has been to ignore the zeros or to change them to some arbitrarily small number. The former practice is an arbitrary decision to effectively remove a group of households who currently make up the bottom of the distribution. As such, this decision sharply decreases measured poverty levels and also narrows inequality. The latter practice effectively accepts all recorded zeros as genuine zeros, possibly leading to an overestimate of measured poverty and inequality.6

In sum then, the focus of this paper is on three weaknesses in the personal income variable in the 1996 and 2001 South African census data; namely, missing data, a large number of implausible zero values and the fact that income is measured in bands. All of these weaknesses impact on measured individual and household income and therefore on measured poverty and inequality.

In the next section of the paper, we deal with missing data by imputing income bands for those with missing income data for 2001 using contemporary multiple imputation techniques. Statistics South Africa offers users of the 2001 data a single hotdeck imputation for the missing 2001 personal income data. In line with contemporary practice, multiple imputation approaches are preferred to single imputations. Our work in this section will discuss and use a multiple imputation approach and will compare our imputation results with the hot deck results of Statistics South Africa. In the third section of the paper we consider the impact of implausible income values; in particular, the high percentage of households with zero income. Our approach is to use a set of decision rules to reclassify potentially problematic zero incomes as missing and then to re-run the multiple imputation process on the augmented missing data. The process allows for the possibility that any values that are reclassified from zero to missing to be imputed back into the data as a zero income once more if the census data support such an imputation.

The income data in the 2001 census is given in twelve bands. As stated previously, in order to estimate measures of poverty and inequality, a continuous measure of income is required. Therefore, a further “imputation” step is required in order to translate the bands into point estimates. The lowest income “band” is zero income and no within-band decisions are necessary here. For the next ten bands, the convention (including for our analysis in sections 2 and 3 of this paper) is to allocate to each individual the midpoint income of the band within which the person is found. Finally, incomes falling in the highest (unbounded) band are all assigned the lower bound value for this top band. In section 4, we examine the sensitivity of the key results that are derived using this set of rules to those that are derived when we impute within-band point estimates from empirical distributions of personal incomes in each band. These empirical distributions are available from a national household income and expenditure survey of 30,000 households that was conducted in 2000.

One of the most important uses of census data is to calculate provincial poverty shares. As a final exercise on the 2001 data, in Section 5 we consider the impact of our imputations on the estimates of provincial poverty shares. These shares are important from a policy perspective as they are central to the formula for allocating budget allocations for anti-poverty programmes. Encouragingly, we find that provincial poverty shares are robust to a range of assumptions about missing data values and the distribution of incomes within bands.

In order to keep the discussion of sections 2 through 5 manageable, we discuss the techniques and illustrate their impact using the 2001 census data. However, all exercises were replicated on the 1996 data. The final section of the paper briefly returns to the issue of comparing 1996 and 2001 poverty and inequality situations in the light of our imputation work.


  1. The Leibbrandt et al (2004) paper goes on to complement the analysis of the income inequality and income poverty changes with an analysis of changes in access to services. This access-based approach focused on type of dwelling, access to water, energy for lighting, energy for cooking, sanitation and refuse removal. These data on access revealed significant improvements in access between 1996 and 2001. The contrast between these findings and the findings on income serve as important reminders that income is only one of many dimensions to well-being and that non-money metric aspects of well-being are important.

  2. Even from the standpoint of consistency in the way that the census data was collected, there is some justification for an exclusive focus on 1996 and 2001: ‘In the apartheid years, different approaches were used for enumeration in different areas. In particular, some ‘black’ areas were “enumerated” by means of estimates from aerial photographs as it was considered too dangerous for enumerators to go door-to-door. The 1996 census was the first attempt to standardise methodology for all areas, and this practice was repeated in 2001’ (Cronje and Budlender 2004: 68).

  3. Simkins (2005) makes a promising start down this road. For both 1996 and 2001, a set of decision rules is applied to allocate positive incomes to some adults with missing incomes and to adults with zero incomes that are in households with zero income. These decision rules are overt and replicable. However, they are not anchored in the imputation literature and there is no testing for the sensitivity of results to plausible rule changes.

  4. In both years, the respondent was told that the reference period was 1 October of the previous year until 31 September of the census year. They were also told that this income should include all sources of income including housing loan subsidies, bonuses, allowances such as car allowances, investment income as well as any pension or disability grants. In 1996, the first question of a subsequent household module prompted respondents about “additional money that this household generates and that has not been included in the previous section. (For example, the sale of home grown produce of home-brewed beer or cattle or rental of property about remittance income.)” (Statistics South Africa, 1996: 7) This is followed by a question then asked about total income from remittances or payments back home that had been received by the household over the past year. In 2001, these sources of household income were included as part of the prompt for the personal income question. The meta-data file states: “Income from the sale of home-grown produce or home-brewed beer or cattle was also to be included”. If any of these activities brought in income for the household as a whole rather than for a particular person, the enumerator was instructed to add the amount to the income of someone in the household. If the household had received remittances or payments from a person working or living elsewhere, the instruction was that this income should be added to the total of someone in the household, for example, the head of the household. (Statistics South Africa, 2000a: 81) Given these differences between 2001 and 1996, the personal income data is not directly comparable. Aggregate household income data, including the two household level questions in 1996, should theoretically be comparable. However, it is unlikely that without the specific questions around household level income, that such income was thoroughly captured in 2001. For this reason, we decided to use only income collected at the individual level in 1996. It should therefore be borne in mind that the 1996 estimates of per capita income are likely to be understated.

  5. Leibbrandt et al (2004) compressed the top end of the 2001 distribution of personal incomes into the real income equivalent of the top band in 1996. As all of these bands are way above any plausible poverty line, this has no impact on the analysis of poverty. However, as this decision effectively compresses the top end of the 2001 income distribution, this decision impacts on the inequality analysis. See Table A.3 in Appendix A of Leibbrandt et al (2004) for a detailed set of results.

  6. An example of the impact of these assumptions on measured poverty and inequality is contained in Appendix A of Leibbrandt et al (2004).

Octoplus Information Solutions Top of page | Home | Contact SARPN | Disclaimer