Nine steps to cleaner data

October 5, 2011 · by wciecon · in Econometrics, Frances Woolley, Teaching · 13 Comments

Real world data is messy. Dirty. Untidy.

Before you can even think about using all of those pretty techniques you learned in econometrics class, you need to clean up the data.

Here is my nine step approach.

1. Get organized. After several months of analysis do you honestly think you will have any recollection of exactly how you decided to recode the income data, or which specification it was that generated such beautiful results?

The best way of keeping track of your Stata work (and this post is written for people who use Stata) is to write all your code in a ".do" file, and then feed it into the computer. I should do that. But generally I don't, because I make so many mistakes it takes a long time to get the .do file to run (plus my co-authors do .do files). An alternative approach is create a log file every time you open Stata (click "File" then "log" then "begin log"). From your log files you can go back and reconstruct what you've done. This is not an approach recommended by professional economists of high repute. They all use .do files.

Once you open Stata and load your data file, you are likely to encounter an immediate problem: too much data. The Canadian Community Health Survey, for example, contains literally hundreds of different variables.

Now data cleaning is like any other kind of cleaning. The fastest way to tidy stuff up is just to chuck it out. In other words:

2. Choose your dependent and explanatory variables (and get rid of the other ones). Suppose, for example, you've just discovered that the Canadian Community Health Survey contains data on neight and income and you decide to test the theory that tall people earn more.

There's a standard way to estimate earnings, and that's with a Mincer equation:

log earnings = a + rS + b₁X + b₂X² ,

where X="potential experience" which is calculated as age minus years of schooling minus six.

Clearly other factors also influence earnings – demographic variables (marital status, age, number of children), regional variables (urban v. rural, province of residence), whether the person's "human capital" was acquired in Canada or elsewhere (immigrant status, ethnicity, language).

And, of course, height.

Which explanatory variables are acceptable and which ones aren't is, to some extent, a matter of custom and tradition – like the practice of calling age-schooling-six "experience." When in doubt, ask WWIMS or "What would intermediate micro (macro) say?" Most of the important relationships in economics are covered there.

But there's a couple of things to avoid.

First, don't just throw stuff into your regression equation. The scientific method as taught in elementary schools ("Formulate a hypothesis and then test to see if the data is consistent with your hypothesis") is not an accurate description of how people actually do research. But it has some merits as a guiding principle.

If you just enough variables into your regression the odds of something being statisitcally significant just by random chance are fairly high, as is illustrated by this brilliant cartoon from xkcd.com.

A second no-no is having variables on the right side of your equation that are influenced by the variable that is on the left hand side. Suppose, for example, you are studying how much low income Canadians spend on housing, and you think "people who spend more on food have less money to spend on housing." So you specify your model as:

housing spending = a + b₁*income+b₂*spending on food+b₃*spending on other goods.

Well, you've basically estimated the person's budget constraint. Sure, more spending on food means less spending on housing, but that's true by definition. You haven't learnt anything about economic behaviour.

Once you've figured out what you need and don't need, you can chuck a whole bunch of stuff out using "keep" or "drop". For example, with the 2007 Canadian Community Health Survey, I used this command to keep the variables I needed:

keep VERDATE ADM_RNO GEOGPRV DHHGAGE DHH_SEX DHHGMS INCGPER INCGHH DHHG611 DHHGLE5 DHHGL12 HWTGHTM HWTGWTK SDC_8 SDC_9 SDCGCBG SDCFIMM SDCGRES SDCGCGT SDCGLNG SDCGLHM EDUDH04 EDUDR04 LBSG31 LBSDHPW

(Yes, I am a data hoarder – I can't help but think "one of these variables might be useful.")

Then save the file USING A DIFFERENT NAME so you don't write over your CCHS file, for example:

save "C:\Documents and Settings\FrancesWoolley\My Documents\My Data Sources\HeightEarningsCCHS.dta", replace

You're now ready for the next step:

3. Assess the situation.

The command

sum *

will give the number of observations for each variable, as well as the mean, standard deviation, minimum and maximum.

You will immediately notice that some variables don't look like you expect them to. For example, INCGPER – the personal income variable – has an average value of 3.2. DHH_SEX – the respondent's gender – has an average value of 1.55.

This is bad news. It means you need to go to the codebook to figure out exactly how these variables were defined, and recode them in a way that is useful to you. (Here is a link to the CCHS 2007-8 documentation). This documentation will be available wherever you got your data.

And now you're ready for the next stage:

4. Create dummy variables

The documentation tells me that DHH_SEX is coded as male=1, female=2. I want to use sex as a "dummy variable" – a 0/1 on/off switch that is, say 0 for female respondents, and 1 for male respondents. To do that, I need to do some recoding.

First, I generate my new variable:

gen male = .

(this creates the variable "male" and gives everyone a missing value. It's an extra step, but it minimizes the odds of making an error).

Then we give every male respondent a value of "1":

replace male = 1 if DHH_SEX==1

and every female respondent a value of "0":

replace male = 0 if DHH_SEX==2

Notice that the first part of the expression has one equal sign and the second has two.

Sometimes it's necessary to create a whole load of dummy variables. For example, there are ten provinces in Canada, plus the territories. Each province has its own labour market, and so earnings vary in systematic ways across the provinces. The way to capture that is by creating dummy variables for each province. Rather than writing code to create 11 provincial and territorial dummies, you can just use this short-cut:

tab GEOGPRV, gen (prov)

This command creates 11 dummies for the provinces and territories called prov1, prov2, etc.

The general format for this command is

tab NameOfOriginalVariable, gen (PrefixForDummyVariables)

Other variables you may wish to recode in this way are marital status, highest level of education, plus linguistic, ethnic or cultural variables.

This is time consuming, and you might ask yourself – is it even necessary? In fact there is a shortcut in Stata that eliminates the need to create, for example, dummy variables for all of the provinces. You just put "xi" before your regression command, and then an "i." before your dummy variable. So your regression becomes:

xi: regress income age i.GEOGPRV

You can use this for other commands as well.

As always, it's vital to keep track of what you're doing – so that once you have the regression results created by that shortcut, you can figure out which province is which. Short cuts can cause long delays…

Getting back to the main topic – height and income –

If you recall, INCGPER – the CCHS personal income variable – has an average value of 3.2. Looking at the codebook reveals that personal income is coded in six categories: no income, less than 20,000, 20,000 to 39,999, etc, up to $80,000 and above.

If your aim was to explain, say, cannabis consumption in terms of people's income, and you had, as is the case with the CCHS, thousands of observations, then perhaps you might want to convert those 6 income categories into dummy variables, and enter them separately into the regression equation.

Our aim, however, is to use height to explain income. There are techniques like interval regression and multinomial logit that work when income is reported in categories. But the old school approach is to convert those income categories to numbers. It's not the best solution, but will find a relationship if there is one in the data, and it provides easy to interpret results. So a final step in data cleaning is:

5. Recode interval data to numbers as necessary, and vice versa.

A quick and dirty approach to recoding is to use the mid-points of the intervals, so every observation greater than $0 and less $20,000 becomes $10,000, and so on.

Here is some code to do this:

generate income = .
replace income = 0 if INCGPER==1
replace income = 10000 if INCGPER==2
replace income = 30000 if INCGPER==3
replace income = 50000 if INCGPER==4
replace income = 70000 if INCGPER==5

But then you come to the tricky part: what to do with the people who earn over $80,000 a year?

There are basically two approaches here. One is to use a regression technique that eliminates the need to deal with the problem: Tobit regression.

Another is to use assumptions about the shape of the distribution to come up with a numerical value for top-coded observations – and there are various sensible ways of doing this. If you need a quick and dirty technique, you can use one of the techniques suggested by Michael Hout here.

Here is a Spreadsheet to Download that gives a couple of different formulas for imputing values for top-coded data – you can fill in your values and it will churn out an appropriate number for you. (Chris Auld, I am counting on you to check this and make sure it's correct).

Once you've done all of this recoding, you are now in a position to …

6. Rename if necessary

Pause for a minute to look at the labels for your variables – it will save you time in the long run if you give your variables sensible labels, so you don't have to retype the names every time you do summary statistics or regression results. Rename variables using the command

rename OLDNAME NEWNAME

7. Create log variables and interaction terms as needed.

This is easy. For example,

gen agesquared = age*age

or gen logincome = log(income)

(But is it log or is it ln? You know what, it doesn't matter. If you don't believe me, generate both lnincome and logincome, and see what happens when you use them as dependent variables).

Now you are in a position to run your descriptive statistics again. And take another look at the frequencies for all of your variables. Are you missing lots of observations for some variables? If so, this is a matter that needs some thought.

8. Develop a strategy for dealing with missing values

Some CCHS respondents do not report their income.

If you don't have information on the dependent variable, there's nothing you can do – you just have to drop that respondent from your sample. That can cause problems, but there are also things that can be done to mitigate those problems, for example, the Heckman correction.

What about missing values for your explanatory variables? You can drop those respondents, too, but you risk making your sample less representative of the population as a whole. (Stata will automatically drop missing observations unless you tell it otherwise)

If there are many missing observations for a particular explanatory variable, it might be better just to drop that variable from your analysis, rather than lose a lot of respondents. The risk here is that the model becomes "misspecified." Suppose, for example, you are trying to explain cannabis consumption in terms of income and education. If you drop income because there are too many missing values, then you might end up over-estimating the impact of education on cannabis consumption, since income and education are highly correlated.

An alternative strategy is to assign those people some value of the characteristic in question or "imputation". There are people who devote their entire lives to working out how to impute missing values. There are also high tech econometric techniques for dealing with missing values such as expectation-maximization algorithms.

Most papers I read deal with missing values by dropping either cases or variables. That does not justify the practice, but it might make you feel a bit less guilty about doing it.

If you are dropping respondents with missing values, however, try to do methodically. Sometimes people will report descriptive statistics for their entire sample, and then drop respondents with missing values for their regression analysis. And drop different respondents in different regressions. That's not helpful because it's hard to figure out who the sample is, or what's causing changes in results across specifications.

9. Refine your sample

You're now almost ready to start running regressions. So it's time to once again step back and think "what am I trying to learn from this analysis?"

The point of analysis is making an inference from a sample to a population. So you have to ask yourself "What is the population that I'm interested in here?"

Suppose, for example, you were interested in the impact of height on earnings. The impact of a 12-year-old's height on her earnings isn't really of much interest, since this child should be in school anyways. And if a 5'3" 80 year old low earnings – well, it's probably because he's 80, not because he's short. So in economic analysis it's common to drop people over, say, 50, 60, or 65 and under, say, 18 or 25.

The choice of the younger age limit can make quite a difference to the results, and there aren't any hard-and-fast rules about which limit to choose.

An alternative approach is to drop anyone who is not employed, or not in full-time employment.

The most important thing here is that the choice be careful and deliberate: the results of your analysis can only be generalized to people who are like those in your sample.

And now you're ready to go.

It's a bit like painting: doing the prep work can take hours or even days, but once you get down to it, there is almost instantaneous gratification.

But I'm not going to insult your intelligence by adding "step 10: Have fun!".

By the way, if you're interested, height does have a statistically significant impact on income. For men between 20 and 65 who are employed, an extra centimeter of height increases predicted annual income by $341. Using log of income as the dependent variable, a 1 percent increase in height is associated with a 0.52 percent increase in annual income. For employed women, the effect of height is stronger – $350 in annual earnings for a centimeter increase in height. With a log specification, a 1 percent increase in height translates into a 0.77 percent increase in annual income. You can upload my regression results here: Download HeightIncomeRegressionResults.

Now these results may be sensitive to my treatment of top-coded observations or my choice to use OLS rather than another type of regression technique. But that's a topic for another post.

13 comments

Publicus Promeritum · October 5, 2011 - 10:52 am · Reply→

You may have heard this before, but I would strongly recommend you try using a Version Control System. Yes, it means you have to use .do files. Yes, it means a little more work. But, rather than a directory of foo-1.do, foo-2.do, foo-3.do files, with a VCS you’ll just have foo.do, along with a separate window that lists every change you made to the file and a one or two sentence explanation for each change. It will change your life, or at least a very small part of it.
This book provides a good introduction: http://pragprog.com/book/svn2/pragmatic-version-control-using-subversion
And this software doesn’t require a CompSci degree: http://tortoisesvn.net/downloads.html
(Just skip over the stuff about networks. You don’t need it.)
Version control is a wonderful thing. Without it, software as we know it wouldn’t exist.
[Preemptive note to compsci types: Yes, I’m a git user too, but git is clunky on Windows, and not terribly user-friendly. Hence, SVN.]
Stephen Gordon · October 5, 2011 - 12:12 pm · Reply→

Sigh
Some day I’ll have to learn Stata.
Frances Woolley · October 5, 2011 - 12:44 pm · Reply→

Stephen – what! are you thinking of becoming a micro guy 😉 ?
Stephen Gordon · October 5, 2011 - 12:46 pm · Reply→

Actually, for teaching. The students already know Stata, and it’s a pain to teach them Matlab and time series econometrics.
Jeremy Fox · October 5, 2011 - 2:12 pm · Reply→

So economists don’t use R (r-project.org)? Why the heck not? I know nothing of Stata, but especially for the sort of garden-variety linear model problems and basic data processing discussed in this post, I can’t imagine that Stata is worth whatever it costs. R is not only free, it’s versatile, powerful, and has optional menu-driven interfaces (e.g., R Commander) that make it easy to use for most common statistical methods. Every step in this post is really easy with R. And even for more advanced stuff, I’d be very surprised if there’s anything Stata can do that R can’t. Indeed, I’m sure there’s much R can do that Stata can’t (e.g., can Stata simulate systems of ordinary or delay differential equations, or fit them to time series data?) There are mathematical things that R can’t do, for which you probably want commercial software, but in my experience they’re things like symbolic algebra (for which you probably want Mathematica), not number crunching. R is very quickly replacing all commercial software as the standard statistical platform in ecology.
Chris Auld · October 5, 2011 - 2:48 pm · Reply→

Lots of economists use R. Also, lots of economists have a great deal of human capital in Stata and see no compelling reason to switch yet. Both Stata and R are very flexible matrix-oriented languages and both feature many user-written extensions which can be conveniently downloaded. For microeconometric models, it’s still more likely, I think, that you’ll find some new estimator available as a Stata command than as an R command. For applications not widely used in microeconometrics (such as working with differential equations), R is more well-developed. Stata is also much faster than R, at least for many common routines based on optimizing a function. Stata’s syntax is somewhat less arcane than R’s and Stata is therefore somewhat easier to teach to students with little programming background. A research budget makes Stata’s modest price basically moot, and for that price you get commercial support if you need it. And finally, it is easy to dump data from Stata and read it into R to estimate some model not available in Stata, so one can use Stata to manipulate data and R to estimate if need be. For all of these reasons, R is not clearly better than Stata for microeconometricians, although I would expect that that will slowly change over time if (or as) R becomes the statistical computing standard.
Chris Auld · October 5, 2011 - 3:07 pm · Reply→

Lots of economists use R, but lots of economists have a lot of human capital built up in Stata. Both R and Stata are flexible matrix-oriented languages. Users can write and make readily available extensions to both languages, and both feature many user-written extensions in the wild. For microeconometricians, new or obscure estimators are still, I think, more likely to be found in Stata than in R. For methods not commonly used by microeconomists, such as working with differential equations, R is more likely to have relevant canned routines, but if necessary one can always easily dump data from Stata and run some particular command in R (or vice versa, of course). Stata’s syntax is somewhat less arcane than R’s and Stata is somewhat easier to teach to students. R is free, but that’s largely an advantage to students rather than faculty with research budgets, as Stata is pretty cheap and the availability of commercial support comes with the price. For all of these reasons many economists don’t feel compelled to switch yet, although it does seem plausible that R will slowly become the new standard over time.
Chris Auld · October 5, 2011 - 3:37 pm · Reply→

Incidentally, Frances, I didn’t check the spreadsheet but Stata could do the calculation with something like the following, if the highest income category were, say, 8 (over $100k) and the next, 7, is $80k-$100k:
count
local numobs=r(count)
count if INCGPER==8
local ftop = r(count)/numobs' count if INCGPER==7 local ftopminus=r(count)/numobs’
local V = (log(ftopminus' +ftop’) – log(ftop')) / ((log(100000) - log(80000)) local mtop = 100000*(V’/(V'-1)) di "Estimated income for top category:mtop'”
replace income = `mtop’ if INCGPER==8
Warning: I did not try to actually run this.
Frances Woolley · October 5, 2011 - 4:59 pm · Reply→

Chris has written a great follow-up post at:
http://chrisauld.com/2011/10/05/tips-on-estimating-models-and-producing-publication-ready-tables-using-stata/
Determinant · October 5, 2011 - 5:41 pm · Reply→

You teach your studends MATLAB Stephen? The engineering courses I took expected students to teach themselves, under pain of failed assignments due in MATLAB.
david · October 5, 2011 - 8:09 pm · Reply→

Stata is used over R because R makes it substantially harder to quickly perform common tasks.
R is much more elegant when toying with abstract, generated data though, to be sure.
Darren · October 6, 2011 - 5:51 pm · Reply→

@Publicus, IMHO, using any VCS is overkill in a single user environment. The overhead is not worth it. The point of VCS is to prevent development teams from stepping on each others toes, and to facilitate build versions from multiple source code files. A single researcher working on a do file faces none of these problems… their organizational needs can be solved with a directory structure.
Promeritum · October 7, 2011 - 9:35 am · Reply→

@Darren
“A single researcher working on a do file faces none of these problems… their organizational needs can be solved with a directory structure.”
Keeping multiple old copies of a file or folder around is a version control system. A dedicated VCS software adds the ability to see all your changes in one view with the rationale for each, and allows you to quickly find out when (and why) a specific change was made. (Assuming you diligently check your changes in.)
The biggest advantage of VCS is not having old copies available. It’s being able to answer the question “When did I make this change and what the heck was I thinking when I made it?”