Survey research is a multi-disciplinary activity involving a wide variety of disciplines and
expertise including subject matter specialists, questionnaire designers, statisticians, survey
managers, interviewers and computer specialists. Analysis of the resulting dataset furthers
research in the health and social sciences, which in turn affects policy. This proposal focuses
on the various complexities of a common data type, the sample survey, with potential to impact
the entire survey process, research in substantive areas, and subsequent policy.
Here is a brief description of the typical survey process. Once the objectives of the survey
are made clear the subject matter specialist provides general questions that address the survey's
objectives. A questionnaire designer then translates these general questions into a format that
can be easily understood by a potential respondent. The questionnaire becomes the measuring instrument
used to obtain the data from the survey. Once the design of the questionnaire is complete, the sample
of potential respondents is chosen by some random mechanism. In view of competing concerns regarding
the cost to run the survey and the statistical efficiency of the data that are collected, the sample
may be obtained in an economically efficient way, but may result in a complicated data structure.
After the sample has been selected, the questionnaire is delivered by mail, telephone, personal
interview or via the Internet. This is carried out under the direction of a survey manager and
this aspect of the process includes the training and supervision of the interviewers. For a wide
variety of reasons, the data file that is returned once data collection is complete will almost
certainly contain missing values. Statisticians are involved in the development of the sampling
design, and in the analysis of the resulting data; the research of this MITACS team focuses on the
issues that arise in data analysis. It is important to note, however, that the analysis of the data
is closely tied to the way in which the data have been collected.
Statistics Canada surveys and similar surveys managed by Westat Inc, as well as other social surveys,
have increasingly complex structures. Most methods of data analysis have been developed for cross-sectional
scenarios where the data are collected at only a single time point. Although they may be described as having
complex structures, cross-sectional surveys are simpler than some new surveys currently being run.
Taking cross-sectional data as the standard fare from the past, there are two new and important complexities
in current surveys that are increasingly prevalent: the introduction of time, and the introduction of space.
Longitudinal data involve a set of repeated observations on an individual, or group of individuals followed
through time. It now comes in many diverse forms, from multi-wave panel surveys, to pooled time-series and
event histories, and has the advantage of adding a greater causal interpretation to variables that are
observed to be associated in time. If time adds a horizontal component to the cross-sectional approach,
giving context to the present, space adds a vertical layer, adding embeddedness in social contexts as an
additional concern. Time and space methods occupy the developing frontier in social science and health research.
Many newer surveys done by Statistics Canada collect longitudinal data. One prominent example is the National
Longitudinal Survey of Children and Youth. The data from this survey, along with similar surveys, are housed
at Statistics Canada Research Data Centres (RDCs) across Canada. Statistics Canada has identified a pressing
need for new methodologies in view of these ongoing longitudinal data collection efforts. The connection to
the RDCs is an important one. They are located on university campuses and are important sources of data for
subject matter researchers. Further methodological issues arise when subject matter researchers try to link
data from their own databases to data available through the RDCs. Some of the team members have been closely
involved in current operating RDCs or in proposed new ones. With these kinds of connections there is the
opportunity for cross-fertilization of ideas as well as technology transfer with subject matter researchers
who also use these RDCs. Mary Thompson was one of the key people in bringing an RDC to the University of
Waterloo. Jamie Stafford is a member of the committee for the University of Toronto's RDC. David Bellhouse
is currently on a committee comprised of social scientists and health care researchers that is in the final
stages of obtaining approval for the building of an RDC at the University of Western Ontario. Currently there
are nine RDCs operating in Canada with more being planned.
The purpose of this project is to further the research and development of methodological tools in complex
data structures arising from surveys by bringing together academics who have both methodological and subject
matter research interests in complex data structures, and researchers from Statistics Canada, Westat Inc and
the Toronto Rehabilitation Institute who have either a research interest or are at the front line working with
these data structures on a daily basis. There are three overlapping areas that we will pursue related to the
analysis of a common data type surveys with complex structures. It is the various complexities of this data
type that have resulted in the following focused projects. Any particular dataset could involve all, or some
subset, of these complexities requiring that they be simultaneously addressed in an analysis. These are: (A)
modelling a process arising in a complex data structure; (B) variance, correlation structure and their estimation;
and (C) the handling of missing data. Several sub-projects are briefly described below as they relate to each of these
three general areas. All sub-projects that are being explored have direct application to complex surveys currently run
by Statistics Canada.
A. New models
Data that are collected on an individual may be obtained at a variety of levels. At the basic level is information
on each individual respondent. At another level a respondent may belong to a group and information can be collected
on the group that is relevant to the respondent. As an example, Statistics Canada's National Longitudinal Survey of
Children and Youth collects data on children, and often a second level of information on each child's family,
information that is relevant to all children in the family. Likewise, for a child attending school there might be
information on the child's teacher, and further information on the school. In modelling the dynamics of children's
development one can use the data obtained at the level of the child, or use the data obtained at several levels of
aggregation (multi-level modelling). Here are two projects related to the modelling of processes using survey data.
The first project will use a single level of modeling time, while the second is related to multi-level modeling
space and time.
i. Modelling of correlated durations (spells) and life history transitions using longitudinal survey data
(Bellhouse, Lawless, Sutradhar)
Many types of behaviour over time tend increasingly to be regarded as movements at random intervals from one state
to another. Examples include movements of individuals between different states of an illness, and movements of
individuals between various labour market states. There are several open research issues pertinent to the analysis
of these kinds of data: 1) How to model the processes of transition and duration in the presence of drop-out or
attrition, when the individuals are clustered due to a complex sample design; 2) How to accommodate interval
censored duration data, where the censoring arises through the superposition of the process of data collection
on the process of spell duration. 3) How to develop mixed models to accommodate between subject variability,
and the combination of model-based and sampling-based analysis.
Within Statistics Canada, Georgia Roberts and Milorad Kovacevic have already initiated some work in this area.
Prospects for collaborative research are excellent; for example Sutradhar and Kovacevic have already published
jointly on the topic of longitudinal data analysis (Biometrika, 2000).
ii. Multi-level modelling (Escobar, Lou, Reid)
Recent concerns with the hierarchical model involve the notion of multiple nesting units in social research the
cross-nested random effects model. This model attempts to incorporate the more complex reality presented by multiple
social or health contexts when an individual is embedded in multiple contexts at once, each potentially having its
own influence, but the contexts are not hierarchically related. Instead, they occur at the same level of reality.
One example occurs in considering influences on child development and life course options, where both school and
neighborhood may have a determining role. The problem comes from the fact that schools typically will draw students
across neighborhoods, and children in the same neighborhoods may attend numerous schools: this is cross-nesting.
Separating these roles is essential and important, but the methods that can address this issue are still in development.
Another example of cross-nesting stems from cross-appointed physicians treating subjects at two or more of the treatment
sites, leading to some degree of correlation among providers across centers. Methods for dealing with such correlations
within a hierarchical data structure have yet to be developed to properly examine what patient, physician, and center
factors in addition to the intervention underlie any changes that might occur in the outcomes of interest.
If the final level in a hierarchical model is time, further complexities arise when the hierarchy in not preserved
but changes in time due to migration, policy etc. Current methods ignore part of the structure, for example, the
hierarchy and focus only on time. Approaches that account for the longitudinal structure and spatial structure,
that itself changes in time, are needed.
Efforts are being made to develop collaborative research in this area. As part of the team's activities,
a session at the Statistical Society of Canada meetings in 2004 on the topic of multi-level modeling was
organized by Roland Thomas. Speakers were J. N. K. Rao (Carleton University), Emmanuel Behnin (Statistics Canada) and
Danny Pfeffermann (University of Southampton).
The group is collaborating with the Statistical and Applied Mathematical Sciences Institute (SAMSI) on a theme year
in Latent Variable Models in the Social Sciences (LVMSS). See www.samsi..info for a link to planned activities.
SAMSI held a kick-off workshop for the LVMSS theme year September 11-14, 2004. The first day featured tutorials on
Structural Equation Modelling and Multilevel Modelling.
B. Variance estimation
i. Estimation and analysis of dependencies with complex sampling designs (Rao, Stafford, Thompson)
Variances and covariances between different durations or spells (described, for example, in A.i) on different
individuals have a complex structure. They can be further complicated in the analysis of ordered multiple spells.
For example, an individual may contribute spells to several strata where we may allow the effects of covariates to
be different in these strata. Statistical inference about these effects needs to allow for the overlapping of the
strata. Model-specific research on statistical properties is needed in the presence of complex covariance structures
of the above type. Moreover, variance estimation methods based on linearization, jackknife, one-step jackknife,
bootstrap, estimating equation bootstrap, require development.
In addition to event history modelling, longitudinal surveys are used for other purposes such as gross flows estimation,
the elimination of effects of latent variables in linear regression models using individual changes between consecutive
time points, the modelling of marginal means of responses as functions of co-variables, and conditional modelling of the
response at a given time point as a function of past responses and present and past co-variables.
ii. Algorithms for the creation of replication variance estimator (Chen, Sitter, Wu)
Both Statistics Canada and Westat Inc produce public access sample files that have been stripped of identifiers in
order to protect respondent confidentiality. With the identifiers missing it is impossible to calculate valid variance
estimates without further information. One solution is to use a replication variance estimator such as the bootstrap and
provide the user only with the replication weights. The use of replication methods for variance estimation in complex
surveys is highly computer intensive. However, creating a set of replication weights in replication methods for variance
estimation can be viewed as a related design of experiments problem, and there are now, in the field of computer experiments,
a number of sophisticated algorithms to construct large designs quickly and automatically. The adaptation and development of
algorithms for design of experiments toward automatic methods for quick creation of replication weights for use in variance
estimation for complex surveys will be investigated. In addition, there are difficulties in developing such replication
methods which adequately mask strata and cluster identifiers.
C. Missing data
Not everyone who is asked participates in a sample survey. Of those who do respond some may not provide answers to all
the questions in the survey. The first situation is called unit nonresponse and the second is item nonresponse. Our
research proposal is directed to item nonresponse. Since survey data files almost invariably contain missing data due
to item nonresponse this is a topic of interest to all users of survey data as well as to those who produce the data,
Statistics Canada and Westat included. Usually item nonresponse occurs because of factors such as fatigue in filling
out a long survey or in lack of understanding a question. In other cases a kind of item nonresponse is planned in
advance of the survey. Here the questionnaire is designed so that a single respondent is not asked all the questions
in order to reduce response burden, but over the entire set of respondents all the questions appear. Of the proposals
below, (i) deals with problems related to the usual type of item nonresponse and (ii) is related to a type of planned
i. Swiss cheese missing data (Rao, Sitter)
The standard method for item nonresponse is to impute the missing data. As the pattern of item nonresponse becomes
more complex the imputation problem becomes more challenging. These complex item non-response patterns are often
called Swiss cheese holes. Statistics Canada and Westat would like to be able to build information into the data sets
that are made available to researchers, which would allow users to seamlessly compute variances that account for
ii. Item response theory (Thomas)
It is of interest to adapt item response theory and other psychometric methods for scaling and scoring to survey data.
This is of direct concern to Statistics Canada in some of their educational surveys. Much of the test data in these
surveys may be missing by design and must be stochastically imputed several times. Each stochastic imputation yields a
set of plausible values, which may be used for estimation. A problem is to compute variances for this procedure that
incorporate both the stochastic modelling as well as the survey design.