编辑推荐
本书是国内第1本系统介绍各种多层模型的教学和科研参考书。书中采用国际通用的著名统计软件SAS来演示各种多层模型的应用,结合具体的实例,由浅入深地逐步介绍如何使用不同的SAS程序,如Proc MIXED,Proc NLMIXED和Proc GLIMMIX,来进行各种多层资料的模型分析。
本书可作为综合性大学,医学院、财经大学,师范院校等相应专业的研究生或本科生教材,也可供实际应用工作者参考。
内容简介
Multilevel Models: Appfications Using SAS is written in nontechnical terms focuses on the methods and applications of various multilevel models including liner multilevel modelsmultilevel logistic regression models multilevel Poisson regression models multilevel negative binomial models as well as some cutting-edge applications such as multilevel zero-inflated Poisson (ZIP) model random effect zero-inflated negative binomial model (RE-ZINB) mixed-effect mixed-distribution models bootstrapping multilevel models and group-based trajectory models. Readers will learn to build and apply multilevel models for hierarchically structured cross-sectional data and longitudinal data using the internationally distributed software package Statistics Analysis System (SAS). Detailed SAS syntax and output are provided for model applications providing students research scientists and data analysts with ready templates for their applications.
作者简介
.
内页插图
目录
Chapter 1 Introduction
1.1 Conceptual framework of multilevel modeling
1.2 Hierarchically structured data
1.3 Variables in multilevel data
1.4 Analytical problems with multilevel data
1.5 Advantages and limitations of multilevel modeling
1.6 Computer software for multilevel modeling
Chapter 2 Basics of Linear Multilevel Models
2.1 Intraclass correlation coefficient (ICC)
2.2 Formulation of two-level multilevel models
2.3 Model assumptions
2.4 Fixed and random regression coefficients
2.5 Cross-level interactions
2.6 Measurement centering
2.7 Model estimation
2.8 Model fit, hypothesis testing, and model comparisons
2.8.1 Model fit
2.8.2 Hypothesis testing
2.8.3 Model comparisons
2.9 Explained level-1 and level-2 variances
2.10 Steps for building multilevel models
2.11 Higher-level multilevel models
Chapter 3 Application of Two-level Linear Multilevel Models
3.1 Data
3.2 Empty model
3.3 Predicting between-group variation
3.4 Predicting within-group variation
3.5 Testing random level-1 slopes
3.6 Across-level interactions
3.7 Other issues in model development
Chapter 4 Application of Multilevel Modeling to Longitudinal Data
4.1 Features of longitudinal data
4.2 Limitations of traditional approaches for modeling longitudinal data
4.3 Advantages of multilevel modeling for longitudinal data
4.4 Formulation of growth models
4.5 Data description and manipulation
4.6 Linear growth models
4.6.1 The shape of average outcome change over time
4.6.2 Random intercept growth models
4.6.3 Random intercept and slope growth models
4.6.4 Intercept and slope as outcomes
4.6.5 Controlling for individual background variables in models
4.6.6 Coding time score
4.6.7 Residual variance/covariance structures
4.6.8 Time-varying covariates
4.7 Curvilinear growth models
4.7.1 Polynomial growth model
4.7.2 Dealing with collinearity in higher order polynomial growth model
4.7.3 Piecewise (linear spline) growth model
Chapter 5 Multilevel Models for Discrete Outcome Measures
5.1 Introduction to generalized linear mixed models
5.1.1 Generalized linear models
5.1.2 Generalized linear mixed models
5.2 SAS Procedures for multilevel modeling with discrete outcomes
5.3 Multilevel models for binary outcomes
5.3.1 Logistic regression models
5.3.2 Probit models
5.3.3 Unobserved latent variables and observed binary outcome measures
5.3.4 Multilevel logistic regression models
5.3.5 Application of multilevel logistic regression models
5.3.6 Application of multilevel logit models to longitudinal data
5.4 Multilevel models for ordinal outcomes
5.4.1 Cumulative logit models
5.4.2 Multilevel cumulative logit models
5.5 Multilevel models for nominal outcomes
5.5.1 Multinomial logit models
5.5.2 Multilevel multinomial logit models
5.5.3 Application of multilevel multinomial logit models
5.6 Multilevel models for count outcomes
5.6.1 Poisson regression models
5.6.2 Poisson regression with over-dispersion and a negative binomial model
5.6.3 Multilevel Poisson and negative binomial models
5.6.4 Application of multilevel Poisson and negative binomial models
Chapter 6 Other Applications of Multilevel Modeling and Related Issues
6.1 Multilevel zero-inflated models for count data with extra zeros
6.1.1 Fixed-effect ZIP model
6.1.2 Random effect zero-inflated Poisson (RE-ZIP) models
6.1.3 Random effect zero-inflated negative binomial (RE-ZINB) models
6.1.4 Application of RE-ZIP and RE-ZINB models
6.2 Mixed-effect mixed-distribution models for semi-continuous outcomes
6.2.1 Mixed-effects mixed distribution model
6.2.2 Application of the Mixed-Effect mixed distribution model
6.3 Bootstrap multilevel modeling
6.3.1 Nonparametric residual bootstrap multilevel modeling
6.3.2 Parametric residual bootstrap multilevel modeling
6.3.3 Application of nonparametric residual bootstrap multilevel modeling
6.4 Group-based models for longitudinal data analysis
6.4.1 Introduction to group-based model
6.4.2 Group-based logit model
6.4.3 Group-based zero-inflated Poisson (ZIP) model
6.4.4 Group-based censored normal models
6.5 Missing values issue
6.5.1 Missing data mechanisms and their implications
6.5.2 Handling missing data in longitudinal data analyses
6.6 Statistical power and sample size for multilevel modeling
6.6.1 Sample size estimation for two-level designs
6.6.2 Sample size estimation for longitudinal data analysis
Reference
精彩书摘
In the linear model case, this integral can be solved in closed form, and the resulting likelihood or restricted likelihood can be maximized directly. For nonlinear multilevel models, however, the integral is usually unknown and must be approximated. Many methods have been proposed for such maximization approximation. Two basic methods are: 1) linearization, which approximates the integrated likelihood function using techniques such as Taylor series expansion, 2) integral approximation with numerical methods. These approaches are implemented in two SAS procedures, PROC GLIMMIX and PROC NLMIXED and two macros, %GLIMMIX and %NLMIXED, respectively.
Prior to the current version of SAS (SAS 9.2) (SAS Institute Inc., 2008), PROC GLIMMIX is solely based on linearization methods. In version 9.2 of PROC GLIMMIX, linearization is the default estimation method, and two numerical integration methods——Laplace approximation method and adaptive Gauss-Hermite quadrature have been added as options. The linearization method is also called a pseudo-likelihood method, in which pseudo-data are generated from the original data, and likelihood function is approximated using Taylor series expansions (Schabenberger, 2005). The essential idea of the linearization method is to approximate GLMM using normal linear mixed model estimates repeatedly. Among the various linearization methods available in the procedure, the default method is the restricted or residual pseudo-likelihood (REPL) (Wolfinger & OConnell, 1993). The maximization of the pseudo-likelihood can be carried out by various optimization techniques in PROC GLIMMIX. The default optimization technique is the Newton-Raphson algorithm.
The major advantages of linearization-based methods include: First, they can fit models for which the joint distribution is difficult or impossible to ascertain. Second, compared with numerical integration methods, they allow a larger number of random effects to be estimated in the model. Third, the variance/covariance structure of the level-1 residual matrix (i.e., R matrix) can be readily accommodated. Fourth, the model is iteratively estimated based on the linear mixed model, thus both ML and REML are available for model estimation (Schabenberger, 2005). In addition, in our experience, linearization based models are much faster to run.
The disadvantages of linearization-based methods include: First, they are based on iterative model estimation using pseudo-data constructed from the original data; as such, they do not have a real likelihood, and therefore -2LL or deviance statistic cannot be used for model comparisons. Second, PROC GLIMMIX does not support a broad array of variance/covariance structures of the R matrix that you can draw on with the PROC MIXED procedure (Schabenberger, 2005).
前言/序言
Interest in multilevel statistical models for social science and public health studies has been aroused dramatically since the mid-1980s. New multilevel modeling techniques are giving researchers tools for analyzing data that have a hierarchical or clustered structure. Multilevel models are now applied to a wide range of studies in sociology, population studies, education studies, psychology, economics, epidemiology, and public health.
Individuals and social contexts (e.g., communities, schools, organizations, or geographic locations) to which individuals belong are conceptualized as a hierarchical system, in which individuals are micro units and contexts are macro units. Research interest often centers on whether and how individual outcome varies across contexts, and how the variation is explained by contextual factors; what and how the relationships between the outcome measures and individual characteristics vary across contexts, and how the relationships are influenced or moderated by contextual factors. To address these questions, studies often employ data collected from more than one level of observation units, i.e., observations are collected at both an individual level (e.g., students) and one or more contextual levels (e.g., schools, cities). As a result, the data are characterized by a hierarchical structure in which individuals are nested within units at the higher levels. This kind of data is called hierarchically structured data or multilevel data. The conventional single-level statistical methods, such as ordinary least square(OLS) regression are inappropriate for analysis of multilevel data because observations are nonindependent and the contextual effects cannot be addressed appropriately in such models. Multilevel modeling not only takes into account observation dependence in the multilevel data, but also provides a more meaningful conceptual framework by allowing assessment of both individual and contextual effects, as well as cross-level interaction effects.
This book covers a broad range of topics about multilevel modeling. Our goal is to help students and researchers who are interested in analysis of multilevel data to understand the basic concepts, theoretical frameworks and application methods of multilevel modeling. This book is written in non-mathematical terms, focusing on the methods and application of various multilevel models, using the internationally widely used statistical software, the Statistics Analysis System (SAS). Examples are drawn from analysis of real-world research data. We focus on twolevel models in this book because it is most frequently encountered situation in real research. These models can be readily expanded to models with three or more levels when applicable. A wide range of linear and non-linear multilevel models are introduced and demonstrated.
复杂系统数据解析:进阶建模与实践指南 本书面向在数据科学、社会学、心理学、教育学以及医学等领域进行深入研究的专业人士和高级学生,旨在提供一套系统、全面且高度实用的复杂数据结构建模框架与应用策略。我们聚焦于超越传统线性模型的局限性,深入探索那些数据内在存在层级结构、重复测量或非独立观测的复杂场景。 在当今数据驱动的研究环境中,研究者越来越频繁地面对“嵌套”或“纵向”数据结构。例如,学生嵌套在班级中,班级嵌套在学校里;患者在不同时间点的多次测量;或是来自不同地域、具有不同社会背景的个体数据。简单地将这些数据视为独立的观测值进行传统回归分析,不仅会低估标准误差,导致推断偏差,更会忽略数据层级结构中蕴含的关键信息——即层级间的相互作用和异质性。 本书的核心目标,是为读者构建起一座坚实的桥梁,连接前沿的统计理论与前沿的软件实现能力,特别是针对那些在复杂模型拟合中表现卓越的统计软件环境。我们假设读者已经掌握了基础的统计推断原理和多元回归分析的基础知识,因此,本书将直接切入复杂模型的理论精髓和实际操作细节。 --- 第一部分:超越独立性假设——层级模型的理论基石 本部分将首先奠定读者对复杂数据结构本质的深刻理解。我们不满足于识别“层级”的存在,更深入探讨这种结构如何系统性地影响数据的方差和协方差结构。 第一章:复杂数据的挑战与模型选择的逻辑 我们将详细剖析何为“非独立性”及其对统计效力的负面影响。通过具体的案例分析,展示标准OLS(普通最小二乘法)在面对嵌套数据时产生的偏误。随后,引入混合效应模型的概念,作为解决此类问题的首选工具,明确其在建模不同层级效应方面的优势。 第二章:随机截距模型的构建与解释 随机截距模型是层级分析的起点。本章将详尽阐述如何设置随机截距,以捕捉不同群组(Level 2 或更高层级)的基线差异。我们将深入探讨方差分量(Variance Components)的理论,并教授如何解读这些分量,例如组内相关系数(ICC),以量化层级结构对总变异的贡献程度。解读随机截距的分布及其对个体差异的解释将是本章的重点。 第三章:随机斜率与交叉水平交互作用 更进一步,本章探讨了随机斜率模型的必要性。当预测变量对结果的影响程度本身也因群组而异时,随机斜率模型变得不可或缺。我们将详细讲解如何设置随机斜率,并解释其与随机截距的联合分布。此外,本章还将细致区分交叉水平交互作用(Cross-Level Interactions)——即低层级变量对高层级变量的调节效应——的理论框架和统计意义。 第四章:模型拟合的统计原理与收敛性诊断 复杂模型,尤其是包含大量随机效应的模型,其收敛性是实践中最大的挑战之一。本章将深入讲解最大似然估计(ML)与限制性最大似然估计(REML)的数学原理差异,并指导读者如何通过信息准则(如AIC/BIC)进行模型选择。我们将提供一套系统的诊断流程,用于识别和解决模型拟合不佳、参数估计不稳定的问题。 --- 第二部分:应用扩展与高级方法论 在掌握了基础的随机效应模型后,本书将转向更具挑战性和现实意义的应用场景,涵盖纵向数据分析和广义线性混合模型(GLMM)。 第五章:纵向数据分析与增长曲线模型 重复测量数据(如追踪研究)是层级模型的典型应用场景。本章将侧重于增长曲线模型(Growth Curve Models)。我们将介绍如何将时间作为连续变量纳入模型,区分随机截距和随机斜率随时间变化的轨迹。本章将详细讨论如何处理不规则测量时间点以及如何通过协变量预测个体轨迹的差异。 第六章:广义线性混合模型(GLMM)的理论基础 当因变量不再是正态分布时(例如二元、计数或比例数据),标准混合模型无法适用。本章将构建GLMM的理论框架,重点讲解连接函数(Link Functions)和指数族分布在高层级数据中的应用。我们将解析Logit、Probit以及泊松/负二项分布在混合模型结构下的具体表达形式。 第七章:GLMM的实施策略与特殊情况处理 本章侧重于GLMM的实际操作。我们将针对二元(如患病/未患病)和计数(如事件发生次数)数据,提供详细的参数估计和解释指南。特殊关注拉普拉斯近似(Laplace Approximation)和惩罚拟合准则(PQL)等数值方法在拟合复杂GLMM时的优劣,并教授读者如何判断模型输出的可靠性。 第八章:贝叶斯方法在层级建模中的优势 面对高度复杂的层级结构或样本量较小的情况,传统的最大似然方法可能受限。本章将介绍贝叶斯统计方法如何为层级模型提供强大的替代方案。我们将阐述如何设置先验分布、如何运行MCMC(马尔可夫链蒙特卡洛)算法,并侧重于贝叶斯框架下对随机效应后验分布的解释和报告。 --- 第三部分:模型应用与结果的可靠报告 本部分将引导读者将理论知识转化为具有说服力的研究报告,确保模型的稳健性和结论的透明度。 第九章:模型选择、嵌套与非嵌套模型的比较 本章将提供一个清晰的决策树,指导研究者何时需要引入随机斜率,何时只需保留随机截距。我们将讲解如何使用似然比检验(Likelihood Ratio Tests)来严格比较嵌套模型,以及在非嵌套模型比较中应采用的统计标准。 第十章:效应的分解与解释:层级效应的量化 模型的最终价值在于其解释力。本章专注于如何清晰地向非统计学背景的受众报告层级模型的结果。我们将提供标准化和非标准化系数的解释指南,重点讲解如何分解和报告来自不同层级的固定效应、随机效应方差,以及如何可视化复杂的交叉水平交互作用。 第十一章:缺失数据处理与模型稳健性检验 在真实世界的研究中,数据缺失是常态。本章将探讨在层级模型框架下处理缺失数据的方法,包括列表删除(Listwise Deletion)的局限性,以及完全信息最大似然(FIML)和多重插补(Multiple Imputation)在混合模型中的应用策略。此外,本章还将介绍通过改变模型假设(如改变残差结构或随机效应分布)来进行模型稳健性检验的实用技巧。 附录:模型设计与报告标准 附录提供了一套全面的研究设计检查清单,用于规划复杂模型研究,确保数据收集过程能够支持后续的层级分析。同时,我们参考了主要学术期刊的报告指南,指导读者如何撰写一份清晰、完整且符合学术规范的混合效应模型结果报告。 --- 贯穿全书的实践指导:本书的每一章节理论讲解后,均会紧密结合当前主流统计软件的实际操作流程,通过详尽的输入文件示例和输出结果解析,确保读者能够立即将所学知识应用于自己的研究数据中。我们相信,对复杂数据结构的掌握,是现代定量研究走向深度的关键一步。