A Method for Estimating Relations and Predicting Behavior

Using Individual-Level Data

J. Miller McPherson

University of Arizona

James Cook

Duke University

November 10, 2000

DRAFT

A central task for the relational analysis of social structure has been the development of a practical method that embodies the core assumptions of structuralist theory. The effort to develop such a method involves the solution of a number of related problems. The structure of traditional data sets and the core sociological methods of analysis require both independent and dependent variables to be expressed as characteristics of some individual unit, whether that unit is a person, an organization or a nation. Structuralists have at times held individual-level data and methods responsible for the predominance of individual-level explanations in the social sciences (McPherson 1982; Marsden and Laumann 1984). Furthermore, these dominant methods assume that individual values of dependent variables are independent of one another. Yet one of the assertions at the core of the sociological enterprise is that individual values are, in fact, dependent upon one another (Erbring and Young 1979; Friedkin 1990).

Over the past two decades, a method addressing this problem has emerged. The network effects autocorrelation model adapts traditional regression analysis by incorporating an additional term to go along with traditional independent variables. This term, Wy, is a matrix of relations between all individuals in the studied population, W, matrix-multiplied by a vector y consisting of values of the dependent variable for all i cases besides the focal case (the trivial effect of y_i upon itself is excluded by insuring that the ith term in the autocorrelation vector y is zero) (Burt 1982; Erbring and Young 1979). Through matrix multiplication each vector y_i is transposed – effectively flipped on its side. Then matching cells are multiplied by one another and, finally, the resulting contents of each row are summed. The result is a vector, an autocorrelation variable now structurally identical to traditional variables. The value of this variable represents the accumulated magnitude of the dependent variable among one's network contacts. A positive and statistically significant effect of the autocorrelation variable in a regression model indicates that one is positively influenced by one's peers.

The Ersatz Method

Unfortunately, some basic problems confront a researcher attempting to use the network effects model. These problems are succinctly identified by Ronald Burt:

· “The number of relations to be estimated for each of the networks within a system increases exponentially with system size…. But who lives in an area occupied by only 2500 persons?”

· “Structural theory makes statements about perceptions and behaviors in terms of the network context of actors. This context is lost for a random sample of actors…. There is no way of knowing how those respondents are connected within the system.” (Burt 1981: 313-314)

To solve the practical problem of network data collection and the methodological problem of stripping individuals of their relations in surveys, Burt proposes that “ersatz positions” with patterns of “ersatz ties” be substituted for observed network ties between actual network positions. An “ersatz position” is defined according to the pattern of relations that stem from it; two individuals occupy the same position to the extent that their patterns of relations resemble one another. Ties between ersatz positions represent the expected relation between occupants of two ersatz positions (Burt 1981: 317). If the number of positions in a system is small and the expected relations between positions can be ascertained, then running a network effects model becomes feasible no matter the system size or lack of direct network data.

When using the ersatz strategy of describing relations between positions rather than people, the challenge is then to identify a set of positions and elaborate a method for describing the relations between those positions. As Burt notes, Blau’s description of society as structured by demographically homophilous “parameters” fits the bill (Burt 1981: 319; Blau 1977). The widely substantiated homophily principle predicts that the probability of interaction increases as demographic similarity increases (for a review, see Smith-Lovin et al. 2000). Because the homophily principle predicts interaction patterns according to demographic position, it also predicts that individuals occupying the same demographic position also occupy the same ersatz network position.

How might the relations between positions in a demographic structure be described? The homophily principle already embodies a general description: the farther apart two demographic positions are, the less interaction will tend to occur between any two of their occupants. However, in order to use homophily to construct a usable matrix for a network effects model, a specific quantitative description of the homophily effect must be accomplished. Burt proposes that the mean proportion of the ties of “egos” at position i that extend to “alters” at position j be used as asymmetric indicators of tie strength between demographic positions (Burt 1981: 325)[1]. The resulting square matrix (the size of which is equivalent to the number of distinct demographic positions) can be used to describe the set of relations in a population, no matter how large that population is.

An Alternative Model

In his discussion of the ersatz method for characterizing a network, Burt notes that “theory and practice are not the same thing. I do not expect ersatz network positions to be used in practice in the same manner that I have introduced them in theory” (Burt 1981: 331). Burt’s insight that a network of interactions can be winnowed from the set of nodes to the set of equivalent positions is quite valuable. However, there are some limitations to Burt’s approach that prevent its broad application. As a position-specific method, it uses network data incompletely and refrains from a parametric description of homophily that could be more generally applied. At any one position, only ties from that position are used for purposes of estimation. If the homophily effect could be estimated with a few parameters that predict patterns of influence by distance rather than position, network data from all positions could be used to predict network effects across all positions. Ultimately, predictions of network effects would become possible even without network data. This paper elaborates such a method. The first step of this method involves the estimation of homophily parameters. The second step involves the use of those parameters to predict individual behavior.

Estimation

In order to estimate homophily parameters for a demographic dimension or set of dimensions, two kinds of data, regarding population density and egonetworks, are required. Population density, the numbers of individuals occupying various values of demographic categories, is easily obtained from population censuses. Egonetwork data consists of the reports of “egos” about their relations between alters. Using a representative data set for which demographic information regarding alters is collected, we can estimate the demographic distribution of alters of an ego in a population. If we know the demographic distribution of a population from which the egonetwork data is sampled, then we also can estimate the demographic distribution of non-alters of an ego in a population (assuming that all those in a population who are not mentioned as alters are not alters).

Using this information, a data set can be constructed in which each dyad involving a potential tie between a sampled ego and a member of the population is a separate case[2]. The dependent variable of interest is the existence of a tie in the potential dyad. To estimate homophily in terms of demographic distance rather than demographic position, the independent variables of interest are measurements of distance in the dyad for various demographic categories. A logistic regression estimates the impact of demographic distance on the likelihood of a tie for a dyad in the population.

Using the Current Population Survey, information was obtained for the demographic distribution of age and education in the population of the United States. Using the 1985 General Social Survey Network Module for ego network data representative of the U.S. population, age and education distances were calculated between egos and all dyads in the United States, including ties reported by egos and non-ties imputed from the Current Population Survey’s demographic distributions. The estimated homophily parameters for age and education distance are reported in Table 1.

Table 1. Regression Results: Estimated Homophily Parameters

Social Tie Reported?

Intercept

-14.8314**

(3770.76)

Absolute Age Distance

-0.2833**

(900.35)

Age Distance Squared

0.0136**

(522.80)

Absolute Age Distance Cubed

-0.00020**

(410.22)

Absolute Educ. Distance

-0.7219**

(453.03)

Educ. Distance Squared

0.1087**

(148.12)

Absolute Educ. Distance Cubed

-0.00569

(85.75)

Distance in Age

(ego-alter)

0.00830**

(63.58)

Distance in Education

(ego-alter)

0.0580**

(86.01)

Age of Ego

-0.0128*

(5.56)

Age of Ego Squared

0.00027**

(22.47)

Educ. of Ego

-0.2589**

(70.21)

Educ. of Ego Squared

0.0140**

(144.73)

Notes: * p<.05 ** p<.01

Chi-Squares in Parentheses

When the values of age and education distance are known, the significant results of Table 1 indicate that the likelihood of interaction within for any particular dyad can also be known. Further, these parameters provide support for the homophily principle: as absolute demographic distance in a dyad increases, the likelihood of interaction between members of that dyad decreases[3]. The statistically significant and positive effect of signed distance between ego and alter in age and education belies the competing contention that individuals tie themselves disproportionately to higher-status others (which in this case would be older, more educated individuals). In fact, it seems that individuals are more likely to report a tie to another as that other becomes increasingly younger and less-educated. The explanatory power of these secondary variables, furthermore, is dwarfed by that of the homophily parameters.

How can the results of Table 1 be substantively interpreted? The probability that one individual interacts with another individual is equal to e^b/(1+e^b), where b is equal to the sum of all terms when the values of the variables are specified. For instance, the estimated probability that an individual (“Alice”) at 30 years of age and with 12 years of education interacts with another individual (“Bob”) at 29 years of age and with 13 years of education is equal to e^(-17.0052)/(1+e^(-17.0052)), or 4.11 * 10^-8, about 4 in 100 million. In other words, the odds that two particular individuals with the above characteristics interact is astronomically low (in fact, comparable to the probability of a significant asteroid impact in a given year). However, this probability is reasonably high compared to the probability that Alice interacts with another individual (“Cleo”) at 60 years of age and with 20 years of education, 1.74 * 10^-9, less than 2 in a billion. And these very low probabilities don’t imply that an interaction with somebody like Bob or Cleo is impossible for Alice. Because there are millions individuals occupying each position, the probability of interacting with somebody in a position becomes considerable: Alice’s probability of interacting with somebody who is aged 29 years and with 13 years of education is estimated to be 0.0693, and Alice’s probability of interacting with somebody who is aged 60 years and with 20 years of education is estimated to be 0.000189. Another way of expressing such probabilities is that Alice has about a 7% chance of interacting with a 29-year-old with 13 years of education. Thinking about the probabilities positionally, we might say that people occupying Alice’s position will have an average of .07 friends aged 29 years and with 13 years of education, or we might say that the strength of the tie between an individual at age=30,education=12 and the social position at age=29, education=13 is .07.

More broadly speaking, the small set of homophily parameters from Table 1, when applied to a population distribution, provide a glimpse into the unique vantage point of any particular social position we might be interested in, into the social worldview of the individual occupying a position. Distributions of estimated probabilities of interaction are more than mathematical niceties; rather, they describe the shape of a fundamental unit of sociology, the social circle. Because interaction has been tied to the diffusion of ideas, materials, diseases, behaviors, organizational affiliation (indeed any mutable social property), the ability to estimate probabilities of interaction between positions also provides us the ability to predict the diffusion of substantively important social content.

Prediction

This insight brings us back to our objective: prediction of individual behavior from patterns of interaction. The values estimated above, when distributed across cells in a matrix with dimensions corresponding to sociodemographic characteristics, constitute the elusive W matrix that allows the network effect model to be employed for a large population. Now that a set of weights representing the strength of ties between individuals and positions in social space have been estimated, it is only necessary to designate a social phenomenon to be explained. In a modification of the classic network effects model (Erbing and Young 1979) to reflect the positional rather than individual nature of the W matrix, the W matrix is multiplied elementwise by a companion Y matrix of the same dimensions. Cells in the Y matrix consist of the conditional mean of a dependent variable, from a random sample of the relevant population, for the particular combination of demographic characteristics the cell represents. By multiplying corresponding cells of the two matrices, the average value of the dependent variable at a demographic position is weighted by the degree of interaction a focal individual is expected to have with that position. Summing the results across all positions creates a single value, the magnitude of which should be positively related to the likelihood that a focal individual is associated with the idea, behavior, disease, organization, or other social content of concern.

As a preliminary test of this method, I chose a few dependent variables from the 1985 General Social Survey. After creating independent variables using the method described above, logistic regressions were run to test the ability of those independent variables to predict individual behavior. The results, presented in Table 2, are indeed preliminary but encouraging nonetheless. Further tests should include additional individual-level independent variables from the sociological literature to compare the impact of individual and structural factors on the diffusion of social content.

Table 2. Results of Logistic Regressions Predicting Individual Behavior

Indepdendent Variables	Dependent Variables
	Reads Newspaper Everyday
Intercept	-1.244**
Network Effect	1.1897**
	U.S. Should Get Out of U.N.
Intercept	0.9871**
Network Effect	3.0818**
	Communism is Worst/Bad Government
Intercept	1.6947**
Network Effect	0.3466
	Discusses Politics All/Most of Time
Intercept	-1.9961**
Network Effect	1.0797**

References

Blau, Peter M. 1977. Inequality and Heterogeneity. New York: MacMillan.

Burt, Ronald S. 1981. “Studying Status / Role-Sets as Ersatz Network Positions in Mass Surveys.” Sociological Methods and Research 9: 313-337.

Erbring, Lutz and Alice Young. 1979. "Individuals and Social Structure: Contextual Effects as Endogenous Feedback." Sociological Methods and Research 7: 396-430.

Friedkin, Noah. 1990. "Social Networks in Structural Equation Models." Social Psychology Quarterly 53: 316-328.

Marsden, Peter V. and Edward Laumann. 1984. "Mathematical Ideas in Social Structural Analysis." The Journal of Mathematical Sociology 10: 271-294.

McPherson, J. Miller. 1982. "Hypernetwork Sampling: Duality and Differentiation Among Voluntary Organizations." Social Networks 3: 225-249.

Smith-Lovin, Lynn, J. Miller McPherson and James M. Cook. 2001. Homophily Review Piece. Annual Review of Sociology.

[1] The term “ego” is used to refer to an individual who is asked about their social contacts. The term “alters” is used to refer to the contacts referred to by ego.

[2] The number of cases for analysis therefore equals the number of egos sampled multiplied by population size. Fortunately, the logistic regression procedures which will be used here allow non-events (in this case, non-alters) to be referred to as a group, thereby alleviating an otherwise considerable computing burden.

[3] The non-monotonic nature of the relationship expressed by the squared and cubed terms is due to kin relationships between generations.