Thursday, April 30, 2015

Regression Analysis

Part I:
The regression of crime rate to percentage of students getting free lunch has a significant level of .005, indicating that it is indeed significant. 48.5% of the students will get free lunches when the crime rate is 79.7 per 100,000 people. I'm confident that there is a relationship, I just don't think that relationship is quite as strong as the local news station might be implying. 

Part 1 












Part II:
Introduction:
The UW system wants to know why students choose the schools they're going to. In order to analyze this, spatial regression analysis is to be performed on data regarding University enrollment and County population.

Methods:
Testing spatial regression was done through three separate equations for two schools, Eau Claire and Milwaukee. The null hypothesis states that there is no relationship between the two variables. The alternate hypothesis states that there is a relationship between the two variables. The variables we tested against the number of students attending each school were: population divided by distance, percentage of the county population with a bachelor’s degree, and median household income per county.

Results:
Figure 1
Figure 2
For both schools, only two tests from each were deemed significant, population divided by distance and percentage of the county population with a bachelor’s degree. Because the significance level of each of these tests was .005 or smaller, for each of these we REJECT THE NULL! The Eau Claire population divided by distance to student regression (Figure 1) has a significance level of .000 and has an r2 of .945, showing this regression to be strong. The Eau Claire Bachelor Degree to student regression (Figure 2) has a significance level of .003 and has an r2 of .121, showing this regression to be very weak. The Milwaukee population divided by distance to student regression (Figure 3) has a significance level of .000 and has an r2 of .922, showing this regression to be strongly correlated. The Milwaukee Bachelor Degree to student regression (Figure 4) has a significance level of .001 and has an r2 of .160, showing this regression to be weakly correlated.

The Significance values for the number of students attending to Median Household Income had significance values of .104 and .027 for Eau Claire (Figure 5) and Milwaukee (Figure 6), respectively. Because both of these significance levels are greater than .005, both FAIL TO REJECT THE NULL! 
Figure 3


Figure 4
Figure 5
Figure 6
When looking at Residual Map 1, it can be seen that areas with larger populations (other than Milwaukee) have higher numbers of students attending Eau Claire than the regression would predict, however, most of the state closely follows the predicted regression. When looking at Residual Map 2, it seems that closer counties deviate higher than counties further away. Regardless of what the symbology of Residual Map 3 seems to indicate, the map shows that areas with higher populations (other than Milwaukee) deviate higher from the regression than those with lower populations that are closer. Residual Map 4 shows small rural counties with smaller populations and counties closer to Milwaukee as deviating higher than the regression. For all of the maps, distance is the most common influence on school selection throughout the state. Percentage of the population with a bachelor’s degree has some influence, but would perhaps be more indicative if it were weighted by distance as well.
Residual Map 1

Residual Map 2

Residual Map 3

Residual Map 4

Friday, April 10, 2015

Correlation and Spatial Autocorrelation

Part I:

1. 
Hypotheses:
Null hypothesis: there is no linear association between distance in feet and sound level in decibels (r = 0)
Alternate hypothesis: there is a linear association between distance in feet and sound level in decibels (r≠ 0)

Question1
The Pearson correlation for distance and sound level is -.896. The .896 tells that the variables are strongly correlated, and the negative value tells that as distance increases the sound level decreases. The critical value at 8 degrees of freedom for a 95% Significance Level is 1.860, and the t-score is -5.71, so the null is rejected.


2.
The findings from the correlations show several patterns. The strong negative correlation between percent white and percent black of -.887 is one of the major reasons why Milwaukee is seen as one of the most segregated cities in North America. Most of the neighborhoods that have white residents have no black residence whatsoever, and there is a large separation between the two groups. The differences between the two groups become only more heightened when the correlation between percent white and present with a bachelors degree is compared to percent black and percent with a bachelors degree, as they are both moderately strong , yet in different directions. Neighborhoods with a higher percentage of white population typically have less of a population living below the poverty line, as there is a -.767 correlation between the two. Unfortunately, the opposite rings true for percent black and population living below the poverty line, as there is a moderately strong positive correlation of .668 between the two. The correlation between percent white and percent Hispanic is almost identical to the correlation between percent black and percent Hispanic at -.218 and -.246 respectively. It seems that almost every demographic group is just as likely to walk to work as the others, with the only slightly significant correlation being a .354 positive correlation between the percent below the poverty level and the percent that walk to work. The percentage with no high school diploma varies across the groups from a moderate negative correlation with percent white and percent with a bachelor's degree, to a high positive correlation with percent Hispanic and a moderate positive correlation with percent below the poverty line. This suggests that the in the Hispanic neighborhoods, the percentage of the population with a diploma is lower than in other neighborhoods. 
Question 2
Part II:

Introduction: 
The Texas election commission is analyzing the patterns of elections and wants to see if any of the election patterns are clustered. Furthermore they want to determine if election patterns have changed over 20 years. I am to analyze the data and determine if there is spatial autocorrelation of voting results, and to determine if there are any correlations, if the populations are indeed clustered.

Methods: 
In order to test clustering in the election data, I used GeoDa to create several LISA maps and to calculate Moran's I for the variables. Next I used SPSS to create a correlation matrix of the variables.

Results: 
My analysis has determined that all of the data is clustered, but not all of it is clustered to the same extent. The most clustered of all of the data was the percentage of Hispanic persons throughout Texas, with a Moran's I value of 0.7787 (graph 1, map 1). The percent of the population that voted Democratic in 2008 was the second most clustered data set with a Moran's I of 0.6957 (graph 2, map 2). The voting turnout in 2008 was less clustered than the percentage that voted democratic with a Moran's I of only 0.3634 (graph 3, map 3). The 80s voting data differed from the 08 voting data, but only slightly, with the percent democratic having a Moran's I of 0.5752 (graph 4, map 4) and the voting turnout having a Moran's I of 0.4681 (graph 5, map 5).

When looking at the correlation matrix (graph 6), and the LISA maps, further results may be observed. The percent that voted Democrat in the 80s and the voting turnout percentage of a county in the 80s had a correlation of -.612, indicating that areas with a higher percentage of the population voting democratic had less people actually vote. In 2008 the same comparison resulted in a correlation of -.604 suggesting that this trend hasn't changed much in the last 20 years. When comparing the democratic voting percentages from the eighties to 2008, the resulting correlation is .540, suggesting that areas were more democratic in the 80s were even more democratic in 2008. When comparing the 2008 voting data to the percentage of the population that self describes itself as Hispanic, there is a correlation of .669 between percent Hispanic and percent that voted Democrat, suggesting that the Hispanic members of the population greatly favor the Democratic Party. When comparing the same Hispanic percentage to the voting turnout from 2008 there is a correlation of -.668, suggesting that well the Hispanic population favors the Democratic Party, they are less likely to vote.

Map 1
Graph 1



Map 2

Graph 2

Map 3

Graph 3

Map 4

Graph 4

Map 5

Graph 5

Graph 6

Conclusion: 
The Lisa maps show that the voting patterns are indeed clustered, and lend some insight into where these patterns have changed in the last 20 years. When looking at maps 2 and 4, the viewer can see that the democratic voting has only become more clustered in 2008 than it was in the 80s. When looking at maps 3 and 5, the viewer can see that the voting turnout has only gotten less clustered in the last 20 years. When comparing map 1 to maps two and three, it is clear how the areas of high Hispanic population correlate with areas with high democratic voting percentages and low voter turnout. 


Monday, March 16, 2015

Significance Testing

Part I: Z and T tests



Interval Type
Confidence Level
n
α
z or t
z or t value
A
Two Tailed
90
45
.1
Z
1.64
B
Two Tailed
95
12
.05
T
2.201
C
One Tailed
95
36
.05
Z
1.64
D
Two Tailed
99
180
.01
Z
2.55
E
One Tailed
80
60
.2
Z
.845
F
One Tailed
99
23
.01
T
2.508
G
Two Tailed
99
15
.01
T
2.624

1.     A Department of the interior in Washington D.C. estimates that the number of particular invasive species in a certain county (Bucks County) should number as follows (averages based on data from the whole state of Pennsylvania) per acre: Asian-Long Horned Beetle, 4; Emerald Ash Borer Beetle, 10; and Golden Nematode, 75.  A survey of 50 fields had the following results: (10 pts)

                                                           μ            σ
            Asian-Long Horned Beetle   3.2       0.73
            Emerald Ash Borer Beetle    11.7    1.3
            Golden Nematode                 77       5.71
           
a.     Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
a.     Asian-Long Horned Beetle
                                                        i.     Null hypothesis: There is no significant difference between the Asian-Long Horned beetle population of bucks county and the mean of Pennsylvania
                                                       ii.     Alternative Hypothesis: There is a significant difference between the Asian-Long Horned beetle population of bucks county and the mean of Pennsylvania
                                                     iii.     Z-test statistic = -7.74911541
                                                     iv.     Critical value = 1.96 or -1.96
                                                       v.     -7.74911541 is less than -1.96, so the null hypothesis is REJECTED!
b.     Emerald Ash Borer Beetle
                                                        i.     Null hypothesis: There is no significant difference between the long horned beetle population of bucks county and the mean of Pennsylvania
                                                       ii.     Alternative Hypothesis: There is a significant difference between the long horned beetle population of bucks county and the mean of Pennsylvania
                                                     iii.     Z-test statistic = 9.25
                                                     iv.     Critical value = 1.96 or -1.96
                                                       v.     9.25 is greater than 1.96, so the null hypothesis is REJECTED!
c.      Golden Nemotode
                                                        i.     Null hypothesis: There is no significant difference between the golden nemotode population of bucks county and the mean of Pennsylvania
                                                       ii.     Alternative Hypothesis: There is a significant difference between the golden nemotode population of bucks county and the mean of Pennsylvania
                                                     iii.     Z-test statistic = 2.47
                                                     iv.     Critical value = 1.96 or -1.96
                                                       v.     2.47 is greater than 1.96, so the null hypothesis is REJECTED!
b.     Be sure to present the null and alternative hypotheses for each as well as conclusions
c.      What can ascertained pertaining to the findings about these invasive species in Buck County?
a.     The populations of invasive species in Buck County vary greatly from those in the rest of Pennsylvania.

2.     An exhaustive survey of all users of a wilderness park taken in 1960 revealed that the average number of persons per party was 2.1.  In a random sample of 25 parties in 1985, the average was 3.4 persons with a standard deviation of 1.32 (one tailed test, 95% Con. Level) (5 pts)

a.     Test the hypothesis that the number of people per party has changed in the intervening years.  (State null and alternative hypotheses)
a.     Null Hypothesis: The number of people per party has not changed
b.     Alternative Hypothesis: The number of persons per party has increased
c.      T-value = 4.924
d.     Probability value = 1.711
e.     4.924 is greater than 1.711, so the null is rejected
b.     What is the corresponding probability value

Part II: Chi-Squared Testing

Introduction:

The tourism board of Wisconsin wishes to determine what defines the concept of "Up-North", as it relates to the state. In order to analyze this concept, three different variables will be analyzed, to see if their distribution varies differently in the northern versus southern regions of the state. The number of licenses sold for the deer gun season will be analyzed, as it is a stereotypical feature of the north. The percentage of the total population that purchased tags for gun deer season will also be analyzed, as well as forest acreages. 

Methods:

First, county shape file data was acquired from the U.S. Census website, and an attribute value of 1 or 2 was given to the counties depending on whether or not they were north or south of Highway 29 (Map 1). Next, data from the Wisconsin Department of natural resources statewide comprehensive outdoor recreation plan was joined to the shape files to determine potential variables. After the variables I've been chosen, new fields are added to the attribute table, and small integer values - between 1 and 4 - are assigned to them based on where they fall on an equidistant classification. After all of the data sets have been classified, the data was exported to SPSS for chi-squared calculations.  

Map 1


Results:

The first Chi-squared test (Table 1) was to determine whether or not the distribution of forest acreage was balanced between the northern and southern parts of Wisconsin. For the tree acreage, the null hypothesis was that there was no difference in the distribution of forest acreage between the northern and southern parts of the state. The alternate I pop assist was that the distribution of forest acreage was not even between the northern and southern parts of the state. After calculating the chi-square test, the test statistic is 7.8 at 95%, and the critical value is 33.962. Therefore, the null is rejected, as there is a significant difference between forest acreage in the north and the south. 

Table 1

The second chi-squared test (Table 2) was to determine whether or not the distribution of deer gun licenses was bounced between the northern and southern parts of Wisconsin. For this test, the null hypothesis was that there was no difference between the expected distribution of gun licenses and actual distribution. The alternate hypothesis was that the expected distribution of gun licenses was different from the actual distribution of gun licenses. After calculating the chi-square test, the test statistic is 7.8 at 95%, and the critical value is 4.399. As the critical value is less then the test statistic, I am unable to reject the null.

Table 2


The third chi-squared test (Table 3) was to determine whether or not the percentage of the population that deer hunt is equally distributed between the northern and southern parts of the state. For this test, the null hypothesis was that the percentage of the population that hunts deer is randomly distributed across the state. The alternate hypothesis was that the percentage of the population that hunts deer is not randomly distributed across the state. The test statistic is 7.8 at 95%, and the critical value is 16.428. As the critical value is greater than the test statistic, the null is rejected.

Table 3


Conclusion:
When looking at map two, the distribution of forest is very visibly skewed to the north, and the results of chi-squared test one show this. These forests play a significant role in the cultural differences between the northern and southern parts of the state, as there are indicative of less agriculture and more of a reliance on the other forms of income.
 
Map 2
When looking at map three, the distribution of deer gun tags seems randomly distributed across the state, and the results of chi-squared test two supports this. I chose this result because of how it did not fit my preconceived notion of the distribution of deer tags throughout the state.
 
Map 3
When looking at map four, the percentage of the county population with deer gun tags in the north is visibly higher than that in the south. This distribution is very different from the expected distribution as seen in table 3. I chose this result because it helps to show how much more of a cultural phenomenon deer hunting is in the northern parts of Wisconsin then in the southern parts of the state, as a far greater percentage of the population hunts in the north than in the south.
 
Map 4