It’s fun to explore your GeoData with GeoDa.
Universities like MIT, Chicago and Arizona use GeoDa because of its powerful spatial data analysis, geovisualization and geostatistics tools.
So that’s what we did too.
Here’s how to download GeoDa software from Spatial@UChicago. Now, let’s check out some if the key feature in the new and improved GeoDa.
Getting Started with GeoDa
GeoDa has an intuitive interface that makes it easy for you to add multiple file formats like shapefile, GeoJSON, KML, SQLite and table format (CSV, XLS and DBF).
To see how your geographic data relates in space, GeoDa provides a variety of base maps from Carto and Nokia.
As shown below, these 4 tools are used to load data, save as a GeoDa project (GDA), close the application and open attribute data.
Similar to any GIS software, you can resize and move columns freely. You can join tables, query observations and export data in different formats.
Not much more to say but how straight-forward we found GeoDa to use. The interface is modern and you can get your hands dirty in your analysis quickly.
Geovisualization and Data Classification
This is one of GeoDa’s specialties – its geovisualization tools. Anyone can gain insights in their data through the means of visualizations in the forms of thematic maps, cartograms and map movies.
Really, you get more options than QGIS and ArcGIS in terms of data classification. The maps and rate drop-down gives you an abundance of ways to classify your data.
- Themeless Map – A simple one color map
- Quantile Map – Arranges groups so they have the same quantity.
- Percentile Map – Shades data in different percentiles (99%)
- Box Map – A quartile map where outliers are shaded differently.
- Standard Deviation Map – Each standard deviation becomes a class.
- Unique Values Map – Uniquely groups values into categories.
- Natural Breaks Map – Arranges each groupings so there is less variation in each class.
- Equal Intervals Map – Divides classes into equal groups.
- Rates-Calculated Map – Uses spatial weights to classify data.
If you don’t want to use these types of data classification methods, then GeoDa has a Category Editor tool for you to interactively edit custom breaks in the data. The neat thing about is how it interactively generates a histogram as you change the dividing lines in your data.
The cartogram tool substitutes appropriately-sized circles to represent a variable. For example, here we see clusters of population in the United States.
This is also known as a Dorling Cartograms. However, the downfall for these types of cartograms is that the centroid and shape are not maintained. This means that readers may have difficulty understanding features in the map. You may not have even known this represented United States population if I didn’t tell you!
Data Exploration Analysis
For this section, we’re going to hunt down some statistical relationships using the St Louis region county homicide counts and rates.
The three main variables we’ll examine are:
- HR8488 – homicide rate per 100,000
- PE87 – police expenditures per capita
- RDAC85 – resource deprivation/affluence composite variable (percent of families living below the poverty line, median family income)
READ MORE: University of Chicago Sample Data Sets (Great sample data)
When you look at this histogram of police expenditures, you can see the distribution of how money was spent is relatively equal across counties.
But when you look at the histogram for homicide rates, it’s positively skewed. This means the majority of the data has a low homicide rate, but there are some counties with extremely high homicide rates.
This box plot shows that the median number of homicides per 100,000 people is about 3.7. However, there are two counties that really jump out with enormous homicide rates. Those two counties are St. Louis City (36.0) and St. Clair (20.2).
Just where are these two observations? In a standard deviations type of map, here we paint in red the two counties with greater than normal homicide rates. As you can see, they have a whopping 3 standard deviations greater than the mean for homicide rate.
What’s the best way to see how variables relate to each other? For example, how does the resource deprivation/affluence composite variable relate to homicide rates?
Well, we can put each variable as on the x-axis and y-axis of a graph and see how it all looks. This is called a scatter plot.
The linear regression curve (straight red line) gives us a r-square value of 0.276. The other red curved line is a LOWESS (LOcally WEighted Scatter-plot Smoother) that fits a smooth curve between these two variables.
So what does this actually mean?
It means that given these 78 observations, resource deprivation accounts for 27.6% of the variance for homicide rates. While a model with r-square of zero indicates 0% that a model explains none of the variability of the response data around its mean… This really shows there is partial relationship between these two variables (resource deprivation and homicide rates).
But it really doesn’t end here with GeoDa. If you want to see how a bunch of scatter plots relate to each other, pick all the variables your heart desires with the Scatter Plot Matrix.
3D Scatter Plot
You will have to really put on your thinking cap for Geoda’s 3D scatter plot. I did at least. What this tool does is graphs out three separate variables in three dimensional space like this.
The nice thing about it is how you can project your data points to the XY-axis, XZ-axis or ZY axis. When you see how the data look on each axis by rotating the 3D Scatter Plot. At this point, you’ll start to understand how data points become suspended in 3D space.
For bubble charts, you select the X an Y-axis variables. Further to this, you choose a variable for bubble size and color. What this enables you to do is visualize four variables in a clever way.
Be careful for the size variable as this can really influence your graph. You can right-click the graph and resize bubble size from small to large. We keep it simple here, and use the homicide rates as size. As expected, the two large red bubbles are St. Louis City and St. Clair.
Parallel Coordinate Plot (PCP)
Meet my new favorite graph.
In a Parallel Coordinate Plot, each line corresponds to a county with homicide rates, police expenditures and resource deprivation plotted. Each of the dimensions corresponds to a horizontal axis and each data element is displayed as a series of connected points along the dimensions/axes.
The two red lines on the far right are the counties (St. Louis City and St. Clair) with greatest homicide rates. The one red line snug at the right of the PCP represents St. Louis City. Not only does the county of St. Louis City has the highest homicide rate, but police are spending the most money and it has the highest resource deprivation. This graph really puts these three variables into perspective.
All in all, I am completely blown away by the data exploration tools in GeoDa.
Let’s see how it does with more geostatistical-based tools.
Finding Patterns in Geographic Space
The main difference in this menu is how these types of analyses are performed in geographic space. While histograms, scatter plots and bubble charts simply analyzes data, these next few tools understands how counties and attributes are related to each other in terms of its geography.
And it all begins with setting contiguity in the weights manager. I set the bordering to be in direct contact with one another with either a queen or rook contiguity. This influences the number of neighbors that connect to each county.
Here’s a histogram showing the queen connectivity and number of neighbors:
Here’s a histogram showing the rook connectivity and number of neighbors:
So similar, but different. Geoda offers a map for you to interactively see how the rook and queen connect with its neighbors. Love this feature.
Moran Scatter Plot
Because we’ve set how counties relate to each other, the moran scatter plot will factor this in.
Positive spatial auto-correlation occurs when Moran’s I is close to +1. This means values are clustered together. While negative spatial autocorrelation occurs when Moran’s I is near -1. A checkerboard is an example where Moran’s I is -1 because dissimilar values are next to each other.
A value of 0 for Moran’s I typically indicates no autocorrelation. In this case, the Moran’s I is 0.16 meaning that homicide rates are not so much clustered together.
When you select a LISA Cluster map, it will generate a choropleth map showings a significant Local Moran statistic. Bright red indicates high-high that suggests clustering of high similar values. Blue counties show low-low values suggest clustering of low values together.
The remaining grey shades indicate no significant relationship.. While a high-low and low-high locations indicate spatial outliers.
Lastly, GeoDa produces four significance levels – p
GeoDa can produce univariate, differential and local Moran’s I with EB Rate as well.
Local G Cluster Map
The last tool is a variatino to see how data is clustered. In the center of St. Louis, high rates of homicide is centralized in the middle. While, in the north-eastern portion, homicide rates are much lower.
Imagine how useful this is for the real estate industry and those wanting to move to St. Louis. In this case, the G*Clusters map generates the same results.
If you have homicide rates in a city, you can use spatial regression to understand the factors behind patterns of crime. Why is there homicide rates concentrated in the center of St. Louis? Is it police spending? Can resource deprivation explain homicide locations?
Here is some terminology commonly used in regression models.
- Dependent variable (Y): What are you trying to predict. (Location of homicide rates)
- Independent variable (X): Explanatory variables that explain the dependent variable. (Income, education, etc)
- Beta-coefficient: Weights reflecting the relationship between the explanatory and dependent variable.
- Residual: The value not explained by the model
In our simple model, homicide rates is the dependent variable. While, we try to explain high and low homicide rates with police expenditures and resource deprivation.
Our output table is as follows:
When you substitute each coefficient in our regression model, it means that areas with higher resource deprivation and higher police expenditures would mathematically generate homicide rates. The standard error of the estimate is a measure of the accuracy of predictions. In a regression line, the smaller the standard error of the estimate is, the more accurate the predictions are. While, the t statistic is the coefficient divided by its standard error.
Another statistic to keep in mind is the Jarque-Bera statistic which indicates whether or not the residuals (the observed dependent variable values minus the predicted values) are normally distributed. When you put these residuals in a histogram, the null hypothesis is that it should resemble a bell curve.
Further to this, the output table also lets you test for multi-collinearity when two or more predictor variables in a multiple regression model are highly correlated. The calculated Moran’s I determines whether the regression residuals are spatially random (spatially autocorrelated).
Other options in GeoDa are the maximum likelihood estimation for spatial lag models and spatial error model.
READ MORE: Spatial Autocorrelation and Moran’s I in GIS
GeoDa Final Thoughts
You’ll have a lot of aha moments in GeoDa walking through . Not only does it serve as a gentle introduction to spatial analysis and statistics for non-GIS users, but it’s useful for those users who are trying to learn statistics.
Luc Anselin started GeoDa as an ArcView 3.0 extension. Due to its popularity, it’s been reworked into its own open source, data exploration tool.
While not necessarily your prototype full-blown GIS package, GeoDa possesses a range of exciting analytical and geo-visualization tools for industries such as economics, health, real estate and more.
Have you tried putting your geostatistics to the test with GeoDa? Let us know what you think of it in our comments section below.