Python - Machine Learning

Adding a population density attribute to the California House Prices dataset.

I wrote this code to add a population density attribute to the well known California House Prices dataset. The original dataset measures total population per mesh block, however the geographic size of each mesh block varies greatly, so the population measure does not accurately reflect the population density.

I experimented with creating a measure of the total population within a certain radius of the centre of each mesh block. This new attribute proved fruitful, and became one of the strongest predictors of the average house price in each mesh block.

I chose the Python Pandas data analytics library for this example because Pandas is a very concise and readable library, that is well-established, and works well with smaller datasets that can be quickly processed in memory.

Data source: https://www.kaggle.com/datasets/camnugent/california-housing-prices

California Housing Data

Population density is distorted because the distance between mesh blocks varies greatly across the state. Mesh blocks along the western boarder are very dispersed, whereas mesh blocks along the eastern seaboard are highly compacted. The default data does not take this into account, so population density appears higher in the west and lower in the east that it actually is.

Graphic created using MatPlotLib:

dataframe.plot( kind="scatter", x="longitude", y="latitude", grid=True, s=df["population"]/100, label="population", c="median_house_value", cmap="jet", colorbar=True, legend=True, sharex=False, figsize=(10,7) )

The code used to correct geographic distribution errors in the 'population' attribute is shown below.

# I use the Shapely library to represent geographic points, and measure distances between points

from shapely.geometry import Point

# SOME CODE OMITTED

# cali_housing_data is a DataFrame which contains the latitude and longitude at the centre of each mesh block.

# It also contains a 'population' attribute which measures the number of people resident within that mesh block.

# DataFrame iterators allow you to insert data into the DataFrame at the row-level. This code adds a new

# attribute 'pt_mesh_centre' the gelocation at the centre of this mesh block.

for idx, row in cali_housing_data.iterrows():

pt = Point(row['latitude'], row['longitude'])

cali_housing_data.at[idx, 'pt_mesh_centre'] = pt

# Reindex the DataFrame after using an iterator to avoid potential errors

cali_housing_data.reindex()

# Now sum the population of all mesh blocks within the distance of POPULATION_RADIUS from the centre of each

# mesh block, and store the result in the 'population_density' attribute for each row

for idx, row in cali_housing_data.iterrows():

cali_housing_data.at[idx, 'population_density'] = get_total_population(row['pt_mesh_centre'],

cali_housing_data,

POPULATION_RADIUS)

# SOME CODE OMITTED

# This function calculates the total population with a certain distance ('radius') of a particular geo-location

# ('pt_population_centre').

def get_total_population( pt_population_centre, cali_housing_data, radius ):

# This line creates a 'slice' of the data by calculating the distance between the geo-location of the centre of

# each mesh block ('pt_mesh_centre') and 'pt_population_centre'. The query returns rows where the mesh blocks is

# within the prescribed distance ('radius') from our population centre, filtering-out rows for mesh blocks that

# are more than 'radius' from our population centre.

rows_within_radius = cali_housing_data.loc[ pt_population_centre.distance( cali_housing_data['pt_mesh_centre'] )

< radius ]

# Next we get the 'population' count attribute for each row.

population_within_radius = rows_within_radius['population']

# Next we sum these population counts to get the total population within our prescribed area.

return population_within_radius.sum()

This code sample demonstrates the use of Pandas libraries in Python to calculate a more meaningful attribute for population distribution in the California Housing data (found in Kaggle's California Housing Prices dataset).

Next sample