Data Science Case Study: In which neighborhood can I live in Gaziantep?

Servet Demir Ph.D
6 min readMar 20, 2021

1. Introduction

This study is a sample application for cluster analysis based on geographical locations and characteristics of neighborhoods.

1.1. Description & Discussion of the Background

There are many factors that will determine which neighborhood you want to live in. for example; crowds, parks, restaurants or cafes affect our choice. For some of us, we would prefer less crowded or near hospitals. Gaziantep is one of the cities famous for its food culture. It is a city that is in the UNESCO cultural heritage at the moment (https://en.unesco.org/creative-cities/gaziantep). Promotional video https://www.youtube.com/watch?v=V5gvWa7PWIM This project will be grouped only by considering the social facilities and population crowds of the central neighborhoods in Gaziantep. Basic features will be determined for each group. It will help the neighborhood preferences that people want to live in.

2. Data acquisition and cleaning

2.1 Data

In this project, we need five types of data;

  1. Neighborhood list
  2. The population in each neighborhood
  3. The venues in each neighborhood
  4. The rent rate (TL/m2) in each neighborhood
  5. The sales rate (TL/m2) in each neighborhood

From a web search, I got four sources:

  1. www.nufusu.com/ilce/sehitkamil_gaziantep-nufusu (Neighborhood list and population)
  2. “https://www.nufusu.com/ilce/sahinbey_gaziantep-nufusu (Neighborhood list and population)
  3. Foursquare API for getting venues place in neighborhoods
  4. https://www.endeksa.com/tr/ (rent rate & sales rate)

2.2 Getting Data

#importing Related Modules & library
# import numpy as np # library to handle data in a vectorized manner
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn import preprocessing
import folium # map rendering libraryprint('Related Modeles & Libraries imported.')# Donloading Neigborhoods and polulation From web
# I select neighborhood in downtown
url1="https://www.nufusu.com/ilce/sehitkamil_gaziantep-nufusu"
skamil= pd.read_html(url1, header=0)
skamil=skamil[1]
skamil=skamil.iloc[0:20,:]
skamil.columns= ["year", "borough", "Neighborhood", "populations"]
url2="https://www.nufusu.com/ilce/sahinbey_gaziantep-nufusu"
sbey= pd.read_html(url2, header=0)
sbey=sbey[1]
sbey=sbey.iloc[0:20,:]
sbey.columns= ["year", "borough", "Neighborhood", "populations"]
# combine list and correct some neigborhood name
neighborhoods= pd.concat([skamil, sbey], ignore_index=True)
neighborhoods["Neighborhood"]=neighborhoods["Neighborhood"].map(lambda x: x.rstrip("Mah."))
neighborhoods["Neighborhood"]=neighborhoods["Neighborhood"].replace({"Eyüpsultan": "Eyüp Sultan",
"Fıstıklık": "Yukarıbeylerbeyi",
"Zeytinli": "Zeytinli, Şehitkamil",
"Beydilli": "Beydilli, Şahinbey",
"Konak": "Konak, Şahinbey",
"Karataş" : "Karataş, Şahinbey"}, regex=True)
neighborhoods.head()
# Getting rent rate and Sales rate
data_rest_sales=pd.read_excel("data_rent_sales.xlsx")
data_rest_sales.head()
# Addin to neighborhoods data base
neighborhoods["rent_rate"]=data_rest_sales["rent_rate"]
neighborhoods["sales_rate"]=data_rest_sales["sales_rate"]
neighborhoods

Add Latitude and Longititude

latitudes_list=[]
longitudes_list=[]
for i in list(neighborhoods["Neighborhood"]):
address_geo = str(i+ ", Gaziantep")
geolocator = Nominatim(user_agent="gaziantep_explorer")
location = geolocator.geocode(address_geo)
latitudes_list.append(location.latitude)
longitudes_list.append(location.longitude)

neighborhoods["latitude"]=latitudes_list
neighborhoods["longitude"]=longitudes_list
neighborhoods

Getting Avenue from Foursquare API

all_venues = getNearbyVenues(names=neighborhoods["Neighborhood"],latitudes=neighborhoods["latitude"], longitudes=neighborhoods["longitude"])
all_venues.head()

3. Methodology

In Gaziantep, there 9 boroughs but only two of them (Şahinbey & Şehitkamil) downtown. So that, in this project, I select to study these boroughs. Also, there are villages and I dismiss that from the neighborhood list. Then I got the populations, rent rate, and sales rate for each neighborhood. Then I got latitudes and longitude of all neighborhoods in order to get the venues. Then I will cluster analysis for neighborhoods. For model features, I will use venue categories, populations, rent rates, and sales rates.

Exploring Data

# The Maps of Neigborhoods in Gaziantep
latitude=37.0686307
longitude=37.3674178
map_gaziantep = folium.Map(location=[latitude, longitude], zoom_start=12)
# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['latitude'],neighborhoods['longitude'], neighborhoods['borough'], neighborhoods['Neighborhood']):
label = '{}, {}'.format(neighborhood, borough)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_gaziantep)

map_gaziantep
# Population in Neigborhoodsneighborhoods.sort_values(by='populations',ascending=True).plot.barh(x="Neighborhood", y= "populations", 
figsize=(10, 15))
plt.xlabel("Population (x1000)", fontsize=14)
plt.title("The Polulation of Neigbhoods in Gaziantep", fontsize=16, c="red");
# Compare Sales Rate and Rent Rate
plt.scatter(x=neighborhoods["sales_rate"], y= neighborhoods["rent_rate"],
alpha = 0.5 ,s=neighborhoods["populations"]*20)
plt.title ("Comparing Rent Rate vs Sales Rate", fontsize=16, c="red")
plt.xlabel("Sales Rate (TL/m2)", fontsize=14)
plt.ylabel("SRent Rate (TL/m2)", fontsize=14);
# Top 20 Venues Categories in Gaziantep
top_20_venues=all_venues["Venue Category"].value_counts().to_frame().reset_index()
top_20_venues=top_20_venues.iloc[:20,:]
top_20_venues.plot.bar(x="index", y= "Venue Category",
figsize=(15, 8))
plt.xlabel("Venue ", fontsize=14)
plt.title("Distribution of Venue in Gaziantep", fontsize=16, c="red");

4. Analyzing Data

# one hot encoding
gaziantep_onehot = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
gaziantep_onehot['Neighborhood'] = all_venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [gaziantep_onehot.columns[-1] ]+ gaziantep_onehot.columns[:-1].tolist()
gaziantep_onehot = gaziantep_onehot[fixed_columns]
gaziantep_onehot.head()
#  let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category. 
# Adding population, rent rate, sales rate
gaziantep_grouped = gaziantep_onehot.groupby('Neighborhood').mean().reset_index()
# first I will scale the feature in order to similiar effect size
gaziantep_grouped["populations"]=preprocessing.MinMaxScaler(feature_range=(0,0.1)).fit_transform(np.array(neighborhoods["populations"]).reshape(-1,1))
gaziantep_grouped["rent_rate"]=preprocessing.MinMaxScaler(feature_range=(0,0.1)).fit_transform(np.array(neighborhoods["rent_rate"]).reshape(-1,1))
gaziantep_grouped["sales_rate"]=preprocessing.MinMaxScaler(feature_range=(0,0.05)).fit_transform(np.array(neighborhoods["sales_rate"]).reshape(-1,1))
gaziantep_grouped.head()

Cluster Neighborhoods

# Prepare data for kmeans
gaziantep_grouped_clustering =gaziantep_grouped.drop('Neighborhood', 1)
# Optimze k value
ssd=[]
K= range(1,15)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=0).fit(gaziantep_grouped_clustering)
ssd.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(K,ssd, "bx-")
plt.title( "Compare Score for Diffrent K Values")
plt.xlabel("Diffent K values")
plt.show()
# set number of clusters
kclusters = 5
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(gaziantep_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

5. Results

num_top_venues = 10indicators = ['st', 'nd', 'rd']# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = gaziantep_grouped['Neighborhood']
for ind in np.arange(gaziantep_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(gaziantep_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
gaziantep_merged = neighborhoods# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
gaziantep_merged = downtown_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
gaziantep_merged.head() # check the last columns!

Cluster Features

# create map
latitude=37.0686307
longitude=37.3674178
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(gaziantep_merged['latitude'], gaziantep_merged['longitude'], gaziantep_merged['Neighborhood'], gaziantep_merged['Cluster Labels'].astype(int)):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)

map_clusters

6. Discussion

According to the results of the analysis, the clustering results overlap with my own observations. These results can be used to live or work. You can choose neighborhood alternatives according to your priorities. For example, if you want a social activity-based lifestyle, you can choose a neighborhood from cluster-5. If you are interested in vehicle trading or repair business, you can choose Cluster-3.

7. Conclusion

As a result, the neighborhood’s social facilities and the crowd will affect our neighborhood selection. This short report can be used as a guide in neighborhood selection.

--

--

Servet Demir Ph.D

I graduated from Bosphorus University Physics Teaching Program in 1995. In 2001, I earned my master’s degree and in 2006, Ph.D. in Educational Science.