You are on page 1of 11

Journal of Research in Computer Science and Engineering

Volume 2 Issue 3

Analysis of Crime using Machine Learning Approach

Pranjali v. Gurnule1, Pratik Kubal2, Akshay Bhosale3


Department of Computer Science
Mumbai University
Corresponding Authors’ Email Id: pranjali.gurnule18@gmail.com1,pratik.p.kubal@gmail.com2,
akshaysb555@gamil.com3

Abstract
In this paper we have sampled data in order to predict the homicidal
Incidences of crime in India. In order to achieve this we have sampled data as
Crime in India, along with Poverty and Unemployment data. After Preliminary
analysis, we process the data and fit various Models – Linear and Polynomial.
Furthermore, we compare these models, and choose a model which explains
the resultant data. Consequently, we highlight that Poverty in Rural and
Urban areas, and Unemployment in Urban Areas is necessary to predict the
Homicidal Incidences in India.

Keywords: Crime, Feature Scaling, Gradient Descent, Linear Regression,


Machine Learning, Multivariate Learning, Polynomial Regression

INTRODUCTION Machine Learning this link can be not only


In the documented studies, on what documented but also predicted.
influences crime in India, Poverty, Furthermore, a more in detail analysis can
Unemployed, and Inequality are the three be reached on what explains the crime in a
criterias that are always taken in particular sample. We hope that our analysis
accordance. These studies have been done obtained in our paper will highlight
on various levels such as Cities, States, and Government policies in future years against
particular radius [1]. However, with homicidal crime.

1 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved


Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

Framework
A. Data-set
Data used is Crime in India 2010, Poverty rural and Urban 2010, Unemployment rural and urban
rates in
TABLE 1
Correlation between poverty and unemployment

Factor Rural Urban Both

Poverty 0.9029 0.9105 0.9303


Unemployment -0.3810 -0.1959 -0.3554
preprocessing
Unemployment 0.7277 0.7988 0.8212

Fig: 1 corelation between povety and Unemployment.

2 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved


Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

B. Processing Fig 1 which shows the linear rise in start,


Percent rate leads to poor correlation in both and gradually exhibiting a curve. A linear
Poverty and Unemployment data, since the regression plane Fig 2 confirms our
Populations of the states vary, however the hypothesis; in some parts the regression
main problem lies in missing data – plane has overestimated the data, while in
Unemployment data, having no quantitative some parts it has underestimated it.
comparisons between samples – only Therefore there is a need for a polynomial
Unemployment rate between multivariate model.
states.Therefore, we correct it using
population from Poverty dataset. Table I D. Different Range Of Features
shows change in correlation. In the Fig 1,the mean of Unemployment
variables (Rural, Urban, and combined
C.Outlier Data areas) is significantly less than the poverty
There is a presence of outliers in that data. variables (rural, Urban, and combined areas)
Therefore while modeling, this outlier data is shown.
will influence the learning. This is seen in

Fig. 2 Three dimensional linear plane

3 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved


Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

Using a polynomial approach and adding from two independent variables, Hence the
square root feature has two advantages. three features can be mapped to a 3d
Firstly, this particular choice gives us a domain. Moreover, the problem lies in the
polynomial curve. Secondly, while various variables which could or could not
comparing intra-feature group-wise, the affect crime – Poverty in Rural, Urban,
square feature gives an emphasis to the combined areas; Unemployment in Rural,
independent variable, while the square root Urban, combined Areas – Even after
feature decreases the importance of that discarding models giving Square root
particular function. feature to Unemployment, we are left with
multiple Inter variable; and Intra variable
E. Experimentation group models – which we will iterate in the
While modeling a 3d plot we are limited by experimental stage to find a model which
two independent variables – The three describes the Crime Incidences with great
features to be used in modeling are derived accuracy and least error.

Fig. 3 Low data points

4 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved


Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

III. METHODOLOGY m
1 (i ) 2
A. Gradient Descent
J (θ )= ∑
2 m i=1
( (i )
hθ (x )− y )

We have given a customized polynomial Where,


h θ (x )=θT x
function to Gradient Descent algorithm [2] –
m=lengthoftrainingsamples
which calculates the slopes ( ) with the two y (i)=Observedvalueforsamplei
Independent feature axis. Therefore, despite Also,
the equation having three features, the J (θ )=J (θ1 +θ 2 +. .. +θn )

Independent Variables are still two. The m is Fig: 5 Equation for Cost Fit Function
divided so as to remove any dependence of
length of samples on the output. A squared Error function is used as a
Firstly, the Gradient Descent Algorithm is as Heuristic for the Multivariate Linear &
follows (equation 2) polynomial regression [4].Moreover, the
feature scaling [4] can be done by (equation
repeatuntilconvergence 4)
m

{θ j =θ− α
1

m i= 1
( }
hθ (x (i ))− y (i ))x (ji )
xi − μi
x i=
si
where, w h ere ,
j=columns, μi= Mean
i=rows,
(i ) s i= StandardDeviation
y =observedvalueforsamplei,
hθ (x (i ))=PredictedValue,
Note : B. Multivariate Polynomial Regression
x (0i )= 1 The general Multivariate function used is

Fig: 4 The Gradient Descent Algorithm (equation 5) which is derived from


polynomial research [5]:

Secondly, from Ordinary Least squares[3]


we have derived the Cost of Fit (Cost) hθ (x )= θT x= θ 0 x 00 +θ1 x11 +θ 2 x 22+...+θ n x nn
function based on square errors is given
Below (equation 3) W h ere ,
x= Independent Variables
hθ (x )= Predicted value
θ= Slope wit ht h e feature
5 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved
Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

After customization (equation 6)

Table II

Factor Mean (in Millions)


Poverty In Rural Areas 99.30357

Poverty In Urban Areas 26.425

Urban and Rural Poverty 125.7321

Rural Unemployment 5.44656

Urban Unemployment 4.41313

PolynomialRegression : h θ (x )= θ T x
hθ (x )= θ0 √x 1 +θ1 x 2+θ 2 x 22 w h ere ,
hθ (x )ist h e predicted value
W h ere ,
θT is t h e matrix of θ
x 1= First feature x is t h e new sample to be predicted .
x 2= Second feature
To be noted that if the model is trained using
Fig. 6 High end points feature scaled variables, we further scale the
C. Multivariate Linear Regression new sample.
The equation (gradient Descent) is a
Multivariate Algorithm, Therefore to get IV. EXPERIMENT
Linear Model; instead of giving a In this stage we test various cases against
Polynomial function we give it a linear Multivariate Linear and polynomial models
function [6].We know (equation 7): and observe its adjusted r squared and
Predicted r squared values. In the first part,
hθ (x )= θT x= θ 0 x 0 +θ1 x1 +θ 2 x 2 ...+θn x n we test various cases of Poverty against
Unemployment (Inter-feature-wise), and in

D. Prediction Function the second part we test intra feature-wise.

We predict a new sample according to either


Linear or Polynomial Model by (equation 8)
6 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved
Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

The values of Adjusted r squared and the data. Furthermore, Residual Standard
predicted r(Predicted Residual error Sum of Error is the difference between the observed
Squares PRESS Statistic)[8] squared talks value and the estimated or the predicted
about the model to explain variability of value. High values of adjusted r squared and
data around the mean. Greater the value, predicted r squared talks about high
better the model explains the variability in capability to predict the variability in the
data.

TABLE III Comparison between Linear and Polynomial Models Inter Feature
Linear Model Polynomial Model
Adjusted Predicted Residual Adjusted Predicted Residual
Case R- R- Standard R- R- Standard
Squared Squared Error Squared Squared Error
Poverty Rural and 0.8007 0.7541 524.6 0.916 0.8333 340.6
Unemployment Rural
Poverty Rural and 0.905 0.8718 362.2 0.946 0.9228
273.1
Unemployment Urban
Poverty Rural and
0.8271 0.7251 488.6 0.9194 0.8315924 333.6
Unemployment
combined
Poverty Urban and 0.8591 0.7682578 441.2 0.8891 0.8124214 391.4
Unemployment Rural
Poverty Urban and 0.8179 0.7466558 501.4 0.8493 0.7808036 456.1
Unemployment Urban
Poverty Urban and
0.8444 0.7414832 463.6 0.8722 0.8110839 420.1
Unemployment
combined
Poverty Combined
0.8558 0.8020817 446.2 0.9412 0.9093449 285.1
and Unemployment
Rural
Poverty Combined
0.9131 0.8717862 346.4 0.9495 0.9284094 264
and Unemployment
Urban
Poverty Combined
0.8639 0.7718077 433.6 0.9389 0.8890291 290.5
and Unemployment
Combined

7 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved


Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

TABLE IV Comparison between Linear and Polynomial Models Intra Feature

Linear Model Polynomial Model

Adjusted Predicted Residual Adjusted Predicted Residual


Case R- R- Standard R- R- Standard
Squared Squared Error Squared Squared Error

Poverty Rural and


0.8991 0.8106955 373.3 0.9573 0.9532678 242.9
Urban ( Rural
Emphasis)

Poverty Rural and


0.8991 0.8106955 373.3 0.9581 0.9511998 240.5
Urban ( Urban
Emphasis)

Unemployment
0.7137 0.6526985 628.8 0.7624 0.7210551 572.9
Rural and Urban
(Rural Emphasis)

Unemployment
0.7137 0.6526985 628.8 0.6956 0.6010112 648.3
Rural and Urban
(Urban Emphasis)

A. Evaluating Results accuracy of 95% with a even lesser Residual


According to table III, we see that features Error than before – at 240.
of Poverty Combined against people
Unemployed in urban Areas performs better Moreover, closest to which the Multivariate
than any other feature, having predicted r Linear models go is while modeling Poverty
squared value of 0.9284 and residual error of combined against People Unemployed in
264 – which means approximately 93% of urban areas at 87% accuracy, and with a
time the Predicted incidence of crime by the Residual Square error of 346 – a 22%
Multivariate polynomial model is accurate. increase in Residual Square error compared
However, between intra feature-wise Table to its Multivariate Polynomial Model.
IV, we observe that Multivariate Polynomial
Model for Number of people under poverty We find that he plots created by both models
in Rural against Urban areas – with Urban are similar and only differ in theta values,
feature emphasized – is even better at but we have found a similar feature in both
predicting Incidences of crime with an
8 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved
Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

the models which explains the main reason Income Inequality (Gini Index) could also
why linear model fails. be added to the model which could model
incidences of Crime in other Areas. As the
V. COMPARATIVE STUDY crimes due to being under the poverty line,
As we can see in the Fig 3, the polynomial and Unemployed are addressed by
curve highlighted in blue perfectly models Government policies, our expectation in
the low end sample points, exhibiting a future years the Polynomial model to
linear rise. However, the Linear plane explain crime by lesser extent year by year
Highlighted in green, Overestimates the eventually resulting in correlation between
sample points. In the Fig 4, The Linear poverty and unemployment with Incidences
model highlighted in green overestimates of Crime under 0.1
the sample point of State Uttar-Pradesh.
Moreover, in the mid section it VII. CONCLUSION
Underestimates the points of An Pr(Andra Using Statistical Machine Learning we have
Pradesh) and its neighbors. found out a link between Crime, Poverty and
Unemployment. The Experiment conducted
The Polynomial model, displays a curve on various permutations of the Independent
going closer to the mid section points and Variables found that Poverty in Rural and
Uttar-Pradesh. Overall, the polynomial Urban Areas is most viable method to
model tries to better fit the model than the explain the incidences of homicide in the
linear plane. It also handles outliers year 2010 - Poverty in Urban areas
properly. contributed more to homicides than Poverty
in Rural areas. Another interpretation
VI. FUTURE RESEARCH obtained was that between Poverty and
We are planning a similar research for the Unemployment: The Unemployment in
year 2016 when the data will be available. Urban areas and poverty in both Rural and
Any methods to track poverty are welcomed Urban an area predicts the Incidences of
and can be integrated into our model. Homicide.
Furthermore, our polynomial model is just
one of the models which explain the A limitation of our approach in future
Homicidal incidences; other features such as studies is that the time taken to survey
9 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved
Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

poverty is large, a solution to this could be Effects on K-Means Clustering


any methods which track poverty and Algorithm"Research Journal of
Unemployment in Real time – Stanford Applied Sciences, Engineering and
combining Satellite data to map Poverty[10] Technology,2013.
– Such data with the population density of
the region can perhaps give a pragmatic [5] Greenland Sander, Dose-Response
analysis of Crime areas. Other factors such and Trend Analysis in
as Income Inequality can enhance our Epidemiology: Alternatives to
model. Categorical Analysis, Epidemiology,
July 1995 , 4 .
REFERENCES
[1] Ching-Chi Hsieh and M. D. Pugh, [6] Rencher, Alvin C.; Christensen,
Poverty, Income Inequality, and William F. , Chapter 10, Multivariate
Violent Crime: A Meta-Analysis of regression – Section 10.1,
Recent Aggregate Data Studies Introduction, Methods of
Criminal Justice Review Vol 18, Multivariate Analysis (Wiley Series
Issue 2, pp. 182 – 202. in Probability and Statistics, 709 (3rd
ed.), John Wiley & Sons, 2012)
[2] Andrew Ng, Supervised Learning,
Discriminative Algorithms, CS 229: [7] Theil, Henri, Economic Forecasts
Machine Learning (Course and Policy, Holland, Amsterdam:
handouts). North, 1961)
Available:http://cs229.stanford.edu/
materials.html Retrieved: 27th Feb [8] Allen, D. M. , The Relationship
2017 Between Variable Selection and Data
Augmentation and a Method for
[3] Hayashi, Fumio, Econometrics Prediction, Technometrics, 16, 1974
(Princeton University Press, 2000). 125–127

[4] Bin Mohamad, Ismail; Dauda [9] Daniel Adler, Duncan Murdoch and
Usman,Standardization and Its others (2016). rgl: 3D Visualization
10 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved
Journal of Research in Computer Science and Engineering
Volume 2 Issue 3

Using OpenGL. Rpackage version


0.95.1441. https://CRAN.R-
project.org/package=rgl

[10] Stanford University, Stanford


scientists combine satellite data,
machinelearningtomappovertyAvaila
ble:http://news.stanford.edu/2016/08/
18/combiningsatellitedatamachine-
learning-to-map-poverty/ Retrieved:
4 March 4, 2017.

11 Page 1-11 © MANTECH PUBLICATIONS 2017. All Rights Reserved

You might also like