You are on page 1of 45

Abstract

As a side effect of increasingly popular social media, cyberbullying has emerged


as a serious problem afflicting children, adolescents and young adults. Machine learning
techniques make automatic detection of bullying messages in social media possible, and
this could help to construct a healthy and safe social media environment. In this
meaningful research area, one critical issue isrobust and discriminative numerical
representation learning of text messages. In this paper, we propose a new representation
learningmethod to tackle this problem. Our method named Semantic-Enhanced
Marginalized Denoising Auto-Encoder (smSDA) is developedvia semantic extension of
the popular deep learning model stacked denoising autoencoder. The semantic extension
consists ofsemantic dropout noise and sparsity constraints, where the semantic dropout
noise is designed based on domain knowledge and theword embedding technique. Our
proposed method is able to exploit the hidden feature structure of bullying information
and learn arobust and discriminative representation of text. Comprehensive experiments
on two public cyberbullying corpora (TwitterandMySpace) are conducted, and the results
show that our proposed approaches outperform other baseline text representation learning
methods.
Chapter 1

INTRODUCTION

1.1 Introduction
SOCIAL MEDIA, is a group of Internet-based applications that build on the
ideological and technological foundations of Web 2.0, and that allow the creation
and exchange of user-generated content. Via social media, people can enjoy
enormous information, convenient communication experience and so on. However,
social media may have some side effects such as cyberbullying, which may have
negative impacts on the life of people, especially children and teenagers.
Cyberbullying can be defined as aggressive, intentional actions performed by an
individual or a group of people via digital communication methods such as sending
messages and posting comments against a victim. Different from traditional bullying
that usually occurs at school during face-to-face communication, cyberbullying on
social media can take place anywhere at any time. For bullies, they are free to hurt
their peers feelings because they do not need to face someone and can hide behind
the Internet. For victims, they are easily exposed to harassment since all of us,
especially youth, are constantly connected to Internet or social media.
As reported cyberbullying victimization rate ranges from 10% to 40%. In the
United States, approximately 43% of teenagers were ever bullied on social media.
The same as traditional bullying, cyberbullying has negative, insidious and sweeping
impacts on children . The outcomes for victims under cyberbullying may even be
tragic such as the occurrence of self-injurious behavior or suicides.
One way to address the cyberbullying problem is to automatically detect and
promptly report bullying messages so that proper measures can be taken to prevent
possible tragedies. Previous works on computational studies of bullying have shown
that natural language processing and machine learning are powerful tools to study
bullying. Cyberbullying detection can be formulated as a supervised learning
problem. A classifier is first trained on a cyberbullying corpus labeled by humans,
and the learned classifier is then used to recognize a bullying message. Three kinds
of information including text, user demography,and social network features are often
used in cyberbullying detection[9]. Since the text content is the most reliable,our
work here focuses on text-based cyberbullying detection

1.2 Problem Statement

In the text-based cyberbullying detection, the rst and also critical step is the numerical
representation learning for text messages. labeling data is labor intensive and time
consuming and cyberbullying is hard to judge from a third view due to its intrinsic
ambiguities.

1.3 Aim of the Project

Semantic-Enhanced Marginalized Denoising Auto-Encoder (smSDA) is


developedvia semantic extension of the popular deep learning model stacked
denoisingautoencoder successfully predict sentiment and filter bullying word in text
messages.

1.4 Natural language processing

Natural language processing is a field of computer science, artificial intelligence, and


computational linguistics concerned with the interactions between computers and human
(natural) languages. As such, NLP is related to the area of humancomputer interaction.
Many challenges in NLP involve: natural language understanding, enabling computers to
derive meaning from human or natural language input; and others involve natural
language generation. The field of study that focuses on the interactions between human
language and computers is called Natural Language Processing, or NLP for short. It sits
at the intersection of computer science, artificial intelligence, and computational
linguistics (Wikipedia).

Natural Language Processing is a field that covers computer understanding and manipu-
lation of human language, and its ripe with possibilities for newsgathering, Anthony
Pesce said in Natural Language Processing in the kitchen. You usually hear about it in
the context of analyzing large pools of legislation or other document sets, attempting to
discover patterns or root out corruption.

1.4.1 What Can Developers Use NLP Algorithms For?

NLP algorithms are typically based on machine learning algorithms. Instead of hand-
coding large sets of rules, NLP can rely on machine learning to automatically learn these
rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection
of sentences), and making a statical inference. In general, the more data analyzed, the
more accurate the model will be.

Summarize blocks of text using Summarizer to extract the most important and
central ideas while ignoring irrelevant information.
Create a chat bot using ParseyMcParseface, a language parsing deep learning
model made by Google that uses Point-of-Speech tagging.
Automatically generate keyword tags from content using AutoTag, which
leverages LDA, a technique that discovers topics contained within a body of text.
Identify the type of entity extracted, such as it being a person, place, or
organization using Named Entity Recognition.
Use Sentiment Analysis to identify the sentiment of a string of text, from very
negative to neutral to very positive.
Reduce words to their root, or stem, using PorterStemmer, or break up text
into tokens using Tokenizer.

1.4.2 Open Source NLP Libraries

These libraries provide the algorithmic building blocks of NLP in real-world applications.
Algorithm provides a free API endpoint for many of these algorithms, without ever
having to setup or provision servers and infrastructure.

Apache Open NLP: a machine learning toolkit that provides tokenizers, sentence
segmentation, part-of-speech tagging, named entity extraction, chunking, parsing,
coreference resolution, and more.
Natural Language Toolkit (NLTK): a Python library that provides modules for
processing text, classifying, tokenizing, stemming, tagging, parsing, and more.
Standford NLP: a suite of NLP tools that provide part-of-speech tagging, the
named entity recognizer, coreference resolution system, sentiment analysis, and
more.
MALLET: a Java package that provides Latent Dirichlet Allocation, document
classification, clustering, topic modeling, information extraction, and more.

1.4.4 Sentiment analysis

Sentiment analysis (also known as opinion mining) refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract subjective
information in source materials. Sentiment analysis is widely applied to reviews and
social media for a variety of applications, ranging from marketing to customer service.

Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a


writer with respect to some topic or the overall contextual polarity of a document. The
attitude may be his or her judgment or evaluation (see appraisal theory), affective state
(that is to say, the emotional state of the author when writing), or the intended emotional
communication (that is to say, the emotional effect the author wishes to have on the
reader).

A basic task in sentiment analysis is classifying the polarity of a given text at the
document, sentence, or feature/aspect levelwhether the expressed opinion in a
document, a sentence or an entity feature/aspect is positive, negative, or neutral.
Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional
states such as "angry", "sad", and "happy".

1.4.5 Artificial intelligence

Artificial intelligence (AI) is intelligence exhibited by machines. In computer science,


the field of AI research defines itself as the study of "intelligent agents": any device that
perceives its environment and takes actions that maximize its chance of success at some
goal. Colloquially, the term "artificial intelligence" is applied when a machine mimics
"cognitive" functions that humans associate with other human minds, such as "learning"
and "problem solving". As machines become increasingly capable, mental facilities once
thought to require intelligence are removed from the definition. For example, optical
character recognition is no longer perceived as an exemplar of "artificial intelligence",
having become a routine technology. Capabilities currently classified as AI include
successfully understanding human speech, competing at a high level in strategic game
systems (such as Chess and Go), self-driving cars, intelligent routing in Content Delivery
Networks, and interpreting complex data.

AI research is divided into subfields that focus on specific problems or on specific


approaches or on the use of a particular tool or towards satisfying particular applications.

The central problems (or goals) of AI research include reasoning, knowledge, planning,
learning, natural language processing (communication), perception and the ability to
move and manipulate objects.General intelligence is among the field's long-term
goals.Approaches include statistical methods, computational intelligence, and traditional
symbolic AI. Many tools are used in AI, including versions of search and mathematical
optimization, logic, methods based on probability and economics. The AI field draws
upon computer science, mathematics, psychology, linguistics, philosophy, neuroscience
and artificial psychology.

The field was founded on the claim that human intelligence "can be so precisely
described that a machine can be made to simulate it". This raises philosophical arguments
about the nature of the mind and the ethics of creating artificial beings endowed with
human-like intelligence, issues which have been explored by myth, fiction and
philosophy since antiquity. Some people also consider AI a danger to humanity if it
progresses unabatedly.
Chapter 2

LITERATURE SURVEY
Karthik Dinakar, Roi Reichart, Henry Lieberman Modeling the Detection of Textual
Cyberbullying MIT Media Lab, computer Science & Artificial Intelligence
Laboratory Massachusetts Institute of Technology Cambridge, MA 02139 USA.

The scourge of cyber bullying has assumed alarming proportions with an ever-
increasing number of adolescents admitting to having dealt with it either as a victim or as
a by stander. Anonymity and the lack of meaningful supervision in the electronic
medium are two factors that have exacterbated this social menace. Comments or posts
involving sensitive topics that are personal to an individual are more likely to be
internalized by a victim, often resulting in tragic outcomes. We decompose the overall
detection problem in- to detection of sensitive topics, lending itself into text classification
sub-problems. We experiment with a corpus of 4500 YouTube comments, applying a
range of binary and multiclass classifiers. We find that binary classifiers for in- dividable
labels outperform multiclass classifiers. Our findings show that the detection of textual
cyber bullying can be tackled by building individual topic-sensitive classifiers.

Andreas M. Kaplan*, Michael Haenlein Users of the world, unite! The challenges
andopportunities of Social Media ESCP Europe, 79 Avenue de la Republique, F-
75011 Paris, France

The concept of Social Media is top of the agenda for many businessexecutives
today. Decision makers, as well as consultants, try to identify ways inwhich firms can
make profitable use of applications such as Wikipedia, YouTube, Facebook, Second Life,
and Twitter. Yet despite this interest, there seems to be verylimited understanding of
what the term Social Media exactly means; this articleintends to provide some
clarification. We begin by describing the concept of SocialMedia, and discuss how it
differs from related concepts such as Web 2.0 and User Generated Content. Based on this
definition, we then provide a classification of Social Media which groups applications
currently subsumed under the generalized term intomore specific categories by
characteristic: collaborative projects, blogs, contentcommunities, social networking sites,
virtual game worlds, and virtual social worlds.Finally, we present 10 pieces of advice for
companies which decide to utilize SocialMedia.

Qianjia Huang, Vivek K. Singh, Pradeep K. Atrey Cyber Bullying Detection Using
Social and Textual Analysis Department of Applied Computer Science ,The
University of Winnipeg, Winnipeg, MB, Canada, The Media Lab Massachusetts
Institute of Technology Cambridge, MA, USA, Department of Computer Science
University at Albany SUNY Albany, NY, USA.

Cyber Bullying, which often has a deeply negative impact onthe victim, has
grown as a serious issue among adolescents.To understand the phenomenon of cyber
bullying, expertsin social science have focused on personality, social relationships and
psychological factors involving both the bully andthe victim. Recently computer science
researchers have alsocome up with automated methods to identify cyber
bullyingmessages by identifying bullying-related keywords in cyberconversations.
However, the accuracy of these textual feature based methods remains limited. In this
work, we investigate whether analyzing social network features can improve the accuracy
of cyber bullying detection. By analyzing the social network structure between users and
deriving features such as number of friends, network embeddedness, and relationship
centrality, we find that the detection of cyber bullying can be significantly improved by
integrating the textual features with social network features

CHAD EBESUTANI , MATTHEW FIERSTEIN , ANDRES G. VIANA, JOHN


YOUNG, The Role Of Loneliness In The Relationship Between Anxiety
AndDepression In Clinical And School-Based Youth Duksung Womens University,
University of California at Los Angeles, University of Mississippi Medical Center,
University of Mississippi, MANUEL SPRUNG International Psychology and
Psychotherapy Center, Vienna.

Identifying mechanisms that explain the relationship between anxiety and


depression are needed. The Tripartite Model is one model that has been proposed to help
explain the association between these two problems, positing a shared component called
negative affect. The objective of the present study was to examine the role of loneliness
in relation to anxiety and depression. A total of 10,891 school-based youth (Grades 212)
and 254 clinical children and adolescents receiving residential treatment (Grades 212)
completed measures of loneliness, anxiety, depression, and negative affect. The
relationships among loneliness, anxiety, depression, and negative affect were examined,
including whether loneliness was a significant intervening variable. Various meditational
tests converged showing that loneliness was a significant mediator in the relationship
between anxiety and depression. This effect was found across children (Grades 26) and
adolescent (Grades712) school-based youth. In the clinical sample, loneliness was found
to be a significant mediatorbetween anxiety and depression, even after introducing
negative affect based on the Tripartite Model. Results supported loneliness as a
significant risk factor in youths lives that may result from anxiety and place youth at risk
for subsequent depression. Implications related to intervention and prevention in school
settings are also discussed.

Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, Amy Bellmore, Learning from
Bullying Traces in Social Media Department of Computer Sciences, University of
Wisconsin-Madison, Madison, WI 53706, USA, Department of Educational
Psychology, University of Wisconsin-Madison, Madison, WI 53706, USA.
We introduce the social study of bullying tothe NLP community. Bullying, in
both physical and cyber worlds (the latter known as cyberbullying), has been recognized
as a serious national health issue among adolescents. However, previous social studies of
bullying are handicapped by data scarcity, while the few computational studies narrowly
restrict themselves to cyberbullying which accounts for only a small fraction of all
bullying episodes. Our main contribution is to present evidence that social media, with
appropriate natural language processing techniques, can be a valuable and abundant data
source for the study of bullying in both worlds. We identify several key problems in
using such datasources and formulate them as NLP tasks, including text classification,
role labeling, sentiment analysis, and topic modeling. Since this is an introductory paper,
we present baseline results on these tasks using off-the-shelf NLP solutions, and
encourage the NLP community to contribute better models in the future.
Chapter 3

SYSTEM ANALYSIS

System analysis is the technique for finding the best answer for the issue.
Framework study is the procedure by which find out about the realistic issues,
characterize articles and pre-requisites and assesses the arrangements. It is the state of
mind about the affiliation and the issue it includes, an arrangement of advances that aids
in tackling these issues. Attainability study assumes an essential part in framework
examination which gives the objective for configuration and improvement.

3.1 Existing system and its disadvantages

A classifier is first trained on a cyberbullying corpus labeled by humans, and the learned
classifier is then used to recognize a bullying message. Three kinds of information
including text, user demography, and social network features are often used in
cyberbullying detection . Since the text content is the most reliable, our work here
focuses on text-based cyberbullying detection.

In the text-based cyberbullying detection, the first and also critical step is the numerical
representation learning for text messages. In fact, representation learning of text is
extensively studied in text mining, information retrieval and natural language processing
(NLP). Bag-of-words (BoW) model is one commonly used model that each dimension
corresponds to a term. Latent Semantic Analysis (LSA) and topic models are another
popular text representation models, which are both based on BoW models. By mapping
text units into fixed-length vectors, the learned representation can be further processed
for numerous language processing tasks.

Drawbacks of Existing System


Lack of efficient training data-set
Social media are often very short and contain a lot of informal language and
misspellings cannot properly process.
Major limitation of these approaches is that the learned feature space still relies on
the BoW assumption and may not be robust

Proposed System and its advantages

Some approaches have been proposed to tackle these problems by incorporating expert
knowledge into feature learning. Proposed to combine BoW features, sentiment features
and contextual features to train a support vector machine for online harassment detection.

It can utilized label specific features to extend the general features, where the label
specific features are learned by Linear Discriminative Analysis. In addition, common
sense knowledge was also applied. Nahar et.al presented a weighted TF-IDF scheme via
scaling bullying-like features by a two factor. Besides content-based information, Maral
et.al proposed to apply users information, such as gender and history messages, and
context information as extra features. But a major limitation of these approaches is that
the learned feature space still relies on the BoW assumption and may not be robust. In
addition, the performance of these approaches rely on the quality of hand-crafted
features, which require extensive domain knowledge.

Advantages

1)Most cyberbullying detection methods rely on the BoW model. Due to the sparsity
problems of both data and features, the classifier may not be trained very well. Stacked
densoing autoencoder (SDA), as an unsupervised representation learning method, is able
to learn a robust feature space. In SDA, the feature correlation is explored by the
reconstruction of corrupted data. The learned robust feature representation can then boost
the training of classifier and finally improve the classification accuracy. In addition, the
corruption of data in SDA actually generates artificial data to expand data size, which
alleviate the small size problem of training data.

2) For cyberbullying problem, we design semantic dropout noise to emphasize bullying


features in the new feature space, and the yielded new representation is thus more
discriminative for cyberbullying detection.
3) The sparsity constraint is injected into the solution of mapping matrix W for each
layer, considering each word is only correlated to a small portion of the whole
vocabulary. We formulate the solution for the mapping weights W as an Iterated Ridge
Regression problem, in which the semantic dropout noise distribution can be easily
marginalized to ensure the efficient training of our proposed smSDA.

4) Based on word embeddings, bullying features can be extracted automatically. In


addition, the possible limitation of expert knowledge can be alleviated by the use of word
embedding.
Chapter 4

Software Requirement and Specification

Software Requirement
Software System Configuration:-
Operating System : Windows95/98/2000/XP
Application Server : Tomcat5.0/6.X
Front End : HTML, Java, Jsp
Scripts : JavaScript.
Server side Script : Java Server Pages.
Database : My sql
Database Connectivity : JDBC.

Hardware Requirements
Hardware System Configuration:-
Processor - Pentium III
Speed - 1.1 GHz
RAM - 256 MB (min)
Hard Disk - 20 GB
Floppy Drive - 1.44 MB
Key Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse
Monitor - SVGA

Functional Requirements
Functional requirements of the system are described in this section. In natural language,
requirements are expressed. Functions of the system are described and also behaviour of
the system is observed, when certain inputs are given.
Following are the functional requirements in this system :
New user should be Signed Up.
After registering, he/she can Sign In.
Host Instance.
Sender can upload the file using a receiver Email Id.
File is uploaded to Amazon Cloud Server.

Non Functional Requirements


Following are the performance requirements :
Portability : The system or project is developed using Java. Java is platform
independent, the code can be portable to any machine with few corrections.
Resources Utilized : Minimum utilization of available resources must be used by
the system.
Efficiency : Efficient algorithms must be used by the system in order to use fewer
number of resources.
Feasibility : The system must be economically feasible.
Chapter 5

SYSTEM DESIGN
Outline is a creative procedure; a fine plan is the way to valuable framework. The
framework "Outline" is characterize as "The procedure of applying an assortment of
methods and standards for the guideline of characterizing a procedure or a framework in
adequate component to allow its physical acknowledgment". Diverse configuration
elements are taken after to extend the framework. The outline design depicts the elements
of the framework, the segments or components of the framework and their hope to end-
clients.

5.1 Fundamental Design Concepts


An arrangement of principal plan ideas has developed over the point of reference
three decades. Despite the fact that the level of significance in every idea has blended
throughout the years, each has stood the test of time. Every one furnishes the product
creator with an establishment from which more confused outline strategies can be
connected. The basic outline ideas give the required system for "hitting the nail on the
head". The basic outline ideas, for example, build, refinement, particularity,
programming plan, control chain of importance, basic apportioning, information
structure, programming methodology and data whipping are connected in this anticipate
to taking care of business according to the configuration.
INPUT DESIGN

The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are
necessary to put transaction data in to a usable form for processing can be achieved by
inspecting the computer to read data from a written or printed document or it can occur
by having people keying the data directly into the system. The design of input focuses on
controlling the amount of input required, controlling the errors, avoiding delay, avoiding
extra steps and keeping the process simple. The input is designed in such a way so that it
provides security and ease of use with retaining the privacy. Input Design considered the
following things:

What data should be given as input?


How the data should be arranged or coded?
The dialog to guide the operating personnel in providing input.
Methods for preparing input validations and steps to follow when error occur.

OBJECTIVES

1.Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process
and show the correct direction to the management for getting correct information from
the computerized system.

2.It is achieved by creating user-friendly screens for the data entry to handle large volume
of data. The goal of designing input is to make data entry easier and to be free from
errors. The data entry screen is designed in such a way that all the data manipulates can
be performed. It also provides record viewing facilities.

3.When the data is entered it will check for its validity. Data can be entered with the help
of screens. Appropriate messages are provided as when needed so that the user will not
be in maize of instant. Thus the objective of input design is to create an input layout that
is easy to follow.
OUTPUT DESIGN

A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users
and to other system through outputs. In output design it is determined how the
information is to be displaced for immediate need and also the hard copy output. It is the
most important and direct source information to the user. Efficient and intelligent output
design improves the systems relationship to help user decision-making.

1. Designing computer output should proceed in an organized, well thought out manner;
the right output must be developed while ensuring that each output element is designed so
that people will find the system can use easily and effectively. When analysis design
computer output, they should Identify the specific output that is needed to meet the
requirements.

2. Select methods for presenting information.

3. Create document, report, or other formats that contain information produced by the
system.

The output form of an information system should accomplish one or more of the
following objectives.

Convey information about past activities, current status or projections of the


Future.
Signal important events, opportunities, problems, or warnings.
Trigger an action.
Confirm an action.
Low level design
Design 0

User

Check Request

Yes Accept No
Request?

Accepted Rejected

Post Messages

First user checks the friend request that is sent to him/her from other people. If the user
wishes to accept the request then he can click on accept request button and then the user
can start chatting and post messages to friends. If user donot want to accept the request
then the user can click on delete request button and then the friend request sent by other
people will be deleted and then the user will no longer be able to chat with them.
5.2 System Architecture
Framework engineering is the hypothetical plan that characterizes the structure and
conduct of a framework. A design clarification is a legitimate depiction of a framework,
sorted out in a mode that backings examination about the auxiliary properties of the
framework. It characterizes the framework device or building squares and gives an
arrangement from which items can be obtained, and frameworks built up, that will work
commonly to actualize the for the most part framework. The System design is uncovered
underneath.
Users first enters text messages, and these text messages are stored in database. From the
database, all the messages are extracted and the classification of the messages is done,
that is messages are classified into bullying words and normal words. Once classification
of the messages is done then the sentiment prediction is done using NLP, that is the
attitude of the person who is sending the message is determined by correlating between
the other words in the message. For the input messages mining rules are applied to
determine the frequency of the occurrence of the words in the messages. Neurons are
trained using the training dataset. Finally result of the messages is predicted that is whether
the entered messages are positive, negative or neutral.
USE CASE DIAGRAM

Send Request

Accept Request
User
Post Messages

Share Images

In this module, users can send a friend request to all the people and only those
who accept his/her friend request, the user can start chatting with him/her. In the same
manner other people can also send a friend request and it is upto the user to accept the
friend request or delete the friend request. If the friend request sent by the user is deleted
then the user is not able to chat with that person. Once the friend request is accepted
among users then the users can share images, post messages and can start chatting with
their friends.

Gather Text Messages

Apply Classfication
NLP
Analyse Sentiment

Generate Result
NLP gathers all the text messages from the database, and then the classification of
the messages is done, for the classified messages the sentiment prediction is done. Based
on training the neurons using training dataset, finally the result of the messages are
predicted.

Training Data Set

Test Data Set


Weka
Apply Mining Rule

Predict Result using


J48

Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code.
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well-suited for developing new machine
learning schemes.

Training dataset is used to train the machine, then for the test dataset that is the messages
entered by the user the data mining rules are applied and then finally the result is
predicted using J48 algorithm
Class Diagram

Filter Server Sentiment Prediction

Main(); parseJson();

QueryProcess.getData(); filterData();

BadWords() Apply Classification()

AI Module

trainData();

testData();

predictResult()
Sequence Diagram

In this module there is a datastore, NLP,Weka tool. First all the text messages
entered by the users are gathered from a database, and then the sentiment analysis is
performed using NLP, J48 algorithm is applied to predict the data and then data mining
rules are applied for the predicted data and then finally the result is predicted.
Chapter 6

Implementation of Cyberbullying Detection


Implementation is the stage of the project when the theoretical design is turned
out into a working system. Thus it can be considered to be the most critical stage in
achieving a successful new system and in giving the user, confidence that the new system
will work and be effective.

The implementation stage involves careful planning, investigation of the existing


system and its constraints on implementation, designing of methods to achieve
changeover and evaluation of changeover methods.

Modules

1. Marginalized Stacked Denoising Auto-encoder


2. Semantic Enhancement for mSDA
3. Construction of Bullying Feature Set
4. smSDA for Cyberbullying Detection

Marginalized Stacked Denoising Auto-encoder

It can proposed a modified version of Stacked Denoising Auto-encoder that employs a


linear instead of a nonlinear projection so as to obtain a closed-form solution . The basic
idea behind denoising auto-encoder is to reconstruct the original input from a corrupted
one ~x1,., ~xn with the goal of obtaining robust representation. Marginalized
Denoising Auto-encoder: In this model, denoising auto-encoder attempts to reconstruct
original data using the corrupted data via a linear projection.

Semantic Enhancement for mSDA

The advantage of corrupting the original input in mSDA can be explained by feature co-
occurrence statistics. The cooccurrence information is able to derive a robust feature
representation under an unsupervised learning framework, and this also motivates other
state-of-the-art text feature learning methods such as Latent Semantic Analysis and topic
models.

A denoising autoencoder is trained to reconstruct these removed features values from the
rest uncorrupted ones. Thus, the learned mapping matrix W is able to capture correlation
between these removed features and other features. The major modifications include
semantic droupout noise and sparse mapping constraints.

However, a direct use of these bullying features may not achieve good performance
because these words only account for a small portion of the whole vocabulary and these
vulgar words are only one kind of discriminative features for bullying.

Construction of Bullying Feature Set

The bullying features play an important role and should be chosen properly. In the
following, the steps for constructing bullying feature set Zb are given, in which the first
layer and the other layers are addressed separately. For the first layer, expert knowledge
and word embeddings are used. For the other layers, discriminative feature selection is
conducted. Layer One: firstly, we build a list of words with negative affective, including
swear words and dirty words. Then, we compare the word list with the BoW features of
our own corpus, and regard the intersections as bullying features. and does not reflect the
current usage and style of cyberlanguage.

Therefore, we expand the list of pre-defined insulting words, i.e. insulting seeds, based on
word embeddings as follows: Word embeddings use real-valued and low-dimensional
vectors to represent semantics of words. The well-trained word embeddings lie in a
vector space where similar words are placed close to each other. In addition, the cosine
similarity between word embeddings is able to quantify the semantic similarity between
words. Considering the Interent messages are our interested corpus, we utilize a well-
trained word2vec model on a large-scale twitter corpus containing 400 million tweets. A
visualization of some word embeddings after dimensionality reduction (PCA). It is
observed that curse words form distinct clusters, which are also far away from normal
words. Even insulting words are located at different regions due to different word usages
and insulting expressions. In addition, since the word embeddings adopted here are
trained in a large scale corpus from Twitter, the similarity captured by word embeddings
can represent the specific language pattern. For example, the embedding of the
misspelled word fck is close to the embedding of fuck so that the word fck can be
automatically extracted based on word embeddings.

smSDA for Cyberbullying Detection

we propose the Semantic-enhanced Marginalized Stacked Denoising Auto-encoder


(smSDA). In this subsection, we describe how to leverage it for cyberbullying detection.
smSDA provides robust and discriminative representations The learned numerical
representations can then be fed into Support Vector Machine (SVM). In the new space,
due to the captured feature correlation and semantic information, the SVM, even trained
in a small size of training corpus, is able to achieve a good performance on testing
documents.

Based on prior knowledge, we construct a pre-defined bullying wordlist and compare it


with the original vocabulary of the whole corpus X. The words appearing in both the
vocabulary and the bullying wordlist are selected as insulting seeds. The insulting seeds
are then expanded and refined automatically via word embeddings, which defines the
bullying features Zb for layer one.
Algorithm : J48

Step 1 : Collect dataset form social network website

Step 2 : prepare training data set with OSN data

Step 3 : Process OSN Data to predict result

Step 4 : apply classification rules

Step 5 : apply mining Rules

Step 6 : Generate sentiment result

Step 7: train neurons with training data set

Step 8: process sentiment result to filter bad words

Step 9 : Filter world form messages

Step 10 : Re construct Messages


Chapter 7

Software Testing of Cyberbullying Detection


In software development life cycle, software testing is one of the significant
phases. In software testing phase, verification and validation of the project under
development are performed with respect to requirements mentioned[28]. In this
project, Unit testing of main modules, Integration testing of different modules and
complete System testing functionality along with the GUI testing is performed.

7.1 Test Environment


This project was tested on the following platforms
Software Platform :
Software used for testing is given below
Operating System : Windows XP or Windows 7
Tool / IDE : Eclipse Juno 4.2 with Tomcat Server
Visual Interface : Java Server Pages, Servlets
Database : MySQL 5.0
Hardware Platform :
Processor : Intel Pentium 4 or more
Memory : 1GB RAM or more
Harddisk : 40GB or more with 5400 rpm

7.2 Unit Testing of Main modules


Individual modules or functions are tested under Unit testing. This method of
testing is also called as White Box Testing. Independently, different modules are tested
here and their functionality is checked. Each of these modules of Unit testing is explained
below.
Table 7.1 Unit Test Case 1 for Registration Page

Sl No of Test Case UTC-1

Name of Test Case Test case to verify Registration Page

Feature being Tested Registration details

Description User should provide details like Name,


Email Id, Mobile Number
Sample Input Email Id, Mobile Number, Username
Password
Expected Output Registration Successful message should be
displayed
Actual Output As expected

Remarks Passed

The Table 7.1 shows Unit Test Case 1(UTC-1) to verify whether user interface accepts
registration details of sender and receiver.

Table 7.2 Unit Test Case 2 for Message Post

Sl No of Test Case UTC-2

Name of Test Case Test Message Post

Feature being Tested Messages have been posted

Description Store in database

Sample Input Text

Expected Output Message posted

Actual Output Device is ready to use

Remarks Passed

The Table 7.2 shows Unit Test Case 2(UTC-2)


Table 7.3 Unit Test Case 3 for Sentiment Analysis

Sl No of Test Case UTC-3

Name of Test Case Check Posted Message

Feature being Tested Sentiment analysis

Description Posted message sentiment should be tested

Sample Input Messages

Expected Output Predict sentiment

Actual Output Messages

Remarks Passed

The Table 7.3 shows Unit Test Case 3(UTC-3)

Table 7.4 Unit Test Case 4 for Data Prediction

Sl No of Test Case UTC-4

Name of Test Case Data Prediction using AI

Feature being Tested Result prediction

Description Posted messages should be tested to


predict bullying words
Sample Input Posted Messages

Expected Output Filter Words

The Table 7.4 shows Unit Test Case 4(UTC-4)


7.3 Integration Testing of Modules

Two or more modules are combined together to check for Integration testing.
Integration Testing purpose is to check whether the integrated modules are performing as
expected. Integrated modules testing is given below.

Table 7.9 Integration Test Case 1 for Sender Registration and Post Messages

Sl No of Test Case ITC-1

Name of Test Case Test case to register Sender and post


messages
Feature being Tested Registration and post message

Description Sender should register first to post


messages to other user
Sample Input Text messages

Expected Output Message successfully sent

Table 7.10 Integration Test Case 2 for sentiment analysis and prediction

Sl No of Test Case ITC-2

Name of Test Case Test case to analyse sentiment and predict


bullying words
Feature being Tested Filter bullying words

Description After sentiment analysis we should apply


ai to predict bullying words
Sample Input Posted Messages

Expected Output Bullying words Predicted

Actual Output As expected

Remarks Passed
System Testing
Sl No of Test Case ITC-1

Name of Test Case Test case to analyse overall performance


of the application
Feature being Tested System performance

Description Should check sentiment predicted properly


and bad word filtered
Sample Input Posted Messages

Expected Output Bullying words Predicted

Actual Output As expected

Remarks Passed

Summary
This chapter presents software testing methods and different types of test cases,
used to test the different modules of the project. Various unit test cases are described for
each of the modules. Integration testing is performed, when unit test cases or unit
modules are taken together. Finally, system test case is conducted to verify the
functioning of overall system.
Chapter 8

Experimental Analysis and Results


Different parameters are considered to evaluate the effectiveness of the
application. Both theoretically and practically the system is validated using evaluation
metrics. This chapter describes the experimental results of different modules of the
project for its performance and accuracy. Comparison analysis to competitive approaches
is also shown. This chapter also depicts the graphical representation of the performance.

8.1 Evaluation Metric

Metrics are the various measures used to evaluate the project. The following are the
metrics used to evaluate the project.
Time : This metric is used to determine the time required to do the computation of
secret key generation, security device generation, ciphertext generation,
device update.
Bits : This metric is used to determine the bits required for secret key size,
security device size and ciphertext size

8.2 Experimental Dataset


About the dataset used for evaluating the developed system is explained in this section.
The experimental dataset consists of number of bits used to convert the plaintext file into
the encrypted file using secret key and USB security device.

8.3 Performance Analysis


For the dataset considered, the performance analysis of the application is carried out as
follows. This application requires 64-bit length of secret key size. Three different
approaches are compared for the time analysis. The first approach is Two factor data
security protection system with no revocability, second approach is Single factor secret
system with revocability and third approach is the current application with Two factor
data security protection mechanism with revocability.
The current application requires additional computation time in security device
generation and device update. This is because it considers more number of bits for secret
key generation. Anyway, the current application requires slight extra cost for
computation. This is negligible. The application outperforms the other approaches.

7
6
5
T
4
i
3
m 2
e 1
0
without without USB Project Approach
Revocability Device

Figure 8.1 Performance Plot for different Security Mechanism approaches

Table 8.1 Showing the Evaluation Metric for different Security Mechanism
approaches

Parameters Without Without USB Project Approach


Revocability Device
Time Taken 6.5 ms 5.5 ms 5.5 ms

Bit Size 32 32 64

8.4 Inference From the Result

Inference drawn from the performance plot shown in the Figure 8.1 is that other
approaches take more time for computation of secret key generation, ciphertext
generation and device update. Other approaches consider less number of bits(32 bits
only), whereas the current application considers 64-bit secret key size.
Table 8.1 shows the number of bits considered and time taken in milliseconds for
different approaches. Without revocability option takes more time and considers less
bits(32-bit), whereas the application takes less time and considers more number of
bits(64-bit).

Summary

In this chapter, evaluation metric used to determine the performance of the application is
explained. It also describes the dataset considered, performance analysis and the
inference made for the obtained results.

CONCLUSION

This project addresses the text-based cyberbullying detection problem, where robust and
discriminative representations of messages are critical for an effective detection system.
By designing semantic dropout noise and enforcing sparsity, we have developed
semantic-enhanced marginalized denoising autoencoder as a specialized representation
learning model for cyberbullying detection. In addition, word embeddings have been
used to automatically expand and refine bullying word lists that is initialized by domain
knowledge.

Future Enhancement

Semantic-Enhanced Marginalized Denoising Auto-Encoder (smSDA) is developed to


filter bad words in social network messages which can be extended to another level
where bad video can also be predict and prevented before it reach to destination user.
Chapter 9
Screenshots
References

[1] A. M. Kaplan and M. Haenlein, Users of the world, unite! The challenges and
opportunities of social media, Business horizons, vol. 53, no. 1, pp. 5968, 2010.
[2] R. M. Kowalski, G. W. Giumetti, A. N. Schroeder, and M. R. Lattanner, Bullying in
the digital age: A critical review and meta-analysis of cyberbullying research among
youth. 2014.
[3] M. Ybarra, Trends in technology-based sexual and non-sexual aggression over time
and linkages to nontechnology aggression, National Summit on Interpersonal Violence
and Abuse Across the Lifespan: Forging a Shared Agenda, 2010.
[4] B. K. Biggs, J. M. Nelson, and M. L. Sampilo, Peer relations in the anxiety
depression link: Test of a mediation model, Anxiety, Stress, & Coping, vol. 23, no. 4,
pp. 431447, 2010.
[5] S. R. Jimerson, S. M. Swearer, and D. L. Espelage, Handbook of bullying in schools:
An international perspective. Routledge/Taylor & Francis Group, 2010.
[6] G. Gini and T. Pozzoli, Association between bullying and psychosomatic problems:
A meta-analysis, Pediatrics, vol. 123, no. 3, pp. 10591065, 2009.
[7] A. Kontostathis, L. Edwards, and A. Leatherman, Text mining and cybercrime,
Text Mining: Applications and Theory. John Wiley & Sons, Ltd, Chichester, UK, 2010.
[8] J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, Learning from bullying traces in
social media, in Proceedings of the 2012 conference of the North American chapter of
the association for computational linguistics: Human language technologies. Association
for Computational Linguistics, 2012, pp. 656666.
[9] Q. Huang, V. K. Singh, and P. K. Atrey, Cyber bullying detection using social and
textual analysis, in Proceedings of the 3rd International Workshop on Socially-Aware
Multimedia. ACM, 2014, pp. 36.
[10] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and L. Edwards,
Detection of harassment on web 2.0, Proceedings of the Content Analysis in the WEB,
vol. 2, pp. 17, 2009.
[11] K. Dinakar, R. Reichart, and H. Lieberman, Modeling the detection of textual
cyberbullying. in The Social Mobile Web, 2011.
[12] V. Nahar, X. Li, and C. Pang, An effective approach for cyberbullying detection,
Communications in Information Science and Management Engineering, 2012.
[13] M. Dadvar, F. de Jong, R. Ordelman, and R. Trieschnigg, Improved cyberbullying
detection using gender information, in Proceedings of the 12th -Dutch-Belgian
Information Retrieval Workshop (DIR2012). Ghent, Belgium: ACM, 2012.
[14] M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong, Improving cyberbullying
detection with user context, in Advances in Information Retrieval. Springer, 2013, pp.
693696.
[15] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, Stacked
denoisingautoencoders: Learning useful representations in a deep network with a local
denoising criterion, The Journal of Machine Learning Research, vol. 11, pp. 33713408,
2010.
[16] P. Baldi, Autoencoders, unsupervised learning, and deep architectures,
Unsupervised and Transfer Learning Challenges in Machine Learning, Volume 7, p. 43,
2012.
[17] M. Chen, Z. Xu, K. Weinberger, and F. Sha, Marginalized denoising autoencoders
for domain adaptation, arXiv preprint arXiv:1206.4683, 2012.
[18] T. K. Landauer, P. W. Foltz, and D. Laham, An introduction to latent semantic
analysis, Discourse processes, vol. 25, no. 2-3, pp. 259284, 1998.
[19] T. L. Griffiths and M. Steyvers, Finding scientific topics, Proceedings of the
National academy of Sciences of the United States of America, vol. 101, no. Suppl 1, pp.
52285235, 2004.
[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, the Journal of
machine Learning research, vol. 3, pp. 9931022, 2003.
[21] T. Hofmann, Unsupervised learning by probabilistic latent semantic analysis,
Machine learning, vol. 42, no. 1-2, pp. 177196, 2001.
[22] Y. Bengio, A. Courville, and P. Vincent, Representation learning:A review and new
perspectives, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35,
no. 8, pp. 17981828, 2013.
[23] B. L. McLaughlin, A. A. Braga, C. V. Petrie, M. H. Moore et al., Deadly Lessons::
Understanding Lethal School Violence. National Academies Press, 2002.
[24] J. Juvonen and E. F. Gross, Extending the school grounds? bullying experiences in
cyberspace, Journal of School health, vol. 78, no. 9, pp. 496505, 2008.
[25] M. Fekkes, F. I. Pijpers, A. M. Fredriks, T. Vogels, and S. P. Verloove-Vanhorick,
Do bullied children get ill, or do ill children get bullied? a prospective cohort study on
the relationship between bullying and health-related symptoms, Pediatrics, vol. 117, no.
5, pp. 15681574, 2006.
[26] M. Ptaszynski, F. Masui, Y. Kimura, R. Rzepka, and K. Araki, Brute force works
best against bullying, in Proceedings of IJCAI 2015 Joint Workshop on Constraints and
Preferences for Configuration and Recommendation and Intelligent Techniques for Web
Personalization. ACM, 2015.
[27] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the
Royal Statistical Society. Series B (Methodological), pp. 267288, 1996.
[28] C. C. Paige and M. A. Saunders, Lsqr: An algorithm for sparse linear equations and
sparse least squares, ACM Transactions on Mathematical Software (TOMS), vol. 8, no.
1, pp. 4371, 1982.
[29] M. A. Saunders et al., Cholesky-based methods for sparse least squares: The
benefits of regularization, Linear and Nonlinear Conjugate Gradient-Related Methods,
pp. 92100, 1996.
[30] J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its
oracle properties, Journal of the American statistical Association, vol. 96, no. 456, pp.
13481360, 2001.

You might also like