Analysing language use on social media for identifying malicious activities.

Author: Pranav Bhandari

Bhandari, Pranav, 2022 Analysing language use on social media for identifying malicious activities., Flinders University, College of Science and Engineering

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.

Abstract

Although advances in natural language processing techniques have made significant contributions to the field of text mining with promising results, various problems are encountered in contextualizing the text to the level of performance comparable to humans. This thesis deals with various aspects of Natural Language Processing, to discover the underlying patterns and classification of the text, and contextualizing the natural language data. The thesis leverages the use of different methods to analyze the colossal amount of text present on social media to extract different intentions and behaviors in the text. Three different datasets are considered for the thesis, each of which presents malicious activities in different domains that are subjected to various levels of experimentation. The first data set deals with the spread of misinformation, the second with suspicious activities on Twitter, and the third with the spread of threat online. These datasets were chosen because they contained the mixture of both malicious and non-malicious activities which helped to differentiate the behavior of each party. The initial process starts with the Exploratory Data Analysis(EDA) of the data where various methods, such as sentiment and polarity analysis, and frequently used words are used with various illustrations to generate insights from the data. The EDA resulted in useful insights that categorized the distinct features of each of the categories(labels) from each other in the data set. In addition, experiments such as word analysis with various techniques allowed us to customize the themes of the different categories present in the data. Following the EDA, Topic Modeling for each of the datasets is performed where the underlying topics are extracted by combining the K-means clustering with Principal Component Analysis. This resulted in the discovery of different topics in the datasets that could be studied individually for different purposes. Furthermore, the moral inclination of all the documents in the corpus is discovered using the Moral Foundation Dictionary and FrameAxis methods. The documents are categorized into vice and virtue domains of five different moral foundation axes, and the results are analyzed. The divergence in moral scores for the individual category in each of the data sets indicated that the use of moral language is highly subjective to the topic and context of discussions. Finally, experiments are performed with the state-of-the-art NLP model called BERT, with fine-tuning of parameters to achieve an accuracy of 97% for the first data set and 98% for the second and third data sets.

Keywords: NLP, Text analysis, machine learning, exploratory data analysis, topic modelling, moral valence, text classification

Subject: Computer Science thesis

Thesis type: Masters
Completed: 2022
School: College of Science and Engineering
Supervisor: Mehwish Nasim