Expert Commentary
Online social media platforms have become an integral part of our lives, with users spending hours on these platforms every day. This has provided a wealth of data that can be analyzed to gain insights into public sentiments and mental health. Identifying individuals who may be at risk of suicide early on can potentially save lives. However, traditional techniques for analyzing such large-scale datasets have become ineffective.
This paper proposes a new methodology based on a big data architecture to predict suicidal ideation from social media content. The approach involves two phases: batch processing and real-time streaming prediction. The batch dataset is collected from the Reddit forum and used for model building and training, while the streaming data is extracted using the Twitter streaming API for real-time prediction.
The first phase, batch processing, involves preprocessing the raw data and extracting features. These features are then used to train multiple Apache Spark ML classifiers, including Naive Bayes, Logistic Regression, Linear SVM, Decision Trees, Random Forest, and Multilayer Perceptron. Various feature-extraction techniques are explored, and different testing scenarios are used to evaluate performance.
The experimental results of the batch processing phase indicate that the (Unigram + Bigram) + CV-IDF features with the MLP classifier achieved a high accuracy of 93.47% in classifying suicidal ideation. These features are then applied to the real-time streaming prediction phase.
This research is significant as it takes advantage of big data architecture and machine learning techniques to tackle the challenge of analyzing large-scale social media data for suicide ideation detection. The use of Apache Spark ML classifiers allows for efficient processing of the data and the extraction of meaningful features.
However, there are some limitations to consider. The study only focuses on data from Reddit and Twitter, which may not be representative of all social media platforms. Additionally, the proposed approach assumes that users explicitly express their suicidal thoughts on these platforms, which may not always be the case. Future research could explore incorporating additional data sources and investigating more advanced natural language processing techniques to improve the accuracy of suicide ideation prediction.
In conclusion, this research provides a practical and effective approach for predicting suicidal ideation using social media data. The use of big data architecture and machine learning classifiers allows for efficient processing and accurate prediction. With further refinement and expansion, this methodology could have significant implications for public health and suicide prevention efforts.