The way we communicate and work has changed significantly with the rise of
the Internet. While it has opened up new opportunities, it has also brought
about an increase in cyber threats. One common and serious threat is phishing,
where cybercriminals employ deceptive methods to steal sensitive
information.This study addresses the pressing issue of phishing by introducing
an advanced detection model that meticulously focuses on HTML content. Our
proposed approach integrates a specialized Multi-Layer Perceptron (MLP) model
for structured tabular data and two pretrained Natural Language Processing
(NLP) models for analyzing textual features such as page titles and content.
The embeddings from these models are harmoniously combined through a novel
fusion process. The resulting fused embeddings are then input into a linear
classifier. Recognizing the scarcity of recent datasets for comprehensive
phishing research, our contribution extends to the creation of an up-to-date
dataset, which we openly share with the community. The dataset is meticulously
curated to reflect real-life phishing conditions, ensuring relevance and
applicability. The research findings highlight the effectiveness of the
proposed approach, with the CANINE demonstrating superior performance in
analyzing page titles and the RoBERTa excelling in evaluating page content. The
fusion of two NLP and one MLP model,termed MultiText-LP, achieves impressive
results, yielding a 96.80 F1 score and a 97.18 accuracy score on our research
dataset. Furthermore, our approach outperforms existing methods on the
CatchPhish HTML dataset, showcasing its efficacies.

Phishing has become a significant concern in today’s digital age. Cybercriminals are constantly finding new ways to deceive individuals and steal sensitive information. This article introduces a novel approach to phishing detection that focuses on analyzing HTML content.

What makes this approach unique is its multi-disciplinary nature. The researchers combine a specialized Multi-Layer Perceptron (MLP) model for structured tabular data with two pretrained Natural Language Processing (NLP) models for analyzing textual features such as page titles and content. By integrating these models and their respective embeddings through a novel fusion process, the researchers are able to create a comprehensive and effective phishing detection model.

One particularly interesting aspect of this study is the creation of an up-to-date dataset that reflects real-life phishing conditions. The scarcity of recent datasets for comprehensive phishing research is a recognized challenge in the field. By openly sharing their dataset with the community, the researchers contribute to addressing this challenge and ensuring the relevance and applicability of their findings.

The research findings demonstrate the effectiveness of the proposed approach. The CANINE model, focused on analyzing page titles, shows superior performance, while the RoBERTa model excels in evaluating page content. The fusion of these NLP models with the MLP model, termed MultiText-LP, achieves impressive results with a high F1 score of 96.80 and an accuracy score of 97.18 on the research dataset.

Furthermore, the researchers compare their approach to existing methods using the CatchPhish HTML dataset. The results showcase the efficacies of their approach as it outperforms these existing methods.

Overall, this study highlights the importance of a multi-disciplinary approach in addressing complex challenges such as phishing detection. By integrating methods from different domains, such as machine learning and natural language processing, the researchers are able to develop a sophisticated model that can accurately detect phishing attempts. Their contribution to creating a relevant dataset and openly sharing it with the community further strengthens the impact of their research. As cyber threats continue to evolve, it is crucial for researchers and practitioners to collaborate across disciplines to develop effective solutions that can mitigate these risks.
Read the original article