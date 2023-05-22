

Researchers from South Korea have made the extremely rare decision to create and train artificial intelligence (AI) using the dark web for data with the aim of using it to shed light on how to prevent cybercrime.

DarkBERT – The New Al Model

The internet has a portion known as the “Dark Web” which is hidden and inaccessible through regular web browsers because the links to these pages are yet to be indexed by the search engine.

Since this area of the internet is untracked, it is well-known for its anonymous websites and is mostly used to host markets that enable illegal operations like the trade in drugs and weapons, the sale of stolen data, and serving as a shelter for hackers to facilitate cybercrime.

Researchers from the Korea Advanced Institute of Science and Technology (KAIST) in conjunction with the data intelligence group, S2W, have released DarkBERT, a generative AI language model that has been trained only on datasets derived from the dark web.

DarkBERT was then set loose to scour and index anything it could uncover on the dark web in order to inform how to better deal with cybercrime in this part of the internet.

While it is yet to be peer-reviewed, the researchers published a paper titled “DarkBERT: A language model for the dark side of the Internet,” which described in detail the development and experiment process for this Large Language Model(LLM).

DarkBERT: A Language Model for the Dark Side of the Internethttps://t.co/OqEm1QTPsV Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface… pic.twitter.com/1X6HdiyRzR — Daily AI Papers (@papers_daily) May 18, 2023

To create a dataset for the model, the research team compiled a sizable database by crawling the Tor network, which is specialized software used to access the dark web, in order to optimize how DarkBERT adjusts to the language used on the dark web.

The database then underwent deduplication, data filtering, and pre-processing in an effort to ease ethical concerns about the dark web’s sensitive information-filled content. This removed organizations’ names, information about data leaks, threat comments, and illicit photos.

While DarkBERT is a new artificial intelligence model, it was built on the RoBERTa architecture, an approach for AI that Facebook researchers came up with in 2019.

On the other hand, RoBERTa is an advancement over Google’s BERT (Bidirectional Encoder Representations from Transformers), which Facebook’s researchers were able to boost its performance once it was made open source. According to a research paper describing the inner workings of RoBERTa, it is a “robustly optimized method for pretraining natural language processing (NLP) systems.”

Popular AI YouTuber Matthew Berman dove into the paper in more depth here:

AI To Fight Against Cybercrime

According to DarkBERT’s research paper, the team found that their Large Language Model performed significantly better at understanding the dark web than other models that had been trained to carry out comparable tasks, including RoBERTa, which was created to “predict intentionally hidden sections of text within otherwise unannotated language examples.”

The researchers said:

“Our evaluation results show that DarkBERT-based classification model outperforms that of known pre-trained language models.”

They also said that DarkBERT could potentially be used to aid in cybersecurity tasks such as identifying websites that sell or publish private, confidential data of organizations leaked by ransomware groups.

It could additionally be used to scour through the many forums on the dark web which are updated daily and watch out for any exchange of illegal information.

DarkBert won’t be accessible to the general public for a while due to the possibly dangerous nature of dark web content. Requests for the usage of the AI model for scholarly endeavors are now permissible, nonetheless.

That doesn’t mean DarkBERT is complete since like with other LLMs, additional training and fine-tuning might still enhance its performance. What can be learned from it and how it will be applied are still unknown.

Related articles: