Cyber Threat Intelligence: Applying Machine Learning, Data Mining and Text Feature Extraction to the Darknet

News Monitors – Abstract

The Darknet has become a hub for hacking communities, offering cyber criminals the ability to freely discuss and sell unknown and emerging exploits. This paper focuses its attention on studying the effectiveness of machine learning for providing cyber threat intelligence from Darknet hacking forums. Developing a functioning system for extracting information from communities and applying machine learning methods to predict items of considerable threat. These threat actors contain posts made by users that may be intended to sell or discuss cyber security exploits, specifically the study focuses its attention on identifying zero-day threats. This model provides significant ability for cyber security professionals to create pre- reconnaissance cyber threat intelligence for a more proactive method of defence by reviewing Darknet forums, extracting data and building a machine learning model. The paper reviews various classification methods in order to predict threat levels using text feature extraction, applying supervised learning models Naive Bayes, Nearest Neighbour, Random Forest and Support Vector Machine. The study found that applying machine learning methods of text feature extraction to user created Darknet data, can predict un-deployed or emerging attack threats, such as malwares and exploits to an 81.77% accuracy.

1. Introduction

Threats from hacking, viruses and malwares have been a challenge for the cyber security industry since the early years of computing systems (Milošević, 2013). Most notably, one of the first documented instances of a computer virus, titled ‘Brain’ or ‘Pakistani Virus’, was discovered in 1987 (Highland, 1997). However, since the creation and advance of the internet, cyber space has shaped a central hub for the growth and creation of these cyber- attacks. It is estimated that a total of 3.4 billion users are on the internet (46% of the world’s population). Thus, it’s not unreasonable to state that the internet has provided cyber criminals with a platform for learning, developing, collaborating and testing methods of hacking (ACS, 2016). Instances such as the shutting down of websites, breaching of data, acts of fraud and distribution of viruses are evident to security in modern computing and for its users. According to Verizon’s 2015 Data Breach Investigations Report (2015), cyber-attack vectors by industry include:

  • Point of Sale (Retail, Entertainment, Hospitality) — 28.5%
  • Crimeware (Public sector, Education, Finance) — 18.8%
  • Cyber Espionage (Professional, Information, Manufacturing) — 18%
  • Miscellaneous — 14.7%
  • Privilege Misuse (Mining, Healthcare, Administrative) — 10.6%
  • Web Applications (Finance, Information) — 9.4% (Verizon, 2015)

Furthermore, they also note that most targeted industries in 2015 are the following:

  • Manufacturing — 27.4%
  • Public — 20.2%
  • Professional — 13.3%
  • Information — 6.2%
  • Utilities — 3.9%
  • Transportation — 1.8%
  • Education — 1.7%
  • Real Estate — 1.3%
  • Financial Services — 0.8%
  • Healthcare 0.7% (Verizon, 2015)

From this, it is apparent that preventing cyber-attacks across a multitude of industries and vectors presents a huge task, with a vast amount of challenges, for cyber security professionals. Eric Fischer (2016) categorises the long-term challenges faced for security in technology:

  • Design: Security is not often an integral part within the design and development of software and hardware. Traditionally, developers focus more on features than they do for security due to economic reasoning. Moreover, features of security cannot be predicted.
  • Inventive: The economy of cyber security is very distorted. Cyber security is generally regarded as being expensive and many do not consider it as an investment. On the other hand, cyber-attacks themselves can be cheap and very profitable for criminals.
  • Consensus: Stakeholders and directors within in private and public-sector organisations regard cyber security differently. Differing understandings of its meaning, implementation and risk mean these individuals may not act in the right way to prevent attacks.
  • Environment: Cyberspace could be regarded as one of the fastest growing technological areas across both scale and property. Applications, social media, mobile, data, cloud computing and Internet of Things (IoT) to name a few — all pose a complicated environment for cyber security. The potential opportunities for cyber- attacks grows, as the cyber space grows (Fischer, 2016).

Understanding the volume to which cyber-attacks occur and the broad spectrum of vectors they occur within, means that taking a proactive approach to cyber security is perhaps a feasible solution in addressing these challenges. Whilst challenges such as weaknesses in design may allow the attackers to have an upper hand; cyber security professionals have been able to detect attacks proactively by monitoring hacking communities and social media websites (Robertson, 2017). Over recent years the rise of hacking communities across both the surface web and Darknet has become ever more apparent. Whilst it is understood that security professionals and organisations in cyber security gain a large amount of threat intelligence from these communities; we are faced with the ever-growing challenge of monitoring behaviour on social platforms, due to their exponential growth (Chaudhry, 2017). An exampling being the scale of the 0day Forum (a popular exploit and hacking forum), which has over 47,000 posts, 15,000 threads, 35,000 members (“0day Forum — Homepage”, n.d.).

Despite exploit and vulnerability detection growing day by day, methods of defence are notably slower. A recent example of this is the Mirai botnet. On 20th September 2016, the author Brian Kerbs’s website was shut down by a Distributed Denial of Service (DDoS) attack. The attack contained an extraordinary amount of traffic, to the volume of 620 Gigabytes per second (Kolias, Kambourakis, Stavrou & Voas, 2017). Soon after, variants of the same botnet were found to be attacking numerous other websites, one of which peaking at 1.1 Terabytes per second (Goodin, 2017). Notably the attack also effected the service provider Dyn, creating outages on hundreds of websites in the United States. A month after the botnet first came to light, the source code was found on hacking community Hackforums. Not only was this a large wake up call for the cyber security of IoT devices, but it also displays the prevalence of threat intelligence on hacking forums (Fremantle & Scott, 2017). Hacking communities allow users to freely and anonymously share, sell, purchase and discuss methods of attack, which is of growing concern. The Zero Day Initiative discovered 135 high threat zero-day exploits in Adobe, 76 in Microsoft and 50 in Apple products across 2016 alone (“Zero Day Initiative”, 2017). Thus, it is apparent that within the more immediate future, the security industry will be faced with more zero-day exploits and the monitoring of their presence on hacking communities, market places and IRC channels is vital.

This study aims to discover the effectiveness of gathering cyber threat intelligence (CTI) from the Darknet. Researching how the application of web crawlers, for extracting data and machine learning can be applied in order to build an effective model for providing CTI. This will be done by collecting primary research as to how accurate and effective a machine can predict a threat autonomously. The primary data used within this study will be exacted and gathered by the engineering of a working web crawler, to collect and parse data from Darknet hacking forums. The intended outcome seeks to be able to autonomously gather and classify data from Darknet hacking forums to provide CTI. In addition, a review and discussion of preceding literature surrounding CTI, data mining and text classification, related to the Darknet, will aim to provide context across the study. The study intends to address the following questions:

  1. What research is currently being done in data mining for cyber threat intelligence?
  2. How effective can a text feature extraction and data mining model be for providing cyber threat intelligence?
  3. What level of predictive accuracy can be gained using this model?

In summary, the specific contributions to this study include:

1) A brief introduction to the Darknet, machine learning, data mining and cyber threat intelligence.
2) A review and discussion of current literature surrounding these fields.
3) A methodology for creating a system for gathering primary research and cyber threat intelligence from the Darknet.
4) An evaluation of this system, the primary data found and its effectiveness in predicting threats.
5) A discussion of this study, findings and a review of the primary research for future development.

1.1. The Darknet

This study will define the term “Clearnet”, as web pages that can be access by standardised web browsers, such as Google Chrome and Safari. Each and every web page that can be accessed by Clearnet search engines has been indexed with a 32-bit or 128-bit IP address on the Domain Name System Service. Whilst the number Clearnet web pages is in millions and growing day by day, it is apparent that this does not include non-indexed, or ‘hidden’ sites. Omand (2015) cites that Clearnet web pages only contribute to around 1/500th of the internet. The rest of which, is contained within the various layers of the internet. Ciancaglini, Balduzzi, Goncharov & McArdle (2013) define the term “Deepweb” as sites that are not indexed by search engines, particularly the following:

  • Dynamic web pages: Pages dynamically generated on the HTTP request.
  • Blocked sites: Sites that explicity prohibit a crawler to go and retrieve their content by using, CAPTCHAs, pragma no-cache HTTP headers, or ROBOTS.TXT entries, for instance.
  • Unlinked sites: Pages not linked to any other page, preventing a Web crawler from potentially reaching them.
  • Private sites: Pages that require registration and log-in/password authentication.
  • Non-HTML/Contextual/Scripted content: Content encoded in a different format, accessed via Javascript or Flash, or are context dependant (i.e., a pecific IP range or browsing history entry).
  • Limited-access networks: content on sites that are not accessible from the public internet infrastructure (Ciancaglini, Balduzzi, Goncharov & McArdle 2013)

Notably, what is discussed as a ‘deeper’ level to the Deepweb is that of “Darknets”. Darknets and alternative routing infrastructures, consist of websites that require systems such as TOR or those hosted on Invisible Internet Project (I2P) networks (Ciancaglini, Balduzzi, Goncharov & McArdle 2013). Specifically, The Onion Router (TOR) routes traffic through a chain of nodes which encrypts the information end to end and blindly passes it to the next node, providing no registration of the user. As suggested by Mansfield-Devine (2009), it is very common for hackers and cyber-dependant criminals to operate across all of the layers discussed. However, it is apparent that the Deepweb and Darknet provide a greater level of anonymity and security for those conducting illegal activity. Thus, preliminary reconnaissance of CTI operates primarily on these networks.

1.2. Machine learning, data mining and text feature extraction

Machine learning, a subfield of artificial intelligence, is an ever-expanding automating model used across multiple application within industry today (Kononenko & Kukar, 2013). It’s uses can be seen in the likes of medicine, economics and natural/technical sciences, to name a few. As technology has developed over the past 20 years, the collection and analytics of data has become vital in modern research. Machine learning encompasses systems that learn from this data. “Learning rules, functions, relations, equation systems, decision and regression trees, Bayesian nets, neural nets, etc” (Kononenko & Kukar, 2013). What is defined as ‘data mining’ is a process of machine learning itself. This encompasses the method of extracting information in order to learn patterns, theories, predictions and models from large data sets. Data mining is a multi-faceted area and in addition to machine learning, includes statistics, artificial intelligence, databases, pattern recognition and data visualization (Li, 2014). Thus, it is important to state that the process of data mining or Knowledge Discovery in Databases (KDD) covers a multitude of techniques, such as machine learning. Furthermore, data mining includes many different steps to be repeated and refined in order to provide high levels of accuracy and predictions in data analysis, as seen in figure 1.

This means that there is no current standardised framework for carrying out data mining. That said, the Cross-Industry Standard Process for Data Mining (CRISP-DM) defines one framework for the data mining process across multiple industries. As cited by Jain (2012), the main tasks for data mining include:

1. Classification: Classifies a data item to a predefined class
2. Estimation: Determining a value for unknown continuous variables
3. Prediction: Records classified according to estimated future behaviour 4. Association: Defining items that are together
5. Clustering: Defining a population into subgroups or clusters
6. Description & Visualisation: Representing data (Jain 2012)

Thus, the process and definition of data mining involves the extraction of knowledge from data. Where machine learning is defined within data mining is the automation of methods used. As discussed by Kononenko and Kukar (2013):

“Machine Learning cannot be seen as a true subset of data mining, as it also compasses the other fields, not utilised for data mining”

The knowledge gained through differing machine learning techniques varies between the intended outcome. Machine learning encompasses three main categories, these include Unsupervised, Supervised and Reinforcement Learning (see figure 2). Derivatives of these include classification, regression, clustering, learning of associations, logical relations and equations (Kononenko and Kukar, 2013).

Unsupervised learning models are defined by applying data mining algorithms to identify patterns and structures in the attributes of a data set. Supervised learning applies specified or provided variables for the algorithms to predict outcomes.

1.3. Cyber Threat Intelligence

Modern day cyber security is a complex and multifaceted challenge across numerous dimensions of cyber space. Whilst traditional approaches of security, involving the limitation of vulnerabilities and purging of known exploits have always been effective, it is a constant challenge to keep up with attackers. Effective strategy to proactively counter cyber-attacks encompass the detection of future threats, attacker’s behaviours, intent and accessibility. The proactive nature of providing CTI intends to predict and eleminate vulnerabilities and exploits. Furthermore, it greatly improves the ability to react to attacks when they occur. Because of this, CTI is a very complex field and is applied in many different ways. Threat intelligence organisations gain an understanding of future attacks through a multitude of methods, the likes of which include, the monitoring of endpoint data to record threat actors in a system. However, over the last five years, the use of artificial intelligence and machine learning, in order to gather pre-reconnaissance data, had rapidly grown. ABI Research speculate that “machine learning in cybersecurity will boost big data, intelligence, and analytics spending to $96 billion by 2021” (ABI-Research, 2017). Due to the large amount of data sources found within the cyber infrastructure there are many avenues machine learning can be used for in CTI. These can include the likes of anomaly, botnet and phishing detection, as well as the likes of active authentication (Epishkina & Zapechnikov, 2016). The gathering of threat intelligence can be regarded as the first line of defence in the cyber security infrastructure. With the second line of defence involving reactive security systems such as intrusion detection systems (IDSs) and mitigation techniques (Epishkina & Zapechnikov, 2016).

The purpose of being proactive with CTI means that the security industry faces a great challenge with zero-day attacks. A zero-day attack is an unknown exploit exposing a flaw in software or hardware. As a result, these zero-day attacks leave virtually no opportunity for detection in the security process. This of course means that cyber security professionals are continually looking for new methods of discovery for attacks. One particular source for these exploits lies within hacking communities and market places, across the multiple layers of the internet. The commoditisation of exploits is increasing day by day. Illegal online marketplaces, forums and chatrooms are not uncommon. It is estimated that Silk Road, a market place for illegal goods, generated over $1.2 million in transactions in 2011 (Armona & Stackman, 2014). Furthermore, it is not unreasonable to state that due to exploit information having a very small marginal production cost, they are a very valuable commodity for those selling. What is apparent however, is the time-sensitivity of these exploits. Companies and government organisations of constantly updating software to stay up to date with technology and implement fixes. It may be the case that an exploit becomes invaluable due to an organisation updating software, thus the timing of their sale is very important. Additionally, because of zero-day exploits being unique, there are a lot of questions surrounding the legitimacy of an exploit. As a result, the information gathered adjoining zero-day exploits are massively valuable to the cyber security industry and CTI is particularly focused on as much as this information as possible.

2. Literature Review

As previously discussed, the primary purpose of cyber threat intelligence (CTI) is to help organisations discover and understand potential risks from differing threat actors. Having briefly mentioned zero-day exploits; threat actors can vary in many shapes and forms. CTI is required to contain in-depth information about an attack or threat in order to help an organisation remedy its security, to safeguard against it. Its function within the military, government, business and security provides a strategic advantage against attackers. CTI as a component of cyber security typically includes attacks from three areas:

  • Cyber crime
  • Cyber hacktivism
  • Cyber espionage (Planqué, 2017)

However, as the number of threats grow across an ever-expanding domain, understanding what CTI truly entails becomes increasingly vague. The lack clear academic literature and businesses using CTI to define their products, leads to an unclear definition of its term (Planqué, 2017). Within section 2.1 a review, comparison and discuss various academic texts will be presented in order to gather a clear understanding as to what is meant by CTI within this paper and primary research.

The gathering of threat intelligence data from sources such as the Darknet is a developing approach for proactive threat detection. The collection of information on threats, through applying data mining and machine learning techniques is very apparent. As this area develops and advances further, new and diverse techniques of monitoring illegal activity on the Darknet emerge. Various methods, such as association rules, time-series analysis, clustering, statistical and correlation analysis can be applied to intelligence data to provide valuable information on threats. In addition to this, approaches to text classification using both supervised and unsupervised models allow for great accuracy in predicting outcomes (McCallum & Nigam, 2005). Drawing from academic research, Section 2.2 will cover the application of data mining and machine learning techniques. Additionally, focusing the literature review on research that relates to predictive data mining and machine learning models.

Text feature extraction involves the use of data mining and machine learning applications, as previously discussed, with large sets of text. Text feature extraction has been applied across many industries; its application for sentiment analysis is an important part of marketing and sales in the digital world (Pang & Lee, 2008). Essentially, this necessitates the extraction of features in text created by users, in order to predict an opinion, sentiment or subjectivity. Its use within threat intelligence is relatively new and thus within this section a review, comparison and discussion on research across multiple industries will be made. Thus, providing an understanding of the functionality of text feature extraction, to which can be applied to the model in the later sections of this paper.

2.1. Cyber Threat Intelligence

As previously discussed, the use of CTI spans across many differing sectors and involves multiple attack vectors. Due to the broad spectrum of threats and actors within the cyber space, what is define as CTI can be somewhat unclear. It is essential to have a clear understanding as to what threat intelligence in cyber space entails and what information can be regarded as intelligence. Hutchins, Cloppert & Amin (2012), discuss what they describe as a ‘kill chain analysis’ for a cyber-attack in order to understand what information or ‘intelligence’ in involved. A reconstruction of an intrusion is detailed in Figure 3.

Intelligence can be gathered at any given stage in this kill chain. Furthermore, Hutchins, Cloppert & Amin detail that if an analyst discovers intelligence at any stage inside the kill chain they can assume that the prior phases have already been actioned. Thus, a complete analysis of current and prior phases must be done to mitigate future attacks. If the prior stages cannot be reproduced, it is not unreasonable to state that action on the current phase will be difficult. Organisations may define threat intelligence at different stages within the kill chain. As discussed by Hutchins, Cloppert & Amin, an intrusion detection system may discover a threat at stage 5, as seen in Figure 4.

A notable example, is the detection of a virus on a system. If a virus scanner detects a virus, it is apparent that this is already at stage 6 and has been through each and every stage beforehand. Whilst this can still be regarded as threat intelligence, Hutchins, Cloppert & Amin defines this as a late phase detection in the kill chain. Moreover, in order for intelligence to be as effective as possible, defenders must move their analysis and detection up in the kill chain. Figure 5, shows that not only has the detection of the threat come at an earlier stage, there are less phases to reproduce for the mitigation of the attack, as described in Figure 4.

With this said, a general premise can be made that intelligence of a threat gathered within the earlier stages of the kill chain is much more proactive, effective and actionable threat information.

In a report for the Centre for the Protection of National Infrastructure (CPNI), MWR InfoSecurity (2015) define threat intelligence as information that can be acted upon to change outcomes. An example notes the use of ‘Knowns’ and ‘Unknowns’ within intelligence. Thus with more intelligence ‘Unknown Unknowns’ of an attack moves to ‘Known Knowns’ (Figure 6).

MWR argue that the definition of CTI is unclear due to it being a young field. In addition to this, vendors and advisory papers describe CTI differently because of their products and activities. They go on to define, in regard to traditional intelligence (discussed previously), threat intelligence as information that can aid a decision in order to prevent an attack or decrease the time taken to discover an attack. Additionally, MWR InfoSecurity (2015) go on to note the subtypes of CTI are Strategic Threat Intelligence and Operational Threat Intelligence:

Strategic Threat Intelligence (STI) entails high-level information for board level or senior decision makers within an organisation. This level of intelligence may not be technical however it will define a threats impact on the organisation, I.E financial.

Operational Threat Intelligence (OTI) entails information more specific to an attack. Typically, this included technical details in order to handle and attack by security teams, as an example. According to MWR, Operational Threat Intelligence varies depending on the sector. For instance, a business may wish to have intelligence on a potential attacker, however a various amount of restrictions, such as the law, may prevent them from gathering that information. On the other hand, government organisations may have access to this level of information and thus their OTI is at a higher level.

Because of this when discussing OTI and more generally, CTI, within the public sector, the Military’s general definition of CTI is regarded differently due to the nature of their intelligence. The Ministry of Defence (2016), define CTI as activities in cyber space to gather intelligence on a target and adversary system in order to support military operations. Thus, it is notable that CTI within the military may not necessarily relate to a potential attack and be more so relevant to attacks implemented by the military themselves. Notably, the United States Department of Defence (2017) define intelligence as information in direct support of current of future operations.

Moving into what threat intelligence actually entails, Barnum (2014), notes that traditional intelligence seeks to understand a threats capacities, actions and intent. Thus, when discussing CTI, identified items may include:

  • Prior actions
  • Occurred actions
  • Potential actions
  • Detection or identification
  • Mitigation
  • Relevant threat actors
  • Intent
  • Capabilities
  • Tactics, techniques and procedures (TTP)
  • Vulnerable
  • Misconfigurations
  • Weaknesses

A greater understanding of these items provides a more holistic and effective decision to be made on the attack. What is seen detailed in Branum’s research and that of Hutchins, Cloppert & Amin’s, essentially details a contextual understanding as to where CTI may be gathered from and at which point in an attack intelligence is define. Early phase intelligence is what is generally understood as CTI within corporate fields, according to Hutchins, Cloppert & Amin (2012).

It is notable that when discussing CTI the context of its purpose or intention must be defined clearly in order to have a clear understanding of its definition. Because of this, it is apparent that CTI within differing sectors has a multitude of different domains.

2.2. Darknet Data Mining and Machine Learning

There are a number of researchers that have applied data mining and machine learning techniques to Darknet data in order to provide some level of threat intelligence. Thonnard & Dacier (2008) present multi-dimensional data mining model to provide information on emerging attack threats from honeypot1 data. Their model includes differing attributes of data, such as geographical, temporal, IP subnets etc. This framework discovers patterns in the data by time-series correlations and clustering between attacks, with use of trace comparisons. Thus, proposing correlation between similar or grouped proceeding attacks in order to predict potential threats. This multi-dimensional Knowledge Discovery and Data Mining Model (KDD) aims to provide actionable knowledge on the internet. The objective of their methodology intends to highlight indicators for assessing the commonness of malicious activities and to provide insight into emerging threats. The result of the methodology entailed a system that extracted meaningful items of data by mining large honeypot data sets. This information is then ‘synthesised’ in order to extract relevance and predictions. The experiment concludes by providing attack threat intelligence with significant insight. Thonnard & Dacier (2008) expand on this research in their proceeding journal ‘A framework for attack patterns’ discovery in honeynet data’. In the following research, they apply this same framework to clique-based clustering methods for domain specific KDD models and analysis of data for identifying the activity of worms2 across the internet. The result of their clustering model enables the identification of multiple worms and botnets in traffic collected by honeypot data. Both frameworks presented by Thonnard & Dacier make important observations to the prevalence of threat intelligence from large data sets from the Darknet. Their KDD models provide a high degree of accuracy; however, it is worth noting that their research is based on previously extracted honeypot data sets. Thus, whilst these models can provide proactive threat intelligence models they may not have the ability to detect new and emerging unseen threats due to the age of their data.

Similar clustering methods are also discussed by Fachkha et al (2012). In this paper, a association rule KDD model is presented to explore the correlation between cyber threats using Darknet data. This model analyses packet distributions, transports, network and application layer protocols as well as resolved domain names. Fachkha et al’s method performs characterisation and traffic profiling. Specifically, the method identifies and monitors Darknet protocol distributions to indicate potential Distributed Denial of Service (DDoS) attacks, buffer over-flow exploits and unsolicited Virtual Private Network (VPN) access. The presented work results in an effective means of interpreting threat patterns and building threat predictions through its KDD model. In review of this work, its notable that again the data used is previously collected data and may not necessarily provide accurate results for real-time threat analysis. Whilst its effectiveness in threat intelligence is obvious with the data set used, it may not provide threat insights to predict features of nearer and emerging threats within current Darknet traffic.

In more recent research, Robertson (2017), introduces an operational system for providing real-time threat intelligence from Darknet data. Their model extracts data from Marketplaces in order to analyse KDD models with machine learning algorithms such as Naïve Bayes and Support Vector Machine. This methodology provides threat intelligence on products and

services focused on malicious hacking in Darknet websites, much similar to the system studied and presented in this paper. The system they present provided a predictive machine learning approach with a 78-82% accuracy. It is notable that this paper presents one of the most relevant pieces of work in relation to this study. What is notable from this research is that their focus is primarily on products as appose to forum discussion. Thus, whilst this may present very effective CTI tool, it may not review and predict threats from more general chat logs from the Darknet users. That said, this model does provide threat analysis from a proactive and real-time methodology. The work presented by Robertson (2017), is a very valuable and recent piece of research relevant to the topic at hand. It is noted that work relevant to this research is very sparse and thus presents a unique approach to Darknet CTI.

1. Honeypots are information system traps used for attracting and monitoring malicious attacks on the internet in order to gain information on them (Jin, de Vel, Zhang & Liu, 2008).

2. Worm viruses or internet worms are standalone malware programs that replicate themselves in order to spread across a network (Nazario, 2004)

2.3. Text Classification and Sentiment Analysis

Text classification and sentiment analysis aims to identify context and subjective information mechanisms from text based data, in order to determine or predict patterns and sentiment. Discussed by authors Ikonomakis, Kotsiantis & Tampakas (2005), text classification generally consists with the following steps:

i) Read documents
ii) Tokenize texts
iii) Stemming
iv) Stop word deletion
v) Vector representation of text
vi) Feature selection
vii) Apply to supervised learning algorithms
viii) Measure of accuracy

In this paper, Ikonomakis et al, discuss that classifying text documents do not differ a lot from more common tasks of machine learning. However, it is noted that one of the largest problems with text classification, is the sheer number of features within a text document and instances presented.

To solve a challenge such as this, Li et al. (2005) present a classification methodology using positive and unlabelled text data. They propose an alternate method of using a training data set, containing both positive and negative labelled. Instead, the model uses two classes, one labelled positive and the other unlabelled. The findings show that only one labelled data set is used in order to predict the unlabelled data. It is unnecessary for a training set to contain both negative and positive classes, essentially a training set containing one class and an unlabelled class, has a relative same degree of accuracy.

Another response to this challenge is presented by Liu, Li, Lee & Yu (2004). They discuss text classification using labelled words as appose to documents. It is noted that in order for a classifier to be relatively accurate when reading text documents, the relevance of words in a class must be of high polarity. Their model proposes a less labour-intensive method of text classification and in some instances, can provide more accurate and effective results.

When discussing the use of text classification, it is important to highlight the complexity of building training data sets. With a well-trained dataset, unlabelled or test data sets may be classified to a higher accuracy. Including those previously discussed, methods of text classification have been improved by the use of lexicons. Al-Rowaily, Abulaish, Al-Hasan Haldar & Al-Rubaian (2015) present the development of a Bilingual Sentiment Analysis Lexicon (BiSAL) consisting of Sentiment Lexicon for both English and Arabic languages. The key concept proposed by this model is to develop an opinion mining and sentiment analysis systems from Darknet forum data, namely those related to radical content. This model contained a list of 279 English and 1019 Arabic sentimentally represented words along with their morphological variants in addition to their sentiment polarity (related to radical content). It is notable in this research that Al-Rowaily, Abulaish, Al-Hasan Haldar & Al- Rubaian’s BiSAL systems has numerous applications for cyber security in order to identify and determine the sentiment polarity within text. Furthermore, user created text can contain many morphological variants of words and thus having a method of identifying these and applying them to a level of polarity in order to predict a pattern, is vital for threat intelligence.

Having a clear understand as to what problems may occur in text that is submitted by humans is evident in this study. This is because, connotation within a sentence may appear to be different to the human reader than the machine. One of the most widely discussed fields of text classification and sentiment analysis is in online media. The growth of organisations wanting to learn opinions and having metrics on their products grows as the increase of online data is apparent. One such paper written by Michelle Annett and Grezgorz Kondrak (2008) compares machine learning and lexical-based approaches to large sets of movie reviews. Annett and Kondrak identify a key feature and challenge within sentiment analysis and text feature extraction. This being thwarted and negated expressions in the text. An example of a thwarted expression is one that comprises an amount of words with polarity conflicting to what the user is in fact expressing. Quote, an example being:

“Johnny Depp was alright. The previous two pirate moves were unrealistic and boring. The plot was awful. However, the special effects made the third pirate movie excellent” (Michelle Annett and Grezgorz Kondrak, 2008)

The negative words in this statement may result in a machine incorrectly recognising that the statement is negative. Whereas in fact the statement is positive. Furthermore, they note that a negated statement consists of a negating word with a noun, adjective, adverb or verb. The authors also note that whilst it may be easy for humans to identify polarity of a full statement, a machine may only be able to identify the polarity of individual words, such as ‘enjoyable’ and ‘entertaining’. Therefore, the prediction for a program when dealing with thwarted and negated words is a difficult task. Understanding problems such as this ahead of building a text classifier can support accuracy and help limit possible problems when building an accurate classifier.

2.4 Review

The prior three sections reviewed various pieces of academic research in the three different subsections used in this study to present a method for CTI. From the literature, the following concluding points are drawn in order to provide an informative overview moving forward.

  • From the research presented by Hutchins, Cloppert & Amin (2012), it is learnt that determining an attack during its reconnaissance, weaponization and delivery phase provides a more proactive and effective method of CTI.
  • MWR InfoSecurity (2015) describes the subtypes of CTI contain STI and OTI. Discovering that STI presents CTI as a more generalised level of threat intelligence whereas OTI focuses more specifically at the technical details.
  • Barnum (2014) supports the definition as to what items are identified in CTI.
  • Thonnard & Dacier (2008) present two frameworks for applying data mining and machine learning to the Darknet, in order to discovering time-series correlations and clustering between attack actors. This research utilises honeypot data over a large period of time.
  • Fachkha et al (2012) build an association rule KDD model in order to discover correlations between cyber threats using Darknet data. Specifically focusing on identifying indicators within traffic. This model identifies attacks such as DDoS, buffer over-flow and unsolicited VPN access.
  • Robertson (2017) introduces a concept for operational real-time threat intelligence from Darknet data. Their model extracts data from Darknet communities applying standardised machine learning algorithms. The model presented correctly predicts threats to a 78%-82% accuracy.
  • Ikonomakis, Kotsiantis & Tampakas (2005) detail the general text classification process and note that text classification has a large challenge due to the number of features within a text document.
  • Li et al (2005) and Liu, Li, Lee and Yu (2004) propose a more efficient method of building a text classification model by firstly considering the labelling of words rather than documents and by using two classes, one labelled positive the other unlabelled.
  • Al-Rowaily, Abulaish, Al-Hasan Haldar & Al-Rubaian (2015) discover that BiSAL can provide text classification methods with more accurate results with the application of morphological variants within a list of words, identified with a polarity indicator.
  • Finally, challenges within text classification often occur because of thwarted and negated expressions, discussed by Annett & Kondrak (2008).

3. Methodology

What has been discussed thus far in this paper, presents the knowledge that CTI must be gathered from early stages in an attack for it to be effective and efficient (Hutchins, Cloppert & Amin, 2012). It is understood that the Darknet has become a central hub for discussions and sales of these attacks and thus are filled with vial information for CTI. The intended outcome is to build a working system to provide CTI from Darknet hacking forums and be able to predict threats from new data. This will be done by drawing from the research studied, the collection of primary data and analysis of results. Figure 7 provides an overview of the CTI system presented in this paper in order to collect primary research, determining the effectiveness of a CTI system such as this. The following sections breaks down and detail each step to the primary research method presented above.

3.1. Accessing Data

One of the first stages necessary for this model is to identify target websites that contain information relevant to cyber threats. This is a manual process intended to seek out forums and discussions on the Darknet that contain a high potential for content being related to zero- day attacks. In order to make a correct and decisive judgement as to websites relevant to this study, a review of the functionality of the site is required, in order to gain an understanding of its legitimacy for exploit information. It is apparent that many Clearnet websites, as well as Darknet, websites contain exploit databases freely accessible to the public (Varsalone, McFadden & Morrissey, 2012). After reviewing Clearnet and Darknet websites, it is found that exploit information from easy to access websites is of a lesser threat to cyber security and often indicates a position of late phase detection in the kill chain process (Hutchins, Cloppert & Amin, 2012). Essentially, a large amount of the exploit information found can be easily discovered to the owners of the software, system or website etc. Thus, in in order to access and extract valuable threat indicators access to hard to reach websites with more specific key details is required. According to Su & Pan (2016), these valuable exploit and vulnerability websites often contain some of the following attributes:

  • Publication of exploit information — Sites will either publish information on vulnerability or not to be accessed or viewable to a certain degree.
  • Target provision — Assessed targets according to the information of the vulnerability.
  • Financial contribution — The site may require users to pay for access or purchase vulnerabilities (typically with Bitcoin3 currency)
  • Verification process — Typically these sites will either contain an automated verification process, pledge system or invite only.

In order to access these websites, connection will be made using TOR, as discussed in Section 1 of this paper. The following Darknet communities contain such attributes and thus will be used to extract data in order to build and evaluate the CTI system studied in this paper. Below, is a detailed reasoning behind the selection of websites used in this study and the method of access:

Please see full publication for the rest of this section.

3.2. Collecting Data

The implementation of a web crawler to each of the three websites highlighted above, is required in or der to extract the relevant data necessary for this study. Crawlers are programs built to automatically navigate across the web, in order to retrieve data (Jain & Bansal, 2014). Typically, a crawler will interact with a web source and extract the data stored within it, I.E the HTML of a web page (Ferrara, De Meo, Fiumara & Baumgartner, 2014). The application of crawlers allows for the collection of large amounts of data from the web, autonomously. However, noted by Zheng, Wu, Cheng, Jiang & Liu (2013), crawling Deep/Darknet websites presents a number of challenges. These include access to particular websites and the lack of any index for these sites. In a supporting statement, Jain & Bansal (2014) cite:

“In this kind of breadth-oriented crawling, the challenges are locating the data sources, learning and understanding the interface and the returned results so that query submission and data extraction can be automated.”

It is apparent that in order for the successful extraction the data from the Darknet, a unique crawler must be built for each target website, to precisely extract data needed. The focused method for this is done by using Python and the Scrapy crawling framework. Scrapy is an open-source web framework build for scraping data from web sources. The reason for using Scrapy as the premise for the crawler is due to the following benefits, described by Kouzis- Loukas, (2016):

  • Event-based architecture — Allowing users to cascade operations in order to clean, form and store data effectively. Additionally, this allows “Disconnect latency from throughput, by operating smoothly while having thousands of connections open” (Kouzis-Loukas 2016).
  • Scrapy allows users to perform requests in parallel. This allows users to scrape a large amount of data from a site in a short period of time.
  • The framework allows the direct use of additional Python frameworks like Beautiful Soup, lxml and Selenium in order to understand broken HTML or confusing encoding.
  • Scrapy provides selectors for high level XPaths. This allows users to extract specific data from websites more precisely.
  • Scrapy has a well maintained and organised code. Separating Python modules, such as spider and pipelines allowing users to automatically update and improve crawlers without difficulty.

The first challenge presented when building the crawlers in this study, is that in order to scrape data from a Darknet address, crawlers must first be able to route into the TOR network. When using a Linux or Mac OS X system, as done in this paper, it is possible to connect to TOR using TOR SOCKS. This allows for traffic to connect directly to Tor with the local host port ‘9050’ (The Tor Project, 2017). With this implementation, it is possible to direct the crawler to ‘.onion’ addresses such as those listed in the previous section.

Another challenge faced is within the code of each of the websites. A particular focus is on websites that contain JavaScript or Login pages. It is apparent that all three of the websites used for data within this research contains such methods of verification. Fortunately, this can be overcome by implementing the Selenium Python framework into the crawler. Selenium provides a simple API for users to write functional and acceptance tests for WebDrivers. Thus, it is possible to replicate a user’s interaction with a webpage, using automated code in in the crawler. Essentially automating the user-required verification methods (“1. Installation — Selenium Python Bindings 2 documentation”, n.d.).

Looking specifically at what will be extracted from the websites, there are specific items require for threat intelligence. The particular format of each website, in this case, contain the following: Topic title, Topic post, Author, Time/Date, Rating. With these items it is possible to gather a large amount of information from each post, with particular focus on the Topic Title and Topic Post. When applying text feature extraction using machine learning approaches, a focus will be made on extracting indicators from Topic Titles/Post.

3.3. Parsing, Cleaning and Labelling Data

It is apparent after general views of forums and discussion boards that there is a large amount of text that may serve as noise to the classifier. One particular example of this is within topic titles containing symbols, (I.E *>PRODUCT<*). Instances such as this may impact the classification significantly and thus it is required to parse this data without noise. For this, a method of removing non-alphanumeric characters from the data implement in both Topic title and Topic post classes.

As this data is created by users, one must be aware that misspelling and composition of words may be changed. Thus, discovering the best method of overcoming this challenge will be a part of this study. Methods evaluated, contain the use of word stemming and n-grams. In 1994, n-grams were noted as being viable means in dealing with ASCII noise in text inputs (Trenkle & Cavnar, 1994). Furthermore, Trenkle & Cavnar (1994) additionally quote:

“The key benefit that N-gram-based matching provides derives from its very nature: since every string is decomposed into small parts, any errors that are present tend to affect only a limited number of those parts, leaving the remainder intact.”

To explain, an n-gram consists of an n-character portion of the text, generating a larger string. To give an example of how an n-gram can support this machine learning process, the word Trenkle & Cavnar (1994) demonstrate n-grams on the word “TEXT” (“_” representing blanks):

bi-grams: _T, TE, EX, XT, T_
tri-grams: _TE, TEX, EXT, XT_, T_ _
quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _

However, it is important to note that the n-gram word stemmer is only but one solution. Thus, as part of this study an analysis of different stemming algorithms, to see their impact on our classification accuracy. Algorithms, Lovins Stemmer, Snowball stemmers and PTStemmers will be briefly investigated.

In order to classify data, it is required to first build a training dataset for the machine to learn from. To build a training data set it is essential to firstly classify a set amount of data for the machine to use, typically this process is referred to as class labelling (Flach, 2015). This study will consist of labelling data as to whether it is regard as a threat or not, (e.g ‘Relevance = {yes , no}’). Whilst researchers have previously discussed methods of automating the labelling process, each individual post must be taken with important consideration before labelling it (Sebastiani, 2002). Expanding on this, an example as to how the identification of a threat made, is highlighted below. These example taken from the websites in question:

Topic post: “Invoice Manager 3.1 — Cross-Site Request Forgery (Add Admin) Vulnerability” Relevance: No

This product refers to a vulnerability found in an invoice management web application and PHP based plugin for web browsers. This is distributed as a free exploit. In this instance, it is considered a low-level threat due to the following reasons. Firstly, as discussed previously, organisations are constantly updating software and that is particularly apparent within online plugins and applications. Thus, this means that this vulnerability is very time-sensitive. With the consideration of the vulnerability being freely accessible to the public means that it is highly likely the organisation will identify this exploit. Furthermore, the vulnerability itself allows a hacker to add an admin to the application using a cross-site request. Whilst this may be a problem for the account holders, it doesn’t present an immediate high threat level to users or cyber security.

Topic post: “Windows 10 RCE (Sendbox Escape/Bypass ASLR/Bypass DEP) 0day Exploit” Relevance: Yes

This product can be seen as a critical zero-day exploit. This exploit details a vulnerability in the Windows 10 operating system, which allows for remote code execution via any browser, these being Google Chrome, Mozilla Firefox, & Opera. Looking at further detail the exploit type allows for a Sandbox Escape4. Additionally, the exploit is self is of high value, selling for 1.319 BTC or 6,000 USD. Due to the scale and possible impact this exploit could have within the cyber security field, it is noted a high-level threat.

Seen in the previous two examples, each item labelled must take various considerations in order to justify the outcome. With these details in mind, building a training data set of considerable size in order to have a high accuracy correctness, required particular attention.

3.4. Machine Learning Approaches

Once the training dataset is expertly labelled, the application machine learning tools are to be used to filer data into threats and non-threats. As discussed previously in this paper, machine learning entails the process of learning rules from instances, in this case the training dataset. This principally allows the creation of a classifier for new instances. In this research, supervised learning algorithms will be applied to the dataset in order to build the model.

Supervised learning is the process deducing a function from labelled training data (Widanapathirana, 2015). Supervised learning algorithms allow the analysis of the instances in the training data, applyinga function which can therefore be used to map new examples (Maglogiannis, 2007). Dissimilarly, unsupervised learning is defined when instances are unlabelled. There are a wide range of supervised learning algorithms available, each with differing functionality. Every algorithm available has both strengths and weaknesses depending on the task. This is often described as the ‘No Free Lunch’ theorem, meaning that there is no single supervised learning algorithm that works best on all problems, each algorithm has different accuracies for different problem (Wolpert & Macready, 1997). Whilst there is a comprehensive number of algorithms to analyse, it would be a large task in its own body of research to evaluate all methods to solve the problem presented in this study. Being that the task at hand is to research a functional system for Darknet CTI, the analysis of its effectiveness will be made from the most common text classification algorithms; Decision Trees, Bayesian, Instance Based and Support Vector Machine categories as noted by Robertson (2017), Keming & Jianguo (2016). The following sections will cover a brief overview of how these algorithms work relative to text classification in order to gauge an understanding as to their application in building an effective CTI system.

3.4.1. Decision Trees

A decision tree is a structure that classifies instances by sorting nodes based on their values. Maglogiannis (2007) cites:

“Each node in a decision tree represents a feature in an instance to be classified and each branch represents a value that the node can assume. Instances are classified starting at the root node and sorted based on their feature values”

A depiction of a decision tree as a simple flow chart is presented in Figure 8. This decision tree is an example of how text classification can be used to decide the genre of a name. Each node will be defined as a decision node, to check values with a leaf node to assign labels, in this case (“Learning to Classify Text”, n.d.). To initially assign a label for the input value, one must first start with a root decision node. In this case, the algorithm determines whether or not the last letter within the word is a vowel, based on what it already knows within the training data. This root decision node, checks the feature value and selects a branch. This is then met by another decision node, which will again check the value based on the training set and make a decision. This process is continued till leaf node is met, which then provides a label from the initial input value.

This example obviously presents only a small portion of what a total decision tree would entail within a large dataset, but gives a general explanation of the process. The RandomForest algorithm provides the possibility of an effective method of the classification task in this study (Robertson, 2017). RandomForest takes the initial instance and divides the data into subsets of trees, like the one shown above. Essentially this provides a more comprehensive approach to using more simplified decision tree algorithms.

3.4.2. Statistical Learning Algorithms

Statistical algorithms comprise of an underlying probability model to provide the probability of an instance belonging to a specific class. The two categories of statistical learning algorithms used, include the Bayesian networks and instance-based methods. Bayesian Networks

The Bayesian Networks belong to a group of graphical models based on probabilistic variables. Each node within the graphical structure represents a random variable. Each variable is represented by a probabilistic dependency, as seen in Figure 9. These dependencies, allow the machine to estimate a probabilistic outcome based on what the machine already knows (training data) (Ben-Ga, 2007).

Figure 9: Example of Bayesian Networks (Thornton, n.d.)

It is worth noting that in many cases the task of using Bayesian Networks is often divided into two main categories. These being the learning of the structure for the network and the determining of the parameters (Magolgiannis, 2007). However, it is apparent that constructing a very large network is a task in its own challenge and not effective for this study. Thus, the Naïve Bayes algorithm presents a very simple Bayesian network to work from within the classification task presented. Instance-Based

Instance-based learning algorithms delay the induction or generalization process until the classification is performed, defining it as a lazy-learning algorithm (Maglogiannis, 2007). Crane (n.d), describes one of the most simple instance-based learning algorithms as being the nearest neighbour algorithm. Figure 10, gives an example of how k-Nearest Neighbour (k– NN) works. The training data created in this experiment, consists of positive (yes) and negative (no) relevance’s to a threat. In this case, xq is the instance to be classified. The 1- Nearest Neighbour algorithm classifies as positive, the 5-Nearest Neighbour algorithm classifies it as negative. With 1-Nearest Neighbor, the decision surfaced created is shown on the right of Figure 10. The IBk classifier is one such k-NN algorithm that will be use in this study. The areas shown in the image represent each region of instance’s space closes to that point. Maglogiannis (2007), explains this further by stating:

“K-Nearest Neighbour (k-NN) is based on the principle that the instances within a dataset will generally exist in close proximity to other instances that have similar properties. If the instances are tagged with a classification label, then the value of the label of an unclassified instance can be determined by observing the class of its nearest neighbours. The k-NN locates the k nearest instances to the query instance and determines its class by identifying the single most frequent class label.”

Figure 10: Example of Instance-based Learning (Crane, n.d.)

3.4.3 Support Vector Machines

Support Vector Machines (SVMs) are very well suited to learning settings such as text classification, with a well-founded computational theory and analysis (Joachims, 2005). They suppose the notion that between two data classes, is a ‘hyperplane’ separating the two (Maglogiannis, 2007). By applying the SVM algorithm to the training dataset, an optimal hyperplane that categorises instances is formed. The intention is to maximise the margin between the two classes in the training data.

Figure 11: Example of SVMs (OpenCV, n.d.)

Where SVMs differ from some of the other algorithms discussed, is that they categorise data based on the optimal hyperplane margin and not the features of the data. Meaning that it can be an effective classifications method when using when many features, such as text (Joachims, 2005). In an experiment analysing the performance of SVMs against existing text classification methods, Joachims (2005) concludes the following point:

“The experimental results show that SVMs consistently achieve good performance on text categorization tasks, outperforming existing methods substantially
and significantly” 
(Joachims, 2005)

This outcome proposes a possible solution for classifying Darknet data for CTI, however given the previous statement the ‘No Free Lunch’ theorem must be considered and thus this

method may not be best suited to the problem. In the following evaluation, a review on the accuracy of SVM classification using the Sequential Minimal Optimization (SMO) algorithm will be made. The problem face initially with training a Support Vector Machine is that it requires a large amount of quadratic programming. SMO breaks large quadratic programming problems into smaller problems, thus optimising the process (Platt, 1998). By using SMO, a more efficient and effective analysis of SVMs is possible.

7. Evaluation

In this section, an evaluation of each process in the system overview, shown in Figure 7, will be made. Firstly, highlighting the implementation, efficiency and challenges presented with extracting the data. This will include the creation of the crawler, parsing and labelling the data in order for the text classification methods. Furthermore, an evaluation for the accuracy of the text classification techniques used within this study will be made. Concluding this section, a discussion to outcome of this study, its results and its effectiveness will support a conclusion. A review of possible variances moving forward, that could allow for a more accurate or effective threat intelligence model will then be conducted.

7.1 Extracting & Pre-Processing Data

As discussed in order to build the crawler the Python framework Scrapy is used to assist with this challenge. Appendix A details an image of the source code of the Spider within our crawler. Looking at the code, XPath variables are used to select items, specific to each site and are iterated for parsing. An example of this being:

post_title = response.xpath(‘//*[@class=” subject_new”]/text()’)

One of the challenges faced, as expected, was the implementation of JavaScript on the websites. This was overcome by using Selenium to launch a WebDrive, in this instance ChromeDriver. Once launched it mimic user interaction with each button and delaying the process in order for the website to respond as if it were a human user. The crawler was then able to freely download and extract the XPath variable required to build the dataset.

The data was then parsed to a Comma-separated Values (CSV) document in order to build a training dataset but way of labelling. The labelling was manually done in this case, section 3.3 notes the methodology taken to analyse each topic post and whether it was regarded as a threat or not. For an example of the data extracted, refer to Appendix B.

Weka was used in order to apply machine learning algorithms to the dataset collected by the web crawler. Weka, is a Java based Data Mining software containing a collection of machine learning algorithms (“Weka 3 — Data Mining with Open Source Machine Learning Software in Java”, n.d.). In order for the dataset to be correctly read by Weka, a conversion from CSV formatted data into Attribute-Relation File Format (ARFF) form was required. ARFF allows the formatting of data to a text file that contains the instances and their attributes ahead of data mining (“Weka 3 — Data Mining with Open Source Machine Learning Software in Java”, n.d.). Typically, when converting CSV to ARFF, attributes will be defined as numeric. However, in the case of text classification, it is required that attributes are defined as strings. Thus, pre-processing of each attribute as strings manually was to be made. The training set build and labelled consists of data from all three websites mentioned within the methodology.

7.2 Analysis of Data

A total of 2,100 items of data extracted was from the forums, with 600 manually labelled for training data (Appendix B). Note that after applying tokenizers the computational requirement greatly exceeded what was available for this study, thus the analysis was only able to be made on the database of 600 items. Additionally, this data was scrambled before processing in order to eliminate any variable impact on the machine learning process. Applying a StringToWordVector filter to the data, converts string attributes into an addition set of attributes representing word occurrences based on the Tokenizer used. In order to account for the spelling and noise challenge discussed in section 3.3, the best suited tokenizer for this problem as discovered to be the n-gram tokenizer. The Tokenizer was set to have a maximum of 3 n-grams. The LovinsStemmer, SnowBall Stemmer and PTStemmers showed significantly less percentage accuracy on correctly classified instances and thus were not valuable nor worth including in this research. After the n-gram tokenizer was applied, the total attributes stood at 6944.

Starting the analysis, 25% of the data is labelled and a 10-Fold Cross-Validation using the supervised learning algorithms NaïveBayes, SMO, RandomForest and IBk is performed. Most notably the top performing supervised learning algorithms in this analysis are NaïveBayes, at 79.33% Correctly Classified Instances, with RandomForest just behind at 78.00%. Figure 13 details the Precision, Recall and F1 comparison between these tests.

Some notable examples of threat detected by the NaïveBayes algorithm, include topic posts selling shell code exploits and windows 32-bit privilege elevation exploits. Examples:
‘shellcode win x8664 download execute generator’
‘cve20151701 win32k elevation of privilege vulnerability’

In a second analysis of the machine learning algorithms, the full database of 600 labelled items were added and split by 25% using Weka as appose to a manual data split. This increased the effectiveness of the NaiveBayes classifier to 81.77% with a total of 368 correctly classified instances. The SMO and RandomForest algorithms only has a marginal difference of less than 1% compared to the first test. However, the performance of the IBk algorithm increased with the additional training support with an accuracy of 74%.

7.3 Discussion

The most apparent outcome of the machine learning analysis show that the NaiveBayes and RandomForest provide the most effective of accuracy out of the four classifiers. It’s noted that both these algorithms allow for a high degree of accuracy when classifying text data and allows for an effective means to predicting threats. This is particularly apparent when applying n-gram Tokenizers to provide the machine with a large number of attributes. It was found that without applying n-gram Tokenizers this level of accuracy would be difficult to attain. Furthermore, it’s noted that amount of labelled data is increased, the accuracy was notably better, thus the classification accuracy with a 25% labelled data results in a positive and effective outcome. As a general evaluation of the process studied, the outcome of applying web crawled Darknet data to machine learning algorithms, can provide an effective method of providing CTI. Whilst it is obvious that some level of expertise is required in building a training dataset and reviewing the processed test data, this process can provide experts with an automated system for analysing threats.

Moving forward with this research, advancing the system to automate the crawling process and parse the data directly into a machine learning system could create a constant feed of valued threat intelligence from the Darknet. Additionally, further research could be taken by applying semi-supervised algorithms to test the effectiveness, which may support the necessity for labelling data.

After studying and implementing each stage of the system presented in Figure 7, the effectiveness and potential opportunities for a CTI model such as this are clear. To conclude, whilst each step presented a number of challenged, the process itself presents a largely conclusive and valuable outcome for primary research. This study has shown that an effective and accuracy level of CTI intelligence from the Darknet, can be obtained via web crawlers and machine learning.

8. Conclusion

The intended goal of this study was to investigate the application of data mining, machine learning and text feature extraction on Darknet hacking forums, to provide CTI. It was found that whilst there have been many areas of research in each individual field, there is only a small amount of research conducted with this specific goal. This highlights the importance of CTI from user generated data as opposed to system generate data. More specially, whilst CTI can be gained from the like of Darknet traffic or protocols, it cannot provide security professionals with vital information about potential and emerging attacks. Furthermore, CTI has been applied in many differing ways and can sometimes be misconstrued. Clarification as to whether a method of CTI presents a proactive ‘early phase’ detection of an attack or not is paramount for the cyber security industry, which this study aimed to address. It is within the discussion and communication of users on hacking forums and Darknet sites that hold key indicators as to potential cyber-attacks. One of the most important conclusions drawn from this study is that gathering data from Darknet hacking communities is an effective means of developing a proactive approach to attack information. Essentially, this allows security professionals to have an upper hand against exploits in order to mitigate future or occurring attacks. The primary research of this study collected and processed Darknet data by applying web crawlers to hidden hacking forums. This provided an effective means of extracting data specific to the attributes required for machine learning. Crawlers can allow researchers and professionals to gather information from any area of the internet. This data was then used

within machine learning to develop a model that can predict cyber threat. This addresses a large amount of the challenges of cyber-attacks presented to cyber security professional, as previously discussed, by eliminating the ‘unknown unknowns’ and turning them into ‘known knowns’ MWR InfoSecurity (2015). The combination of web crawlers, data mining, machine learning and text feature extraction provides an effective solution to real-time and time- sensitive cyber-attack data. Additionally, the level of information provided by this system supports the manual investigation of an exploit, allowing for information such as the web page link, author and time posted to be gained. One of the most important aspects of this study is that this method of CTI is not limited to one attack vector or industry. Thus, its value is apparent across a broad spectrum of computing and can be applied to a multitude of scenarios. Recommendations for future research could entail the study of systems such as this within differing area of cybercrime. Moreover, research into this topic could be transferred the like of detecting child pornography or radical terrorism across the Darknet. This could allow law enforcement agencies to monitor activity on the Darknet to a specific crime or investigation. An additional area of research could entail the development of machine learning to classify text data. Looking more specifically at building machine learning tools to label and predict instances to a better accuracy. To conclude, this study has addressed the research already collected in both machine learning and CTI on the Darknet. With this, primary research on this field has been collected by developing a system based on the knowledge gained from web crawling and machine learning to provide a means and solution for Darknet CTI. The effectiveness of machine learning for CTI has been positively addressed as a proactive approach to countering cyber-attack. Finally, proof of this was made with a working method that can predict cyber threats to an 81.77% accuracy, from 6944 attributes, 600 instances and a 25% label training model.

9. References

0day Forum — Homepage. 0day Forum. Retrieved 5 September 2017, from http://qzbkwswfv5k2oj5d.onion

1. Installation — Selenium Python Bindings 2 documentation. Selenium- Retrieved 1 September 2017, from

ABI-Research. (2017). Machine Learning in Cybersecurity to Boost Big Data, Intelligence, and Analytics Spending to $96 Billion by 2021. Retrieved 18 August 2017, from inte/

ACS. (2016). Cybersecurity: Threats, Challenges, Opportuniteis. Retrieved from

Al-Rowaily, K., Abulaish, M., Al-Hasan Haldar, N., & Al-Rubaian, M. (2015). BiSAL — A bilingual sentiment analysis lexicon to analyze Dark Web forums for cyber security. Digital Investigation, 14, 53–62.

Annett, M., & Kondrak, G. (2008). A Comparison of Sentiment Analysis Techniques: Polarizing Movie Blogs. Advances In Artificial Intelligence, 5032, 25–35. Retrieved from https://link.springer .com/chapter/10.1007%2F978–3–540–68825–9_3

Armona, L., & Stackman, D. (2014). Learning Darknet Markets.

Barnum, S. (2014). Standardizing Cyber Threat Intelligence Information with the Structured Threat Information eXpression (STIXTM). Retrieved from

Ben-Ga, L. (2007). Bayesian Networks. Encyclopedia Of Statistics In Quality & Reliabilit.

Chaudhry, P. (2017). The looming shadow of illicit trade on the internet. Business Horizons, 60(1), 77–89.

Ciancaglini, V., Balduzzi, M., Goncharov, M., & McArdle, R. (2013). Deepweb and Cybercrime — It’s Not All About TOR. Retrieved from

Crane, B. Instance Based Learning (AKA Rote Learning) — Bethopedia. Retrieved 3 September 2017, from

Daniel — Home. (2017). Retrieved 1 September 2017, from

Dua, S., & Du, X. (2016). Data Mining and Machine Learning in Cybersecurity. CRC Press.

Epishkina, A., & Zapechnikov, S. (2016). A syllabus on data mining and machine learning with applications to cybersecurity. 2016 Third International Conference On Digital Information Processing, Data Mining, And Wireless Communications (DIPDMWC).

Famous Dark Net Marketplaces to buy Exploits — 0 day Vulnerabilities- Malwares for Research. (2015). International Institute of Cyber Security. Retrieved 1 September 2017, from buy-exploits-0-day-vulnerabilities-malwares-for-research/

Ferrara, E., De Meo, P., Fiumara, G., & Baumgartner, R. (2014). Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 70, 301–323.

Fischer, E. (2016). Cybersecurity Issues and Challenges: In Brief. Retrieved from

Flach, P. (2015). Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge: Cambridge University Press.

Fremantle, P., & Scott, P. (2017). A survey of secure middleware for the Internet of Things. Peerj Computer Science, 3, e114.

Goodin, D. (2017). Record-breaking DDoS reportedly delivered by >145k hacked cameras. Ars Technica. Retrieved 17 August 2017, from technology/2016/09/botnet-of-145k-cameras-reportedly-deliver-internets-biggest-ddos-ever/

Highland, H. (1997). A history of computer viruses — The famous ‘trio’. Computers & Security, 16(5), 416–429.

Hutchins, E., Cloppert, M., & Amin, R. (2012). Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains. Lockheed Martin Corporation. Retrieved from Paper-Intel-Driven-Defense.pdf

Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text Classification Using Machine Learning Technique. WSEAS TRANSACTIONS On COMPUTERS, 4(8), 966–974.

Inj3ct0r Exploit DataBase. (2015). Retrieved 1 September 2017, from

Jain, P., & Bansal, M. (2014). Efficient Crawling the Deep Web. International Journal Of Advanced Research In Computer Science And Software Engineering, 4(5).

Jin, H., de Vel, O., Zhang, K., & Liu, N. (2008). Knowledge Discovery from Honeypot Data for Monitoring Malicious Attacks. AI 2008: Advances In Artificial Intelligence, 470–481.

Joachims, T. (2005). Text categorization with Support Vector Machines: Learning with many relevant features. European Conference On Machine Learning, 98, 137–142.

Keming, C., & Jianguo, Z. (2016). Research on the text classification based natural language processing and machine learning. Balkan Tribological Association, 22(3–1), 2484–2494.

Kolias, C., Kambourakis, G., Stavrou, A., & Voas, J. (2017). DDoS in the IoT: Mirai and Other Botnets. Computer, 50(7), 80–84.

Kononenko, I., & Kukar, M. (2013). Machine learning and data mining. Oxford [u.a.]: Woodhead Publ.

Kouzis-Loukas, D. (2016). Learning Scrapy. Packt Publishing Ltd.

Learning to Classify Text. Retrieved 3 September 2017, from

Li, X., & Liu, B. (2005). Learning to Classify Texts Using Positive and Unlabeled Data. Retrieved from

Li, X., Ng, S., & Wang, J. (2014). Biological data mining and its applications in healthcare. Singapore: World Scientific Pub. Co.

Liu, B., Li, X., Lee, W., & Yu, P. (2004). Text Classification by Labeling Words. Retrieved from

Maglogiannis, I. (2007). Emerging artificial intelligence applications in computer engineering. Amsterdam: IOS Press.

McCallum, A., & Nigam, K. (2005). A Comparison Of Event Models For Naive Bayes Text Classification. Retrieved from

Milošević, N. (2013). History of malware. Retrieved from

MWR Infosecurity. (2015). Threat Intelligence: Collecting, Analysing, Evaluating. CPNI. Retrieved from ce_whitepaper-2015.pdf

Nazario, J. (2004). Defense and detection strategies against Internet worms. Boston (Mass.): Artech House.

Nigam, K., McCallum, A., Thrun, S., & Michell, T. (2017). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2–3), 103–134. Retrieved from https://link.springer .com/article/10.1023%2F A%3A1007692713085?LI=true

Omand, D. (2015). The Dark Net: Policing the Internet’s Underworld | World Policy Institute. Retrieved 17 August 2017, from

OpenCV. Example of Support Vector Machine. Retrieved from

Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations And Trends® In Information Retrieval, 2(1–2).

Planqué, D. (2017). Cyber Threat Intelligence: From confusion to clarity; An investigation into Cyber Threat Intelligence. Retrieved from

Platt, J. (1998). Fast Training of Support Vector Machines using Sequential Minimal Optimization. Retrieved from content/uploads/2016/02/smo-book.pdf

Registration. (2017). 0day Forum. Retrieved 1 September 2017, from http://qzbkwswfv5k2oj5d.onion

Reid, F., & Harrigan, M. (2012). An Analysis of Anonymity in the Bitcoin System. Security And Privacy In Social Networks, 197–223. Retrieved from
https://link.springer .com/chapter/10.1007/978–1–4614–4139–7_10

Robertson, J. (2017). Darkweb cyber threat intelligence mining. Cambridge University Press.

Sabanal, P., & Yason, M. (2012). Digging Deep Into The Flash Sandboxes. Retrieved from .pdf

Santini, M. (2013). Machine Learning for Language Technology — Decision Trees and Nearest Neighbors. Presentation,

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

Su, H., & Pan, J. (2016). Crowdsourcing Platform for Collaboration Management in Vulnerability Verification. Network Operations And Management Symposium (APNOMS).

Telefonica. (2016). Analysis of The Inj3ct0r Team. Retrieved from

The Ministry of Defence. (2016). Cyber Primer. MoD. Retrieved from 720-Cyber_Primer_ed_2_secured.pdf

The Tor Project, I. (2017). Tor Project: Mac OS X Install Instructions. Retrieved 1 September 2017, from

Thonnard, O., & Dacier, M. (2008). A framework for attack patterns’ discovery in honeynet data. Digital Investigation, 5, S128-S139.

Thornton, C. KR-IST Lecture 9a Bayesian networks. Retrieved 3 September 2017, from

Trenkle, J., & Cavnar, W. (1994). N-Gram-Based Text Categorization.

United States Department of Defence. (2017). DOD Dictionary of Military and Associated Terms. Retrieved from

Varsalone, J., McFadden, M., & Morrissey, S. (2012). Defense against the black arts. Boca Raton, Fl: CRC Press.

Verizon. (2015). Data Breach Investigation Report. Retrieved from report_2015_en_xg.pdf

Weka 3 — Data Mining with Open Source Machine Learning Software in Java. Retrieved 4 September 2017, from

Widanapathirana, C. (2015). Intelligent Inference Methods for Automated Network Diagnostics.

Wolpert, D., & Macready, W. (1997). No free lunch theorems for optimization. IEEE Transactions On Evolutionary Computation, 1(1), 67–82.

Zero Day Initiative. (2017). Retrieved 17 August 2017, from

Zheng, Q., Wu, Z., Cheng, X., Jiang, L., & Liu, J. (2013). Learning to crawl deep web. Information Systems, 38(6), 801–819.


This publication is officially published and owned by the University of Portsmouth. Any use of this publication must be properly referenced. Feel free to use any research made, however please provide the proper credit for this work.

This publication does not contain all the extracts contained within the full research project due to sensitivity of information and research. Please see the full publication.

Leave a Reply