Good Bot, Bad Bot: Characterizing Automated Browsing Activity

Good Bot, Bad Bot:

Characterizing Automated Browsing Activity

Xigao Li

Stony Brook University

Babak Amin Azad

Stony Brook University

Amir Rahmati

Stony Brook University

Nick Nikiforakis

Stony Brook University

Abstract—As the web keeps increasing in size, the number

of vulnerable and poorly-managed websites increases commensu-

rately. Attackers rely on armies of malicious bots to discover these

vulnerable websites, compromising their servers, and exﬁltrating

sensitive user data. It is, therefore, crucial for the security of the

web to understand the population and behavior of malicious bots.

In this paper, we report on the design, implementation, and

results of Aristaeus, a system for deploying large numbers of

“honeysites”, i.e., websites that exist for the sole purpose of attract-

ing and recording bot trafﬁc. Through a seven-month-long exper-

iment with 100 dedicated honeysites, Aristaeus recorded 26.4 mil-

lion requests sent by more than 287K unique IP addresses, with

76,396 of them belonging to clearly malicious bots. By analyzing

the type of requests and payloads that these bots send, we discover

that the average honeysite received more than 37K requests each

month, with more than 50% of these requests attempting to brute-

force credentials, ﬁngerprint the deployed web applications, and

exploit large numbers of different vulnerabilities. By comparing

the declared identity of these bots with their TLS handshakes

and HTTP headers, we uncover that more than 86.2% of bots

are claiming to be Mozilla Firefox and Google Chrome, yet are

built on simple HTTP libraries and command-line tools.

I. INTRODUCTION

To cope with the rapid expansion of the web, both legitimate

operators, as well as malicious actors, rely on web bots (also

known as crawlers and spiders) to quickly and autonomously

discover online content. Legitimate services, such as search

engines, use bots to crawl websites and power their products.

Malicious actors also rely on bots to perform credential stufﬁng

attacks, identify sensitive ﬁles that are accidentally made public,

and probe web applications for known vulnerabilities [1], [2].

According to recent industry reports [3], bots are responsible

for 37.2% of the total website-related trafﬁc, with malicious

bots being responsible for 64.7% of the overall bot trafﬁc.

Given the abuse perpetrated by malicious bots, identifying

and stopping them is critical for the security of websites and

their users. Most existing bot-detection techniques rely on

differentiating bots from regular users, through supervised

and unsupervised ML techniques, based on features related

to how clients interact with websites (e.g., the speed and

type of resources requested) [4]–[6] as well as through

browser-ﬁngerprinting techniques [7], [8].

In all of the aforementioned approaches, researchers need

to obtain a ground truth dataset of known bots and known

users to train systems to differentiate between them. This

requirement creates a circular dependency where one needs

a dataset resulting from accurate bot detection to be used for

accurate-bot detection. The adversarial nature of malicious bots,

their ability to claim arbitrary identities (e.g., via User-agent

header spooﬁng), and the automated or human-assisted solving

of CAPTCHAs make this a challenging task [9]–[11].

In this paper, we present a technique that sidesteps the issue

of differentiating between users and bots through the concept

of honeysites. Like traditional high-interaction honeypots, our

honeysites are fully functional websites hosting full-ﬂedged

web applications placed on public IP address space (similar

to Canali and Balzarotti’s honeypot websites used to study the

exploitation and post-exploitation phases of web-application

attacks [12]). By registering domains that have never existed

before (thereby avoiding trafﬁc due to residual trust [13]) and

never advertising these domains to human users, we ensure

that any trafﬁc received on these honeysites will belong to

benign/malicious bots and potentially their operators. To scale

up this idea, we design and build Aristaeus,

a system that

provides ﬂexible remote deployment and management of

honeysites while augmenting the deployed web applications

with multiple vectors of client ﬁngerprinting.

Using Aristaeus, we deploy 100 honeysites across the

globe, choosing ﬁve open-source web applications (WordPress,

Joomla, Drupal, PHPMyAdmin, and Webmin), which are

both widely popular and have been vulnerable to hundreds of

historical vulnerabilities, thereby making them attractive targets

for malicious bots. In a period of seven months (January 24 to

August 24, 2020) Aristaeus recorded 26.4 million bot requests

from

287,017

unique IP addresses, totaling more than 200 GB

of raw logs from websites that have zero organic user trafﬁc.

By analyzing the received trafﬁc, we discovered that from

the

37,753

requests that the average Aristaeus-managed

honeysite received per month,

21,523

(57%) were clearly

malicious. Among others, we observed

47,667

bots sending

unsolicited POST requests towards the login endpoints of our

deployed web applications, and uncovered

12,183

unique bot

IP addresses which engaged in web-application ﬁngerprinting.

In the duration of our experiment, we observed the attempt to

exploit ﬁve new high-severity vulnerabilities, witnessing bots

weaponizing an exploit on the same day that it became public.

Furthermore, by analyzing the collected ﬁngerprints and

attempting to match them with known browsers and automation

tools, we discovered that at least 86.2% of bots are lying

about their identity, i.e., their stated identity does not match

their TLS and HTTP-header ﬁngerprints. Speciﬁcally, out

Rustic god in Greek mythology caring over, among others, beekeepers.

of the

30,233

clients, which claimed to be either Chrome or

Firefox, we found that 86.2% are lying, with most matching

the ﬁngerprints of common Go and Python HTTP libraries as

well as scriptable command-line tools (such as wget and curl).

Our main contributions are as follows:

•

We design and implement Aristaeus, a system for

deploying and managing honeysites. Using Aristaeus, we

deploy 100 honeysites across the globe, obtaining unique

insights into the populations and behavior of benign and

malicious bots.

•

We extract URLs from exploit databases and web-

application ﬁngerprinting tools and correlate them with

the requests recorded by Aristaeus, discovering that more

than

25,502

bots engage in either ﬁngerprinting, or the

exploitation of high-severity, server-side vulnerabilities.

•

We curate TLS signatures of common browsers and

automation tools, and use them to uncover the true

identity of bots visiting our infrastructure. We ﬁnd that

the vast majority of bots are built using common HTTP

libraries but claim to be popular browsers. Our results

demonstrate the effectiveness of TLS ﬁngerprinting for

identifying and differentiating users from malicious bots.

Given the difﬁculty of differentiating between users and

bots on production websites, we will be sharing our curated,

bot-only dataset with other researchers to help them improve

the quality of current and future bot-detection tools.

II. BACKGROUND

Unsolicited requests have become a ﬁxture of the web-

hosting experience. These requests usually originate from bots

with various benign and malicious intentions. On the benign

side, search-engine bots crawl websites to index content for

their services, while large-scale research projects use bots to

collect general statistics. At the same time, malicious actors

use bots to identify and exploit vulnerabilities on a large scale.

Moreover, high-proﬁle websites are victims of targeted bot

attacks that seek to scrape their content and target user accounts.

Bots and automated browsing.

An automated browsing

environment can be as rudimentary as wget or curl requests, or

be as involved as full browsers, controlled programmatically

through libraries such as Selenium [14]. The underlying bot

platforms and their conﬁguration deﬁnes the capabilities of

a bot in terms of loading and executing certain resources such

as JavaScript code, images, and Cascading Style Sheets (CSS).

As we show in this paper, the capabilities and behavior of

these platforms can be used to identify them.

Browser ﬁngerprinting.

Malicious bots can lie about their

identity. Prior work has proposed detection schemes based on

browser ﬁngerprinting and behavioral analysis to extract static

signatures as well as features that can be used in ML models.

Browser ﬁngerprinting is an integral part of bot detection.

Previous research has focused on many aspects of browsing

environments that make them unique [15]–[17]. The same

techniques have also been used for stateless user tracking by

ad networks, focusing on features that are likely to produce

different results for different users, such as, the supported

JavaScript APIs, list of plugins and browser extensions,

available fonts, and canvas renderings [15], [18], [19].

TLS ﬁngerprinting.

Similarly, the underlying browser and

operating systems can be ﬁngerprinted at the network layer

by capitalizing on the TLS differences between browsers

and environments [20]. Durumeric et al. used discrepancies

between the declared user-agent of a connection and the used

TLS handshake to identify HTTPS interception [21]. In this

paper, we show how TLS signatures (consisting of TLS version,

list of available cipher suites, signature algorithms, e-curves,

and compression methods) can be used to identify the true

nature of malicious bots, regardless of their claimed identities.

Behavioral analysis.

Next to browser and TLS

ﬁngerprinting, the behavior of bots on the target website can

signal the presence of automated browsing. To that end, features

such as the request rate, requests towards critical endpoints

(e.g., login endpoint), and even mouse moves and keystrokes

have been used by prior bot-detection schemes [4], [5], [22].

Browsing sessions.

To study the behavior of bots, we need

a mechanism to group together subsequent requests from the

same bot. While source IP address can be used for that purpose,

in a large-scale study, the IP churn over time can result in

false positives where an address changes hands and ends up

being used by different entities. To address this issue, we used

the notion of “browsing sessions” as used by server-side web

applications and also deﬁned by Google Analytics [23]. For

each IP address, we start a session upon receiving a request and

end it after 30 minutes of inactivity. This allows for an organic

split of the active browsing behavior into groups. Grouping

requests from the same bot in a session enables us to analyze

activities, such as, changes in a bot’s claimed identity and mul-

tiple requests that are part of a credential, brute-forcing attack.

III. SYSTEM DESIGN

To collect global bot information, we design Aristaeus, a

system that provides ﬂexible honeysite deployment and ﬁnger-

print collection. Aristaeus consists of three parts: honeysites,

log aggregation, and analysis modules. Our system can launch

an arbitrary number of honeysites based on user-provided

templates, i.e., sets of existing/custom-made web applications

and scripts using virtual machines on public clouds. Aristaeus

augments these templates with multiple ﬁngerprinting modules

that collect a wide range of information for each visiting client.

The information collected from honeysites is periodically

pulled by a central server, which is responsible for correlating

and aggregating the data collected from all active honeysites.

Figure 1 presents the overall architecture of our system.

A. Honeysite Design

A honeysite is a real deployment of a web application,

augmented with different ﬁngerprinting techniques, and

increased logging. Like traditional honeypots, our honeysites

are never advertised to real users, nor linked to by other

sites or submitted to search engines for listing. If a honeysite

includes publicly-accessible, user-generated content (such as a

typical CMS showing blog posts), Aristaeus creates randomly-

generated text and populates the main page of the honeysite.

1. Deploy honeysites

Automated browsers

2. Log aggregation 3. Bot trafﬁc analysis

TLS Apache CSP JS

LogSourcesHoneysites

AristaeusControlCenter

WebsiteTemplates

DomainRegistrationModule

NameServers

Bots

Scripts & CMD tools

Other

bots

LogAggregation

LogCorrelation

QueryingPanel

SessionGeneration

Bot signatures

Trafﬁc analysis

AnalysisEngine

NewHoneysites

CurrentHoneysites

Fig. 1: High-level overview of the architecture of Aristaeus

Lastly, to ensure that the trafﬁc a honeysite is receiving is

not because its domain name used to exist (and therefore

there are residual-trust and residual-trafﬁc issues associated

with it [13]) all domains that we register for Aristaeus, were

never registered before. To ensure this, we used a commercial

passive-DNS provider and searched for the absence of

historical resolutions before registering a domain name. As a

result, any trafﬁc that we observe on a honeysite can be safely

characterized as belonging either to bots, or to bot-operators

who decided to investigate what their bots discovered.

To be able to later correlate requests originating from

different IP addresses as belonging to the same bot that changed

its IP address or from a different deployment of the same

automated-browsing software, Aristaeus augments honeysites

with three types of ﬁngerprinting: browser ﬁngerprinting,

behavior ﬁngerprinting, and TLS ﬁngerprinting. Figure 2

shows how these different types of ﬁngerprinting interface with

a honeysite. We provide details and the rationale about each

of these ﬁngerprinting modules in the following paragraphs:

1) Browser ﬁngerprinting

To ﬁngerprint each browser (or automated agent) that

requests content from our honeysites, we use the following

methods.

JavaScript API support.

Different bots have different

capabilities. To identify a baseline of supported features of

their JavaScript engine, we dynamically include an image using

document.write() and var img APIs, and verify whether the con-

necting agent requests that resource. We also check for AJAX

support by sending POST requests from the jQuery library that

is included on the page. Lastly, we use a combination of inline

JavaScript and external JavaScript ﬁles to quantify whether

inline and externally-loaded JavaScript ﬁles are supported.

Browser ﬁngerprinting through JavaScript.

We utilize

the Fingerprintjs2 (FPJS2) library [24] to analyze the ﬁnger-

printable surface of bots. FPJS2 collects 35 features from web

browsers, including information about the screen resolution,

time zone, creation of canvas, webGL, and other features that

can be queried through JavaScript. To do that, FPJS2 relies on

a list of JavaScript APIs. Even though in a typical browsing

environment these APIs are readily available, there are no

guarantees that all connecting clients have complete JavaScript

support. For instance, the WeChat embedded browser is unable

to execute OfﬂineAudioContext from FPJS2, breaking the entire

ﬁngerprinting script. To address this issue, we modularized

FPJS2 in a way that the library would survive as many failures

as possible but still collect values for the APIs that are available.

Naturally, if a bot does not support JavaScript at all, we will

not be able to collect any ﬁngerprints from it using this method.

Support for security policies.

Modern browsers support a

gamut of security mechanisms, typically controlled via HTTP

response headers, to protect users and web applications against

attacks. Aristaeus uses these mechanisms as a novel ﬁngerprint-

ing technique, with the expectation that different bots exhibit

different levels of security-mechanism support. Speciﬁcally, we

test the support of a simple Content Security Policy and the

enforcement of X-Frame-Options by requesting resources that

are explicitly disallowed by these policies [25], [26]. To the best

of our knowledge, this is the ﬁrst time these mechanisms have

been used for a purpose other than securing web applications.

2) Behavior ﬁngerprinting

Honoring robots.txt.

Benign bots are expected to follow

the directives of robots.txt (i.e., do not visit the paths explicitly

marked with Disallow). Contrastingly, it is well known

that malicious bots can not only ignore these Disallow

directives but also use robots.txt to identify end-points that

they would have otherwise missed [27]. To test this, Aristaeus

automatically appends to the robots.txt ﬁles of each honeysite

Disallow entries to a “password.txt” ﬁle, whose exact path

encodes the IP address of the agent requesting that ﬁle. This

enables Aristaeus to not only discover abuses of robots.txt but

also identify a common bot operator behind two seemingly

unrelated requests. That is, if a bot with IP address 5.6.7.8

requests a robots.txt entry that was generated for a different

bot with IP address 1.2.3.4, we can later deduce that both of

these requests belonged to the same operator.

Customized error pages.

To ensure that Aristaeus gets

a chance to ﬁngerprint bots that are targeting speciﬁc web

applications (and therefore request resources that generate

HTTP 404 messages), we use custom 404 pages that incorporate

all of the ﬁngerprinting techniques described in this section.

Caching and resource sharing.

Regular browsers use

caching to decrease the page load time and load resources

that are reused across web pages more efﬁciently. The use

OS/Network Stack     FingerprinTLS

Apache

   Perl

       Common Fingerprint Code

Web application

Fingerprint Design

Browser Fingerprint

Behavior Analysis

TLS Fingerprint

Honeysite Implementation

Fig. 2: Internal design of each honeysite.

of caching can complicate our analysis of logs as we rely

on HTTP requests to identify which features are supported

by bots. To eliminate these effects, we use the “no-cache”

header to ask agents not to cache resources, and we append

a dynamic, cache-breaking token at the end of the URLs for

each resource on a page.

Cache breakers are generated by encrypting the client IP

address and a random nonce using AES and then base64-

encoding the output; this makes every URL to the same

resource unique. For example, if a bot asks for the resource

a.jpg, it will send a request in the format /a.jpg?r=[endoded

IP+nonce]. During our analysis, we decrypt the cache

breakers and obtain the original IP address that requested

these resources. Using this information, we can pinpoint any

resource sharing that occurred across multiple IP addresses.

This happens when a resource is requested from one IP

address but the cache breaker points to a different IP address.

3) TLS ﬁngerprinting

We use the ﬁngeprinTLS [20] (FPTLS) library to collect

TLS handshake ﬁngerprints from the underlying TLS library

of each client that connects to our honeysites over the HTTPS

protocol. From each connection, FPTLS gathers ﬁelds such

as TLS version, cipher suites, E-curves, and compression

length. The motivation for collecting TLS ﬁngerprints is that

different TLS libraries have different characteristics that can

allow us to later attribute requests back to the same software

or the same machine. This type of ﬁngerprinting is performed

passively (i.e., the connecting agents already send all details

that FPTLS logs as part of their TLS handshakes) and can be

collected by virtually all clients that request content from our

honeysites. This is particularly important because, for other

types of ﬁngerprinting, such as JavaScript-based ﬁngerprinting,

if a bot does not support JavaScript, then Aristaeus cannot

collect a ﬁngerprint from that unsupported mechanism.

B. Honeysite Implementation

While Aristaeus is web-application agnostic, for this paper,

we deployed ﬁve types of popular web applications, consisting

of three Content Management Systems (CMSs) and two

website-administration tools. On the CMS front, we deployed

instances of WordPress, Joomla, and Drupal, which are

the three most popular CMSs on the web [28]. WordPress

alone is estimated to be powering more than

30%

of sites

on the web [29]. In terms of website-administration tools,

Fig. 3:

(Top) Daily number of new IP addresses that visited our

honeysites. (Bottom) Daily number of received requests.

we chose Webmin and PHPMyAdmin. Both of these tools

enable administrators to interact with databases and execute

commands on the deployed server.

Next to their popularity (which makes them an attractive

target), these tools have existed for a number of years and have

been susceptible to a large number of critical vulnerabilities,

ranging from

vulnerabilities (in the case of Webmin) to

333

vulnerabilities (in the case of Drupal). Note that, for

CMSs, the vulnerabilities in the core code do not include

the thousands of vulnerabilities that third-party plugins have

been vulnerable to. Moreover, all ﬁve web applications allow

users and administrators to log in, making them targets for

general-purpose, credential-stufﬁng bots.

C. Deployment and Log Collection/Aggregation

For the set of experiments described in this paper, we

registered a total of 100 domain names. As described in

Section

III-A

, we ensure that the registered domains never

existed before (i.e., once registered and left to expire) to avoid

residual-trafﬁc pollution. For each domain, Aristaeus spawns

a honeysite and deploys one of our web application templates

onto the server. We use Let’s Encrypt to obtain valid TLS

certiﬁcates for each domain and AWS virtual machines to host

our honeysites over three different continents: North America,

Europe, and Asia. Overall, we have 42 honeysites in North

America, 39 honeysites in Europe, and 19 in Asia.

A central server is responsible for the periodic collection

and aggregation of logs from all 100 honeysites. As described

in Section

III-A

, these include web-server access logs, TLS

ﬁngerprints, browser ﬁngerprints, behavior ﬁngerprints, and

logs that record violations of security mechanisms. We

correlate entries from different logs using timestamps and

IP addresses and store the resulting merged entries in an

Elasticsearch cluster for later analysis.

IV. BOT TRAFFIC ANALYSIS

In this and the following sections, we report on our ﬁndings

on the data collected from 100 honeysites deployed across 3

different continents (North America, Europe, and Asia-Paciﬁc)

over a 7-month period (January 24, 2020 to August 24,

2020). Overall, our honeysites captured

26,427,667

requests

from 287,017 unique IP addresses totaling 207GB of data. A

detailed breakdown of our dataset is available in the paper’s

Appendix (Table VI).

From 26.4 million requests captured, each honeysite

received

1,235

requests each day on average, originating

from 14 IP addresses. Given that these domains have never

been registered before or linked to by any other websites,

we consider the incoming trafﬁc to belong solely to bots and

potentially the bots’ operators.

Daily bot trafﬁc.

Figure 3 (Top), shows the number of

unique IP addresses that visited our honeysites each day. We

observe that this number initially goes down but averages

at around

1,000

IP addresses during the later stages of our

data collection. Our data demonstrate that our honeysites keep

observing trafﬁc from new IP addresses, even 7 months after

the beginning of our experiment.

Figure 3 (Bottom) shows the number of requests received

over time. Beginning on April 18, we observed an increase in

the volume of incoming requests which is not followed by a

commensurate increase in the number of unique IP addresses.

Upon careful inspection, we attribute this increase in trafﬁc to

campaigns of bots performing brute-force attacks on our Word-

Press honeysites (discussed in more detail in Section VI-A2).

Geographical bot distribution.

Although we did not

observe signiﬁcant variation across the honeysites hosted in

different regions, we did observe that bot requests are not

evenly distributed across countries. Overall, we received most

bot requests from the US, followed by China and Brazil.

Figure 4 presents the top 10 countries based on the number

of IP addresses that contacted our honeysites.

Limited coverage of IP blocklists.

We collect information

about malicious IP addresses from 9 popular IP blocklists that

mainly focus on malicious clients and web threats, including

AlienVault, BadIPs, Blocklist.de, and BotScout. Out of

76,396

IPs that exhibited malicious behavior against our Aristaeus-

managed infrastructure, only 13% appeared in these blocklists.

To better understand the nature of these malicious bots and

how they relate to the rest of the bots recorded by Aristaeus,

we used the IP2Location lite IP-ASN database [30] to obtain

the type of location of all IP addresses in our dataset. Figure 5

shows this distribution. Contrary to our expectation that most

bots would be located in data centers, most IP addresses

(64.37%) are, in fact, located in residential IP space.

This ﬁnding suggests that most bot requests come from

either infected residential devices, or using residential devices

as a proxy to evade IP-based blocklists [31]. This is also

conﬁrmed by the distribution of the IP addresses in the

third-party IP blocklists that we obtained. As shown in

Figure 5, the vast majority of the hosts labeled as malicious

by these blocklists are also located in residential networks.

Fig. 4: Bot Origin Country Distribution.

Fig. 5: Location of IP addresses.

Lastly, to understand how often the blocklists need to be

updated, we characterize the lifetime of bots in our dataset.

We deﬁne the “lifetime” as the duration between a bot’s ﬁrst

visit of a honeysite and the ﬁnal visit through the same IP

address. We observe 69.8% bot IP addresses have a lifetime

less than a day. These bots are active for a short period of

time and then leave, never to return. In contrast, 0.8% of bots

have a lifetime longer than 200 days, approaching the duration

of our study. This indicates that bots frequently switch to

new IP addresses, which make static IP-based blocklists less

effective against IP-churning behavior.

Our comparison of the malicious bots discovered by Aris-

taeus, with popular blocklists demonstrates both the poor cov-

erage of existing blocklists but also the power of Aristaeus that

can identify tens of thousands of malicious IP addresses that are

currently evading other tools. We provide more details on our

methodology for labeling bots as malicious in Section VI-A.

Honeysite discovery.

A bot can discover domain names

using DNS zone ﬁles, certiﬁcate transparency logs, and

passive network monitoring. We observe that, mere minutes

after a honeysite is placed online, Aristaeus already starts

observing requests for resources that are clearly associated

with exploitation attempts (e.g. login.php, /phpinfo.php,

/db pma.php, and /shell.php). In Section

VI-A

, we provide

more details as to the types of exploitation attempts that we

observed in Aristaeus’s logs.

To identify how bots ﬁnd their way onto our honeysites, we

inspect the “Host” header in the bot HTTP requests checking

whether they visit us through the domain name or by the

IP address of each honeysite. Out of the total

287,017

addresses belonging to bots, we ﬁnd that

126,361

(44%) visit

us through our honeysite IP addresses whereas

74,728

(26%)

visit us through the domain names. The remaining

85,928

(30%) IP addresses do not utilize a “Host” header. Moreover,

in HTTPS trafﬁc, we observe that all of

36,266

bot IP

addresses visit us through our domain names since Aristaeus

does not use IP-based HTTPS certiﬁcates.

Given the large volume of requests that reach our honeysites,

one may wonder whether we are receiving trafﬁc because

of domains that were once conﬁgured to resolve to our

AWS-assigned IP addresses and whose administrators forgot

to change them, when they retired their old virtual machines.

This type of “dangling domains” are a source of vulnerabilities

and have been recently investigated by Liu et al. [32] and

Borgolte et al. [33]. Using passive DNS, we discovered

that two misconﬁgured third-party domains pointed to our

infrastructure, during the duration of our experiment. However,

the clients who connected to our honeysites because of these

misconﬁgured domains amount to a mere 0.14% of the total

observed IP addresses.

Similarly, to quantify the likelihood that we are receiving

requests from real users (as opposed to bots) whose browsers

are stumbling upon content linked to via an IP address (instead

of through a domain name) back when the Aristaeus-controlled

IP addresses used to belong to other parties, we performed

the following experiment. We crawled 30 pages for each of

the Alexa top 10K websites, searching for content (i.e. images,

JavaScript ﬁles, links, and CSS) that was linked to via an

IP address. Out of the

3,127,334

discovered links, just 31

(0.0009%) were links that used a hardcoded IP address.

When considered together, these ﬁndings demonstrate that

the vast majority of Aristaeus-recorded visits are not associated

with real users, but with bots (benign and malicious) that are

the subject of our study.

Most targeted URIs.

Not all honeysite endpoints receive

the same number of requests. Here, we list the endpoints that

received the most requests from bots and the web applications

to which they belong.

Among the most requested endpoints, we observe those

that are common across multiple web applications, such as

robots.txt, as well as resources that belong to only one of our

web applications, such as wp-login.php, which is the login page

of WordPress. Figure 6 shows the share of requests for each

URI towards different web applications; darker-colored cells

indicate a higher portion of requests going towards that type

of web application. To enable a baseline comparison, Figure 6

also shows the access distribution of document root (/).

Overall, the login endpoints of our honeysites received the

most attention by bots. These requests are brute forcing the

/user/login. There are also general requests for the robots.txt

ﬁle. Interestingly, the robots.txt ﬁle was mostly queried on

WordPress and Joomla honeysites and signiﬁcantly less for

Drupal, PHPMyAdmin, and Webmin honeysites.

Certain resources are only requested on one of our web

platforms. For instance, xmlrpc.php and wp-login.php are

only requested on WordPress honeysites. Similarly, the URI

/changelog.txt is requested only from Drupal honeysites for the

purpose of web-application ﬁngerprinting (i.e. determining the

exact version and whether it is vulnerable to known exploits).

Next is session login.cgi ﬁle that hosts the Webmin login

page, and we only observe requests to this resource on Webmin

instances. Finally, document root and /latest/dynamic/instance-

identity/document requests are observed equally among all of

our honeysites. The /latest/dynamic/instance-identity/document

endpoint exists on some of the servers hosted on AWS and

can be used in a Server-Side Request Forgery (SSRF) attack

(we discuss this attack in Section VI-A2).

Overall, the application-speciﬁc attack pattern suggests that

bots will ﬁngerprint the applications and identify the presence

Fig. 6:

Heatmap of the most requested URIs and their respective

web applications. Darker cells indicate a larger fraction of requests

towards that type of web application. The icons in each cell indicate

whether the resource is available in that web application.

indicates that the resource exists; indicates that the resource does

not exist; indicates that the resource exists but is not available

to unauthenticated clients.

of a speciﬁc vulnerable web application rather than blindly

ﬁring their attack payloads. We discuss the ﬁngerprinting

attempts by bots in more detail in Section

VI-A

2. Lastly,

Table VII (in the Appendix) lists the most targeted URLs

from the perspective of each web application type.

V. JAVASCRIPT FINGERPRINTS OF BOTS

In this section, we report on our ﬁndings regarding bot

detection through browser ﬁngerprinting.

JavaScript support.

We designed several tests to measure

the JavaScript support of bots. From these tests, we discovered

out that only

11,205

(0.63% of the total

1,760,124

sessions)

of bot sessions support JavaScript functionality such as adding

dynamic DOM elements and making AJAX requests.

JavaScript-based browser ﬁngerprinting.

When it comes

to the detection of the bots that visited our honeysites, the

effectiveness of JavaScript-based browser ﬁngerprinting is

greatly impacted by the lack of support for JavaScript from the

majority of bots. Across the total of

1,760,124

sessions, only

0.59% of them returned a JavaScript-based browser ﬁngerprint.

Overall, given that the majority of bots that come to our

websites do not support JavaScript, this type of browser

ﬁngerprinting proves to be less useful for bot identiﬁcation.

By the same token, this also indicates that if websites demand

JavaScript in order to be accessible to users, the vast majority

of bots identiﬁed by Aristaeus will be ﬁltered out.

VI. BOT BEHAVIOR

In this section, we look at different aspects of the behavior

of bots during their visits on our honeysites.

Honoring the robots.txt.

Based on our dynamic robots.txt

generation scheme, we did not observe any violations against

Disallow-marked paths. This is an unexpected ﬁnding and

could be because of the popularity of using fake Disallow

entries for identifying reconnaissance attempts [34]–[36].

However, this does not mean that all bots will honor robots.txt,

since only 2.2% of total sessions included a request to this ﬁle.

Enforcing the content security policy (CSP).

Less than

1% of the total IP addresses reported CSP violations on

our honeysites. Similarly, less than 1% of bots violated the

CSP rules by loading resources that we explicitly disallowed.

The majority of CSP violations originated from benign

search-engine bots which were capable of loading embedded

resources (such as, third-party images and JavaScript ﬁles) but

did not support CSP. The vast majority of bots do not load

CSP-prohibited resources, not because they honor CSP, but

because they do not load these types of resources in general.

Shared/distributed crawling.

Since Aristaeus encodes the

client’s IP addresses into each URL cache-breaker, clients are

expected to make requests that match their URLs. However,

out of

1,253,590

requests that bore valid cache breakers, we

found that

536,258

(42.8%) “re-used” cache-breakers given

to clients with different IP addresses.

Given the large percentage of mismatched requests, we can

conclude that most are because of distributed crawlers which

identify URLs of interest from one set of IP addresses and then

distribute the task of crawling across a different pool of “work-

ers”. This behavior is widely observed in Googlebots (19.6%

of all cache-breaker re-use) and the “MJ12Bot” operated by the

UK-based Majestic (32.1% cache-breaker reuse). Interestingly,

malicious bots do not engage in this behavior, i.e., any cache-

breakers that we receive from them match their IP address.

A. Bot Intentions

Based on their activity on our honeysites, we categorize

bot sessions into three categories: “Benign”, “Malicious”, and

“Other/Gray”. Benign bots are deﬁned as bots visiting our

honeysites and asking for valid resources similar to a normal

browser, with no apparent intent to attack our honeysites. For

example, benign bots do not send unsolicited POST requests

nor try to exploit a vulnerability. Contrastingly, malicious

bots are those that send unsolicited POST requests towards

authentication endpoints, or send invalid requests trying to

exploit vulnerabilities. Apart from these two categories, there

are certain bots that because of limited interaction with our

honeysites, cannot be clearly labeled as benign or malicious.

We label these bots as “Other/Gray”.

1) Benign bots

Based on their activity, we categorize search-engine bots

and academic/industry scanners as benign bots. In total, we

recorded 347,386 benign requests, which is 1.3% of the total

requests received by Aristaeus.

Search-engine bots.

Search-engine bots are responsible

for the majority of requests in the benign bots category, and

contribute to 84.4% of total benign bots. The general way of

identifying search-engine bots is from their User-Agents where

they explicitly introduce themselves. However, it is possible

for bots to masquerade their User-Agents as search-engine

bots in order to hide their malicious activity. Search engines

typically provide mechanisms, such as reverse DNS lookups,

that allow webmasters to verify the origin of each bot that

claims to belong to a given search engine [37]–[40].

In total, we received

317,820

requests from search-engine

bots, with Google bots contributing 80.2% of these requests.

For instance, we observed four different Google-bot-related

TABLE I: Breakdown of requests from search engine bots

Type Total SEBot Requests Veriﬁed Requests

Googlebot 233,024 210,917 (90.5%)

Bingbot 77,618 77,574 (99.9%)

Baidubot 2,284 61 (0.026%)

Yandexbot 4,894 4,785 (97.8%)

Total 317,820 293,337 (92.3%)

user agents (“Googlebot/2.1”, “Googlebot-Image/1.0”,

“AppEngine-Google”, and “Google Page Speed Insights”)

which match documentation from Google [41].

Of the

317,820

requests claiming to originate from

search-engine bots, we veriﬁed that

293,337

(92.3%) are

indeed real search-engine bot requests. The share of requests

towards our honeysites from the identiﬁed search-engine bots

are listed in Table I.

Academic and Industry scanners.

Apart from anonymous

scanners, we identiﬁed

30,402

(0.12%) requests originating

from scanners belonging to companies which collect website

statistics (such as BuiltWith [28] and NetCraft [42]), keep

historical copies of websites (such as the Internet Archive [43]),

and collect SEO-related information from websites (such as

Dataprovider.com). Moreover, the crawlers belonging to a

security group from a German university were observed on our

honeysites. We were able to verify all of the aforementioned

bots via reverse DNS lookups, attributing their source IP

address back to their respective companies and institutions.

2) Malicious bots

We deﬁne malicious requests by their endpoints and access

methods. As we described in Section

III-A

, we ensure that

the domains we register for Aristaeus never existed in the past.

Hence, since benign bots have no memory of past versions

of our sites, there should be no reason for a benign bot to

request a non-existent resource. Therefore, we can label all

invalid requests as

reconnaissance

(i.e., ﬁngerprinting and

exploitation attempts) requests, which we ultimately classify

as malicious. Similarly, we label bots that make unsolicited

POST requests to other endpoints, such as login pages, as

malicious. Overall, we labeled

15,064,878

requests (57% of

total requests) as malicious.

Credential brute force attempts. 13,445,474

(50.8%)

requests from

47,667

IP addresses targeted the login page of

our websites. By analyzing all unsolicited POST requests we

received and checking their corresponding URIs, we discovered

that different web applications attract different attacks. For

example, there are

12,370,063

POST requests towards Word-

Press, 90.3% of which are attempts to brute force wp-login.php

and xmlrpc.php. However, for Joomla, there are

343,263

unsolicited POST requests with only 51.6% targeting the

Joomla log-in page. The remaining requests are not speciﬁc to

Joomla and are targeting a wide variety of vulnerable software

(e.g. requests towards /cgi-bin/mainfunction.cgi attack DrayTek

devices that are vulnerable to remote code execution [44]).

Interestingly, system management tools attract different

patterns of attacks. While 76.2% of POST requests towards

TABLE II: Top ﬁngerprinting requests

Path # requests Unique IPs Target applications

/CHANGELOG.txt 116,513 97

Drupal, Joomla,

Moodle and spip

/(thinkphp|TP)/

(public|index)

55,144 3,608 ThinkPHP

/wp-content/plugins 32,917 2,416 WordPress

/solr/ 23,307 919 Apache Solr

/manager/html 10,615 1,557 Tomcat Manager

phpMyAdmin targeted login endpoints, virtually all POST

requests (99.95%) for Webmin targeted its speciﬁc login

endpoints. This suggests that most bots targeting Webmin

focus on brute-forcing credentials, as opposed to targeting

other, publicly-accessible pages.

By examining the username and password combinations

that were attempted against our honeysites, we observe that

attackers always try to login as “admin” using either common

passwords [45], or variations of our honeysite domains (i.e.

attempting a “www.example.com” password on the honeysite

serving the

example.com

domain). From the number of

attempts, we found 99.6% of bots (as identiﬁed by their IP

address) issued fewer than 10 attempts per domain before chang-

ing their targets. Only 0.3% of bots issued more than 100 brute-

force attempts per domain. The most active bot issued

64,211

Reconnaissance attempts.

To identify requests related

to reconnaissance, we incorporate a two-prong mapping

approach. First, we generate signatures based on popular

libraries and databases that include URIs related to Application

ﬁngerprinting, Exploitation attempts, Scanning for open-access

backdoors, and Scanning for unprotected backup ﬁles. We

provide details for each speciﬁc library and dataset later in

this section. Second, we manually identify the intention of

requests for endpoints that received more than 1,000 requests

in our dataset, mapping each request to the aforementioned

categories of attacks whenever possible. This ﬁltering step is

necessary since we cannot possibly create a comprehensive

database that includes signatures for all bot requests. As an

example of the power of this prioritized-labeling method,

via this process we discovered attacks exploiting the recent

CVE-2020-0618 vulnerability in MSSQL Reporting Servers

which was not part of our original database of signatures.

Overall, we collected a total of

16,044

signatures, with 179

signatures matching requests in our dataset. These signatures

cover

25,502

(9% of total) IP addresses which generated

659,672 requests.

• Application ﬁngerprinting:

In this study, ﬁngerprinting

attempts refer to requests that aim to uncover the presence of

speciﬁc web-application versions and their plugins. To quantify

these requests, we use the signatures from BlindElephant [46]

and WhatWeb [47], two open-source ﬁngerprinting tools that

have large databases of ﬁngerprints for popular web applications

and their plugins. By requesting speciﬁc ﬁles and matching

their content with the signatures in their database, these tools

can identify the type and speciﬁc version of the target web

application. We extract the ﬁle paths from the databases of

ﬁngerprints and correlate these signatures with our web server

logs, to identify ﬁngerprinting attempts from malicious bots. To

ensure that we do not label regular crawling as ﬁngerprinting,

we discount requests towards generic ﬁles, such as, index.php

and robots.txt even if these are valuable in the context of web-

application ﬁngerprinting. Overall, our database includes 13,887

URL-based signatures used to identify ﬁngerprinting attempts.

Table II lists the top 5 paths in our database of ﬁngerprinting

signatures that received the most requests. In total, we received

223,913

requests that were categorized as ﬁngerprinting at-

tempts and originated from

12,183

unique bot IP addresses.

Within our database of signatures, /CHANGELOG.txt has

received the largest number of requests since this ﬁle can be

used to identify the version of Drupal, Joomla, Moodle (Online

learning platform), and SPIP (Collaborative publishing system).

Second, we observe requests towards remote-code execution

(RCE) vulnerabilities in ThinkPHP deployments which are

known to be targeted by multiple botnets [48]. The ﬁngerprint-

ing of Apache Solr is related to the versions that were reported

to be vulnerable to RCE in November 2019 [49]. Finally, in the

top ﬁve categories of ﬁngerprinting requests, we observe large

numbers of requests towards speciﬁc vulnerable WordPress plu-

gins as well as the default deployment of Tomcat Manager. The

rest of ﬁngerprinting-related requests follow the same patterns

of probing for highly-speciﬁc endpoints, belonging to applica-

tions that are either misconﬁgured or known to be vulnerable.

• Exploitation attempts:

We deﬁne exploitation attempts

as requests towards URIs that are directly used to trigger known

exploits. We use exploits from

exploit-db.com

to generate

signatures for exploitation attempts. Unfortunately, automati-

cally generating signatures based on public exploit descriptions

is challenging due to the diverse format of vulnerability reports.

As a result, we incorporate a human-assisted automation tool

that extracts the URLs of signature candidates for the human

analyst to verify. At the same time, we hypothesize that bots

will most likely focus on server-side exploits that are easy

to mount (such as SQL injections and RCEs) and therefore

focus on these types of exploits, as opposed to including client-

side attacks, such as, XSS and CSRF. The resulting signature

database includes 593 signatures for the 5 web applications

in our dataset corresponding to vulnerabilities from 2014 to

2020. Our database includes 174 exploits for WordPress, 297

exploits for Joomla, 40 for Drupal, 52 for phpMyAdmin, and

19 exploits for Webmin, as well as 14 exploits extracted by

looking at the most requested paths on our honeysites.

Overall, we matched

238,837

incoming requests to

exploitation attempts that originated from

10,358

bot IP

addresses. Table III includes the top 5 endpoints used in

these attempts. In this table, we report on the CVE number

whenever possible, and in the absence of a CVE number, we

report the EDB-ID (Exploit-db ID) for these vulnerabilities.

The RCE vulnerability in PHPUnit received the most

exploitation attempts, followed by a setup PHP code injection

vulnerability of phpMyAdmin, and an RCE on exposed XDebug

servers (PHP Remote debugging tool). Next, an RCE vulner-

TABLE III: Top exploit requests

Path # requests Unique IPs CVE/EDB-ID

/vendor/phpunit/

.../eval-stdin.php

70,875 346 CVE-2017-9841

/scripts/setup.php 67,417 1,567 CVE-2009-1151

/?XDEBUG SESSION

START=phpstorm

23,447 7 EDB-44568

/?a=fetch&content=<php>die(

@md5(HelloThinkCMF))</php>

21,819 953 CVE-2019-7580

/cgi-bin/mainfunction.cgi

20,105 2,055 CVE-2020-8515

ability in ThinkCMF (CMS application based on thinkPHP)

is also targeted by malicious bots. The last entry in Table III

refers to a Draytech vulnerability which is signiﬁcant in that its

exploit was released during our study, allowing us to observe

how fast it was weaponized (discussed more in Section VIII).

Interestingly, 3,221 (14%) of IP addresses that sent

ﬁngerprinting or exploitation requests were observed in both

categories, suggesting that some bots cover a wide range of

vulnerabilities, as opposed to focusing on a single exploit.

Next to exploitation attempts, we also searched for requests

that included tell-tale shell commands (such as rm -rf /tmp

and wget) in one or more request parameters. In this way, we

discovered an additional 24,379 shell-related requests.

Though most injected shell commands attempt to

download a malicious payload from a publicly accessible IP

address/domain, we discovered that

2,890

requests contain

the URL of a private IP address, such as “192.168.1.1:8088”

which of course is unreachable from our web servers. These

requests could either belong to a buggy bot that extracts the IP

address of the wrong network interface after exploiting a host,

or could indicate botnets which are meant to attack the routers

in a local network, but ﬁnally ended up on the public web.

• Scanning for open-access backdoors:

We generate

a list of 485 well-known PHP, ASP, Perl, Java and bash,

backdoors. We use the same lists as Starov et al. [50] to extract

the signatures of known web backdoors and augment their

lists with two repositories that host web shells [51], [52]. Our

signatures matched 42 web shells (such as, shell.php, cmd.php

and up.php) requested 144,082 times by 6,721 bot IP addresses.

• Scanning for unprotected sensitive ﬁles:

Another

group of bots query for unprotected sensitive ﬁles, by either

guessing the names of likely-sensitive ﬁles (such as backup.sql)

or capitalizing on administrator behavior (e.g. keeping known

working copies of sensitive ﬁles with a .old sufﬁx) and leaks

due to speciﬁc editors (such as accessing the temporary swap

ﬁles left behind by Vim).

Similar to web-shell signatures, we used popular word lists

used in security scanners, such as SecLists [51], to build a

database of 1,016 signatures. These signatures matched 52,840

requests from 5,846 unique bot IP addresses. Files with the

.env extension which include potentially sensitive environment

variables used in Docker as well as other popular development

platforms were requested 29,713 times by 1,877 unique bot IP

addresses. Bots also requested a wide range of likely sensitive

ﬁle extensions including .old, .sql, .backup, .zip, and .bak as

well as text editor cache ﬁles such as .php˜ and .swp ﬁles.

Based on all of our signatures introduced in this section,

we observe 929 unique bot IP addresses that participated in

all of the aforementioned types of attacks. This demonstrates

that there exist bots that are equipped with a large number

of exploits and are willing to exhaust them while attacking

a host, before proceeding to their next victim.

B. Duration and frequency of bot visits

We grouped the requests recorded by Aristaeus into

1,760,124

sessions. Overall, 44.9% of sessions only consist

of a single request. 46% of sessions include between 2-20

requests, whereas there exist

2,310

sessions that include more

than 1,000 requests. The majority of bots spend as little as 1-3

seconds on our honeysites. 58.1% of the bots that visited our

honeysites left within 3 seconds, and among these bots, 89.5%

left within 1 second. Contrastingly, 10.7% of bots spent more

than 30 minutes on our honeysites.

A large fraction of bots visiting our honeysites perform too

few and too generic requests for us to be able to categorize

them as benign or malicious. Speciﬁcally,

11,015,403

requests

(41.68% of total requests) fall into this category. We provide

additional details for these bots below:

Single-shot scanners

. 50.04% of the IP addresses that

visited our honeysites only sent a single request and did

not exhibit any obviously malicious behavior. This is clearly

bot behavior since modern browsers make tens of follow-up

requests in order to load the required ﬁrst-party and third-party

resources. Similarly, these bots are unlikely to be indexing

websites since that would also require follow-up requests

for pages linked from the main page. We hypothesize that

these single-shot bots are either populating databases for later

processing (by more persistent bots) or are searching for

speciﬁc content that is not present on our setup.

Internet-wide scanners

. We attributed

114,489

requests to

four different Internet-wide scanners, including Masscan [53]

(21.4%) and Zgrab [54] (73.1%). Moreover, our honeysites

recorded “Stretchoid” (34.3%) and “NetSystemsResearch”

(3.69%) bots, which claim to identify online assets of organiza-

tions and the availability of public systems. The exact intention

behind these requests remains unclear since these tools can be

collecting data for both benign as well as malicious purposes.

C. Unexpected changes in bot identity

In this section, we focus on bots that switched their identity

across requests. We look for changes in certain HTTP headers,

such as, the user agent, as well as artifacts of a change in the

used automation tool, such as, the reordering of HTTP headers.

Multiple User-Agents from the same IP address.

At least

14.28% of all IP addresses that visited our honeysites sent

requests with two or more user agents. There may be benign

reasons why a bot would present two or more User-Agent

(UA) strings, such as, bots behind NATs and proxies, or bots

that upgraded their versions of browsers/crawling tools. At

the same time, we observed clear spooﬁng behavior, such as,

bots changing their UAs with every request. We summarize

the types of UA changes below:

1) Changing the operating system

. As shown in the

following example, only the operating system part of the

user agent was changed across two requests.

•

”Mozilla/5.0 (

X11; Ubuntu; Linux x86 64

; rv:52.0)

Gecko/ 20100101 Firefox/52.0”

•

”Mozilla/5.0 (

Windows NT 6.1; WOW64

; rv:52.0)

Gecko/ 20100101 Firefox/52.0”

We identiﬁed

5,479

IP addresses that claimed more than

one OS during their requests. One possible explanation

is that these bots are searching for server-side cloaking,

i.e., our honeysites presenting different content to users

of Windows vs. Linux.

2) Changing the browser version.

We use the Levenshtein

distance to measure User-Agent string similarity of a

certain IP address, recording their minimum and maximum

similarity. We observed that when the changes are limited

to browser versions as presented in the bots’ UAs, the

requests exhibit more than 90% similarity. A total of

11,500

IP addresses present these types of version changes.

In light, however, of our ﬁndings in Section

VII-A

regarding bots imitating common browsers, these version

changes are likely to be part of a spooﬁng strategy, as

opposed to honest announcements of browser updates.

3) Switching user agents frequently

We observed

4,440

(1.54%) IP addresses that sent requests with more than

5 UAs, and there are 542 IP addresses presenting more

than 20 UAs across their requests. In extreme cases, we

observed 44 IP addresses that sent more than 50 unique UA

headers. However, by looking at other HTTP headers and

their TLS ﬁngerprint, we can attribute their requests to just

one or two tools. This uncovers the strategy of frequently

rotating UAs, presumably to evade server-side blocking.

Ordering of HTTP headers.

During our analysis of

crawling tools, we observed that speciﬁc orders of headers

can be attributed to usage of certain tools. For instance, we

discovered that wget and curl have consistent orderings across

versions and are different from each other. By capitalizing on

this observation, we identiﬁed

23,727

bots that presented more

than one header orderings, revealing the usage of more than

one tools, regardless of UA claims. Moreover, we discovered

28,627

IP addresses that have only one ordering of HTTP

headers, but have multiple UAs, which means they are changing

their claimed identities, without changing the underlying tools.

VII. TLS FINGERPRINTING

Aristaeus serves content to both HTTP and HTTPS requests

to accommodate as many bots as possible. Overall, bots made

10,156,305

requests over HTTPS (38.4% of total requests).

Out of all HTTPS requests, we extracted 558 unique TLS ﬁn-

gerprints. This indicates that most bots use the same underlying

TLS library which is related to the tool and the operating system

that they are using. We can therefore use these TLS ﬁngerprints

to identify bots and corroborate their claimed identities.

A. TLS Fingerprint of Web Bots

1) Bots behind NAT/Proxy

It is expected that some bots will use proxies before

connecting to a website, both in order to evade rate-limiting

and blocklisting as well as to potentially distribute the load

across multiple servers. Therefore, some requests that originate

from the same IP address may be emitted by different bots. To

understand whether multiple bots are “hiding” behind the same

IP address or whether a single bot is just sending requests

with multiple UAs, we can compare the TLS ﬁngerprints

across requests of interest. To perform this analysis, we make

use of the following observations:

• Basic Web crawling tools

. Tools like curl and wget will

produce only one ﬁngerprint for a given OS and TLS library

combination. We use this information to identify the use of

these tools in the presence of UA spooﬁng.

• Support for GREASE values in the TLS stack

. Chrome,

Chromium, and Chromium-based browsers (such as Brave and

Opera) support GREASE, a new, TLS-handshake-related, IETF

standard [55]. The GREASE values that are sent in the TLS

handshakes produce multiple TLS ﬁngerprints across multiple

requests, with differences in TLS cipher suites, extensions, and

E-curves. As a result, the TLS ﬁngerprint of the aforementioned

browsers will be different in the ﬁrst and the last 1-2 bytes for

all GREASE-related handshake values. GREASE was added

to Chrome in Version 55 [56] which we veriﬁed by testing

Chrome on several popular platforms (Ubuntu 16.04, Ubuntu

18.04, CentOS 7, Windows 10, Mac OS, and Android 8).

Contrastingly, browsers such as Firefox, Edge, and Safari

have not implemented GREASE at the time of our analysis.

As a result, GREASE values are absent from the handshakes

and therefore the TLS ﬁngerprints of these browsers remain

the same across requests. For these browsers, we collected

their ﬁngerprints over different operating systems and used

these ﬁngerprints to uncover the true identity of bot requests.

Out of

43,985

IP addresses with a TLS ﬁngerprint, there are

1,396 (3.17%) IPs with two or more sets of TLS ﬁngerprints.

For this 3.17%, we observe that the requests originated from

different tools and/or OSs, hiding behind the same IP address.

If there was a TLS intercepting proxy in place, we would

not observe multiple TLS ﬁngerprints but rather a single

TLS ﬁngerprint (that of the intercepting proxy). Nevertheless,

distinguishing multiple clients behind a TLS proxy remains

a challenge for TLS ﬁngerprinting.

2) TLS ﬁngerprint of Tools

Given that only 558 unique TLS ﬁngerprints are shared

among all

10,156,305

requests, this means that the majority

of requests can be attributed to a small number of tools and

TLS libraries.

To perform the matching of bot TLS ﬁngerprints to known

tools, we manually extracted the TLS ﬁngerprint of the Go-

http-client, wget, curl, Chrome, Firefox, Opera, Edge and IE

browsers, and included them in our database of signatures.

Moreover, for other TLS ﬁngerprints that we could not repro-

duce in our setup, we assume that a crawler will not pretend to

be another crawler. For example, a crawler built using Python

may pretend to be Chrome or Firefox (in order to bypass anti-

bot mechanisms), but it has no reason to pretend to be curl given

that both tools are well known for building bots and therefore

receive the same treatment from anti-bot tools and services [8].

TABLE IV:

Popular TLS ﬁngerprint distribution. Entries below the

line correspond to Chromium-based tools that were not in the top

ten, in terms of unique bot IP count.

Tools

Unique

FPs

IP Count

Total

Requests

Go-http-client 28 15,862 8,708,876

Libwww-perl or wget 17 6,102 120,423

PycURL/curl 26 3,942 80,374

Python-urllib 3 8 2,858 22,885

NetcraftSurveyAgent 2 2,381 14,464

msnbot/bingbot 4 1,995 44,437

Chrome-1(Googlebot) 1 1,836 28,082

Python-requests 2.x 11 1,063 754,711

commix/v2.9-stable 3 1,029 5,738

Java/1.8.0 8 308 1,710

MJ12Bot 2 289 28,065

Chrome-2(Chrome, Opera) 1 490 66,631

Chrome-3(Headless Chrome) 1 80 2,829

Chrome-4(coc coc browser) 1 4 101

Total 113 38,239 9,879,326

Therefore, after excluding browser TLS ﬁngerprints, we used

the description from the majority of recorded UA headers that

match the unknown TLS ﬁngerprints, to label them.

Table IV lists the most popular tools covering 113 unique

ﬁngerprints. Given that one of these tools is based on Google

Chrome, the bottom part of Table IV lists any additional

ﬁngerprints that we could trace back to Chrome. The total of

these 14 tools produced

9,879,326

requests, covering 97.2%

of all TLS requests. Bots using the Go language (and therefore

the Go-provided TLS libraries) are by far the most popular,

exceeding more traditional choices such as, Python, perl, and

wget. We observe a total of four different Chromium-related

ﬁngerprints, with distinct ﬁngerprints for bots operated

by Google (Googlebot, Google-Image, and Google Page

Speed Insights), headless Chrome, and the coc coc browser

corresponding to a Vietnamese version of Chrome.

These results show the power of TLS ﬁngerprinting in cor-

roborating the identity of benign bots and identifying malicious

bots that are lying about their identities. Out of

38,312

requests

that claimed to be msnbot/bingbot and have a valid TLS ﬁnger-

print, we were able to use reverse DNS to verify that all of them

were indeed real msnbot or bingbots. Similarily, out of

28,011

requests that claimed to be Googlebot, we matched

27,833

(99.4%) of them through TLS ﬁngerprinting and identify them

as real Googlebots. The remaining bots also failed in producing

the expected reverse DNS results, pointing to malicious actors

who claim the Googlebot identity to avoid getting blocked.

Using TLS ﬁngerprinting to uncover the real identity of bots

Given our ability to match claimed user agents (UAs) with

presented TLS ﬁngerprints, we checked the TLS ﬁngerprints

of all HTTPS-capable bots searching for a mismatch between

the stated UAs and the observed TLS ﬁngerprints. Overall, we

discovered that

27,860

(86.2%) of the total of

30,233

clients

that claim to be Firefox of Chrome, were in fact lying about

their identity.

Fake Chrome.

Among the

12,148

IP addresses that claimed

to be Chrome through their UAs,

10,041

of them do not contain

the expected GREASE values in their cipher suites. As a result,

we can conclude that more than 82.6% of clients are lying about

being Chrome. From their TLS ﬁngerprints, we can conclude

that they are mostly curl and wget running on Linux OSs.

Fake Firefox.

Similarly,

18,085

IP addresses claimed

through their UAs, to be Firefox. However,

12,418

(68.7%)

of these Firefox clients actually matched the ﬁngerprints of

Go-http-client, and

3,821

(21.1%) matched the ﬁngerprints of

libwww-perl. A small number of requests (5.7%) matched to

either python or curl. The remaining 539 IP addresses do not

match any of the TLS ﬁngerprints in our database, including

the ﬁngerprints of Firefox. Overall, our results show that at

least

17,819

out of

18,085

(98.5%) IP addresses that claimed

to be Firefox are lying about their identity.

Real Chrome.

351 of the

2,419

IP addresses that show signs

of GREASE support in their TLS handshakes, claimed to be-

long to mobile Safari. This is not possible, given that Safari does

not currently support GREASE (neither for Mac nor for the

iPhone). This indicates actors (benign or malicious) who wish

to obtain the content that websites would serve to mobile Safari

browsers, but lack the ability to instrument real Apple devices.

Other TLS ﬁngerprints.

Finally, there are

11,693

of bots

that have other types of TLS ﬁngerprints, but they mostly be-

long to Go-http-client, Python-urllib, curl, and wget, as shown

in Table IV. They exhibit a wide range of UAs including MSIE,

Android Browser, .NET CLR, and MS Word. This indicates

a much larger landscape of spoofed client identities, past the

Chrome/Firefox spooﬁng that we investigated in this section.

4) TLS ﬁngerprints in Exploitation attempts

We applied our method of matching TLS ﬁngerprints to

the stated identities of the bots behind the malicious requests

we previously discussed in Section

VI-A

2. Table V presents

the results. First, we can observe that there are almost no

real browsers accessing those resources, corroborating our

exploitation labels (under the reasonable assumption that

attackers do not need full-ﬂedged browsers to send malicious

payloads to vulnerable websites). Second, there are major

variations in the different type of malicious requests. For

example, 93.4% of exploit requests are using Golang, but only

171 requests are using Golang to look for misplaced Backup

ﬁles. Similarly, libwww/wget is popular in the backdoor

requests, but these tools do not appear in backup ﬁle probing

requests. These results indicate different generations of tools

and attackers, using different underlying technologies to

exploit different website vulnerabilities.

VIII. CASE STUDIES

Bots only focusing on JS resources.

Even though many

bots do not request images and other resources (presumably

as a way of speeding up their crawls) we observed bots that

only request JavaScript resources. One bot in our dataset

(IP address: 101.4.60.1**) was particularly interesting, as

it only downloaded JavaScript ﬁles but never, according to

Aristaeus’ tests, executed them. Given that the IP address of

this bot belongs to a Chinese antivirus company, we suspect

that the intention of that bot is to collect JavaScript ﬁles for

anti-malware research purposes.

TABLE V: TLS ﬁngerprint of malicious requests

Type Python Golang

libwww /

wget

Chrome /

Firefox

Unknown Total

Backdoor 231 1,718 349 3 482 2,783

Backup File 411 171 84 0 1,803 2,469

Exploits 275 18,283 607 0 390 19,555

Fingerprinting 1,524 3,670 630 139 7,226 13,189

Spikes in incoming trafﬁc

. We observe two major spikes

in our dataset. The ﬁrst trafﬁc surge happened from May

28th to June 17th, where a group of bots continuously sent

us log-in attempts and XML-RPC requests. These bots

initially requested /wp-includes/wlwmanifest.xml) to check if a

honeysite was an instance of WordPress. They then extracted

the list of users from the author-list page, and then started

brute-forcing the admin account through POST requests

towards

xmlrpc.php

(targeting WordPress’s authentication

point that is meant to be used as an API). This group of

bots issued a total of

4,851,989

requests, amounting to

18.4% of the total requests. Similarly, the second trafﬁc surge

corresponds to 21.9% of the total requests in our dataset.

Failed cloaking attempts

. Modifying the HTTP user agent

header is likely the most common method of cloaking used

by the bots (both malicious bots trying to exploit websites

as well as benign bots operated by researchers and security

companies). Yet during our study, we observed failed attempts

to modify this header. For instance, we observed wrong

spellings of the “User-Agent” header including “useragent”

and “userAgent”. Similarly, the “Host” header also included

different spellings and letter cases, such as “HOST”, “host”,

or “hoSt”. The appearance of these spelling artifacts means

that these header ﬁelds are forged. For certain HTTP libraries

however, an incorrect spelling results in both the original

header and the new header being sent out. Therefore, some

requests recorded by Aristaeus included both ”User-Agent” and

”userAgent” headers. For these bots, the original ”User-Agent”

header indicated, e.g., ”python-requests/2.23.0”, whereas the

“userAgent” header reported ”Mozilla/5.0 (Windows NT 6.1;

WOW64; rv:45.0) Gecko/20100101 Firefox/45.0”.

Time to weaponize public exploits.

During the seven-

month span of this study, we observed requests that tried to

exploit ﬁve remote command execution (RCE) vulnerabilities

that went public after the start of our data collection. As

a result, we have visibility over the initial probes for these

exploits. The ﬁve RCE vulnerabilities affect the following

software/ﬁrmware: DrayTech modems (CVE-2020-8585),

Netgear GPON router (EDB-48225), MSSQL Reporting

Servers (CVE-2020-0618), Liferay Portal (CVE-2020-7961),

and F5 Trafﬁc Management UI (CVE-2020-5902).

For the ﬁrst vulnerability on DrayTech devices, the exploit

was released on March 29, 2020 and we observed exploitation

attempts on Aristaeus’ honeysites a mere two days later.

In a similar fashion, the exploit for Netgear devices went

public on March 18 2020, and the ﬁrst exploitation attempts

were recorded by Aristaeus on the same day. Next, the proof-

of-concept exploit for MSSQL reporting server vulnerability

went public by the researcher who reported this vulnerability

on February 14, 2020, and we received exploitation attempts

for this vulnerability 4 days later [57]. The Liferay vulnerability

went public on March 20, 2020 and exploiting requests showed

up in Aristaeus’ logs after 4 days. Finally, the F5 vulnerability

was publicly announced on June 30, 2020 and we observed

requests towards F5 TMUI shell on the same day.

Based on these ﬁve occasions, we can clearly observe

that the time window between an exploit going public and

malicious actors probing for that vulnerability is short and, in

certain cases (such as the Netgear and F5 devices) non-existent.

IX. DISCUSSION

In this section, we ﬁrst highlight the key takeaways from

our analysis of the data that Aristaeus collected, and then

explore how the size of our infrastructure relates to the

number of bots discovered. We close by discussing Aristaeus’s

limitations as well as future work directions.

A. Key Takeaways

• Everyone is a target:

Just by being online and publicly ac-

cessible, each one of Aristaeus’ honeysites attracted an average

37,753

requests per month,

21,523

(57%) of which were

clearly malicious. Each online site is exposed to ﬁngerprinting

and a wide range of attacks, abusing both operator error (such

as, common passwords) as well as recently-released exploits.

• Most generic bot requests are generated by rudimentary

HTTP libraries:

Throughout our data analysis, we observed

that 99.3% of the bots that visit our websites do not

support JavaScript. This renders the state-of-the-art browser

ﬁngerprinting that is based on advanced browser APIs and

JavaScript, ineffective. To combat this, we demonstrated

that TLS ﬁngerprinting can be used to accurately ﬁngerprint

browsing environments based on common HTTP libraries.

• Most bots are located in residential IP space:

Through

our experiments, we observed that the majority (64.37%) of

bot IP addresses were residential ones, while only 30.36%

of IP addresses were located in data centers. This indicates

that bots use infected or otherwise proxied residential devices

to scan the Internet. We expect that requests from residential

IP space are less susceptible to rate limiting and blocklisting

compared to requests from data centers and public clouds, out

of fear of blocking residential users.

• Generic bots target low-hanging fruit:

Aristaeus’ logs

reveal that 89.5% of sessions include less than 20 requests,

and less than 0.13% of sessions include over 1,000 requests.

The bruteforce attempts exhibit similar patterns: 99.6% IP

addresses issue fewer than 10 attempts per domain, while only

0.3% IP addresses issued more than 100 attempts per domain.

This indicates that most bots are highly selective and surgical

in their attacks, going after easy-to-exploit targets.

• IP blocklists are incomplete:

The vast majority (87%)

of the malicious bot IPs from our honeysite logs were not

listed in popular IP blocklists. This further emphasizes the

limited beneﬁts of static IP blocklisting and therefore the

need for reactive defenses against bots. At the same time, the

poor blocklist coverage showcases the practical beneﬁts of

Aristaeus, which can discover tens of thousands of malicious

clients that are currently missing from popular blocklists.

• Exploits that go public are quickly abused:

We observed

that bots start probing for vulnerable targets as quickly as on

the same day that an exploit was made public. Our ﬁndings

highlight the importance of immediately updating Internet-

facing software, as any delay, however small, can be capitalized

by automated bots to compromise systems and services.

• Fake search-engine bots are not as common as one

would expect:

Contrary to our expectations, less than 0.3%

of the total requests that claimed to be a search-engine

bot were lying about their identity. This could suggest that

either search-engine bots do not receive special treatments

by websites, or that provided mechanisms for verifying the

source IP address of search-engine bots have been successful

in deterring bot authors from pretending to be search engines.

• Most generic internet bots use the same underlying

HTTP libraries:

Even though Aristaeus recorded more than

10.2 million HTTPS requests from bots, these bots generated

just 558 unique TLS ﬁngerprints. We were able to trace these

ﬁngerprints back to 14 popular tools and HTTP libraries,

responsible for more than 97.2% of the total requests.

B. Size of Infrastructure

To understand how the number of Aristaeus-managed hon-

eysites affects the collected intelligence (e.g. could Aristaeus

record just as many attackers with 50 instead of 100 honeysites),

we conduct the following simulation. We pick an ordering of

honeysites at random and calculate the number of unique IP

addresses and TLS ﬁngerprints, each honeysite contributes.

Figure 7 shows the results when this simulation is repeated

100 times. We resort to simulation (instead of just ordering the

honeysites by contributing intelligence) since a new real de-

ployment would not have access to post-deployment statistics.

We observe two clearly different distributions. While one

could obtain approximately 50% of the TLS ﬁngerprints with

just 10% of the honeysites, the number of unique IP addresses

grows linearly with the number of honeysites. Our ﬁndings indi-

cate that, if the objective is the curation of TLS ﬁngeprints from

bots, a smaller infrastructure can sufﬁce yet, if the objective is

the collection of bot IP addresses, one could deploy an even

larger number of honeysites and still obtain new observations.

C. Limitations and Future Work

In this study, we deployed honeysites based on ﬁve popular

web applications. Since we discovered that many bots ﬁrst

ﬁngerprint their targets before sending exploits (Section

VI-A

our honeysites will not always capture exploit attempts towards

unsupported web applications.

By design, Aristaeus in general and honeysites in particular,

attract trafﬁc from generic bots that target every website on the

Internet. Yet high-proﬁle websites as well as speciﬁc companies,

often receive targeted bot attacks that are tailored to their infras-

tructure. These bots and their attacks will likely not be captured

by Aristaeus, unless attackers ﬁrst tried them on the public web.

As follow-up work, we plan to design honeysites that can dy-

namically react to bots, deceiving them into believing that they

are interacting with the web application that they are looking

for. Malware analysis systems already make use of variations of

Fig. 7:

(Top) Number of honeysites and their coverage of bot TLS

ﬁngerprints. The opaque line is the median of 100 random experiments.

(Bottom) Number of honeysites and their coverage of bot IP addresses.

this technique to, e.g., detect malicious browser extensions [58]

and malicious websites that attempt to compromise vulnerable

browser plugins [59]. In this way, we expect that our honeysites

will be able to capture more attacks (particularly from the

single-shot scanners that currently do not send any trafﬁc past

an initial probing request) and exploit payloads.

X. RELATED WORK

Characterization and detection of crawlers.

To quantify

crawler behavior and differences of crawler strategies, a number

of researchers analyzed large corpora of web server logs and

reported on their ﬁndings. These studies include the analysis

of 24 hours worth of crawler trafﬁc on microsoft

com [60], 12

weeks of trafﬁc from the website belonging to the Standard

Performance Evaluation Corporation (SPEC) [61], 12.5 years

of network trafﬁc logs collected at the Lawrence Berkeley

National Laboratory [62], as well as crawler trafﬁc on

academic networks [63] and publication libraries [64].

The security aspect of web crawlers has received signiﬁcantly

less attention compared to studies of generic crawling activity.

Most existing attempts to differentiate crawlers from real users

use differences in their navigational patterns, such as, the

percentage of HTTP methods in requests (GET/POST/HEAD

etc.), the types of links requested, and the timing between

individual requests [4]–[6]. These features are then used in one

or more supervised machine-learning algorithms trained using

ground truth that the authors of each paper were able to procure,

typically by manually labeling trafﬁc of one or more webservers

to which they had access. Xie et al. propose an ofﬂine method

for identifying malicious crawlers by searching for clusters of

requests towards non-existent resources [65]. Park et al. [22]

investigated the possibility of detecting malicious web crawlers

by looking for mouse movement, the loading of Cascading

Style Sheets (CSS), and the following of invisible links. Jan

et al. [66] propose an ML-based bot detection scheme trained

on limited labeled bot trafﬁc from an industry partner. By

using data augmentation methods, they are able to synthesize

samples and expand their dataset used to train ML models.

In our paper, we sidestep the issue of having to manually

label trafﬁc as belonging to malicious crawlers, through the

use of honeysites, i.e., websites which are never advertised

and therefore any visitors must, by deﬁnition, either be

benign/malicious crawlers or their operators who follow-up on

a discovery from one of their bots. We therefore argue that the

access logs that we collected via our network of honeysites

can be used as ground truth in order to train more accurate

ML-based, bot detection systems.

Network telescopes and honeypots

. Network telescopes and

honeypots are two classes of systems developed to study

Internet scanning activity at scale. First introduced by Moore et

al. [67], Network Telescopes observe and measure the remote

network security events by using a customized router to reroute

invalid IP address blocks trafﬁc to a collection server. These

works are effective in capturing network security events such

as DDoS attacks, Internet scanners, or worm infections. Recent

work by Richter et al. [68], applies the same principles to

89,000 CDN servers spread across 172 Class A preﬁxes and ﬁnd

that

32%

all logged scan trafﬁc are the result of localized scans.

While large in scale, network telescopes either do not provide

any response or interactive actions, or support basic responses

at the network level (e.g., sending back SYN-ACK if received

a SYN from certain ports [69]). This makes them incapable of

analyzing crawler interaction with servers and web applications.

Honeypots provide decoy computing resources for the

purpose of monitoring and logging the activities of entities

that probe them. High-interaction honeypots can respond to

probes with high ﬁdelity, but are hard to set up and maintain.

In contrast, low-interaction honeypots such as Honeyd [70]

and SGNET [71] intercept trafﬁc sent to nonexistent hosts

and use simulated systems with various “personalities” to

form responses. This allows low-interaction honeypots to be

somewhat extensible while limiting their ability to respond

to the probes [72].

Our system, Aristaeus, combines some of the best properties

of both these worlds. Aristaeus is extensible and can be

automatically deployed on globally-dispersed servers. However,

unlike network telescopes and low-interactive honeypots, our

system does not restrict itself to the network layer responses

but instead utilizes real web applications that have been

augmented to perform traditional as well as novel types of

client ﬁngerprinting. In these aspects, our work most closely

relates to the honeynet system by Canali and Balzarotti which

utilized 500 honeypot websites with known vulnerabilities

(such as SQL injections and Remote Command Execution

bugs), and studied the exploitation and post-exploitation

behavior of attackers [12]. While our systems share similarities

(such as the use of real web applications instead of mock web

applications or low-interaction webserver honeypots), our focus

is on characterizing the requests that we receive, clustering

them into crawling campaigns, and uncovering the real identity

of crawlers. In contrast, because of their setup, Canali and

Balzarotti [12] are able to characterize how exactly attackers

attempt to exploit known vulnerabilities, what types of ﬁles

they upload to the compromised servers, and how attackers

abuse the compromised servers for phishing and spamming (all

of which are well outside Aristaeus’s goals and capabilities).

XI. CONCLUSION

In this paper, we presented the design and implementation of

Aristaeus, a system for deploying and managing large numbers

of web applications which are deployed on previously-unused

domain names, for the sole purpose of attracting web bots.

Using Aristaeus, we conducted a seven-month-long, large-scale

study of crawling activity recorded at 100 globally distributed

honeysites. These honeysites captured more than 200 GB of

crawling activity, on websites that have zero organic trafﬁc.

By analyzing this data, we discovered not only the expected

bots operated by search engines, but an active and diverse

ecosystem of malicious bots that constantly probed our

infrastructure for operator errors (such as poor credentials

and sensitive ﬁles) as well as vulnerable versions of online

software. Among others, we discovered that an average

Aristaeus-managed honeysite received more than 37K requests

per month from bots, 50% of which were malicious. Out of

the

76,000

IP addresses operated by clearly malicious bots

recorded by Aristaeus, 87% of them are currently missing from

popular IP-based blocklists. We observed that malicious bots

engage in brute-force attacks, web application ﬁngerprinting,

and can rapidly add new exploits to their abusive capabilities,

even on the same day as an exploit becoming public. Finally,

through novel header-based, TLS-based, and JavaScript-based

ﬁngerprinting techniques, we uncovered the true identity of

bots ﬁnding that most bots that claim to be a popular browser

are in fact lying and are instead implemented on simple HTTP

libraries built using Python and Go.

Next to all the insights into the abuse by malicious bots,

Aristaeus allowed us to curate a dataset that is virtually

free of organic user trafﬁc which we will make available

to researchers upon publication of this paper. This bot-only

dataset can be used to better understand the dynamics of bots

and design more accurate bot-detection algorithms.

XII. AVAILABILITY

One of the main contributions of this paper is the curation

of a bot-only, trafﬁc dataset. To facilitate and advance research

on the topic of bot detection, our dataset will be available to

other researchers upon request.

ACKNOWLEDGMENT

We thank the reviewers for their valuable feedback. This

work was supported by the Ofﬁce of Naval Research under

grant N00014-20-1-2720, N00014-20-1-2858, by the National

Science Foundation under grants CNS-1813974, CNS-1941617,

and CMMI-1842020, as well as by a 2018 Amazon Research

award. Any opinions, ﬁndings, or conclusions expressed in

this material are those of the authors and do not necessarily

reﬂect the views of the sponsors.

REFERENCES

[1]

A. Shirokova, “Cms brute force attacks are still a threat.” [Online].

Available: https://blogs

cisco

com/security/cms-brute-force-attacks-are-

still-a-threat

[2]

T. Canavan, CMS Security Handbook: The Comprehensive Guide for

WordPress, Joomla, Drupal, and Plone. John Wiley and Sons, 2011.

[3]

Imperva, “Bad bot report 2020: Bad bots strike back.” [Online].

Available: https://www

imperva

com/resources/resource-library/reports/

2020-bad-bot-report/

[4]

A. G. Louren

c¸

o and O. O. Belo, “Catching web crawlers in the act,”

in Proceedings of the 6th international Conference on Web Engineering,

2006, pp. 265–272.

[5]

P.-N. Tan and V. Kumar, “Discovery of web robot sessions based on

their navigational patterns,” in Intelligent Technologies for Information

Analysis. Springer, 2004, pp. 193–222.

[6]

G. Jacob, E. Kirda, C. Kruegel, and G. Vigna, “Pubcrawl: Protecting

users and businesses from crawlers,” in Presented as part of the 21st

USENIX Security Symposium (USENIX Security 12), 2012, pp. 507–522.

[7]

A. Vastel, W. Rudametkin, R. Rouvoy, and X. Blanc, “FP-Crawlers:

Studying the Resilience of Browser Fingerprinting to Block Crawlers,”

in MADWeb’20 - NDSS Workshop on Measurements, Attacks, and

Defenses for the Web.

[8]

B. Amin Azad, O. Starov, P. Laperdrix, and N. Nikiforakis,

“Web Runner 2049: Evaluating Third-Party Anti-bot Services,”

in 17th Conference on Detection of Intrusions and Malware

& Vulnerability Assessment (DIMVA), 2020. [Online]. Available:

https://hal.archives-ouvertes.fr/hal-02612454

[9]

K. Bock, D. Patel, G. Hughey, and D. Levin, “uncaptcha: a low-resource

defeat of recaptcha’s audio challenge,” in 11th USENIX Workshop on

Offensive Technologies (WOOT 17), 2017.

[10]

M. Motoyama, K. Levchenko, C. Kanich, D. McCoy, G. M. Voelker, and

S. Savage, “Re: Captchas-understanding captcha-solving services in an

economic context.” in USENIX Security Symposium, vol. 10, 2010, p. 3.

[11]

S. Sivakorn, J. Polakis, and A. D. Keromytis, “I’m not a human:

Breaking the google recaptcha,” Black Hat, 2016.

[12] D. Canali and D. Balzarotti, “Behind the Scenes of Online Attacks: an

Analysis of Exploitation Behaviors on the Web,” in Proceedidngs of the

20th Network & Distributed System Security Symposium (NDSS), 2013.

[13]

C. Lever, R. Walls, Y. Nadji, D. Dagon, P. McDaniel, and M. Antonakakis,

“Domain-z: 28 registrations later measuring the exploitation of residual

trust in domains,” in IEEE Symposium on Security and Privacy (SP),

2016, pp. 691–706.

[14] “Seleniumhq browser automation,” https://www.selenium.dev/.

[15]

P. Eckersley, “How unique is your web browser?” in International Sym-

posium on Privacy Enhancing Technologies Symposium, 2010, pp. 1–18.

[16]

N. Nikiforakis, A. Kapravelos, W. Joosen, C. Kruegel, F. Piessens, and

G. Vigna, “Cookieless monster: Exploring the ecosystem of web-based

device ﬁngerprinting,” in 2013 IEEE Symposium on Security and

Privacy, 2013, pp. 541–555.

[17]

O. Starov and N. Nikiforakis, “Xhound: Quantifying the ﬁngerprintability

of browser extensions,” in IEEE Symposium on Security and Privacy

(SP), 2017, pp. 941–956.

[18]

P. Laperdrix, W. Rudametkin, and B. Baudry, “Beauty and the beast:

Diverting modern web browsers to build unique browser ﬁngerprints,”

in IEEE Symposium on Security and Privacy (SP), 2016, pp. 878–894.

[19]

M. Mulazzani, P. Reschl, M. Huber, M. Leithner, S. Schrittwieser,

E. Weippl, and F. Wien, “Fast and reliable browser identiﬁcation with

javascript engine ﬁngerprinting,” in Web 2.0 Workshop on Security and

Privacy (W2SP), vol. 5, 2013.

[20]

L. Brotherston, “Tls ﬁngerprinting.” [Online]. Available:

https://github.com/LeeBrotherston/tls-ﬁngerprinting

[21]

Z. Durumeric, Z. Ma, D. Springall, R. Barnes, N. Sullivan, E. Bursztein,

M. Bailey, J. A. Halderman, and V. Paxson, “The security impact of

https interception.” in Proceedings of the 24th Network and Distributed

System Security Symposium (NDSS), 2017.

[22]

K. Park, V. S. Pai, K.-W. Lee, and S. B. Calo, “Securing web service

by automatic robot detection.” in USENIX Annual Technical Conference,

General Track, 2006, pp. 255–260.

[23]

G. Analytics, “How a web session is deﬁned in analytics.” [Online].

Available: https://support.google.com/analytics/answer/2731565?hl=en

[24]

Valve, “Fingerprintjs2.” [Online]. Available: https:

//github.com/Valve/ﬁngerprintjs2

[25]

M. W. Docs, “Content security policy (csp).” [Online]. Available:

https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP

[26]

S. Stamm, B. Sterne, and G. Markham, “Reining in the web with content

security policy,” in Proceedings of the 19th international conference

on World wide web, 2010, pp. 921–930.

[27]

N. Virvilis, B. Vanautgaerden, and O. S. Serrano, “Changing the game:

The art of deceiving sophisticated attackers,” in 2014 6th International

Conference On Cyber Conﬂict (CyCon 2014). IEEE, 2014, pp. 87–97.

[28]

B. P. Ltd, “Web technology usage trends (accessed march 27,2020).”

[Online]. Available: https://trends.builtwith.com

[29] “Wordpress: About us,” https://wordpress.com/about/.

[30]

“Ip2location lite ip-asn database,” https://lite

ip2location

com/database/

ip-asn.

[31]

X. Mi, X. Feng, X. Liao, B. Liu, X. Wang, F. Qian, Z. Li, S. Alrwais,

L. Sun, and Y. Liu, “Resident evil: Understanding residential ip proxy

as a dark service,” in 2019 IEEE Symposium on Security and Privacy

(SP), 2019.

[32]

D. Liu, S. Hao, and H. Wang, “All your dns records point to us:

Understanding the security threats of dangling dns records,” in

Proceedings of the 2016 ACM SIGSAC Conference on Computer and

Communications Security, 2016, pp. 1414–1425.

[33]

K. Borgolte, T. Fiebig, S. Hao, C. Kruegel, and G. Vigna, “Cloud strife:

mitigating the security risks of domain-validated certiﬁcates,” 2018.

[34]

S. F. McKenna, “Detection and classiﬁcation of web robots with

honeypots,” Naval Postgraduate School Monterey United States, Tech.

Rep., 2016.

[35]

R. Barnett, “Setting HoneyTraps with ModSecurity: Adding Fake

robots.txt Disallow Entries,” https://www

trustwave

com/en-us/resources/

blogs/spiderlabs-blog/setting-honeytraps-with-modsecurity-adding-

fake-robotstxt-disallow-entries/.

[36]

R. Haswell, “Stopping bad robots with honeytraps,” https:

//www.davidnaylor.co.uk/stopping-bad-robots-with-honeytraps.html.

[37]

Google, “Verifying googlebot.” [Online]. Available:

https://support.google.com/webmasters/answer/80553

[38]

Microsoft, “Verifying bingbot.” [Online]. Available:

https://www.bing.com/toolbox/verify-bingbot

[39]

Yandex, “Verifying yandexbot.” [Online]. Available: https://yandex

com/

support/webmaster/robot-workings/check-yandex-robots.html

[40]

Baidu, “How can i know the crawling is from baiduspider.” [Online].

Available: https://help

baidu

com/question?prod id=99&class=0&id=

3001

[41]

Google, “Overview of google crawlers.” [Online]. Available:

https://support.google.com/webmasters/answer/1061943

[42]

Netcraft, “Netcraft: Active cyber defence,” URL:

https://www.netcraft.com/, 2014.

[43]

“Internet Archive: Digital Library of Free & Borrowable Books, Movies,

Music & Wayback Machine,” https://archive.org/.

[44]

C. Security, “Multiple vulnerabilities in draytek products

could allow for arbitrary code execution.” [Online]. Available:

https://www

cisecurity

org/advisory/multiple-vulnerabilities-in-draytek-

products-could-allow-for-arbitrary-code-execution 2020-043/

[45]

M. Daniel, H. Jason, and g0tmi1k, “10k most common credentials.”

[Online]. Available: https://github

com/danielmiessler/SecLists/blob/

master/Passwords/Common-Credentials/10k-most-common.txt

[46]

“The blindelephant web application ﬁngerprinter.” [Online]. Available:

https://github.com/lokifer/BlindElephant

[47]

“Whatweb.” [Online]. Available: https://github

com/urbanadventurer/

WhatWeb

[48]

“Thinkphp remote code execution vulnerability

used to deploy malware.” [Online]. Available:

https://www

tenable

com/blog/thinkphp-remote-code-execution-

vulnerability-used-to-deploy-variety-of-malware-cve-2018-20062

[49]

“Exploit code published for two dangerous apache

solr remote code execution ﬂaws.” [Online]. Available:

https://www

zdnet

com/article/exploit-code-published-for-two-

dangerous-apache-solr-remote-code-execution-ﬂaws/

[50]

O. Starov, J. Dahse, S. S. Ahmad, T. Holz, and N. Nikiforakis, “No

honor among thieves: A large-scale analysis of malicious web shells,”

in Proceedings of the 25th International Conference on World Wide

Web, ser. WWW ’16, 2016, p. 1021–1032.

[51]

“Seclists: A collection of multiple types of lists

used during security assessments.” [Online]. Available:

https://github.com/danielmiessler/SecLists

[52]

“lhlsec/webshell.” [Online]. Available: https://github

com/lhlsec/webshell

[53]

R. D. Graham, “Masscan: Mass ip port scanner,” URL: https://github.

com/robertdavidgraham/masscan, 2014.

[54]

Z. Durumeric, E. Wustrow, and J. A. Halderman, “Zmap: Fast

internet-wide scanning and its security applications,” in Presented as

part of the 22nd USENIX Security Symposium (USENIX Security 13),

2013, pp. 605–620.

[55]

D. B. Internet Engineering Task Force (IETF), “Applying generate

random extensions and sustain extensibility (grease) to tls extensibility.”

[Online]. Available: https://tools.ietf.org/html/rfc8701

[56]

“Chrome platform status: Grease for tls,” https://www

chromestatus

com/

feature/6475903378915328.

[57]

“Cve-2020-0618: Rce in sql server reporting services (ssrs).” [Online].

Available: https://www

mdsec

uk/2020/02/cve-2020-0618-rce-in-sql-

server-reporting-services-ssrs/

[58]

A. Kapravelos, C. Grier, N. Chachra, C. Kruegel, G. Vigna, and V. Paxson,

“Hulk: Eliciting malicious behavior in browser extensions,” in 23rd

USENIX Security Symposium (USENIX Security 14), 2014, pp. 641–654.

[59]

M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive-by-

download attacks and malicious javascript code,” in Proceedings of the

19th international conference on World wide web, 2010, pp. 281–290.

[60]

J. Lee, S. Cha, D. Lee, and H. Lee, “Classiﬁcation of web robots:

An empirical study based on over one billion requests,” computers &

security, vol. 28, no. 8, pp. 795–802, 2009.

[61]

M. C. Calzarossa, L. Massari, and D. Tessera, “An extensive study of web

robots trafﬁc,” in Proceedings of International Conference on Information

Integration and Web-based Applications & Services, 2013, p. 410.

[62]

M. Allman, V. Paxson, and J. Terrell, “A brief history of scanning,”

in Proceedings of the 7th ACM SIGCOMM conference on Internet

measurement, 2007, pp. 77–82.

[63]

M. D. Dikaiakos, A. Stassopoulou, and L. Papageorgiou, “An

investigation of web crawler behavior: characterization and metrics,”

Computer Communications, vol. 28, no. 8, pp. 880–897, 2005.

[64]

P. Huntington, D. Nicholas, and H. R. Jamali, “Website usage metrics: A

re-assessment of session data,” Information Processing & Management,

vol. 44, no. 1, pp. 358–372, 2008.

[65]

G. Xie, H. Hang, and M. Faloutsos, “Scanner hunter: Understanding

http scanning trafﬁc,” in Proceedings of the 9th ACM symposium on

Information, computer and communications security, 2014, pp. 27–38.

[66]

S. T. Jan, Q. Hao, T. Hu, J. Pu, S. Oswal, G. Wang, and B. Viswanath,

“Throwing darts in the dark? detecting bots with limited data using

neural data augmentation,” The 41st IEEE Symposium on Security and

Privacy (IEEE SP), Jan 2020.

[67]

D. Moore, C. Shannon, G. Voelker, and S. Savage, “Network telescopes:

Technical report,” Cooperative Association for Internet Data Analysis

(CAIDA), Tech. Rep., 2004.

[68]

P. Richter and A. Berger, “Scanning the scanners: Sensing the internet

from a massively distributed network telescope,” in Proceedings of the

Internet Measurement Conference, 2019, pp. 144–157.

[69]

M. Bailey, E. Cooke, F. Jahanian, J. Nazario, and D. Watson, “The internet

motion sensor-a distributed blackhole monitoring system.” in NDSS, 2005.

[70]

N. Provos, “A virtual honeypot framework.” in USENIX Security

Symposium, vol. 173, 2004, pp. 1–14.

[71]

C. Leita and M. Dacier, “Sgnet: a worldwide deployable framework

to support the analysis of malware threat models,” in 2008 Seventh

European Dependable Computing Conference. IEEE, 2008, pp. 99–109.

[72]

C. Kreibich and J. Crowcroft, “Honeycomb: creating intrusion detection

signatures using honeypots,” ACM SIGCOMM computer communication

review, vol. 34, no. 1, pp. 51–56, 2004.

XIII. APPENDIX

TABLE VI: Aristaeus dataset description

Dataset Requests %Requests

Unique

IP Addresses

Blocklist

Coverage

Shared Crawling Begin Date End Date

Benign 347,386 1.3% 6,802 6.91% Yes

2020-01-24

00:00:01

2020-08-24

23:59:59

Malicious 15,064,878 57% 76,396 13% No

Unknown / Gray 11,015,403 41.68% 206,111 11.64% No

Total 26.4 million 100% 287K 11.61% (Mixed)

TABLE VII: Top requested URL in different web applications

Rank WordPress Joomla Drupal PHPMyAdmin Webmin

/xmlrpc.php

(62.664%)

/administrator-

/index.php

(48.333%)

/user/login?destination=

/node/1#comment-form

(81.143%)

(POST)/index.php

(75.65%)

/session login.cgi

(79.93%)

/wp-login.php

(25.094%)

/administrator/

(41.512%)

/wp-login.php

(18.064%)

(POST)/phpmyadmin/index.php

(9.658%)

/wp-login.php

(13.649%)

/wp-admin/

(12.239%)

/wp-login.php

(9.29%)

/xmlrpc.php

(0.476%)

(GET)/phpmyadmin

/index.php

(5.715%)

/xmlrpc.php

(4.51%)

/administrator

/index.php

(0.001%)

/xmlrpc.php

(0.822%)

/administrator/

(0.222%)

/wp-login.php

(8.228%)

/robots.txt

(1.684%)

/administrator

(0.001%)

/wp-admin/

(0.044%)

/administrator/index.php

(0.095%)

/vendor/phpunit/phpunit-

/src/Util/PHP/eval-stdin.php

(0.749%)

/wp-admin/

(0.227%)