Security Data Science: Threat Hunting from Sysmon Logs based on Zipf’s Law

Anomaly detection engineering based on power law distribution in enterprise security telemetry

Published in

TDS Archive

9 min readAug 28, 2022

Cover photo from Unsplash resembling Zipf’s law curve.

This article will focus on the “data-centric security” paradigm — utilization of statistical knowledge for cyber-security needs. This threat-hunting approach produces the same results regardless of deployment and environmental characteristics. We will define baseline activity based on the statistical pattern of Zipf’s law (more generally known as “power law”), using the built-in functionality of many tools like Elasticsearch or Splunk, and evaluate out-of-baseline elements. Additionally, we share an end-to-end anomaly-based detection implementation in Python to analyze Sysmon ID 1 process creation events, with threat intelligence-powered analysis of out-of-baseline executable hashes.

Preface

Cybersecurity at the current state of conceptual ideas, in essence, seems to be unsolvable. Over the last decades, we have just seen a never-ending cat & mouse game between offensive and defensive techniques. Today, the security problem is essentially a risk minimization task, and a generally successful approach across the industry is avoidance to be amongst the lowest-hanging fruits.

However, even Top Tier organizations have their challenges. For example, in recent Lapsus$ activity, a threat actor with minimal experience and mediocre resources could access sensitive parts of infrastructures like Microsoft, NVIDIA, Okta, and counting. Such cases reveal gray areas in conventional security operations, like insider threats or weakness against MFA bypasses.

A supposedly efficient approach, apropos enlightened by Dave, is to focus on baseline activity, treating out-of-baseline (OOB) actions to be with investigation potential:

The emphasis: “(..) understanding baseline behavior is critical”.

Another term used to describe OOB elements is “anomaly.” Generally, when discussing anomalies, security practitioners focus on machine learning (ML) solutions. However, most ML hype is artificially pumped for marketing needs. Indeed, ML has a huge unreleased potential in the cybersecurity domain, with a significant need for applied research. Yet, we would like to emphasize that ML is not paramount, often not needed, and frequently counterproductive.

ML challenges occur because, in novelty and outlier detection problems, the definition of baseline in multi-dimensional concept space like enterprise network possesses a significant challenge, alas failure of many high-cost anomaly detection solutions. This does not mean we cannot use ML algorithms for unsupervised security problems. Still, I deliberately omitted the discussion of ML-based anomaly detection algorithms in this article (some ideas are shared in this Twitter thread) since we propose a more straightforward and widely supported statistical approach.

Zipf’s Law

Here we discuss a specific way of defining the baseline based on Zipf’s Law. It always, by design, produces out-of-baseline (OOB) elements. You cannot treat them as alerts directly or welcome alert fatigue. Therefore, we discuss various ways to address OOB values.

Fancy statistical words describe Zipf’s Law in literature, but we can define it as “a pattern of element distribution” in a specific closed environment. Such a pattern resembles a reverse exponential curve, as shown by the purple line in Figure 1 below:

Figure 1. Rude generalization of Zipf’s Law in Natural Language Processing (NLP). Image created by the author.

Originally, Zipf’s Law was formed in the scope of Natural Language Processing (NLP), representing a distribution of words in a corpus. In Figure 1 above, the y-axis conveys a word count, and the x-axis — represents unique words ordered by frequency. For example, if we count the appearances of words on the Wikipedia page about Elizabeth II, we get:

Figure 2. Word frequency in an arbitrary natural language corpus. Image created by the author.

It appears that in natural languages, frequent words (“the”, “to”, “of”, “a”, “and”, etc.) have little to no meaning within the text. Such words are called “stopwords” and usually are omitted from further NLP processing. That’s why it is common to see the following code at the beginning of NLP applications:

import nltk
from nltk.corpus import stopwords
nltk_stopwords = stopwords.words('english')words = [word for word in text.split() if word.lower() not in nltk_stopwords]
...

In NLP, to train a model, the most valuable part of Zipfian distribution is somewhere in the middle, whereas rare words are too unique to affect the training process. Consider the words that appear only once from the word frequency distribution above:

[('Tudor', 1),
 ('sovereign', 1),
 ('Jamaica', 1),
 ('cross', 1),
 ('argent', 1)]

While such elements bring little interest to NLP tasks, they are somewhat interesting since they represent out-of-baseline language! Our experience shows — this Zipf’s law property is highly relevant for defensive security problems.

Patterns in Enterprise

Zipf’s Law represents a more general pattern than just word frequency in the English corpus. We see the same structure of element appearance in many production operations aspects. We collected data from a network of undisclosed security partners representing a medium-sized enterprise, specifically:

external outbound connection destination IP addresses from Sysmon Event ID 3 events; “external” means non-private network IP range and non-self ASN;
process binary names and portable executable (PE) hashes from Sysmon Event ID 1 process creation events;
usernames from account login Event ID 4624;

And this is what pattern these data types form after a count-based aggregation over seven day period (note that the y-axes are logarithmic):

Figure 3. Over seven days, these are frequency statistics of four data types (destination IP, executable hash, process name, user name). Image created by the author.

Each of them forms an already familiar pattern. A significant benefit of this approach is that acquiring count-based statistics is straightforward and is supported by many major tools. For instance, you can collect the lower tail of hash value Zipfian distribution using ElasticSearch’s “rare_terms” aggregation. Exact Python code to collect hashes that appeared less than 100 times in the last 7 days, considering we collect process creation events (EventID 1) and process executable hash stored in the “hash.sha256” field:

In Spark, we can use approx_count_distinct() to acquire the same statistics. In Splunk rare operator provides the same results. This will yield you the following stats:

[{'key': '935c1861df1f4018d698e8b65abfa02d7e9037d8f68ca3c2065b6ca165d44ad2',   'doc_count': 51564},  
{'key': 'fc10d47cb2dc511e648d3020a1f5789d1075be54dcacb6aa6d156f23561edf6c',   'doc_count': 25101},
...{'key': 'd51f88c09c3ed32243bb588177c159ecc3b2578a98cb09b6959eb134386ba2af',   'doc_count': 1}

Detection Engineering Framework

To proceed with threat hunting based on these statistics, we define a threshold of appearance counts to investigate unique elements below it. In Figure 3 above, this is visualized as a dashed red line, which represents a threshold of 10 appearances in seven days.

A threshold of 10 is chosen arbitrarily, and usually, this value is a configurable hyperparameter dependable on the following details:

Aggregation window — how frequently you sync with a data source. The larger the query window, the better the generalization (i.e., less probability that baseline elements appear rare). However, you get results less often, so there is an increase in potential incident response time.
Prior frequency of elements — for less frequent events like logins threshold might be lower than for process creation events.
Ability and resources to process out-of-baseline elements.

The general approach we successfully used can be represented by the following map:

Figure 4. General overview of detection engineering based on Zipf’s Law statistics. Image created by the author.

Pseudocode example:

instantiate_database()while True:
    stats = get_count_data()    for element,count in stats:
        if count < THRESHOLD and not_in_database():
            is_malicious = analyze()
            if is_malicious:
                report()        put_in_database()    sleep(QUERY_WINDOW)

Sysmon & VirusTotal & SQLite

We used pseudocode above for brevity, while we offer a ready-to-use solution defined in this gist.

It represents a functional example of executable hash aggregation from Sysmon ID 1 using ElasticSearch, and analysis of OOB elements with VirusTotal (VT) API, storing hashes in SQLite database to avoid redundant API queries.

Since VT calls are a limited resource, here we report time-series insight in how many calls this script makes to VT API:

Figure 5. A number of OOB hashes. Image created by the author.

After the bootstrapped peak, API queries stabilize and keep low with a mean of ~5 queries per hour.

Threshold variability

Since this is a sensitive parameter, we report an in-depth analysis of queries initiated by this script in a long-term setting with a ratio of all data. We analyze threshold variability on the number of analysis rounds of OOB elements based on telemetry acquired from a middle-size enterprise. Here are the time series representing VirusTotal API calls under different thresholds over the long run:

Figure 6. The number of queries per time is based on threshold selection. Image created by the author.

No significant difference is seen, with only dissimilarity in an increased number of queries at the beginning of the analysis.

Our detection engineering framework applicability to other data types in our environment shows that, as a rule of thumb, setting up a ‘threshold=10’ results in inspecting~40% of all elements:

destination ip, 7d: elements below threshold==10: 38.30%
executable hashes, 7d: elements below threshold==10: 40.54%
process names, 7d: elements below threshold==10: 42.81%
usernames, 7d: elements below threshold==10: 36.04%

Since the ratio is directly related to a threshold value, here is a visual representation of the variable threshold (x-axis) effect on a ratio from all values (y-axis) for all four data types:

Figure 7. The ratio of all elements investigated is based on threshold setting for different data types. Image created by the author.

Therefore, you can consider your analysis budget by knowing the total population of hashes (e.g., your environment has 25k unique hash executions per month). Thanks to storing already queried values in DB, you will analyze each of the values only once.

Alternative Post-Processing Heuristics

Other than VT, there are myriad options to perform semi- or fully-automotive analytics. First, you can finally utilize those 3rd party licenses that provide threat intelligence (Recorder Future, FireEye, Flashpoint, Anomali, Maltego, etc.) that your CISO repetitively demands to use.

However, clever in-house logic can yield no less significant results since OOB elements provide a ground for selective operations that otherwise are not practical in the scope of full infrastructure:

task your EDR / Velociraptor to check OOB binary against a library of YARA rules (this collection maintained by Florian Roth and co-authors is a good start);
execute autorunsc.exe on a host, make a check against a predefined list of programs and alert if unusual entries are present;
query process names against a list of lolbins, and gradually filter out common occurrences for your environment (based on process’ CommandLine) — alert everything else;
make a correlative analysis (make a new query to ElasticSearch) — e.g., analyze if a host with OOB PE hash had a network communication with an external IP address within 5–10 min of execution, and alert the human analyst if did.

Conclusions

From time to time, visionaries of the field propose and realise fascinating ideas, for instance, the graph-centric approach to Active Directory (AD) security with tools like BloodHound maintained by the SpecterOps team, which significantly raised the quality of AD networks around the world. Or consider Jared Atkinson’s intriguing ideas on rethinking MITRE TTPs:

We do not pretend to stand in line with ideas like this. However, we hope this article provides a fresh look at how security professionals treat telemetry in their environment, with a wide potential on (a) what planes to define a baseline and (b) how to treat OOB items. Such an approach, presumably, might enlighten the gray areas of contemporary security operations.

Side note — we would like to emphasize that Zipf’s law observation is not limited to the aforementioned queries and data structure — it is more ubiquitous. We consider sharing additional thinking planes where Zipf’s Law might be utilized for an anomaly-based detection engineering framework for cybersecurity problems in a separate article, but for curious minds — consider multi_terms aggregation.

And as a reminder — a complete code with the implementation of out-of-band hash analysis against VirusTotal with local SQLite database to avoid redundant queries in below 100 lines of code is here.