Security Data Science: Detection Engineering from Windows Events with Recurrent Neural Networks

21 min readJun 30, 2020

Abstract

This article describes utilizing Windows Sysmon Events for behavioral security analytics. The decision heuristic is based on the Recurrent Neural Network (RNN), which considers event sequence. This work aims to provide a reader with a necessary epistemic background on techniques from Data Science or Information Security domains and to show the actual Sysmon Event data preprocessing pipeline, feature engineering, and model architecture selection using Tensorflow Keras API.
The dataset used for model training was artificially generated in a lab environment and contains telemetry on legitimate processes and several offensive techniques. Its main goal is to evaluate the ability of the RNN model to learn patterns in source data. However, due to the humble coverage of attack techniques, the model is not expected to achieve broad generalization on the real-world threat landscape.

TL;DR: Repository with this article’s code: RNNs, dataset, EDA, etc.

Introduction

The essential security operations problem is classifying infrastructure activity as malicious or benign. Obvious is consideration to outsource this functionality to some artificial intelligence (AI) agent.

This article may be valuable for:

security analysts who want to extend their toolkit and implement Deep Learning (DL) techniques in their operations;
data science folks who are willing to apply their knowledge in the cybersecurity domain;

Thus I hope to minimize the gap between Security Operations and Data Science by publishing this article and accompanying code.

Many Data Science techniques at this moment are not theoretical but already an engineering problem, with unrevealed potential, especially in the infosec domain (both offensive and defensive).

Considering the broad audience and merge of two disciplines, the article touches on a set of concepts that may be familiar to readers with different experiences, while other sections may be novel.

Status quo and hypothesis

So far, most of the security-AI research I've seen performed classification out of a single input example using Machine Learning (ML) algorithms like gradient boosting and decision trees or classical neural networks.

Figure 1. Sysmon EventID 3 processing to np.array example.

This single input example may involve sophisticated feature engineering. For example, training examples contain information about the current process and parent process executables, network communication information, and other contextual data like loaded DLLs.

Still, like a human analyst cannot reliably identify from a single portion of information whether it corresponds to malicious activity — such ML algorithms have their limit of precision.

There's a desire to raise the following hypothesis:

It appears naturally to observe system events in sequence like any security professional does. Moreover, security events are always temporal (i.e. have timestamps), so current Data Science techniques for analysis of Time Series data, involving RNN, should provide decent results for system event classification (benign/malicious).

Data

The ultimate goal of such work may be to create a Neural Network that generalizes well across all techniques in the MITRE ATT&CK matrix. At the same time, such a model should be able to distinguish offensive strategies from all sorts of valid infrastructure activity.

Such analysis will require tremendous work done in the direction of dataset creation (comparable with EDR vendors' resources) and computational resources to train Neural Networks later.

But, we don't need that to test raised hypothesis. To verify whether RNN may be used to identify correlative logic behind features derived from Sysmon events, I've created the dedicated dataset, where several offensive techniques were emulated on target, including:

malware execution from within interactive session (T1204);
dropper's initiated PowerShell activity (T1086);
maintaining established Command and Control on TCP/8080 (T1043);
persistence via Scheduled Tasks (T1053) and WMI Event Subscriptions (T1084);
system enumeration (T1082, T1033, etc.).

Dataset has ~3.7k Events from a total of 100 processes — enough to develop a full preprocessing, feature engineering, and Neural Network training pipeline:

Recurrent Neural Networks

RNN is a class of artificial neural networks. A classical neural network takes some specific input and returns output with a high degree of non-linearity (i.e., "really complex" transformation of input data). In contrast, RNN does this multiple times across temporal dimensions, carrying information between every iteration.

For now, I'll refer to this basic visualization as seen below, where each empty circle represents the same neural network that is abused multiple times while working on different timestamps of input:

Figure 2. Source: https://medium.com/@venkatakrishna.panga/time-series-forecasting-lstm-f45fbc7796e1

To use RNN for classification (as defined in our hypothesis), we care only about output from the last iteration (o4 in Figure 2) when all event sequences are already processed. This type of architecture may be referenced as "many to one."

Similar architectures are used in Natural Language Processing (NLP). The problem of sentiment classification (e.g., this dataset and corresponding citations) or much more relevant to our situation (as involves temporal data) — activity recognition from sensory data (e.g., dataset and several articles here or here).

Sysmon

Sysmon is a part of Windows Sysinternals utility, supported and maintained by Microsoft, with active development. Sysmon installs persistent service on the system, which generates logs of specific activity by parsing ETW events (low-level Windows OS sensory).

It is one of the most common Data Collector engines (which means it does only collection, no transport, detection, or alert functionality). You may refer to an example of EventID 3 in Figure 1 above, which describes network connectivity.

Sysmon is one of the most valuable sources of visibility on Windows systems, both workstations and servers, on which security analysts build detections. Sysmon is relatively easy to set up, and there're plenty of articles online on this topic, e.g., from CQURE. Drop the binary from the Sysinternals suite and provide some configuration — sysmon will install a service running in the background as NT AUTHORITY\SYSTEM.

Preprocessing

Preprocessing is arguably the most crucial part of data operations that security folks need to master to bring ML techniques into everyday operations.

Right away, here're a few suggestions:

pandas is a convenient tool for Exploratory Data Analysis (EDA) and preprocessing, but it does not take .xml/.evt files as-is. Instead, save Events in CSV format (EventViewer/Powershell allows that). If you managed to get events in XML — you might refer to my work in this article, which describes reading Events from XML into pandas.
If you're working with pandas, feather (.ft) format is much lighter and faster than CSV/XML/JSON, so after the initial load of data, save it in feather format and use this set for consequent work:

df.to_feather('logs.ft')

When your log file is considerably large (3–5 GB), an initial load may consume 10+ GB of RAM because of the expansion of integers in memory and similar specifics. To avoid this behavior — use specific types during reading. For details, refer to this article.
If you have a larger dataset than mine — do not start preprocessing on a large corpus of data. Instead, filter out the first ~10k elements and perform all syntax lookups with this data set; only after revealing practical preprocessing techniques apply them to all data. Such a trick will save you lots of time.

Input Shape

RNN layer takes 3-dimensional input with the shape of:

(batch_size, timesteps, feature_dimensionality)

Figure 3. Source: https://www.tensorflow.org/tutorials/structured_data/time_series

3-dimension data structure may confuse newcomers and those in Data Science working only with NLP-related RNN solutions. In the case of NLP, the Embedding layer performs expansion of data into the 3rd dimension (feature dimensionality).

An important twist is to understand, that logs should be grouped by some parameter to complete one training example (sequence) with multiple timesteps of multiple features.

*cracking skull sound*

Don't worry — you'll get this data manipulation step and dimensionality idea with the supportive description within the following paragraphs and visualization in Figure 4 below.

Labels

Here is worth thinking of sequence as an entity that will receive a label (benign/malicious). Some examples of grouping parameters may be:

network activity: source and destination IP pairs;
application logs: user identification (cookie, ID);
sysmon event logs (our case): Process ID.

As you can see, in the case of sysmon, I chose to group event logs by process ID and marked them accordingly — malicious; legitimate.

Labels for the provided dataset are available here and here. Still, you may refer to this logic I've created for faster process labeling if you need to perform some operation on your data. It allows you to view current and parent process CommandLine and mark activity accordingly.

> python label_pids.pyPID: 10116 C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe"C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -NoP -NonI -W Hidden -c $x=$((gp HKCU:Software\Microsoft\Windows Update).Update); powershell -NoP -NonI -W Hidden -enc $x
"C:\Windows\system32\fodhelper.exe"(input) maliciousPID: 8049
...

Padding

At this point, it is possible to conclude that desired state of data should have a structure like this:

Figure 4. The structure of data to target preprocessing.

Why PADDED? Because RNN requires all input sequences to be the same length, whereas A != B != C, i.e., different processes have different counts of events. Let's observe the ten most verbose processes:

Figure 5. Left column: ProcessId; Right column: Count of events.

The most verbose process ID is 1520 with a sequence of 284 events, with length decreasing rapidly. Therefore, I've chosen the sequence length to be 128 ('MAX_TIMESTAMPS' in code), where the shortest sequences are padded and the longest truncated.

Padding and truncating are better to perform from sequence beginning (a.k.a. prepadding), using tf.keras.preprocessing.sequence.pad_sequences() it's done like this:

pad_sequences(sequence, maxlen=128, padding='pre', truncating='pre', value=0)

The rationale is that data at the beginning is less relevant (loading of standard DLL or similar). However, data at the end contains actual process activity like CommandLine parameters or network connections. Consequently, correct padding here is crucial, or truncated data will lose important information.

When observing the least verbose processes (no screenshot for brevity), we see that ~15 processes have only 1–3 events, so we may consider dropping all processes having less than four events. These are not valuable for the model as providing too little information for classification.

In my experience, the simplest malware dropper generates at least four events (including the obligatory process Create and Terminate and a few DLL loads to operate with the NT kernel). As per the latest research, it's impossible to create a more quiet agent with dynamic syscall invocation: 1, 2, or by suppressing ETW/Sysmon logging at all: 1, 2, 3, but this is a topic for another discussion. In usual circumstances, those processes having less than four events are continuous service tasks that start before the initiation of Sysmon service.

Feature Engineering

The next thing to consider is the features or (x1, x2, ...) part from Figure 4 above. In other words — what specific event characteristics do we want to include in feature dimensionality?

We start with a few superficial, valuable data characteristics and tune features accordingly. Later advancements can be done based on model performance — add something that explicitly distinguishes those False-Positives from True-Positives). I decided to leave the following features:

'binary' and 'path' of the process:

# get binary name (last part of "Image" after "\")
newdf['binary'] = df.Image.str.split(r'\\').\
                        apply(lambda x: x[-1].lower())# same with binary path
newdf['path'] = df.Image.str.split(r'\\').\
                        apply(lambda x: '\\'.join(x[:-1]).lower())

Whether any base64 string persists in command's arguments — created 'b64' column with values 0 (no any base64 appearance) and 1 (is appearance) using the following code:

import re# Leave only cmd arguments from CommandLine
# (strip away everything before first space)df['arguments'] = df.CommandLine.fillna('empty').str.split().\
                        apply(lambda x: ' '.join(x[1:]))# match at least 32 character long consequent string
# with base64 characters onlyb64_regex = r"[a-zA-Z0-9+\/]{64,}={0,2}"b64s = df['arguments'].apply(lambda x: re.search(b64_regex, x)).notnull()newdf['b64'] = b64s.astype(int)
del b64s

Whether there's any URL or UNC path appearance in command arguments:

# matches if there's call for some file with extension 
# (dot at the end of regex) via UNC pathunc_regex = r"\\\\[a-zA-Z0-9]+\\[a-zA-Z0-9\\]+\."uncs = df['arguments'][df['arguments'].\
                 apply(lambda x: re.search(unc_regex, x)).notnull()]url_regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"urls = df['arguments'].apply(lambda x: re.search(url_regex, x)).notnull()# Merge both URL and UNC as boolean 
# whether there is/none ANY UNC/URL in arguments?df['unc_url'] = pd.concat([uncs, urls]).astype(int)
del uncs, urls

Figure 6. Check that the UNC regex (tricky one) is correct.

Whether process performs network connections:

newdf['network'] = df['Protocol'].notnull().astype(int)

At that moment, I decided to postpone further Feature Engineering (FE), but worth to note that it is a sort of never-ending process, and new ideas may be needed as new evidence comes in.

One may find great insights into more FE ideas beneficial to security analysis in the following Elastic blog post authored by Bobby Filar, including CommandLine embedding using TF-IDF and parent-child relationships.

A more thorough view of this problem is covered in this CAMLIS talk, where Brian Murphy covers valuable ways to encode specific fields.

groupby() transformation

At this point, we have filtered DataFrame newdf with only necessary data:

Figure 7. DataFrame after Feature Engineering.

It's time to perform the encoding of categorical features (those with type 'object') and group all 3712 entries into padded 3D sequences.

Below is the function that takes this DataFrame as input and returns Numpy arrays — X is 3-dimensional data with encoded feature sequences across events in every process, and y contains labels (malicious, 1 | benign, 0):

Invoke this function on newdf with groupby field of 'ProcessId':

train_X, val_X, train_y, val_y = groupby_transform(newdf, 'ProcessId')

Training and validation sets

You may have noticed that the function returns train_.. and val_.. sets for both X and y.

Different sets are needed to evaluate performance of our model in an unbiased manner, by making predictions on data that wasn’t involved in training.

Please refer to the material online for more information on this, as it's a pretty essential concept. I skipped the test set in this research model evaluation and will work only with training and validation sets as an acceptable sacrifice due to the small amount of data.

In usual circumstances, splitting is done randomly, e.g., using sklearn's train_test_split, which ensures that an equal portion of every class data (malicious/benign) is split across both sets with a specified ratio (20%/80% in the example below):

from sklearn.model_selection import train_test_splittrain_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.2)

Such a train/test splitting scenario may be considered in the case of a vast dataset with hundreds/thousands of examples of a single MITRE technique. But in our superficial case, this is unacceptable. We may have only one example of a specific offensive technique (e.g., Events from a process that performs Command and Control communication). By assigning it to a validation set, we give no data for the model to train.

Therefore I performed manual process selection to place into the validation set (list here). These processes represent redundant MITRE techniques that still appear in the training set — invoke PowerShell with "encoded" parameters, system enumeration, and a few others. The file is then parsed and, according to the "ProcessId" value feature sequence, is placed either in train_X or val_X.

tf.data.Dataset

At last, a few datasciency operations are needed, as:

shuffling data order
splitting data into smaller batches

For these operations, I use tf.data.Dataset functionality:

import tensorflow as tfBATCH_SIZE = 8
SHUFFLE_BUFFER = 100train_Xds = tf.data.Dataset.from_tensor_slices( (train_X, train_y) ).shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE).prefetch(1)val_Xds = tf.data.Dataset.from_tensor_slices( (val_X, val_y) ).shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE).prefetch(1)

Splitting into batches (smaller portions of data) is needed to allow the model to update itself faster using optimization algorithms such as mini-batch gradient descent or Adam (in this article's model, we use Adam).

Worth to note that tf.data.Dataset workflow may be extended to a full preprocessing pipeline, including encoding.

Training

Simple Model

Finally, we are ready to test this data against some RNN models.

First of all, let us build a simple RNN with a single LSTM layer of 32 units to verify that data have the correct format, as well as that appropriate metrics, and the optimization algorithm is chosen.

I won't stick with a single evaluation criterion but use multiple metrics to analyze the behavior of the Neural Network:

METRICS = [
  keras.metrics.TruePositives(name='tp'),
  keras.metrics.FalsePositives(name='fp'),
  keras.metrics.TrueNegatives(name='tn'),
  keras.metrics.FalseNegatives(name='fn'),
# Precision: (TP) / (TP + FP)
# what proportion of predicted Positives is truly Positive
  keras.metrics.Precision(name='precision'),
# Recall: (TP) / (TP + FN)
# what proportion of actual Positives is correctly classified
  keras.metrics.Recall(name='recall'),
  keras.metrics.AUC(name='auc'),
  keras.metrics.Accuracy(name='accuracy')
]

In the case of security classes, data is often skewed (one class [benign behavior] has many more examples than the second [malicious behavior]), So it's wise to consider a wider range of metrics and not just the most common "accuracy". Such a METRICS definition approach, as suggested by TensorFlow, is the way to go for imbalanced data.

Consequently, the model is compiled as follows:

from tensorflow import kerasdef model_simplest(MAX_TIMESTEPS, FEATURES):    model_simplest = keras.models.Sequential([
        keras.layers.LSTM(32, 
                          dropout=0.2, 
                          recurrent_dropout=0.2, 
                          input_shape=[MAX_TIMESTEPS, FEATURES]),
        keras.layers.Dense(units=1, activation='sigmoid')
        ])
    OPT = keras.optimizers.Adam()
    
    return model_simplestmodel = model_simplest(MAX_TIMESTEPS, N)model.compile(optimizer=OPT, 
                       loss='binary_crossentropy', 
                       metrics=METRICS)

For readers from the security camp — a small description of terminology and a little of "Deep Learning 101":

Last layer Dense(units=1, activation=’sigmoid') consists of a single neuron that performs actual binary classification — provides output in a range from 0 to 1 (see sigmoid function's y scale), i.e., probability with which network considers input example to be malicious.
binary_crossentropythen gives the largest cost to those outputs that do not correspond to labels:

Figure 8. Binary Crossentropy for Dummies (yellow/black book cover).

optimizer, in this case keras.optimizers.Adam(), with every iteration over batch, tries to rewrite the internal state of the Neural Network (i.e., it's weights), so the difference between the NN output and actual label is minimized.

Model training happens through a .fit() method. We'll try to proceed with just 3 epochs for now:

Figure 9. Simple RNN model training. Proves that input data is in the correct format and network can learn data patterns — loss decreases over epochs, True Positive and True Negative cumulative numbers grow.

Epoch is training iteration over all datasets, so we've passed dataset 3 times through this RNN model.
"Train for 8 steps" in this case represents the number of batches in every epoch. To recall from the code above BATCH_SIZE = 8 and overall size of our dataset contains 64 sequences, so there are 8 batches of 8 sequences in a dataset:

>>> train_X.shape(64, 128, 6)

To estimate the output of a model, we may refer to:

.evaluate() model's method, which analyzes the model's output in comparison to provided labels and gives all the METRICS values given the existing model's state:

Figure 10. Evaluation of simple, almost non-trained model on the validation set. All predictions are negative — only True Negative and False Negative, none as True Positive or False Positive.

.predict() method, when we do not provide any labels but just want to see the model's prediction on behalf of the input sequence(s):

Figure 11. The simplest model prediction on the first sequence from the validation set gives 44.6% for being malicious and 22.7% for the second.

As seen from .evaluate(), at the moment, all predictions are negative (i.e., less than 50% of being malicious). But this network has the simplest possible architecture and was trained for a few seconds only, whilst seeing mostly benign behavior. Let's take a look at more mature models.

Model selection

Andrey Karpathy gives decent advice on the appropriate model lookup process:

“Overfit a single batch of only a few examples (e.g. as little as two). To do so we increase the capacity of our model (e.g. add layers or filters) and verify that we can reach the lowest achievable loss (e.g. zero).”

Few words on overfitting — it is characteristic of neural networks that learned random perturbations of specific training data examples that are not relevant to data as a whole. In 2 dimensional space, it may be visualized as follows:

The green line represents the output of the "overfitted" model, and the black is — the desired prediction of a good model. I won't go into more detail, as "overfitting" is a pretty essential data science concept, and for those willing to dig deeper, start here.

We usually want to avoid overfitting, but in this case (during the architecture selection process), if the model can overfit a simple batch of examples and drop loss as low as possible, this means that its architecture is complex enough to learn the non-linearity of input data. Such a network should be able to figure out all parameters that distinguish benign and malicious samples if trained on a larger corpus of data.

Let's prepare multiple models:

First of all I gradually increased depth (adding either LSTM() or Dense() layers, or both), and later added “convolution” operation using Conv1D() layer in “model_conv”, and “LSTM bidirectionality” logics using Bidirectional() layer in “model_conv_birnn” model. Notice that models have no regularization (overfitting reduction logic that will be added to the end model).

As our dataset isn't large at all, I decided not to implement small-batch logic but try to overfit using the same training set on models without regularization.

To observe loss value over the epochs, implement the plot_loss() helper function:

import matplotlib.pyplot as pltcolors = plt.rcParams['axes.prop_cycle'].by_key()['color']def plot_loss(epochs, history, label, n, val=False):
    # Use a log scale to show the wide range of values.
    plt.semilogy(epochs,  history['loss'],
                 color=colors[n], label='Train '+label)    if val:
        plt.semilogy(epochs,  history['val_loss'],
                     color=colors[n], label='Val '+label,
                     linestyle="--")
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()

.. and trained all models for 500 epochs:

from models_no_regularization import  model_simplest, model_dd, model_deep, model_deeper, model_conv_rnn, model_conv_birnnmodels = { 'simplest': model_simplest(MAX_TIMESTEPS, N),
           'dropout_dense': model_dd(MAX_TIMESTEPS, N),
           'deep': model_deep(MAX_TIMESTEPS, N),
           'deeper': model_deeper(MAX_TIMESTEPS, N),
           'conv1d': model_conv_rnn(MAX_TIMESTEPS, N),
           'conv1d_bi': model_conv_birnn(MAX_TIMESTEPS, N)}np.random.seed(51)
tf.random.set_seed(51)for i, (name, model) in enumerate(models.items()):
    
    print(f"\n{i}th model: {name}!")    h = model.fit(train_Xds, epochs=500, verbose=0)
    plt.figure(figsize=[8, 4])
    plot_loss(h.epoch, h.history, f"\n{i}th model, {name}", i)    # results on training data - to test overfitting
    results = model.evaluate(train_Xds)    for name, value in zip(model.metrics_names, results):
        print(name, ': ', value)
        print()

To observe the complete output of this loop, please refer to the actual model selection notebook.

Overall, results show that there's no noticeable benefit (i.e., decrease in model's loss) from adding more layers in depth , e.g., here's a comparison of loss for simplest and deepest models:

0th model: simplest! 
...
loss :  0.21952103544026613...3th model: deeper! 
...
loss :  0.31504578702151775

Not to stick with loss value at the latest epoch, below you can observe loss variations over all 500 epochs and see that loss values are similar in both cases (in a range from 0.2 to 0.4):

Figure 14. Loss values for "model_simplest" and "model_deeper" during training over 500 epochs.

Although, by adding convolutions to data processing, there's a noticeable decrease of loss as low as under 0.1 during some epochs:

5th model: conv1d!
...
loss :  0.1116175636011576

Figure 15. Loss of model with Conv1D(..) layer.

As it appears, both models — with only convolutions or model with convolutions AND layers.Bidirectional() around LSTM, show similarly good results with loss ~0.1.

In the case of both convolutions and bidirectional LSTM, the model's architecture may be represented as follows:

Figure 16. Representation of model_conv_birnn architecture.

During some future tests, though I noticed that LSTM bidirectionality doesn't yield any noticeable gain but requires twice as many parameters to train:

>>> model_conv_rnn.summary()...
Trainable params: 36,353
...>>> model_conv_birnn.summary()...
Trainable params: 71,425
....

So instead of training the model for 3 hours, it takes 6 hours, but without noticeable improvement in the model's performance. In hindsight to Occam's razor ideology, I am dedicated to staying with the model_conv_rnn model.

Convolutions

It's worth saying a few words about convolutions to understand why this helps the most.
With just a few parameters, the convolution operation identifies a specific pattern in source data. This proved helpful in Computer Vision, where many similar low-level characteristics are shared across all the input — edges, lines, color, etc. Here for example, is what happens to an image when a filter that detects horizontal lines, is applied:

Figure 17. Applied horizontal line convolution filter. Source.

In the case of Conv1D, the convolutions operation is not applied to the spatial dimensions of the image but to the temporal dimension of the data sequence.

Figure 18. Example of easy-to-find pattern for Conv1D.

So Conv1D should find patterns of a specific sequence in sysmon data that characterize maliciousness — maybe its PowerShell command that results in events with network activity or a preloading specific set of DLLs (legitimate images tend to load a different set of DLLs in different sequence).

Evaluation of model with regularization

As a final step — I've added regularization to avoid overfitting (Dropout layer just after LSTM layer) and trained the selected model_conv_rnn model for the larger amount of epochs:

tf.keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

OPT = tf.keras.optimizers.Adam(learning_rate=0.005)model3 = model_conv_rnn(MAX_TIMESTEPS, N)model3.compile(optimizer=OPT, loss='binary_crossentropy', metrics=METRICS)h = model3.fit(train_ds, epochs=500, verbose=0, validation_data=val_ds)

Fixing random seed is common practice and, again, is well described by A.Karpathy:

Fix random seed. Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome. This removes a factor of variation and will help keep you sane.

During the evaluation, we see that 9/10 of Processes from the validation set are predicted as benign (True Negatives: tn == 5 and False Negatives: fn == 4), and only one is predicted as malicious (True Positive: tp == 1):

>>> model3.evaluate(val_ds)1/1 [==============================] - 1s 768ms/step - loss: 0.7081 - tp: 1.0000 - fp: 0.0000e+00 - tn: 5.0000 - fn: 4.0000 - precision: 1.0000 - recall: 0.2000 - auc: 1.0000 - accuracy: 0.0000e+00

Imbalanced data

This is what you'll see often when working with security-related classification. The problem here is that malicious samples usually form a very small fraction of data. It may be as tiny as 0.172% of all datasets, referring to this Credit Card Fraud Detection dataset.

In our case, it's much better and results in 20% of all processes in the training set being malicious:

>>> u, c = np.unique(train_y, return_counts=True)
>>> pos, neg = c
>>> print(f"Malicious processes: {neg}\nValid processes: {pos}")
>>> print(f"Malicious percentage: {round(neg*100/(pos+neg),4)} %")Malicious processes: 14
Valid processes: 56
Malicious percentage: 20.0 %

Still, 4 out of 5 processes are benign, and it's much harder for the model to find feature patterns that classify malicious activity.

Tensorflow has great documentation on addressing this issue, mostly by implementing data oversampling. I won't cover it here, but I found it to work well even on larger datasets, allowing the model to learn in class with less appearance.

Early Stopping

In the case of this dataset more standard approach helped. During training, the model's performance varies, and deeper into the training, it may start to overfit training data, therefore perform worse on validation data.

There's a technique called "early stopping", that implements monitoring of the model's performance during training and halts the training process if performance started to become worse.

es_callback = [
    tf.keras.callbacks.EarlyStopping(
        # Stop training when `val_loss` is no longer improving
        monitor="val_loss",
        # "no longer improving" == "no better than 0.05 less"
        min_delta=0.05,
        # "no longer improving" == "for at least 400 epochs"
        patience=400,
        verbose=1,
        restore_best_weights=True  ) ]

This callback object is then invoked when fitting the model:

>>> h = model3.fit(train_ds, epochs=1000, validation_data=val_ds, callbacks=es_callback, verbose=0)Restoring model weights from the end of the best epoch.
Epoch 00569: early stopping

Here, configuration explicitly states to proceed with training over 1000 epochs, early stopping ended training at epoch 569, as it has the most efficient model's state according to val_loss(validation loss) value.

Prediction on validation set indeed is great! 8 out of 10 processes are evaluated correctly:

model3.evaluate(val_ds)1/1 [==============================] - 0s 14ms/step - loss: 0.4425 - tp: 4.0000 - fp: 1.0000 - tn: 4.0000 - fn: 1.0000 - precision: 0.8000 - recall: 0.8000 - auc: 0.9200 - accuracy: 0.0000e+00

Just to emphasize — given Sysmon data from a single process, this Recurrent Neural Network model was able to classify malicious activity whilst distinguishing it from legitimate system behavior.

No string or behavior signatures are involved, just a sequence of Event logs with info on whether the process used URL/base64 in arguments and whether it performs network communication.

Future work

The Resulting RNN model itself should be considered only as Proof of Concept. It does not generalize over any noticeable part of the security landscape. Additionally, this specific model may be considered overly optimistic, as there isn't implemented test set logic.

As stated in the beginning:

Ultimate goal of such work may be to create Neural Network that generalize well across all techniques in MITRE ATT&CK matrix, whilst is able to distinguish it from all sorts of valid infrastructure activity.

To accomplish that following activity looks promising.

Dataset

The largest part of future work in this direction must be done in the direction of complex dataset collection. A good starting point here is the Mordor dataset (creation of Roberto Rodriguez and Jose Luis Rodriguez).

This dataset contains events from emulated APT group activity, covering more techniques. With little preprocessing, customization of the Mordor dataset can be integrated into this article's RNN training pipeline.

Figure 19. MITRE ATT&CK coverage by Mordor dataset. Source: https://redcanary.com/blog/comparing-red-team-platforms/

More examples of the specific technique are needed to build a reliable model. For that, some data augmentation techniques may be useful. Generation of similar logs but with different ProcessId, source/destination ports, filenames, and similar ideas may be useful to consider.

Here deliberate consideration on what to modify should be done, as some information is crucial not to be changed (e.g., RPC SMB pipes, some binary names or arguments). However, other telemetry data is variable and will be modified by the adversary (document names and paths, SMB pipes for C&C).

Nevertheless, I believe that in wise hands, this activity will result in a better model, less prone to overfitting, and better predictive power.

Not only malicious telemetry is needed — lots of benign activity should be seen by such an AI agent in order to reduce False Positive rate.

Should such a model be built by the internal enterprise security team for your purposes — you may use existing infrastructure logs as benign baseline + Mordor + own emulated attack events (emulating specific APIs you're in the target group). This collection should provide a decent starting point dataset.

For EDR or SIEM vendors result should be even more generalized, as you need to cover any potential infrastructure. My suggestion is to start with existing telemetry and simulated data (both benign/malicious) and in-process address mistakes of the first model prototypes (a.k.a. "partial fit").

Architecture

We suggest performing optimization of neural network architecture:

Considering that if this logic is implemented in production, it will be part of the composite solution where detection heuristics are based on weights from both manual behavioral signatures and neural network predictions (see Joshua Saxe — hybrid solutions work better than signature-based / AI-based separately).
With high probability, a single model won't do successful classification. Still, by a set of different neural networks that target detection of own offensive technique set, e.g., one model is good at detecting lateral movement, another — enumeration, third — persistence.

Conclusions

It's safe to state that the hypothesis specified at the beginning of the article is True.

The Recurrent Neural Network model proved to be able to learn patterns from a sequence of Windows Events, and classify security behavior just on Sysmon events from a single process.

In addition, detailed elaboration of all crucial manipulations is provided, accompanied by source code. Anyone interested in the topic is weaponized with the necessary background and references to apply this work to own data or to develop the idea further.