Microsoft Sentinel Blog

9 MIN READ

msticpy - Python Defender Tools

ianhelle

Microsoft

Jun 17, 2019

Introduction

This article has been superseded by a newer version - please see the "MSTICPy and Jupyter Notebooks in Azure Sentinel" article]

msticpy is a package of python tools intended to be used for security investigations and hunting (primarily in Jupyter notebooks). Most of the tools originated from code written in Jupyter notebooks which was tidied up and re-packaged into python modules. I’ve added some references to other blogs in the References section, where I describe some of these notebooks in more detail.

The goals of the package are twofold:

Reduce the clutter of code in notebooks making them easier to use and read.
Provide building-blocks for future notebooks to make authoring them simpler and quicker.

There are some side benefits from this:

The functions and classes are easier to test when extracted into standalone modules, so (hopefully) they are more robust.
The code is easier to document, and the functionality is more discoverable than having to wade through old notebooks and copy and paste the desired functions.

While much of the functionality is only useful in Jupyter notebooks (e.g. much of the nbtools sub-package), there are several modules that are usable in any python application - most of the modules in the sectools sub-package fall into this category.

msticpy is organized into three main sub-packages:

sectools - python security tools to help with data analysis or investigation. These are all focused on data transformation, data analysis or data enrichment.
nbtools - Jupyter-specific UI tools such as widgets and data display. These are mostly presentation-layer tools concentrating on how to view or interact with the data.
data - data interfaces and query library for log and alert APIs including Azure Sentinel/Log Analytics, Microsoft Graph Security API and Microsoft Defender Advanced Threat Protection (MDATP).

The package is still in an early preview mode so there are likely to be bugs, possible API changes and much is not yet optimized for performance. We welcome feedback, bug reports and suggestions for new or improved features as well as contributions directly to the package.

In this article I'll give a brief overview of the main components. This is intended as an overview of some of the features rather than a full user guide. Although the modules/functions/classes are documented at the API level, we are still missing more detailed user guidance. In future blogs I will drill down into some of the specific components to describe their use (and limitations) in more detail, which will help fill some of this gap. Some of the modules have user document notebooks, which are listed in the References section at the end of the document. The API documentation is available on mstipy ReadTheDocs.

Request for Comments

We would really appreciate suggestions for future or better features. You can add these in comments to this doc or directly as issues on the msticpy GitHub.

Installing

The package requires Python 3.6 or later (see Supported Platforms for more details).

pip install msticpy

or for the latest dev build (although usually we publish direct to PyPi)

pip install git+https://github.com/microsoft/msticpy

A conda recipe and package is in the works but not yet available.

Installing the package will also install dependencies if required versions of these are not already installed. If you are installing into an environment where you are using some of these dependencies (especially if you are using conflicting versions), you should to create a python or conda virtual environment and use your notebooks from within that.

Security Tools Sub-package - sectools

This sub-package contains several modules helpful for working on security investigations and hunting. These are mostly data processing modules and classes and usually not restricted to use in a Jupyter/IPython environment (some of the modules have a visualization component that may not work outside a notebook environment).

base64unpack

This is a Base64 and archive (gz, zip, tar) extractor intended to help decode obfuscated attack command lines and http request strings. Input can either be a single string or a specified column of a pandas dataframe. The module will try to identify any base64 encoded strings and decode them. If the result of a decoding looks like one of the supported archive types, it will try to unpack the contents. The results of each decode/unpack are rechecked for further base64 content and it will recurse down up to 20 levels (the default can be overridden, but if you need more than 20 levels, there is probably something wrong!). Output is to a decoded string (for single string input) or a DataFrame (for dataframe input).

iocextract

This uses a set of built-in regular expressions to look for Indicator of Compromise (IoC) patterns. Input can be a single string or a pandas dataframe with one or more columns specified as input. You can add additional patterns and override built-in patterns.

The following types are built-in: IPv4 and IPv6, URLs, DNS domains, Hashes (MD5, SHA1, SHA256), Windows file paths and Linux file paths (this latter regex is kind of noisy because a legal linux file path can have almost any character). The two path regexes are not run by default.

Output is a dictionary of matches (for single string input) or a DataFrame (for dataframe input).

vtlookup

Wrapper class around Virus Total API. Input can be a single IoC observable or a pandas DataFrame containing multiple observables. Processing requires a Virus Total account and API key and processing performance is limited to the number of requests per minute for the account type that you have. For example a VirusTotal free account is limited to 4 requests per minute. Supported IoC types are: Filehash (MD5, SHA1, SHA256), URL, DNS Domain, IPv4 Address.

geoip

Geographic location lookup for IP addresses is implemented as generic class with support for different data providers. The shipped module has two data providers:

GeoLiteLookup - Maxmind Geolite (see https://www.maxmind.com)
IPStackLookup - IPStack (see https://ipstack.com)

Both services offer a free tier for non-commercial use. However, a paid tier will normally get you more accuracy, more detail and a higher throughput rate. Maxmind geolite uses a downloadable database, while IPStack is an online lookup (an account and API key are required).

The following screen shot shows both the use of the GeoIP lookup classes and map display with another msticpy module using folium (a python package using leaflet.js)

eventcluster

This module is intended to be used to summarize large numbers of events into clusters of different patterns. High volume repeating events can often make it difficult to see unique and interesting items.

The module contains functions to generate clusterable features from string data. For example, an administration command that does some maintenance on thousands of servers with a commandline such as:

install-update -hostname {host.fqdn} -tmp:/tmp/{some_GUID}/rollback

These repetitions can be collapsed into a single cluster pattern by ignoring the character values in the string and using delimiters or tokens to group the values.

This module uses an unsupervised learning module implemented using SciKit Learn DBScan.

outliers

Similar to the eventcluster module but a little bit more experimental (read 'less tested'). It uses SciKit Learn Isolation Forest to identify outlier events in a single data set or using one data set as training data and another on which to predict outliers.

auditdextract

Module to load and decode Linux audit logs. It collapses messages sharing the same message ID into single events, decodes hex-encoded data fields and performs some event-specific formatting and normalization (e.g. for process start events it will re-assemble the process command line arguments into a single string).

The following figures shows examples of raw audit messages and converted messages (these are two different event sets, so don’t show the same messages).

Notebook tools sub-package - nbtools

This is a collection of display and utility modules designed to make working with security data in Jupyter notebooks quicker and easier.

nbwidgets - groups common functionality such as list pickers, time boundary settings, saving and retrieving environment variables into a single line callable command. In most cases these are simple wrappers and collections of the standard IPyWidgets.
nbdisplay - functions that implement common display of things like alerts, events in a slightly prettier and more consumable way than print().

nbwidgets

Query time selector

Session browser

Alert browser

nbdisplay

Event timeline

Logon display

Process Tree

Data sub-package - data

Some of these components are currently part of the nbtools sub-package but will be migrated to the data sub-package.

Parameterized query manager

This is a collection of modules that includes a set of commonly used queries and can be supplemented by user-defined queries supplied in yaml files. The purpose of these is to give you quick access to commonly used-queries in a way that allows easy substitution of parameter values such as date range, host name, account name, etc. The package current supports Kusto query language (KQL) queries targeted at Log Analytics and OData queries targeted at Microsoft Graph Security API. We are building driver modules to work with Microsoft Defender Advanced Threat Protection API and, in principle could be extended to cover queries expresses as a simple string expression. The architecture and yaml format was inspired by the Intake package – although some of the parameter substitution gymnastics meant that I was not able to use this package directly.

Sample query definition

Query provider setup

Running a query

Note: the parameters for the query are auto-extracted from the query_times date widget object.

Other Modules

security_alert and security_event

These are encapsulation classes for alerts and events. Each has a standard 'entities' property reflecting the entities found in the alert or event. These can also be used as meta-parameters for many of the queries. For example, the query:

qry.list_host_logons(query_times, alert)

will extract the value for the hostname query parameter from the alert.

entityschema

This module implements entity classes (e.g. Host, Account, IPAddress, etc.) used in Log Analytics alerts and in many of these modules. Each entity encapsulates one or more properties related to the entity. This example shows a Linux alert with the related entities.

To-Do Items

Some of the items on our to-do list are shown below. However, other things requested by popular demand or contributed by others can certainly change this.

Create generic Threat Intel lookup interface supporting multiple providers.
Add additional modules for host-to-ip and ip-to-host resolution.
Add syslog queries, processing and visualizations.
Add network queries, processing and visualizations.
Add additional notebooks to document use of the tools.

Supported Platforms and Packages

msticpy is OS-independent
Requires Python 3.6 or later
Requires the following python packages: pandas, bokeh, matplotlib, seaborn, setuptools, urllib3, ipywidgets, numpy, attrs, requests, networkx, ipython, scikit_learn, typing
The following packages are recommended and needed for some specific functionality: Kqlmagic, maxminddb_geolite2, folium, dnspython, ipwhois

Contributing to msticpy

msticpy is intentionally an open source package so that it is available to be used as-is or in modified form by anyone who wants to. We also welcome contributions – whether these are whole features, extensions of existing features, bug-fixes or additional documentation.

I’m a little finicky about code hygiene so I would (politely) ask the following for potential contributors:

Include doc comments in all modules, classes, public functions and public methods. Please use numpy docstring standard for consistency and to allow our auto-documentation to work well.
We are converting to Black code formatting throughout the project. This will happen whether you format your code like this or not. 😊
Type annotations are a great thing. History and I will thank you for adding type annotations. See this section of the docs for more information.
We write unit tests using Python unitest format but run these with pytest. Please add unit tests for any substantial PRs – and please make sure that the existing unit tests complete successfully.
Linters and other stuff. Committed branches will kick of tests and linting in the Azure build pipeline. Many of these are none-breaking (i.e. your build will complete with warnings) but please try to avoid introducing any new warnings (I’m having a hard-enough time fixing my own warnings!). Using pylint, prospector, mypy and pydocstyle is a good minimum combination.