Blog Post

Image generated by Bing Image Creator using the prompt "Show a cartoon cowboy trying to wrangle a lake"
Microsoft Sentinel Blog
11 MIN READ

Use Data Wrangler to Streamline Your Microsoft Sentinel data lake Notebook Development

David Hoerster's avatar
Apr 29, 2026

As you create a Sentinel data lake notebook, sometimes you need to explore the data and refine it before moving to your next cell. Using the Data Wrangler extension in Visual Studio Code can help you visualize and shape your DataFrames more efficiently.

One of the many exciting features of the Microsoft Sentinel data lake is a built-in advanced analytics engine, powered by Apache Spark. This Spark cluster has access to data that is within Sentinel data lake, and can work with this data through Jupyter notebooks in Visual Studio Code. As with any coding effort, creating the right data set can be an iterative process, and sometimes making those changes purely through code can be a little tricky. Wouldn't it be great if you could visualize the distribution of your data, apply some actions to shape and refine it, and then translate those actions to code? Well, you can do that with the Data Wrangler extension in VSCode in conjunction with the Sentinel data lake's MicrosoftSentinelProvider class. This blog will walk you through how to enable Data Wrangler in VSCode, how to use some of its functionality, and incorporating refinement actions back into your data lake notebook.

Scenario

The dataframe that is being built will be sourced from SignInLogs but will be used in a later algorithm. I need to clean up some of the columns by replacing missing values with default values, removing rows meeting certain criteria, and creating some categorical columns for later machine learning tasks.

Initial DataFrame

An essential data structure that you use in Jupyter notebooks is a DataFrame. A DataFrame is an in-memory representation of your data, like a database table that has columns and rows.

Let's start with a basic DataFrame that contains some sign-in events from the SigninLogs table from the data lake. The returned data is useful, but for our later investigations we will need to "clean" the data by removing some missing values, renaming columns, creating true/false columns for analysis, and some other operations.  In our notebook cell, we'll perform the following actions.

Initial Includes

Before you can use the Sentinel data lake in your notebook, you need to include the proper class from the sentinel_lake.providers module. This module contains a class named MicrosoftSentinelProvider that provides functions that let you read from and write to the data lake. We also will be using a few other Python libraries in our example, and this would look like the following:

from sentinel_lake.providers import MicrosoftSentinelProvider
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import pandas as pd
from datetime import datetime, timedelta

Variable Definitions

Our sample will pull the last 30 days of SigninLogs from the data lake in order to assist with the investigation. This will be a variable that is defined once in the notebook and can be used elsewhere if needed. The same will be done for the name of the workspace in the data lake that will be queried, since the read_table and save_as_table functions can take the workspace name as a parameter and I only want to define the name once and avoid typos with multiple calls.

In addition is a very important step where we instantiate our connection to the Sentinel data lake. The "spark" variable we pass to the MicrosoftSentinelProvider class is a global variable representing your Spark session. The variable sentinel_provider exposes the read_table and save_as_table functions that enable reading from and writing to the data lake.

one_month_ago = datetime.now() - timedelta(days=30) 
workspaceName = "YOUR_WORKSPACE_NAME" 
sentinel_provider = MicrosoftSentinelProvider(spark)

Replace "YOUR_WORKSPACE_NAME" with the name of the Sentinel workspace that you will be working with in the data lake.

Complex Type Definitions

Part of our query of SigninLogs will return complex types that contain name/value pairs. The LocationDetails and Status columns have nested values like city and state for LocationDetails and errorCode and failureReason for Status. To be able to easily access those nested values, the use of a StructType allows us to define that structure and we'll use this when retrieving the DataFrame.

location_schema = StructType(
    [
        StructField("city", StringType(), True),
        StructField("state", StringType(), True),
        StructField("countryOrRegion", StringType(), True),
    ]
)

status_schema = StructType(
    [
        StructField("errorCode", IntegerType(), True),
        StructField("failureReason", StringType(), True),
        StructField("additionalDetails", StringType(), True),
    ]
)

Dataframe Definition

We now have the parts needed to make a call to create a DataFrame for the last 30 days of data from the SigninLogs table in the lake. Our code to define the DataFrame uses our time definition as a filter for TimeGenerated, defines a handful of columns that we want returned, breaking down our complex types using the StructTypes defined earlier, and retrieves those nested column names as individual DataFrame columns.

signin_events_df = (
    sentinel_provider.read_table("SigninLogs", workspaceName)
    .filter(col("TimeGenerated") >= one_month_ago) 
    .filter(col("UserPrincipalName") != "")  
    .select(
        col("TimeGenerated"),
        col("AppDisplayName"),
        col("IPAddress"),
        col("IsRisky"),
        col("RiskState"),
        col("RiskLevelAggregated"),
        col("RiskLevelDuringSignIn"),
        col("ConditionalAccessStatus"),
        col("ClientAppUsed"),
        col("IsInteractive"),
        col("UserType"),
        col("MfaDetail"),
        col("LocationDetails"),
        col("Status"),
    )  
    .withColumn("loc", from_json(col("LocationDetails"), location_schema))
    .withColumn("status", from_json(col("Status"), status_schema))
    .select(
        "*",
        col("loc.city").alias("City"),
        col("loc.state").alias("State"),
        col("loc.countryOrRegion").alias("Country"),
        col("status.errorCode").alias("ErrorCode"),
        col("status.failureReason").alias("FailureReason"),
        col("status.additionalDetails").alias("AdditionalDetails"),
    )
    .drop("loc", "status")
)

Final Code (for now)

Putting all of these steps together results in the following code for our cell that retrieves the last 30 days of SigninLogs into a DataFrame.

 

Running that cell and then calling show() on the resulting DataFrame produces the following output:

It's great data, but not the most visually appealing. It would be nice to have a cleaner looking table. That's where Data Wrangler can help right away.

Install Data Wrangler

Data Wrangler is a VSCode extension that's published by Microsoft. You can find it from the VSCode Marketplace by searching for "Data Wrangler". Installing the extension is quick and only requires Python 3.8 or higher to be installed on your machine.

 

Data Wrangler View of a DataFrame

Data Wrangler, by default, works natively with Pandas  DataFrames. Pandas is an open-source Python library that is very popular with data scientists for data analysis and manipulation. When working with the MicrosoftSentinelProvider class, the DataFrame returned is a PySpark DataFrame. We can easily convert our PySpark DataFrames to Pandas DataFrames by calling `.toPandas()` on that DataFrame. 

That's a much cleaner looking table. Clicking the ellipsis in the bottom right of the table and selecting "Show column insights" changes the view to provide a quick glance of the distribution of the data:

Now, just by glancing at the column headers, you can quickly assess the distribution of data in the DataFrame. You can see that 7% of conditional access attempts failed, that a number of sign-in events were for Security Copilot, and 30% of the sign-in events came from just three IP addresses.

Wrangling Your Data

A cleaner table view with data distribution statistics is nice, but the real power of Data Wrangler allows you to shape and refine your data for use elsewhere in your notebook. In the simple DataFrame we have created, let's perform some data cleansing steps so that you can more easily filter and join this DataFrame with other DataFrames later in my analysis. Upon first glance at the DataFrame there are a few data cleansing tasks to perform, namely:

  • Remove rows that have non-usable UserType values of -1
  • Create a true/false column for whether the user is a Member or Guest, and drop the original UserType column
  • Fill in column values that have missing data with a default value
  • Filter out sign ins to the My Profile page

Let's get started by opening Data Wrangler by clicking the Data Wrangler icon in the lower left corner of the DataFrame.

Data Wrangler will open in a new tab in VS Code. There's a lot going on in this tab, with the left-hand pane having sections for an operations toolbox, a data summary panel that lists some stats about your DataFrame, and cleaning steps that keeps track of the changes you have made to your DataFrame. The rest of the page is split in two, with the DataFrame view taking up the majority of real estate and the operation preview pane at the bottom.

We'll spend most of our time in the operations pane, but we'll also use the operation preview pane to do some additional tasks. Let's dive in.

Task 1: Remove Rows

Looking at the DataFrame grid, I can see the UserType column has some rows with a value of "-1". I don't want those in my DataFrame, so we can remove them using a filter.

Selecting Filter in the Operations panel allows me to enter my criteria. I want to exclude rows that have a "-1" for UserType. I'll enter that and if I wait a few seconds, my DataFrame will update allowing me to preview the change.

I unchecked the "Keep matching rows" checkbox, so my filter is excluding rows that match my criteria of UserType "Equal to" the value "-1". In the DataFrame, UserType is highlighted and I see that -1 is now not part of the DataFrame. Below the DataFrame, in the operation preview, I can see the Python code that makes this change. And in the Cleaning Steps pane, I see my Filter step is present. I can accept this change by clicking the Apply button in the Operations pane.

Once I do that, my DataFrame is updated with my Filter operation. Everything being done by Data Wrangler is done in a sandbox, so these steps do not affect my original DataFrame...at least not yet. (We'll get to that.) Let's make a few more changes.

Task 2: One-Hot Encoded Columns

I want to be able to filter on UserType later on in my notebook, but I don't want to do string comparisons. I'd rather filter on a simple binary column. That's where One-Hot columns are useful. I'd like to have a column for IsMember and one for IsGuest. Each column will be a 0 or a 1 (false or true) and allows me to quickly filter instead of doing string comparisons. Let's create those columns.

In the Operations pane, expand Formulas and select One-hot encode. The panel will switch so you can enter the column you're targeting. Select UserType, and in a few seconds, you'll see your DataFrame update with a preview of the new columns.

Notice the new columns created (UserType_Guest and UserType_Member) are in green. The UserType column is in red and will be dropped. Clicking Apply accepts these changes, and you'll see the updated DataFrame.

You can rename the new columns by selecting the Rename column operation under Schema. In this case, we'll rename the new columns to be IsMember and IsGuest, and accept the changes. Your Data Wrangler tab should look similar to the below image.

Task 3: Provide Default Values for Missing Data

Scanning through the DataFrame, we can see that the FailureReason and AdditionalDetails columns have a number of missing values. We would prefer to have a value in a cell rather than missing values. Filling in default values for missing values is another operation. Under Find and Replace in the Operations pane, select Fill missing values. You can set a default value for multiple columns in one swoop with this operation.

I'm setting the same default value ("N/A") for both columns in one operation. The columns in red are the old values; the columns in green are the new values. Again, if this looks good, hit the Apply button and the DataFrame is updated.

Task 4: Use Copilot to Create Operations

One last update that we wanted to make was to filter out rows where the target application was "My Profile". We've already created a filter operation earlier, but this time, we'll use Copilot to generate the operation. In the Operation Preview pane, below your DataFrame, there's a text box where you can type a prompt. Enter something like "For the column AppDisplayName, filter out the rows where the value is equal to My Profile". Hit Enter, and Copilot thinks for a few seconds and will display the code in the preview pane along with a modal dialog stating that the preview is paused. Since this change was generated by Copilot, you need to review the code before accepting the change.

If the code looks good, click the Run code link in the modal and your DataFrame will go back to preview mode. You'll see the filtered out rows highlighted, and if this all looks good, click Apply to accept the operation.

Using Copilot to help create operations can be very helpful if you know what you want to do, but aren't sure what the operation is called, such as a One-Hot Encoding. But you should always examine the code generated before accepting it.

Applying the Changes to Our Notebook

We've created a number of operations and our DataFrame looks great, but how can we translate these operations back to our original notebook? Data Wrangler makes that easy by allowing you to export your operations back into the source notebook.

Once you're satisfied with your changes, click the Export to notebook button above your DataFrame.

This action will take all of the operations you created and create a new cell in your Jupyter notebook, right below the one where you kicked off the Data Wrangler tab, Your operations will be contained within a local function and a copy of your DataFrame will be sent to the function. The result of the function will be a new DataFrame that you can then work with throughout the rest of your notebook.

Since this is all code, you can change variable names or even the structure of the generated code. Personally, I like to change the DataFrame names from the generic "df" and "df_clean" to something more meaningful, and even the local function can be renamed to a more meaningful function name. This way, if others are working on the same notebook, they have a better understanding of what is happening in the code. It may look like this:

def clean_signin_info(df):
    # Filter rows based on column: 'UserType'
    df = df[~(df["UserType"] == "-1")]
    # One-hot encode column: 'UserType'
    insert_loc = df.columns.get_loc("UserType")
    df = pd.concat(
        [
            df.iloc[:, :insert_loc],
            pd.get_dummies(df.loc[:, ["UserType"]]),
            df.iloc[:, insert_loc + 1 :],
        ],
        axis=1,
    )
    # Rename column 'UserType_Guest' to 'IsGuest'
    df = df.rename(columns={"UserType_Guest": "IsGuest"})
    # Rename column 'UserType_Member' to 'IsMember'
    df = df.rename(columns={"UserType_Member": "IsMember"})
    # Replace missing values with "N/A" in columns: 'FailureReason', 'AdditionalDetails'
    df = df.fillna({"FailureReason": "N/A", "AdditionalDetails": "N/A"})
    return df


signin_events_pandas_df = signin_events_df.toPandas()
cleaned_signin_events_df = clean_signin_info(signin_events_pandas_df)
cleaned_signin_events_df.head()

And my resulting DataFrame will have all of my cleaning steps applied.

Start Using Data Wrangler Today

You can get started using Data Wrangler with your Sentinel data lake notebooks today and explore all of the data wrangling tasks you can do with it. The Data Wrangler extension is available in the VS Code Marketplace and is free to download and use. It works well with the Microsoft Sentinel extension that you use with your Sentinel data lake notebook tasks, so install it today and start wrangling the data lake. Happy wrangling!

Resources

Updated Apr 27, 2026
Version 1.0
No CommentsBe the first to comment