Azure Data Explorer support for inline Python is GA

Microsoft

Apr 06, 2020

Azure Data Explorer (ADX) supports running Python code embedded in Kusto Query Language (KQL) using the python() plugin . The plugin runtime is hosted in a sandbox, an isolated and secured environment hosted on ADX existing compute nodes. This sandbox contains the language engine as well as common mathematical and scientific packages. The plugin extends KQL native functionalities with a huge archive of OSS packages, enabling ADX users to run advanced algorithms, such as machine learning, artificial intelligence, statistical tests, time series analysis and many more as part of the KQL query.

We launched this plugin about a year ago, as a private preview with early adopters, followed by a public preview with large community of internal and external users. During this period we worked directly with selected customers on common use cases, gained a lot of feedback and improved the plugin robustness, functional capabilities and scale. Today we are pleased to finish this preview and move to GA, making the python() plugin available to all ADX users.

Capabilities

The Python image is
- based on Anaconda distribution, thus many mathematical and scientific packages are pre-installed by default
- can be customized with additional private/public packages
The plugin can be run in distributed mode, on multiple nodes, handling significant workloads with large amounts of data
The inline Python code can be authored and debugged in VS code as explained here
The python plugin can be enabled/disabled via Azure portal

Figure 1: enabling the python plugin from Azure Portal

Examples

The plugin is invoked using the “tbl | evaluate python(…)” operator. The input table is sent to the Python sandbox and is mapped to a pandas DataFrame named ‘df’, while the ‘result’ DataFrame should be set in the Python script and is sent back to ADX.

Regression analysis:

In this example we leverage numpy polyfit() to find the optimal cubic curve that fits the (x,y) points:

range x from -50 to 50 step 1
| extend y = 0.5*pow(x, 3) + rand(100000) - 50000, ry=0
| summarize x=make_list(x), y=make_list(y), ry=make_list(ry)
| evaluate python(typeof(*),
        'def fit(s, deg):\n'
        '    x = np.arange(len(s))\n'
        '    coeff = np.polyfit(x, s, deg)\n'
        '    p = np.poly1d(coeff)\n'
        '    z = p(x)\n'
        '    return z\n'
        '\n'
        'result = df\n'
        'result["ry"] = df["y"].apply(fit, args=(3,))\n')
| render scatterchart with(title='Polynomial Regression')

Scoring using a trained ML model:

We trained a logistic regression model externally and serialized the model as a string into a table in ADX. Here we use ADX as a compute target, for fast

scoring of new samples, calculating the confusion matrix:

let model_str = toscalar(ML_Models| where name == 'Occupancy' | top 1 by timestamp desc | project model);
OccupancyDetection 
| where Test == 1
| extend pred_Occupancy=bool(0)
| evaluate python(typeof(*),
    'import pickle\n'
    'import binascii\n'
    'smodel = kargs["smodel"]\n'
    'bmodel = binascii.unhexlify(smodel)\n'
    'clf1 = pickle.loads(bmodel)\n'
    'df1 = df[["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"]]\n'
    'predictions = clf1.predict(df1)\n'
    'result = df\n'
    'result["pred_Occupancy"] = pd.DataFrame(predictions, columns=["pred_Occupancy"])',
    pack('smodel', model_str))
| summarize n=count() by Occupancy, pred_Occupancy  //  confusion matrix