Journey Series for Generative AI Application Architecture - Model references and evaluation models

Microsoft

Mar 25, 2024

In the previous content, we integrated the entire SLMOps process through Microsoft Olive. The development team can configure everything from data, fine-tuning, format conversion, deployment, etc. through Olive.config. In this article, I hope to talk about model reference and evaluation.

Model reference

Convert the fine-tuned model to ONNX in Olive.config. We hope that the model can be deployed in a unified format on different edge devices to obtain a consistent development experience through simple deployment, thereby expanding the usability of the model.

Learn about ONNX and ONNX Runtime

ONNX

ONNX (pen Neural Network Exchange) is an open format designed for machine learning and used to store trained models. It allows different artificial intelligence frameworks (such as Pytorch, MXNet) to store model data in the same format and interact differently. The ONNX model has strong versatility and scalability, and developers can easily convert models trained by PyTorch into ONNX. ONNX provides a unified model format standard, lowering the threshold for deploying and maintaining models.

ONNX has good support in terms of hardware compatibility and deployment scenarios. The optimized model allows us to deploy to the cloud, edge devices and embedded devices. This also just meets the requirements for deploying SLM to different devices.

ONNX Runtime

ONNX Runtime is an inference framework maintained by Microsoft that can be combined with ONNX (.onnx) files to directly perform model inference. You can combine ONNX Runtime with different programming languages to implement model inference, and support different hardware acceleration environments, including CPU / Nvidia CUDA / Nvidia TensorRT / Intel OpenVINO / AMD RoCm, etc.

Model Accuracy

When deploying and referencing models, we need to consider accuracy issues because we need to make trade-offs based on the usage scenarios of the model. Commonly used precisions now include half-precision FP16, single-precision FP32, and quantized precision INT 4. Because we may need accuracy guarantee during LLM/SLM training, we generally use single-precision calculation of FP 32. In inference scenarios, the half-precision of FP16 is generally used, which can save more GPU computing power and obtain the same results. Of course, if you want to further reduce the consumption of GPU, you will also use INT4 for quantization. On the Microsoft Olive configuration we can easily convert the fine-tuned model to FP32, FP16 and INT4 models. The following is the accuracy conversion process of our model after fine-tuning.



        "convert": {
            "type": "OnnxConversion",
            "config": {
                "use_dynamo_exporter": true,
                "torch_dtype": "float32",
                "target_opset": 18,
                "save_as_external_data": true,
                "all_tensors_to_one_file": true
            }
        },
        "optimize_cuda": {
            "type": "OrtTransformersOptimization",
            "config": {
                "model_type": "phi",
                "use_gpu": true,
                "keep_io_types": false,
                "num_heads": 32,
                "hidden_size": 2560,
                "opt_level": 0,
                "optimization_options": {
                    "attention_op_type": "GroupQueryAttention"
                },
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "float16": true
            }
        },
        "blockwise_quant_int4": {
            "type": "OnnxMatMul4Quantizer",
            "config": {
                "save_as_external_data": true,
                "all_tensors_to_one_file": true,
                "block_size": 16,
                "is_symmetric": true
            }
        }

Run

We try to run it with the support of Azure A100 computing power, and we can get consistent results in both FP16 and INT4.

Model Evaluation

We have completed the reference to the model. At this time, we need to consider a deeper issue, the method of model evaluation. For SLM / LLM, we have a very complete prompt flow open source tool. Not only can you combine the prompt words to see the actual results of the model in solving the problem, but you can also evaluate the execution time of the model and the performance between different hardware. And this tool not only supports Azure, but can also be implemented locally. This also works very well with model evaluation for different hardware.

The following is my implementation after using Notebook to call and run Promptflow evaluation fine-tuning in Microsoft Azure.

Summary

With the blessing of AI 2.0, we not only pursue the application level, but actually we also need to pay attention to the usage scenarios and usability of the model. At this time, more edge devices need to be compatible. The integration of Microsoft Olive and ONNX models will be a very important step. Thank you again for continuing to follow this series. Send feedback

Journey Series Blogs
Journey Series for Generative AI Application Architecture - Foundation (microsoft.com)
Journey Series for Generative AI Application Architecture - Fine-tune SLM with Microsoft Olive - Microsoft Community Hub