I am a graduate student at Northwestern University in the Data Science program and a developer with expertise in the Azure.
For this project I created a simple Python and Flask interface as a container for our Azure experiment.
The interface interacts with an API that is generated through the Azure ML Ops portal. Azure uses a great interface which is reminiscent of SSIS in its simplicity. The different parts of the pipeline are represented visually, each piece can be dragged onto a canvas and manipulated with a mouse. This is one of the important things I point out in my video, as reducing the technical barriers to data science can help us focus on the work of modelling and domain knowledge.
With a tool like the ML Ops, we can integrate machine learning into enterprise level applications without having to worry about optimizing our code. We can focus on choosing the right model for the job and ensuring the data is clean, then provide the enterprise developers with an API to work with.
Even though the application was light weight I decided to use Azure Devops for source control. This helped keep all the changes organized and safe in the cloud. Its not something that I have seen Data Scientists do a lot, even in my graduate program, all the code is usually kept in a storage space or a local drive. Developers discovered a long time that source control is vital for the success of any large project. Even if you are working with a single page R Shiny application, I would recommend getting comfortable with GIT or any other source control technology. The benefits will be in the pain that you avoided in the future.
One service I did not use in my project but spoke about in my presentation is Azure Data Factory.
With this tool we could automate our data pipeline. For example, if we wanted to deploy our project to a production environment, we would need a way to update our data regularly. We could use ADF to access an API on a schedule and refresh our data. We could then move the data into an Azure Blob Storage item and access it from our ML Ops portal. Azure Data Factory would also allow us to trigger the experiment or notebook that we created to update our model results.
All these tools working together helped me build something very resilient in a very quick span of time. The project helped me get a good understanding of HL7 data used in the medical industry. It was very enlightening writing python scripts to extract data from the documents. The data is stored in documents and is non-relational. The scripts had to flatten the data so we could correlate events to drugs. For example, a patient goes into the doctor for a check up and gets prescribed several drugs. Each of those things exist as different items on the patients’ medical record so they must be tied together after normalizing.
The project also helped me understand some of the issues with applying classification algorithms to medical data, specifically pharmaceuticals. Some drugs cannot be properly classified due to the nature of how broad they are. If you think about painkillers, those can be used to treat a wide range of symptoms and illnesses. An experiment to tie drugs to illnesses in a 1 to 1 comparison would have very low accuracy scores for drugs like these but would work very well with specialty drugs.