Blog Post

AI - AI Platform Blog
5 MIN READ

Use generative AI to extract structured data out of emails

AzadehNia's avatar
AzadehNia
Icon for Microsoft rankMicrosoft
Feb 13, 2025

Avoiding custom code to extract tabular data of variable format from emails, let AI do it for you!

One thing we regularly hear from clients is that they receive information that are key to their business such as order requests via email in an unstructured format and sometimes there are structured information within the body of those emails in a variety of table formats.

In today’s fast-paced digital world, businesses need a way to automatically extract, structure, and integrate this information into their existing applications. Whether it’s leveraging AI-powered document processing, natural language processing (NLP), or intelligent automation, the right approach can transform email-based orders into structured, actionable data.

In this blog, we’ll explore one such scenario where AI can be leveraged to extract information in tabular format that has been provided within an email. The emails contextually belong to a specific domain, but the tables are not with consistent headers or shapes. Sometimes in the body of one email there could be multiple tables.

 

The problem Statement

Extract tabular information with varying table formats from emails

 

The typical approach to this problem involves rule-based processing, where individual tables are extracted and merged based on predefined logic. However, given the variety of email formats from hundreds or even thousands of different senders, maintaining such rule-based logic becomes increasingly complex and difficult to manage. A more optimal solution is leveraging the cognitive capabilities of generative AI, which can dynamically adapt to different table structures, column names, and formatting variations—eliminating the need for constant rule updates while improving accuracy and scalability.

 

To create this sample code, I used below email with test data, with two tables with inconsistent column names. It is going to provide some upcoming trainings information. Please note the difference between the column headers:

Hi there,

 

Regarding the upcoming trainings, this is the list:

 

Event Date

Description of Event

Length

Grade

2025-01-21

Digital environments

20 hours

5

2025-03-01

AI for Industry A

10 hours

3

 

and some further events in below list

 

Date

Subject

Duration

Grade

2025-01-21

Digital environments 2

2 days

1

2025-03-01

AI for Industry B

2 weeks

4

 

These sessions are designed to be interactive and informative, so your timely participation is crucial. Please make sure to log in or arrive on time to avoid missing key insights.

If you have any questions or need assistance, feel free to reach out. Looking forward to seeing you there!

Thanks,

Azadeh

These are the two tables within the email, and we need to extract one consistent table format with all the rows from these two tables.

Table 1

Event Date

Description of Event

Length

Grade

2025-01-21

Digital environments

20 hours

5

2025-03-01

AI for Industry A

10 hours

3

 

Table 2

Date

Subject

Duration

Grade

2025-01-21

Digital environments 2

2 days

1

2025-03-01

AI for Industry B

2 weeks

4

 

To extract the tabular data into one single table in json format, I am using python with below libraries installed in my environment:

pandas beautifulsoup4 openai lxml

 

The Code

I use azure OpenAI service with a gpt 4o deployment.

Below code is just one way of solving this type of problem and can be customized or improved to fit to other similar problems. I have provided some guidelines about merging the tables and column names similarity in the user prompt.

This sample code is using an email message that is saved in 'eml' format in a local path, but the email library has other capabilities to help you connect to a mailbox and get the emails.

import email
import pandas as pd
from bs4 import BeautifulSoup
import os  
from openai import AzureOpenAI  

endpoint = os.getenv("ENDPOINT_URL", "https://....myendpointurl....openai.azure.com/")  
deployment = os.getenv("DEPLOYMENT_NAME", "gpt-4o")  
subscription_key = os.getenv("AZURE_OPENAI_API_KEY", "myapikey)  

# Initialize Azure OpenAI Service client with key-based authentication    
client = AzureOpenAI(  
    azure_endpoint=endpoint,  
    api_key=subscription_key,  
    api_version="2024-05-01-preview",
)
  
# Process email content with GPT-4
def extract_information(email_body, client):
    soup = BeautifulSoup(email_body, "html.parser")
    body = soup.get_text()
    print(body)
    #Prepare the chat prompt 
    chat_prompt = [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant that is expert in extracting structured data from emails."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Extract the required information from the following email and format it as JSON and consolidate the tables using the common column names. For example the columns length and duration are the same and the columns Event and Subject are the same:\n\n{body}"
                }
            ]
        }
    ] 
        
    messages = chat_prompt  
        
    # Generate the completion  
    completion = client.chat.completions.create(  
        model=deployment,
        messages=messages,
        max_tokens=800,  
        temperature=0.1,  
        top_p=0.95,  
        frequency_penalty=0,  
        presence_penalty=0,
        stop=None,  
        stream=False
    )
    return completion.choices[0].message.content

email_file_name = r'...path to your file....\Test Email with Tables.eml'
with open(email_file_name, "r") as f:
    msg = email.message_from_file(f)

email_body = ""
for part in msg.walk():
    if part.get_content_type() == "text/plain":
        email_body = part.get_payload(decode=True).decode()
    elif part.get_content_type() == "text/html":
        email_body = part.get_payload(decode=True).decode()

extracted_info = extract_information(email_body, client)
print(extracted_info)

 

The output is:

```
[
    {
        "Event": "Digital environments",
        "Date": "2025-01-21",
        "Length": "20 hours",
        "Grade": 5
    },
    {
        "Event": "AI for Industry A",
        "Date": "2025-03-01",
        "Length": "10 hours",
        "Grade": 3
    },
    {
        "Event": "Digital environments 2",
        "Date": "2025-01-21",
        "Length": "2 days",
        "Grade": 1
    },
    {
        "Event": "AI for Industry B",
        "Date": "2025-03-01",
        "Length": "2 weeks",
        "Grade": 4
    }
]
```

 

Key points in the code:

  • Read an email and extract the body
  • Use a gen AI model with the right instructions prompt to complete the task
  • Gen AI will follow the instructions and create a combined consistent table
  • Get the output in the right format, e.g. 'json'

 

I hope you find this blog post helpful, and you can apply it to your use case/domain.

Or you can simply get the idea of how to use generative AI to solve a problem, instead of building layers of custom logic.

 

 

Updated Feb 13, 2025
Version 8.0