A Mutable Log

A blog by Devendra Tewari


Project maintained by tewarid Hosted on GitHub Pages — Theme by mattgraham

Extract all tabular data from multipart mime documents

Emails and Microsoft’s legacy MHTML are multipart mime documents. The following post shows how tabular data in the form of HTML tables can be extracted from such documents using Jupyter Notebook and Python 3.

To begin with, install BeautifulSoup4 and html-table-extractor, using pip

pip install BeautifulSoup4 html-table-extractor

The following code opens an MHTML file, walks through all the parts in the file, uses BeautifulSoup4 to parse parts that have content type text/html, iterates through all the tables in the body, parses each table using html_table_extractor, and prints it out.

import email
from bs4 import BeautifulSoup
from html_table_extractor.extractor import Extractor
with open("file.mht") as fp:
    message = email.message_from_file(fp)
    for part in message.walk():
        if (part.get_content_type() == "text/html"):
            soup = BeautifulSoup(part.get_payload(decode=False))
            for table in soup.body.find_all("table", recursive=False):
                extractor = Extractor(table)
                extractor.parse()
                print(extractor.return_list())