A blog by Devendra Tewari
Emails and Microsoft’s legacy MHTML are multipart mime documents. The following post shows how tabular data in the form of HTML tables can be extracted from such documents using Jupyter Notebook and Python 3.
To begin with, install BeautifulSoup4 and html-table-extractor, using pip
pip install BeautifulSoup4 html-table-extractor
The following code opens an MHTML file, walks through all the parts in the file, uses BeautifulSoup4 to parse parts that have content type text/html
, iterates through all the tables in the body, parses each table using html_table_extractor, and prints it out.
import email
from bs4 import BeautifulSoup
from html_table_extractor.extractor import Extractor
with open("file.mht") as fp:
message = email.message_from_file(fp)
for part in message.walk():
if (part.get_content_type() == "text/html"):
soup = BeautifulSoup(part.get_payload(decode=False))
for table in soup.body.find_all("table", recursive=False):
extractor = Extractor(table)
extractor.parse()
print(extractor.return_list())