Retrieving 3D Compound Structures from PubChem using Python

Faris Izzatur Rahman
4 min readMar 22, 2023

--

Introduction

PubChem is an open chemistry database that provides information on chemical substances, their biological activities, and chemical structures. In the field of bioinformatics, researchers often work with molecular structures to study their interactions, properties, and behavior. The ability to obtain 3D structures of compounds programmatically can save time and streamline research processes.

In this blog post, we’ll walk through a Python script that reads a CSV file containing compound names, searches for their corresponding PubChem Compound ID (CID), and downloads their 3D structures in Structure Data Format (SDF) using the PubChemPy and Requests libraries.

For the full documentation of PubChemPy, read this sites

Getting Started

To begin, make sure you have the following Python libraries installed:

  • pandas
  • pubchempy
  • requests

The main components of our script are two functions, search_pubchem and download_sdf. The search_pubchem function searches for a compound's CID using the PubChemPy library, while the download_sdf function downloads the compound's 3D structure in SDF format using the Requests library.

Searching for Compounds in PubChem:

The search_pubchem function takes a compound name as an input and returns the corresponding CID. It uses the get_compounds function from the PubChemPy library to search for the compound. If the compound is not found or an error occurs, the function prints an error message and returns None.

The search_pubchem function is a Python function designed to search for a given chemical compound on PubChem, which is a large public database of chemical substances. The function takes a single argument, compound, which is the name of the chemical compound to search for.

The function uses the pubchempy library, which provides a Python interface to PubChem's web services, to search for the compound. It calls the get_compounds method of the pcp library, passing in the compound name as well as the search type (name) and the record type (3d), which specifies that we want 3D information about the compound.

If the search is successful, the function returns the PubChem Compound ID (cid) of the first result returned by the search. If the search is unsuccessful, due to either an IndexError (meaning no results were found) or a PubChemHTTPError (meaning there was an error connecting to PubChem), the function prints an error message and returns None.

Downloading the 3D Structure in SDF Format:

The download_sdf function takes a CID as input and returns the compound's 3D structure. in SDF format as a string. It constructs the URL for the PubChem REST API and sends a GET request using the Requests library. If the request is successful, the function returns the SDF content as a string. In case of any errors, the function prints an error message and returns None.

The download_sdf function is a Python function designed to download the 3D structure of a chemical compound in SDF format from PubChem, given its Compound ID (cid). The function takes a single argument, cid, which is the PubChem Compound ID of the chemical compound to download.

The function uses the requests library to send an HTTP GET request to the PubChem REST API, requesting the SDF data for the compound with the given cid. If the request is successful (i.e., if the HTTP status code is in the 200 range), the function returns the text of the SDF file. If the request is unsuccessful, due to either a requests.exceptions.RequestException (meaning there was an error with the HTTP request) or a PubChemHTTPError (meaning there was an error with the PubChem REST API), the function prints an error message and returns None.

Main Function:

The main function of the script performs the following steps:

  1. Read the CSV file containing compound names into a pandas DataFrame.
  2. Iterate through the DataFrame, search for each compound in PubChem, and store the resulting CIDs in a new column.
  3. Iterate through the DataFrame again, download the 3D structure in SDF format for each compound with a valid CID, and save the SDF files with the compound names.

The script handles errors gracefully, printing informative messages for any issues that may arise during the process.

Conclusion:

In this blog post, we’ve demonstrated a simple Python script that can streamline the process of retrieving 3D compound structures from PubChem. By leveraging the capabilities of the PubChemPy and Requests libraries, the script reads a list of compound names from a CSV file, searches for their corresponding CIDs, and downloads their 3D structures in SDF format. This approach can be easily integrated into other bioinformatics workflows, enhancing research efficiency and productivity.

Remember to adapt the script to your specific use case, such as changing the input file name or adding any additional processing steps. With these tools in hand, you’re well-equipped to explore the vast world of chemical structures and their biological implications.

For the full code, you can check it here

--

--