How To Extracting Text From PDFs Using Google Apps Script

How To Extracting Text From PDFs Using Google Apps Script

Learn how to use Google Apps Script to automatically extract text from PDF files. This beginner’s guide covers text conversion from PDFs, information extraction with RegEx, and data exporting to Google Sheets. Use this effective automation strategy to improve your document management productivity.

Any organization may find it difficult to manage invoices and receipts. By producing paper receipts for consumers, an external accounting system attempts to streamline this procedure. These receipts are then conveniently converted to PDF files and posted to a designated folder in Google Drive. The hard part is getting precise data out of these PDF invoices and into a Google Spreadsheet, including the invoice number, invoice date, and buyer’s email address. Our PDF extractor script is capable of handling the task and will lead you through it, which is fortunate.

Convert PDF to Text:

Let’s start by utilizing the following function to convert the PDF to text. Make sure the Advanced Drive API is configured as this guide instructs.

const convertPDFToText = (fileId, language) => {
  fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW'; // Sample PDF file
  language = language || 'en'; // English

  // Read the PDF file in Google Drive
  const pdfDocument = DriveApp.getFileById(fileId);

  // Use OCR to convert PDF to a temporary Google Document
  // Restrict the response to include file Id and Title fields only
  const { id, title } = Drive.Files.insert(
      title: pdfDocument.getName().replace(/\.pdf$/, ''),
      mimeType: pdfDocument.getMimeType() || 'application/pdf',
      ocr: true,
      ocrLanguage: language,
      fields: 'id,title',

  // Use the Document API to extract text from the Google Document
  const textContent = DocumentApp.openById(id).getBody().getText();

  // Delete the temporary Google Document since it is no longer needed

  // (optional) Save the text content to another text file in Google Drive
  const textFile = DriveApp.createFile(`${title}.txt`, textContent, 'text/plain');
  return textContent;
Extract Information from Text:

We can now extract the necessary data using RegEx patterns since we know the text content of the PDF. The text elements that are pertinent are “Invoice Date” and “Invoice Number.” The function listed below can assist us in achieving this:

const extractInformationFromPDFText = (textContent) => {
  const pattern = /Invoice\sDate\s(.+?)\sInvoice\sNumber\s(.+?)\s/;
  const matches = textContent.replace(/\n/g, ' ').match(pattern) || [];
  const [, invoiceDate, invoiceNumber] = matches;
  return { invoiceDate, invoiceNumber };

Please be aware that you might need to adjust the RegEx pattern if your PDF has a specific structure.

Save Information to Google Sheet:

Using the Google Sheets API, we can now quickly store the information we’ve extracted to a Google Sheet. The function that will handle this task is as follows:

const writeToGoogleSheet = ({ invoiceDate, invoiceNumber }) => {
  const spreadsheetId = '<<Google Spreadsheet ID>>';
  const sheetName = '<<Sheet Name>>';
  const sheet = SpreadsheetApp.openById(spreadsheetId).getSheetByName(sheetName);
  if (sheet.getLastRow() === 0) {
    sheet.appendRow(['Invoice Date', 'Invoice Number']);
  sheet.appendRow([invoiceDate, invoiceNumber]);

There you have it, then! You may easily extract and save the relevant data from your PDF invoices into a Google Sheet by using these features.

You might think about using commercial APIs that use machine learning to assess document layouts and extract certain information at scale in the case of more complicated PDF structures. Amazon Textract, Adobe’s Extract API, and Google’s Vision AI are three notable web services for extracting PDF data that all provide substantial free tiers for small-scale use.

You may effectively manage and organize your bills with the help of these tools, which will enhance your workflow and overall efficiency. Happy bookkeeping!


Leave a Comment