1 min readfrom Microsoft Excel | Help & Support with your Formula, Macro, and VBA problems | A Reddit Community

Extract data from Power Query

Our take

Extracting specific data from PDFs, especially when dealing with invoices, can be a challenging task. Power Query (PQ) offers powerful tools, but it’s crucial to navigate its features effectively to avoid unnecessary complications. If you're encountering difficulties in isolating the numbers you need, rest assured that with the right approach, you can streamline the process.

The struggle described in this post will ring familiar to anyone who has tried to wrangle unstructured PDF data into something usable in a spreadsheet. Power Query is exceptionally good at transforming data that already has structure—rows, columns, consistent formatting—but PDFs, particularly invoice-style documents, often lack the predictable architecture that makes automation straightforward. When the tool attempts to convert a visual document into a table, it makes assumptions about layout that frequently result in the "nonsense sheets" described here. This isn't a failure of the user's skills; it's a fundamental mismatch between the tool's strengths and the document's format.

The core issue is that PDFs preserve visual presentation rather than data relationships. A number that appears in the same position on every page of an invoice is, from the PDF's perspective, just text sitting in a particular coordinate space. Power Query, meanwhile, is looking for patterns that suggest tabular data—repeating delimiters, consistent column breaks, or recognizable data types. When it doesn't find those patterns, it guesses, and those guesses often miss the mark entirely. For users facing this challenge, the solution typically involves extracting the raw text from the PDF first, then using Power Query to clean and transform that text. This two-step approach acknowledges that Power Query excels at transformation but not at initial extraction from unstructured formats. Related discussions in our community highlight this exact pain point, with readers exploring Tools for exporting data from PDF to Excel and Trying to automate extracting info from PDFs into a table with PowerQuery but they're somehow not structured the same and it's messing up reflecting similar frustrations with inconsistent document structures.

That said, the user's instinct to automate this workflow is exactly right. Manually extracting numbers from hundreds of invoice pages would be unsustainable, and the fact that the target numbers appear in consistent positions suggests the data is amenable to automation—the challenge is finding the right toolchain to access it. The good news is that once the raw text is extracted, Power Query becomes remarkably powerful for filtering out everything except the specific values needed. The key is treating PDF extraction and data transformation as separate problems rather than expecting a single tool to handle both.

This scenario points to a broader tension in how we think about spreadsheet tools. Users increasingly expect their data platforms to handle end-to-end workflows, from ingestion to analysis, but the reality is that different stages often require different tools. Power Query is exceptionally capable within its domain, but it's not a PDF parser by design. As more workflows involve extracting data from documents like invoices, receipts, and forms, the gap between what users expect their spreadsheet software to do and what it was originally built for continues to widen. The future of data management will likely involve tighter integration between document extraction and data transformation—but for now, understanding these tool boundaries is essential for anyone looking to build reliable, automated workflows.

Hi, I've been fighting with powerquery (pq) bc I need to extract specific numbers from a pdf, it has hundreds of pages and the numbers are always at the same spot, but they're not spreadsheets, they're invoices.

I've tried pq but it makes nonsense sheets trying to convert the text to a normal sheet, but I can't find how to keep just the number I need and toss the rest of the info

submitted by /u/sbeveguy
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Related Articles

Tagged with

#financial modeling with spreadsheets#big data management in spreadsheets#rows.com#Excel alternatives for data analysis#natural language processing for spreadsheets#AI-native spreadsheets#Excel compatibility#google sheets#predictive analytics in spreadsheets#cloud-native spreadsheets#Excel alternatives#generative AI for data analysis#conversational data analysis#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data analysis tools#data cleaning solutions