Extract data from Power Query
Our take
The struggle described in this post will ring familiar to anyone who has tried to wrangle unstructured PDF data into something usable in a spreadsheet. Power Query is exceptionally good at transforming data that already has structure—rows, columns, consistent formatting—but PDFs, particularly invoice-style documents, often lack the predictable architecture that makes automation straightforward. When the tool attempts to convert a visual document into a table, it makes assumptions about layout that frequently result in the "nonsense sheets" described here. This isn't a failure of the user's skills; it's a fundamental mismatch between the tool's strengths and the document's format.
The core issue is that PDFs preserve visual presentation rather than data relationships. A number that appears in the same position on every page of an invoice is, from the PDF's perspective, just text sitting in a particular coordinate space. Power Query, meanwhile, is looking for patterns that suggest tabular data—repeating delimiters, consistent column breaks, or recognizable data types. When it doesn't find those patterns, it guesses, and those guesses often miss the mark entirely. For users facing this challenge, the solution typically involves extracting the raw text from the PDF first, then using Power Query to clean and transform that text. This two-step approach acknowledges that Power Query excels at transformation but not at initial extraction from unstructured formats. Related discussions in our community highlight this exact pain point, with readers exploring Tools for exporting data from PDF to Excel and Trying to automate extracting info from PDFs into a table with PowerQuery but they're somehow not structured the same and it's messing up reflecting similar frustrations with inconsistent document structures.
That said, the user's instinct to automate this workflow is exactly right. Manually extracting numbers from hundreds of invoice pages would be unsustainable, and the fact that the target numbers appear in consistent positions suggests the data is amenable to automation—the challenge is finding the right toolchain to access it. The good news is that once the raw text is extracted, Power Query becomes remarkably powerful for filtering out everything except the specific values needed. The key is treating PDF extraction and data transformation as separate problems rather than expecting a single tool to handle both.
This scenario points to a broader tension in how we think about spreadsheet tools. Users increasingly expect their data platforms to handle end-to-end workflows, from ingestion to analysis, but the reality is that different stages often require different tools. Power Query is exceptionally capable within its domain, but it's not a PDF parser by design. As more workflows involve extracting data from documents like invoices, receipts, and forms, the gap between what users expect their spreadsheet software to do and what it was originally built for continues to widen. The future of data management will likely involve tighter integration between document extraction and data transformation—but for now, understanding these tool boundaries is essential for anyone looking to build reliable, automated workflows.
Hi, I've been fighting with powerquery (pq) bc I need to extract specific numbers from a pdf, it has hundreds of pages and the numbers are always at the same spot, but they're not spreadsheets, they're invoices.
I've tried pq but it makes nonsense sheets trying to convert the text to a normal sheet, but I can't find how to keep just the number I need and toss the rest of the info
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Tools for exporting data from PDF to ExcelHi everyone! I started a new job a few weeks ago and a big part of my role involves extracting data from numerous PDFs (e.g., invoice numbers, amounts, total packages, etc.) and entering them into a massive Excel master file. This file acts as a registry and the foundation for other documents. I’m looking for something that saves me from doing 'copy-paste' all day, hundreds of times over. Browsing this group, I noticed some people suggest Power Query for similar tasks, but I’m not familiar with it and would have to learn it from scratch. Does anyone have any tools to recommend, perhaps something more user-friendly than Power Query? submitted by /u/BomboGanoush [link] [comments]
- Trying to automate extracting info from PDFs into a table with PowerQuery but they're somehow not structured the same and it's messing up.I thought since the PDFs looked like they were the same format (they're documents from a government agency), they would produce the same results if I ran them through PowerQuery. Somehow, they don't. I need three pieces of data from each file. Somehow they all end up on different columns despite looking identical. I've tried my best to make it fit but the moment I try to remove extraneous columns, the same error pops up because one of the file doesn't have a specific numbered column. It's so frustrating. I don't even need it to look nice, I just need the info in a list for convenience. Is there anything I can do to make it work? submitted by /u/DoctorKrakens [link] [comments]
- I want to use Power Query to import data received from a client, where the file name changes each month. What's the easiest way to automate this?I used to use VBA for this, but that's a lot more roundabout, and I have a lot less control over the transformation. I have no issues with transforming the actual data itself. My issue lies in the fact that it's a different file each month. Using wildcard formatting, *filehere*.xls* would always pull the correct file. This file is also stored in the same place relative to my spreadsheet each time, but the location of the spreadsheet and folders itself changes each month. In VBA, I could find the relative position quite easily via ThisWorkbook.Path & "\Data\" However, I don't know how to use PQ to import automatically like this, so that I'd always import the correct data simply by refreshing links. I think I've seen people set up a somewhat hacky way, where PQ first reads a table in the workbook to retrieve values, and then uses those to find the file to query. Is that the only way? submitted by /u/space_reserved [link] [comments]
- Get more from Power Query in Excel with these little-known capabilitiesIf you use Excel or Power BI to work with data, you’ve likely encountered Power Query - and may already rely on it regularly across desktop and web. It pulls data together, cleans it up, and prepares it for analysis. But beyond those familiar tasks, Power Query includes several capabilities that can make solutions more flexible, scalable, and easier to maintain. Get more from Power Query in Excel with these little-known capabilities submitted by /u/beyphy [link] [comments]