What would you do with this task, and how long would it take you to do it?
Our take
I'm going to describe a situation as specifically as I can. I am curious what people would do in this situation, I worry that I complicate things for myself. I'm describing the whole task as it was described to me and then as I discovered it.
Ultimately, I'm here to ask you, what do you do, and how long does it take you to do it?
I started a new role this month, I am new to advertising modeling methods like mmm, so I am reading a lot about how to apply the methods specific to mmm in R and python, I use VScode, I don't have a github copilot license, I get to use copilot through windows office license. Although this task did not involve modeling, I do want to ask about that kind of task another day if this goes over well.
The task
5, excel sheets are to be provided. You are told that this is a clients data that was given to another party for some other analysis and augmentation. This is a quality assurance task. The previous process was as follows;
the data
- the data structure: 1 workbook per industry for 5 industries
- 4 workbooks had 1 tab, 1 workbook had 3 tabs
- each tab had a table that had a date column in days, 2 categorical columns advertising_partner, line_of_business and at least 2 numeric columns per work book.
- some times data is updated from our side and the partner has to redownload the data and reprocess and share again
the process
- this is done once per client, per quarter (but it's just this client for now)
- open each workbook
- navigate to each tab
the data is in a "controllable" table
bing bing home home impressions spend partner dropdown line of business dropdown where bing and home are controlled with drop down toggles, with a combination of 3-4 categories each.
compare with data that is to be downloaded from a tableau dashboard
end state: the comparison of the metrics in tableau to the excel tables to ensure that "the numbers are the same"
the categories presented map 1 to 1 with the data you have downloaded from tableau
aggregate the data in a pivot table, select the matching categories, make sure the values match
additional info about the file
- the summary table is a complicated sumproduct look up table against an extremely wide table hidden to the left. the summary table can start as early as AK and as late as FE.
- there are 2 broadly different formats of underlying data in the 5 notebooks, with small structure differences between the group of 3.
in the group of 3
- the structure of this wide table is similar to the summary table with categories in the column headers describing the metric below it. but with additional categories like region, which is the same value for every column header. 1 of these tables has 1 more header category than the other 2
- the left most columns have 1 category each, there are 3 date columns for day, quarter.
| REGION | USA | USA | USA | ||
| PARTNER | bing | bing | |||
| LOB | home | home | auto | ||
| impressions | spend | ...etc | |||
| date | quarter | impressions | spend | ...etc | |
| 2023-01-01 | q1 | 1 | 2 | ...etc | |
| 2023-01-02 | q1 | 3 | 4 | ...etc | |
in the group of 2
- the left most categories are actually the categorical headers in the group of 3, and the metrics, the values in each category mach
- the dates are now the headers of this very wide table
- the header labels are separated from the start of the values by 1 column
- there is an empty row immediately below the final row for column headers.
| date Label | 2023-01-01 | 2023-01-02 | ||||
| year | 2023 | 2023 | ||||
| quarter | q1 | q1 | ||||
| blank row | ||||||
| REGION | PARTNER | LOB | measure | |||
| blank row | ||||||
| US | bing | home | impressions | 1 | 3 | |
| US | bing | home | spend | 2 | 4 | |
| US | auto | ...etc | ...etc | ... etc |
The question is, what do you do, and how long does it take you to do it?
I am being honest here, I wrote out this explaination basically in the order in which I was introduced to the information and how I discovered it. (Oh it's easy if it's all the same format even if it's weird, oh there are 2-ish different formatted files)
the meeting of this task ended at 11:00AM. I saw this copy paste manual etl project and I simply didn't want to do it. So I outlined my task by identifying the elements of the table, column name ranges, value ranges, stacked / pivoted column ranges, etc... for an R script to extract that data. by passing the ranges of that content to an argument make_clean_table(left_columns="B4:E4", header_dims=c(..etc)) and functions that extract that convert that excel range into the correct position in the table to extract that element. Then the data was transformed to create a tidy long table.
the function gets passed once per notebook extracting the data from each worksheet, building a single table with the columns for the workbook industry, the category in the tab, partner, line of business, spend, impressions, etc...
IMO; ideally (if I have to check their data in excel that is), I'd like the partner to redo their report so that I received a workbook with the underlying data in a traditionally tabular form and their reporting page to use power query and table references and not cell ranges and formula.
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- How to deal with a bulky spreadsheet that is starting to hit the limits of Excel?Hello all, I have been venturing on quite the Excel journey the past year or so. I made a corporate spreadsheet that is approaching 500k formulas and that is starting to get serious speed issues at this point. It is 2026, so I conversed with ChatGPT several times regarding the speed issue, but realized I am way better off asking the experts here anyways. What is the problem So, my spreadsheet imports flat databases with specific information regarding objects that need further analysing. The imported flat databases run from say A tot CC or something, from which I probably draw about 12-15 datafields that are used for further analysis. It 'may' be more in the future. Afterwards, said data gets 'enriched' (manually) by things that aren't in the database, also because said data needs a human eye that cannot be automated. So far, so good. Right now, each object gets analysed from several different angles. As it stands, my spreadsheet runs from A until NA or something on the Formula Page. Many columns receive data from preceding columns, that are in the turn the result of many (slightly complex) logical IF or IFS tests, many of which are nested 3 or 4 deep. Often, they work in conjunction with X.LOOKUP to retrieve values, as the columns on the formula page are not equal. For example: A until BC on the Formula Page may analyze 150 objects, BD until DD may analyse 100 objects (from the same dataset, so narrower), and so forths. Thus a lot of X.LOOKUP is required, also because the first 'block' comes up with values that need to be found with X.LOOKUP. Also, values need to be retrieved from the flat database 'import' page with X.LOOKUP. Finally, X.LOOKUP is an insurance compared to FILTER, as I am not fully convinced that empty values in the flat database always contain a space (" "). To get to the point I use many IF, IFS, AND, and if need be, OR, formulas. Thinks: tens of thousands, probably in excess of 100k. These are compounded with X.LOOKUP, or X.LOOKUP gets used copiously without those. Here too, think tens of thousands. These formulas are - as much as possible - in array format, even though I find it controversial to do that as I consider how it can create a chain of updates throughout the spreadsheet. 'Dependencies' is the name of the game, with one object receiving many possible alterations / adjustments due to manual input data, for which the spreadsheet needs to provide. Right now, when I update a value, it may take up to 4 seconds to update the spreadsheet, which is already beyond the annoyance point for me. This leads me to these (hopefully) simple questions: Is it smart to use array formulas, knowing that each thing I change should only impact that one object line (for example, row 488) and none other? It is important to mention that object 1 does not influence object 488, or any other. Any manual data field only effects the object in the row it is in. In my mind, array formulas do not make sense in that regard, as it can result in a cascade of updates, but apparantly array formulas are 'way more efficient'. Is use of a VBA library the way to go to reduce lag and create more of an instant spreadsheet again? I am not able to code in VBA yet, but I am in the slow process of learning it regardless. Alternatively: should I use LET whenever a repeated lookup is needed in the same formula? Really looking for to your answers! submitted by /u/EvolvedRevolution [link] [comments]
- Automate list size reductionI'm going to try to lay this out succinctly, let's see how I do: I am trying to build a flexible re-usable sheet to dump two reports into and receive a comparison report. One report has a list of date time stamps for a variable number of work orders (Dt), each with a variable number of entries. The second has a list of date time stamps for a variable number of inspections that also have work orders (Mt), but also have way more work orders, and often are missing inspections for some work orders entirely. What I want out the end is a report for each produced time stamp (Dt) that tells me how separate it is from the most recent (before or after) inspection (Mt). What I do right now is basically a min(abs(Dt-filter( (Mt)*(does Mt work order match Dt work order)))). It works, but oh em gee the computation, because every Dt entry gets compared to EVERY Mt computation. After that I just look at a pivot table of the results to see max time and average. This tells me if people went too long saving parts (Dt time stamp) but NOT measuring parts (Mt). Is there a way to take each of the original lists, have them pushed/pulled to a variable width series of columns (one per Work Order), then have a variable length comparator look at the first set of columns and do that "time to closest inspection" but only with the matching column from the second set of data? I know I could do it if I make them part of a data model and just slowly go through each work order, but I'd prefer it to just spit out the results at the end so I can hand this sheet off to others and not have to manually do the adjustments every day. Let me know if this isn't clear and I can try to set up some example mini-tables or something to better explain my gibber jabber above. e.t.a. uh Version 2603 (Build 19822.20150 Click-to-Run) submitted by /u/NeedIQMSHelp [link] [comments]
- Data Prep for Software TransitionI’ve been tasked with preparing approximately 7,000 lines of membership data for import into our new system. I have a solid understanding of what needs to be done to build the master data sheet for import, but I’m also aware that I don’t have the technical depth to execute every part of this process independently. About the Data: Sheet 1: Base membership data + contact info ~7,000 records, each with a unique System ID (primary identifier) Sheet 2: Member credit balances ~50 records, includes System ID, needs to be merged into Sheet 1 by way of similar System ID Sheet 3: Member debit balances ~100 records, includes System ID, also needs to be merged into Sheet 1 (after Sheet 2 fields) by way of similar System ID Tools / Limitations I’m currently using the Excel web application via SharePoint, and I’m noticing that some features shown in tutorials (like certain tabs or tools) aren’t available to me. I’ve come across concepts like: Left / Inner / Outer joins (via Python/Pandas) Power Query Indexing / lookup methods However, I don’t have coding experience, and I’m unsure whether my current version of Excel supports Power Query or similar functionality. Questions: What is the simplest way for a beginner to merge data into corresponding columns using a shared System ID? Is working with around 7,000 rows (and potentially up to 20,000 if guest data is included) too large to handle using column-by-column lookup methods? Would you recommend cleaning and validating the data before merging the sheets, or is it better to complete the merge first and then handle data cleanup? --- I'll likely have several more questions over the next few days as I work through this process. If anyone is willing to follow along and offer guidance, I would really appreciate the support. The final data import sheet must be completed and ready to share by April 27. Thank you in advance for your help! submitted by /u/Complex_Chemistry807 [link] [comments]
- Proper maintenance of game activity tableI'm part of a team of a rather small hobby game that we've been running for many years now. We've been running simple activity summaries in hours spent like this monthly, ever since beginning in April 2017. They simply go like this: GM Matt - 65.7h M Aiden - 54.3h K Hawk - 32.1h K Tell - 30.0h M Poe - 27.6h T Tribux - 25.4h M Vair - 24.5h K Rodian - 23.3h K Eventide - 20.1h T Crab - 14.4h M Voss- 14.3h M Teir - 9.1h M Vallin - 4.4h T Forrte - 0.5h They're posted monthly so there are similar tables for May 2017, June 2017 and so forth up to April 2026. I want to start storing them all in Excel, this is what I did so far: https://preview.redd.it/n9oem7xh2zug1.png?width=1529&format=png&auto=webp&s=95855988839af802d4d6830df3996943eecdc85f The sum part for each year is simply: =SUM(D3:L3) and so on. Then there's a final column "ALLTIME" where I sum everything to the left except the columns with yearly sums. This works so far, but I'm wondering if there's a better approach that's gonna be more comfortable in the future if I want to extract more data. I have zero experience with data management and Excel, this is what came intuitively to me, but I want to know if there isn't something that will bite me in the ass in the future. submitted by /u/Informal-Ordinary387 [link] [comments]