What would you do with this task, and how long would it take you to do it?

Our take

In your new role, you're navigating a complex quality assurance task involving five Excel workbooks containing client data across multiple industries. Each workbook presents unique structures and challenges, requiring careful comparison against data sourced from a Tableau dashboard. Your goal is to ensure numerical consistency while simplifying the data extraction process. By leveraging R for script automation, you're aiming to streamline the task and create a more efficient workflow.

I'm going to describe a situation as specifically as I can. I am curious what people would do in this situation, I worry that I complicate things for myself. I'm describing the whole task as it was described to me and then as I discovered it.

Ultimately, I'm here to ask you, what do you do, and how long does it take you to do it?

I started a new role this month, I am new to advertising modeling methods like mmm, so I am reading a lot about how to apply the methods specific to mmm in R and python, I use VScode, I don't have a github copilot license, I get to use copilot through windows office license. Although this task did not involve modeling, I do want to ask about that kind of task another day if this goes over well.

The task

5, excel sheets are to be provided. You are told that this is a clients data that was given to another party for some other analysis and augmentation. This is a quality assurance task. The previous process was as follows;

the data

the data structure: 1 workbook per industry for 5 industries
4 workbooks had 1 tab, 1 workbook had 3 tabs
each tab had a table that had a date column in days, 2 categorical columns advertising_partner, line_of_business and at least 2 numeric columns per work book.
some times data is updated from our side and the partner has to redownload the data and reprocess and share again

the process

this is done once per client, per quarter (but it's just this client for now)
open each workbook
navigate to each tab
the data is in a "controllable" table

bing bing

home home

impressions spend partner dropdown line of business dropdown
where bing and home are controlled with drop down toggles, with a combination of 3-4 categories each.
compare with data that is to be downloaded from a tableau dashboard
end state: the comparison of the metrics in tableau to the excel tables to ensure that "the numbers are the same"
the categories presented map 1 to 1 with the data you have downloaded from tableau
aggregate the data in a pivot table, select the matching categories, make sure the values match


bing	bing
home	home
impressions	spend	partner dropdown	line of business dropdown

additional info about the file

the summary table is a complicated sumproduct look up table against an extremely wide table hidden to the left. the summary table can start as early as AK and as late as FE.
there are 2 broadly different formats of underlying data in the 5 notebooks, with small structure differences between the group of 3.

in the group of 3

the structure of this wide table is similar to the summary table with categories in the column headers describing the metric below it. but with additional categories like region, which is the same value for every column header. 1 of these tables has 1 more header category than the other 2
the left most columns have 1 category each, there are 3 date columns for day, quarter.



	REGION	USA	USA	USA
	PARTNER	bing	bing	google
	LOB	home	home	auto
		impressions	spend	...etc
date	quarter	impressions	spend	...etc
2023-01-01	q1	1	2	...etc
2023-01-02	q1	3	4	...etc

in the group of 2

the left most categories are actually the categorical headers in the group of 3, and the metrics, the values in each category mach
the dates are now the headers of this very wide table
the header labels are separated from the start of the values by 1 column
there is an empty row immediately below the final row for column headers.


			date Label	2023-01-01	2023-01-02
			year	2023	2023
			quarter	q1	q1
blank row
REGION	PARTNER	LOB	measure
blank row
US	bing	home	impressions	1	3
US	bing	home	spend	2	4
US	google	auto	...etc	...etc	... etc

The question is, what do you do, and how long does it take you to do it?

I am being honest here, I wrote out this explaination basically in the order in which I was introduced to the information and how I discovered it. (Oh it's easy if it's all the same format even if it's weird, oh there are 2-ish different formatted files)

the meeting of this task ended at 11:00AM. I saw this copy paste manual etl project and I simply didn't want to do it. So I outlined my task by identifying the elements of the table, column name ranges, value ranges, stacked / pivoted column ranges, etc... for an R script to extract that data. by passing the ranges of that content to an argument make_clean_table(left_columns="B4:E4", header_dims=c(..etc)) and functions that extract that convert that excel range into the correct position in the table to extract that element. Then the data was transformed to create a tidy long table.

the function gets passed once per notebook extracting the data from each worksheet, building a single table with the columns for the workbook industry, the category in the tab, partner, line of business, spend, impressions, etc...

IMO; ideally (if I have to check their data in excel that is), I'd like the partner to redo their report so that I received a workbook with the underlying data in a traditionally tabular form and their reporting page to use power query and table references and not cell ranges and formula.

submitted by /u/TheTresStateArea
[link] [comments]

Tagged with

#Excel alternatives for data analysis#generative AI for data analysis#conversational data analysis#data analysis tools#big data management in spreadsheets#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data cleaning solutions#financial modeling with spreadsheets#natural language processing for spreadsheets#Excel compatibility#row zero#Excel alternatives#google sheets#financial modeling#business intelligence tools#rows.com