Dataset of 150k+ stool images and not sure how to fully use it [D]

Our take

You have a substantial dataset of over 150,000 stool images, yet your current manual verification process may limit efficiency as you scale. While training on a meticulously curated subset of 5,000 images lays a strong foundation, the ongoing manual review could become a bottleneck. To enhance your workflow, consider incorporating semi-automated annotation tools or leveraging active learning techniques that prioritize the most informative samples.

I have a dataset of around 150k stool images, and I’m trying to better understand the “right” way to use it for training a computer vision model.

Right now, our process is pretty manual. We initially trained on about 5k images that were individually verified by a human. For every image, we checked/corrected the Bristol type, consistency, color, mucus/blood indicators, etc. Then we trained the model on those verified annotations.

As we continue training, we keep doing the same thing: manually reviewing and correcting images before feeding them back into the model.

My question is basically: does this workflow make sense from an ML perspective? Is this how people normally approach building a solid vision dataset/model, especially in a domain where annotation quality matters a lot? Or is there a smarter/more scalable approach people usually move toward once they have a large dataset?

I’m mainly trying to understand best practices around dataset quality, human verification, iterative training, and scaling annotation without introducing bad labels.

submitted by /u/SamePersonality5183
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#large dataset processing#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#workflow automation#stool images#computer vision model#dataset quality#annotation quality#human verification#best practices#manual process#iterative training#workflow#consistency#verified annotations#Bristol type#scalable approach#large dataset