Everything I'm finding is about training to your data set--which is not what I'm after. I just want the nature of the subject of the picture. Must run locally (the images do not leave the computer) and automated as there are upwards of 50,000 images to deal with.
So... Clip?
This is a pre-trained part of most genAI models, the CLIP layer.
Most models have such a layer built in. Flux uses something different (better), I think, but by in large you want to look into local "captioning" models (T5).
You're probably going to want something trained on the image subgenre you are trying to positively/negatively ID on specifically.
I think a lot of people like DeepBoru too.
What is your use case?