ITP Blogs for week-4-the-main-character-of-an-image

🖍️ Notes

We used to live in linear line world, and now it is a spatial world.

Models
Diffusion model behaves differently than latent space models. The latent space will be able to tell the conceptual space between an object and another, thus rendering it more similar to our mind. While diffusion, just literally gets into the middle of one and another object.

GAN: generator and a discriminator network compete: the generator tries to produce realistic data, while the discriminator tries to distinguish fake data from real data.

Concepts

Latent space is compressed.

Embedding is a 'location', it needs another embedding to be useful (so you can start making connection / space between them).

Writing

geometry reflects meaningful semantic relationships

Hiano’s writing makes me think about the journeys to get into a vector in a latent space. As there are infinite vector points in a latent space, there is infinite number of journeys that could be taken to arrive at that specific vector points. We can probably metaphorize each of this infinite journeys as each of the human beings that could have existed in this world since the onset of world creation. Each human beings carry their own life experience, hence traversing through a ‘unique’ journey to get to a vector point that defines ‘who’ they are. For example, “to be a Kezia (me)”,

—

I was watching 3Blue1Brown video on Transformers when I notice that the examples used for the image to text model only filters the subject of the image and not the totality of it. For example, an image of a bird perching on some branches of tree will be condensed to just a bird. In other words, only high level abstraction of the image is captured. The way that popular models prioritise the “main subject” seems to be based on the tendency of human mind to highlight salient objects.

From the perspective of alt text/search/image tagging, it seems that concise captions are more practical though may not be the most accurate. Most people googles with few descriptive words, and this kind of behaviour may pair better with simplified alt text rather than overly descriptive alt text that may not match with the user’s simple search keywords. This makes me wonder, is it more intuitive for human to think of a broader concept first? Is it because humans are social creatures, thus our visual communication is more optimised for the “main character” in the scene (i.e, the other human you are interacting with)?

And surely, machine learning models are trained based on our human biases. That may explain why more models are outputting high level description of an image.

This favour towards conciseness seems to benefit daily / casual conversation and user interface texts. Both being topics that are averaged to most human experiences and can cater to the mass. On the other side, accurate/exhaustive description will be required on highly technical / niche subjects such as scientific research, forensics, medical reports. These topics may not be accessible to most human’s knowledge, but only to a handful of experts.

That being said, who is machine learnings being trained for? Are we prioritising the general population, or specialized technical domains? Should machine learning be built for everyday people, or should it just be kept behind the scenes in specialized fields?