• 🇨🇦Samuel Proulx🇨🇦A
    link
    fedilink
    English
    arrow-up
    39
    ·
    4 days ago

    As someone who is completely blind, I pay for OpenRouter in order to have AI describe images to me. If more people bothered with alt text, I wouldn’t have to. But it is what it is. I suspect there are models I could run locally that would do what I need; on IOS, apple handles all image descriptions locally on the phone, and they’re perfectly adequate. But on Windows, nobody has created an easy way to get a local model running in the Open-source NVDA screen reader (https://www.nvaccess.org/) but there are multiple addons that work with OpenRouter. NVDA is open source and entirely written in Python, so it should actually be pretty easy to do. The main reason I haven’t tried it myself is because I have no idea what local model to use. None of the benchmarks really tell me “This model would be good at describing images to blind people”. Whereas the giant cloud models are semi-okay at everything, so everyone just uses those. But if we could use a smaller model, we might even be able to fine tune it for the specific use-case of blind people. Maybe someday!

    • CovfefeKills@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      2 days ago

      I recommend Google’s gemma-3-4b-it-qat in LLM Studio.

      Okay, let's describe this image for someone who can’t see it.
      
      Imagine you're standing in a lush, green meadow filled with wildflowers – lots of tiny bursts of color like little dandelions and daisies scattered across the grass. The ground feels soft and yielding under your feet, like a thick carpet of moss and blades.
      
      Now, dominating the scene is a giant tortoise. It’s enormous, much bigger than any turtle you've ever seen! Its shell is rough and textured, covered in patches of moss and lichen – it feels cool and damp to touch. You can sense its weight, a solid, ancient presence.  The shell isn’t perfectly smooth; there are cracks and ridges, like the surface of an old wooden table.
      
      Growing out of this massive tortoise's back is a tiny little house! It looks incredibly rustic – made of weathered wood with a steeply pitched roof covered in moss and small stones. You can almost feel the age of the wood, imagining it’s been standing there for a very long time. There’s a small window on the second floor, and a little chimney puffing out wisps of smoke—it smells faintly like woodsmoke and maybe something sweet.
      
      Perched on a mossy stump just to the side of the tortoise is a small bird. It's a vibrant blue color – imagine a brilliant sky captured in feathers! You can almost hear its gentle chirping, a quiet sound in the stillness of the meadow.
      
      The overall feeling of the image is one of peace and tranquility.  It’s like a fairytale scene—a slow-moving giant carrying his home on his back, surrounded by nature's beauty. There’s a sense of timelessness and magic to it.
      
      Would you like me to focus on any particular aspect of the image in more detail, or perhaps describe something else about the overall feeling it evokes?
      
      70.58 tok/sec
      
      •
      
      397 tokens
      
      •
      
      1.51s to first token
      
      •
      
      Stop reason: EOS Token Found
      

      on a i7 13650hx RTX4060 laptop.

      Image link: https://images.squarespace-cdn.com/content/v1/58586fa5ebbd1a60e7d76d3e/1493824812986-X5XRILWEPX6FAD5I19ZU/image-asset.jpeg?format=2500w

      The scene is just a normal sized turtle with a house built ontop of the shell, there is a bird sitting on a tree stump. The AI didn’t do the best job BUT it is only 4b params, it can run on phones or really fast on a gpu. Naturally Google with their Youtube comments on videos data they do have the best computer vision models.

      • 🇨🇦Samuel Proulx🇨🇦A
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 days ago

        I will also include Grok for comparison. It’s the model I use for images that I think might contain adult content (pornography, violence, whatever) because it won’t refuse anything. Sadly, because I’m boring, I’m using it more for moderation than pleasure. Unfortunately, I have a prompt for it designed to limit refusals, and I forgot to change it. As you can see, that…strongly effected the description. Suddenly there are tits in this picture.

        This image appears to be a digitally rendered or CGI artwork depicting a whimsical, fantastical scene in a natural outdoor setting. It does not contain any adult content, including pornography, pedophilia, violence, nudity, blood, death, dead bodies, self-harm, harm to children, or other X-rated elements. There is no visible text in the image, so no racism, doxing, hateful or insulting material, or otherwise objectionable words are present. The overall tone is serene, magical, and family-friendly, evoking themes of nature and fantasy (similar to something from a fairy tale or children’s storybook).

        Detailed Visual Breakdown:

        • Main Subject: The central focus is a large, realistic-looking giant tortoise (resembling a Galápagos tortoise) slowly walking through a lush, grassy meadow. The tortoise’s shell has been creatively modified in the artwork to serve as the foundation for a small, rustic wooden house. The house is built directly onto the shell, appearing old and weathered, with a shingled roof covered in moss, vines, and small plants growing on it, giving it a dilapidated, abandoned cabin vibe. The house has a small porch with railings, a door, and windows, and there’s a faint plume of smoke rising from a chimney, suggesting it’s inhabited or recently used. The tortoise’s shell itself is textured with moss, dirt, and grass, blending seamlessly with the house, as if the creature is a living, mobile home. The tortoise’s head is extended forward, with its mouth slightly open, and it has a gentle, neutral expression. Its legs are sturdy and visible as it moves through the grass.

        • Foreground and Environment: The scene is set in a vibrant, green field filled with tall grass, wildflowers (mostly white and yellow daisies or similar blooms), and scattered dandelions. The ground is dotted with small plants and dew-like sparkles, adding a magical glow. The lighting suggests a soft, diffused sunlight, possibly early morning or late afternoon, with a misty or foggy atmosphere in the background.

        • Secondary Elements: To the right of the tortoise, there’s a small blue tit perched on a moss-covered tree stump. The tit has blue and yellow feathers, with a white underbelly, and it’s facing toward the tortoise, as if curiously observing it. The stump is overgrown with moss and a small pinecone-like object on top. In the distant background, there’s a blurred view of a dense, green forest with trees and foliage, enhancing the sense of a peaceful, enchanted wilderness.

        • Composition and Style: The image is in a wide, horizontal format (like a landscape photo or artwork). It’s highly detailed and photorealistic, with a focus on textures like wood grain, moss, and animal skin, suggesting it could be from digital art software, a video game concept, or an illustration. There are subtle particle effects, like floating specks or sparkles in the air, adding to the dreamy quality. No human figures, vehicles, or modern elements are present—it’s purely nature and fantasy-based.

      • 🇨🇦Samuel Proulx🇨🇦A
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 days ago

        Thanks! It’s a bit wordy, but I suspect that could be improved with prompting and some examples.

        For comparison, here’s the full google 2.5 pro in the cloud:

        This is an enchanting and highly detailed, photorealistic fantasy image set in a lush, green meadow. The overall feeling is one of peace, magic, and ancient wonder.

        The central and most prominent figure is a colossal tortoise who takes up the left and center of the frame. Its scale is immense, as it carries a small, rustic wooden house on its back, where its shell would be. The tortoise’s skin is ancient and leathery, with brown and gray tones, and detailed with wrinkles and folds. Patches of bright green moss grow on its shell and legs, suggesting it has been wandering for a very long time. Its head is extended forward and turned slightly to the right, as if observing something. In its mouth, it gently holds a small white daisy with a yellow center, a charming and tender detail.

        The house on its back is old and weathered, made of dark wooden planks. It has a multi-gabled roof with moss-covered wooden shingles. A small brick chimney pokes out from the roof, with a faint wisp of white smoke rising from it, indicating someone might be home. The house features a small covered porch with a railing and tiny lanterns hanging from the eaves. Vines and other small plants creep up the walls, integrating the structure with the living creature beneath it.

        The tortoise is wading through tall, vibrant green grass that is dotted with small wildflowers, mostly white daisies and yellow buttercups. Several small, orange and black butterflies, similar to monarchs, flutter around the tortoise’s legs and in the surrounding grass.

        To the right of the tortoise, there is an old, dark tree stump. Like the tortoise, the stump is covered in patches of green moss and a cluster of light-brown mushrooms growing on its side. Perched majestically on top of this stump is a small bird, facing the tortoise. The bird has a brilliant blue-gray back and head, a warm, orangey-yellow breast, and a sharp, dark beak. It appears to be a kingfisher, and its posture suggests it is in a quiet standoff or conversation with the giant tortoise.

        The entire scene is bathed in soft, natural sunlight that filters through the air, illuminating tiny specks of dust or pollen, which adds to the magical atmosphere. The background is a soft-focus blur of deep green, suggesting a dense forest or rolling hills far away, which makes the tortoise, house, and bird stand out as the clear subjects of this peaceful, fairytale-like world.

        • CovfefeKills@lemmy.world
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          1 day ago

          I am thoroughly impressed with the quality of the local gemma 3 model, and these are improved weekly pretty much. About the scene, the tortoise is seemingly normal sized. The house ontop the tortoise is seemingly normal sized. Scale is a particular challenge with this scene with these conflicting normals and I guess AI chooses the house to be accessible by normal sized humans and that is why the AI decides to label the tortoise as gigantic but for all we know, the tortoise is standard and mini humans inhabit the house.

          Oh the concept comes from tortoise that hibernate in shallow ponds accumulating dirt and pond plants on their shells. They are like majestic swimming islands and that is where the miniworld on their shell idea comes from. I think Gemma 3 27b can mask 3d objects in images it might be the goto API model for cost effective vision tasks (google removed their image demo thing so i cannot confirm but i feel like i remember being impressed by the 27b model for vision tasks).

    • VeryFrugal@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      ·
      3 days ago

      How’s the usage number and how much does it cost? Always thought that this is literally the best thing that AI is actively doing.

      • 🇨🇦Samuel Proulx🇨🇦A
        link
        fedilink
        English
        arrow-up
        4
        ·
        3 days ago

        It really depends. For images that are graphs and infographics I use gpt5 or Gemini 2.5 pro. For anything with adult content I have to use grok because it’s the only model that won’t refuse. For stuff that’s just text in an image the cheap models from Microsoft are fine. Also, sometimes openrouter has limited time deals where some models are free. I’d say overall I spend between 2 and 5 dollars a month on it. But I do allow open router to train on the data so I get a discount of a few percent as well.

          • 🇨🇦Samuel Proulx🇨🇦A
            link
            fedilink
            English
            arrow-up
            3
            ·
            2 days ago

            These days I can do it all myself. Press control+windows+enter when Windows first boots, and the basic built-in screen reader that’s part of Windows 11 comes on. It’s good enough to get through set-up and install a better screen reader. Sadly, if I were on Linux, that wouldn’t at all be the case. Though I do run multiple Linux servers via SSH, including all of the infrastructure for rblind.com.

            I did manage to assemble my DIY Framework 16 laptop, and I’ll upgrade the mainboard in it later this year, but that’s pretty much hitting my limits when it comes to hardware. Soldering is right out, and Oh My God do I hate those damn ipex connectors.