Hey OP, sorry for the negativity, I think most of these commenters right now are pretty off-base. My company is building a lot of API infrastructure and I thought this was a great write up!
Hey, I've been getting into visual processing lately and we just started working on an offline wrapper for Apple's vision/other ML libraries via CLI: https://github.com/accretional/macos-vision. You can see some SVG art I created in a screenshot I just posted for a different comment https://i.imgur.com/OEMPJA8.png (on the right is a cubist plato svg lol)
Since your app is fully offline I'd love to chat about photogenesis/your general work in this area since there may be a good opportunity for collaboration. I've been working on some image stuff and want to build a local desktop/web application, here are some UI mockups of that I've been playing with (many AI generated though some of the features are functional, I realized that with CSS/SVG masks you can do a ton more than you'd expect): https://i.imgur.com/SFOX4wB.pnghttps://i.imgur.com/sPKRRTx.png but we don't have all the ui/vision expertise we'd need to take them to completion most likely.
Guys, I found out about this technology called Cascading Style Sheets recently and I think it's the missing piece we've been looking for. It lets you declaratively specify layout in a composable, hierarchical system based on something called the Document Object Model in a way that minimizes both clientside and serverside processing, based on these things called "stylesheets".
The best part is, it's super easy to customize them, read others for inspiration or to see how they did something, or even ship multiple per site to deal with different user preferences. Through this "forms" api, and little-known browser features like url-fragments, target/attribute selector, and style combinators, plus "the checkbox hack" you can build extremely responsive UIs out of it by "cascading" UI updates through your site! When do you think they're going to add it to next.js?
I'm tentatively calling this new UI paradigm "no-framework" or "no package manager", not sure yet https://i.imgur.com/OEMPJA8.png
> Cascading Style Sheets recently and I think it's the missing piece we've been looking for. It lets you declaratively specify layout in a composable, hierarchical system based on something called the Document Object Model in a way that minimizes both clientside and serverside processing, based on these things called "stylesheets"
I tried that and it was an absolute nightmare. There was no way to tell where a given style is used from, or even if it's used at all, and if the DOM hierarchy changes then your styles all change randomly (with, again, no way to tell what changed or where or why). Also "minimizes clientside processing" is a myth, I don't know what the implementation is but it ends up being slower and heavier than normal. Who ever thought this was a good idea?
> There was no way to tell where a given style is used from, or even if it's used at all
It's pretty easy. Open the inspector, select an element and you will find all the styles that apply. If you didn't try to be fancy and use weird build tools, you will also get the name of the file and the line number (and maybe navigation to the line itself). In Firefox, there's even a live editor for the selected element and the CSS file.
> if the DOM hierarchy changes then your styles all change randomly
Also styles are semantics like:
- The default appearance of links should be: ...
- All links in the article section should also be: ...
- The links inside a blockquote should also be: ...
- If a link has a class 'popup' it should be: ...
- The link identified as 'login' should be: ...
There's a section on MDN about how to ensure those rules are applied in the wanted order[1].
This way, your styles shouldn't need updates that often unless you change the semantics of your DOM.
> It's pretty easy. Open the inspector, select an element and you will find all the styles that apply.
Of course it's not easy, 80% of that list will be some garbage like global variables I would only need when I actually see them in a style value, not all the time.
The names are often unintuitive, and search is primitive anyway, so that's of little help. And the values are just as bad, with --vars() and !important needless verbosity in this aborted attempt of a programming language
Then there is this potentiality more useful "Computed" styles tab, but even for the most primitive property: Width, it often fails and is not able to click-to-find where the style is coming from
> Also styles are semantics like:
That's another myth. You style could just be. ReactComponentModel.ReactComponentSubmodel.hkjgsrtio.VeryImportantToIncludeHash.List.BipBop.Sub
What do you dump the other 99% of the poor designs and papercuts on?
But at any rate, shifting blame doesn't help you make a hard system easy by citing a broken tool
> Open the inspector, select an element and you will find all the styles that apply.
That tells me which styles apply to an element. You also need the converse - find which elements a given style applies to - and there's no way to do that AFAIK. It's very hard to ever delete even completely unused styles, because there is no way to tell (in the general case) whether a given style is used at all.
> This way, your styles shouldn't need updates that often unless you change the semantics of your DOM.
In my experience the DOM doesn't have semantics, or to the extent that it does, they change all the time.
> You also need the converse - find which elements a given style applies to - and there's no way to do that AFAIK.
I've never needed to do this, because I pay attention to my DOM structure and from CSS selectors can figure where a style applies. But I've just checked and the search bar for Firefox Inspector supports css selectors.
> In my experience the DOM doesn't have semantics, or to the extent that it does, they change all the time.
The DOM semantics are those of a hyper linked documents|forms. Take a page and think about what each elements means and their relations to each order. They will form a hierarchy with some generic components replicated. The due to how CSS is applied, you go from generic to specific elements, using the semantic structure to do the targeting
As an example, the structure of HN's reply page is
page
header
logo + Title
body
comment_box
upvote_button + comment_metadata
comment_text
textbox // reply text
button // reply
This and the structure of the other pages will give you an insight on how to target the relevant elements.
> A table is a grid. Lot of UI toolkits have a grid container
Sure. As long as the end result is the same grid, it shouldn't matter. But in a CSS world you switch your table for grid-layout divs (or vice versa) and suddenly one corner case thing that is in one grid cell somewhere in your app gets its styling flipped.
> Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.
Progressive disclosure, same reason you don't get assaulted with all the information a website has to offer at once, or given a sql console and told to figure it out, and instead see a portion of the information in a way that is supposed to naturally lead you to finding the next and next bits of information you're looking for.
> use cases
This is essentially just where you're moving the hierarchy/compression, but at least for me these are not very disjoint and separable. I think what I actually want are adaptable LoRa that loosely correspond to these use cases but where a dense discriminator or other system is able to adapt and stay in sync with these too. Also, tool-calling + sql/vector embeddings so that you can actually get good filesystem search without it feeling like work, and let the model filter out the junk.
> let the AI calculate this at run time?
You still do want to let it do agentic RAG but I think more tools are better. We're using sqlite-vec, generating multimodal and single-mode embeddings, and trying to make everything typed into a walkable graph of entity types, because that makes it much easier to efficiently walk/retrieve the "semantic space" in a way that generalizes. A small local model needs at least enough structure to know these are the X ways available to look for something and they are organized in Y ways, oriented towards Z and A things.
Especially on-device, telling them to "just figure it out" is like dropping a toddler or autonomous vehicle into a dark room and telling them to build you a search engine lol. They need some help and also quite literally to be taught what a search engine means for these purposes. Also, if you just let them explore or write things without any kind of grounding in what you need/any kind of positive signals, they're just going to be making a mess on your computer.
Maybe it depends on the use case, but my opinion is, if you do need to apply compression, it should be done via a tool call real time instead of in a pipeline.
For example, if you’re trying to summarize the status of a project, instead of feeding an agent (in real time or via summarization pipeline), it’s better to write a script that summarizes the status of all of the jira tickets, instead of asking the agent to read all of the tickets to create a summary
Another small data point, I think people would prefer to ask questions of an AI model instead of reading the generated summaries.
This is exactly what we're working on, is there any application in particular you're interested in the most?
> I'm struggling collecting actual data I could use for fine-tuning myself,
Journalling or otherwise writing is by far the best way to do this IMO but it doesn't take very much audio to accurately do a voice-clone. The hard thing about journalling is that it can actually be really biased away from the actual "distribution" of you, whether it's more aspirational or emotional or less rigorous/precise with language.
What I'm starting to do is save as many of my prompts as possible, because I realized a lot of my professional writing was there and it was actually pretty valuable data (especially paired with outputs and knowledge of what went well and waht didn't) for finetuning on my own workloads. Secondly is assembling/curating a collection of tools and products that I can drop into each new context with LLMs and also use for finetuning them on my own needs. Unlike "knowledge repositories" these both accurately model my actual needs and work and don't require me to do really do anything unnatural.
The other thing I'm about to start doing is "natural" in a certain sense but kinda weird, basically recording myself talking to my computer (verbalizing my thoughts more so it can be embedded alongside my actions, which may be much sparser from the computer's perspective) / screen recordings of my session as I work with it. This is something I've had to look into building more specialized tools for, because it creates too much data to save all of it. But basically there are small models, transcoding libraries, and pipelines you can use for audio/temporal/visual segmentation and transcription to compress the data back down into tokens and normal-sized images.
This is basically creating a semantic search engine of yourself as you work, kinda weird, but IMO it's just much weirder that your computer can actually talk back and learn about you now. With 96GB you can definitely do it BTW. I successfully finetuned an audio workload on gemma 4 2b yesterday on a 16GB mac mini. With 96GB you could do a lot.
> letting LLMs write docs and add them to a "knowledge repository"
I think what you actually want them to do is send them to go looking for stuff for you, or actively seeking out "learning" about something like that for their own role/purposes, so they can embed the useful information and better retrieve it when they need it, or produce traces grounded in positive signals (eg having access to this piece of information or tool, or applying this technique or pattern, measurably improves performance at something in-distribution to whatever you have them working on) they can use in fine-tuning themselves.
I think maybe you're misunderstanding the issue here. I have loads of data, but I'm unwilling to send it to 3rd parties, so that leaves me with gathering/generating the training data locally, but none of the models are good/strong enough for that today.
I'd love to "send them to go looking for stuff for you", but local models aren't great at this today, even with beefy hardware, and since that's about my only option, that leaves me unable to get sessions to use for the fine-tuning in the first place.
Right, that's exactly the situation I'm in too and "send them to go looking for stuff for you" without it going off the rails is the problem we've been working on.
Basically you need a squad of specialized models to do this in a mostly-structured way that ends up looking kind of like a crawling or scraping/search operation. I can share a stack of about 5-6 that are working for us directly if you want, I want to keep the exact stack on the DL for now but you can check my company's recent github activity to get an idea of it. It's basically a "browser agent" where gemma or qwen guide the general navigation/summarization but mostly focus on information extraction and normalization.
The other thing I've done, which obviously not everybody is going to want to do, is create emails and browser profiles for the browser agent (since they basically work when I'm not on the computer, but need identity to navigate the web) and run them on devices that don't have the keys to the kingdom. I also give them my phone number and their own (via an endpoint they can only call me from). That way if they run into something they have a way to escalate it, and I can do limited steering out of the loop. Obviously this is way more work than is reasonable for most people right now though so I'm hoping to show people a proper batteries-included setup for it soon.
Edit: Based on your other comment, I think maybe what you're really looking for most are "personal traces". Right now that's something we're working on with https://github.com/accretional/chromerpc (which uses the lower level Chrome Devtools Protocol rather than Puppetteer to basically fully automate web navigation, either through an LLM or prescriptive workflows). It would be very simple to set up automation to take a screenshot and save it locally every Xm or in response to certain events and generate traces for yourself that way, if you want. That alone provides a pretty strong base for a personal dataset.
> that ends up looking kind of like a crawling or scraping/search operation
Sure, but what I'm talking about is that the current SOTA models are terrible even for specialized small use cases like what you describe, so you can't just throw a local modal on that task and get useful sessions out of them that you can use for fine-tuning. If you want distilled data or similar, you (obviously) need to use a better model, but currently there is none that provides the privacy-guarantees I need, as described earlier.
All of those things come once you have something suitable for the individual pieces, but I'm trying to say that none of the current local models come close to solving the individual pieces, so all that other stuff is just distraction before you have that in place.
Understood. I guess I'm saying "soon" but definitely agreed its not "now" yet. I will say though, with 96GB, in a couple months you're going to be able to hold tons of Gemma 4 LoRa "specialists" in-memory at the same time and I really think it will feel like a whole new world once these are all getting trained and shared and adapted en-masse. And also, you could set up personal traces now if you want. Nobody can make you, but in its laziest form it can be literally just taking screenshots of your screen periodically as you work, and that'll have applications soon
> And also, you could set up personal traces now if you want. Nobody can make you, but in its laziest form it can be literally just
But again, you're missing my point :) I cannot, since the models I could generate useful traces from are run by platforms I'm not willing to hand over very private data to, and local models that I could use I cannot get useful traces from.
And I'm not holding out hope for agent orchestration, people haven't even figured out how to reliably get high quality results from agent yet, even less so with a fleet of them. Better to realistically tamper your expectations a bit :)
To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware.
In the past couple months there's been a kind of explosion in small-models that are occupying a niche in this kind of AI-transcoding space. What I'm hoping we're right on the cusp of achieving is a similar explosion in what I'd call tool-adaptation, where an LLM paired with some mostly-fixed suite of tools and problem cases can trade off some generality for a specialized (potentially hyper-specialized to the company or user) role.
The thing about more transcoding-related tasks is that they in general stay in sync with what the user of the device is actively doing, which will also typically be closely aligned with the capabilities of the user's hardware and what they want to do with their computer. So most people aren't being intentional about this kind of stuff right now, partly out of habit I think, because only just now does it make sense to think of personal computer as "stranded hardware" now that they can be steered/programmed somewhat autonomously.
I'm wondering if with the right approach to MoE on local devices (which local llms are heading towards) we could basically amortize the expensive hit from loading weights in and out of VRAM through some kind of extreme batch use case that users still find useful enough to be worth the latency. LoRa is already really useful for this but obviously sometimes you need more expertise/specialization than just a few layers' difference. Experimenting with this right now. It's the same basic principle as in the paper except less of a technical optimization and more workload optimization. Also it's literally the beginning of machine culture so that's kind of cool
> To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware
I think that's only possible to limited extent. Learnt skills (RL in context of an LLM?) need to be in the weights of the model since this reflects the model's "personalized" learning of the behavioral feedback loop. Declarative knowledge (facts) can be loaded at runtime (RAG).
That's interesting. So you want to train language, linguistic reasoning, and tool use, but otherwise strip out all knowledge in lieu of a massive context? Just grade they model on how well it can access local information, perhaps also run tools?
Hey I was literally just working on this today (I was racing ahead on an audio FT myself but OP beat me by a few hours). For audio inference definitely try running your input through VAD first to drop junk data and if necessary, as one of several preprocessing steps before sending the audio to the large model. You can check out how I did it here: https://github.com/accretional/vad/blob/main/pkg/vad/vad.go
VAD is absurdly time-effective (I think like O(10s) to segment 1hr of audio or something) and reduces the false positive rate/cost of transcription and multimodal inference since you can just pass small bits of segmented audio into another model specializing in that, then encode it as text before passing it to the expensive model.
Excellent work still, your repo is much more robust and fleshed out and I am just beelining straight to audio LoRa not really knowing what I'm doing, as this is my first time attempting a ~real ML training project.
Definitely interested in swapping notes if you are though. Probably the biggest thing that came out of this exercise for us was realizing that Apple actually has some really powerful local inference/data processing tools available locally, they just are much more marketed towards application developers so a lot of them fly under the radar.
We just published https://github.com/accretional/macos-vision to make it easy for anybody to use Apple's local OCR, image segmentation, foreground-masking, facial analysis, classification, and video tracking functionality accessible via CLI and hopefully more commonly in ML and data workloads. Hopefully you or someone else can get some use of it. I definitely will from yours!
Here’s the trick: use Gemini Pro deep research to create “Advanced Hacker’s Field Guide for X” where X is the problem that you are trying to solve. Ask for all the known issues, common bugs, unintuitive patterns, etc. Get very detailed if you want.
Then feed that to Claude / Codex / Cursor. Basically, create a cheat sheet for your AI agents.
I think realistically businesses in other parts of the world have no incentive to fully enforce ethical provenance across the entire supply chain for these kinds of products, and in most cases, fully lack the capability either. You'd have to run some kind of ATF-kinda thing in a third world country where official rule of law is already dicey or absent.
They are so paranoid against scraping or someone building automations on top of their app they don't want you to have, that they are willing to make their actual application borderline unusable for the power users who would actually be willing to pay for their first party upsells and features.
It's infuriating. I have literally tried all of their paid products in various forms (they are expensive but the value is clearly there if you're a business). If only they invested as much in making them actually good as they did in preventing you from using extensions or other tools to implement the features they can't or won't, I'm sure they'd get a lot more business.
That was also a last-ditch effort to maintain pre-WW2 geopolitical structures rather than a bipolar US-sphere vs Soviet-sphere world. Note that this was basically the nail in the coffin that led to their full-fledged decolonization in the following years. At the time the UK still held very significant military and political sway over the middle east, east africa, and asia
reply