Hacker Newsnew | past | comments | ask | show | jobs | submit | icyfox's commentslogin

1. Factory limits basically. There's a limit to the amount of fabrication lines that can create ram. Combined with the market incentives right now to make high bandwidth memory (HBM) over server memory (DRAM)... HBM starts as DRAM dies, so it competes with normal DRAM for wafer starts / cleanroom fab capacity.

2. Eventually more plants will come on line. Most of the main manufacturers have announced expansions but these can take O(years) to come online.


Are more plants coming? I think I heard it won't be many of them, because it's risky.

If the bubble bursts and RAM demand drops, then they'll have big losses. And that's not an impossible scenario over the few X years that it takes to build a plant


Not particularly. I'm not yet convinced people's mouse movements are unique enough to our identity that they're useful as a fingerprint, whereas it's very easy to classify whether something looks bezier or looks human.

Eventually I'm hoping to collect enough data here to train a biased decoding model, so you could input some randomized personality vector (which implicitly encodes slow movement, jerky motion, trackpad, mouse, etc) and have that impact the RNN generation. So in theory there would be infinite combinations from the larger subspace we're sampling from.


Could look into addressing this. What are you trying to achieve?


So much of what Apple has lost over the last 10 years is a lower bar for what counts as good enough.

You see this most obviously in software and marketing - the kinds of decisions where only a few people sign off at the end, and where "good enough" is whatever those few people decide it is. You see it less in hardware and procurement where there's a powerful review cycle and scrutiny at every level of the stack. Work there is more immediately measurable: benchmarks for performance, dollars for cost.

The "vibe" of software, or of a PDF [^1], is much harder to catch that way. There's no benchmark that flags it and most conventional executives aren't drilling down in that level of detail to see it either.

You want distributed decision-making, of course. But that only works well if it's distributed to people who've cultivated their own taste and who will make good calls under pressure. I'm not sure how much of that gets fixed by leadership change at the top. Taste isn't really something a CEO can decree into a 60,000 person org. But I've only heard good things about Ternus, so I'm optimistic. Fingers crossed for a bright new chapter.

[^1]: https://www.apple.com/promo/pdf/US_FY26_Earth_Day_Promo_Tand...


Digitizing my old tapes was one of the most rewarding side projects that I did over the last year. I managed to get in under the wire (pun intended) of Firewire compatibility on Sequoia and a long daisy-chain of adapters. But it was clear the days of this approach were numbered. I'm optimistic these 3rd party accessories will become more standardized into self-contained cheap boxes where people can easily transfer over their stuff before camcorders degrade.

My pipeline went camera -> dvrescue -> ffmpeg -> clip chunking -> gemini for auto tagging of family members and locations where things were shot.

We now have all our family's footage hosted on a NAS with Jellyfin serving over Tailscale to my parents Macbooks. I found the clip chunking in particular made the footage a lot more watchable than just importing the two-hour long tapes although ymmv.


I am going to finish such a project soon myself, including some old Video8 tapes! Sounds like you're on macOS, Any reason you didn't use iMovie for the capture itself?

The Video8 tapes have already been digitalized via a Digital8 camcorder, but apparently you can get even better quality out of old analog tapes with the vhsdecode project. Let's see if I ever get around to that, but at least it bypass Firewire entirely: https://github.com/oyvindln/vhs-decode https://www.reddit.com/r/vhsdecode/


Mostly wanted to fully automate the pipeline (auto-rewind tape, scan tape head position, etc) and iMovie is just using the same AVFoundation APIs under the scene that you can call manually. Took some notes here if helpful: https://pierce.dev/notes/automating-our-home-video-imports

Wish vhsdecode was easier to use in practice! Such a cool idea but a bit too inconvenient to hack your own hardware like this...


I used dvgrab to ingest my old tapes, and ffmpeg and avisynth/QTGMC to de-interface and encode files for easy viewing (though I keep the original .dv files).

The biggest issue I ran into was that while the audio and video were properly synced up in the original .dv file (due to it being an interleaved format), when I re-encoded the videos, the audio and video would drift out of sync as the video went on.

I was able to fix the sync issues by using dvgrab to split the original dv file into a bunch of 3 minute chunks. I then wrote a script to extract the audio track from each chunk, pad the end of the audio with milliseconds of silence to the exact length of the video track, combine the padded audio tracks, encodes the combined track, and muxes the fixed audio track with the encoded video. This worked really well; the silence padding is imperceptible, but the audio and video are still in sync - even after 2 hours.

A final point that needs making is that doing anything with dv files in ffmpeg (even -c:v copy) destroys the SMPTE timecodes embedded in the original file, making it much harder to split by scene.


Just because I've dealt with this exact issue in the past, it may have been a 30fps vs 29.97fps issue. For me the audio was a fixed length, but the frame rate was SLIGHTLY too fast. The problem can manifest as either too slow or too fast depending on which side is expecting 30fps vs 29.97fps.


I think it was just clock drift on the camcorder during the initial recording, as I'm pretty sure I tried adjusting the frequency of the audio track to make it the same duration as the video track, and the A/V sync was still wrong.

I'm so glad the audio and video tracks are stored interleaved, as it made my solution possible, and the results I got were great. By splitting the interleaved video into small enough chunks, padding the audio, and cutting it exactly to video length, the padding was practically imperceptible.

The only issue I ran into was that ffmpeg can't cut audio with any real precision. I eventually figured out that I could dump the audio track to a headerless PCM file, calculate the exact byte offsets for my cut points, and cut them with perfect precision using the head and tail commands from GNU coreutils. This was perfect because I was able to use the cat command to combine all of the padded audio chunks into a single raw PCM file, which I then made an AAC encode of with ffmpeg to mux with my original encoded video track.


This is very likely it


Transcode to another format first that keeps the timecode?


Ffmpeg's dvvideo implantation is unfortunately just broken and mangles timecodes, even if just doing a stream copy from dvvideo to dvvideo without any re-encoding.

Fortunately, dvgrab does allow you to take the original .dv file and generate a .srt subtitle track with time stamps that you can mux into your encoded files.


If you are capturing I find dvgrab is pretty good. It's what I've been using for about 25 years now!

In the olden days when I got paid to shoot real video on a VX2000 and edit it for people, captured using a PCI Firewire card and dvgrab in Slackware, rewrapped with probably mencoder shading towards ffmpeg when it became more popular (and developed!), dual-boot into Windows 2000 and cut in Premiere 5.0, then back into Linux to transcode back to DV if I wanted to write it out to DV tape.

These days I shoot on a PD150 or DSR500 (and quite often some HDV cameras), capture via a PCIe Firewire card and dvgrab in Ubuntu, rewrap with ffmpeg, and edit in Resolve, without the dual-booting step.

If you use dvgrab it will split the capture up into separate clips on shot boundaries based on the pause/unpause markers on the tape. I have not found a way to extract good/no good from the stream, but if you're not shooting on a broadcast camera you don't have this anyway. Timecode is preserved though!

When you load it all up in Resolve, one of the options in the Cut page is "Source Tape View" which runs all your clips together by timecode, and lets you view them as though they were a continuous tape of your rushes, which is how we used to do basic assemble editing in the olden days of clunky tape decks and edit controllers with big rows of red 7-segment displays.

Edit your old home videos. You can do that now, and they'll be far more watchable.


A few years ago I did a bit more of a crude flow.

Play the footage on a tv in a dark room. Place a 4k camera on a tripod and record the tv with audio into the camera audio port.

Worked perfectly.


Actually not a terrible way to go from interlaced to progressive footage. Depending on the TV and camera


> gemini for auto tagging of family members

With all respect, reading this part made me feel uneasy.


Went through a very similar journey recently as well. In my case using a Macbook was a non-starter, as certain adapters are prohibitively expensive these days, if you can even get your hands on one. Thankfully my son has a desktop Windows PC and Firewire PCI cards are cheap and plentiful, so getting connected that way worked out. Much better than an earlier attempt via RCA cables (simple but digital -> analog -> digital is not the way to go).

My pipeline was camera -> WinDV -> DVdate (to extract exact datetimes into srt subtitles) -> Handbrake (to convert to mp4).


> Digitizing my old tapes was one of the most rewarding side projects

I also wanted to do that, but then I realised I needed to invest more time and may need some hardware, so one day I simply had enough, went to a commercial shop and had them turn all the old stuff into digital. The cost wasn't that huge either, so considering that I could also save time (doing it myself), I am ok with that investment. Hopefully the future has digital everywhere. Storage to be cheaper too, ideally.


Can you expand on the Gemini tagging part? What did you do with the tags, import them into Jellyfin after cutting the videos into parts?


Is it possible to accomplish tagging with local AI instead of Gemini?


As far as I've seen, local OSS video understanding models just really aren't there yet. I briefly looked at facial recognition models but a good amount of signal was actually in the video's audio instead of the raw video frames. Depends on the accuracy you're looking for at the end of the day.


Thanks for the reply. Let's hope local models catch up.


Waymo is such an interesting case study. For most other ~AI deployments you have strong public reaction to the proliferation of slop, non-human failure modes, cost cutting at the expense of quality, etc. But I haven't met a single person who doesn't like the experience of Waymo. They ended up cracking the code on what I suspect people really want:

- consistent car quality

- safety of the drive (conservative driving and potential fear of drivers)

- no randomly chatty driver

All of those feel like a breath of fresh air especially when stacked up against the current state of Uber & Lyft rides. People really just want consistency. I don't actually think you needed AI to get there (I've had occasional rides in black cars that provided the same experience). Waymo was just right time, right place, right price.


> but I haven't met a single person who doesn't like the experience of Waymo.

Just last week a Waymo was driving on train tracks and the rider had to jump out of the car and run because the car stopped while trains came at it. (https://www.youtube.com/watch?v=26KJvL2clTs) I bet that guy'd have something to say about the experience.


Yeah that's obviously not great but that video is nothing like what you described. You made it sound like it drove onto a mainline train track with a train barreling down the tracks that couldn't stop with the guy diving out of the car to avoid getting clobbered. It did not, it got stuck on a tram track. Not quite the same thing.


not having to talk to the driver and picking my own music are my fav parts. the novelty wears off quick and it becomes normal


I've had Waymos in SF take very strange routes. It seemed to really strongly avoid ever using Market St, generally preferring a long right-angle route over the perfect hypotenuse. Sometimes this delayed me very considerably, doubling my ride time compared to the Google Maps estimated time.

That said, I've never felt unsafe or uncomfortable. But I have jumped out halfway through the ride and grabbed an eScooter instead.


Market used to be closed to all cars (2021-2025); only taxis and busses were allowed but that changed recently:

https://www.sfmta.com/blog/creating-better-market-street-car... https://www.planetizen.com/news/2025/08/135849-sfs-market-st...

Wonder if that explains your observed preference. I'd bet Waymos will start utilizing the route again if it aligns with Google's mapping algo.


Back when I had to drive/walk in SF, I would also go quite out of my way to avoid market or mission. Especially near 6th. Self-preservation and whatnot...


There's a lot of complaints about externalities, especially when a power cut stopped all the vehicles in a city recently.


I'm not commenting on the externalities. For that I'd also cite economic impact, job loss, occasional emergency services issues, etc. I'm saying the experience when you yourself are taking a ride. I haven't met a single person who's said "this sucked - I'm going back to Uber".


I think parent was talking about how users of the service were very satisfied with it, not about externalities.


My first and only Waymo ride was super sketch. Car slowed down to ~5mph in a 35mph zone and stayed that way for 5+ minutes as other cars were swerving around us. Felt like it was going to come to a complete stop in the middle of the road, I prefer real humans.


What you're getting at is basically the difference between probabilistic models vs deterministic ones.


waymo is also a probabilistic deep learning system


Tried calling it and it left without picking us up.


At the risk of being overly pedantic, topologists would typically classify this as venom.

Venom is inert if digested; it's only a problem if it gets in your blood stream. So arrows that were laced with venom and thereby contaminated meat were actually perfectly safe to eat.

Poison is different. If ingested, inhaled, or absorbed it will kill you.


We Dutch solve this problem by having a single word for "poison", "venom and "toxin"¹. Everybody still knows what you mean and nobody gets to be pedantic.

¹ and "badly compressed looping animation"


Same in Portuguese, veneno.

Although there are plenty of other opportunities for pedantry, especially when we take regionalisms, and other Portuguese speaking countries into account.


Is the word "stamppot" ?


Just "food". Any kind of Dutch food fits the description.


This is true, notably a kroket is both looping and badly compressed.


Vergif.

I don't know how you get from 'ver' to badly compressed.

(And I'm a native Flemish speaker, but living in the USA for 8+ years, so I barely, if ever speak it).


Remove Ver, add t and you got German: Gift

Vergiftet would be past tense.

Funny that in English gift is a word but entirely different meaning.

Languages are fun, especially in Europe where they're all different but all so related but everyone does not want to admit it.


> Funny that in English gift is a word but entirely different meaning.

In English it maintains its original Germanic meaning derived from the verb give.

The sense of "poison" in German comes from a euphemistic use of "gift". (Literally 'something given' but actually used to calque Greek "dosis", which also literally meant 'something given', but was used to mean 'dose [of medicine]'.)

https://en.wiktionary.org/wiki/Gift#Etymology

Summing up, the reason gift is a word in English with an entirely different meaning from what it has in German is that everyone in Germany forgot what gift meant.

(The reason it's gift and not something more like yift is the Danelaw.)


This is one of the reasons I like HN: Random knowledge transfer like this. Appreciated!

Also: in German Dosis is the word for dose.

    Die Dosis macht das Gift
(the dose makes the poison)


It's probably the same, for example in Afrikaans its just gif. Vergif is the verb action of doing it, and vergiftig the same past tense of it having happened previously.


In Norwegian, "gift" is poison. It's also the word for married (de er gift).


In German "Mitgift" is what the bride gets from her family when she enters marriage.


> all so related but everyone does not want to admit it.

I'm laughing in Finnish..


Hehe, you found the exception that proves the rule :P


And Basque, Maltese, Turkish and Georgian.

Magyar (Hungarian) and Finnish are both Uralic languages along with Estonian and the Sámi languages, but none of these are related to the Indo-European languages common in the other parts of Europe.

And while most of Europe’s extant languages are in the Indo-European language family, there’s still a fair number of differences between Albanian, Germanic, Hellenic, Celtic, Romantic and Slavic languages.


Oh for sure there are many differences, that comes with them being different languages, countries, ethnicity. You can do this on many levels.

The point was essentially what you're showing here: People focusing on all the differences instead of shared history, languages influencing each other and how we're all not that different in the end.

If you want to, even within what are nowadays countries and what outsiders would say is "one language" and "one ethnicity", you can start focusing on differences and make people dislike each other.


That’s fair. I tunneled in through a linguistic lens.


In NL, just 'gif' is sufficient


Same in Chinese (毒). But it is a better solution just not to give pedants the time of the day.


You can't really, can you?

At the very least, they'd complain about accuracy, if not time zone, or even how we should all be on UTC (do not get one started on the difference between GMT and UTC if you value your... time)


Same in Polish. You'd just call both of these "trucizna".


Not really, we have both „jad” (venom) and „trucizna” (poison).


How does this happen ? The poster above you isn't really Polish ? How can someone that claims to know Polish not know there's two different words ?


Obviously I know "jad" but I don't see any issue with calling venom "trucizna". Natural languages aren't C++ and you don't get compiler errors when you speak - to me, there is no issue calling both venoms and poison trucizna. Polish dictionary doesn't seem to contradict it either:

https://sjp.pwn.pl/slowniki/trucizna.html

The point is, both are correct(afaik) while in English venom and poison are definitely two different things.


Nobody would say „trujący wąż” (poisonous snake) or „jadowity grzyb” (venomous mushroom). The distinction is similar to English. There are exceptions and contexts where it can be used interchangeably but arguably the same is true for English.


>>Nobody would say „trujący wąż”

No? That's how I've always said it. "Ta żmija jest trująca" - don't see any issue here. Jadowity grzyb I'll agree.


This is fascinating, assuming you are both natives of Poland. Is there as much language variance in Poland as in, say, Italy ?


No idea how much variance there is in Italy so not sure how to answer that question.


Italy, the core remnant of the Roman Empire, has unmatched language diversity, often varies even from town to town. It's a colorful mosaic of micro cultures and customs where people from one region using different words for venom/poison is completely normal, in their local dialect. Everyone speaks standard Italian though.

You've never visited Italy ? They're not that far away and I'm sure you'll love it.


> The point is, both are correct(afaik) while in English venom and poison are definitely two different things.

No, the situation in English matches your description exactly: all of these things are called poison. The word venom is almost never used in natural speech.

Furthermore, if you ask English speakers what the difference between poison and venom is, by far the two most common responses will be "there isn't one" and "I don't know". icyfox is just looking to be annoying.

(Another popular option will probably be "it's called venom when you're talking about snakes", which explains roughly 100% of use of venom in natural speech.)


And in Russian we use "jad" ("яд" in cyrillic) for both. Although there is the word "отрава", which can be used for poisons and "яд" is closer to "venom" the difference is almost non-existant and both are often used interchangeably.


TIL. I always thought that "If it bite you -> you die = venom" and "If you eat, bite, touch -> you die = poison". But your differentiation makes more sense


That explains the words "venomous" and "poisonous" used of creatures.

It's different for the actual substances. Although it relates: a venomous creature that bites you will release its venom into your bloodstream.


>a venomous creature that bites you will release its venom into your bloodstream

unless it's a bee, wasp, hornet, scorpion, stingray, jellyfish, man-of-war, platypus, lionfish, stonefish, sea urchin, or catfish, which all have venom instead of poison, but the delivery mechanism of said venom isn't biting


I said "bite" echoing the comment I was replying to. Obviously the same applies, mutatis mutandis, to stinging etc.


If a venomous snake bites you, you die. If you bite a venomous snake, you live. If a poisonous snake bites you, you will. If you bite a poisonous snake, you die.

Or Hamlet's mother died by drinking poisoned wine. Hamlet died by being stabbed with an envenomed sword.


Not overly pedantic at all as it highlights that by using venom the hunters were able to eat what they shot.


These chemicals are derived from plants where even pedants would classify them as poisons.

The genus name Boophone is from the Greek bous = ox, and phontes= killer of, a clear warning that eating the plant can be fatal to livestock.


Huh, so telephone is killer of distance and Persephone is killer of… Persians? Grain? Vegetation?


You're mixing up phōnē (voice) and phonos (slaughter), but the truth about Persephone is actually more metal.

Her name predates Greek contacts with Persians, so the timeline doesn't fit. Instead, it comes from perthein (to destroy) + phonos, making her the "Bringer of Destruction". With a caveat that the etymology of her name is uncertain: https://en.wikipedia.org/wiki/Persephone#Name

I do like "killer of distance" for telephone, though. :)


> Instead, it comes from perthein (to destroy) + phonos, making her the "Bringer of Destruction". With a caveat that the etymology of her name is uncertain:

But... of all the theories listed there, perthein isn't among them.

And if the roots are "destroy" and "death", what would make her the "bringer" of destruction?


Fair point about the source, but the classification usually follows the mode of delivery, not the organism of origin.

Many plant-derived compounds function as venoms once introduced into the bloodstream (arrow coatings, darts, etc.), even if they’re also toxic when ingested. Curare is one example of a plant-based compound - lethal in blood, but largely harmless if eaten.

So while Boophone is absolutely a poison in the ecological sense, using it on arrows still fits the venom/toxin distinction better than a purely ingested poison. Otherwise why would people hunt with this if they got sick the second they ate the meat?


Is it really? We call it poison darts when hunters use poison from the poison dart frog to hunt.


Not pedantic, two different.

Thanks for clarifying.


In practice the difference is mostly semantics.

Venom is still almost always poisonous when eaten and poison is harmful when injected. 2-3% as dangerous when eaten vs injected only helps so much.


"mostly semantics"

Semantics: 1 (linguistics) the study of meanings

I am not sure what could be more important.

But perhaps you "word choice"?


What things are more important than the study of meanings in a linguistic context?

Well semantics only covers an infinitesimal fraction of all meaning. Consider if I inject arsenic into a snakes venom sac is it now a venom? Nothing about your answer changes anything about what’s going on, yet you could still debate the question.

So when you say “what could be more important” I can only say that just about everything is more important.


But eating a rattlesnake and dying is a bad way of finding out that you have a stomach ulcer.


I am not a native speaker but I believe you are wrong. It is called poison dart for example. So injected toxins can be both called poisons and venoms.


In Spanish it's commonly "dardo venenoso" (venomous dart), no "dardo ponzoñoso" (poisonous dart). So it's probably incorrectly used in English.


Exactly half of these HN usernames actually exist. So either there are enough people on HN that follow common conventions for Gemini to guess from a more general distribution, or Gemini has memorized some of the more popular posters. The ones that are missing:

- aphyr_bot - bio_hacker - concerned_grandson - cyborg_sec - dang_fan - edge_compute - founder_jane - glasshole2 - monad_lover - muskwatch - net_hacker - oldtimer99 - persistence_is_key - physics_lover - policy_wonk - pure_coder - qemu_fan - retro_fix - skeptic_ai - stock_watcher

Huge opportunity for someone to become the actual dang fan.


Before the AI stuff Google had those pop up quick answers when googling. So I googled something like three years ago, saw the answer, realized it was sourced from HN. Clicked the link, and lo and behold, I answered my own question. Look mah! Im on google! So I am not surprised at all that Google crawls HN enough to have it in their LLM.

I did chuckle at the 100% Rust Linux kernel. I like Rust, but that felt like a clever joke by the AI.


I laughed at the SQLite 4.0 release notes. They're on 3.51.x now. Another major release a decade from now sounds just about right.


That one got me as well - some pretty wild stuff about prompting the compiler, starship on the moon, and then there's SQLite 4.0


You can criticize it for many things but it seems to have comedic timing nailed.


The promise is backwards compatibility in the file format and C API until 2050.

https://sqlite.org/lts.html


I wouldn't be surprised if it went towards the LaTeX model instead where there's essentially never another major version release. There's only so much functionality you need in a local only database engine I bet they're getting close to complete.


I'd love to see more ALTER TABLE functionality, and maybe MERGE, and definitely better JSON validation. None of that warrants a version bump, though.

You know what I'd really like, that would justify a version bump? CRDT. Automatically syncing local changes to a remote service, so e.g. an Android app could store data locally on SQLite, but also log into a web site on his desktop and all the data is right there. The remote service need not be SQLite - in fact I'd prefer postgres. The service would also have to merge databases from all users into a single database... Or should I actually use postgres for authorisation but open each users' data in a replicated SQLite file? This is such a common issue, I'm surprised there isn't a canonical solution yet.


I think the unified syncing while neat is way beyond what SQLite is really meant for and you'd get into so many niche situations dealing with out of sync master and slave 'databases' it's hard to make an automated solution that covers them effectively unless you force the schema into a transactional design for everything just to sort out update conflicts. eg: Your user has the app on two devices uses one while it doesn't have an internet connection altering the state and then uses the app on another device before the original has a chance to sync.


Yes, it's a difficult problem. That's why I'd like it to be wrapped in a nice package away from my application logic.

Even a product that does this behind the scenes, by wrapping SQLite and exposing SQLite's wrapped interface, would be great. I'd pay for that.


If it had been about GIMP I would have laughed harder.


Be reasonable. It's only looking forward a single decade.


Every few years I stumble across the same java or mongodb issue. I google for it, find it on stackoverflow, and figure that it was me who wrote that very answer. Always have a good laugh when it happens.

Usually my memory regarding such things is quite well, but this one I keep forgetting, so much so that I don't remember what the issue is actually about xD


I've run into my own comments or blog posts more often than I care to admit...


Several decades into this, I assume all documentation I write is for my future self.

Beautifully self-serving while being a benefit to others.

Same thing with picking nails up in the road to prevent my/everyone’s flat tire.


ziggy42 is both a submitter of a story on the actual front page at the moment, and also in the AI generated future one.

See other comment where OP shared the prompt. They included a current copy of the front page for context. So it’s not so surprising that ziggy42 for example is in the generated page.

And for other usernames that are real but not currently on the home page, the LLM definitely has plenty occurrences of HN comments and stories in its training data so it’s not really surprising that it is able to include real usernames of people that post a lot. Their names will be occurring over and over in the training data.


one more reason to doubt that it's Ai-generated


HN has been used to train LLMs for a while now, I think it was in the Pile even


It has also fetched the current page in background. Because the jepsen post was recently on front page.


I may die but my quips shall live forever


So many underscores for usernames, and yet, other than a newly created account, there was 1 other username with an underscore.


In 2032 new HN usernames must use underscores. It was part of the grandfathering process to help with moderating accounts generated after the AI singlarity spammed too many new accounts.


my hypothesis is they trained it to snake case for lower case and that obsession carried over from programming to other spheres. It can't bring itself to make a lowercaseunseparatedname


Most LLMs, including Gemini (AFAIK), operate on tokens. lowercaseunseparatedname would be literally impossible for them to generate, unless they went out of their way to enhance the tokenizer. E.g. the LLM would need a special invisible separator token that it could output, and when preprocessing the training data the input would then be tokenized as "lowercase unseparated name" but with those invisible separators.

edit: It looks like it probably is a thing given it does sometimes output names like that. So the pattern is probably just too rare in the training data that the LLM almost always prefers to use actual separators like underscore.


The tokenization can represent uncommon words with multiple tokens. Inputting your example on https://platform.openai.com/tokenizer (GPT-4o) gives me (tokens separated by "|"):

    lower|case|un|se|parated|name


You can straight up ask Google to look for reddit, hackernews users post history. Some of it is probably just via search because it's very recent, as in last few days. Some of the older corpus includes deleted comments so they must be scraping from reddit archive apis too or using that deprecated google history cache.


This is definitely based on a search or page fetch, because there are these which are all today's topics

- IBM to acquire OpenAI (Rumor) (bloomberg.com)

- Jepsen: NATS 4.2 (Still losing messages?) (jepsen.io)

- AI progress is stalling. Human equivalence was a mirage (garymarcus.com)


The OP mentioned pasting the current frontpage into the prompt.


What % of today’s front page submissions are from users that have existed 5-10 years+?

(Especially in datasets before this year?)

I’d bet half or more - but I’m not checking.


It does memorize. But that's not actually very news.... I remember ChatGPT 3.5 or old 4.0 to remember some users on some reddit subreddts and all. Saying even the top users for each subreddit..

The thing is, most of the models were heavily post-trained to limit this...


That’s a lot more underscores than the actual distribution (I counted three users with underscores in their usernames among the first five pages of links atm).


either you only notice the xxx_yyy frequent posters or it's quite interesting that so many have this username format


Aw, I was actually a bit disappointed how much on the nose the usernames were, relative to their postings. Like the "Rust Linux Kernel" by rust_evangelist, "Fixing Lactose Intolerance" by bio_hacker, fixing an 2024 Framework by retro_fix, etc...


I was here first


We talked about this model in some depth on the last Pretrained episode: https://youtu.be/5weFerGhO84?si=Eh_92_9PPKyiTU_h&t=1743

Some interesting takeaways imo:

- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)

- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM

- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.


I used Serp via API many moons ago. The most interesting part of the company imo is their legal defense of different plans:

  Production - $150
  15,000 searches / month
  U.S. Legal Shield
ie. "Our U.S. Legal Shield protects your right to crawl and parse public search engine data under the First Amendment. We assume scraping and parsing liability for customers on most recurring plans unless your usage is illegal."

I imagine at least some portion of companies use them just for this liability shield.


Sounds a lot like the old guarantee paid SSL certificate providers used to offer; pretty words, but meaningless in practice. (IIRC, no one ever got a payout from any of them.)

"We assume scraping and parsing liabilities for both domestic and foreign companies unless your usage is otherwise illegal" seems like a big loophole in it.


Couldn't this be laid out as, We assume scraping and parsing liability unless it is ruled as being illegal, in which case your use would be illegal and our liability shield wouldn't help you?


> unless your usage is illegal

Like copyright infringement of Google's search results?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: