Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How does Codex learn the relationship between English and code?

Is it purely through the comments in the training corpus?



It's really interesting. HN's HTML is very un-semantic and is actually quite hard to work with.

    <tr class="athing" id="28191639">
      <td class="title" valign="top" align="right"><span class="rank">9.</span></td>
      <td class="votelinks" valign="top"><center><a id="up_28191639" onclick="return vote(event, this, &quot;up&quot;)" href="vote?id=28191639&amp;how=up&amp;auth=****&amp;goto=news"><div class="votearrow" title="upvote"></div></a></center>
      </td>
      <td class="title">
        <a href="http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html" class="storylink">You can list a directory containing 8M files, but not with ls</a>
        <span class="sitebit comhead"> (<a href="from?site=be-n.com"><span class="sitestr">be-n.com</span></a>)</span>
      </td>
    </tr>
In the video Codex picks up tr.athing as a news item. I wonder if this is actually generalized learning, or if it just picked the selector up from eg. a userscript that appeared in its training corpus.

Another thing that's kind of scary (and makes it worrying if this is used for Copilot) is the second prompt to make the text uppercase results in code that is superficially correct, but is very semantically wrong - innerHTML.toUpperCase() is dangerous because it not only makes the content uppercase, it also modifies the attributes on the HTML elements inside. This definitely broke the vote button, which uses inline JS which is case sensitive. It also destroys any attached event handler since the elements are basically deleted then re-created.

The correct way to do this is to either use CSS text-transform: uppercase, or if it is important to update the DOM itself, recursively descend and update childNodes with nodeType == text's nodeValue to uppercase.


> Another thing that's kind of scary (and makes it worrying if this is used for Copilot) is the second prompt to make the text uppercase results in code that is superficially correct, but is very semantically wrong - innerHTML.toUpperCase() is dangerous because it not only makes the content uppercase, it also modifies the attributes on the HTML elements inside. This definitely broke the vote button, which uses inline JS which is case sensitive. It also destroys any attached event handler since the elements are basically deleted then re-created.

This is actually an issue I have with all these Transformer-based code generators - they have no inherent constraints on safe and correct code and often seem to generate superficially correct but bad and potentially even dangerous code. I remember that the first Copilot showcase also included stuff like that (not to mention that it sometimes generates GPL'd code).

All the model does is a very complex form of association learning. It may "understand" the relationship between English and various programming languages, but you cannot code in any constraints about optimization, security, licensing etc. There is so much bad code out there on the internet and this model may have seen a lot of it.

It's also no coincidence that most demos shown so far are very high level dynamic languages like Javascript and Python.


With some prompt engineering, you can get Codex to produce better results. In these examples I wrote up to `makeUpper`, Codex wrote the rest (with temperature = 0):

    // JavaScript one-liner to make the text of element with ID athing uppercase
    const makerUpper = function(id) {
      document.getElementById(id).innerHTML = document.getElementById(id).innerHTML.toUpperCase();
    };
vs

    // JavaScript one-liner to make the text of element with ID athing uppercase while following all security best practices
    const makerUppercase = function(id) {
      const element = document.getElementById(id);
      element.textContent = element.textContent.toUpperCase();
    };


The second result is more semantically correct, but it will not function if called on tr.athing because tr.athing contains HTML elements that will be deleted when you replace the text. It is still much safer than innerHTML which will silently corrupt attributes. It's also interesting you need to prompt Codex for security best practices (and a bit questionable if it even "knows" anything about best practices)

I guess part of it is that a one-liner is impossible. Here's what I would write given the prompt

    const makeUppercase = (id) => {
      const element = document.getElementById(id);
      if (element == null) return;
      const makeChildNodeUpper = (node) => {
        if (node.nodeType === Node.TEXT_NODE) {
          node.nodeValue = node.nodeValue.toUpperCase();
        } else {
          node.childNodes.forEach(makeChildNodeUpper);
        }
      }
      makeChildNodeUpper(element);
    }


> It's also interesting you need to prompt Codex for security best practices

Well, that's one of the central lessons of ML - garbage in, garbage out. There is a lot of garbage code out there and no easy a priori way to distinguish garbage code from good code.


Here’s a pretty good one-liner to make the text uppercase:

  document.getElementById(id).style.textTransform = 'uppercase';


That was in their first comment.


Completely agree. It currently tends to write unsafe, error-prone code. The next step is to figure out how to rein it in, either with new techniques or rejection sampling from a large set of possible outputs.


I wonder why innerHTML has a toUpperCase method. It makes sense for innerText of course, but case sensitivity in the html can definitely matter for JS and CSS. I'm guessing because both are just treated as JS string objects. But there is a special NodeList collection, so why not a special HtmlString?


Yup, innerHTML just returns a string, so of course you can .toUpperCase() on it even if it is unsafe.

innerHTML's history is fascinating. It was not part of the original DOM Level 1 API but was added in IE5. It is not semantically correct (you should be using Element.textContent or examining the inner text nodes), but because it was so easy and the rest of the DOM API so verbose, it caught on and became one of the primary ways used to manipulate content in JS.

FWIW Chrome recently proposed a Trusted Type mechanism for preventing XSS (which also has the side effect of blocking this sort of unsafe manipulation) - https://web.dev/trusted-types/, https://developer.mozilla.org/en-US/docs/Web/API/TrustedHTML


Wait asking Codex to change something on the third item in the list is hard when you have tr tags? I feel like tables are the quintessential way of listing rows of items in HTML, what am I missing?


A list is usually represented by an ol in HTML, not a table.

HN uses three trs for each item, not one. The table cells are also not consistent, because it's not actually a tabular data, so each cell could have nothing (used as a spacer for layout), or more than one thing squeezed inside. The intermingling of semantic and non-semantic (presentational) elements makes "understanding" the page difficult, which incidentally also makes the page less accessible since screen-readers also rely on the same mechanisms to relay information to non-sighted users.


I don't think HN markup is nearly as bad as many other modern sites.


As far as I understand Codex is a fine-tuned GPT-3.

GPT-3 was trained on a corpus derived from "the internet" (WikiPedia, links from Reddit with enough votes, and a filtered Common Crawl). So not only would GPT-3 had been exposed to code with comments, it would likely have read code examples on WikiPedia, tutorials online, API documentation, and even answers to questions on sites like StackOverflow.

The fine tuning itself is, as far as I know, from code only. So it would lean heavily on comments there. But it has a basis of understanding from the aforementioned sources.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: