nschiefer's comments

nschiefer · on May 31, 2023

Good news for readers who prefer reputable ML papers: this work was accepted to ICLR 2023! https://openreview.net/forum?id=9XFSbDPmdW

skinner_ · on May 31, 2023

Thanks! What a self-own, ICLR Spotlight paper, that's close to the top of the reputability food chain!

nschiefer · on April 21, 2023

(Disclaimer: I’m plugging my own work here ;-))

You might enjoy this project, which ties to do basically exactly what you described: stick everything in a database and let it drive the app: https://riffle.systems/essays/prelude/

It’s still very much a research prototype but we should have some more writing out soon.

PKop · on April 21, 2023

Good essay. Particularly with web development, there is complexity around bridging the gap between server and client, and data crosses this chasm through serialization which exacerbates the problem and limits expressiveness of server languages, requiring massive duplication of code simply to serialize and duplicate on client what is present on server if one wants all the power of client interactivity and API's.

This is a big value of recent server-centric frameworks like Phoenix LiveView that provide ability to have code and data co-located and not have to duplicate so much on client and server as with SPA's while attempting to maintain some base level of client interactivity. But seems always a tension between leveraging the full power of client and full power of server.

You might find this article [0][1] informative. It disputes the idea that UI's are "pure functions of the data/model" in a compelling way, and points to this incorrect assumption as having introduced some complexity/pain in how frameworks like React work.

[0] https://blog.metaobject.com/2018/12/uis-are-not-pure-functio...

[1] https://news.ycombinator.com/item?id=31979347

vaughan · on April 22, 2023

That article was a great read.

> What is a (somewhat) pure mapping from the model is the data that is displayed in the UI, but not the entire UI.

This line was interesting to think about.

React added so much complexity and perf issues, and I really don’t know what was so bad about something like Backbone / Backbone.Marionette.

I find that in React I want all component props and data to come from an external store via subscription, instead of being passed down some tree. My UI is stable.

nschiefer · on Jan 15, 2021

This is amazing! Are you planning to roll this out to other states (that are also allowing vaccinations in the general public)?

If not, what's the limiting factor? Can I help overcome that?

patio11 · on Jan 15, 2021

Priority #1 is making sure it works in California; we'll need to sustain volunteer attention for days/weeks, etc, while working kinks out of the system.

Broadly we will probably publish the "recipe" (rather little code involved!) and encourage other teams to clone it; getting dozens of trusted, capable, committed volunteers relied heavily on professional networks in the Bay Area, and our networks might not be dense enough to replicate that for e.g. Michigan.

nschiefer · on Jan 15, 2021

Makes sense, especially the part about the "recipe" being more about organizational capacity than code.

Do you think it's important for volunteers to live in the region that they're trying to report on? Concretely, could a network based in the Bay Area make reports about availability in Michigan?

I might be able to get a sufficient group in Massachusetts, but the state hasn't approved general availability yet, despite CDC guidance.

kasrak · on Jan 15, 2021

Recipe-wise, this is a copy of the main database structure that we're using. Sharing in case it's helpful for other people spinning up similar things in other states: https://airtable.com/shrb5QLQ2DGSoau2l

patio11 · on Jan 15, 2021

No strong feelings; I'm in Japan, rather than California, but combination of local knowledge for e.g. which state resources to scrape plus some local affinity for maintaining volunteer enthusiasm seemed useful.

nschiefer · on Jan 15, 2019

You're right the Record Layer doesn't yet have join support, but note PR #306 [1]. We support aggregate indexes but can't run aggregate “reports” that aren't backed by indexes. The rationale for this is covered in the paper (see [2]), but note that doesn't preclude such support being added into or on top of Record Layer (also discussed in the paper).

The Record Layer does a whole bunch of work to deal with index maintenance (see our docs [3])). Our “index maintainer” abstraction (discussed in the paper) makes maintaining our indexes (including those that are basically materialized views) completely seamless from the user's perspective, even for updates and deletes. We also have a lot of tooling for making efficient schema migrations. For example, schema migrations are performed lazily (when the data is accessed), so they aren't limited by the 5 second transaction limit. If you add/remove/change indexes, they'll be put into a “write-only” mode where they'll keep accepting writes while an “online indexer” builds the index over multiple transactions. We even have fancy logic to automatically adjust the size of the transactions if they start failing due to contention or timeouts!

Basically, the Record Layer solves a lot (but not all) of the pain points that shows up when you don't know your access patterns from the beginning. The paper talks a bit about how CloudKit uses some of those features.

[1] https://github.com/FoundationDB/fdb-record-layer/pull/306

[2] https://foundationdb.github.io/fdb-record-layer/FAQ.html — search for “aggregation”

[3] https://foundationdb.github.io/fdb-record-layer/SchemaEvolut...

nschiefer · on Jan 15, 2019

The preprint of the paper is now up on arXiv.org: https://arxiv.org/abs/1901.04452

nschiefer · on Jan 14, 2019

(I'm from the iCloud team that works on the Record Layer.) Both building a relational database and implementing a proper SQL interface on top of it are huge projects. The SQL spec is large and complicated, so achieving true compatibility (as opposed to superficial compatibility) is challenging. Even worse, once you have a SQL interface users expect to be able to throw any SQL that they give to, say, Postgres, and have it work just as well, which requires a ton of detailed work on the query optimizer.

The client/server distinction isn't terribly strong in the FDB world. The FDB client is unusual in that it's a (stateless) part of the FDB cluster itself. You could therefore embed it in the client itself or build an RPC service around it. The Record Layer takes the same approach---it's just a Java library---so you could either embed it in the client application or build some kind of wire protocol for accessing it. One could have an embedded SQL layer like SQLite or H2 with no additional server beyond the cluster or a separate SQL layer network server that acted more like Postgres or MySQL.

The Record Layer was designed for use cases that don't need a SQL interface, so we focused on building the layer itself. That said, the Record Layer exposes a ton of extension points so there's a fluid boundary between what needs to live in its main codebase and what can be implemented on top. There are almost certainly enough extension points to implement a SQL interface as another layer on top of the Record Layer. For example, you could add totally new types of indexes outside of the Record Layer's codebase, if that were needed for SQL support. It's still a lot of work, especially on the query optimizer. Perhaps the community is up to that challenge. :-)

grandinj · on Jan 15, 2019

Apache Ignite is also using H2 as their "SQL parser and query planner" layer

nschiefer · on Oct 5, 2017

I'm a friend and colleague of Michael's at MIT. This article is very nice and gives a good summary of what made Michael so special. In a lot of ways, Michael was the animating "spirit" of the MIT theory group. He had an encyclopedic knowledge and incredibly deep understanding of basically every area of computer science, and many areas beyond; in the short year that I knew him, we had conversations about everything from convex optimization to computer architecture to rent control laws to Medieval philosophy.

Michael was a truly remarkable researcher. Ludwig's comments about him being the type that you "only see a couple of times in a generation" are accurate. I also recommend watching the start of Yin-Tat Lee's recent talk [1] at the Simons Institute. Yin-Tat is a prolific researcher himself, so his comments carry a lot of weight.

For those wondering about Michael's publication count: computer science (and especially theoretical computer science) is a "high publication" field, in part because of the nature of publishing in conferences and in part because the field is young and there are many good open problems. Still, Michael's publication record is abnormally strong and reflects his collaborative nature. Regarding the comments about co-authorship, Michael could easily have been a co-author on a dozen more papers if he had cared, since he often contributed the main ideas to projects that he never formally joined. This was definitely my experience collaborating with him. I expect that Michael will be more prolific in the next year than many living researchers, from the point of view of publishing.

His papers (incomplete list here [2]) are very well written, by the way. I recommend checking them out.

The most incredible thing about Michael was the way he learned. If you talked about something that he didn't understand, he'd quiz you about it until he did. And he did this with everyone, from brand new grad students like me to famous professors.

At the same time, Michael was incredibly generous. He liked to talk, and you could interrupt him at any time and he'd explain everything to you with astounding patience. Michael wasn't in science for glory; he just really loved learning and teaching. He's already profoundly missed and our entire community is shocked by his untimely passing. My deepest condolences go to Michael's family.

We hope to have a memorial website up soon, especially since Michael was too humble to have much of an online presence.

[1] https://www.youtube.com/watch?v=6pIheZseT1U

[2] https://scholar.google.com/citations?hl=en&user=t3kDJHQAAAAJ...

nschiefer · on July 21, 2012

Most of your feedback is perfectly fair. My algorithm was not meant for general search, not am I a "young genius" or anything of the sort. Similarly, techniques like tf-idf, latent semantic analysis, document modelling, and even pseudo-relevance feedback expansion are no longer "cutting edge" techniques.

However, your blanket characterization that "there's absolutely nothing new" in my work and that I just talked "about existing research" while "making it look like [my] own" is somewhat offensive. Based on a fairly extensive review of the literature, the algorithm that I developed is novel and seems to outperform a number of these standard techniques on short document like tweets. As for "just looking at co-occurrence," that's essentially a type of pseudo-relevance feedback expansion and is, of course, well known and easy to implement.

Please realize the difference between a news interview published online and the paper I submitted to the fair when assessing the novelty of my work and when suggesting that I made no reference to others' work.

Regarding my "definition for a 'word'", I apologize for appearing pretentious. I was asked to speak on the conference's theme of "redefinition" in relation to my research and did the best I could.

Finally, I think it's kind of strange that you automatically assumed that my parents even worked on my project. Neither of them are even familiar with the details of the project. My work was my own, and your automatic assumption of what is tantamount of broad academic fraud and plagiarism assumed particularly bad faith. In any case, thanks for your honest feedback; I try to be careful about how I come across and I'll be even more mindful in the future.

nschiefer · on July 21, 2012

You are correct. My algorithm is conceptually rather similar to a number of ones recently published. The work closest to mine in the literatures is [1], by Lafferty and Zhai at CMU in 2001.

That said, my method is somewhat different than these in the way that it explicitly treats unlinked documents as distributions over a graph of words and the theoretical framework (based on a theoretical process for document generation) employed to derive it.

[1] http://www.iro.umontreal.ca/~nie/IFT6255/lafferty-zhai.pdf

nschiefer · on July 21, 2012

Hi, this is Nicholas, a long time lurker on HN and the person in the video. I saw this thread during my morning commute to work (and was very surprised, to say the least!) and wanted to register to mention a few important details that the news articles always omit. Hopefully this helps correct a few misconceptions!

To begin, I'd like to flatly deny that I "built a better search engine." I did my (very academic) work in information retrieval and developed a new algorithm that seems to give significantly better search results (when compared to other academic search techniques, more on this later) on short documents like Twitter tweets. Specifically, my algorithm uses random walks (modelled as Markov chains) on graphs of terms representing documents to perform a type of semantic smoothing known as document expansion, where a statistical model of a document's meaning (usually based on the words that appear in the document) is expanded to include related words. My system is in no way, shape, or form a "search engine" or even comparable with something like Google---rather, it is an algorithm that could help improve search results in a real, commercial search engine.

My work is not, by far, the first to attempt document expansion. A number of related techniques, including pseudo-relevance feedback expansion, translation models, some forms of latent semantic indexing, and some of those mentioned by exg already exist. However, to my knowledge, the knowledge of my science fair juges (some of whom are active IR researchers), and the knowledge of my research mentor (also more on this later), my work is a novel method (not a synthesis of existing methods) that seems to work quite well in comparison to other, similar, algorithms on collections of small documents like tweets.

The last point is certainly important: it is simply impossible to compare my algorithm to something like Google, for several reasons. First, I'm not a software engineer or a large company; it is downright impossible for me to craft a combination of algorithms like that found in Google to get comparable results. No commercial search engine would be so foolish as to use only a single algorithm (essentially a single feature, from an ML perspective). Instead, they use hundreds or thousands. Second, it is essentially impossible to compare search engines with any level of scientific rigour. I evaluated my system using a standard corpus of data published by NIST as part of TREC (the Text REtrieval Conference), consisting not only of 16+ million tweets, but also of sample queries and the correct, human-determined results for these queries. However, to achieve statistically comparable results, many variables have to be controlled in a way that is impossible with a large, complex search engine. Instead, the academic approach compares individual algorithms one-on-one and postulates that these can be combined to give better search results in aggregate.

Specifically, my research showed that my system achieved above-median scores on the official evaluation metrics of the 2011 Microblog corpus when compared to research groups that published last November. Furthermore, my system did the best of all of the "single algorithm" systems, including those that used other document expansion techniques like I described above.

Most of my work was spent on the development of the algorithm, proofs of its convergence and asymptotic complexity, a theoretical framework, and a statistical analysis of my results. Notably absent from this list is engineering. My project is not, by any means, "a toy engineering project" as some commenters have suggested. Actually, the engineering in my project is quite poor, as that area is not one I've had much exposure to.

To briefly address my research mentor: my parents had nothing to do with my project other than providing emotional support when I was stressed. I had a research mentor at a university who I found after I did very well at the 2011 Canada-Wide Science Fair. He provided me with important computational and data resources (such as the corpus I used), but did not develop my algorithm, proofs, or code, which were my own work.

Given the recent attention of my project (and Jack Andraka's project on cancer detection), I'd like to point out a general trend in news articles about science fair projects. In general, the media has a tendency to focus on the potential applications of a project and ignore the science in it, leading to (seemingly fair) criticism. Using me as an example, the talk about "toy" projects and "synthesis" is fair given how it is portrayed in the media. Somehow, "novel IR algorithm based on Markov chain-based document expansion," even with careful (and thorough!) explanation, gets turned into "Teen builds a better search engine." Similarly, a great friend (and roommate) of mine whose project on drug combinations to treat cystic fibrosis was completely shredded on Reddit when it got significant media attention last year. In his project, he never once claimed or tried to claim that he had done anything with immediate (or even near) medical applications. Instead, he discussed his work to identify molecules that bind to different sites on the damaged protein and can work synergistically as drugs. The media spin-machine quickly turned this into "Teen cures cystic fibrosis" and other such nonsense. Even Jack's project (I know both him and his project), which is unusually "real world" has being overspun by the media. It's just what happens. Heck, people even make fun of it in upper-level science fairs, but it still happens.

Finally, thank you for the encouraging words! To finish with a shameless plug, I'd like to point out that, while fairs like ISEF tend to be very well-funded (because of the positive publicity). However, many regional and state (in the US) or national (outside of the US) youth science organizations struggle to find funding (and even volunteers) to run fairs that send people to ISEF. If you ever find yourself in a position where you can help (financially, with your time, whatever), I'd strongly encourage it. Given the impact the science fairs have had on my life, I know that I certainly will.

SiVal · on July 24, 2012

Wonderful work, Nicholas. I'm adding you to my list of good role models for my little boys. The 17-yr-old sports stars get most of the press, because their accomplishments are easily seen. You are the equivalent of a 17-yr-old basketball star with ten years of training behind him, but you play on a court that is nearly invisible (mathematical terrains and state spaces).

You are the kind of role model I want for my boys. Now, I have my work cut out for me explaining to them why. ;-)

gojomo · on July 24, 2012

Thanks for your comment.

So much of the coverage of your research was brief video, that having just this small-but-precise description is an immense help in understanding.

Many also need to hear your other message, about the distorting effects of both popular-media-attention, and of transplanting-results-outside-their-original-context. So much online discussion is knocking down decontextualized caricatures of real work... resulting in a lot of unnecessary waste and negativity.

nsomniact · on July 21, 2012

Your attitude is refreshing.