Hi, I'm one of the CKAN devs. Just wanted to say the site is fully functional ag...

jedsmith · on April 6, 2011

I have a bit of experience with Xen. If you're actually seeing a whole lot of steal (how much?), that's a bad sign because it means you're on a box with a lot of contention. In an ideal world, Xen should steal very little from you. I'm burning all four cores available to me on one of my personal Linodes, and the platform is barely stealing anything. Here's vmstat -s and uptime from that Linode for comparison:

       409198 non-nice user cpu ticks
     60878563 nice user cpu ticks
       166987 system cpu ticks
    811571786 idle cpu ticks
      4486779 IO-wait cpu ticks
           25 IRQ cpu ticks
        15388 softirq cpu ticks
       766577 stolen cpu ticks

    12:06:10 up 13 days, 14:11,  3 users,  load average: 4.00, 4.01, 4.05

I've had the pedal to the floor for a couple of days on the CPU, and only 766 kticks have been stolen (total) since I booted. If you're seeing a lot more steal than that, your host is working pretty hard to schedule the domUs fairly.

Wouldn't dare to assume that I know better how to run operation than you do, just sharing my experiences with Xen. Netflix had a solution to this -- unfortunate that it was necessary, but a solution nonetheless -- which was to monitor steal closely and spin up a new instance if it skyrocketed: http://blog.sciencelogic.com/netflix-steals-time-in-the-clou...

Given the opportunity, I'd like to point out that I meant no disrespect in my original comment, if it wasn't clear. I was speaking more from a generality and not about CKAN specifically, a fact lost on those mindlessly downvoting me.

kindly · on April 6, 2011

No disrespect was taken. The hacker news coverage came as a big surprise. We like to turn any caching mostly off and we know this is a risk. This is because we do not want the possibility of any stale data as this annoying to the type of users we have. We are working on a better cache invalidation scheme but this has not been a big priority.

Your feedback is appreciated, thank you.

Edit: Our amount of steal was much much higher than that.

ebishop · on April 6, 2011

Have you considered implementing some sort of script to scan some of the large biological databases and add links/metadata for the datasets they contain?

Looking at what's in CKAN now, it seems that it's mostly datasets that are a bit more easily understood than most of the biological data that's out there, but at the same time indexing and accessing biological data is a HUGE problem for researchers in this field.

There are currently some big databases such as the data stored by the UCSC genome browser (genome.ucsc.edu/downloads.html) and all sorts of expression/small RNA data available from GEO (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/), and lots of other slightly more esoteric databases like flybase.org, which specializes in fruit fly data.

Truly doing a proper job of indexing/classifying all of this is a close-to-impossible task (and in many cases requires specialized knowledge), but there are an absurd number of publicly available biological datasets out there. If you wanted to rapidly expand the number of entries you have you could use a script to index one or two of the big databases like GEO, and fill in the metadata from what they already have.

Of course, I can also understand why you might prefer to have the majority of the datasets in your site be the sort of thing most people (or at least, non-biologists) can interpret vs. something that's highly specialized like this. Not to mention, keeping up with all the new data, and properly filling in all the metadata could be a real can of worms.

kindly · on April 8, 2011

Sorry for the late reply. I sadly do not understand the concerns of this field very well. There are many very large datasets referenced on ckan, mainly links to huge triple stores. There are many biological data sets also eg flybase as mentioned. These triple stores are too big to do any decent dynamic linking against which is big shame.

If you get the opportunity could you repost this to ckan-discuss@lists.okfn.org. There are people on that list that understand these issues far more than me and they would love to hear from anyone interested.