Tmpfs inode corruption: introducing inode64

kzrdude · on Sept 4, 2021

Is the mount option needed for backwards compat? Any system that creates that many files, should be large file aware, hopefully? Maybe it comes down to many files being created (and deleted) during the lifetime of the tmpfs mount, i.e usage time, not size, without any counter reset.

A system that uses "our identifier space is approximately infinite" can be fragile in a surprising way. A system that uses "our identifier space is proportional to the physical size of the medium" is more humanly predictable in its limitations.

rini17 · on Sept 4, 2021

And what can you do here? Either suddenly fail with "inodes exhausted" error (but here they aren't really exhausted only the numbers are), or try to reuse numbers (which would slow the allocation and potentially cause confusion), or...?

kzrdude · on Sept 4, 2021

Reusing inodes is not that uncommon among filesystems, so I think tmpfs should have had an implementation like that.

cdown · on Sept 4, 2021

Hey there! Article author here -- just found out this was posted here and going through the comments :-)

One of the earliest test versions of my patch actually did inode reuse using slabs just like you're suggesting, but there are a few practical issues:

1. Performance implications. We use tmpfs internally within the kernel in a lock-free manner as part of some latency-sensitive operations, and using slabs complicates that somewhat. The fact that we make use of tmpfs internally as the kernel makes this situation quite different than other filesystems.

2. Back when I was writing the patch, each memory cgroup had its own set of slabs, which greatly complicated being able to reuse inodes as slabs between different services (since they run in different memcgs).

After it became clear that slab recycling wouldn't work, I wrote a test patch that uses IDA instead, but I found that the performance implications were also not tenable. There are other alternative solutions but they increase code complexity/maintenance non-trivially and aren't really worth it.

A 64-bit per-superblock inode space resolves this issue without introducing any of these problems -- before you go through 2^64-1 inodes, you're going to have other practical constraints anyway, at least for the timebeing :-)

kzrdude · on Sept 4, 2021

Oh that's interesting, thank you! Note that when I said should, 1) it carries no weight and 2) I was referring to the old impl, not what the fix should be. Going 64-bit sounds like a good option, hopefully it can become the default.

fomine3 · on Sept 6, 2021

Thank you!

gopalv · on Sept 4, 2021

> Reusing inodes is not that uncommon among filesystems

I've been through that hell before and it completely sucks.

BSD UFS will reuse an inode occasionally if you replace a file in a dir by delete + create (which makes sense, if you only cared about the efficiency of the dir structure).

But this meant that my caches which were keyed by device + inode + mtime, would fail to detect a file being updated.

As an obvious workaround, there was a 2 second delay into the system which seemed to fix things at first (apc.file_update_protection), but then it also stopped working for some people.

Because they were pushing data out to nodes through something like an rsync which also pushes an mtime update.

Anyway, the eventual workaround was to cache-break by ctime as well as mtime.

But it was hell to debug (months with 1 SEV1 ticket), because it would take a lot (a lot) to even triage the issue to the filesystem of why the file updates aren't going to the cache (& requests failing with API mismatches as the symptom).

Would not spring this on people without an explicit warning.

PS: Looks like ext4 does reuse them, whoa

formerly_proven · on Sept 4, 2021

I'm pretty sure most Unix-y file systems will reuse inodes with enough churn (eventually), because their design is generally one (or more) arrays for inodes with a free bitmap. NTFS should be very similar in behavior, though I'm not sure if the file index (~MFT index, basically the inode number) is actually used much, Windows itself tends to prefer object IDs as a persistent location-independent handle for files.

Filesystem cache indexing should generally be done on the (name, type, inode, ctime) tuple (ideally include inode generation, but it's a fuss to get and doesn't make much difference in practice).

jabl · on Sept 4, 2021

Similarly, XFS has the inode32/64 mount options. Since kernel 3.7 inode64 is the default: https://man7.org/linux/man-pages//man5/xfs.5.html

esgwpl · on Sept 4, 2021

"The inode64 and inode32 names are used based on existing precedent from XFS."

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

smitty1e · on Sept 4, 2021

I have not read an article in quite some time that takes a complex topic and lays it out so clearly.

Thank you!

opk · on Sept 4, 2021

I enabled `pam_namespace` for per-user `/tmp` and `/var/tmp` on the systems I administer with those filesystems being tmpfs and xfs respectively. We get odd problems weekly where one or other user can't write to `/tmp` (but nsenter fails to reproduce it) but have never had issues with the `/var/tmp`. We came from Solaris where tmpfs was standard for `/tmp` and it surprised me to find that it seems to be an unusual choice on Linux. Are there reasons it isn't widely deployed on Linux? I can't think why inode reuse could cause the problems I experience but if I can enable this inode64 option, I'll give it a try.

scottlamb · on Sept 4, 2021

Iirc several distributions tried out tmpfs-based /tmp a few years ago then reverted. I think the problem was just that some Linux-specific software and sysadmins had grown up expecting to be able to write a lot of data to /tmp. Limiting it to 50% of RAM (the default iirc) didn't go well. The distribution people didn't want to deal with converting those things to use /var/tmp so they backed off. I don't think it helped politically that the systemd people were pushing it, as those folks were already perceived in some circles as dictating too much about the direction of Linux.

This is all from memory and I'm on my phone, so no citations. In any case, I don't think inode numbers were the problem.

tyingq · on Sept 4, 2021

I'm somewhat intrigued that this problem has remained for decades after 64 bit inodes were available in Linux.

opk · on Sept 4, 2021

I've not come across FS_IOC_GETVERSION before. Should the many user-mode applications that rely on st_dev, st_ino being enough to identify an inode also be calling this ioctl? Or is it something you only need to worry about in limited circumstances?

cpcallen · on Sept 4, 2021

Can someone explain how tmpfs / /dev/shm track files if not by inode number?

formerly_proven · on Sept 4, 2021

inodes are actual objects in the kernel. Their inode number is supposed to be such that (device, inode, generation) uniquely identifies it, but the kernel itself doesn't really care about this (as long as your filesystem is not using the inode cache, which does use the inode number as a key). The way tmpfs works is that it uses the dirent cache as it's actual data structure (forcing the refcount of all cache entries >1 means they can't be evicted); those just directly contain pointers to inode objects. File data is stored in the page cache.

rav · on Sept 4, 2021

Since the contents are in-memory, you could imagine a straightforward implementation where the directory hierarchy in the tmpfs is stored in a pointer-based in-memory tree.

A file handle just needs a pointer to where in memory the contents are stored, and the inode numbers are computed on the side because it's part of the interface.

You don't want to use the pointers as inode numbers directly, since for security reasons the kernel doesn't want to export valid kernel pointers to user space.

yencabulator · on Sept 8, 2021

To play with these concepts, you can write a FUSE filesystem where each file has inode 42, while being wholly separate nodes & handles. Linux VFS doesn't really use the inode number internally for anything while a direntry is alive.

esjeon · on Sept 4, 2021

Please read the article. TMPFS was internally using 32bit unsigned int for counting inode number, which got overflow-ed. Kernel already has 64bit wide ino_t, the dev migrated to it.

puppet-master · on Sept 4, 2021

The question was clear enough: per the article tmpfs is able to track two distinct files with the same inode number, implying it does not use the inode internally to differentiate between files

Tobu · on Sept 4, 2021

I'm not sure where the confusion is coming from.

If userspace wants to distinguish files, it normally does so with a (ino_t, dev_t) pair. With some nuance wrt inode generations if you aren't holding on to files yet want to guard against ino reuse, and some funkyness with overlayfs.

Most filesystems don't expose a way to access files by ino only, so the internal implementation generally doesn't key anything by them or rely on them itself; they're just a kind of of uniqueness cookie. But keying by them somewhere in the kernel implementation is feasible, if you scope it properly.

adrianmonk · on Sept 4, 2021

It may be that, in modern times, filesystems don't actually key things with i-node number. But wasn't that the original purpose? Otherwise, why the design where directories are basically a mapping of names to i-node numbers?

Also, the article does say this:

> On non-volatile filesystems, inode numbers are typically finite and often have some kind of implementation-defined semantic meaning (eg. implying where the inode is disk).

If you've got other means to know where the i-node is on disk, then why would a filesystem bother encoding that into the i-node numbering scheme?

Maybe the article is wrong here. I don't know. But if the question is where the confusion is coming from, the article definitely seems to be painting a picture where normal filesystems look up things based on i-node number and tmpfs is relatively unique in not doing so.

formerly_proven · on Sept 4, 2021

Yes, the OG Unix file system simply had a fixed-size (at partition time, this is still the case for ext4 and XFS iirc) array of struct inode on the disk, and the inode number (iirc it was just idx or something like that in the code) is simply the index into that array. An OG Unix directory was simply a file with the directory flag set and the file was just an array of struct { char name[30]; int ino; } (~something like that). You actually used to be able to just open() a directory and directly read directory entries (dirents henceforth) from it, just like a file - exactly like a file, because it WAS a file.

ext and XFS still largely work like this, though directories are now hashtables and I think XFS supports multiple independent arrays for inodes (or just for extent allocation? I don't remember). NTFS also looks a lot like a Unix file system with some weird growths on it, and not a whole lot like FAT.

yyyk · on Sept 4, 2021

There's also some funkyness with btrfs subvolumes:

https://news.ycombinator.com/item?id=28274054

pengaru · on Sept 4, 2021

Since there's no need for these inode numbers to survive a reboot, the inodes are simply generated on-the-fly for a given tmpfs mount, and not persisted anywhere durable.

TFA makes this clear...

amelius · on Sept 4, 2021

While we're at it, can we also increase the default max number of hardlinks on ext4 filesystems? And the max number of commandline arguments?