Make the result of alcove_cgroup:set/6 more consistent:
* return ok on success
* return an errno tuple if open() or write() fails
* return {error,enoent} if the cgroup does not exist, instead of []
In the case of a partial write, currently the code will crash, taking
care to close the fd in the unix child process first. The code could
select on the fd and attempt another write.
Ubuntu 15.04 sets the MS_SHARED flag on system mounts. Since this flag
is inherited by container mounts and the container filesystem is visible
to the global namespace.
Use the MS_PRIVATE flag on container mounts to prevent this issue.
Enforce the use of the fork path. Having an optional fork path was nice
when working in the shell:
{ok, Child} = alcove:fork(Drv).
Instead of:
{ok, Child} = alcove:fork(Drv, []). % port process forks
However it introduced a few problems:
* made the interface inconsistent and ambiguous
alcove:kill(Drv, Pid, 9)
% vs
alcove:kill(Drv, [], Pid, 9) % the port process is sending the signal
* calls could not have optional arguments
Whether or not calls should have optional arguments is an open question
but the optional fork path would have conflicting arities:
For example, the last argument to mount/8 is used only by Solaris:
% arity 6
-spec mount(alcove_drv:ref(),iodata(),iodata(),iodata(),uint64_t() | [constant()],iodata()) -> 'ok' | {'error', file:posix() | 'unsupported'}.
% arity 7
-spec mount(alcove_drv:ref(),iodata(),iodata(),iodata(),uint64_t() | [constant()],iodata(),iodata()) -> 'ok' | {'error', file:posix() | 'unsupported'}.
% arity 7
-spec mount(alcove_drv:ref(),fork_path(),iodata(),iodata(),iodata(),uint64_t() | [constant()],iodata()) -> 'ok' | {'error', file:posix() | 'unsupported'}.
% arity 8
-spec mount(alcove_drv:ref(),fork_path(),iodata(),iodata(),iodata(),uint64_t() | [constant()],iodata(),iodata()) -> 'ok' | {'error', file:posix() | 'unsupported'}.
* because of the ambiguity in arity, each call can't have an optional
timeout
call/5 sets a timeout of 'infinity'. The timeout isn't accessible from
the "named" functions (e.g., fork, chdir, ...), which means that users
who need the timeout will have to resort to using the call/5
interface. An unfortunate side effect of using call/5 is that dialzyer
won't be able to type check the arguments.
The result of removing the functions without the fork path is that the
code ends up being simpler and more consistent.
Allow any erlang processes to send messages into the port, similar
to the way port drivers can respond directly to a process. The Unix PID
fork path is used as the key to map the port response to the Erlang PID
by the gen_server.
The current implementation is an experiment only and has a race
condition. Process A calls into the gen_server and blocks. Process B calls
into the port for the same Unix fork path and blocks. The response for
both requests will be sent to process B and process A will block forever.
In other words, the last erlang process to make a request to the fork
path becomes the controlling process. If the erlang process dies, any
data generated by the Unix process is sent to the process that started
the gen_server.
What the behaviour should be needs to be defined. For example, using the
gen_tcp behaviour of controlling_process would be problematic: since the
response always goes to the controlling process, a call from another
erlang process would hang. After the Unix process has called exec(),
allowing multiple processes to send in data may make sense.
Making the controlling process the only process privileged to talk to
the fork path would have some weird side effects:
* data for the fork path would have to be serialized through the
controlling process
* if a non-owning process sends a message to the fork path, either we
have to extend the type spec for each call to include
{error,not_owner} or we send a badsig exception and the client is
forced to wrap each call in a try/catch
There should also be a concept of linking between unix and erlang
processes:
* processes are unlinked
* erlang process dies: stdout/stderr from the unix process should
be dropped
* unix process dies: controlling process gets the normal exit
messages (exit_status, termsig)
* processes are linked
* erlang process dies: unix process gets a SIGKILL
* unix process dies: erlang process gets an exit(kill)
* erlang process monitors unix process (?)
* unix process dies: erlang process gets a 'DOWN' message
To test the concept works, modify the tcplxc example to talk directly to
the port from each erlang container process (the example needs much more
cleanup and should be converted to an OTP process).
Mount filesystems using the Solaris-specific version of the mount
interface. Solaris adds an options and options length parameter to
mount which takes a NULL terminated string of comma separated arguments.
On other Unix'es the options are either included in the mount flags
(MS_NOEXEC, ...) or in the data argument (<<"size=128M">> for tmpfs).
The behaviour of mount(2) on Solaris is bizarre: the options argument is
input/output, with the mount options placed in the buffer on return.
If the MS_OPTIONSTR is present in the mount flags and the options buffer
is too small, the mount call returns -1 and ERRNO is set to EOVERFLOW
but the mount actually succeeds! A more robust interface might truncate
the options to the size of the buffer, possibly seting the options
length to the required length and return 0.
Surprisingly, the options buffer can also be too large. This is so
weird, it must be a bug in alcove. If the buffer exceeds a certain size,
mount returns -1 with ERRNO set to EINVAL. The mount fails in this case:
{error,einval} = alcove:mount(Drv, [Child], "swap", Dir, "tmpfs",
[ms_optionstr], <<>>, <<"size=16m", 0:(1024*8)>>).
Since Solaris has this extra argument, mount/6,7 has to be extended to
mount/7,8, forcing all platforms to pass in an options parameter. This
parameter is ignored on all platforms except Solaris. The result is that
the mount interface is not the Linux mount(2) interface, it is some weird
hybrid that is awkward to use on all platforms (the interface does not
map to the mount(2) man page on any platform).
The value of the option parameter is also not returned to the caller on
Solaris. Options to fix this include:
* breaking out Solaris mount(2) to a Solaris specific call (mountext or
whatever)
* checking if opt is non-NULL on return and opt len > 0. If so, return:
{ok, binary()}
This extends the mount/7,8 type to:
ok | {ok,binary()} | {error,posix()}
Minimize the arguments for the port. These options can be set using
alcove:setopt/3,4.
Command line arguments should be reserved for options that can be set
only at start up.
open(2) on Linux, OpenBSD, FreeBSD and NetBSD support the O_CLOEXEC
flag. Close on exec can be set by the caller.
This change allows passing privileged file descriptors to an unprivileged
process, by:
* opening the fd as root
* dropping privs
* calling exec
fnctl(2) can be introduced later if there are any platforms that do not
support the O_CLOEXEC flag.
Pass in an integer or a list of integers/atoms for the mount flags
parameter.
The mount flags is an unsigned integer and the decode function returns
an int. While this is probably ok, there are a number of typecasts in
the lookup code that needs to cleaned up.
Accept an integer or a list of atoms/integers for the flags argument. If
a list is passed, the value are OR'ed together. The values for atoms are
looked up from the constants defined in the system header files.
Default environment variables can be overriden by the user.
Rename the proplist keyword from 'env' to 'environ', since it clashes
with the 'env' option to alcove_drv:start/1. These options should be
namespaced or put into another proplist to prevent this.
Mount /etc as a tmpfs filesystem. By default, create passwd and group
files. These can be overwritten by the user.
System files are created while running inside the read-only mount
namespace, so attempts to write the bind mounted system directories will
fail.
Setting the mount flags on bind mounts requires 2 calls: 1 to perform
the mount, the second to remount the filesystem with the appropriate
flags. According to the man page, the bind flag is required in both
mounts:
Note that behavior of the remount operation depends on the /etc/mtab
file. The first command stores the 'bind' flag to the /etc/mtab
file and the second command reads the flag from the file. If you
have a system without the /etc/mtab file or if you explicitly
define source and target for the remount command (then mount(8)
does not read /etc/mtab), then you have to use bind flag (or option)
for the remount command too. For example:
mount --bind olddir newdir
mount -o remount,ro,bind olddir newdir
Destroy the cgroup when the container exits. Use a constant name for the
container hostname ("alcove" + os pid). If leaking the pid is a concern,
the code could generate random bytes on startup and hash(pid, bytes).
Use a list of binaries as the namespace:
[<<"alcove">>, <<"guest1234">>] % <<"alcove/guest1234">>
is converted to:
<<"/sys/fs/cgroup/blkio/alcove/guest1234">>
<<"/sys/fs/cgroup/cpu/alcove/guest1234">>
<<"/sys/fs/cgroup/cpuacct/alcove/guest1234">>
<<"/sys/fs/cgroup/cpuset/alcove/guest1234">>
With the current cgroup limits, a fork bomb causes the container cgroup
limit to be exceeded and the fork bomb is killed. Works well.
Add a resource limit on the number of processes. It is sort of redundant
given the cgroup limit but is useful on systems without cgroup support.
Not all the cgroups may exist. For example, the stock raspbian doesn't
have cpuset. The cgroup code could be smarter about this, but it'd
complicate the example.
Rough working code for creating a Linux container, restricted by
cgroups:
erl: tcplxc:start().
shell: nc localhost 31337
Multiple containers are supervised by one port in this example.