berty/README.md

# berty

![Erlang Punch Berty License](https://img.shields.io/github/license/erlang-punch/berty)
![Erlang Punch Berty Top Language](https://img.shields.io/github/languages/top/erlang-punch/berty)
![Erlang Punch Berty Workflow Status (main branch)](https://img.shields.io/github/actions/workflow/status/erlang-punch/berty/test.yaml?branch=main)
![Erlang Punch Berty Last Commit](https://img.shields.io/github/last-commit/erlang-punch/berty)
![Erlang Punch Berty Code Size (bytes)](https://img.shields.io/github/languages/code-size/erlang-punch/berty)
![Erlang Punch Berty Repository File Count](https://img.shields.io/github/directory-file-count/erlang-punch/berty)
![Erlang Punch Berty Repository Size](https://img.shields.io/github/repo-size/erlang-punch/berty)

A clean, safe and flexible implementation of BERT, a data-structure
format inspired by Erlang ETF.

This project is in active development, and should not be used in
production yet.

## Features

Primary features:

 - [x] High level implementation of ETF in pure Erlang
 - [x] Atoms protection and limitation
 - [ ] Fine grained filtering based on type
 - [ ] Callback function or MFA
 - [ ] Fallback to `binary_to_term` function on demand
 - [ ] Drop terms on demande
 - [ ] Term size limitation
 - [ ] Custom options for term
 - [ ] Property based testing
 - [ ] BERT parser subset
 - [ ] Depth type protection
 - [ ] Fully documented
 - [ ] +90% coverage
 - [ ] 100% compatible with standard ETF
 - [ ] 100% compatible with BERT

Secondary features:

 - [ ] Global or fine grained statistics
 - [ ] Profiling and benchmarking facilities
 - [ ] Logging facilities
 - [ ] Tracing facilities
 - [ ] ETF path
 - [ ] ETF schema
 - [ ] Custom parser subset based on behaviors
 - [x] ETF as stream of data
 - [ ] Usage example with ETF, BERT and/or custom parser
 - [ ] Low level optimization (optimized module with merl)

## Usage

Berty was created to easily replace `binary_to_term/1` and
`binary_to_term/2` built-in functions. In fact, the implementation is
transparent in many cases. The big idea is to protect your system from
outside, in particular atom and memory exhaution.

```erlang
% create an atom from scratch
Atom = term_to_binary(test).

% An atom is automatically converted as binary
{ok, <<"test">>}
  = berty:decode(Atom).

% different methods can be used to deal with atoms.
{ok, test}
  = berty:decode(Atom, #{ atoms => {create, 0.2, warning} }).

% Other terms are supported
Terms = term_to_binary([{ok,1.0,"test",<<>>}]),
{ok, [{ok,1.0,"test",<<>>}]}
  = berty:decode(Terms).
```

More features are present, for example, dropping terms or creating
custom callbacks.

```erlang
Lists = term_to_binary([1024,<<>>,"test"]).

% let drop all integers
{ok, [<<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => drop
                         , small_integer_ext => drop
                         }).

% let create a custom callback
Callback = fun
  (_Term, Rest) ->
    {ok, doh, Rest}
end.
{ok, [doh, <<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => {callback, Callback}
                         , small_integer_ext => {callback, Callback}
                         }).

% let create another one.
Callback2 = fun
  (Term, Rest) when 1024 =:= Term ->
    logger:warning("catch term ~p", [1024]),
    {ok, Term, Rest};
  (Term, Rest) -> {ok, Term, Rest}
end.

{ok, [1024, <<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => {callback, Callback2}
                         , small_integer_ext => {callback, Callback2}
                         }).
```

Those are simple examples, more features are present and will be
added. Here the most important functions:

 - `berty:decode/1`: standard BERT decoder with default options
 - `berty:decode/2`: standard BERT decoder with custom options
 - `berty:decode/3`: custom decoder with custom options
 - `berty:encode/1`: standard BERT encoder with default options
 - `berty:encode/2`: standard BERT encoder with custom options
 - `berty:encode/3`: custom encoder with custom options
 - `berty:binary_to_term/1`: wrapper around `binary_to_term/1`
 - `berty:term_to_binary/1`: wrapper around `term_to_binary/1`

## Build

```sh
rebar3 compile
rebar3 shell
```

## Test

```sh
rebar3 as test eunit
rebar3 as test shell
```

# FAQ

## Why creating another BERT implementation?

Mainly because of atoms management. In fact, `binary_to_term/1` and
`term_to_binary/1` are not safe, if unknown data are coming from
untrusted source, it's quite easy to simply kill the node by
overflowing the number of atoms managed by the node itself, and
probably also a full cluster if this data is shared.

```erlang
% first erlang shell
file:write_file("atom1", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1,1_000_000) ])).
% second erlang shell
file:write_file("atom2", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1_000_000,2_000_000) ])).
```

Now restore those 2 files on another node.

```erlang
% third erlang shell
f(D), {ok, D} = file:read_file("atom1"), binary_to_term(D).
f(D), {ok, D} = file:read_file("atom2"), binary_to_term(D).
no more index entries in atom_tab (max=1048576)

Crash dump is being written to: erl_crash.dump...done
```

Doh. Erlang VM crashed. We can fix that in many different way, here
few examples:

 - avoid using `binary_to_term/1` and `term_to_binary/1` functions,
   instead create our own parser based on ETF specification. When
   terms are deserialized, atoms can be (1) converted in existing atom
   (2) converted in binary or list (3) simply dropped or replaced with
   something to alert the VM this part of the data is dangerous.

 - keep our own local atom table containing all atom deserialized. A
   soft/hard limit can be set.

## Oh?  really? Is it serious?

In fact, a simple solution already exists, using the option `safe` or
`used` when using
[`binary_to_term/2`](https://www.erlang.org/doc/man/erlang.html#binary_to_term-2). It
will protect you from creating non-existing atoms, but how many
projects are using that?

- [`mojombo/bert.erl`](https://github.com/mojombo/bert.erl):
  https://github.com/mojombo/bert.erl/blob/master/src/bert.erl#L25

  ```erlang
  -spec decode(binary()) -> term().

  decode(Bin) ->
    decode_term(binary_to_term(Bin)).

  ```

- [`mojombo/ernie`](https://github.com/mojombo/ernie):
  https://github.com/mojombo/ernie/blob/master/elib/ernie_server.erl#L178

  ```erlang
  receive_term(Request, State) ->
    Sock = Request#request.sock,
      case gen_tcp:recv(Sock, 0) of
          {ok, BinaryTerm} ->
            logger:debug("Got binary term: ~p~n", [BinaryTerm]),
            Term = binary_to_term(BinaryTerm),
  ```

- [`sync/n2o`](https://github.com/synrc/n2o):
  https://github.com/synrc/n2o/blob/master/src/services/n2o_bert.erl#L8

  ```erlang
  encode(#ftp{}=FTP) -> term_to_binary(setelement(1,FTP,ftpack));
  encode(Term)       -> term_to_binary(Term).
  decode(Bin)        -> binary_to_term(Bin).
  ```

- [`ferd/bertconf`](https://github.com/ferd/bertconf):
  https://github.com/ferd/bertconf/blob/master/src/bertconf_lib.erl#L10

  ```erlang
  decode(Bin) ->
      try validate(binary_to_term(Bin)) of
        Terms -> {ok, Terms}
      catch
        throw:Reason -> {error, Reason}
      end.
  ```

- [`a13x/aberth`](https://github.com/a13x/aberth):
  https://github.com/a13x/aberth/blob/master/src/bert.erl#L25

  ```erlang
  -spec decode(binary()) -> term().

  decode(Bin) ->
    decode_term(binary_to_term(Bin)).
  ```


- [`yuce/bert.erl`](https://github.com/yuce/bert.erl):
  https://github.com/yuce/bert.erl/blob/master/src/bert.erl#L24

  ```erlang
  -spec decode(binary()) -> term().
  decode(Bin) ->
      decode_term(binary_to_term(Bin)).
  ```

- And probably many more like this search on
  [`searchcode.com`](https://searchcode.com/?lan=25&q=binary_to_term)
  or
  [`github.com`](https://github.com/search?q=binary_to_term+language%3AErlang&type=code&l=Erlang)
  suggest.

It's highly probable lot of those functions are hard to call, but it
could be the case. In situation where unknown data are coming,
`erlang:binary_to_term/1` and even `erlang:binary_to_term/2` should be
avoided or carefully used.

## Why am I not aware of that?

Few articles[^erlef-atom-exhaustion][^paraxial-atom-dos] have been
created in the past to explain these problems. On my side, if I was in
charge of fixing this issue, I would probably do something in two
times.

In the first step, I would probably create a workaround on atom
creation function, with a soft/hard limit. When we reach the soft
limit, warnings are displayed saying we reached the soft limit, but we
can still create new atoms. When reaching the hard limit, atoms can't
be created anymore, and exceptions are raised instead of crashing the
host.

In a second step, I would probably create a flexible interface to
deal with atoms and divide the problem in half:

 1. create fixed atom store containing only atoms from source code
    (Erlang release and project), this one can't be increased.

 2. create a second atom store containing dynamically created atoms
    during runtime, this one can be increased.

What I worry about is when dealing with mnesia. What could happen if
someone create more than 2M unwanted atoms added in Mnesia or DETS?
What kind of behavior the cluster will have? And how to fix that if
it's critical.

Unfortunately, I think it will totally break atom performance, but it
could be an interesting project to learn how Erlang BEAM works under
the hood.

[^erlef-atom-exhaustion]: https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/atom_exhaustion.html
[^paraxial-atom-dos]: https://paraxial.io/blog/atom-dos

## Are atoms the only issue there?

Well, it depends. If you are receving a (very) long string or list
containing terms, it will have a direct impact on the memory, and it
will eventually lead to memory exhaustion:

```erlang
% size of the list should be checked
% if not, memory exhaustion can happen
[ $1 || _ <- lists:seq(0,160_000_000) ].
% eheap_alloc: Cannot allocate 3936326656 bytes of memory (of type "heap").
% Crash dump is being written to: erl_crash.dump...
```

Same behavior can be generated using binaries:

```erlang
% big binaries can crash the BEAM
binary_to_term(<<131, 111, 4294967294:32/unsigned-integer, 0:8/integer, 255:8, 0:4294967280/unsigned-integer>>).
% binary_alloc: Cannot allocate 4294967293 bytes of memory (of type "binary").
% Crash dump is being written to: erl_crash.dump...
```

Generating ETF payload with very long binaries can also have
an impact on CPUs, the following code can generate DoS and if many process

```erlang
% big payload, high cpu usage, no crash.
% size of the big integer must be checked
% size: 2**18-1, binary byte size: 262_150 (~262kB)
_ = binary_to_term(<<131, 111, 262_143:32/unsigned-integer, 0:8/integer, 255:2_097_144/unsigned-integer>>).

% size: 2**19-1, binary byte size: 524_294 (~524kB)
_ = binary_to_term(<<131, 111, 524_287:32/unsigned-integer, 0:8/integer, 255:4_194_296/unsigned-integer>>).

% size: 2**20-1, binary byte size: 1_048_582 (~1MB)
_ = binary_to_term(<<131, 111, 1_048_575:32/unsigned-integer, 0:8/integer, 255:8_388_600/unsigned-integer>>).
```

Creating a long node name can crash the VM during startup, because the
name of the node is encoded using an `atom_ext` term, encoded on 255
bits. If the name of the node is greater than 255, it crashes.

```sh
erl -sname $(pwgen -A0 252 1)
# Crash dump is being written to: erl_crash.dump...done

erl -name $(pwgen -A0 246 1)@localhost
# Crash dump is being written to: erl_crash.dump...done
```

It's highly probable other terms can have a deadly impact on a node or
a cluster.

## How to fix the root cause?

The problem is from atoms, at least one
paper[^atom-garbage-collection] talked about that. Fixing the garbage
collection issue could help a lot, but if it's not possible for many
reason, using an high level implementation of ETF with some way to
control what kind of data are coming might be an "okayish" solution.

The "Let it crash" philosophy is quite nice when developing high level
application interacting in a safe place but this philosophy can't be
applied in a place where uncontrolled data is coming. Some functions,
like `binary_to_term/1` must be avoid at all cost.

[^atom-garbage-collection]: Atom garbage collection by Thomas Lindgren, https://dl.acm.org/doi/10.1145/1088361.1088369

## What about ETF schema?

This answer is a draft, a sandbox to design an Erlang ETF Schema
feature.

It might be great to have syntax to create ETF schema, a bit like
protobuf[^protobuf], json schema[^json-schema], XML[^xml] (with
XLST[^xlst]) or ASN.1[^asn.1]. In fact, when I started to find
something around this feature, I also found UBF[^ubf] project from Joe
Armstrong.

```erlang
schema1() ->
  integer().

schema2() ->
  tuple([[atom(ok), integer()]
        ,[atom(error), string(1024)]).

% fun ({ok, X}) when is_integer(X) -> true;
%     ({error, X) when is_list(X) andalso length(X) =< 1024 -> is_string(X);
%     (_) -> false.

schema3() ->
  tuple(
```

Here the final representation.

```erlang
[{tuple, [{atom, [ok]}, {integer, []}]}
,{tuple, [{atom, [error]}, {string, [1024]}]}
]
% or
[[tuple, [2]]
,[atom, [ok,error]]
,[integer, []]
,[string, [1024]]
].
```

[^protobuf]: https://protobuf.dev/overview/
[^json-schema]: https://json-schema.org/
[^xml]: https://en.wikipedia.org/wiki/XML
[^xlst]: https://en.wikipedia.org/wiki/XSLT
[^asn.1]: https://en.wikipedia.org/wiki/ASN.1
[^ubf]: https://ubf.github.io/ubf/ubf-user-guide.en.html

## What about an ETF path feature?

Another feature like xmlpath or jsonpath is also required as well, an
easy syntax and comprehensible one needs to be created. I would like
to include:

 1. pattern matching

```erlang
% how to create an etf path?
% first example
% ETF = #{ key => #{ key2 => { ok, "test"} } }.
"test" = path(ETF, "#key#key2{ok,@}")

% second example
% ETF = [{ok, "test"}, {error, badarg}, {ok, "data"}].
[{ok, "test"},{ok, "data"}] = path(ETF, "[{ok,_}]")
% or
[]{ok,_}

% third example
% ETF = {ok, #{ <<"data">> => [<<"test">>] }}.
[<<"test">>] = path(ETF, "{ok,@}#!data").
```

## Nothing to add?

When I wrote [Serialization series — Do you speak Erlang ETF or BERT?
(part
1)](https://medium.com/@niamtokik/serialization-series-do-you-speak-erlang-etf-or-bert-part-1-ff70096b50c0)
in 2017, someone told me to check another project called
[`jem.js`](https://github.com/inaka/jem.js) and read [Replacing JSON
when talking to Erlang](http://inaka.net/blog/2016/08/17/why-json/)
([archive](https://web.archive.org/web/20180301221900/http://inaka.net/blog/2016/08/17/why-json/))
blog post. What's funny here... Is that:

```erlang
handle_post(Req, State) ->
  {ok, Body, Req1} = cowboy_req:body(Req),
  Decoded = erlang:binary_to_term(Body),
  Reply = do_whatever(Decoded),
  {erlang:term_to_binary(Reply), Req1, State}.
```

Yes, "Faster and more efficient", but can destroy your whole platform
in few second. Don't do that. Please. Unfortunately,
[inaka.net](inaka.net) seems to be down, it would have been funny to
play with that.

## Is there a "risk analysis" for each terms somewhere?

Probably, but I did not find a lot on that. Here a short summary of
each terms is it safe or not and with the risk(s).

| Terms                 | Code |    Safe? | Risks
|:----------------------|-----:|---------:|--------------------------|
| `ATOM_CACHE_REF`      |   82 |       no | atom exhaustion
| `ATOM_EXT`            |  100 |       no | atom exhaustion
| `ATOM_UTF8_EXT`       |  118 |       no | atom exhaustion
| `BINARY_EXT`          |  109 |    maybe | dynamic binary length (32bits)
| `BIT_BINARY_EXT`      |   77 |    maybe | dynamic bitstring length (32bits)
| `EXPORT_EXT`          |  113 |       no | atom exhaustion
| `FLOAT_EXT`           |   99 |      yes | 31 bytes float fixed length
| `FUN_EXT`             |  117 |       no | atoms exhaution
| `INTEGER_EXT`         |   98 |      yes | 1 byte fixed length
| `LARGE_BIG_EXT`       |  111 |    maybe | dynamic integer length (32bits)
| `LARGE_TUPLE_EXT`     |  105 |    maybe | dynamic tuple length (32bits)
| `LIST_EXT`            |  108 |    maybe | dynamic list length (32bits)
| `LOCAL_EXT`           |  121 |      yes | atom exhaustion
| `MAP_EXT`             |  116 |    maybe | dynamic pair length (32bits)
| `NEWER_REFERENCE_EXT` |   90 |       no | memory exhaustion
| `NEW_FLOAT_EXT`       |   70 |      yes | 8 bytes fixed float
| `NEW_FUN_EXT`         |  112 |       no | atom exhaution
| `NEW_PID_EXT`         |   88 |       no | atom exhaution
| `NEW_PORT_EXT`        |   89 |       no | atom exhaution
| `NEW_REFERENCE_EXT`   |  114 |    maybe | dynamic reference length (16bits)
| `NIL_EXT`             |  106 |      yes | fixed length
| `PID_EXT`             |  103 |       no | atom exhaustion
| `PORT_EXT`            |  102 |       no | atom exhaustion
| `REFERENCE_EXT`       |  101 |       no | atom exhaustion
| `SMALL_ATOM_EXT`      |  115 |       no | atom exhaustion
| `SMALL_ATOM_UTF8_EXT` |  119 |       no | atom exhaustion
| `SMALL_BIG_EXT`       |  110 |    maybe | dynamic integer length (8bits)
| `SMALL_INTEGER_EXT`   |   97 |      yes | fixed size
| `SMALL_TUPLE_EXT`     |  104 |    maybe | dynamic tuple length (8bits)
| `STRING_EXT`          |  107 |    maybe | dynamic string length (16bits)
| `V4_PORT_EXT`         |  120 |       no | atom exhaustion

# Resources

 - [BERT-RPC Official](https://bert-rpc.org) [(archive)](https://web.archive.org/web/20160304092040/http://bert-rpc.org/)
 - [BERT-RPC Google group](https://groups.google.com/g/bert-rpc)