Tuesday, September 18, 2018

There is More Than 0 and 1 to Binary Data in Puppet

There is More Than 0 and 1 to Binary Data In Puppet

There is More Than 0 and 1 to Binary Data in Puppet

Sometimes you have the need to manage resources that have binary data. Historically you needed luck to be on your side to make this work (more about this below), and more recently Puppet was given the ability to handle binary data by using the Puppet specific PSON on the wire format instead of JSON.

The problems caused by binary data are caused by two things:

  • Ruby does not have a data type for binary data other than a String with ASCII-8BIT encoding.
  • There is no binary data type in textual data formats such as JSON or YAML and Base64 encoding is required of these values - but how do you then differentiate between a text string and a Base64 encoded one? You don’t want for example a binary file to end up with the Base64 encoded text representation of the binary content it was supposed to have.

Historically you needed to be lucky

Luck played a big part historically as (in very old Puppet versions) the encoding of the catalog was undefined, and in more modern versions the encoding has been set to UTF-8. The problem then was that binary content could contain bytes or sequences of bytes that were invalid in the catalog’s encoding (the undefined; whatever happened to be the default encoding, or in UTF-8).

Thus, this “kind of” worked if your binary data did not contain any illegal byte sequences.

Enter PSON

To take luck out of the equation, the PSON format was invented to allow the catalog to contain ASCII-8-BIT. It also allows Ruby specific data types to be included in the format. Using PSON is bad for several reasons:

  • It is implemented in Ruby, and is thus much slower than the native JSON support in moder Ruby.
  • Using Ruby serialization breaks JSON compatibility and it is not possible to correctly read the stream with an off-the-shelf JSON parser in other languages.
  • Regressing to using ASCII-8BIT and then either keeping strings with that encoding, or trying to make them into UTF-8 and leaving those that could not be converted as ASCII-8BIT requires processing and is not really robust as it leaves some binary content looking like UTF-8 and subsequent operations may fail.

Enter Rich Data

Puppet has for some time had the experimental option --rich_data which when turned on uses a fully JSON standard representation of data that cannot be expressed natively in a textual format like JSON or YAML.

Use of rich_data is the standard in Puppet 6.0.0. A Puppet 6 agent will request the catalog in rich data format from the master. (Before Puppet 6.0.0 using rich_data on an agent did not really work, but was fine when used with apply).

So now, finally, we can send values of all Puppet data types in a catalog, including binary data. Hooray !!

There is a lot more to say about the rich data format in the catalog - and I will get back to that after we looked at binary data in more detail.

Binary Data in Ruby

As an example - here is a string (in Ruby) that is binary (since \xFF cannot be a byte on its own in UTF-8).

bad_utf8 = "hi \xFF"
bad_utf8.bytes  # is [104, 105, 32, 255]
a.split(' ') # fails with bad encoding error

If the bad_utf8 string was returned into Puppet Language and then used as a resource parameter value the behavior is undefined as we don’t know what operations someone may do on that value. If it reaches the catalog, it would (before Puppet 6.0.0) make Puppet switch to PSON since it is invalid UTF-8. In summary: bad stuff happens. What happens in Puppet 6.0.0 is described further on.

We can make the bad_utf8 value ok by forcing the encoding to be ASCII-8-BIT like this:

ok_binary = bad_utf8.force_encoding(Encoding::ASCII_8BIT)
ok_binary.bytes  # is [104, 105, 32, 255]
ok_binary.split(' ') # works and results in ["hi", "\xFF"]

After the change of encoding we still have the exact same bytes, but now Ruby string operations know that we are dealing with just bytes and not variable length encoding of characters and operations will not fail.

If we return ok_binary and use it in the Puppet Language, there is a small risk it could still fail if the value is given to a function and it tries to change encoding and doing it wrong. If the value is assigned to a resource parameter, it will (before Puppet 6.0.0) require switching to PSON, but otherwise be fine. (Again, what happens in Puppet 6.0.0 is described further on).

Binary Data in the Puppet Language

Historically binary data in the Puppet Language was only possible by calling a function that return a string with binary content (either possibly invalid UTF-8, ASCII-8BIT encoded, or any other encoding in fact since it can be set by a user). The file() function is the primary example of this.

The Binary data type was introduced in Puppet 4.8.0 together with the function binary_file() that returns a Binary. It is also possible to create a binary from a text string, from a Base64 encoded string, or from an array of integer byte values. See the documentation for Creating a Binary for the details.

No Poking of Byte Values into Strings

In contrast to Ruby, the Puppet Language does not support the \x escape sequence for strings as it is too easy to create illegal byte sequences. Puppet supports the \u escape to insert Unicode characters into a string. While you can specify a character that may not exist the result will never blow up later with a bad encoding error.

Going the other way is possible - if you have a Binary, it can be turned into a String with \xHH content for non UTF-8 Strings (where HH is two hex digits). This is for display purposes only, such a string cannot again be parsed by the Puppet Language.
See the documentation Binary value to String for the details.

The Rich Data Format

The rich data format used as standard in Puppet 6.0.0 is based on standard JSON.
When rich data values are encountered, they are encoded as a Hash (i.e. JSON Object). Data values that can represented as a String will have two reserved keys in the resulting hash __ptype, and __pvalue where __ptype is the name of the data type and __pvalue is the value in string form. For example - with either of these two values (Puppet Language):

Binary('hello', '%s') # Make binary out of text (non base64 string)

or

Binary('aGVsbG8=') # Make binary from base64 string

would appear like this in a catalog:

{"__ptype": "Binary", "__pvalue": "aGVsbG8=\n"}

In Puppet 6.0.0: When serializing, both Binary values and ASCII-8BIT encoded strings are serialized as being Binary (as in the example above). When deserializing a catalog on the agent it will transform all instances of Binary into instances of ASCII-8BIT. Types and providers thus only needs to deal with String and possibly check for the encoding of a string (to not mess things up).

Before Puppet 6.0.0: When serializing Binary and ASCII-8BIT strings are handled without any transformation, and possibly regressing to using PSON on the wire if ASCII-8BIT strings were not UTF-8 compatible (i.e. all the mess discussed earlier). Types and Providers would need to know about the Binary data type - only the File data type supported this, and then only when using puppet apply.

Learn more

There is a lot more to say about the rich data format, and you can read all the details in the Pcore Data Representation specification, and specifically the Pcore Generic Data document that describes the format used in the Puppet 6.0.0 catalog.

Rich Data and PDB

So, what does PDB do with rich data? Can it be queried etc?

Before Puppet 6.0.0: If using --rich_data a catalog with rich data would be handled the same way as if the hashes representing rich data were just hashes. In practice this kind of worked since the data was stored and could be retrieved, but both viewing and querying was difficult. Also, Reports could contain rich data hashes, and this could cause downstream problems.

In Puppet 6.0.0: A hard decision had to be made, and the least bad option was to simply stringify rich data into most reasonable human consumable form. While this is lossy (you cannot read a catalog back again and get an actual catalog with the same data as the original), it serves the current use cases where a catalog is stored in PDB.
The same approach was taken for Reports.

So, to answer the question - “Can rich data be queried?”, well, kind off, if you base your query on the string representation.

Needless to say, the entire domain of rich data; more advanced queries and producing more meaningful diffs in Reports are examples of work that is ahead of us.

Summary

  • Use binary_file() to read binary content, or when you don’t know if the file is not UTF-8 compatible.
  • It is recommended to use the Binary data type to type binary input in all functions and to prefer returning Binary over returning ASCII-8BIT string.
  • From Puppet 6.0.0 expect binary contents to be ASCII-8BIT encoded String in types and providers, but not deferred “agent side” functions.
  • Before Puppet 6.0.0 expect binary content to be either ASCII-8BIT encoded strings or instances of the Puppet Binary data type as the representation differs depending on use of --rich_data or not, or evaluating the logic on the agent, or as part of a puppet apply.

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete