There is More Than 0 and 1 to Binary Data in Puppet
Sometimes you have the need to manage resources that have binary data. Historically you needed luck to be on your side to make this work (more about this below), and more recently Puppet was given the ability to handle binary data by using the Puppet specific PSON on the wire format instead of JSON.
The problems caused by binary data are caused by two things:
- Ruby does not have a data type for binary data other than a String with ASCII-8BIT encoding.
- There is no binary data type in textual data formats such as JSON or YAML and Base64 encoding is required of these values - but how do you then differentiate between a text string and a Base64 encoded one? You don’t want for example a binary file to end up with the Base64 encoded text representation of the binary content it was supposed to have.
Historically you needed to be lucky
Luck played a big part historically as (in very old Puppet versions) the encoding of the catalog was undefined, and in more modern versions the encoding has been set to UTF-8. The problem then was that binary content could contain bytes or sequences of bytes that were invalid in the catalog’s encoding (the undefined; whatever happened to be the default encoding, or in UTF-8).
Thus, this “kind of” worked if your binary data did not contain any illegal byte sequences.
Enter PSON
To take luck out of the equation, the PSON format was invented to allow the catalog to contain ASCII-8-BIT. It also allows Ruby specific data types to be included in the format. Using PSON is bad for several reasons:
- It is implemented in Ruby, and is thus much slower than the native JSON support in moder Ruby.
- Using Ruby serialization breaks JSON compatibility and it is not possible to correctly read the stream with an off-the-shelf JSON parser in other languages.
- Regressing to using ASCII-8BIT and then either keeping strings with that encoding, or trying to make them into UTF-8 and leaving those that could not be converted as ASCII-8BIT requires processing and is not really robust as it leaves some binary content looking like UTF-8 and subsequent operations may fail.
Enter Rich Data
Puppet has for some time had the experimental option --rich_data
which when turned on uses a fully JSON standard representation of data that cannot be expressed natively in a textual format like JSON or YAML.
Use of rich_data
is the standard in Puppet 6.0.0. A Puppet 6 agent will request the catalog in rich data format from the master. (Before Puppet 6.0.0 using rich_data
on an agent did not really work, but was fine when used with apply
).
So now, finally, we can send values of all Puppet data types in a catalog, including binary data. Hooray !!
There is a lot more to say about the rich data format in the catalog - and I will get back to that after we looked at binary data in more detail.
Binary Data in Ruby
As an example - here is a string (in Ruby) that is binary (since \xFF
cannot be a byte on its own in UTF-8).
bad_utf8 = "hi \xFF"
bad_utf8.bytes # is [104, 105, 32, 255]
a.split(' ') # fails with bad encoding error
If the bad_utf8
string was returned into Puppet Language and then used as a resource parameter value the behavior is undefined as we don’t know what operations someone may do on that value. If it reaches the catalog, it would (before Puppet 6.0.0) make Puppet switch to PSON since it is invalid UTF-8. In summary: bad stuff happens. What happens in Puppet 6.0.0 is described further on.
We can make the bad_utf8
value ok by forcing the encoding to be ASCII-8-BIT like this:
ok_binary = bad_utf8.force_encoding(Encoding::ASCII_8BIT)
ok_binary.bytes # is [104, 105, 32, 255]
ok_binary.split(' ') # works and results in ["hi", "\xFF"]
After the change of encoding we still have the exact same bytes, but now Ruby string operations know that we are dealing with just bytes and not variable length encoding of characters and operations will not fail.
If we return ok_binary
and use it in the Puppet Language, there is a small risk it could still fail if the value is given to a function and it tries to change encoding and doing it wrong. If the value is assigned to a resource parameter, it will (before Puppet 6.0.0) require switching to PSON, but otherwise be fine. (Again, what happens in Puppet 6.0.0 is described further on).
Binary Data in the Puppet Language
Historically binary data in the Puppet Language was only possible by calling a function that return a string with binary content (either possibly invalid UTF-8, ASCII-8BIT encoded, or any other encoding in fact since it can be set by a user). The file()
function is the primary example of this.
The Binary
data type was introduced in Puppet 4.8.0 together with the function binary_file()
that returns a Binary
. It is also possible to create a binary from a text string, from a Base64 encoded string, or from an array of integer byte values. See the documentation for Creating a Binary for the details.
No Poking of Byte Values into Strings
In contrast to Ruby, the Puppet Language does not support the \x
escape sequence for strings as it is too easy to create illegal byte sequences. Puppet supports the \u
escape to insert Unicode characters into a string. While you can specify a character that may not exist the result will never blow up later with a bad encoding error.
Going the other way is possible - if you have a Binary, it can be turned into a String with \xHH
content for non UTF-8 Strings (where HH
is two hex digits). This is for display purposes only, such a string cannot again be parsed by the Puppet Language.
See the documentation Binary value to String for the details.
The Rich Data Format
The rich data format used as standard in Puppet 6.0.0 is based on standard JSON.
When rich data values are encountered, they are encoded as a Hash (i.e. JSON Object). Data values that can represented as a String will have two reserved keys in the resulting hash __ptype
, and __pvalue
where __ptype
is the name of the data type and __pvalue
is the value in string form. For example - with either of these two values (Puppet Language):
Binary('hello', '%s') # Make binary out of text (non base64 string)
or
Binary('aGVsbG8=') # Make binary from base64 string
would appear like this in a catalog:
{"__ptype": "Binary", "__pvalue": "aGVsbG8=\n"}
In Puppet 6.0.0: When serializing, both Binary
values and ASCII-8BIT encoded strings are serialized as being Binary
(as in the example above). When deserializing a catalog on the agent it will transform all instances of Binary
into instances of ASCII-8BIT. Types and providers thus only needs to deal with String and possibly check for the encoding of a string (to not mess things up).
Before Puppet 6.0.0: When serializing Binary
and ASCII-8BIT strings are handled without any transformation, and possibly regressing to using PSON on the wire if ASCII-8BIT strings were not UTF-8 compatible (i.e. all the mess discussed earlier). Types and Providers would need to know about the Binary
data type - only the File
data type supported this, and then only when using puppet apply
.
Learn more
There is a lot more to say about the rich data format, and you can read all the details in the Pcore Data Representation specification, and specifically the Pcore Generic Data document that describes the format used in the Puppet 6.0.0 catalog.
Rich Data and PDB
So, what does PDB do with rich data? Can it be queried etc?
Before Puppet 6.0.0: If using --rich_data
a catalog with rich data would be handled the same way as if the hashes representing rich data were just hashes. In practice this kind of worked since the data was stored and could be retrieved, but both viewing and querying was difficult. Also, Reports could contain rich data hashes, and this could cause downstream problems.
In Puppet 6.0.0: A hard decision had to be made, and the least bad option was to simply stringify rich data into most reasonable human consumable form. While this is lossy (you cannot read a catalog back again and get an actual catalog with the same data as the original), it serves the current use cases where a catalog is stored in PDB.
The same approach was taken for Reports.
So, to answer the question - “Can rich data be queried?”, well, kind off, if you base your query on the string representation.
Needless to say, the entire domain of rich data; more advanced queries and producing more meaningful diffs in Reports are examples of work that is ahead of us.
Summary
- Use
binary_file()
to read binary content, or when you don’t know if the file is not UTF-8 compatible. - It is recommended to use the
Binary
data type to type binary input in all functions and to prefer returningBinary
over returning ASCII-8BIT string. - From Puppet 6.0.0 expect binary contents to be ASCII-8BIT encoded
String
in types and providers, but not deferred “agent side” functions. - Before Puppet 6.0.0 expect binary content to be either ASCII-8BIT encoded strings or instances of the Puppet
Binary
data type as the representation differs depending on use of--rich_data
or not, or evaluating the logic on the agent, or as part of apuppet apply
.