Tuesday, September 18, 2018

There is More Than 0 and 1 to Binary Data in Puppet

There is More Than 0 and 1 to Binary Data In Puppet

There is More Than 0 and 1 to Binary Data in Puppet

Sometimes you have the need to manage resources that have binary data. Historically you needed luck to be on your side to make this work (more about this below), and more recently Puppet was given the ability to handle binary data by using the Puppet specific PSON on the wire format instead of JSON.

The problems caused by binary data are caused by two things:

Ruby does not have a data type for binary data other than a String with ASCII-8BIT encoding.
There is no binary data type in textual data formats such as JSON or YAML and Base64 encoding is required of these values - but how do you then differentiate between a text string and a Base64 encoded one? You don’t want for example a binary file to end up with the Base64 encoded text representation of the binary content it was supposed to have.

Historically you needed to be lucky

Luck played a big part historically as (in very old Puppet versions) the encoding of the catalog was undefined, and in more modern versions the encoding has been set to UTF-8. The problem then was that binary content could contain bytes or sequences of bytes that were invalid in the catalog’s encoding (the undefined; whatever happened to be the default encoding, or in UTF-8).

Thus, this “kind of” worked if your binary data did not contain any illegal byte sequences.

Enter PSON

To take luck out of the equation, the PSON format was invented to allow the catalog to contain ASCII-8-BIT. It also allows Ruby specific data types to be included in the format. Using PSON is bad for several reasons:

It is implemented in Ruby, and is thus much slower than the native JSON support in moder Ruby.
Using Ruby serialization breaks JSON compatibility and it is not possible to correctly read the stream with an off-the-shelf JSON parser in other languages.
Regressing to using ASCII-8BIT and then either keeping strings with that encoding, or trying to make them into UTF-8 and leaving those that could not be converted as ASCII-8BIT requires processing and is not really robust as it leaves some binary content looking like UTF-8 and subsequent operations may fail.

Enter Rich Data

Puppet has for some time had the experimental option --rich_data which when turned on uses a fully JSON standard representation of data that cannot be expressed natively in a textual format like JSON or YAML.

Use of rich_data is the standard in Puppet 6.0.0. A Puppet 6 agent will request the catalog in rich data format from the master. (Before Puppet 6.0.0 using rich_data on an agent did not really work, but was fine when used with apply).

So now, finally, we can send values of all Puppet data types in a catalog, including binary data. Hooray !!

There is a lot more to say about the rich data format in the catalog - and I will get back to that after we looked at binary data in more detail.

Binary Data in Ruby

As an example - here is a string (in Ruby) that is binary (since \xFF cannot be a byte on its own in UTF-8).

bad_utf8 = "hi \xFF"
bad_utf8.bytes  # is [104, 105, 32, 255]
a.split(' ') # fails with bad encoding error

If the bad_utf8 string was returned into Puppet Language and then used as a resource parameter value the behavior is undefined as we don’t know what operations someone may do on that value. If it reaches the catalog, it would (before Puppet 6.0.0) make Puppet switch to PSON since it is invalid UTF-8. In summary: bad stuff happens. What happens in Puppet 6.0.0 is described further on.

We can make the bad_utf8 value ok by forcing the encoding to be ASCII-8-BIT like this:

ok_binary = bad_utf8.force_encoding(Encoding::ASCII_8BIT)
ok_binary.bytes  # is [104, 105, 32, 255]
ok_binary.split(' ') # works and results in ["hi", "\xFF"]

After the change of encoding we still have the exact same bytes, but now Ruby string operations know that we are dealing with just bytes and not variable length encoding of characters and operations will not fail.

If we return ok_binary and use it in the Puppet Language, there is a small risk it could still fail if the value is given to a function and it tries to change encoding and doing it wrong. If the value is assigned to a resource parameter, it will (before Puppet 6.0.0) require switching to PSON, but otherwise be fine. (Again, what happens in Puppet 6.0.0 is described further on).

Binary Data in the Puppet Language

Historically binary data in the Puppet Language was only possible by calling a function that return a string with binary content (either possibly invalid UTF-8, ASCII-8BIT encoded, or any other encoding in fact since it can be set by a user). The file() function is the primary example of this.

The Binary data type was introduced in Puppet 4.8.0 together with the function binary_file() that returns a Binary. It is also possible to create a binary from a text string, from a Base64 encoded string, or from an array of integer byte values. See the documentation for Creating a Binary for the details.

No Poking of Byte Values into Strings

In contrast to Ruby, the Puppet Language does not support the \x escape sequence for strings as it is too easy to create illegal byte sequences. Puppet supports the \u escape to insert Unicode characters into a string. While you can specify a character that may not exist the result will never blow up later with a bad encoding error.

Going the other way is possible - if you have a Binary, it can be turned into a String with \xHH content for non UTF-8 Strings (where HH is two hex digits). This is for display purposes only, such a string cannot again be parsed by the Puppet Language.
See the documentation Binary value to String for the details.

The Rich Data Format

The rich data format used as standard in Puppet 6.0.0 is based on standard JSON.
When rich data values are encountered, they are encoded as a Hash (i.e. JSON Object). Data values that can represented as a String will have two reserved keys in the resulting hash __ptype, and __pvalue where __ptype is the name of the data type and __pvalue is the value in string form. For example - with either of these two values (Puppet Language):

Binary('hello', '%s') # Make binary out of text (non base64 string)

Binary('aGVsbG8=') # Make binary from base64 string

would appear like this in a catalog:

{"__ptype": "Binary", "__pvalue": "aGVsbG8=\n"}

In Puppet 6.0.0: When serializing, both Binary values and ASCII-8BIT encoded strings are serialized as being Binary (as in the example above). When deserializing a catalog on the agent it will transform all instances of Binary into instances of ASCII-8BIT. Types and providers thus only needs to deal with String and possibly check for the encoding of a string (to not mess things up).

Before Puppet 6.0.0: When serializing Binary and ASCII-8BIT strings are handled without any transformation, and possibly regressing to using PSON on the wire if ASCII-8BIT strings were not UTF-8 compatible (i.e. all the mess discussed earlier). Types and Providers would need to know about the Binary data type - only the File data type supported this, and then only when using puppet apply.

Learn more

There is a lot more to say about the rich data format, and you can read all the details in the Pcore Data Representation specification, and specifically the Pcore Generic Data document that describes the format used in the Puppet 6.0.0 catalog.

Rich Data and PDB

So, what does PDB do with rich data? Can it be queried etc?

Before Puppet 6.0.0: If using --rich_data a catalog with rich data would be handled the same way as if the hashes representing rich data were just hashes. In practice this kind of worked since the data was stored and could be retrieved, but both viewing and querying was difficult. Also, Reports could contain rich data hashes, and this could cause downstream problems.

In Puppet 6.0.0: A hard decision had to be made, and the least bad option was to simply stringify rich data into most reasonable human consumable form. While this is lossy (you cannot read a catalog back again and get an actual catalog with the same data as the original), it serves the current use cases where a catalog is stored in PDB.
The same approach was taken for Reports.

So, to answer the question - “Can rich data be queried?”, well, kind off, if you base your query on the string representation.

Needless to say, the entire domain of rich data; more advanced queries and producing more meaningful diffs in Reports are examples of work that is ahead of us.

Summary

Use binary_file() to read binary content, or when you don’t know if the file is not UTF-8 compatible.
It is recommended to use the Binary data type to type binary input in all functions and to prefer returning Binary over returning ASCII-8BIT string.
From Puppet 6.0.0 expect binary contents to be ASCII-8BIT encoded String in types and providers, but not deferred “agent side” functions.
Before Puppet 6.0.0 expect binary content to be either ASCII-8BIT encoded strings or instances of the Puppet Binary data type as the representation differs depending on use of --rich_data or not, or evaluating the logic on the agent, or as part of a puppet apply.

Monday, September 17, 2018

More about undef

In my earlier blogpost Let’s talk about undef I covered the data type Undef itself. In this blog post I am going to cover what happens when you use undef in puppet manifests and in Ruby.

Over time the Puppet Language undef has been represented internally in different ways. Starting with Puppet 4 (and with future parser in Puppet 3) the compiler (i.e. the puppet language) represents undef as the Ruby nil value.

However, the resource API and the older so called 3.x function API required nil to be transformed to other values for backwards compatibility.

Undef and Functions

Puppet has two function APIs, the so called 3x (old) and 4x (since Puppet 3 with future parser).

Here is how the transformations are done for a function call to function f() in puppet and what the resulting Ruby values given to the function are in the 3x and 4x APIs:

puppet	3x API < 6.0.0	3x API >= 6.0.0	4x API
`f(undef)`	`''`	`''`	`nil`
`f([undef])`	`[:undef]`	`[nil]`	`[nil]`
`f({undef => undef})`	`{:undef => :undef}]`	`{nil => nil}`	`{nil => nil}`

As you can see from the table above:

In the 3x API (in all Puppet versions) a top level value of undef is translated to the empty string.
In versions before 6.0.0, an undef nested (at any depth) inside an Array or Hash is translated to the Ruby Symbol :undef.
From Puppet 6.0.0 nested undef in the 3x API is handled the same was as for the 4x API - using the Ruby runtime value nil.

Returned values

Returning values from 4.x functions works as you expect; anything set to nil is semantically the same as Puppet’s undef. This is however royally screwed up when it comes to 3x functions! Since they are given the transformed undef values (in the form of either empty string or symbol :undef, they could return such values back to the compiler. And thus, the compiler may end up feeding values encoded that way to 4x functions - thus exposing the 4x functions to the 3x API encoding.

The compiler treats the :undef symbol as if it was nil in terms of type checking and it will also be serialized as if it was a nil, but since there is no transformation going on for 4x functions they were exposed to this problem.

From Puppet 5.5.7 all returned values from 3x functions are subject to a transformation such that the :undef symbol is transformed back to nil.

Recommended Actions for 3x functions

For 3x functions that you maintain, the recommendation is to change them to use the 4x API since that makes it sane; you get nil for Puppet’s undef and you return nil when you want to return undef values - and it all works in harmony with Ruby. (And if you have special processing of :undef you can remove that in favor or straight forward detection/removal of nil in Ruby). Although it is an action you need to take, it is quite easy to change a 3x to 4x function.

If you for some reason cannot do that, and you want to maintain your function as a 3x function supporting both old and new versions you should treat both :undef and nil as being nil - which means you may need to do operations twice. You also need to make sure you are not returning structures with :undef in them if function is used with any puppet version >= Puppet 3 with future parser <= Puppet 5.5.7.

Undef and Resources

Giving values to resource is almost like giving values to functions, but not quite. A given top-level undef in the resource API means “I want the default value” - that is, it acts as if you had not given a value at all!

# Given this definition:
define myresource($param = 'green') { }
# These two will have exactly the same effect on param1
# setting it to the value 'green' in both resources
myresource { 'title1': param1 => undef }
myresource { 'title2':  }

If we change the default value expression to also be undef, we
will get the effect of not including a value at all for that
parameter. (That is, no null value will appear in the JSON for the serialized catalog sent to the agent).

# Given this definition:
define myresource($param = undef) { }
# These two will have exactly the same effect on param1
# neither will have the param1 set at all
myresource { 'title1': param1 => undef }
myresource { 'title2':  }

The (abbreviated) output in the catalog looks like this:

{
  "type": "Myresource",
  "title": "title1",
},
{
  "type": "Myresource",
  "title": "title2",
}

Getting Undef from Hiera

When Automatic Parameter Lookup (APL) is used for a class, it is possible to bind an undef value (null in JSON and YAML data files). When a value for a parameter looked up with APL results in undef it will set the value of the parameter to undef (in contrast to when giving it in a manifest since that means - “use the default”).

The rationale for this is that APL kicks in to get a default value and it is then not meaningful to also let it return a value that means “use the default”. Instead, the result of the default value expression for a class parameter only gets used if there was no value bound at all for that parameter in hiera.

class car($color = 'blue') { }
class { 'car': color => undef }

If nothing was bound in hiera, this would result in the car’s color being 'blue' (the default expression kicks in)
If car::color was bound to ‘green’ in hiera, the result would be a 'green' car.
If car::color was bound to undef in hiera the result would be an undef color.

Recommended actions for Resources

If you want to accept a parameter value of undef then declare a default value expression of undef. Then you can either get that default by not specifying a parameter value (or giving undef, which is the same thing), or you can bind undef in hiera and all gives you the same result.

Something like this:

class car(Optional[String] $color = undef) { }

Summary

I hope this has provided you with some of the (otherwise) hard to find details about how undef actually works in different Puppet versions. (I am also a bit sad that something like this blog post is needed - but that is a different story).