RE: [tosca] Re: Proposal: CSARs should be tarballs, not ZIPs

At some point, we need a serious set of discussions about policies. For example, Tal states below that âpolicies are associated with nodesâ. That is not my interpretation at all. Policies are associated with topologies, not with nodes (and in fact they are explicitly defined inside the âtopology templateâ context). We had preliminary discussions last year about extending the policy grammar since (as Tal correctly points out below) policies currently have fairly weak grammar. But as you will recall, there was significant resistance against making policy grammar more flexible.

Lots of work still to be done 😊

Chris

From: Tal Liron <tliron@redhat.com>
Sent: Sunday, April 18, 2021 11:36 AM
To: Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com>
Cc: adam souzis <adam@souzis.com>; Chris Lauwers <lauwers@ubicity.com>; tosca@lists.oasis-open.org
Subject: Re: [tosca] Re: Proposal: CSARs should be tarballs, not ZIPs

I'm not dismissing TOSCA as useful for Day 2, in fact that's one aspect we want to keep improving. What I am saying is that TOSCA should not represent runtime systems.

If I want a representation of my actual Kubernetes resources then I can just do a "kubectl get" command with a selector to retrieve those manifests. If I want a language that exactly represents those resources then ... it's those manifests. And similar APIs exist for all cloud platforms. So, why use TOSCA? Why not just work directly with resource representations?

TOSCA's value is in allowing for designs that are not beholden to implementation details. You literally create the "design language" that is appropriate for your application. You create types and relationships that are meaningful for your work.

What we can do with Day 2 is provide (=design) contours for it. Day 2 is not a wild west where everything about the workload changes completely. The changes that happen are within very specific parameters and happen in very specific ways. The service would still be recognizable even if it changed quite a bit. It's that "recognition" that we want to be able to express grammatically.

The way I tend to handle it is via policies. The policies can allow the designer to specify those contours (e.g. performance envelopes) and changes that might need to happen. For example, let's say that when you hit a certain bandwidth limit you would want to add an additional frontend processor, which would also require you to add a loadbalancer/proxy in front of it. In TOSCA I would design for that loadbalancer to be there in advance, however the initial scale of it might be "0", meaning that I would not need it upon an initial deployment. But it's still there in the design, in TOSCA. This is again why I like that we call these things "node templates", rather than just "nodes" -- it reminds us that we are describing a potential, something that could be provisioned in reality depending on changing conditions. To an extent TOSCA is taking into account Day 2 from the very beginning.

But I'll readily admit that we have a lot of work to do to improve on this expressiveness. I sometimes think we need more "conditionality" in the topology. For example, if that loadbalancer is not provisioned then should the relationships be "re-routed" around it? A TOSCA topology is a single thing, but the requirements are specified with a filtering language that might need to be more powerful, a way to express some kind of if/else for these changing conditions.

We don't really have a set of best practices for Day 2, and indeed it seems everybody has a different idea on how to handle it. And if we do want to describe these "contours", well, TOSCA policies have a very weak grammar. We can associate a policy with nodes, but ... it's a flat association. There's no way to annotate, for example, how different nodes are to be used by that policy. Thus it can make more sense to actually use a "logical" node template instead of a policy, and use TOSCA relationships to provide those annotations. Again, we don't have very clear best practices for it.

Anyway, we've gone way beyond my initial proposal regarding the CSAR format. :)

On Sun, Apr 18, 2021 at 6:19 AM Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com> wrote:

If I design and deploy a topology using templates in one language, then that will also be the language through which I understand what I deployed. If I must then use a completely different language to design my day 2 version of that topology, and that language has the strength and tools that facilitate my day 2 topology design, then as a user, I will quickly realize, that I could just as well have used that other language for my day 1 design as well.

Of course there are scenarios where, when a topology needs modification, you can simply destroy it, and design and deploy a fresh one using TOSCA. But we should also recognize that there are scenarios where the discontinuity and downtime caused by a complete recreate of the topology is absolutely unacceptable.

For those scenarios, saying that TOSCA is for day 0-1 only, will completely disqualify TOSCA.

In my humble opinion, dismissing the use of TODCA for day 2 scenarios, is effectively going to kill TOSCA.

FÃ Outlook til Android

From: Tal Liron <tliron@redhat.com>
Sent: Saturday, April 17, 2021 10:11:56 PM
To: adam souzis <adam@souzis.com>
Cc: Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com>; Chris Lauwers <lauwers@ubicity.com>; tosca@lists.oasis-open.org <tosca@lists.oasis-open.org>
Subject: Re: [tosca] Re: Proposal: CSARs should be tarballs, not ZIPs

Hi all,

Whether the CSAR format is useful in every situation is beside the point. It is useful in many situations. If all you're doing is storing YAML files then indeed there are many other ways to store them -- and now that we agreed to standardize on URLs to reference imports and artifacts, then indeed anything you can host will do the trick. CSAR is more interesting when you want to distribute the referenced artifacts. I don't think you want to commit 1.3 GB virtual machine images on a git repository. Anyway, whether or not specific implementations will do other things, I think it's still a good idea to include CSAR as part of the TOSCA standard.

As for my specific proposal --

The tar format predates GNU by a lot. It was introduced in 1979. Its latest standard is IEEE Std 1003.1-2001. It is unencumbered by patents. There exist many implementations in many programming languages with diverse licenses, easily as diverse as the ZIP landscape if not more. I'm pretty sure the issues with entry name length have been solved.

gzip has a different history, as it was created by GNU, but it is likewise unencumbered and again there exist diversely licensed implementations for any important platform. By the way, though the "g" was originally chosen to refer to GNU, many implementations decide that the "g" stands for "gratis".

I'm very certain that it is possible to both create and read ".tar.gz" tarballs in any platform for proprietary, free, and open source products. You are absolutely not limited to the GNU tools that are distributed in most Linux operating systems.

By the way, .tar.gz files tend to be smaller than .zip files when using the same compression. But anyway I don't think efficient compression is our primary goal here. The point is to create a lowest common interchange standard. Specific orchestration implementations are free to do as they please. But if they want to support the OASIS CSAR standard, we want to provide something that's easy and useful.

Peter, you raised another issue, which is very interesting but I think unrelated to CSAR: that TOSCA should be more useful in Day 2. This is part of our ongoing discussion in the TOSCA ad hoc. I think that having the same language for Day 1 and Day 2 could dilute TOSCA's strengths. Specifically, it mixes two different roles: design vs. representation. It's one thing to turn a design into a running deployment, but how do you move back from a running deployment into a design that would supposedly be able to precisely create it? To do so I think would require a language that is focused specifically on the representation of runtime systems. And because runtime systems differ so widely in paradigms, semantics, and state behavior, I do not believe it is possible to create a single representation language that could work for all cloud platforms and orchestration systems.

By focusing on design TOSCA is able to be very expressive, very generic, and thus quite powerful indeed.

I'll note here that the Clout format I am working on is expressly not a runtime representation language. Rather, it is an intermediary format between design and runtime. The "nodes" it represents are what we've been calling "node representations", which encapsulate TOSCA runtime features: attributes, operations, and notifications. This is the place where design and runtime meet. Most specifically, this is where the current value of an attribute can be found.

It's impossible -- and I think undesirable -- to move from Clout back to TOSCA. How would you know where to place values that were originally generated by TOSCA function calls, or work back from the various object-oriented inheritance directions? How would you create a complex hierarchy of types from a flat representation of values? You might be able to come up with something, but it's like decompiling machine code: the result is not going to be the C source code from whence you started and it would be of very limited use, not to mention limited readability.

On Sat, Apr 17, 2021 at 2:25 PM adam souzis <adam@souzis.com> wrote:

I agree with Peter in that we should make sure TOSCA templates and related artifacts can be interchanged outside of CSAR and there are many contexts where CSAR files are secondary. Unfurl for examples relies on git not CSARs for packaging and interchange. CSARs are just something it imports.

BTW, tar.gz is fairly universal at this point and many specifications and technologies rely on it -- licensing and file system compatibility are no longer concerns.

Adam

On Sat, Apr 17, 2021 at 12:57 AM Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com> wrote:

All,

From my perspective the purpose CSAR is to have a standardized format that is guaranteed to be understood by all TOSCA software, so that interchange of TOSCA between tools, processors, orchestrators, etc. is possible, at least at the lowest level.

For me the CSAR is, however, secondary as a representation of your TOSCA content, there are many other, equally respectable, ways that TOSCA software could represent the TOSCA content *internally* and for interchange within a software family: Filesystem, git, database, etc.

So as long as any software family is able to ingest and export the CSAR formats that we choose to standardize, and assuming that such formats are reasonably open and common (zip, tar, tar.gz, etc) then I am happy enough.

I do have a small reservation with respect to gz, since the GNU licenses tend to turn any software that incorporates their licenses into open software as well. That could have some impact on how those of us who produce commercial software have to package the CSAR capabilities.

I am ancient, and I remember having had trouble with some versions of tar having a limit on the number of characters in the full file paths. I assume that modern implementations (other than GNU) do not have such limitations? Otherwise, this could be a problem with the suggestion.

Another point, that I have been trying to make, is that even the âNode Representations Graphâ takes the form of one or more, small/large/huge, sets of TOSCA nodes and relationships. I believe that it ought to be possible to *export* that graph (or coherent subsets of that graph) on TOSCA template format.

This will require, that the nodes in the Node Representation Graph retain a relationship to the TOSCA nodetypes, relationship types, artifacts, interfaces etc, from which they were created. What will be missing would be where the inputs came from (but the values would be there). The sources of properties and attributes should still be derivable based on the types and templates that the nodes and relationships were derived from. Of course there would not (necessarily) be any abstractions in that representation â no requirements and capabilities or filters â because at the Node Representation Graph level, all such abstractions will have been resolved.

Where the above may seem pointless from a day-0 and day-1 perspective, from a day-2 perspective this becomes a serious requirement because without that information being retained (I deliberately say âinformationâ, not postulating any specific representation of that information). So the ability to make such a low.

Also, by elevating *any* internal representation of the Node Representation Graph to be formally representable as TOSCA, the question about âdangling requirementsâ goes away, because within that TOSCA representation all requirements will have been completely resolved.

In my context, such a Node Representation Graph can easily contain hundreds of millions of nodes, making representation as one huge CSAR impractical, but theoretically possible.

The point is:

Any representation that can be exported as TOSCA, is TOSCA.

That broadens the view of TOSCA being equated to CSAR, to TOSCA is anything that can be described using a CSAR. So representations in git, files, database, whatever are TOSCA as long as they can be converted to a valid CSAR format.

Peter

From: tosca@lists.oasis-open.org [mailto:tosca@lists.oasis-open.org] On Behalf Of Tal Liron
Sent: 16. april 2021 19:24
To: tosca@lists.oasis-open.org
Subject: [tosca] Re: Proposal: CSARs should be tarballs, not ZIPs

Some additional thoughts:

Remember that the CSAR version is separate from the TOSCA version. The current CSAR version is 1.1. So my proposal here would be for CSAR 2.0 (it's a significant enough change that I think it would warrant a major semantic version change).

But, backwards compatibility would mean that systems would still be able to support CSAR 1.1, which is in ZIP. To be 100% clear: you could write a TOSCA 2.0 service template and package it in CSAR 1.1. We would have to be clear in the TOSCA 2.0 spec that this is supported.

Another thought regarding extensions -- if we move to tarballs, it might be a good idea to choose a different extension than ".csar" so that processors would easily know if they're dealing with a new-style vs. old-style container. (This is a common problem with systems that upgrade their formats.) So, perhaps something like this:

".csar" extension: means CSAR 1.1 or CSAR 1.0, meaning it's a ZIP
".csar2" extension: means CSAR 2.0 (and beyond), meaning it's a TAR
".csar2.gz" extension: GZIPped TAR

It's a bit awkward, but 100% deterministic.

On Thu, Apr 15, 2021 at 12:03 PM Tal Liron <tliron@redhat.com> wrote:

In a conversation I had with someone who professes to "hate TOSCA" one of the issues that came up was how bad CSARs are. And one point made hit home.

CSARs are currently defined as ZIP containers. Unfortunately, ZIP is not a streaming format, instead requiring random access to locations in the container. The entire container needs to be read in order to access an individual entry. Thus The any processing of a CSAR has to take place on an accessible file system, which means that if the CSAR is at a URL then the whole package would have to be downloaded first.

If you're dealing with a CSAR with very big artifacts (virtual machine images) then this quickly becomes a major burden on different parts of the system which need to process specific parts of a CSAR. This is indeed a pain point with currently existing TOSCA solutions, e.g. ONAP.

There's a reason why "tarballs" are so often preferred in packaging. A ".tar.gz" file is streamable for two reasons: gunzip is streaming decompression of a single file, and that single file is a "tape archive" (tar), which is a straightforward concatenation, likewise streaming. There is no random access. Thus a CSAR processor can choose to process just a specific entry and not have to download the entirety. It can throw away bytes that do not interest it.

Note that if one can benefit from random access to a tarball, then it's easy enough to unpack it in its entirety, and indeed in a much more efficient way than a ZIP: the tarball can be unpacked and streamed directly to the filesystem. A ZIP would still have to be downloaded first to accomplish the some function, leading indeed to more than double the storage requirement.

So, it's very obvious to me that this needs to change in TOSCA 2.0 with a new CSAR specification.

My specific recommendations:

1. Let's first standardize on TAR. So a raw ".csar" extension would be exactly a "tape archive" (a tarball).

2. Let's then standardize on GZIP for the supported algorithm. So a ".csar.gz" extension would imply a GZIPped CSAR. There are many other popular algorithms used (bz2, xz) but in the interests of interoperability it's best to recommend one. The usefulness of adding the extra ".gz" is to clarify if decompression is needed, and indeed many toolchains recognize that convention automatically.

tosca message