RE: Differentiating TOSCA from HEAT

Hi Tal,

This makes complete sense â we are aligned on understanding now.

There is nothing particularly wrong with HOT the language.

Agree. In fact our orchestrator has a language, âDSDâ, that was originally derived from HOT, so you will recognize some of the syntax, but I fixed all the issues with the HOT language that I mentioned below, and quite a few more.

You can think of HOT as a subset of TOSCA.

The way I extended HOT makes TOSCA a subset of the DSD language, but DSD is a lot easier to understand and use.

Basically, we made the language define a compositional graph-grammar instead of just a graph. Also, we went a different direction than TOSCA by making the DSD language Turing-complete while also being completely declarative. I know that we deliberately do not want that with TOSCA. But the graph grammar approach does solve all the day 2 expressiveness issues that some uses of TOSCA are facing.

The relevant one for our group is: it creates its worfklows automatically for you but there is almost no visibility into them, and definitely no hackability.

I hope you learn from Heat what not to do, and that you fully expose that DAG to users

Yes. There are two things here.

I fully agree that the visibility into the âplaybooksâ is critical. Version 1 of our orchestrator did, many years ago, not have that and we quickly realized that this would not work. So we give full visibility of the playbook graph including direct insight into why it looks the way it does.

As for âhackabilityâ, that is nice for small scale DevOps type orchestration, but when you look at thousands of transactions per minute on behalf of end-customers, you donât want users to hack each playbook individually. The paradigms you want for orchestration depends heavily on that kind of concern.

There is always a tradeoff between flexibility and constraint. Too many constraints on what users can do will definitely lead to frustration â I get that a lot with systems like maven, where they are really helpful if what you want to do is within their paradigm, and really frustrating if it is not. But too much flexibility, also leads to frustration because that leaves the user without guidance as to what combinations of functions and features are meaningful for some intent, and what combinations are not. So âflexibilityâ is not always the same as âgoodâ.

Basically, your domain for orchestration is DevOps, as you say. Our orchestration domain is quite different from DevOps, and that makes a huge difference in the requirements. So nothing is right or wrong â just two different worlds.

I just believe it should be solved locally, with specificity for that resource's unique lifecycle challenges, and then locked away as a black box

I completely agree â always solve issues as close to their source as possible. I just donât work in a domain where that is not feasible today, because most of the systems are orchestrating are nowhere near becoming cloud native. So in essence our orchestrator can be, when there is nothing else doing it, the system that turns such entities declarative.

Welcome to the cloud-native world. :) It's best for your components to be designed to run in clouds

Cloud native is for compute nodes running applications with connectivity. We are not in that domain. There are lots of nodes that do not represent components in clouds.

Invariably there are bugs and scalability challenges

That is not my experience. Our orchestrator is fully cloud native and scalable, including full support for zero-downtime rolling upgrades of the cluster. We run thousands of service requests per minute and millions of events per second and have service graphs consisting of hundreds of thousands of nodes. I can tell you from experience that it is possible and can be made to scale. I am not saying it is easy J

By the way, my middle name should be âDagâ â I dream in DAGs. J

As for bugs, automatically generating workflows is a lot less error-prone than hacking up workflows manually. I know of several companies who learned that the hard way â ending up with thousands of spaghetti-workflows with overlapping functionality, but different bugs in each. I donât think we really disagree on this.

Do you want users to be able to design their own DAGs? â TOSCA is well suited for it. A "task" can be a node and these nodes can be connected via typed relationships.

We have been doing that for years, but I mentioned that we do a âGraph Grammarâ as our catalog, which means that users can do what you say, but can be constrained to do only meaningful modifications of the DAG that represent a clearly modeled intent.

Peter

From: Tal Liron [mailto:tliron@redhat.com]
Sent: 6. januar 2022 01:50
To: Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com>
Cc: Chris Lauwers <lauwers@ubicity.com>; tosca@lists.oasis-open.org
Subject: Re: Differentiating TOSCA from HEAT

On Wed, Jan 5, 2022 at 1:59 AM Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com> wrote:

I was hoping that you could add for our positioning of TOSCA some more concrete details about the mentioned bad experience? What exactly was the nature and reasons for those failures? Your 4 bullets are too generic, I think. You mention that HEAT is slow and does not scale, and you ascribe that to the OpenStack architecture, and not so much to the HOT language. Is that, in your opinion, the primary reason for the shortcoming of HEAT? If so, to ensure the success of TOSCA we would need to give some attention to scalability in our discussions.

There is nothing particularly wrong with HOT the language. It shares some of the same DNA as TOSCA and does a lot of the same things. The same is true for Cloudify DSL. The reason Puccini can parse all three of these languages is due to their core similarity. TOSCA is better than the others for the most part, but that's more about specific grammatical features than some essential qualitative difference. You can think of HOT as a subset of TOSCA.

There are various reasons why Heat isn't great. The relevant one for our group is: it creates its worfklows automatically for you but there is almost no visibility into them, and definitely no hackability. That makes debugging very painful. It's an extremely anti-devops approach: we'll do the work, you stay away. Above, I linked to a Puccini example where I use TOSCA + an OpenStack profile to generate an Ansible playbook for deployment. The advantage, in my opinion, is that you get an actual debuggable and extensible playbook. There's no real lesson here for TOSCA specifically, but I do think Heat can be a cautionary tale for those of us wanting to implement automatic workflows in an orchestrator.

Concerning your views on declarative orchestration, clearly if a single underlying management system, and the components it orchestrates are all fully declarative and insensitive to sequencing, then indeed, the orchestrator itself does not need to be concerned with sequencing. But at the lowest level, technology is inherently sensitive to sequencing.

Absolutely. I just believe it should be solved locally, with specificity for that resource's unique lifecycle challenges, and then locked away as a black box (but with access to the source code, so that devops can fix production bugs). Indeed the responsibility for implementing this functionality should best be with the component's vendor. They know it best. It's basically the operator pattern: the orchestration work should be a managerial component living in the same space as the resource it's orchestrating. Sometimes I call it "side by side" orchestration.

It's absurd to me that devops teams for various companies again and again try to solve the same LCM issues for whatever orchestrator they are using. Invariably there are bugs and scalability challenges. Orchestrators should not be doing generic phase-by-phase LCM, especially if they are not running in-cluster. It's a losing battle.

Example: Installing a VM running a database application. If the management system allows you to specify this declaratively, including the required database configuration, then the orchestrator does not need to be concerned with the sequencing. If another VM needs to run an application that uses the database, and the two VMs are created and started in arbitrary order, then either that application needs to be insensitive to situations where the database is not yet ready or the declarative management system must be aware of the dependency.

I strongly recommend that the application be able to stand up even if the database is not ready. This is the cornerstone of living in the cloud: it's an ephemeral environment where dependent services may come and go or just change suddenly. An orchestrator's work here is, of course, not to create the database connection. But it can assist in discovery (IP address injection?) and otherwise notifying, collecting notifications, and reconciling issues.

The point is that the temporal dependencies do not go away by themselves. The prerequisite is careful design of applications and management systems to fit into such a paradigm, and eventually in some cases, we are basically just pushing the sequencing problem down to lower level orchestrators/management-systems, and if the service topology happens to span more than one management system, then not only must each system be declarative within itself, but all the systems must be designed to interwork according to the âcentrifugeâ model to handle any required sequencing between them.

Welcome to the cloud-native world. :) It's best for your components to be designed to run in clouds, but there are also a lot of options for you if they don't. The operator pattern can allow you to create a cloud-native frontend for a component that doesn't play the game well.

There are good examples of this in the world of databases. Most of the popular and mature databases we use have not been designed for cloud. But operators can allow for LCM of db clusters in cloud environments, managing all the difficult aspects of geographical redundancy, auto-scaling, failovers, load-balancing, backups, etc. If this operator is of good quality you end up being able to treat the db cluster declaratively and not worry about low-level sequences. And then all an orchestrator needs to do is work with those declarations. (Again, that's why I prefer to call it a "coordinator".)

This is a beautiful vision, but as you also say, we are not there, and so TOSCA will need to be able to support any sequencing requirements that are not yet within the capabilities of the systems being orchestrated.

I agree. But I think TOSCA is already there:

1) By using typed relationships you can derive various kinds of dependency graphs. There can be a graph for installation dependencies, a graph for networking configuration, etc. From these topological graphs a sequenced workflow graph (DAG) can be derived for your workflow engine of chocie. (Again, I hope you learn from Heat what not to do, and that you fully expose that DAG to users.)

2) Do you want users to be able to design their own DAGs? TOSCA is well suited for it. A "task" can be a node and these nodes can be connected via typed relationships. I'm working on a TOSCA profile for Argo Workflows that does exactly that. I dislike the workflow grammar in TOSCA 1.3 mostly because it's superfluous. We really don't need two different grammars for creating graphs.

tosca message