Re: Differentiating TOSCA from HEAT

On Wed, Jan 5, 2022 at 1:59 AM Bruun, Peter Michael (CMS RnD Orchestration) <peter-michael.bruun@hpe.com> wrote:

I was hoping that you could add for our positioning of TOSCA some more concrete details about the mentioned bad experience? What exactly was the nature and reasons for those failures? Your 4 bullets are too generic, I think. You mention that HEAT is slow and does not scale, and you ascribe that to the OpenStack architecture, and not so much to the HOT language. Is that, in your opinion, the primary reason for the shortcoming of HEAT? If so, to ensure the success of TOSCA we would need to give some attention to scalability in our discussions.

There is nothing particularly wrong with HOT the language. It shares some of the same DNA as TOSCA and does a lot of the same things. The same is true for Cloudify DSL. The reason Puccini can parse all three of these languages is due to their core similarity. TOSCA is better than the others for the most part, but that's more about specific grammatical features than some essential qualitative difference. You can think of HOT as a subset of TOSCA.

There are various reasons why Heat isn't great. The relevant one for our group is: it creates its worfklows automatically for you but there is almost no visibility into them, and definitely no hackability. That makes debugging very painful. It's an extremely anti-devops approach: we'll do the work, you stay away. Above, I linked to a Puccini example where I use TOSCA + an OpenStack profile to generate an Ansible playbook for deployment. The advantage, in my opinion, is that you get an actual debuggable and extensible playbook. There's no real lesson here for TOSCA specifically, but I do think Heat can be a cautionary tale for those of us wanting to implement automatic workflows in an orchestrator.

Concerning your views on declarative orchestration, clearly if a single underlying management system, and the components it orchestrates are all fully declarative and insensitive to sequencing, then indeed, the orchestrator itself does not need to be concerned with sequencing. But at the lowest level, technology is inherently sensitive to sequencing.

Absolutely. I just believe it should be solved locally, with specificity for that resource's unique lifecycle challenges, and then locked away as a black box (but with access to the source code, so that devops can fix production bugs). Indeed the responsibility for implementing this functionality should best be with the component's vendor. They know it best. It's basically the operator pattern: the orchestration work should be a managerial component living in the same space as the resource it's orchestrating. Sometimes I call it "side by side" orchestration.

It's absurd to me that devops teams for various companies again and again try to solve the same LCM issues for whatever orchestrator they are using. Invariably there are bugs and scalability challenges. Orchestrators should not be doing generic phase-by-phase LCM, especially if they are not running in-cluster. It's a losing battle.

Example: Installing a VM running a database application. If the management system allows you to specify this declaratively, including the required database configuration, then the orchestrator does not need to be concerned with the sequencing. If another VM needs to run an application that uses the database, and the two VMs are created and started in arbitrary order, then either that application needs to be insensitive to situations where the database is not yet ready or the declarative management system must be aware of the dependency.

I strongly recommend that the application be able to stand up even if the database is not ready. This is the cornerstone of living in the cloud: it's an ephemeral environment where dependent services may come and go or just change suddenly. An orchestrator's work here is, of course, not to create the database connection. But it can assist in discovery (IP address injection?) and otherwise notifying, collecting notifications, and reconciling issues.

The point is that the temporal dependencies do not go away by themselves. The prerequisite is careful design of applications and management systems to fit into such a paradigm, and eventually in some cases, we are basically just pushing the sequencing problem down to lower level orchestrators/management-systems, and if the service topology happens to span more than one management system, then not only must each system be declarative within itself, but all the systems must be designed to interwork according to the âcentrifugeâ model to handle any required sequencing between them.

Welcome to the cloud-native world. :) It's best for your components to be designed to run in clouds, but there are also a lot of options for you if they don't. The operator pattern can allow you to create a cloud-native frontend for a component that doesn't play the game well.

There are good examples of this in the world of databases. Most of the popular and mature databases we use have not been designed for cloud. But operators can allow for LCM of db clusters in cloud environments, managing all the difficult aspects of geographical redundancy, auto-scaling, failovers, load-balancing, backups, etc. If this operator is of good quality you end up being able to treat the db cluster declaratively and not worry about low-level sequences. And then all an orchestrator needs to do is work with those declarations. (Again, that's why I prefer to call it a "coordinator".)

This is a beautiful vision, but as you also say, we are not there, and so TOSCA will need to be able to support any sequencing requirements that are not yet within the capabilities of the systems being orchestrated.

I agree. But I think TOSCA is already there:

1) By using typed relationships you can derive various kinds of dependency graphs. There can be a graph for installation dependencies, a graph for networking configuration, etc. From these topological graphs a sequenced workflow graph (DAG) can be derived for your workflow engine of chocie. (Again, I hope you learn from Heat what not to do, and that you fully expose that DAG to users.)

2) Do you want users to be able to design their own DAGs? TOSCA is well suited for it. A "task" can be a node and these nodes can be connected via typed relationships. I'm working on a TOSCA profile for Argo Workflows that does exactly that. I dislike the workflow grammar in TOSCA 1.3 mostly because it's superfluous. We really don't need two different grammars for creating graphs.

tosca message