DevOps : The Times They Are A-Changin
Out With the Old
In the operations environments I have worked in there were always strict controls
on who could access production environments, who could make changes,
when changes could be made, who could physically touch hardware, and who
could access what data centers. In these highly regulated and process oriented
enterprises the thought of blurring the lines between development and operations
seems like a non-starter. There is so much process and tradition standing in the way
of using a DevOps approach that it seems nearly impossible. Let’s break it down
into small pieces and see if could be feasible.
Here are the basic steps to getting a new application built and deployed from
scratch (from an operations perspective) in a stodgy financial services environment.
If you’ve never worked in this type of environment some of the timing of these
steps might surprise you. We are going to assume this new application project has
already been approved by management and we have the green light to proceed.
1. Place order for development, testing, user acceptance testing, production, and
infrastructure. Usually about an 8-week lead time.
2. Development team does works on testing while ops personnel are filling out miles
of virtual paperwork to get the infrastructure in place. Much discussion occurs
about failover, redundancy, disaster recovery, data center locations, and storage
requirements. None of this discussion includes developers, just operations and
architects.
3. New application is added to change management database to include new
infrastructure components, application components, and dependencies.
4. Operations is hopeful the developers are making good progress in the 8 weeks
lead time provided by the operational request process. Servers have landed and
are being racked and stacked. Hopefully we correctly estimated the number of
users, efficiency of code, and storage requirements that were used to size this
hardware. In reality we will have to see what happens during load testing and
make adjustments.
5. We’re closing in on one week until the scheduled go-live date but the
application isn’t ready for testing yet. It’s not the developers fault the functional
requirements keep changing but it is going to squeeze the testing and
deployment phases.
6. The monitoring team has installed their standard monitoring agents (usually just
traditional server monitoring) and marked off the checkbox from the deployment
checklist.
7. It’s 2 days before go-live and we have an application to test. The load test team
has coded some form of synthetic load to be applied to the servers. Functional
testing showed that the application worked. Load testing shows slow response
times and lot’s of errors. Another test is scheduled for tomorrow while the
development team works frantically to figure out what went wrong with this test.
8. One day until go-live, load test session 2, still some slow response time and a few
errors but nothing that will stop this application from going into production. We
call the load test a “success” and give the green light to deploy the application
onto the production servers. The app is deployed, functional testing looks good,
and we wait until tomorrow for the real test: production users!
9. Go-Live — Users hit the application, the application stalls and/or crashes, the
operations team check the infrastructure and get the developers the log files to
look at. Management is upset. Everyone is asking if we have any monitoring tools
that can show what is happening in the application.
10. Week one is a mess with the application working, crashing, restarting,
working again, and new emergency code releases going into production to fix
the problems. Week 2 and each subsequent week will get better until new
functionality gets released in the next major change window.
In With the New
Part of the problem with the scenario above is the development and operations
teams are so far removed from each other there is little to no communication
during the build and test phases of the development lifecycle. What if we took a
small step towards a more collaborative approach as recommended by DevOps?
How would this process change? Let’s explore (modified process steps are
highlighted using bold font):
1. Place order for development, testing, user acceptance testing, production, and
infrastructure. Usually about an 8-week lead time.
2. Development and operations personnel fill out virtual paperwork together
which creates a much more accurate picture of infrastructure requirements.
Discussions about failover, redundancy, disaster recovery, data center
locations, and storage requirements progress more quickly with better
estimations of sizing and understanding of overall environment.
3. New application is added to change management database to include new
infrastructure components, application components, and dependencies.
4. Operations is fully aware of the progress the developers are making. This gives
the operations staff an opportunity to discuss monitoring requirements from
both a business and IT perspective with the developers. Operations starts
designing the monitoring architecture while the servers have arrived and are
being racked and stacked. Both the development and operations teams are
comfortable with the hardware requirement estimates but understand that
they will have to see what happens during load testing and make adjustments.
Developers start using the monitoring tools in their dev environment to identify
issues before the application ever makes it to test.
5. We’re closing in on one week until the scheduled go-live date but the application
isn’t ready for testing yet. It’s not the developers fault that the functional
requirements keep changing but it is going to squeeze the testing and
deployment phases.
6. The monitoring team has installed their standard monitoring agents as well
as the more advanced application performance monitoring (APM) agents
across all environments. This provides the foundation for rapid triage during
development, load testing, and production.
7. It’s 2 days before go-live and we have an application to test. The load test team
has coded a robust set of synthetic load based upon application monitoring
data gathered during development. This load is applied to the application
which reveals some slow response times and some errors. The developers and
operations staff use the APM tool together during the load test to immediately
identify the problematic code and have a new release available by the end of
the original load test. This process is repeated until the slow response times and
errors are resolved.
8. One day until go-live, we were able to stress test overnight and everything
looks good. We have the green light to deploy the application onto the
production servers. The app is deployed, functional testing looks good,
business and IT metric dashboard looks good, and we wait until tomorrow for
the real test…production users!
9. Go-Live — Users hit the application, the application works well for the most
part. The APM tool is showing some slow response time and a couple of errors
to the developers and the operations staff. The team agrees to implement a fix
after business hours as the business dashboard shows that things are generally
going well. After hours the development and operations team collaborate on
the build, test, and deploy of the new code to fix the issues identified that day.
Management is happy.
10. Week one is highly successful with issues being rapidly identified and dealt
with as they come up. Week 2 and each subsequent week are business as usual
and the development team is actively focused on releasing new functionality
while operations adapts monitoring and dashboards when needed.
So what scenario sounds better to you? Have you ever been in a situation where
increased collaboration caused more problems than it solved? In this example the
overall process was kept mostly intact to ensure compliance with regulatory audit
procedures.
Developers were never granted access to production (regulatory issue
for financial services companies) but by being tightly coupled with operations they
had access to all of the information they needed to solve the issues.
It seems to me you can make a big impact across the lifecycle of an application by
implementing parts of the DevOps philosophy in even a minor way. In this example
we didn’t even touch the automation aspects of DevOps. That’s where all of those
fun and useful tools come into play so that is where we will pick up next time.
In the operations environments I have worked in there were always strict controls
on who could access production environments, who could make changes,
when changes could be made, who could physically touch hardware, and who
could access what data centers. In these highly regulated and process oriented
enterprises the thought of blurring the lines between development and operations
seems like a non-starter. There is so much process and tradition standing in the way
of using a DevOps approach that it seems nearly impossible. Let’s break it down
into small pieces and see if could be feasible.
Here are the basic steps to getting a new application built and deployed from
scratch (from an operations perspective) in a stodgy financial services environment.
If you’ve never worked in this type of environment some of the timing of these
steps might surprise you. We are going to assume this new application project has
already been approved by management and we have the green light to proceed.
1. Place order for development, testing, user acceptance testing, production, and
infrastructure. Usually about an 8-week lead time.
2. Development team does works on testing while ops personnel are filling out miles
of virtual paperwork to get the infrastructure in place. Much discussion occurs
about failover, redundancy, disaster recovery, data center locations, and storage
requirements. None of this discussion includes developers, just operations and
architects.
3. New application is added to change management database to include new
infrastructure components, application components, and dependencies.
4. Operations is hopeful the developers are making good progress in the 8 weeks
lead time provided by the operational request process. Servers have landed and
are being racked and stacked. Hopefully we correctly estimated the number of
users, efficiency of code, and storage requirements that were used to size this
hardware. In reality we will have to see what happens during load testing and
make adjustments.
5. We’re closing in on one week until the scheduled go-live date but the
application isn’t ready for testing yet. It’s not the developers fault the functional
requirements keep changing but it is going to squeeze the testing and
deployment phases.
6. The monitoring team has installed their standard monitoring agents (usually just
traditional server monitoring) and marked off the checkbox from the deployment
checklist.
7. It’s 2 days before go-live and we have an application to test. The load test team
has coded some form of synthetic load to be applied to the servers. Functional
testing showed that the application worked. Load testing shows slow response
times and lot’s of errors. Another test is scheduled for tomorrow while the
development team works frantically to figure out what went wrong with this test.
8. One day until go-live, load test session 2, still some slow response time and a few
errors but nothing that will stop this application from going into production. We
call the load test a “success” and give the green light to deploy the application
onto the production servers. The app is deployed, functional testing looks good,
and we wait until tomorrow for the real test: production users!
9. Go-Live — Users hit the application, the application stalls and/or crashes, the
operations team check the infrastructure and get the developers the log files to
look at. Management is upset. Everyone is asking if we have any monitoring tools
that can show what is happening in the application.
10. Week one is a mess with the application working, crashing, restarting,
working again, and new emergency code releases going into production to fix
the problems. Week 2 and each subsequent week will get better until new
functionality gets released in the next major change window.
In With the New
Part of the problem with the scenario above is the development and operations
teams are so far removed from each other there is little to no communication
during the build and test phases of the development lifecycle. What if we took a
small step towards a more collaborative approach as recommended by DevOps?
How would this process change? Let’s explore (modified process steps are
highlighted using bold font):
1. Place order for development, testing, user acceptance testing, production, and
infrastructure. Usually about an 8-week lead time.
2. Development and operations personnel fill out virtual paperwork together
which creates a much more accurate picture of infrastructure requirements.
Discussions about failover, redundancy, disaster recovery, data center
locations, and storage requirements progress more quickly with better
estimations of sizing and understanding of overall environment.
3. New application is added to change management database to include new
infrastructure components, application components, and dependencies.
4. Operations is fully aware of the progress the developers are making. This gives
the operations staff an opportunity to discuss monitoring requirements from
both a business and IT perspective with the developers. Operations starts
designing the monitoring architecture while the servers have arrived and are
being racked and stacked. Both the development and operations teams are
comfortable with the hardware requirement estimates but understand that
they will have to see what happens during load testing and make adjustments.
Developers start using the monitoring tools in their dev environment to identify
issues before the application ever makes it to test.
5. We’re closing in on one week until the scheduled go-live date but the application
isn’t ready for testing yet. It’s not the developers fault that the functional
requirements keep changing but it is going to squeeze the testing and
deployment phases.
6. The monitoring team has installed their standard monitoring agents as well
as the more advanced application performance monitoring (APM) agents
across all environments. This provides the foundation for rapid triage during
development, load testing, and production.
7. It’s 2 days before go-live and we have an application to test. The load test team
has coded a robust set of synthetic load based upon application monitoring
data gathered during development. This load is applied to the application
which reveals some slow response times and some errors. The developers and
operations staff use the APM tool together during the load test to immediately
identify the problematic code and have a new release available by the end of
the original load test. This process is repeated until the slow response times and
errors are resolved.
8. One day until go-live, we were able to stress test overnight and everything
looks good. We have the green light to deploy the application onto the
production servers. The app is deployed, functional testing looks good,
business and IT metric dashboard looks good, and we wait until tomorrow for
the real test…production users!
9. Go-Live — Users hit the application, the application works well for the most
part. The APM tool is showing some slow response time and a couple of errors
to the developers and the operations staff. The team agrees to implement a fix
after business hours as the business dashboard shows that things are generally
going well. After hours the development and operations team collaborate on
the build, test, and deploy of the new code to fix the issues identified that day.
Management is happy.
10. Week one is highly successful with issues being rapidly identified and dealt
with as they come up. Week 2 and each subsequent week are business as usual
and the development team is actively focused on releasing new functionality
while operations adapts monitoring and dashboards when needed.
So what scenario sounds better to you? Have you ever been in a situation where
increased collaboration caused more problems than it solved? In this example the
overall process was kept mostly intact to ensure compliance with regulatory audit
procedures.
Developers were never granted access to production (regulatory issue
for financial services companies) but by being tightly coupled with operations they
had access to all of the information they needed to solve the issues.
It seems to me you can make a big impact across the lifecycle of an application by
implementing parts of the DevOps philosophy in even a minor way. In this example
we didn’t even touch the automation aspects of DevOps. That’s where all of those
fun and useful tools come into play so that is where we will pick up next time.
No comments: