Securing DevOps: Security in the Cloud

Translations:no translations yet

I cannot say anything bad about this book, only good. It’s full of quality content and it’s a joy to read. Title says for itself, it’s a book for DevOps engineer. But I would recommend it to anyone who’s just slightly interested in security, clouds or CI/CD. The author builds a CI/CD pipeline throughout first part of the book and explains useful security practices in the rest of the book. It has many real-life examples and many very good explanations. In fact, I’ve never seen a better explanation of CSRF, PKI, DREAD and other things.

In DevOps, everyone in the product pipeline is focused on the customer:

  • Product managers measure engagement and retention ratios.
  • Developers measure ergonomics and usability.
  • Operators measure uptime and response times.

The customer is where the company’s attention is. The satisfaction of the customer is the metric everyone aligns their goals against.

In contrast, many security teams focus on security-centric goals, such as

  • Compliance with a security standard
  • Number of security incidents
  • Count of unpatched vulnerabilities on production systems

When the company’s focus is directed outward to its customers, security teams direct their focus inward to their own environment. One wants to increase the value of the organization, while the other wants to protect its existing value. Both sides are necessary for a healthy ecosystem, but the goal disconnect hurts communication and efficiency.

Figure 1.6 A logging pipeline implements a standard tunnel where events generated by the infrastructure are analyzed and stored.

For many engineers and managers, risk management is about making large spread-sheets with colored boxes that pile up in our inbox. This is, unfortunately, too often the case and has led many organizations to shy away from risk management. In part 3 of this book, I talk about how to break away from this pattern and bring lean and efficient risk management to a DevOps organization. Managing risk is about identifying and prioritizing issues that threaten survival and growth. Colored boxes in spreadsheets can indeed help, but they’re not the main point.

A good risk-management approach must reach three targets:

  • Run in small iterations, often and quickly. Software and infrastructure change constantly, and an organization must be able to discuss risks without involving weeks of procedures.
  • Automate! This is DevOps, and doing things by hand should be the exception, not the rule.
  • Require everyone in the organization to take part in risk discussions. Making secure products and maintaining security is a team effort.

A risk-management framework that achieves all three of these targets is presented in chapter 11. When implemented properly, it can be a real asset to an organization and become a core component of the product lifecycle that everyone in the organization welcomes and seeks.

As we focused on getting the invoicer deployed, we ignored several security issues on the application, infrastructure, and CI/CD pipeline:

  • GitHub, CircleCI, and Docker Hub need access to each other. By default, we granted all three access to highly privileged accounts which, if leaked, could damage other services hosted on these accounts. Making use of accounts with fewer privileges will increase security.
  • Similarly, the credentials we used to access AWS could easily be leaked, granting a bad actor full access to the environment. Multifactor authentication and fine-grained permissions should be used to reduce the impact of a credential leak.
  • Our database security practices are subpar. Not only does the invoicer use an admin account to access PostgreSQL, but the database itself is also public. A good way to reduce the risk of a breach is to harden the security of the database.
  • The public interface to the invoicer uses clear-text HTTP, meaning that anyone on the connection path can copy and modify the data in transit. HTTPS is an easy security win and we should make use of it right away.
  • And finally, the invoicer itself is wide open to the internet. We need authentication and strong security practices to keep the application secure.

The Docker container of the invoicer is hosted on (step 5 of fig-ure 2.1). You need to tell EB the location of the container so it can pull it down from Docker Hub and deploy it to the EC2 instance. The following JSON file will handle that declaration.

Listing 2.20 EB configuration indicates the location of the container

    "AWSEBDockerrunVersion": "1",
    "Image": { 
        "Name": "", 
        "Update": "true" 
    "Ports": [
            "ContainerPort": "8080" 
  "Logging": "/var/log/nginx"

The JSON configuration will be read by each new instance that joins your EB infra-structure, so you need to make sure instances can retrieve the configuration by upload-ing it to AWS S3. Save the definition to a local file, and upload it using the command line. Make sure to change the bucket name from “invoicer-eb” to something personal, as S3 bucket names must be unique across all AWS accounts.

Listing 2.21 Uploading the application configuration to S3

aws s3 mb s3://invoicer-eb 
aws s3 cp app-version.json s3://invoicer-eb/ 

In EB, you reference the location of the application definition to create an application version named invoicer-api.

Listing 2.22 Assigning the application configuration to the EB environment

aws elasticbeanstalk create-application-version \
    --application-name "invoicer" \
    --version-label invoicer-api \
    --source-bundle "S3Bucket=invoicer-eb,S3Key=app-version.json"

And finally, instruct EB to update the environment using the invoicer-api application version you just created. With one command, tell AWS EB to pull the Docker image, place it on the EC2 instances, and run it with the environment previously configured, all in one automated step. Moving forward, the command in the following listing is the only one you’ll need to run to deploy new versions of the application.

Listing 2.23 Deploying the application configuration to the EB environment

aws elasticbeanstalk update-environment \
    --application-name invoicer \
    --environment-id e-curu6awket \
    --version-label invoicer-api

The environment update takes several minutes, and you can monitor completion in the web console. When the environment turns green, it’s been updated successfully. The invoicer has a special endpoint on /version that returns the version of the application currently running. You can test the deployment by querying the version endpoint from the command line and verifying the version returned is the one you expect.

Specialists spend an entire career perfecting skills in WebAppSec. A single chapter can only provide an overview of the field, so we’ll focus on the elementary controls needed to bring the invoicer to a solid security level and leave pointers for you to go beyond the scope of this chapter. You can find many great resources on the subject. The following is a short list you should keep nearby:

  • The Open Web Application Security Project has a lot of excellent resources on protecting web apps ( OWASP also publishes a top-10 list of vulnerabilities in web apps every few years, which is a great tool to raise security aware-ness in your organization (
  • Dafydd Stuttard and Marcus Pinto’s The Web Application Hacker’s Handbook: Find-ing and Exploiting Security Flaws (Wiley, 2011) and Michal Zalewski’s The Tangled Web: A Guide to Securing Modern Web Applications (No Starch Press, 2011) are two excellent books on the topics of breaking and securing web apps.
  • Mozilla Developer Network (MDN, at is one of the best sources of information on webdevelopment techniques, JavaScript, and browser security on the internet (surely my involvement with Mozilla makes me biased, but still, MDN is a truly great resource.
Finding vulnerabilities by hand is a long and tedious task. We’re going to use OWASP Zed Attack Proxy (ZAP), an open source tool designed to scan web apps for vulnerabilities, to make our life a lot easier. ZAP is a Java application downloadable from https:// It’s also available as a Docker container that can be retrieved via docker pull owasp/zap2docker-weekly.

Cross-site scripting and Content Security Policy

Perhaps the most prevalent web vulnerability at the time of writing is the cross-site scripting attack, commonly referred to as XSS. The ZAP baseline scan indicates that the invoicer lacks protection against XSS attacks by displaying these two failures:

  • FAIL: Web Browser XSS Protection Not Enabled
  • FAIL: Content Security Policy (CSP) Header Not Set

XSS attacks are caused by injecting fraudulent code into a website that’s later reflected to other site visitors as if it was normal content. The fraudulent code is executed in the browser of the target to do bad things, like stealing information or performing actions on the user’s behalf.

XSS attacks have grown in importance as web apps increase in complexity, to the point of becoming the most reported security issue on modern websites. We know the invoicer is vulnerable to an XSS attack, so let’s first exploit this vulnerability and then discuss how to protect against it.

You may recall from chapter 2 that the invoicer exposes several endpoints to manage invoices, one of which creates new invoices based on JSON data submitted in the body of the POST request. Consider the JSON document in the following listing as the input of an attack and pay particular attention to the description field. Instead of containing a regular string, inject HTML code that calls the JavaScript alert() function.

Listing 3.3 Malicious invoice payload with an XSS stored in the description field

 "is_paid": false,
 "amount": 51,
 "due_date": "2016-05-07T23:00:00Z",
 "charges": [
   "type":"physical checkup",
   "amount": 51,
   "description": "<script type='text/javascript'>alert('xss');</script>"

Save this document to a file and POST it to the invoicer API.

Listing 3.4 Posting the malicious payload to the application

curl -X POST -d @/tmp/baddata.json

If you retrieve this invoice by pointing a browser at the /invoice/ endpoint of the API, as shown in figure 3.5, the description field is returned exactly like you send it: as a string. Nothing malicious happens there.

But if you access the invoice through the web interface you added to the invoicer, so the description field is rendered to the user as HTML, not as raw JSON. The browser then interprets the <script> block as code and executes it as part of the rendering of the page. This rendering has the effect of triggering the alert() function contained in the malicious payload and displaying an alert box, as shown in figure 3.6.

Why didn’t the malicious code get executed when you accessed the raw JSON? This is because the API endpoint that returns raw JSON also returns an HTTP header named Content-Type set to application/json. The browser notices the data isn’t an HTML document and doesn’t execute its content. XSS is only an issue on HTML pages where scripts and styles can be abused to execute malicious code. The attack is rarely an issue on web APIs, unless those can be abused to return HTML or feed data into other HTML pages.

In-house TDS

Here again we take CircleCI as an example, but a similar workflow can be implemented in any CI environment, including one that you run inside your own data center. For example, when we implemented the ZAP Baseline scan with Mozilla, we ran it as part of a Jenkins deployment pipeline, on a private CI platform, to scan environments being deployed to preproduction.

You can integrate TDS into your pipeline in many different ways. For the purpose of this book, it’s easier for us to rely on a third party, but you can achieve the same results by running the entire pipeline behind closed doors. Focus on the concept, not the implementation details.   To implement this workflow, you modify the configuration of CircleCI to retrieve the ZAP container and run it against the invoicer. The invoicer will run inside its own Docker container and expose a local IP and port for ZAP to scan. These changes are applied to the config.yml file, as described in the following listing.

Listing 3.1 Configuring CircleCI to run a security scan against the invoicer

- run:
    name: Build application container
    command: |
        go install --ldflags '-extldflags "-static"' \${CIRCLE_PROJECT_USERNAME}/${CIRCLE_PROJECT_REPONAME};
        [ ! -e bin ] && mkdir bin;
        cp "${GOPATH_HEAD}/bin/${CIRCLE_PROJECT_REPONAME}" bin/invoicer;
        docker build -t ${DOCKER_REPO}/${CIRCLE_PROJECT_REPONAME} .;        
- run:
    name: Run application in background
    command: |
        docker run ${DOCKER_REPO}/${CIRCLE_PROJECT_REPONAME}        
    background: true
- run:
    name: ZAP baseline scan of application
    # Only fail on error code 1, which indicates at least one FAIL was found.
    # error codes 2 & 3 indicate WARN or other, and should not break the run
    command: |
        docker pull owasp/zap2docker-weekly && \
        docker run -t owasp/zap2docker-weekly \
            -u${DOCKER_REPO}/${CIRCLE_PROJECT_REPONAME}/master/zap-baseline.conf \
            -t || \
        if [ $? -ne 1 ]; then exit 0; else exit 1; fi;

The changes to CircleCI are submitted as a patch in a pull request, which triggers CircleCI to run the configuration. The four steps described in figure 3.5 are followed. If ZAP encounters a vulnerability, it will exit with a non-zero status code, which tells Circle CI that the build has failed. If you run this test against the source code of the invoicer from chapter 2, which doesn’t yet have mitigations in place, the scan will return four security failures, shown in the following listing.

Listing 3.2 Output of the ZAP baseline scan against the invoicer

FAIL: Web Browser XSS Protection Not Enabled
FAIL: Content Security Policy (CSP) Header Not Set
FAIL: Absence of Anti-CSRF Tokens
FAIL: X-Frame-Options Header Not Set

The output of the scan probably doesn’t mean anything to you yet, but it tells us one thing: the invoicer is insecure. In the next sections, I’ll explain what these issues are and how to mitigate them, and we’ll refer back to the baseline scan to verify that we’ve fixed them.

XSS attacks come in many different forms. The attack you just used is particularly dangerous because it stores data in the invoicer’s database persistently, and so is called persistent XSS. Other types of XSS don’t need to store data in the application database, but instead abuse the rendering of query parameters. The invoicer is vulnerable to this type of XSS as well, known as a DOM XSS attack, as it modifies the Document Object Model (DOM) of the browser. To execute it, you need to inject code into one of the parameter query strings, for example, the invoiceid parameter.

Listing 3.5 DOM XSS attack using malicious code in query parameters<script type='text/javascript'>alert('xss');</script>

When entering the URL from listing 3.5 in the browser, the web interface uses the value stored in the invoiceid parameter to render part of the page. The fraudulent JavaScript code is then added to the HTML of the page and executed. This type of XSS requires an attacker to send fraudulent links to its targets for them to click, which seems like a barrier to execution, but can in fact be easily done by hiding those links inside of phishing emails or web-page buttons.

What are inline scripts?

JavaScript code can be embedded into an HTML page in one of two ways. The code can be stored in a separate file and referenced via a <script src="..."> tag, which will retrieve the external resource from the location specified at src. Or the code can be directly added in between script anchors: <scripts>alert('test');</script>. This second method is referred to as inline code, because the code is added directly inside the page as opposed to loaded as an external resource.

In addition to input validation and output encoding, modern web apps should make use of security features built into web browsers, the most powerful of which is probably Content Security Policy (CSP).

CSP enables a channel by which a web app can tell web browsers what should and should not be executed when rendering the website. The invoicer, for example, can use CSP to block XSS attacks by declaring a policy that forbids the execution of inline scripts. The declaration of the CSP is done via an HTTP header returned by the applica-tion with each HTTP response.

The policy in the following listing tells the browser to enable CSP, which blocks inline scripting by default, and only trusts content that comes from the same origin (the domain where the invoicer is hosted).

Listing 3.7 Basic CSP that forbids the execution of inline scripts

Content-Security-Policy: default-src 'self';

You can set this header to be returned alongside every request to the homepage of the invoicer via the following Go code.

Listing 3.8 Go code to return a CSP header with every request

func getIndex(w http.ResponseWriter, r *http.Request) {
    w.Header().Add("Content-Security-Policy", "default-src 'self';") 

You can send the CSP header from any component of the infrastructure that’s on the path of the web app, such as from the web server that sits in front of the invoicer.

Although returning security headers from the web server is a good way to ensure the headers are always set, I recommend managing CSP directly in the application code to make implementation and testing easier for developers. ZAP baseline scanning in CI will catch pages that lack the CSP header.

The concept that one site can link to resources located on another site is a core component of the web. This model works great when sites collaborate with each other in a respectful way and don’t attempt to use hyperlinks to modify each other’s content, but it provides no protection against abuses. A CSRF attack does precisely this: abuses links between sites to force a user into performing actions they didn’t intend to perform.

Figure 3.8 A CSRF attack tricks a user visiting (1) into sending requests to without their approval (2).

Consider the flow presented in figure 3.8. A user is somehow tricked into visiting, maybe via a phishing email or some other means. When connecting to the homepage of in step 1, the HTML returned to their browser contains an image link pointing to The browser, while pro-cessing the HTML to build the page, sends a GET request to the URL of the image in step 2.

No image is hosted at that URL because the GET request is meant to delete an invoice. The invoicer, knowing nothing of the ongoing attack, treats the request as legitimate and deletes invoice number 2 from the database. successfully forced the user to forge a request that crosses over to the invoicer site; hence, the name of the attack: cross-site request forgery.

You may think, “Shouldn’t authentication on protect against this attack?” To some extent, you’d be right, but only if the user isn’t logged in to the invoicer at the time of the attack. If the user is logged in to the invoicer and has the proper session cookies stored locally, the browser will send those session cookies along with the GET request. From the point of view of the invoicer, the deletion request is perfectly legitimate.

We can protect against CSRF attacks by using a tracking token sent to the user when the homepage is built, and then sent back by the browser when the deletion request is submitted. Because operates blindly and has no access to the data exchanged between the invoicer and the browser, it can’t force the browser to send the token when triggering the fraudulent deletion request. The invoicer only needs to confirm that a token is present prior to taking any action. If it isn’t, the request isn’t legitimate and should be rejected.

Several techniques can be used to implement a CSRF token in the invoicer. We’ll select one that doesn’t require maintaining a state on the server side: the cryptographic algorithm, HMAC. HMAC, which stands for hash-based message authentication code, is a hashing algorithm that takes an input value and a secret key and generates a fixed-length output value (regardless of the length of the input). You can use HMAC to gen-erate a unique token provided to a website visitor that will authenticate subsequent requests and prevent CSRF attacks.

Listing 3.12 CSRF token: the HMAC of a random value and a secret key

CSRFToken = HMAC(random value, secret key)

Your CSRF token is the result of the unique HMAC generated by the invoicer every time the homepage is requested. When the deletion request is sent by the browser to the invoicer, the HMAC is verified and, if valid, the request is processed. Figure 3.9 illustrates this CSRF token issuance and verification flow.

Figure 3.9 The invoicer issues a CSRF token to the user when they visit the homepage (the GET / request at the top). The CSRF token must be submitted alongside the POST /invoice request that follows to guarantee the user visited the homepage prior to issuing other requests and isn’t being coerced into sending the POST request through a third-party site.

When the user visits the homepage of the invoicer, the HTML document returned to the browser contains a unique CSRF token, named CSRFToken, stored as a hidden field in the form data. The following listing is an extract of the HTML page that shows the CSRF token in the hidden field of the HTML form.

Listing 3.13 The CSRF token stored in the hidden field of the HTML form

<form id="invoiceGetter" method="GET">
  <label>ID :</label>
  <input id="invoiceid" type="text" />
  <input type="hidden" name="CSRFToken" value="S1tzo02vhdM
  <input type="submit" />

Upon submission of the form, the JavaScript code also provided on the homepage takes the token from the form values and places it into the X-CSRF-Token HTTP header of the request sent to the invoicer. The following listing uses the jQuery frame-work to send the request with the token. You can find it in the getInvoice() function in statics/invoicer-cli.js of the invoicer’s source code repository.

Listing 3.14 JavaScript code to use a CSRF token in requests

function getInvoice(invoiceid, CSRFToken) {
  $('.desc-invoice').html("<p>Showing invoice ID " + invoiceid + "</p>");
    url: "/invoice/delete/" + invoiceid,
    beforeSend: function (request) {
  }).then( function(resp) {

On the side of the invoicer, the endpoint handling invoice deletion retrieves the token from the HTTP header and calls checkCSRFToken() to verify the HMAC prior to pro-cessing the request. This code is shown in the following listing.

Listing 3.15 Go code to verify CSRF tokens before accepting a request

func (iv *invoicer) deleteInvoice(w http.ResponseWriter, r *http.Request) {
    if !checkCSRFToken(r.Header.Get("X-CSRF-Token")) { 
        w.Write([]byte("Invalid CSRF Token"))
    // ...

The invoicer verifies the submitted token by generating a second token using the data received from the user and the secret key only it has access to. If the two tokens are equal, the invoicer trusts the request received from the user. If the verification fails, the request isn’t processed and an error code is returned to the browser. Breaking this scheme requires breaking the cryptographic algorithm behind HMAC (SHA256), or gaining access to the secret key, both of which should be hard to do.

Back to the attack example, this time with the CSRF token enabled. The <img src> code set by the attacker on still generates a request sent to the invoicer, but without the proper CSRF token included. The invoicer rejects it with the 406 Not Acceptable error code, as shown by the developer console of Firefox in figure 3.10.

The token dance between the application and the browser can quickly become complicated, and implementing CSRF on a large application is no small task. For this reason, many web frameworks provide automated support of CSRF tokens. It’s rare for developers to implement tokens by hand, but a good understanding of the attack and the ways to protect against it will help you guide a DevOps team in securing web apps.

If you want to learn more about OAuth2 and OpenID Connect, you should read OAuth2 in Action by Antonio Sanso and Justin Richer.

When an application relies on an IdP, the oauth dance is too complex to be run for every request a user makes. An application must create a session once the user is authenticated, and check the validity of the session when new requests are received.

Sessions can be stateful or stateless:

  • Stateful sessions store a session ID in a database and verify that the user sent the ID with every request. Before a request is processed, the application verifies the status of the session in the database.
  • Stateless sessions don’t store data on the server side, but simply verify that the user possesses a trusted and recent session cookie. For high-performance applications, stateless sessions present the benefit of not requiring a round-trip to the database for every request.

Stateless sessions present a performance benefit but lack the ability to destroy sessions on the server side, because the server doesn’t know which sessions are active and which aren’t. With stateful sessions, destroying a session is as simple as deleting its entry from the database, which forces the user to reauthenticate.

It’s often critical to destroy sessions when bad users abuse your application, or to prevent a disgruntled employee from keeping active access after termination. Carefully consider what type of session you need based on your application and choose stateful sessions whenever possible.

Security testing with ZAP can be automated in CI to provide immediate security feedback to developers.

  • Cross-site scripting attacks inject malicious code in web apps and can be blocked using character escaping and Content Security Policy.
  • Cross-site request forgery attacks abuse links between websites and should be prevented via CSRF tokens.
  • Clickjacking is an abuse of IFrames that applications can stop via CSP and X-Frame-Options headers.
  • HTTP basic authentication provides a simple way to authorize users but doesn’t protect the confidentiality of credentials while in transit.
  • Web applications should authenticate users via identity providers whenever possible to avoid storing passwords locally.
  • Programming languages provide mechanisms to keep applications up to date, that can be integrated into CI testing.

In a traditional infrastructure, we might have implemented these restrictions via firewall rules using the IP address of each server. But we’re in the cloud, and you may have noticed by now that the entire infrastructure has been set up without ever mentioning IP addresses. In fact, we’ve been so oblivious to the physical representation of the infrastructure that we don’t even know how many virtual machines, let alone physical servers, are involved in serving the application.

IaaS makes it possible to think about infrastructure and services at a level that completely abstracts physical considerations. Instead of defining network policies for the invoicer that allow IP addresses to talk to each other, we go one level higher, and authorize security groups to talk to each other.

Strong authentication requires multiple factors, preferably one of the following:

  • A knowledge factor, like a password, that can be memorized by the owner.
  • A possession factor, like the key to your house or an external device required for authentication.
  • An inherence factor, like you, or more precisely, your retina, fingerprint, or voice.

The most common way to implement 2FA on web services is to ask users for a secondary token taken from their phone after they enter their password. Several techniques exist to achieve this.

The simplest and most widespread method is to send a code to the user’s cell phone by SMS or phone call. Possession of the SIM card that holds the phone number is the second factor. This method is safe in theory; unfortunately, phone companies are too lenient in how they agree to migrate phone numbers and security researchers have successfully transferred numbers they don’t own to themselves. SMS authentication doesn’t provide any protection against a motivated attacker. It’s fine for low-security sites, but not for a bastion host. one-time password

A more secure approach uses one-time passwords (OTP). An OTP is a short code, either only valid for a single use (HOTP—the H stands for HMAC-based) or for a short period of time (TOTP—the T stands for time-based). The algorithm uses a variant of the HMAC we discussed in chapter 3 to protect against CSRF attacks: the user and service share a secret key that’s used to generate and verify the OTP. In the case of HOTP, a counter is also maintained on both sides. TOTP uses a timestamp instead to remove the need to store counters. Nowadays, TOTP tokens stored on user’s phones are common practice. GitHub, AWS, and many online services support this method.

Push authentication, illustrated in figure 4.5, is the most modern technique used as a second factor, but has the downside of requiring a third party to participate in the protocol. In the push model, a user is associated with a smartphone running an application that receives the notification. When the user logs in, the service asks the third party to send a push notification to the user’s phone to complete the second-factor step. The notification pops up on the device and the user approves it with a single touch. This approach provides similar security to OTP techniques, where the secret key is stored on the user’s phone but removes the need for the user to manually enter the OTP into the service.

I mentioned earlier that SSH comes with secure configuration parameters by default. Few administrators bother changing those parameters and assume their use of SSH is safe from vulnerabilities. In this section, we’ll discuss problems common to SSH installations and how to fix them with strict configuration parameters on both the server and client side.

Start by evaluating the security of the bastion’s configuration using a command-line scanner. One such scanner can be found at The scanner can be run from a Docker container, as shown in the following listing.

Listing 4.31 Installing and executing ssh_scan Docker container against the bastion

$ docker pull mozilla/ssh_scan 
$ docker run -it mozilla/ssh_scan /app/bin/ssh_scan \ 
    -t \ 
    -P config/policies/mozilla_modern.yml 

Building a secure entry pointThe output of the scan returns a lot of information about the parameters supported by the bastion’s SSH server; we’ll discuss how to tweak those parameters in the next section. Focus on the compliance results: they give you hints about the issues in the current configuration and point to Mozilla’s modern SSH recommendations for reference.

Listing 4.32 SSH configuration fails compliance with Mozilla’s modern guidelines

    "compliance": {
    "policy": "Mozilla Modern",
    "compliant": false,
    "recommendations": [
        "Remove these Key Exchange Algos: diffie-hellman-group14-sha1",
        "Remove these MAC Algos: [email protected], [email protected], [email protected], hmac-sha1"
    "references": [

Let’s dive into the configuration of SSH and make the bastion compliant with Mozilla’s modern guidelines.

The SSH agent is one of the most useful and dangerous tools in the SSH toolbox of administrators. It’s a resident program that lives on the local machine of an SSH client and holds decrypted private keys. Without an SSH agent, operators must enter the passphrases of private keys every time they initiate a connection to a remote server, which quickly becomes cumbersome. Using the ssh-add command, operators unlock and load their keys in the agent’s memory once and use them for as long as the agent lives. An external -t parameter can be specified to expire keys after some time.

Listing 4.37 ssh-add decrypts and loads private keys into SSH agent for six hours

$ ssh-add -t 1800 ~/.ssh/id_rsa_sam_2018-02-31

The main goal of the agent is to forward authentication data over the network. Imagine you want to ssh into the invoicer’s application server through the bastion. You’ll need to first ssh into the bastion, and then perform another SSH connection to the invoicer. That second connection requires a key-pair that doesn’t exist on the bastion and is only stored on your local machine. You could copy the private key over to the bastion, but that’s a major security risk. SSH-agent forwarding, represented in figure 4.9, solves this problem by allowing the second connection to tunnel through the first connection and request authentication from the agent on the operator’s machine.

Forwarding SSH agents is a powerful technique that’s popular among administrators, but few are aware of the security risk it implies: when an agent is forwarded, the operator’s authentication data is accessible to anyone with access to the agent. In effect, any-one with root access to the bastion host can use the operator’s agent. This is due to the agent creating a Unix socket on the bastion that allows subsequent SSH connections to talk back to the operator’s machine. The location of the Unix socket is stored in the SSH_AUTH_SOCK environment variable and only accessible to the user, as shown in the following listing, but root can steal the user’s identity and access the socket.

Listing 4.38 Location and permissions of the SSH-agent socket on the bastion

$ ls -al /tmp/ssh-aUoLbn8rF9/agent.15266
srwxrwxr-x 1 sam sam 0 Sep  3 14:44 /tmp/ssh-aUoLbn8rF9/agent.15266

The recommendation here is to be careful when using an agent; only enable it when needed and on trusted hosts. In practice, that means disabling the agent by default and either using the -A parameter on the SSH command line when connecting to a server or enable it for specific hosts. The following listing shows a configuration that enables the agent for the bastion host only.

Listing 4.39 Disabling SSH agent by default, except for the bastion

Host *
    ForwardAgent no
    ForwardAgent yes

I personally prefer to disable the agent entirely and use the -A flag on the SSH command line when the agent is needed. It’s a little more cumbersome, but if you rarely need to jump hosts, it provides better security than a permanent forward.

The better option: ProxyJump

If you’re using a modern installation of OpenSSH (starting at version 7,3), the Proxy-Jump option provides a safe alternative to SSH-agent forwarding. You can use Proxy-Jump on the command line via the -J flag: $ ssh -J You can also set a configuration file that automatically uses ProxyJump for any host under the domain, as follows:

Host *

As ProxyJump doesn’t expose a socket on the intermediate bastion hosts, it isn’t exposed to the same vulnerabilities as SSH agent. Prefer it, assuming your infrastructure supports modern SSH.

All mature databases provide fine-grained access control and permissions, and Postgre-SQL (PG) is one of the most mature relational databases. Permissions on a PG data-base use two core principles:

  • Users that connect to a database are identified by their role. A role carries a set of per-missions and can own database objects, like tables, sequences, or indexes. Roles can also inherit from other roles, and always inherit the public role. This inher-itance model allows for complex policy building, but also makes management and auditing of permissions more difficult. It’s important to note that roles are defined in the postgres database server program and are global to postgres.
  • Permissions on database objects are handled through grants. A grant gives permission to a role to perform an operation. Standard grants are SELECT, INSERT, UPDATE, DELETE, REFERENCES, USAGE, UNDER, TRIGGER, and EXECUTE, the details of which can be found in the PostgreSQL documentation ( Everything than can be granted can also be revoked using the opposite operation, REVOKE.

The SQL Standard (ISO/IEC 9075-1:2011 at the time of writing) specifies the meaning of roles and grants. Most relational databases that implement this standard handle permissions in similar ways, making it easy to port one’s knowledge from one database product to another. The PostgreSQL \dp command can be used in a psql terminal to list permissions on a database. The following listing shows the output of \dp on the invoicer’s database, which doesn’t yet contain any permissions.

Listing 4.47 Permissions on the tables of the invoicer’s database

invoicer=> \dp
                              Access privileges
 Schema |      Name       |   Type   | Access privileges | Column privileges
 public | charges         | table    |                   |
 public | charges_id_seq  | sequence |                   |
 public | invoices        | table    |                   |
 public | invoices_id_seq | sequence |                   |
 public | sessions        | table    |                   |
(5 rows)

Similarly, you can list the ownership on the tables using \d, which logically belongs to the “invoicer” administrator because it’s the only user that currently exists.

Listing 4.48 Ownership of the invoicer’s database tables

invoicer=> \d
               List of relations
 Schema |      Name       |   Type   |  Owner 
 public | charges         | table    | invoicer
 public | charges_id_seq  | sequence | invoicer
 public | invoices        | table    | invoicer
 public | invoices_id_seq | sequence | invoicer
 public | sessions        | table    | invoicer
(5 rows)

Finally, the \du command lists existing roles on the PG server, with their attributes and the roles they inherit from. Here again, it’s important to remember these roles are defined at the level of the PG server, not the invoicer database. Listing 4.49 shows the declaration of the invoicer user which inherits from the rds_superuser role. rds_superuser is an AWS RDS-specific role that grants most of the superuser permissions, with the exception of sensitive operations, like replication configuration. Although the invoicer role is specific to the RDS instance, the rds_superuser can be found on every PostgreSQL database managed by AWS.

Listing 4.49 Roles of the RDS PG server that hosts the invoicer database

invoicer=> \du
                                  List of roles
   Role name   |                   Attributes             |    Member of 
 invoicer      | Create role, Create DB                   | {rds_superuser}
               | Password valid until infinity            |
 rds_superuser | Cannot login                             | {}
 rdsadmin      | Superuser, Create role, Create DB,       | {}
               | Replication, Password valid indefinitely |
 rdsrepladmin  | No inheritance, Cannot login, Replication| {}

Now that you have a better idea of the permission model of your database, it’s time to create roles for the application, the operators, and the developers.

For the sake of the example, let’s introduce Max, a developer who would like to access technical information in the database, like table sizes, active session, count of entries, and so on. Max doesn’t need, or want, access to personally-identifiable information (PII), so you need to create a set of permissions that prevents him from accessing sensitive columns. You’ll start by creating a role for Max that allows him to log in.

Listing 4.53 Creating a role to allow Max to log in to the database

invoicer=> CREATE ROLE max LOGIN PASSWORD '03wafje*10923h@(&1';

Max can connect to the database using this username and password, and access any object allowed by the public schema he automatically inherits. This includes table sizes and troves of information about the database instance, but should he attempt to access any of the records located in the invoicer’s tables, a “permission denied” error will immediately block his query.

Listing 4.54 Allowing Max to view the state of the database but not table records

invoicer=> \c invoicer
invoicer=> \d+
                             List of relations
 Schema |      Name       |   Type   |  Owner   |    Size    | Description
 public | charges         | table    | invoicer | 16 kB      |
 public | charges_id_seq  | sequence | invoicer | 8192 bytes |
 public | invoices        | table    | invoicer | 8192 bytes |
 public | invoices_id_seq | sequence | invoicer | 8192 bytes |
 public | sessions        | table    | invoicer | 8192 bytes |
(5 rows)
invoicer=> select * from charges;
ERROR:  permission denied for relation charges

You grant Max the permission to read (SELECT) various columns that don’t contain any sensitive information on each of the three tables of the invoicer’s database:

  • On the charges table, Max is allowed to read the charge IDs, timestamps, and invoice IDs. Max isn’t permitted access to the charge types, amounts, or descriptions.
  • On the invoices table, Max is allowed to read the invoice IDs, timestamps, and payment status. Max isn’t permitted access to the invoice amounts, payment, or due dates.
  • On the sessions table, Max is allowed to read the IDs and timestamps. Max isn’t permitted access to the session data.

Listing 4.55 Granting Max permission to read nonsensitive information

invoicer=> GRANT SELECT (id, created_at, updated_at,
           deleted_at, invoice_id) ON charges TO max;
invoicer=> GRANT SELECT (id, created_at, updated_at,
           deleted_at, is_paid) ON invoices TO max;
invoicer=> GRANT SELECT (id, created_at, updated_at,
           expires_at) ON sessions TO max;

The \dp command returns a detailed list of the permissions these directives grant Max, as shown in the following listing. Each entry in Column privileges indicates the column name, followed by the grantee role name and a letter indicating the permission. The letter r indicates read access and corresponds to the SELECT SQL statement.

Listing 4.56 Invoicer database permissions showing Max’s read-only access

invoicer=> \c invoicer
invoicer=> \dp
                   Access privileges
 Schema |      Name       |   Type   | Column privileges
 public | charges         | table    | id:              +
        |                 |          |   max=r/invoicer +
        |                 |          | created_at:      +
        |                 |          |   max=r/invoicer +
        |                 |          | updated_at:      +
        |                 |          |   max=r/invoicer +
        |                 |          | deleted_at:      +
        |                 |          |   max=r/invoicer +
        |                 |          | invoice_id:      +
        |                 |          |   max=r/invoicer
 public | charges_id_seq  | sequence |
 public | invoices        | table    | id:              +
        |                 |          |   max=r/invoicer +
        |                 |          | created_at:      +
        |                 |          |   max=r/invoicer +
        |                 |          | updated_at:      +
        |                 |          |   max=r/invoicer +
        |                 |          | deleted_at:      +
        |                 |          |   max=r/invoicer +
        |                 |          | is_paid:         +
        |                 |          |   max=r/invoicer
 public | invoices_id_seq | sequence |
 public | sessions        | table    |
(5 rows)

With these permissions in place, Max can debug technical issues in the database, but can’t access any sensitive information. This type of access is often sufficient for development work and protects DevOps folks from making a mistake that would put user data at risk.

In the last phase of access-control hardening, we revisit the permission granted to the application itself.

RSA provides almost all the security necessary to secure a communication, but one problem remains. Imagine you’re communicating with Bob for the first time. Bob tells you his public key is 29931229. You haven’t established a secure channel yet, so how can you be sure that someone isn’t tampering with this information via a man-in-the-middle (MITM)? You have no proof, unless someone else can confirm that this is indeed Bob’s public key.

In the real world, this problem is similar to how we trust passports and driver’s licenses: possessing the document itself isn’t enough. It must come from a trusted authority, like a local government agency (for a driver’s license) or a foreign government (for a passport). In the digital world, we took this exact same notion and created public-key infrastructures (PKI) to link keys to identities.

In the output of the OpenSSL command line from listing 5.2, the client and server agreed to use TLSv1.2 with the ECDHE-RSA-AES128-GCM-SHA256 cipher suite. This cryptic string has a specific meaning:

  • ECDHE is an algorithm known as the Elliptic Curve Diffie-Hellman Exchange. It’s a mathematical construct that allows the client and server to negotiate a master key securely. We’ll discuss what “ephemeral” means in a little bit; for now, know that ECDHE is used to perform the key exchange.
  • RSA is the public-key algorithm of the certificate provided by the server. The public key in the server certificate isn’t directly used for encryption (because RSA requires multiplication of large numbers, which is too slow for fast encryption), but instead is used to sign messages during the handshake and thus provides authentication.
  • AES128-GCM is a symmetric encryption algorithm, like Caesar’s cipher, but vastly superior. It’s a fast cipher designed to quickly encrypt and decrypt large amounts of data transiting through the communication. As such, AES128-GCM is used for confidentiality.
  • SHA256 is a hashing algorithm used to calculate fixed-length checksums of the data that transits through the connection. SHA256 is used to guarantee integrity.

The term “ephemeral” in the key exchange provides an important security feature called perfect forward secrecy (PFS).

In a non-ephemeral key exchange, the client sends the pre-master key to the server by encrypting it with the server’s public key. The server then decrypts the pre-master key with its private key. If, at a later point in time, the private key of the server is compromised, an attacker can go back to this handshake, decrypt the pre-master key, obtain the session key, and decrypt the entire traffic. Non-ephemeral key exchanges are vulnerable to attacks that may happen in the future on recorded traffic. And because people seldom change their password, decrypting data from the past may still be valuable for an attacker.

An ephemeral key exchange like DHE, or its variant on elliptic curve, ECDHE, solves this problem by not transmitting the pre-master key over the wire. Instead, the pre-master key is computed by both the client and the server in isolation, using nonsensitive information exchanged publicly. Because the pre-master key can’t be decrypted later by an attacker, the session key is safe from future attacks: hence, the term perfect forward secrecy.

The downside to PFS is that all those extra computational steps induce latency on the handshake and slow the user down. To avoid repeating this expensive work at every connection, both sides cache the session key for future use via a technique called session resumption. This is what the session-ID and TLS ticket are for: they allow a client and server that share a session ID to skip over the negotiation of a session key, because they already agreed on one previously, and go directly to exchanging data securely.

I could spend an entire book talking only about TLS. And as it happens, someone did: Ivan Ristic, the creator of SSL Labs, wrote a comprehensive study of TLS, PKI, and server configurations in his book Bulletproof SSL and TLS (Feisty Duck, 2017). A must-read if this short chapter doesn’t satisfy your curiosity on this fantastic protocol.

From the point of view of a CA, one of the most complex tasks when issuing certificates is verifying that the user making the request is the legitimate owner of the domain. As discussed, AWS does so by emailing the domain owner at a predefined address. Let’s Encrypt uses a more sophisticated approach that goes through a set of challenges defined in the ACME specification.

The most common challenge involves HTTP, where the operator requesting the certificate is provided a random string by the CA, which must be placed at a pre-defined location of the target website for the CA to verify ownership. For example, when requesting a certificate for, the CA will look for a challenge at

The HTTP challenge method works well for traditional web servers, but your invoicer infrastructure doesn’t have a web server you could easily configure to serve this chal-lenge. Instead, you’ll use the DNS challenge, which requests an ACME challenge under the TXT record. For this challenge to work, you need two components:

  • An ACME client that can perform the handshake with Let’s Encrypt, configure the DNS, and request the certificate
  • A registrar that can be configured to serve the TXT ACME challenge For the client, use lego, a Go client for Let’s Encrypt that supports DNS (and more) challenges. My registrar of choice is, but lego supports several DNS providers that would work just as well. Requesting a certificate for your domain can be done with a single command.

Listing 5.5 Requesting a certificate from Let’s Encrypt using a DNS challenge

$ GANDI_API_KEY=8aewloliqa80AOD10alsd lego
--email="[email protected]"
--key-type ec256 
Several guides exist to provide operators with modern TLS configurations. In this section, we’ll discuss the guide maintained by Mozilla, which provides three levels of con-figuration (see

Many tools can help you test your TLS configuration. Most of them probe a server to test every possible configuration supported. Tools like Cipherscan (, written by the author of this book, and ( will give you such reports.

A few advanced tools will also make recommendations and highlight major issues. The most popular and comprehensive of them is certainly, an online TLS scanner that outputs a letter grade from A through F to represent the security of a configuration. An open source alternative is Mozilla’s TLS Observatory (, available as a command-line tool and a web interface. The following listing shows the output of the tlsobs command line against the invoicer.

Enabling HTTPS on the invoicer took you 90% of the way to having a secure endpoint. Tweaking it to match Mozilla’s Modern level requires creating a new configuration that only enables selected parameters, instead of using the defaults automatically provided by AWS: only TLS version 1.2 must be activated, and the list of activated cipher suites must be reduced to a minimum. AWS ELB only supports a limited set of parameters, which you need to choose from (see

NOTE The configuration presented here is current at the time of writing, but will likely change over time as Mozilla evolves its guidelines and AWS supports more ciphers. Make sure to refer to the links provided and always use the latest version of the recommendations when configuring your endpoints.

Call this new configuration MozillaModernV4. The following listing shows how to create it using the AWS command line.

Listing 5.15 Creating a custom load-balancer policy mapping Mozilla’s Modern level

$ aws elb create-load-balancer-policy
--load-balancer-name awseb-e-c-AWSEBLoa-1VXVTQLSGGMG5
--policy-name MozillaModernV4
--policy-type-name SSLNegotiationPolicyType
--policy-attributes AttributeName=Protocol-TLSv1.2,AttributeValue=true

The next step is to assign the newly created policy to your ELB, by switching the ELB from using the ELBSecurityPolicy-2015-05 AWS default policy over to MozillaModernV4.

Listing 5.16 Assigning the MozillaModernV4 policy to the invoicer’s ELB

$ aws elb set-load-balancer-policies-of-listener
--load-balancer-name awseb-e-c-AWSEBLoa-1VXVTQLSGGMG5
--load-balancer-port 443
--policy-names MozillaModernV4

With this change in place, you’ll kick off a rebuild of the invoicer to verify the ELB passes the compliance test in the deployer logs. The configuration level is now being measured as Modern, so the deployer continues its work by triggering an update of the invoicer’s infrastructure.

Listing 5.17 Logs showing the invoicer’s ELB passes the Modern TLS configuration test

2016/08/14 16:42:46 Received webhook notification
2016/08/14 16:42:46 Verified notification authenticity
2016/08/14 16:42:46 Executing test /app/deploymentTests/
2016/08/14 16:42:49 Test /app/deploymentTests/ succeeded: 
Scanning (id 12123107)
--- Analyzers ---
* Mozilla evaluation: modern
2016/08/14 16:42:51 Deploying EBS application: {
  ApplicationName: "invoicer201605211320",
  EnvironmentId: "e-curu6awket",
  VersionLabel: "invoicer-api.

Permissions between GitHub and CircleCI

GitHub provides granular scopes such as write:repo_hook to create webhooks, and write:public_key to create SSH deployment keys, which should fulfill the needs of CircleCI. We can assume CircleCI is asking for broader permissions to do more for the user. CircleCI uses the broad repo scope to read permissions from GitHub and decide who can make changes to CircleCI projects based on their privileges on GitHub.

After permissions are enabled for your organization, only GitHub repo admins or GitHub owners will be able to make changes to project settings on CircleCI. This is useful for larger teams to make sure your project settings are only changed by team members who have admin access.

In effect, CircleCI not only uses oauth to log the user in and create webhooks on their behalf, but it also uses oauth to check which permissions Sam has on the repository. If Sam is an admin or has write access to the repository, she’s permitted to change settings on the CircleCI side of the project. This is a powerful feature, as it centralizes permissions management in GitHub instead of creating a second layer in CircleCI.

Creating teams and users for each Docker Hub repository is a bit of a tedious process, but it ensures a single user can only impact a single container repository. Limiting the scope of sensitive credentials will prove useful the day one of these accounts is leaked.

Trust me on this: you don’t want to spend an entire week changing passwords because you shared a single account everywhere.

The flexibility provided by IAM roles and policies can’t be understated. In a large AWS infrastructure, where components share the same account and many resources, strong access control can help you maintain a strict security perimeter between infrastructure components. Managing these permissions certainly does have a cost, as they can be complex to write and even more complex to audit, but it’s a small price to pay for the level of security they provide to the overall platform.

IAM roles could allow you, for example, to store secrets in an S3 bucket and grant granular permissions to instances to retrieve those secrets. Many organizations use this approach, but it has the downside of storing cleartext secrets in S3. In the next section, we’ll discuss the most sophisticated approaches to handling secret management in AWS.

Secret distribution suffers from the same authentication problem we discussed when considering TLS in chapter 5: you must verify the identity of new systems or risk sending secrets to fraudulent ones. This problem is called the bootstrapping of trust.

The best practice is to always store secrets as encrypted until the very last moment, when they need to be decrypted on the target systems. It’s hard to achieve, because decrypting configuration files requires first providing the instances with a decryption key, and the mechanism by which the key is transferred provides no more security than if we had passed decrypted configuration files directly.

AWS provides a solution to this problem through its Key Management Service (KMS). KMS is a cryptographic service that can be used to manage encryption keys. It works as follows:

  1. Generate an encryption key, kA.
  2. Encrypt document dX with kA and obtain edX.
  3. Encrypt kA with KMS and obtain ekA.
  4. Store both edX and ekA in a location instances can retrieve them from.
  5. Destroy dX and kA.
  6. Instance comes online and downloads edX and ekA.
  7. Instance decrypts ekA with KMS using its instance role and obtains kA.
  8. Instance decrypts edX using kA and obtains dX.
  9. dX contains the cleartext secrets used to configure the instance.

This flow is represented in figure.

Figure 6.14 The distribution of secrets via AWS KMS requires operators to encrypt configuration secrets via KMS prior to distributing them to EC2 instances, where they’re decrypted, also via KMS. This workflow keeps secrets safely encrypted until they reach their target systems and removes the need to manually distribute secret-decryption keys.

  • Lock down permissions on code repositories using organizations and teams, and audit those regularly using automated scripts.
  • Enforce the use of two-factor authentication whenever possible to prevent a password leak leading to an account compromise.
  • Limit integration with third parties, review the permissions delegated to them, and revoke delegation when no longer used.
  • Sign Git commits and tags using PGP, and write scripts to review those signatures outside the CI/CD pipeline.
  • Use limited-privileges accounts when integrating components like CircleCI and Docker Hub, and use one account per project, to compartmentalize the impact of an account leakage.
  • Evaluate how container signing could help bring increased trust to your infra-structure, but be aware of its caveats.
  • Become proficient in using AWS IAM policies and use them to grant limited and specific permissions to infrastructure components.
  • Signing code and containers provides high assurance against fraudulent modifications, but is hard to implement in practice.
  • AWS IAM roles are a powerful mechanism to grant fine-grained permissions to systems of the infrastructure.
  • Distribute secrets to systems securely using specialized tools like Mozilla Sops or HashiCorp Vault, and never store them in cleartext when at rest.
The cost of publishing standard logs is much lower than the cost of parsing logs in inconsistent formats.

The OWASP organization, which I mentioned in chapter 3 when discussing application security, provides two useful resources to decide which application events should be logged for security. The OWASP Logging Cheat Sheet ( is the simplest of the two and provides a high-level list of events an application should record:

  • Input validation failures; for example, protocol violations, unacceptable encod-ings, invalid parameter names and values
  • Output validation failures such as database-record-set mismatch, invalid data encoding
  • Authentication successes and failures
  • Authorization (access control) failures
  • Session management failures; for example, cookie session identification-value modification
  • Application errors and system events such as syntax and runtime errors, connec-tivity problems, performance issues, third-party service error messages, filesys-tem errors, file upload virus detection, configuration changes
  • Application and related systems start-ups and shut-downs, and logging initialization (starting, stopping, or pausing)
  • Use of higher-risk functionality; for example, network connections, adding or deleting users, changes to privileges, assigning users to tokens, adding or deleting tokens, use of systems administrative privileges, access by application administrators, all actions by users with administrative privileges, access to payment card holder data, use of data-encrypting keys, key changes, creation and deletion of system-level objects, data import and export including screen-based reports, submission of user-generated content—especially file uploads
  • Legal and other optins such as permissions for mobile phone capabilities, terms of use, terms and conditions, personal data-usage consent, permission to receive marketing communications.
  "CloudTrailEvent": {
    "eventVersion": "1.05",
    "userIdentity": {
      "type": "AssumedRole",
      "principalId": "AROAIO:sam",
      "arn": "arn:aws:sts::90992:assumed-role/sec-devops-prod-mfa/sam",
      "accountId": "90992"
    "eventTime": "2016-11-27T15:48:39Z",
    "eventSource": "",
    "eventName": "SwitchRole",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "",
    "userAgent": "Mozilla/5.0 Gecko/20100101 Firefox/52.0",
    "requestParameters": null,
    "responseElements": {
      "SwitchRole": "Success"
    "additionalEventData": {
      "SwitchFrom": "arn:aws:iam::37121:user/sam",
      "RedirectTo": ""
    "eventID": "794f3cac-3c86-4684-a84d-1872c620f85b",
    "eventType": "AwsConsoleSignIn",
    "recipientAccountId": "90992"
  "Username": "sam",
  "EventName": "SwitchRole",
  "EventId": "794f3cac-3c86-4684-a84d-1872c620f85b",
  "EventTime": 1480261719,
  "Resources": []

Listing 7.7 shows an example of a CloudTrail log provided by AWS. It records a role-switching operation performed by sam to switch from one AWS account to the other. Note how many details CloudTrail stores with the event. The origin and destination accounts are present, as well as the role used to perform the switch. The IP and user agent of the client are recorded, and timestamps are stored in RFC3339 format in the UTC time zone. Logs don’t get better than this.

A logging pipeline should always retain raw logs for some period of time (90 days often seems to strike a reasonable compromise between retention cost and investigative needs.
At Mozilla, we wrote our own event-processing daemon called Hind-sight ( which uses plugins written in Lua. We’ll discuss Hindsight further in chapter.
The final layer of our logging pipeline is the access layer, designed to give operators, developers, and anyone who requires it access to log data. An access layer can be as simple as an SSH bastion host used to access logs from a storage server, or as complex as an Apache Spark cluster ( designed to run analytical jobs on very large datasets.
These small programs aren’t sophisticated enough to deal with the input and output of log events, and instead defer this task to a dedicated data-processing brain: a piece of software called Hindsight (, designed to execute analysis plugins on streams of data.

Regular expressions will bite you

I once worked for a bank that invested heavily in this type of security appliance. The security team was responsible for maintaining the WAF that protected various online services, including the consumer trading service. Every web request entering that service had to pass through hundreds of regular expressions before being allowed to reach the application server. One day, a developer from the online trading team decided to take a look at these regular expressions. I’m not certain what exactly compelled this engineer to read the contents of a file mostly filled with slashes, dollar signs, wildcards, pluses, brackets, and parentheses, but she did. And somewhere around line 418, buried in the middle of a complex regular expression, she found an ominous ‘.+’. Two innocent characters that, in effect, allowed anything to pass through: the regex equivalent of “allow everything.”

Our proud, several-thousand-euro web-application firewall that took an entire team to maintain was executing hundreds of regexes on every request every second, impacting performance and adding engineering complexity to an already complex system, for no other purpose than allowing everything to pass through. Sure, we fixed the issue quickly, but my faith in regular expressions when used for security was never truly restored. If you choose to deploy this type of security system in your organization, pay extra attention to its complexity, or this could happen to you as well.

Implementing a sliding window efficiently can be tricky because it needs to be aware of current time, have access to historical values, and remove older values without impacting the performance of the algorithm. This is where circular buffers come in. A circular buffer is a data structure that implements a sliding window using a fixed-size buffer, where the last entry is followed by the first entry, in a loop.

Figure 8.4 shows a circular buffer with eight slots, each slot corresponding to one minute. Time progresses clockwise. The current minute is marked “t0” and contains a value of 17, indicating that 17 requests have been counted in the current minute. t-1 has a counter of zero, so does t-2. t-3 has a counter of 23, and so on. The oldest value is marked t-7 and has a value of 8. When the buffer moves forward, t-7 is overridden and become t0, the old t0 becomes t-1, and so on.

Maintaining a sliding window inside a circular buffer gives you a way to flag clients who may be sending a large amount of traffic over a given period of time. You can use it to trigger alerts when a predefined threshold is passed. To do so effectively, you need to keep one circular buffer per client IP to track the count of requests sent by each client individually. In practice, this means maintaining a hash table where the key is the IP of the client and the value is the circular buffer. The memory usage of such a data structure can grow quickly, but because the circular buffer is a fixed size, it can be controlled via configuration. An analyzer implementing this threshold logic is shown in the logging pipeline repository (
Cuckoo filter, a sophisticated hash map designed to provide fast lookups with minimal storage overhead, at the cost of a small false-positive rate.
Once the latitude and longitude of an IP address are known, you need to calculate how far that location is from the normal connection area. This is called the haversine formula (, used to calculate the distance between two points on a sphere.
We can apply this technique to monitor both members of the organization and end users of the websites and services. Internally, it’s extremely useful to monitor connections to sensitive systems, like SSH bastion hosts or AWS CloudTrail logs. When an anomaly is detected, an email should immediately be sent to the impacted user and to the security team for further investigation. It’s normal to receive a small number of these alerts from time to time, but their number should be relatively low and easy to triage.
The right moment to alert is as soon as we want a human to take action and block or review an anomaly. A lot of systems will send you alerts as early as possible by default, sometimes bragging about their ability to alert you in near real-time. That sounds like a nice idea at first, but often means that your phone will beep nonstop from morning until night to review endless streams of false positives. Those systems aren’t useful because you’ll mute them within a week. You don’t want an alert to be sent to you as early as possible in the analysis process. You want it to be sent when there’s enough confidence that it’s fraudulent. An alert should be sent when the automated system has done as much as it can to qualify the event as fraudulent and requires a human brain to continue the analysis. You may not, for example, send every change of a filesystem to operators, but you may send an alert when a client violates a limit several times in a row.
The right amount of information is a balance between providing context and not overwhelming the operator or user. Ultimately, alerts must be short, no more than a dozen lines in an email, and easy to read. The problem being identified must be clearly described at the top, and additional context provided in the body of the alert. If you can’t read it in three seconds, it’s not good enough.

When alerting end users, the notification must contain enough information to help the user make an informed decision. Figure 8.6 shows an example of a notification sent to users of the Firefox Accounts services for which suspicious activity was detected. In this example, the anomaly was detected using a geoprofiling analyzer during a wave of password reuse attacks that occurred in 2016 ( The notification is short and contains clear instructions for the user to follow, but it lacks context, and users were left wondering what the issue with their account was, and what it meant for their data.

Figure 8.6 Email notification sent to end users of the Firefox Accounts service following detection of fraudulent activity on their accounts. The notification is short and contains clear instructions on what the user should do but lacks context about the origin of the issue. Future iterations of the notification added context, such as the location of the connec-tion that triggered the issue, to help users understand the notification and take it more seriously. Writing good security notifications is a process that takes time and requires working with many different groups of experts, including designers, product manag-ers, developers, and translators (this particular notification was translated into nine languages.

Figure 9.1 The four levels of detection—audit logs, IDS, endpoint security, and operator vigilance — are set up to stop the kill chain as early as possible.

A Snort rule describes malicious activity at the network level. Listing 9.1 shows an exam-ple of a rule designed to catch the activity of the Dagger backdoor. It’s made up of four parts:

  • The first line in the rule describes the rule action (alert), which will generate an alert when the rule matches. Other actions can log the activity or drop the connection entirely.
  • The second line describes the network protocol (tcp) and the connection parameter. To match this rule, the connection must go from the home network to the external network (the internet, in most cases) and have a source port of 2589 and any destination port.
  • The third part is the options for the rule. Here, we find a msg to be added to the alert and a log triggered by the rule, and information that helps organize and classify rules (metadata, classtype, sid, and rev).
  • Finally, the fourth part of the rule contains the parameters used to find connections that match the activity of the Dagger backdoor. The flow parameter describes on which part of the connection flow the rule applies; here, between server responses and back to the client. The content parameter contains binary and ASCII strings that will be used to find fraudulent packets by looking for matches inside the packet payloads. And the depth parameter puts a limit on how far inside each packet the rule should look for a match, here limited to the first 16 bytes of each payload.

Listing 9.1 Snort rule to detect the network activity of the Dagger backdoor

tcp $HOME_NET 2589 -> $EXTERNAL_NET any ( 
    msg:       "MALWARE-BACKDOOR – Dagger_1.4.0"; 
    metadata:  ruleset community; 
    classtype: misc-activity; 
    sid:       105; 
    rev:       14; 
flow:      to_client,established; 
    content:   "2|00 00 00 06 00 00 00|Drives|24 00|"; 
    depth:     16; 

In the early 2000s, Snort rules were the standard method to protect networks from virus propagations. They’re still used a lot today, but as we’ll discuss later, they can be chal-lenging to deploy in IaaS environments, where operators don’t control the network.

Yara is both a tool and an IOC format designed to identify and classify malware. It was created by Victor Alvarez at VirusTotal to help organize and share information between analysts. Listing 9.2 shows an example of a Yara file for a Linux rootkit. The document has three parts:

  • The meta section contains information about the IOC, such as the name of its author, a creation date, or a link to further documentation.
  • The strings section contains three strings, one hexadecimal and two ASCII, that identify the rootkit.
  • The condition section applies a filter on inspected files to find the ones that match a specific set of criteria. In this example, the condition first looks for a file header that matches the ELF format (uint32(0) == 0x464c457f), and then looks for the shared object file (uint8(16) == 0x0003) ELF type. ELF stands for Executable and Linkable Format and is the file format for executables on Unix systems. If both these conditions match, Yara will look for the strings defined in the previous section. Should all of them be present in the file, it’s a match for the rootkit.
rule crime_linux_umbreon : rootkit
        description = "Catches Umbreon rootkit"
        reference = ""
        author = "Fernando Merces, FTR, Trend Micro"
        date = "2016-08"
        $ = { 75 6e 66 75 63 6b 5f 6c 69 6e 6b 6d 61 70 }
        $ = "unhide.rb" ascii fullword
        $ = "rkit" ascii fullword
        uint32(0) == 0x464c457f // Generic ELF header
        and uint8(16) == 0x0003 // Shared object file
        and all of them

The Yara command-line tool can scan entire systems for files that match signatures of malicious files, using the Yara -r rulefile.yar /path/to/scan command. The Yara Rules project collects IOCs found by security analysts during investigations and makes them freely available to anyone ( It’s a great place to start work-ing with Yara and to scan systems for IOCs. Yara is focused on file-based IOCs. It provides a powerful and sophisticated interface to scan filesystems, but not all IOCs are files. Other IOC formats, like OpenIOC, can look for indicators that aren’t based on files.

OpenIOC is a format created by Mandiant (now FireEye) to manipulate IOCs in their endpoint security tools. Mandiant came into the spotlight when they published the infamous APT1 report in 2013 (, which exposed the activity of a Chinese state-sponsored military unit tasked with hacking into international corporations, mostly Americans and Europeans. Several IOCs published in the OpenIOC format were provided alongside the report, allowing security teams across the world to check their own environments for potential compromise.

Unlike Yara IOCs, OpenIOC uses XML, making these documents mostly unreadable to the naked eye. Listing 9.3 shows an example of an IOC document that looks for a backdoor named Sourface that targets Windows systems. It’s only a sample of the full file, which you can find at If you spend enough time staring at it, you might begin to understand the struc-ture of this format. The first part is metadata, with unique identifiers, an author, and a date. The interesting part is under the <definition> section. The section starts with an Indicator item that declares an OR operator, meaning that any IndicatorItem that follows would indicate a match (an AND operator would require every IndicatorItem to match).

Three IndicatorItems are then defined under the Indicator section, as follows:

  • The first item, named PortItem, checks if the remote IP is con-nected to the system.
  • The second item, named FileItem, checks if a file with the MD5 checksum “8c4fa713…” is present on the disk, which effectively requires calculating the MD5 checksum of all files on disk to compare them with the malicious checksum.
  • The third item, named ProcessItem, looks for a conhost.dll library loaded inside of a running process by inspecting the memory.

Listing 9.3 Excerpt from the OpenIOC definition of the Sourface backdoor

<?xml version='1.0' encoding='UTF-8'?>
                               <short_description>SOURFACE (REPORT)</short_description> 
                <description>SOURFACE is a downloader that obtains a second-stage 
       backdoor from a C2 server.  Over time the downloader has evolved 
        and the newer versions, usually compiled with the DLL name 
              'coreshell.dll'.  These variants are distinct from the older versions 
   so we refer to it as SOURFACE/CORESHELL or simply CORESHELL. 
    <Indicator id="e16e6299-f75b..." operator="OR"> 
                              <IndicatorItem id="590-7df8..." condition="is">
           <Context document="PortItem"
                    search="PortItem/remoteIP" type="mir"/>
           <Content type="IP"></Content>
      <IndicatorItem id="5ea9f200-01f1..." condition="is"> 
                      <Context document="FileItem"
                 search="FileItem/Md5sum" type="mir"/>
        <Content type="md5">8c4fa713c5e2b009114adda758adc445</Content>
      <IndicatorItem id="3f83ca5b-9a2c..." condition="is"> 
                      <Context document="ProcessItem"
        <Content type="string">Local Settings\Application Data\conhost.dll

OpenIOC isn’t a pretty format, but it’s powerful. Mandiant defined hundreds of terms to look for as indicators in various parts of an operating system. Though mostly focused on Windows-based systems (the tools provided by Mandiant, such as Redline and MIR, only run on Windows), OpenIOC can be used to share indicators on other system types.

It’s quite common for digital investigators to share IOCs in this format, but Yara is gradually becoming the industry standard, probably due to the ease of writing Yara rules compared with the complexity of the OpenIOC XML format. Still, OpenIOC plays an important role in sharing indicators across security communities because of its ability to share more than just file signatures.

The next and last format we’ll discuss, STIX, is similar to OpenIOC in expressiveness, but aims to be more readable and to become the de facto standard for IOC sharing.

Structured Threat Information eXpression (STIX) is an initiative supported by OASIS Cyber Threat Intelligence Technical Committee to standardize the analysis of threats, specification of IOCs, response to compromises, and sharing of information across organizations. Unlike the formats we previously discussed, which are focused on the specification of IOCs, STIX aims to streamline the entire process of protecting organizations against attacks.

Inside STIX are two other protocols: CybOX (Cyber Observable eXpression) is an IOC document format similar to OpenIOC, and TAXII (Trusted Automated eXchange of Indicator Information) is an HTTP-based protocol for sharing information between participants of the STIX network. The TAXII protocol is particularly interesting because it solves the problem of sharing and discovering IOCs. For many years, security operators built their own tools and made their own lists of resources to collect new IOCs and feed them into their detecting infrastructure. With TAXII, this entire process is automated around a standard that many organizations and security-product vendors support.

Anyone can connect to a TAXII exchange and retrieve IOCs in STIX format. Listings 9.4 and 9.5 demonstrate querying the TAXII exchange, with a client called cabby (, packaged inside a Docker container. The following listing queries the discovery service of the exchange, which returns a list of collections, each containing IOCs from a different source. The sample output shows only one collection belonging to EmergingThreats, but the full command returns a dozen.

Listing 9.4 Querying available collections from the TAXII exchange at

$ docker run --rm=true eclecticiq/cabby:latest taxii-collections --path 
          --username guest --password guest 
=== Data Collection Information === 
Collection Name: guest.EmergingThreats_rules
  Collection Type: DATA_FEED
  Available: True
  Collection Description: guest.EmergingThreats_rules
  Supported Content:
=== Polling Service Instance ===
    Poll Protocol:
    Poll Address:
    Message Binding:

The discovery service returns the name of each collection, which can be fed into a polling command to download the full list of STIX IOCs contained in that collection.

The following listing shows how the cabby client is used to download those IOCs. Due to the extreme verbosity of the STIX XML document, only one truncated IOC is shown in the listing, and some extra fields have been removed.

Listing 9.5 Retrieving an IP STIX IOC from the TAXII exchange

$ docker run --rm=true eclecticiq/cabby:latest taxii-poll \
--path \
--collection guest.EmergingThreats_rules \
--username guest --password guest
<stix:STIX_Package id="edge:Package-96b-38-4d-8f-8f" version="1.1.1" 
  <stix:Observables cybox_major_version="2" cybox_minor_version="1" 
   <cybox:Observable id="opensource:Observable-6-8-4-7-16b" 
    <cybox:Object id="opensource:Address-a5-0-4-b-372">
      <cybox:Properties xsi:type="AddressObj:AddressObjectType" 
       category="ipv4-addr" is_destination="true">
        <AddressObj:Address_Value condition="Equal">

Obviously, space efficiency isn’t a goal of the STIX format (or anything based on XML): sharing a single IPv4 4-byte address requires wrapping it into 4,000 bytes of XML soup.

That aside, STIX and TAXII are open standards implemented in a small number of open source ( and commercial ( projects and are currently the best ways to exchange IOCs.

At the time of writing, it’s too early to say whether the use of STIX and TAXII will become widely adopted. Version 2 of the specifications simplifies it significantly, uses a JSON format instead of XML (the following listing), and will probably be easier to support in various security tools. Keep an eye on those projects. They’ll be useful when your organization reaches the security maturity to share threat intelligence with others.

Listing 9.6 STIX v2 IOC in JSON format for the Poison Ivy backdoor

  "type": "indicator",
  "id": "indicator--a932fcc6-e032-176c-126f-cb970a5a1ade",
  "labels": [
  "name": "File hash for Poison Ivy variant",
  "pattern": "[file:hashes:sha256 = 'ef537f25c895bfa...']", 

Until then, you should focus on increasing your investigative capabilities. Now that we’ve discussed the purpose and formats of IOCs, it’s time to learn how to scan your infrastructure for them. In the next section, we’ll start investigating systems using end-point-security.

In this section, we’ll discuss the strengths and weaknesses of three open source end-point security platforms: GRR, by Google; MIG, by Mozilla; and osquery, by Facebook. All three implement sophisticated techniques to scan your infrastructure for IOCs, I’ll show how to test them and how they compare to each other. You may also be interested in commercial alternatives to these tools, such as Mandiant’s MIR, Encase Enterprise, or F-Response, but we won’t discuss them here.

Comparing endpoint-security solutions GRR, MIG, and osquery are different tools that try to solve the same type of problem: organization-wide IOC hunting. Each tool makes different choices on how to solve this problem, and it’s up to you to decide which one best fits your environment.

For example, if you care about having fast interactions with your endpoints, MIG is the fastest tool of the three. If you’re looking for in-depth analysis down to the memory of your endpoints, GRR is the way to go. If you want an intermediate tool that integrates well with your logging pipeline and has a pleasant SQL interface, give osquery a try. Table 9.1 summarizes the capabilities of each tool to help you make this decision.

Table 9.1: A comparison of the strengths and weaknesses of GRR, MIG, and osquery

It’s important to note that all three solutions require a significant investment in time and engineering to deploy and use. This isn’t the type of system you deploy once and leave alone for the next couple of years. These tools are only as useful as you make them, by investing time to use and improve them every day. I don’t recommend trying to deploy an endpoint-security solution if you’re not ready to spend a third of an engineer’s time using and improving it. It doesn’t matter which tool you go with: even commercial tools will require you to spend time fine-tuning and exploiting them to provide security value.

Had this book been written a decade earlier, we would’ve spent the majority of this intrusion-detection chapter discussing network-security monitoring (NSM) and intrusion-detection systems (IDSs). Starting around the dot-com boom of the late ’90s and continuing until the democratization of IaaS, security teams spent most of their budget and time perfecting their network-security-monitoring infrastructure. At the time, it was the most efficient way to catch fraudulent behavior. In a way, it still is, but two recent developments have changed our approach:

  • IaaS providers like AWS are protective of their network and give only very explicit access to their customers. In a traditional data center, you can easily capture and analyze all the traffic that enters and leaves the main router. In AWS, GCE, Azure, and all other IaaS providers, that’s not possible, because access to physical equipment is the privilege of the provider (and giving you that access could compro-mise the traffic of other customers).
  • The proportion of network traffic that uses Transport Layer Security (TLS) is quickly growing, limiting the ability of network-security-monitoring tools to inspect the content of connections. Now that TLS certificates are pretty much free and easy to obtain, malware authors don’t hesitate to use them to protect the confidentiality of their fraudulent connections.

Network security monitoring may be harder to achieve and more limited in an IaaS environment, but it can still be useful. AWS, GCE, and Azure (,, and allow operators to route their out-bound traffic through specific network-address translation (NAT) instances. We can use this feature to inspect the traffic that leaves the infrastructure.

To understand how this works in AWS, we need to first talk about traffic routing. In the invoicer infrastructure you built in part 1, the traffic to and from the invoicer application goes through a load balancer, as shown in figure 9.7. This route is entirely operated by AWS and you have no visibility into the network traffic until it arrives in the application.

The outbound route, however, is the one you can control. This route is used when a program located inside the infrastructure establishes a connection to the internet. In figure 9.7, this is illustrated by the virus connecting back to the attacker and being routed through the IDS. Analyzing outbound traffic won’t protect the infrastructure against a break-in, but it will help catch backdoors that retrieve tools from the internet or establish C2 channels to receive commands from their operators.

Figure 9.7 In AWS, IDS can be placed on the outbound route to catch malware establishing outbound connections.

NSM systems like Snort, Suricata, or Bro (,, and popular choices to monitor network traffic for fraudulent activity. They typically operate in one of two modes:

  • Detection mode, by capturing a copy of the traffic, inspecting it, and generating alerts. This is what people mean when talking about IDS systems.
  • Protection mode, by positioning themselves in the middle of the traffic and blocking suspicious connections. This mode is typically called IPS. Bro is a bit of a different beast, designed to provide powerful network-analysis capabilities, but it doesn’t put much focus on signature-based detection like Snort and Suricata. We talked about the Snort signature format in the first section of this chapter, which both Snort and Suricata can make use of. Various security vendors sell their own rule-sets, which you can subscribe to and feed into your IDS system (Proofpoint Emerging Threats [], Snort Talos, and others). You can also get started with a community version of the Snort Talos rules available at

In the rest of this section, we’ll discuss how to set up Suricata to inspect outbound traffic on an AWS NAT instance. The AWS setup itself will be omitted, because it’s exten-sively documented in Amazon’s own documentation, and we’ll focus on configuring IDS to analyze traffic using Snort community rules refreshed daily and publish alerts into the logging pipeline where they can be routed to operators.

Rules can be downloaded from locations that change regularly, so attempting to list URLs here wouldn’t be helpful. The Snort and Suricata documentation contains pointers that will help you find the best rule-sets. Another great tool is Oinkmaster (, a companion tool for Snort and Suricata designed to regularly download various rule-sets. Its default configuration comes with sample locations that will help you get started.


  • The kill chain of an intrusion contains seven phases. They’re reconnaissance, weaponization, delivery, exploitation, installation, command and control, and actions on objective.
  • Indicators of compromise (IOCs) are pieces of information that characterize an intrusion and can be used to detect compromises across the infrastructure.
  • GRR, MIG, and osquery are endpoint-security solutions that allow investigators to inspect the systems of their infrastructure in real time.
  • Analyzing network traffic with an IDS like Suricata and commercial rule-sets will catch common attack patterns and help protect the network.
  • System-call auditing is a powerful Linux mechanism to watch for suspicious commands on critical systems, but it can become noisy.
  • People are great at finding anomalies and are often the best intrusion-detection mechanism an organization has.

The best way to avoid this catastrophic situation is to prepare your organization with an incident-response plan. The Incident Handler’s Handbook published by the SANS (sysadmin, audit, network, and  security) Institute ( is a good place to start. It breaks down incident response into the following six phases:

  • Preparation — The first phase of incident response is to prepare yourself for the day all hell breaks loose. If you’ve never had an incident in your organization, the best way to prepare for it is to run through a fictional incident. Make it fun by gathering key people in a meeting room for four hours and running through a predefined scenario. Bonus points if you can find a Dungeons & Dragons expert to act as the Dungeon Master. The exercise will highlight the areas where you need to improve (tooling, communication, documentation, key people to involve, and others).
  • Identification — Not all alerts are security incidents. In fact, you should be careful about properly qualifying a security incident and how you go from an alert to triggering the incident-response process. This is the identification phase, where you qualify, in SANS terms, “whether a deviation from normal operations within an organization is an incident.”
  • Containment — You got breached, now what? The next phase of incident response is to contain the bleeding and prevent the attacker from progressing within your infrastructure. That means cutting access where needed, freezing or sometimes shutting down systems, and any other action that blocks the attack until you can fix the breach.
  • Eradication — When the breach is contained, you need to eradicate the threat and rebuild all compromised systems to fix the root cause and prevent further compromises. This is the phase that usually consumes the most resources. Having good DevOps practices helps a lot, by making the reconstruction of the infrastructure faster than if it was manual.
  • Recovery — Attackers often return after a successful breach, and it’s critical to con-tinue monitoring the infrastructure closely in the aftermath of an incident. In the recovery phase, you closely rebuild trust in the security of an infrastructure that was seriously weakened.
  • Lessons learned — Security incidents can be traumatic, but are also a great learning experience to mature the security of an organization. When the dust has settled, the team that dealt with the incident must sit down and go over their notes to identify areas that need improvement. You don’t become an incident-response expert overnight, and learning from the lessons of an incident is the best way to make everyone more responsive and better organized in the future.
Sam’s colleague Max is tasked with freezing and locking down the compromised hosts. Remembering a presentation on AWS forensics he saw at a local conference several months ago, he downloads the tools from to take images of EC2 instances. ThreatResponse is a collection of tools that facilitate the capture of digital forensic artifacts in AWS. The aws_ir command (listing 10.5) can, in one go, snapshot the disk of an EC2 instance, and dump its live memory and upload it, along with other instance metadata, to an S3 bucket. He isn’t sure what they will do with all that data yet, but it seems wise to capture it all. It isn’t that hard anyway: all Max has to do is list the IPs of the instances to capture, and let the script run for a while.

Listing 10.5 aws_ir captures forensic artifacts from EC2 instances

$ pip install aws_ir 
$ aws_ir instance-compromise \ 
  --instance-ip \ 
  --user ec2-user \ 
  --ssh-key ~/.ssh/private-key.pem 
aws_ir.cli - INFO - Initialization successful proceeding to incident plan. - INFO - Initial connection to AmazonWebServices made. - INFO - Inventory AWS Regions Complete 14 found. - INFO - Inventory Availability Zones Complete 36 found. - INFO - Beginning inventory of resources - INFO - Attempting run margarita shotgun for ec2-user on with /home/max/.ssh/private-key.pem
margaritashotgun.repository - INFO - downloading as 
margaritashotgun.memory [INFO] capture 90% complete
margaritashotgun.memory [INFO] capture complete: s3://cloud-response-38c5c23e79e24bc8a5d5d79103b312ff/ - INFO - memory capture completed for: ['']
Processing complete for cr-17-050411-bae0
Artifacts stored in s3://cloud-response-d9f1539a6a594531ab057f302321676f

In the background, the tool invokes the AWS API to save a snapshot of the disk volume attached to the instance, and then connects to it via SSH to install a kernel module used to capture live memory. The kernel module, called LiME (, is a popular tool in digital forensics, often used by specialized teams in coordination with memory-analysis frameworks such as Volatility (

Figure 11.1 Risk levels from different organizations aren’t directly comparable, because each organization has a different risk tolerance.

It’s all about information!

Engineers who first encounter information-security models often wonder if everything should be treated as information. For example, stealing CPU power from a server to mine bitcoins may not imply stealing information from that server. Neither does launching a DoS attack against a given service. From a technical point of view, the security impact can first appear to be unrelated to information.

This is a common mistake that should be corrected early: all components of an infrastructure are designed to manage information. A DoS cuts access to the information stored on that service. A server stolen to mine bitcoin drains computing power needed to process legitimate information and the guarantee that the information processed on that server hasn’t been tampered with vanishes.

When assessing the risks to an organization, we must focus on the information the organization handles — whether it’s provided by customers, generated internally, public, confidential, critical to be retrievable at all times, and so on. It’s the security of the information that drives everything else; the technical components are just tools involved in processing it.

When you evaluate a given system, the information it handles may not be recognizable right away but should become apparent if you dig hard enough. An SSH bastion may not be storing information, but its availability is critical to the operation of a highly sensitive database that can never be tampered with. The information in the database is what matters, and the security of the SSH bastion must be sufficient to protect the information.

The four-levels rule

We’ll always use four classification levels throughout this chapter, regardless of the type of measurement we’re making. Why four levels? It’s not arbitrary. Four seems to provide enough granularity to represent most risks while being small enough to remember the meaning of each level.

Most importantly, there’s a brain trick involved in using an even number of levels: it forces people to choose between them, because there’s no middle. Research performed at the University of Chester confirmed that, when presented with an uneven number of choices, people tend to always pick the one in the middle.1 This may be fine when handling negotiations (put your preferred choice in the middle of two others that are less desirable), but it skews risk assessment negatively.

When it comes to measuring risk, you want people to make conscious decisions, not pick the easy way out. Four levels forces people to decide between levels two and three and think through the implications of each level. This little trick can greatly increase the quality of your risk assessment.

In data systems, integrity is defined similarly. It represents the need for data to remain accurate and unaltered, throughout its entire life. Like confidentiality, integrity requirements vary for data. The corruption of an email-marketing database may have a much lower impact than the corruption of a company’s accounting database.

Here again, we can define levels, primarily by the impact on the organization. Integrity is a binary concept (data either has or lacks integrity), and it doesn’t make a lot of sense to differentiate between losing a little bit of integrity versus losing a lot of it, at least not without knowing the impact.

When the impact is established and the needs have been defined, technical measures can be taken to ensure that integrity is always present. How many controls are added to ensure the integrity of a piece of data depends on how critical that data is to the organization. For example:

  • A list of sales leads may have low integrity, because if modified, it wouldn’t significantly hurt the organization. As such, the organization may allow the marketing department to store the list in spreadsheets on their laptops, without further controls, which has the benefit of being simple and saving infrastructure resources.
  • Customers’ fitness data collected and stored by a startup may have medium integrity, because modifying it would annoy customers, but may not hurt the survival of the company. Data would be stored in a database with regular backups.
  • The communication channel between a load balancer and an application may require high integrity, because tampering with the messages forwarded by the load balancer could allow an attacker to replace legitimate requests with fraudulent ones. We’d use Transport Layer Security (TLS) to protect the integrity of that connection.
  • The source code of a financial trading application may require maximum integrity, because an attacker able to modify it could place fraudulent orders worth billions of dollars. As such, any change may require cryptographic sign-off by two senior developers, and signatures may be verified before deployment.

To be successful, a risk-management program must start from the top of the organization and identify threats that can take down the entire business. In figure 11.1, the IT division alone was unable to identify risks relevant to the entire corporation because its visibility was limited; it ranked a $5 million loss as critical, when the corporation considered it only medium.

Ranking risks is extremely difficult to do from the bottom up. Assessors who start with a limited view of the organization and work their way up have to constantly readjust their assessment levels as they learn more about the organization’s ability to survive.

A better approach is to start from the top, by asking upper management what they’re concerned about. Is a competitor threatening to take over the market? Could a bad article in the press tarnish a product’s reputation? Maybe a natural disaster could prevent the company from operating for weeks. It’s only by talking with the top strategists of an organization that an analyst can identify the top threats.

“I’m collecting a list of risks for the company and progressively ranking them. From your point of view, what’s the biggest risk this organization faces?”

“Well, that’s easy enough: we’re three months away from Christmas, the biggest shopping time of the year. I’m worried the redesign of the online store won’t be finished in time. We’re betting big on the newer version to drive up sales, and the investors are anxious to see our revenue increase before next year,” replies the CEO.

“What do you think could prevent this project from completing in time? Is it a lack of human resources, technological issues, etcetera?”

“You’d have to talk to the CTO for technological details, but I know we’ve been having difficulties hiring qualified engineers. Our current platform is also unstable and tends to crash under heavy traffic, when it doesn’t simply drop orders, which erodes customers’ trust. Those are my biggest concerns.”

We can infer a lot from this short conversation. From a business perspective, the CEO needs high availability and integrity on the online store, and they plan to achieve this through a redesign project. The identified risks are the following:

  • Productivity — Productivity is suffering from a lack of qualified resources.
  • Financial (at two levels) — Investors are expecting an increase in revenue to continue supporting the company; and customer orders are sometimes dropped. Assuming only a few orders are dropped on occasion, the investment risk is obviously the most impactful one.
  • Reputation — The platform is unstable, which could progressively drive customers to competitors.

Quantifying the impact of risks

In the previous example, we assumed that the financial risk of losing investors would be higher than the financial risk of dropping some customer orders. It feels like the right call, but we don’t have the data to back this up. You’ll sometimes find yourself making “gut feeling” decisions in risk management, but they should be the exception, not the rule. This is usually a symptom of not having enough data available to base your decisions on. When information is scarce, risk decisions are primarily qualitative, and when more information is available, risk decisions become quantitative. You should always try to acquire enough data in assessments to be as quantitative as possible, increasing the quality of your analysis.

Continuing with the model defined previously, we can identify three areas that need to be quantified: finances, reputation, and productivity. Depending on your organization and how granular you want to be in your assessment, more areas of impact may be considered. For example, the FAIR (factor analysis of information risk) risk-assessment method defines six areas instead of three ( For the purpose of this chapter, we’ll keep things simple, and you can add complexity later on.


Financial impact is the easiest type to quantify. Go to the Chief Financial Officer and ask them how big a loss would put the company’s survival at risk, and you’ll likely get a straight answer. That’s your critical risk. A financial impact scale may look like the following:

  • LOW impact for anything below $100,000. Risks in this category are an inconvenience, but the organization can easily recover from them.
  • MEDIUM impact for losses up to $1,000,000. At this level of risk, middle management must get sign-off before engaging company resources.
  • HIGH impact for losses up to $10,000,000. This type of risk must be clearly understood by upper management.
  • MAXIMUM impact for losses higher than $10,000,000, which is a third of the company’s yearly revenue. Should a risk of this magnitude be realized, the survival of the company is at stake. The leadership team must not only know about those risks, but also closely monitor them on a weekly basis.


Reputation plays an important role in a lot of business relationships, and its decline may have a negative impact on the organization. The problem with reputation is how hard it is to quantify. Politicians use polls to measure their reputation against a target population, but this is hardly something small - to medium-size businesses can do regularly. An alternative approach is to rank reputation risk by the press coverage a given incident would receive. It’s not 100% accurate, but helps drives the conversation about impact:

  • LOW impact means it’s unlikely the event would hurt the organization’s reputation.
  • MEDIUM impact would represent customers complaining about their negative experience on social media. The audience is small, and in most cases the matter can be resolved by customer service.
  • HIGH impact means the event is getting picked up by specialized press, and a small audience of customers is likely to notice it. The reputation of the organization is affected but can be recovered.
  • MAXIMUM impact represents risks that will be picked up by national press (newspaper, television, and others) and severely deteriorate the organization’s reputation. The company’s survival is at risk, and recovering customer trust would require a large effort.

Productivity All organizations depend on their ability to produce goods or services to function. Assigning a value to risks that harm productivity is an important part of the risk-assessment process. We can quantify these by using two variables: the length of time during which productivity is impaired, and how much of the organization is impacted. Let’s first split the organization into small and large groups. Any team that represents less than 10% of the workforce is considered a small group, and anything bigger is a large group. Based on this, the productivity-impact levels are the following:

  • LOW impact would block a small group for up to a day and a large group for a few minutes.
  • MEDIUM impact would block a small group for a few days and a large group for several hours.
  • HIGH impact would block a small group for weeks and a large group for a few days. The impact on the organization would be large, projects would be delayed, and customers wouldn’t receive their products or services, but the organization could recover.
  • MAXIMUM impact would block a small group for months and a large group for weeks. At this point, the organization’s ability to produce is severely impaired, its survival is at risk, and recovery involves major effort. It’s also possible to use a productivity-impact level to derive a financial loss, for example, by calculating the workforce cost. If 30% of the organization is unable to work for an entire week, and the average daily salary is $500, then a HIGH productivity impact may induce a MEDIUM financial impact.

We now have three types of risk (confidentiality, integrity, and availability) and three areas of impact (finance, reputation, and productivity). We’re creating the outline of a framework to classify and rank risks. For a lot of organizations, measuring impacts on finance, reputation, and productivity isn’t sufficient, and more fine-grained models exist to go deeper in evaluating threats and impacts. In the next section, we’ll discuss identifying threats and measuring vulnerability in an organization.

Identifying threats and measuring vulnerability Risk quantification is often defined as the product of threat times vulnerability times impact: = R = T x V x I.

We’ve discussed quantifying impact, but not threats and vulnerability. In this section, we’ll discuss a threat-modeling model called STRIDE and a vulnerability-assessment tool called DREAD. Used together, these two models allow assessors to identify threats and measure vulnerability to better classify risks.

The STRIDE threat-modeling framework

Threat modeling is the process of identifying vectors of attack that can harm the CIA of information. The term threat modeling sounds impressive, but it’s a straightforward exercise: look at a given system and think of ways an attacker could mess with it. For example, in relation to the invoicer service in part 1, an example of a threat would be an attacker breaching the service’s access controls and retrieving invoices from all users. The confidentiality breach would likely have a high impact on the organization’s reputation.

Threat modeling requires covering the entire scope of attacks a system is exposed to. Being exhaustive is difficult, particularly when systems are large and complex, so methodologies exist to guide the exercise. STRIDE (spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege) is one of those method-ologies, developed by Microsoft to guide its own risk-assessment efforts. The acronym, which stands for the type of threats an analyst should cover, are described in Microsoft’s documentation as the following (

  • Identity spoofing — An example of identity spoofing is illegally accessing and then using another user’s authentication information, such as username and password.
  • Data tampering — Data tampering involves the malicious modification of data. Examples include unauthorized changes made to persistent data, such as that held in a database, and the alteration of data as it flows between two computers over an open network, such as the internet.
  • Repudiation — Repudiation threats are associated with users who deny performing an action without other parties having any way to prove otherwise—for example, a user performs an illegal operation in a system that lacks the ability to trace the prohibited operations. Nonrepudiation refers to the ability of a system to counter repudiation threats. For example, a user who purchases an item might have to sign for the item upon receipt. The vendor can then use the signed receipt as evidence that the user did receive the package.
  • Information disclosure — Information-disclosure threats involve exposing information to individuals who aren’t supposed to have access to it—for example, the ability of users to read a file that they weren’t granted access to, or the ability of an intruder to read data in transit between two computers.
  • Denial of service — DoS attacks deny service to valid users, for example, by making a web server temporarily unavailable or unusable. You must protect against certain types of DoS threats to improve system availability and reliability.
  • Privilege elevation — In this type of threat, an unprivileged user gains privileged access and thereby has sufficient access to compromise or destroy the entire system. Elevation-of-privilege threats include situations in which an attacker has effectively penetrated all system defenses and become part of the trusted system itself, a dangerous situation indeed.

Using STRIDE when evaluating the many ways a system could be attacked allows assessors to be as exhaustive as possible. Let’s run through an example, still focusing on the invoicer service, to see how STRIDE guides the analysis. As a reminder, the invoicer service is a simple web application with a database that allows users to post and retrieve medical invoices. Users connect to it with their web browsers from their personal computer, and the service is hosted on AWS. Let’s assume you haven’t yet implemented any security controls on it (no authentication, transport layer security, and so on). With this context, we can identify the following threats:

  • Identity spoofing — A malicious user could steal the identity of a legitimate user and upload fraudulent invoices on their behalf.
  • Data tampering — An attacker could compromise the database, via a SQL injection or otherwise, to remove or modify stored invoices.
  • Repudiation — A malicious user could delete their customer’s paid-invoice data from the system and deny that payment had been made.
  • Information disclosure — An attacker could leak all invoices in the database and cause great harm to the privacy of legitimate users.
  • Denial of service — An attacker could upload a large volume of invoices, overload the application, and cause a crash that would prevent legitimate users from accessing the service.
  • Privilege elevation — An attacker could breach the application servers and gain access to other critical services hosted in the infrastructure.

This still isn’t an exhaustive list of threats the invoicer service is exposed to, but you can see how the STRIDE threat model drives the analysis. Without a model to follow, it’s likely we would’ve omitted at least one or two vectors of attacks.

STRIDE helps drive the identification of threats, but doesn’t cover the vulnerability of the organization to those threats. This is the purpose of the DREAD model, which we’ll discuss next.

The DREAD threat-modeling framework

We now have a model to identify threats to system information, and a model to quantify the impact of these threats on the organization, but how realistic are those threats?

The DREAD model helps quantify the vulnerability of an organization to a given threat. It’s another model build by Microsoft, designed to work together with STRIDE, that ranks five areas on a scale from 1–10 to evaluate the amount of risk presented by a given threat ( Here’s how it works, with example scores:

  • Damage potential — How great is the damage if the vulnerability is exploited?
  • Reproducibility — How easy is it to reproduce the attack?
  • Exploitability — How easy is it to launch an attack?
  • Affected users — As a rough percentage, how many users are affected?
  • Discoverability — How easy is it to find the vulnerability?

There’s some overlap between the measurements made by DREAD and the impact levels established previously, which makes using them as an exact formula difficult (see the sidebar “Scientific rigor and risk management”). The model may not always work at a mathematical level, but it’s a good way to drive vulnerability discussions during a risk assessment. For example, here’s how we’d use it on the data-tampering threat identified previously:

  • Damage potential — The attack can modify all unpaid invoices in the database and severely impair the organization’s cash flow. The damages would probably be high.
  • Reproducibility — The attack requires breaking through the application’s defenses, and there are no known attack vectors today, so reproducing it is unlikely.
  • Exploitability — The invoicer service is hosted on the public internet and accessible to everyone, so exploitability is high.
  • Affected users — All users with unpaid invoices would potentially be impacted.
  • Discoverability — The source code of the invoicer is public, so an attacker could audit it and find a hole. Best practices were used when developing the invoicer, so it’s unlikely such an issue exists; discoverability is low.

Then the scores are averaged to get the final score. If we were to give the preceding DREAD assessment the score DP = 8; R = 2; E = 10; A = 10; D = 4, then the final DREAD score for this threat would be (8 + 2 + 10 + 10 + 4) / 5 = 6.8 ~ = 7. According to our assessment, the vulnerability of the data tampering threat is 7, or high.

Classic risk assessments have a lot of value, but for day-to-day purposes, a lightweight approach is needed. The rapid risk-assessment (RRA) framework is a lightweight version of a risk-assessment framework designed to take between thirty minutes and one hour to run on a project ( We developed it at Mozilla to bring this high-level risk-discovery approach to all new projects and decide when to engage in more-detailed security work, such as in-depth security reviews, which take weeks to complete.
I’ve been in assessment meetings where the engineering team decided to completely redesign their project based on their new understanding of the risks. Some other projects passed with flying colors, having already thought of and made plans to mitigate all the risks identified during the assessments. Every once in a while, I see a project that ends up simpler at the end of the RRA than it was at the beginning, because the assessment showed a lot of the technical complexity wasn’t needed. Your mileage may vary, but it’s unlikely that an RRA will produce uninteresting results. If it does, the consequences will be minimal because the entire exercise only takes one hour, not three weeks. Having covered in detail how risks should be discovered and ranked, we’ll spend the last section of this chapter discussing the lifecycle of risks in an organization.

Risk management is the set of coordinated activities that direct and control an organization with regard to risk.

  • The CIA triad (confidentiality, integrity, and availability) is a common model to categorize the security requirements of information.
  • Establishing the degree of confidentiality of information means defining exactly who should have access to it at a given time.
  • Integrity represents the need for data to remain accurate and unaltered, throughout its entire life.
  • Availability is the measure of how reachable a given piece of information is over a long period of time.
  • The impact of a given risk can be evaluated at the financial, reputational, and productivity levels.
  • STRIDE and DREAD provide models to evaluate and rank the threats an organization is exposed to.
  • The RRA framework is a lightweight process that helps security teams identify risks early in the development process of applications and services.
  • The RRA has four components: information gathering, data dictionary, risk identification, and security recommendations.
  • Recording and tracking risks is how an organization remains aware of its security posture over time.
I introduced web-application scanning with OWASP ZAP in chapter 3, when you used it to perform automated baseline scans in the CI/CD pipeline. ZAP is one of dozens of automated tools that focus on scanning web applications for vulnerabilities. Burp Suite, Arachni, SQLMap, and Nikto also fall into this category. A complete list would be difficult to compile and keep up to date, but you can check out the list managed by OWASP at

UTF-8 characters in the MySQL database of the application (

MySQL supports only a subset of the Unicode character space in its default configuration, utf8. You need to enable the utf8mb4 character set on a MySQL server to properly handle the entire set of Unicode characters, encoded on four bytes. Persona’s database used the flawed utf8 of the time, and an interesting issue arose: when supplied with an email address that contained a Unicode character beyond the covered set, the database would truncate the string on the unknown character value. That vulnerability allowed an attacker to supply an email address like this: targetuser@\U0001f4a9\[email protected], where [email protected] is the email address of the victim, and is a domain controlled by the attacker. The Unicode character in the middle, \U0001f4a9, commonly known as “pile of poo,” is truncated by MySQL.

American fuzzy lop (AFL; and Radamsa ( are examples of file-based fuzzers that generate mutations to stress the input of an application. Radamsa is a black-box fuzzer, and AFL is a white-box fuzzer. AFL uses a technique called instrumentation to learn about the internals of a program and test its security more effectively. Instrumenting an application requires compiling it in a specific way, which is why AFL is called a white-box fuzzer. Burp Intruder (part of the Burp Suite) and ZAP both provide network-based fuzzers that can target the input of web applications. These tools take a template of the traffic the application accepts, typically by spidering it, and then mutate inputs using random generators or grammars.

Most modern languages have highly configurable and high-performing static code-analysis tools. JavaScript has ESLint (, Python has Bandit (, Java and C/C++ have dozens of them (, and Go is progressively getting there with gas ( Many of these tools can be used quickly by reusing rules created by communities of developers, inheriting best practices from other organizations.
Security Monkey is one of these tools, specifically designed to keep Netflix’s infrastructure safe, initially in AWS and later extended to GCP (Google Cloud Platform). It operates similarly to Trusted Advisor and Scout2, by retrieving configurations from the infrastructure and comparing them against a set of predefined compliance tests. Tests run automatically inside the platform and send alerts when violations are encountered. The platform also provides a web interface to control the tests and view results (figure.

If you joined an organization and were asked to build a security program from scratch, where would you start? You can refer to your original continuous security model, repeated from chapter 1 in figure 13.1, to answer this question. Assuming it would take three years to implement the entire program, you should do the following:

  • Year 1: focus on securing the DevOps pipeline and implementing test-driven security.
  • Year 2: ramp up on fraud detection and incident response.
  • Year 3: integrate risk management and external security testing.

… once had a discussion with a fellow security engineer from another organization on the value of web-application firewalls (WAFs). His argument was that WAFs allowed his team to protect against vulnerabilities developers would inevitably leave in the websites of the organization. His team had invested a lot of time and energy into the WAFs, and they were a core part of their security infrastructure, sitting in front of every website, inspecting every request and response, blocking attacks.

I asked him if that security-engineering time wouldn’t be better invested in writing patches to the websites themselves, so that the WAFs would no longer be needed. “Impossible,” he replied, “The developers have no care for security and no interest in fixing these bugs. That’s why we have the WAFs in the first place!”

year 1: integrating security into devopsYou may think this is an extreme example of a disconnect between security and engineering, but this type of negative interaction is much more common than we’d like to admit. It’s a perfect example of teams that distrust each other and don’t work together. The end result is added layers of complexity (the WAFs) when issues should be fixed directly in the applications. The business suffers, because the added complexity increases maintenance cost and delays the shipping of products. More importantly, everyone in the organization is frustrated, which inevitably leads to bad code and poor security.

NOTE Web-application firewalls have their place in a security infrastructure, particularly when protecting products that can’t be fixed easily, but they should be the last-resort solution to security problems, not the default.

I’m a big proponent of building security tools, but there is still a good case to be made for buying them from vendors every once in a while. Fraud detection is one of those areas where the competition is fierce, and lots of vendors have excellent products that, although expensive, will save you time and energy in implementing your logging pipeline.

When deciding on building versus buying, consider the following:

  • When do you need the security pipeline to be operational? No one can build a reliable infrastructure that works at high scale in fewer than six months, and it often takes more than a year. If you need something ready tomorrow, buy a service from a vendor that will host your logs and run the infrastructure for you.
  • How much visibility do you have into the future? Building is expensive at first, but the cost diminishes over a few years. Buying is typically going to cost you a flat fee every year. If you have five years of visibility, then building may end up costing you less in the long run.
  • Do you have the skills to build your own platform? You may have the skills to write a few scripts or simple programs, but processing millions of logs at high speed takes a whole different level of programming knowledge. Vendors may be able to provide that for you, for a fee.

Building versus buying is often a difficult decision to make. Buying always appears more expensive at first, because licensing and hardware costs are raw numbers. Building may seem more appealing, but you have to consider how much time your team will need to implement and run the full platform. Then multiply that by three, because we’re all terrible at making estimates, and you’ll have an idea of how much it will cost you to do it yourself.