Technology & Policy

On the recent side-channel attacks on Intel SGX – Technology & Policy

2017-03-01T12:00:00Z

Side-channel attacks are my favorite attack in computer security because they poke giant holes in the abstraction and security models that system designers are using. When I hear of a new attack avenue in this space, my first reaction often is “wow, that is so cool” and “I didn’t even think about that.”

In the past week, two papers have been published on arXiv detailing side-channel attacks on Intel SGX. While the existence of such attacks should be taken seriously by people designing systems using Intel SGX (which includes yours truly), these particular attacks are not very interesting.

First of all, these attacks really shouldn’t come as a surprise to anyone following this space. Cache-based side-channel attacks are a well-known attack vector. Many papers detailing new techniques have been published over the last couple of years. There’s Evict+Time and Prime+Probe, Flush+Reload, Flush+Flush, etc. Ge et. al. present a good overview of the current state of the art. Intel explicitly states in their Enclave Writer’s Guide that they don’t protect against attacks at cache line or higher granularity.

Second of all, both attacks are exploiting well-known flaws in modular exponentiation implementations. We know how to do constant-time RSA. On top of that, both papers are at best slightly misleading in describing their attack targets.

From the first paper:

As our victim enclave we chose an RSA implementation from the Intel IIP crypto library in the Intel SGX SDK. The attacked decryption variant is a fixed-size sliding window exponentiation, the code is available online at [32]. The Intel IIP library includes also a variant of RSA that is hardened against cache attacks [33].

If you look at how these two variants are used, you can see that only computations with the public exponent are done with the “vulnerable” variant, whereas computations with the private exponent use the “hardened” variant. So, unless you are somehow swapping your public and private exponents, using this crypto library as documented will prevent this attack for you.

From the second paper’s abstract:

We perform a Prime+Probe cache side-channel attack on a co-located SGX enclave running an up-to-date RSA implementation that uses a constant-time multiplication primitive.

Even though the library they use might have a multiplication primitive that is constant-time, as the authors explain further on in the paper, the modular exponentiation primitive is not. In fact, the modular exponentiation algorithm is the textbook example of an algorithm with a secret-dependent branch:

int modexp(int base, int exponent, int modulus) {
    int result = 1;
    for (int i = 0; i < exponent.bits(); i++) {
        result = modsqr(result, modulus);
        if (exponent & (1<<i)) { // access bit `i`
            result = modmul(result, base, modulus);
        }
    }
    return result;
}

If for each iteration of the loop you can detect whether the multiplication happenned or not, you can reconstruct the individual bits of the exponent. This is a fun attack, but as mentioned, this is basically the most well-known timing attack, first described by Kocher 22 years ago. Oh and the library? It implements the blinding mitigation also mentioned in that seminal paper.

To summarize: Yes, programs running with Intel SGX are vulnerable to side-channel attacks. The same side-channel attacks that have been used for years on modern x86 platforms. This is well-documented. SGX does present a slightly different threat model which makes deployment of side-channel attacks more likely. Hopefully everyone using SGX is implementing countermeasures. They do exist, and are already implemented by most cryptography libraries.

New OPT STEM extension rules – Technology & Policy

2016-03-09T12:00:00Z

Today, the U.S. Department of Homeland Security announced that the new rules regarding the “Optional Practical Training STEM extension” for international students in the U.S. on F-1 visas will be published in the Federal Register on Friday. The new rules are a result of a lawsuit against DHS that would result in an invalidation of the old rules. The new rules are “better” in some respects and “worse” in others. In this blog post I will review some of the most important changes and how they will affect international students.

Update 2016-03-09: A previous version of this post suggested that one could get two consecutive STEM extensions. Careful reading of the rule counters this, as does an explicit comment in the supplementary information: “DHS clarifies that the final rule, as with the proposed rule, does not allow students to obtain back-to-back STEM OPT extensions.”

Reasons for the STEM extension

The DHS is very clear in the rules and the accompanying comments that the goal of OPT is practical training for students, and not to bridge a gap in the U.S. labor force. They see practical training as a vital part of a good education, and providing OPT is a key mechanism in staying competitive in the international education market. Therefore, they basically disregarded any comments that talk about a shortage or overage in U.S. STEM workers. There is however a provision that employers can’t use a student on the STEM extension to replace a U.S. worker, and also that students need to be paid the same as other workers in similar positions. These requirements seem to be designed to assuage fears of “cheap foreign labor taking our jobs.”

Because the justification is practical training, you might ask why there is this specific extension for STEM workers, and not just a general length increase for all OPT. DHS clarifies this as follows:

[…] because of the specific nature of [STEM students’s] studies and fields and the increasing need for enhancement of STEM skill application outside of the classroom. DHS also found, as noted previously, that unlike post-degree training in many non-STEM fields, training in STEM fields often involves multi-year research projects as well as multi-year grants from institutions such as the NSF.

Many STEM OPT practical training opportunities are research related, as indicated by the fact that the employer that retains the most STEM OPT students is the University of California system and that two other universities are among the top six of such employers (Johns Hopkins University and Harvard University).

Basics

First up: some straightforward changes. The length of the STEM extension is now 2 full years instead of the 17 months under the old rule. The “cap-gap extension”—where OPT is automatically extended if the student has an approved petition for an H-1B visa—is maintained. Students are now allowed to be unemployed for a maximum total of 150 days during their initial OPT and their STEM extension.

Training Plan for STEM OPT Students

One of the big changes to the STEM extension is that students and employers now need to come up with a training plan in order to get the extension. Students must fill out a new Form I-983, “Training Plan for STEM OPT Students,” together with their employer and file that with their STEM extension request.

The new article 8 CFR 214.2(f)(10)(ii)(C)(7) says:

The training plan described in the Form I-983 […] form must identify goals for the STEM practical training opportunity, including specific knowledge, skills, or techniques that will be imparted to the student, and explain how those goals will be achieved through the work-based learning opportunity with the employer; describe a performance evaluation process; and describe methods of oversight and supervision.

The form is not available at the time of writing, but there is what seems to be a draft of the instructions accompanying the form. The relevant parts of the instructions basically echo the rule above.

At this time it’s not very clear how detailed the plan needs to be. Can applicants just write down some business lingo (e.g. “improving core competencies”) or do the goals need to be more substantial? Under the current rules, it seems that the “Designated School Official” (i.e. someone who works at your school’s international office) needs to gauge whether the proposed plan meets the regulatory requirements.

Once the extension is approved, students will need to evaluate themselves annually, according to the plan. Their employers will need to sign their evaluation as well.

There are just too many unknowns at this point to really know how this will affect the types of jobs international students will be able to get and how this will affect the amount of time they have to spend on things besides their normal job responsibilities.

Employer-employee relationship

While the actual rules text doesn’t seem to touch on this, the comments and clarifications accompanying the rules suggest that it might be harder to work in an “unusual” work arrangements, such as start-up companies:

There are several aspects of the STEM OPT extension that do not make it apt for certain types of arrangements, including multiple employer arrangements, sole proprietorships, employment through “temp” agencies, employment through consulting firm arrangements that provide labor for hire, and other relationships that do not constitute a bona fide employer-employee relationship.

One of these aspects seems to be that someone else at the same company you work for needs to sign your Form I-983:

[…] students cannot qualify for STEM OPT extensions unless they will be bona fide employees of the employer signing the Training Plan, and the employer that signs the Training Plan must be the same entity that employs the student and provides the practical training experience.

But:

STEM OPT extensions may be employed by new “start-up” businesses so long as all regulatory requirements are met, including that the employer adheres to the training plan requirements, remains in good standing with E-Verify, will provide compensation to the STEM OPT student commensurate to that provided to similarly situated U.S. workers, and has the resources to comply with the proposed training plan.

and

[…] any ownership interest in the employer entity (such as stock options), [must be] commensurate with the compensation provided to other similarly situated U.S. workers.

So while the rules seem to prevent running a sole proprietorship under the STEM extension, it seems entirely feasible that another employee at your startup can fulfill all the supervisory training requirements. If you’re starting out on your own, you could use your first year of regular OPT to get your company off the ground, and hopefully by the time you need to file for the STEM extension you have a co-founder or such who can provided the necessary “training.”

Summary

I’m pretty positive in general about the new rules, and am of course glad that DHS acted swiftly to make new rules after the court decision in August. I’m somewhat concerned about the rules around the training plan, but we’ll see how that’s going to work out in practice. Everything else seems to be an improvement (big or small) upon the previous rules.

Intel has full control over SGX – Technology & Policy

2015-10-13T12:00:00Z

Intel has full control over what software you can run in SGX. This might seem redundant: Intel makes the processor, so of course they have full control. Yet the truth is slightly more inconvenient. When Intel processors don’t run the instructions in your standard software (whether incorrectly or at all), that is a defect at best and a breach of contract at worst. Yet the SGX instruction set includes in its specification that Intel has the authority to make this go/no-go decision.

Let’s take a closer look at how exactly this is specified, since it is pretty well-hidden. After creating and measuring a secure enclave using ECREATE, EADD, and EEXTEND, the EINIT instruction needs to be executed before execution control can be transferred to the enclave. The EINIT instruction has 2 inputs: SIGSTRUCT and EINITTOKEN. SIGSTRUCT contains information about the enclave including an expected hash of the memory. As the name implies, SIGSTRUCT is also cryptographically signed using some key. EINITTOKEN also contains information about the enclave including the same expected hash of the memory as well as the expected public key for the signature. EINITTOKEN must be MACed using the so-called launch key. Both SIGSTRUCT and EINITTOKEN are checked by EINIT and must be valid for execution to proceed succesfully.

Since the launch key is a symmetric cryptography device, surely this key is not widely distributed and most likely is CPU-specific. But how can one obtain this key? The EGETKEY instruction can be used to obtain SGX keys, including the launch key. But this is a user-mode instruction that can only be executed from inside an enclave. There seems to be a chicken-and-egg problem here: to launch an enclave, we need the launch key. To get the launch key, we need to launch an enclave! Here’s the catch: the EINITTOKEN need not be valid if SIGSTRUCT is signed by an Intel key that is baked into the processor.

Thus, Intel can distribute an Intel-signed “launch enclave” that is able to hand out correctly-MACed EINITTOKENs that can then be used to start other enclaves. But they can include whatever logic they want in the launch enclave so Intel can at its sole discretion choose not to MAC a particular EINITTOKEN.

As most things SGX, this “feature” is severely underdocumented. The terms “launch key” and “launch enclave” are only mentioned a few times in the SGX programming reference and never in the whitepapers or tutorials. At the time of writing, nowhere else on the Internet is there any mention of these keywords, except for one insightful Quora answer that I wish I had read months ago.

What reason could Intel have for this architecture? Along with the fact that SGX is being disabled by default, this looks like Intel is again just setting this security technology up for failure due to the lack of widespread adoption by developers and users alike (cf. TXT, SMX, TPM).

SGX Hardware: A first look – Technology & Policy

2015-10-08T12:00:00Z

Without much fanfare, Intel has released Software Guard Extensions (SGX) in Skylake. When I say “without much fanfare,” I mean practically only the following paragraph hidden on page 3 of a press fact sheet:

BETTER SECURITY. The Skylake architecture has been designed to enable better security, including Intel® Software Guard Extensions (Intel® SGX) that can provide an additional level of hardware-based protection by putting data into a secure container on the platform, and Intel® Memory Protection Extensions (Intel® MPX) that can help prevent buffer flow attacks. [What’s a buffer flow attack? Ed.] To be fully utilized, Intel SGX and Intel MPX require additional software capabilities, which will begin to be delivered by the ecosystem later this year.

It has been extremely difficult to find actual hardware that supports SGX. BIOS support is required–the BIOS needs to set aside memory for the Enclave Page Cache (EPC)–but of course no vendor will mention anything about this on their website, nor will they (be able to) answer when you inquire regarding this specific issue.

To my delight, by using Google to search past week results for “intel sgx” for the last few months, I was finally able to find a driver download site that linked this Dell driver. According to Dell’s website, this driver is compatible with the following machine models:

Inspiron 11 i3153
Inspiron 11 i3158
Inspiron 13 i7353
Inspiron 13 i7359
Inspiron 15 i7568

At first I couldn’t find these models mentioned anywhere, but a few days later the i7359 showed up at NewEgg and then at Frys. So, I drove to Sunnyvale (where Frys had the i7359-2435SLV in stock) and I can now confirm that SGX is real:

It’s interesting to note that SGX was disabled in the BIOS by default, so most consumers will not be able to benefit from this feature at all.

The maximum size of the EPC on this laptop is 128MB. This means that enclaves requiring more memory than that will need regular paging between the EPC and main memory. It’s not clear whether such a copy would require re-encryption of the page, the EPC itself is already encrypted so it might not be necessary.

The laptop comes with Windows 10 which runs excruciatingly slow–not surprising considering it only has 4GB of RAM. I installed Arch Linux on it because it’s one of the few distro’s that has an installer with a very recent kernel (4.2.2), required for such new hardware.

I collected some CPUID and MSR information to see what SGX features are supported:

CPUID

                         bit 2 (SGX) is set
                         v
 7h(0h) 00000000h 029c67afh 00000000h 00000000h

     Max. enclave size 2^31 bytes (32-bit mode)
  Max. enclave size 2^36 bytes (64-bit mode) |
    No extended SSA features supported    |  |
SGX version 1 supported  |                |\/|
               v         v                vvvv
12h(0h) 00000001h 00000000h 00000000h 0000241fh

              all enclave attributes supported
              |\       all XSAVE bits supported
              vv                  vv
12h(1h) 00000036h 00000000h 0000001fh 00000000h

   EPC physical address 80200000h
        ////|     |\\\\\\\  EPC size 93.5 MiB
        vvvvv     vvvvvvvv  vvvvv     vvvvvvvv
12h(2h) 80200001h 00000000h 05d80001h 00000000h

MSR

                bit 18 (SGX_ENABLE) is set
                |   bit 0 (LOCK) is set
                v   v
3ah 00000000_00040005h (IA32_FEATURE_CONTROL)

I’m currently writing a simple Linux kernel driver to be able to actually use SGX. I managed to generate a Page Fault using the ENCLS[EBLOCK] instruction, so at least something seems to be working.

I really wish Intel would be more forthcoming with information about and developer support for SGX. The hardly-announced release and default-disabled BIOS setting don’t warrant much hope for the future of SGX.

In the mean time, I intend to write more blog posts in the near future as I try to get SGX up and running. Here’s a cliff hanger for you: the Dell driver package mentioned earlier contains a file aesm_service.exe that contains the string “SGX EPID provisioning network failure.” I’ll try to tell you more about it next time.

Update 2015-12-09: Please see my sgx-utils repository for any open-source SGX utilities, including a bare-bones development Linux driver.

On OpenSSH and Logjam – Technology & Policy

2015-05-20T12:00:00Z

Recent work showing the feasibility of calculating discrete logarithms on large integers has put the Diffie-Hellman key exchange parameters we use every day in the spotlight. I have looked at what this means for SSH key exchange. In short, on your SSH server, do the following:

awk '{ if ($5 <= 2000) printf "#"; print }' /etc/ssh/moduli > /tmp/large_moduli
mv /tmp/large_moduli /etc/ssh/moduli

And put the following in your sshd_config:

KexAlgorithms curve25519-sha256@libssh.org,ecdh-sha2-nistp256,
              ecdh-sha2-nistp384,ecdh-sha2-nistp521,
              diffie-hellman-group14-sha1,
              diffie-hellman-group-exchange-sha1,
              diffie-hellman-group-exchange-sha256

Note that curve25519-sha256@libssh.org is only supported in OpenSSH 6.5 and up, and only works reliably in OpenSSH 6.7 and up. On your SSH client, put the following in your ssh_config:

KexAlgorithms curve25519-sha256@libssh.org,ecdh-sha2-nistp256,
              ecdh-sha2-nistp384,ecdh-sha2-nistp521,
              diffie-hellman-group14-sha1

If with this configuration you are unable to connect to some SSH servers, and you need to add diffie-hellman-group-exchange-sha1 or diffie-hellman-group-exchange-sha256 to the supported list of algorithms, you should recompile your SSH client with a DH_GRP_MIN of 2048, so that a server can’t force your client to use a weak group.

Technical details

Now follows a detailed explanation of these recommendations. The following key exchange mechanisms are supported in the current version (6.8) of OpenSSH:

curve25519-sha256@libssh.org
ecdh-sha2-nistp256
ecdh-sha2-nistp384
ecdh-sha2-nistp521
diffie-hellman-group1-sha1
diffie-hellman-group14-sha1
diffie-hellman-group-exchange-sha1
diffie-hellman-group-exchange-sha256

The first four mechanisms, curve25519-sha256@libssh.org, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, do not use prime-field Diffie-Hellman and are not affected. Previous work shows that these mechanisms are much faster when used at the same security level, so you should use them!

The diffie-hellman-group1-sha1 mechanism uses the fixed 1024-bit Oakley Group 2 (not the 768-bit group 1, as the name of the mechanism might suggest). This group is within the range of being a viable target for nation-state attackers, and should not be used.

The diffie-hellman-group14-sha1 mechanism uses the fixed 2048-bit Oakley Group 14, which should be secure enough for now.

The diffie-hellman-group-exchange-sha1 and diffie-hellman-group-exchange-sha256 mechanisms let the client and server negotiate a custom DH group. The client sends a tuple «min, n, max» to the server, indicating the client’s minimum, preferred and maximum group size. According to the RFC,

Servers and clients SHOULD support groups with a modulus length of k bits, where 1024 <= k <= 8192. The recommended values for min and max are 1024 and 8192, respectively.

The OpenSSH server selects a suitable group from a pre-generated set of groups, installed system-wide in /etc/ssh/moduli (falling back to /etc/ssh/primes), using the choose_dh function in dh.c. In case no suitable group is found, the code defaults to Oakley Group 14, which is safe. A pre-generated set is distributed with the OpenSSH source and many binary distributions and is infrequently changed. The group sizes distributed with OpenSSH are 1024, 1536, 2048, 3072, 4096, 6144, and 8192 bits, with about 30 groups per size. The OpenSSH-distributed 1024-bit groups are well-known and within the range of being a viable target for nation-state attackers, and as such should not be used.

It is possible to generate your own set of groups, in which case it would be safer to use a 1024-bit group, but you might as well go for larger groups. The ssh-keygen man page mentions that “It is important that … both ends of a connection share common moduli.” That statement should not be interpreted as “both server and client need to have the same moduli configured”, as the server sends the chosen modulus to the client. As a case-in-point, the OpenSSH client does not access the system-wide moduli file at all during connection setup.

Speaking about the client, it usually offers the RFC-specified minimum of 1024 bits. There is nothing preventing a server from using that value and offering a well-known (and thus weak) group. So, a standard client shouldn’t use the custom group key exchange mechanisms, unless there is a way to change the minimum group size.

Lenovo ThinkPad HDD Password – Technology & Policy

2015-03-09T00:00:01Z

Modern SSDs (at least the ones made by Intel, Samsung) always encrypt all stored data using AES. The encryption key used is stored in nonvolatile memory on the SSD. One of the reasons for this is that to securely wipe the drive now you just need to overwrite the encryption key with a new random one. This way, you don’t need to erase every flash block, which is very bad for durability reasons. The encryption key can optionally be encrypted using a 32-byte “security password”, the configuration of which is overloaded on the ATA security feature set. If you trust the hardware manufacturer to actually implement this securely, this would seem to provide a very solid and fast option for encrypted persistent storage.

To be able to boot off of such an encrypted drive, it needs to be unlocked before the OS’s bootloader can be read, which requires BIOS support. Luckily, my Lenovo ThinkPad T420s does support this: you can configure a drive password in BIOS the setup screen and from then on the BIOS will ask for a password upon startup. Now here’s the catch: it turns out that when you take this drive and put it in a different machine, it is impossible to unlock the drive. This would mean that if my laptop dies but the drive were still intact I would be unable to access the data on the drive, even though I know the password!

A couple of weeks ago I finally decided to get to the bottom of this by reverse engineering the Lenovo UEFI BIOS on my laptop. The goal was simple: to find the code path from password input to ATA security unlock output and reproduce it. I have detailed the reverse engineering process in another blog post. Here’s the algorithm:

$\textit{AtaPassword} \gets \textrm{SHA}_{256}\left( \textrm{SHA}_{256}(\textit{Password}) \parallel \textit{AtaIdentity}_\textit{SerialNumber} \parallel \textit{AtaIdentity}_\textit{ModelNumber} \right)$

The inputs are $\textit{Password}$ which is the user-supplied password and $\textit{AtaIdentity}$ which is the ATA Identify Device data structure. The output $\textit{AtaPassword}$ gets sent to the drive. Why do they use this algorithm? It’s actually somewhat clever: the S/N and M/N act as a salt, such that a hash sniffed off of the ATA bus will only be able to unlock that one drive, and not any other drives that use the same password.

That’s the good part. The bad part is that the algorithm above is not quite complete. Here is the actual algorithm:

$\textit{PasswordHash} \gets \textrm{SHA}_{256}\left( \left( \textrm{ToScanCodes}(\textrm{LowerCase}(\textit{Password})) \parallel ␀^{64} \right)_{1:64} \right)_{1:12}$

$\textit{SN} \gets \textit{AtaIdentity}_\textit{SerialNumber} \qquad \textit{MN} \gets \textit{AtaIdentity}_\textit{ModelNumber}$

$\textit{AtaPassword} \gets \textrm{SHA}_{256}\left( \textit{PasswordHash} \parallel \textrm{SwapBytes}(\textit{SN}) \parallel \textrm{SwapBytes}(\textit{MN}) \right)$

The function $\textrm{ToScanCodes}$ translates the characters 1234567890qwertyuiopasdfghjkl;zxcvbnm␣ to integers in the ranges 2–11, 16–25, 30–39, 44–50, 57–57, respectively, while dropping other characters. $\textrm{SHA}_{256}$ is the well-known hash function. $\textrm{SwapBytes}$ is the POSIX swab function, it swaps odd and even bytes.

There are a couple of peculiarities in the algorithm that reduce the security. First of all, I’m not sure why the characters get converted into scancodes. The UEFI BIOS is well-equiped to deal with keyboard layouts, so that just seems unnecessary. It also reduces the entropy to only 5.3 bits per character, making short passwords very insecure. What’s worse though, is that only 12 bytes of the password hash are used, putting an upper bound of 96 bits on the entropy. If your password is sampled uniformly at random from the available scancodes, don’t bother making it longer than 18 characters.

The other weird thing is the $\textrm{SwapBytes}$ function. This means that if your model number is Samsung␣SSD␣840␣EVO␣500GB␣…, that part of the input to the hash function will be aSsmnu␣gSS␣D48␣0VE␣O05G0␣B…. Why is that? Between the ATA Identify Device data structure being defined in terms of 16-bit words and the UEFI specification using 16-bit wide characters, while the model and serial number are encoded as 8-bit ASCII, I can only assume that someone messed up some endianness conversion somewhere.

Today, am I releasing a tool to unlock your drive. If despite all the above—96 bits is more entropy than most passwords have—you still decide to use the Lenovo BIOS to do your password management, you can use this to unlock your drive in the event of hardware failure. You will need hdparm to talk to your drive. If the password hash contains a ␀ character, you’ll need to patch hdparm to be able to use that. I tested this on my own setup, but you may want to verify it actually works before you start depending on it.

Reverse Engineering UEFI Firmware – Technology & Policy

2015-03-09T00:00:00Z

In order to figure out how my BIOS drive password worked, I had to reverse-engineer the firmware that comes with my laptop. You can find the binary blobs on the update CD that Lenovo provides, and it turns out these blobs are actually UEFI images. UEFI firmware is made up of many different loadable modules (drivers, shared libraries, etc.), which are stored in the Portable Executable (PE) image format. These modules can be extracted from the image using Nikolaj Schlej’s excellent UEFIExtract (from UEFITool). Once you have all the PE modules, the real reversing can begin.

It helps to understand how UEFI works. The Internet contains a wealth of information, and here are two articles to get you started: Getting started with UEFI development and UEFI Programming - First Steps. The main problem that makes reverse engineering hard is that while the firmware consists of over 300 loadable modules, there is no dynamic linker. Instead, the entry point of a module gets passed an pointer to a “protocol” registry. A protocol is basically an interface, or in other words a struct of function pointers. The registry is keyed by Globally unique identifiers (GUIDs). To call into another module, you need to lookup a GUID in the registry and then call some function returned in the interface.

My first strategy to get some insight into the firmware was to collect GUIDs from images and build a dependency graph. This turned out to be useless. The UEFI image contains PEI dependency sections for each image, but the GUIDs that are listed seem to have no relation to actually required protocols. Furthermore, identifying GUIDs (also known as 16 random bytes) in binaries is hard, and even when I manged to identify a section that seemed to store GUIDs, there would be many GUIDs in such a section that were never referenced from code in that image.

To figure out the dependencies, I decided to actually run the modules and see which protocols they lookup and which ones they register. Wait what, run UEFI PE modules? Yes, I wrote a tool called efiperun that can load PE modules into memory and simulate enough of what an UEFI environment is supposed to look like to actually run them. Most modules will upon entry lookup some standard protocols, do some initialization, and register one or more protocols that other modules can use.

With this information in hand, you can do more targeted reversing, trying to identify interfaces and function signatures. For example, LenovoTranslateService.efi installs a protocol e3abb023-b8b1-4696-98e1-8eedc3d3c63d. This protocol turns out to have the following interface:

struct interface_e3abb023_b8b1_4696_98e1_8eedc3d3c63d
{
    void(EFIAPI *translate)(void* _this, const char* input, char* output, size_t length);
}

With efiperun you can actually write code that calls into loaded EFI modules, which makes it easy to test installed interfaces. Utilizing this functionality, I was able to determine that the translate function above actually translates an ASCII string to keyboard scan codes.

When doing reverse engineering, you always end up exploring branches that turn out to be less fruitful. But the knowledge obtained exploring such a branch can be useful in exploring other ideas. Now that I’ve setup the stage with the tools I’m going to use, I will describe the path that lead to the discovery of the algorithm. Keep in mind that this is a reconstruction and the order in which I actually figured parts out is different.

Graphical entry point

The Lenovo firmware does not make heavy use of graphical elements, but the Hard Drive Password prompt actually does display a small pictogram, pictured on the right. Now, judging by the filenames, there are only a few modules that deal with graphics:

SystemGraphicsConsoleDxe.efi
SystemHiiImageDisplayDxe.efi
SystemImageDecoderDxe.efi
SystemImageDisplayDxe.efi

All these modules install a single protocol that don’t use a well-known GUID, so let’s see what modules call them. As it turns out, only SystemSplashDxe.efi calls SystemHiiImageDisplayDxe.efi (96ce4c12-55e4-4a1c-bbf3-73a5055fb364) and only LenovoPromptService.efi calls SystemImageDisplayDxe.efi (71583a77-2789-4213-a83b-eef42afe85e0). SystemSplashDxe.efi pretty much seems to be as advertised and even contains a GIF file with the ThinkPad splash image. Upon further inspection, LenovoPromptService.efi contains 21 BMP files, all related to displaying password prompts. Bingo!

Password control program

The Prompt service installs a single protocol 56350810-2cb2-4aa0-96d2-66d1b8e1aac2 which is only called by LenovoPasswordCp.efi. This module contains key code connecting various password-related modules, and I’ll assume Cp means “control progam”. Besides the prompt service (for text input), it also calls into LenovoSoundService.efi (e01fc710-ba41-493b-a919-53583368f6d9, for beeping noises when you press an invalid key), LenovoTranslateService.efi (described above) and LenovoCryptService.efi (73e47354-b0c5-4e00-a714-9d0d5a4fdbfd, supposedly a crypto module—see next section).

The password control program has an interesting function at offset 0x8cc that calls only SetMem, CopyMem and the Crypto and Translate services. Here’s roughly the code for this function:

void _0x8cc(const CHAR16 in[64], UINT8 out[16])
{
    UINT8 ascii[64], scancode[64], hash[32];
    BootServices->SetMem(out,16,0);
    BootServices->SetMem(ascii,64,0);
    BootServices->SetMem(scancode,64,0);
    BootServices->SetMem(hash,32,0);
    for (int i=0;i<64;i++)
    {
        ascii[i]=in[i];
    }
    if (TranslateService)
    {
        TranslateService->Translate(TranslateService,ascii,scancode,64);
        if (CryptService)
        {
            CryptService->SHA256(CryptService,scancode,64,hash);
            BootServices->CopyMem(out,hash,16);
        }
        BootServices->SetMem(ascii,64,0);
        BootServices->SetMem(scancode,64,0);
        BootServices->SetMem(hash,32,0);
    }
    else
    {
        BootServices->SetMem(ascii,64,0);
    }
}

I’ll assume that this function is used to hash a password input by the user. There’s another interesting function at offset 0xa30, which checks whether the input CHAR16 is in the character class [0-9A-Za-z ;], which is used to limit the possible characters in the password input.

I’ve made good progress identifying part of the path from password input to security unlock command, but here I’ve hit a dead end. It’s not really clear from where the password control program gets called and what happens to the hash it outputs. I’ll try a different approach next, but first let’s talk about the crypto service.

Crypto service

The password control program calls a function in the Crypto service at offset 0x26e0, which references three GUIDs that I hadn’t seen before:

69188a5f-6bbd-46c7-9c16-55f194befcdf
d0b3d668-16cf-4feb-95f5-1ca3693cfe56
6c48f74a-b4df-461f-80c4-5cae8a85b7ee

These GUIDs do not appear in any efiperun output. Instead, I just searched all images for appearances of these GUIDs, and they appear in 10 other images. A noteworthy appearance is in SystemCryptSvcRt.efi at offset 0x1c70. Offset 0x1c70 is referenced at offset 0x330, where it is immediately followed by the unicode string “SHA256”. This is followed by a jump table at offset 0x370, which points to 3 jumps at offset 0x33c0 that jump to 3 functions at offsets 0x753c, 0x7570 and 0x760c. The function at offset 0x753c references offset 0x2258, which stores the hash initialization constants for SHA256! The rest of the SystemCryptSvcRt.efi module also contains SHA256 round constants, and similar strings and constants for other algorithms.

All in all this suggests that the Crypto service is a front for the cryptographic routines in SystemCryptSvcRt.efi and that the password control program calls SHA256. I wrote a small test program for the EFI shell that tests this:

void buf2hexstr(VOID*buf,CHAR16*str,UINTN len)
{
    UINTN i;
    static CHAR16 hchars[16]={'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
    UINT8* buf_=(UINT8*)buf;
    for (i=0;i<len;i++)
    {
        *(str++)=hchars[*(buf_)  >>4];
        *(str++)=hchars[*(buf_++)&0xf];
    }
}

EFI_STATUS Initialize(...)
{
    ...
    EFI_GUID guid={0x73e47354,0xb0c5,0x4e00,{0xa7,0x14,0x9d,0x0d,0x5a,0x4f,0xdb,0xfd}};
    void* intf;

    if (SystemTable->BootServices->LocateProtocol(&guid,NULL,&intf)==EFI_SUCCESS)
    {
        const char* in="TEST";
        char out[32]={};
        CHAR16 str[13+(32*2)+2]=L"SHA256 test: ";
        ((void(*)(void*,const char*,UINTN,char*))(*(void**)intf))(intf,in,4,out);
        buf2hexstr(out,str+13,32);
        str[13+(32*2)]='\n';
        SystemTable->ConOut->OutputString(SystemTable->ConOut, str);
        SystemTable->ConOut->OutputString(SystemTable->ConOut,
            L"Expected:    94ee059335e587e501cc4bf90613e0814f00a7b08bc7c648fd865a2af6a22cc2\n");
    }
    else
    {
        SystemTable->ConOut->OutputString(SystemTable->ConOut,
            L"Unable to load CryptService protocol\n");
    }
    ...
}

Outputs:

SHA256 test: 94ee059335e587e501cc4bf90613e0814f00a7b08bc7c648fd865a2af6a22cc2
Expected:    94ee059335e587e501cc4bf90613e0814f00a7b08bc7c648fd865a2af6a22cc2

Success!

Hard-drive communication

As mentioned, I discovered how the input password got hashed, but it still needs to be sent to the drive. The UEFI standard defines the ATA Pass Thru Protocol, which can be used to send raw ATA commands to a drive. This protocol is very likely to be used for sending ATA security commands. This protocol is not loaded upon initialization by any modules, but the GUID does appear in the following modules:

FdiskOem.efi
LenovoHdpManagerDxe.efi
LenovoMfgBenchEventDxe.efi
SystemAhciAtaAtapiPassThruDxe.efi
SystemAhciBusDxe.efi
SystemAhciBusSmm.efi
SystemIdeAtaPassThruDxe.efi
SystemIdeBusDxe.efi

Wait a minute, is that second module called Lenovo Hard Drive Password Manager? Why yes, it is. There’s a bunch of code in this module, but I found an interesting function call chain for you:

offset 0xce0
- offset 0x8a0
  - CryptService.SHA256
- offset 0x144c
  - offset 0x232c
    - EFI_ATA_PASS_THRU_PROTOCOL.PassThru

The input to the SHA256 function is a parameter to the function at offset 0xce0, and data from an EFI runtime variable “LenovoHddSecInfoVar”. The PassThru function is called with a ATA_OP_SECURITY_UNLOCK command block including the hash generated just before. I assume the input to the function at offset 0xce0 is the password hash from the password control program, but what is the data in “LenovoHddSecInfoVar”? The dmpstore utility in the EFI shell that will dump runtime variables. Here’s mine:

Variable BS '2D8FBE63-3A04-4EF8-A8A4-77321DB5A9AB:LenovoHddSecInfoVar' DataSize = 8
  00000000: 98 7D BC B7 00 00 00 00-                         *........*

From the code I know that the value is being used as a memory address, so let’s use the mem utility to dump that:

  B7BC7D98: .. .. .. .. .. .. .. ..-18 E0 0F B8 00 00 00 00           ........*
  B7BC7DA8: 98 DF 0F B8 00 00 00 00-.. .. .. .. .. .. .. ..  *........

Those are two more memory addresses, let’s see what’s there:

  B80FDF98: 61 53 73 6D 6E 75 20 67-53 53 20 44 34 38 20 30  *aSsmnu gSS D48 0*
  B80FDFA8: 56 45 20 4F 30 35 47 30-20 42 20 20 20 20 20 20  *VE O05G0 B      *
  B80FDFB8: 20 20 20 20 20 20 20 20-.. .. .. .. .. .. .. ..  *        

  B80FE018: 31 53 48 44 53 4E 46 41-30 42 38 35 39 34 20 45  *1SHDSNFA0B8594 E*
  B80FE028: 20 20 20 20 .. .. .. ..-.. .. .. .. .. .. .. ..  *

If you squint your eyes just right, those kind of read Samsung SSD 840 EVO 500GB and S1DHNSAFB05849E, the Model Number and Serial Number for my SSD, respectively. Piecing all this together, you get the algorithm described in my other blog post.

Conclusion

As I mentioned, this story is the abridged version of how I found the password hashing algorithm. In reality, I looked at many other modules, including many hours spent looking at useless things. In the end though, I prevailed and found what I was looking for, developing a bunch of tools in the process:

efiperun:: Load and run EFI PE image files in a regular OS environment.
guiddb:: Scan files for GUIDs and output them in C-source file format.
memdmp:: Dump UEFI memory using EFI shell.
tree:: A Ruby abstraction for a firmware tree on your filesystem previously extracted by UEFIExtract.

I hope these tools are of use to anyone. Patches welcome. ☺️

Parallel Nibble Sort – Technology & Policy

2015-01-28T12:00:00Z

Update July 20, 2015: The winning solution by Alexander Monakov also uses a sorting network but transposes the items to be sorted to sort 32 nibbles in parallel with a length 60 network, instead of my 4 nibbles with a depth 9 network. Hans Wennborg has a nice write-up of that solution.

Professor John Regehr at University of Utah held a small programming contest for “nibble sort”. The goal is to sort nibbles in a 64-bit value, 1024 times, as fast as possible. For example, the nibble sort of 0xbadbeef is 0xfeedbba000000000.

Algorithm

I chose to implement the sort using a sorting network. I used the following minimum-depth network to sort 16 items, which was designed by David C. Van Voorhis.

Figure 1. From “The Art of Computer Programming, Volume 3”.

The inputs are distributed in rows on the left, and each vertical line segment compares the two numbers on the parallel lines, swapping them if the upper element is less than the lower element.

Parallelization

For each of the 9 stages, the elements to be sorted are split into two buckets, and the same indices in each bucket are compared and potentially swapped at the same time using SIMD instructions. The split for each stage is a different permutation depending on which elements are to be compared. This process was inspired by the paper “Efficient implementation of sorting on multi-core SIMD CPU architecture”. After each stage, the buckets are then again combined into a single list using the inverse permutation. The permutations of combining the buckets from the last stage and splitting them again for the next stage can be reduced to a single permutation.

function sort(input):
    b1:b2 <- input
    for i := 1 to 9:
        b1:b2 <- each_min(b1,b2):each_max(b1,b2)
        b1:b2 <- permute(step=i,b1:b2)
    return b1:b2

Algorithm 1. Parallel network sort.

Implementation

On IA-32, the smallest unit that can be processed is a byte. Every 2 of the 16 nibbles in the input word are unpacked into the lower nibble of 2 bytes for a total of 16 bytes to be sorted, and each bucket is 8 bytes. AVX2 can process 32 bytes or 4 buckets in parallel. This implementation runs more than 72× faster than the reference implementation on my test machine.

Discussion

Currently, the min/max operation takes ½ operation per word, while the permute operation takes 1½ operations per word. The reason the permutation requires so many instructions is that both buckets need to be in the same register for the permute operation but they need to be in seperate registers for the min/max operation. By carefully considering the shuffling constants, it’s possible to do permutations #3 and #8 in 1 operation per word and ½ operation per word, respectively.

The application can be further sped up by using multiple threads to do the sorting, each of the 1024 elements can be sorted individually. When working with much larger datasets, subsets of it can still be sorted individually, so this algorithm scales very well.