Introduction
The new buzz in the mobile marketplace is about Android 64-bit
systems. In September 2013, Apple released the iPhone* 5 with a 64-bit
A7 processor onboard. Thus began the mobile technology race.
It turns out that the Android-based kernel GNU/Linux* has been
supporting processors with 64-bit registers for a long time. Ubuntu is
"GNU/Linux" while Android is "Dalvik/Linux". Dalvik is the
process virtual machine (VM) in
Google's Android operating system,
which specifically executes applications written for Android. This
makes Dalvik an integral part of the Android software stack, which is
typically used on mobile devices such as mobile phones and tablet
computers, as well as more recently on devices such as smart TVs and
wearables. Nevertheless, all developers who use the NDK have to rebuild
their programs under the latest architecture, and the ease or difficulty
of this process depends on the tools that Google will provide. In
addition, Google should provide backward compatibility, i.e., NDK 32-bit
applications should run in Android 64-bit.
The first Intel 64-bit processors for mobile devices were created in the 3
rd
quarter of 2013 and were the new powerful multicore System on a Chip
(SoC) for mobile and desktop devices. This new SoC family consists of
Intel® Atom
TM processors for tablets and 2 in 1 devices, Intel® Celeron
® processors, and Intel® Pentium
® processors for 2 in 1 devices, laptops, desktop PCs and All in One PCs.
In October 2014, Google released a preview emulator image of the
64-bit Android L for developers. This allowed them to test their
programs and rewrite code, if necessary, before the OS is released. In a
Google+ blog
developers indicated that programs entirely created with Java* do not
require porting. They ran them “as is” in the L- version of the
emulator, which supports 64-bit architecture. Those using other
languages, especially C and C++, will have to perform some steps to
build against the new Android NDK. Several older versions of
Android-based devices with 64-bit processors are on the market. However,
manufacturers may have to update them rather quickly; otherwise, there
will be a lack of software apps for users.
Android 64-bit L emulator
In June 2014, Google announced that Android would support 64-bit in
the coming L release. This is great news for those who want the most
performance possible out of their devices and apps. The list of benefits
highlighted by Google in this update include a larger number of
registers, increased addressable memory space, and new instruction sets.
The Android emulator supports many hardware features likely to be found on mobile devices, including:
- An ARM* v5 CPU and the corresponding memory-management unit (MMU)
- A 16-bit LCD display
- One or more keyboards (a Qwerty-based keyboard and associated Dpad/Phone buttons)
- A sound chip with output and input capabilities
- Flash memory partitions (emulated through disk image files on the development machine)
- A GSM modem, including a simulated SIM Card
- A camera, using a webcam connected to your development computer.
- Sensors like an accelerometer, using data from a USB-connected Android device
This is a great step forward for building our favorite devices
and apps. Unfortunately, we’ll have to wait for Android L to drop before
we can enjoy these new performance boosts. A few weeks after Android L
releases, Revision 10 of the Native Development Kit (NDK) should be
posted with support for the three 64-bit architectures that will be able
to run the new version of Android: arm64-v8a, x86_64, and mips64. If
you’ve built an app using Java, your code will automatically have better
performance on the new x86 64-bit architecture. Google has updated the
NDK to revision 10b and added an emulator image developers can use to
prepare their apps to run on devices built with Intel's 64-bit chips.
Keep in mind, the NDK is only for native apps, not those built with
Java on the regular Android SDK. If you have been looking forward to
getting your apps running on 64-bit, or if you need to update to the
latest version of the NDK, hit the developer portal to get your download
started.
Developing with the x86_64 Android NDK
The Native Development Kit (NDK) is a toolset that allows you to
implement parts of your app using native code languages such as C and
C++. For certain types of apps, this can be helpful so you can reuse
existing code libraries written in these languages, but most apps do not
need the Android NDK. You need to balance the benefits of using the NDK
against its drawbacks. Notably, using native code on Android generally
does not result in a noticeable performance improvement, but it always
increases your app complexity. You should only use the NDK if it is
essential to your app and not because you simply prefer to program in
C/C++.
You can download the latest version of Android NDK from:
https://developer.android.com/tools/sdk/ndk/index.html
In this section I'll review how to compile a sample application using the Android NDK.
We will use the sample application, san-angeles, located in the Android NDK samples directory:
$ANDROID_NDK/samples/san-angeles
Native code is located in the
jni/
directory:
$ANDROID_NDK/samples/san-angeles/jni
Native code is compiled for specified CPU architecture(s). Android
applications may contain libraries for several architectures in one apk
file.
To set target architectures you need to create the
Application.mk
file inside the
jni/
directory. The following line will compile the native libraries for all supported architectures:
APP_ABI := all
Sometimes, it’s better to specify a list of target architectures.
This line compiles the libraries for x86 and ARM architectures:
APP_ABI := x86 armeabi armeabi-v7a
Because we are building a 64-bit app, we need to compile the libraries for x86_64 architectures:
APP_ABI := x86_64
Run the following command inside the sample directory to build libraries:
cd $ANDROID_NDK/samples/san-angeles
After the successful build, open the sample in Eclipse* as an Android
application and click “Run”. Select the emulator or a connected Android
device where you want to run the application.
To support all available devices you need to compile the application
for all architectures. If the apk file size with libraries for all
architectures is too big, consider following the instructions in
Google Play Multiple APK Support to prepare a separate apk file for each platform.
Checking supported architectures
You can use this command to check what architectures are included in apk file:
aapt dump badging file.apk
The following line lists all architectures:
native-code: 'armeabi', 'armeabi-v7a', 'x86', 'x86_64'
Another method is to open the apk file as a zip file and view subdirectories in the
lib/
directory.
Optimization of 64-bit programs
Reducing the amount of memory an app consumes
When a program is compiled in the 64-bit mode, it consumes more
memory than its 32-bit version. This increase often goes unnoticed, but
memory consumption can sometimes be two times higher than 32-bit apps.
The amount of memory consumption is determined by the following factors:
- Some objects, like pointers, require larger amounts of memory
- Data alignment and data structure padding
- Increased stack memory consumption
64-bit systems have a larger amount of memory available to user
applications than 32-bit systems. So if a program takes 300 Mbytes on a
32-bit system with 2 Gbytes of memory but needs 400 Mbytes on a 64-bit
system with 8 Gbytes of memory, in relative units, the program takes
three times less memory on a 64-bit system. The one disadvantage is
performance loss. Although 64-bit programs are faster, extracting larger
amounts of data from memory might cancel all the advantages and even
reduce performance. Transferring data between the memory and
microprocessor (cache) is not very cheap.
One way to reduce memory consumption is to optimize data structures.
Another way is to use memory-saving data types. For instance, if we need
to store a lot of integer numbers and we know that their values will
never exceed UINT_MAX, we may use the type "unsigned" instead of "size
t", as discussed in the next section.
Using memsize-types in address arithmetic
Using
ptrdiff_t and
size_t
types in address arithmetic might give you an additional performance
gain along with making the code safer. For example, using the type
int,
whose size differs from the pointer's capacity, as an index results in
additional data conversion commands in the binary code. We might have
64-bit code and the pointers' size is 64 bits while the size of
int type remains the same - 32 bits.
It is not easy to give a brief example to show that
size_t is better than
unsigned.
To be impartial, we have to use the compiler's optimizing capabilities.
But two variants of the optimized code often get too different to
easily demonstrate their difference. We managed to create something like
a simple example after six tries. But the sample is far from ideal
because instead of the code containing the unnecessary conversions of
data types discussed above, it shows that the compiler can build a more
efficient code when using
size_t. Consider this code, which arranges array items in the reverse order:
05 | for (unsigned i = 0 ; i < arraySize / 2 ; i++) |
07 | float value = array[i]; |
08 | array[i] = array[arraySize - i - 1 ]; |
09 | array[arraySize - i - 1 ] = value; |
The variables "
arraySize" and "
i" in the example have the type
unsigned. You can easily replace it with
size_t and compare a small fragment of assembler code shown in Table 1.
Table 1 - Comparing the 64-bit assembler code fragments using the types unsigned and size_t
array [arraySize - I - 1] = value;
|
arraySize, i : unsigned
|
arraySize, i : size_t
|
mov eax, DWORD PTR arraySize$[rsp]
sub eax, r11d
sub r11d, 1
add eax, -1
movss DWORD PTR [rbp + rax*4], xmm0
…
|
mov rax, QWORD PTR arraySize$[rsp]
sub rax, r11
add r11, 1
movss DWORD PTR [rdi + rax*4 - 4], xmm0
…
|
The compiler managed to build a more concise
code when using 64-bit registers. We do not want to say that the code
created using the type
unsigned (column 1) will be slower than the code using the type
size_t
(column 2). It is difficult to compare the speed of code execution on
contemporary processors. But you can see in this example that the
compiler built a briefer and faster code when using 64-bit types.
Now let us consider an example showing the advantages of the types
ptrdiff_t and
size_t
from the viewpoint of performance. For the purposes of demonstration,
we will take a simple algorithm of calculating the minimum path length.
The function
FindMinPath32 is written in classic 32-bit style with
unsigned types. The function
FindMinPath64 differs from it only in the way that all the
unsigned types in it are replaced with
size_t types. There are no other differences! Now let us compare the execution speeds of these two functions (Table 2).
Table 2 - The time of executing the functions FindMinPath32 and FindMinPath64
|
Mode and function |
Function's execution time |
1 |
32-bit compilation mode. Function FindMinPath32 |
1 |
2 |
32-bit compilation mode. Function FindMinPath64 |
1.002 |
3 |
64-bit compilation mode. Function FindMinPath32 |
0.93 |
4 |
64-bit compilation mode. Function FindMinPath64 |
0.85 |
Table 2 shows reduced time relative to the speed of execution of the function
FindMinPath32 on a 32-bit system. This table was developed for the purpose of clarity. The operation time of the
FindMinPath32 function in the first line is 1 on a 32-bit system. This represents our baseline as a unit of measurement.
In the second line, we see that the operation time of the
FindMinPath64 function is also 1 on a 32-bit system. No wonder, because the type
unsigned coincides with the type
size_t on a 32-bit system, and there is no difference between the
FindMinPath32 and
FindMinPath64 functions. A small deviation (1.002) only indicates a small error in measurements.
In the third line, we see a performance gain of 7%. We could well
expect this result after recompiling the code for a 64-bit system.
The fourth line is of the most interest for us. The performance gain is 15%. By merely using the type
size_t instead of
unsigned, the compiler built a more effective code that works even 8% faster!
This simple and obvious example shows how data that are not equal to
the size of the machine word slow down algorithm performance. Mere
replacement of the types
int and
unsigned with
ptrdiff_t and
size_t may
result in a significant performance gain. This result applies first of
all to those cases where these data types are used in index arrays, in
address arithmetic and to arrange loops.
Note: PVS-Studio is a commercial static program analysis tool for
C, C++, and C++11. Although it is not specially designed to optimize
programs, it may assist you in code refactoring and therefore make the
code more efficient. For example, you can use memsize-types when fixing potential errors related to address arithmetic, thus allowing the compiler to build a more optimized code.
Intrinsic functions
Intrinsic functions are special system-dependent functions that
perform actions that cannot be performed at the C/C++ level of code or
that perform these functions much more effectively. Actually, they let
you get rid of inline assembler code because it is often undesirable or
impossible to use.
Programs may use intrinsic functions to create faster code due to the
lack of overhead expenses on calling common functions. The code size is
a bit larger of course.
MSDN gives a list of functions that can be replaced with their intrinsic versions. Examples of these are
memcpy,
strcmp, etc.
Besides automatic replacement of common functions with their
intrinsic versions, you may use intrinsic functions explicitly in your
code. This might be helpful due to these factors:
- Inline assembler is not supported by the Visual C++ compiler in the 64-bit mode while intrinsic code is.
- Intrinsic functions are simpler to use as they do not require knowledge of registers and other similar low-level constructs.
- Intrinsic functions are updated in compilers while assembler code must be updated manually.
- The built-in optimizer does not work with assembler code.
- Intrinsic code is easier to port than assembler code.
Using intrinsic functions in automatic mode (with the help of
the compiler switch) will let you get some percentage of performance
gain and using the "manual" switch helps even more. That is why using
intrinsic functions is a good way to go.
Alignment
Data structure alignment is the way data is arranged and accessed in
computer memory. It consists of two separate but related issues:
data alignment and
data structure padding. When a modern computer reads from or writes to a memory address, it will do this in
word-sized chunks (e.g., 4-
byte chunks on
32-bit systems) or larger.
Data alignment
means putting the data at a memory offset equal to some multiple of the
word size, which increases the system's performance due to the way the
CPU
handles memory. To align the data, it may be necessary to insert some
meaningless bytes between the end of the last data structure and the
start of the next, which is
data structure padding.
For example, when the computer's word size is 4 bytes (which is 8
bits on most machines, but could be different on some systems), the data
to be read should be at a memory offset that is some multiple of 4.
When this is not the case, e.g., the data starts at the 14th byte
instead of the 16th byte, then the computer has to read two 4-byte
chunks and do some calculation before the requested data has been read,
or it may generate an
alignment fault.
Even though the previous data structure ends at the 13th byte, the next
data structure should start at the 16th byte. Two padding bytes are
inserted between the two data structures to align the next data
structure to the 16th byte.
Although data structure alignment is a fundamental issue for all
modern computers, many computer languages and computer language
implementations handle data alignment automatically
It is good in some cases to help the compiler by defining the
alignment manually to enhance performance. For example, Streaming SIMD
Extensions (SSE) data must be aligned on a 16-byte boundary. You may do
this in the following way:
2 | __declspec(align( 16 )) double init_val[ 2 ] = { 3.14 , 3.14 }; |
4 | _m128d vector_var = __mm_load_pd(init_val); |
Android Runtime
Android Runtime (ART) applications were developed by Google as a
replacement of Dalvik. This runtime offers a number of new features that
improve performance and smoothness of the Android platform and apps.
ART was introduced in Android 4.4 KitKat; in Android 5.0 it will
completely replace Dalvik. Unlike Dalvik, ART uses a Just-In-Time (JIT)
compiler (at runtime), meaning that ART compiles an application during
its installation. As a result, the program executes faster and that
improves battery life.
For backward compatibility, ART uses the same byte code as Dalvik.
In addition to the potential speed increase, using ART can provide a
second important benefit. As ART runs app machine code directly (native
execution), it doesn't hit the CPU as hard as just-in-time code
compiling on Dalvik. Less CPU usage results in less battery drain, which
is a big plus for portable devices in general.
So why wasn't ART implemented earlier? Let's look at the downsides of
Ahead-of-time (AOT) compilation. First, the generated machine code
requires more space than the existing byte code. Second, the code is
pre-compiled at install time, so the installation process takes a bit
longer time. Finally, it also corresponds to a larger memory footprint
at execution time. This means that fewer apps can be run concurrently.
When the first Android devices hit the market, memory and storage
capacity were significantly smaller and presented a bottleneck for
performance. This is the reason why a JIT approach was the preferred
option at that time. Today, memory is much cheaper and thus more
abundant, even on low-end devices, so ART is a logical step forward.
In perhaps the most important improvement, ART now compiles your
application to native machine code when installed on a user’s device.
Known as ahead-of-time compilation, you can expect to see large
performance gains as the compilers are set for specific architectures
(such as ARM, x86, or MIPS). This eliminates the need for just-in-time
compilation each time an application is run. Thus it takes more time to
install your application, but it will boot faster when launched as many
tasks executed at runtime on the Dalvik VM, such as class and method
verification, have already taken place.
Next, the ART team worked to optimize the garbage collector (GC).
Instead of two pauses totaling about 10ms for each GC in Dalvik, you’ll
see just one, usually under 2ms. They’ve also parallelized portions of
the GC runs and optimized collection strategies to be aware of device
states. For example, a full GC will run only when the phone is locked
and responsiveness to user interaction is no longer important. This is a
huge improvement for applications that are sensitive to dropping
frames. Additionally, future versions of ART will include a compact
collector that will move chunks of allocated memory into contiguous
blocks to reduce fragmentation and the need to kill older applications
to allocate large memory regions.
Lastly, ART makes use of an entirely new memory allocator called
Rosalloc (runs of slots allocator). Most modern systems use allocators
based on Doug Lea’s design, which has a single global memory lock. In a
multithreaded, object-oriented environment, this interferes with the
garbage collector and other memory operations. In Rosalloc, smaller
objects common in Java are allocated in a thread-local region without
locking and larger objects have their own locks. Thus when your
application attempts to allocate memory for a new object, it doesn’t
have to wait while the garbage collector frees an unrelated region of
memory.
Currently, Dalvik is the default runtime for Android devices and ART
is optionally available on a number of Android 4.4 devices, such as
Nexus phones, Google Play edition devices, Motorola phones running stock
Android, and many other smartphones. ART is currently in development,
and seeking developer and user feedback. ART will eventually replace
Dalvik runtime once it becomes completely stable. Until then, users with
compatible devices can switch from Dalvik to ART if they’re interested
in trying out this new functionality and experience its performance.
To switch or enable ART, your device must be running Android 4.4
KitKat and be compatible with ART. You can easily turn on ART runtime
from “Settings” -> “Developer options” -> “Runtime option”. (Tip:
If you can’t see Developer options in Settings, then go to “About
phone”, scroll down, and tap the Build number 7 times to enable
developer options.) The phone will reboot and start optimizing the apps
for ART, which can take around 15-20 minutes, depending on the number of
apps installed on your phone. You will also notice an increase in the
size of installed apps after enabling ART runtime.
Note: After switching to ART, when you reboot your device for the
first time, it will optimize all the apps once again; which is kind of
annoying.
As Dalvik is the default runtime on Android devices, some apps might
not work on ART, though, most existing apps are compatible with ART and
should work fine. But in case you experience any bugs or app crashes
with ART, then it’s wise to switch back and stay with ART.
Switching to ART on devices requires you to know where to find the
switching option on the device. Google has hidden it under Settings.
Fortunately, there is a trick to enable ART runtime on device that are
based on Android 4.4 KitKat.
Disclaimer: Before trying this, you should make a backup of your
data. Intel won’t be responsible if your device gets bricked (won’t turn
on regardless of what you try). Try it at your own risk!
- Requires Root
- Don’t try if you have WSM Tools installed as they don’t support ART.
To enable ART, carefully follow these steps:
- Make sure your device is rooted.
- Install ‘ES File Explorer’ from the Play store.
- Open ES File Explorer, tap the menu icon from top left corner and
select Tools. In tools, enable the ‘Root Explorer’ option and grant full
root access to ES explorer when prompted.
- In ES explorer, open the Device (/) directory from Menu ->
Local-> Device. Go to the /data/property folder. Open the
persist.sys.dalvik.vm.lib file as Text and then select ES note editor.
- Edit the file by selecting the edit option from top right corner. Rename the line from libdvm.so to libart.so
- Go back to the persist.sys.dalvik.vm.lib file and select ‘Yes’ to save the file. Then reboot the phone.
- The phone will reboot now and start optimizing the apps for ART. It
can take time to reboot depending on the number of apps installed on
your device.
In case you want to revert back to Dalvik runtime, simply follow
the above steps and rename the text in persist.sys.dalvik.vm.lib file
to libdvm.so.
Conclusion
Google has released a 64-bit emulator image for the forthcoming
Android L - but only for the Intel x86 chip. The new emulator will allow
developers to build or optimize older apps for the upcoming Android L
OS and its new 64-bit architecture. Moving to 64-bit increases the
addressable memory space, and allows a larger number of registers and a
new instructions set for developers, but 64-bit apps aren't necessarily
faster.
Java apps automatically gain the benefits of 64-bit because their
byte code will be interpreted by the new ART VM which is 64-bit.This
also implies that no changes to pure Java apps are necessary. Those
built on the Android NDK will need some optimization to include the
x86_64 build target. Intel has advice on how to go about porting code
that targets ARM to x86/x64. Using the new emulator, developers will
only be able to create apps for Intel® Atom™ processor-based chips.
Intel has been providing developers with tools and good system
support for Android particularly its Intel® Hardware Accelerated
Execution Manager (Intel® HAXM) and a range of Intel Atom OS images.
Many Android programmers regularly test on emulated Intel architecture
even though most of their deployment is to ARM devices. As well as the
new emulator there is a 64-bit upgrade to the HAXM accelerator which
should make using HAXM even more attractive. To quote Intel:
"This commitment is evident not only in the delivery of the
industry’s first 64-bit emulator image for Intel architecture, and
64-bit Intel HAXM within the Android L Developer Preview SDK, but also
in many other innovations along the way such as the first 64-bit kernel
for Android KitKat earlier this year, the 64-bit Android Native
Development Kit (NDK), and other 64-bit advancements over the last
decade."
Could it be that a change to Intel architecture might happen as part of the change from 32-bit mobile to 64-bit mobile?
The Android SDK includes a virtual mobile device emulator that runs
on your computer. The emulator lets you prototype, develop, and test
Android applications without using a physical device. The Android
emulator mimics all of the hardware and software features of a typical
mobile device, except that it cannot place actual phone calls. It
provides a variety of navigation and control keys, which you can "press"
using your mouse or keyboard to generate events for your application.
It also provides a screen in which your application is displayed, along
with any other active Android applications.
To let you model and test your application more easily, the emulator
utilizes Android Virtual Device (AVD) configurations. AVDs let you
define certain hardware aspects of your emulated phone and allow you to
create many configurations to test many Android platforms and hardware
permutations. Once your application is running on the emulator, it can
use the services of the Android platform to invoke other applications,
access the network, play audio and video, store and retrieve data,
notify the user, and render graphical transitions and themes.
Related Articles and Resources
- Get information and download Android NDK, Revision 10d here.
- For more information about Android 5.0 Lollipop here.
- Read about developing apps using x86 Android* 4.4 (KitKat) emulator here.