Thoughts on Dell XPS 13 as a developer laptop

Recently I started using a Dell XPS 13 9360. It has an i7 CPU and 256GB SSD so it seemed like it would be a great fit for doing development work and a significant upgrade over my old laptop which was a Lenovo Thinkpad Carbon X1 with an i5.

I chose the XPS 13 based on the pretty much unanimously glowing reviews to be found online:

Based on these reviews it seemed like a no-brainer to pick the XPS 13 over the equivalent Thinkpad Carbon X1 which was several hundred pounds more expensive.

The Dell sales experience is always pretty slick but once I got hold of the laptop and started trying to do real work with it my experience pretty quickly started to sour. Now I don’t want to suggest it’s a bad product, it clearly packs a lot of modern technology into a small package and is in many ways better specced than the Lenovo product I was looking at. The screen in particular is very nice, but beyond that I don’t have too much positive to say about it.

The Keyboard

This is the biggest problem for me. The keyboard is just nowhere near as good as the Lenovo. The key response is soft and the travel is low.

The two images above show the depth of the keys on the two laptops (Dell left, Lenovo right). The images don’t give an indication of the travel on the keys – the Lenovo is much firmer with a longer travel which to my hands is much more comfortable to type on. You can also see the Ctrl and Fn keys are swapped. This is an arbitrary choice but the Ctrl key on the Dell is also considerably smaller than the Lenovo and as an emacs user I really much prefer a larger Ctrl key as I’m hitting it a lot.

On the other side of the keyboard there is another awkward design choice. The PgUp and PgDn keys on the Dell require the Fn key to be pressed rather than standalone keys which makes them, in my opinion, quite useless. There’s even space on the keyboard that could have been used for physical keys like on the Thinkpad so this seems an odd choice.

When I’m working at home I use a Dell wireless keyboard which is something of an improvement over the laptop keyboard but still not in the same class as, for example, the Microsoft keyboards.

USB-C

The only graphical output available is USB-C. I bought a Dell monitor with the laptop but that doesn’t come with a usable cable. Adapter cables are available but they can be expensive so it seems unfortunate that Dell don’t help their customers with this. Technically USB-C is a more flexible connector than, for example, HDMI but this doesn’t seem to be practically useful with the current range of adapters on the market.

If you do presentations then you will also need to make sure you carry the right kind of adapter with you. Most venues will have adapters of HDMI and DisplayPort but I have yet to find anywhere that provides USB-C.

Webcam

The webcam is positioned underneath the screen, just above the Esc key. This is a really awkward place for a camera for video conferencing – it means that you will need to push your screen quite far back otherwise the picture is of your chest and even then the angle is quite odd with your correspondent getting the feeling of looking up your nose.

Performance

Going from a four year old laptop to a new one and switching from an i5 to i7 I was expecting a serious performance boost. However it doesn’t seem like in practical terms this actually happens, probably due to thermal throttling. My dmesg is full of messages like:

[1847368.552137] CPU1: Core temperature above threshold, cpu clock throttled (total events = 97580)
[1847368.552138] CPU3: Core temperature above threshold, cpu clock throttled (total events = 97581)
[1847368.552139] CPU2: Package temperature above threshold, cpu clock throttled (total events = 116408)
[1847368.552140] CPU0: Package temperature above threshold, cpu clock throttled (total events = 116405)
[1847368.552142] CPU3: Package temperature above threshold, cpu clock throttled (total events = 116407)
[1847368.552145] CPU1: Package temperature above threshold, cpu clock throttled (total events = 116405)

The fan is also very prone to coming on and is quite loud, so you have to get used to that noise if you’re going to do any number of builds.

Build Quality

Overall the Dell build quality feels weaker than the Lenovo. It starts with the body which is a mix of metal and plastic where the Carbon X1 feels like more of a single piece of material. The keys on the Dell are soft and slightly loose adding to the plasticky feel. The USB-C seems to be implicated in some singing capacitors – large amounts of output on the screen can be heard as a high-pitched tone.

I also am not much of a fan of the Dell power plugs (on the left, Lenovo right). For a UK plug they are very thin and the triangular shape makes them difficult to hold and apply any force to. It’s a small detail but I would much rather have the functional and solid plug that Lenovo supply.

Conclusion

I’ve raised a number of issues I have with the XPS 13 but it is still a decent laptop that packs a lot of modern tech in at a competitive price. However it still feels to me like a high-end mid-range laptop rather than a genuine contender to compete with the Thinkpad Carbon X1 or the Apple laptops and if you have the budget I would definitely recommend the Thinkpad even if the spec is a little bit lower.

 

Debugging memory leaks in Python

Recently I noticed a Python service on our embedded device was leaking memory. It was a long running service with a number of library dependencies in Python and C and contained thousands of lines of code. I don’t consider myself a Python expert so I wasn’t entirely sure where to start.

After a quick Google search I turned up a number of packages that looked like they might help me but none of them turned out to be quite what was needed. Some were too basic and just gave an overall summary of memory usage, when I really wanted to see counts of which objects were leaking. Guppy hasn’t been updated in some time and seems quite complex. Pympler seemed to work but with an excessive amount of runtime overhead that made it impractical on a resource constrained system. At this point I was close to giving up on the tools and embarking on a time consuming code review to try and track down the problem. Luckily before I disappeared down that rabbit hole I came across the tracemalloc package.

Tracemalloc is a package added in Python 3.4 that allows tracing of allocations from Python. It’s part of the standard library so it should be available anywhere there is a modern Python installation. It has functionality to provide snapshots of object counts on the heap which was just what I needed. There is some runtime overhead which results in spikes in CPU time, presumably when walking the heap, but it seems more efficient than the alternatives.

An example of how you can use the tracemalloc package:

import time
import tracemalloc

snapshot = None

def trace_print():
    global snapshot
    snapshot2 = tracemalloc.take_snapshot()
    snapshot2 = snapshot2.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
        tracemalloc.Filter(False, tracemalloc.__file__)
    ))
    
    if snapshot is not None:
        print("================================== Begin Trace:")
        top_stats = snapshot2.compare_to(snapshot, 'lineno', cumulative=True)
        for stat in top_stats[:10]:
            print(stat)
    snapshot = snapshot2

l = []

def leaky_func(x):
    global l
    l.append(x)

if __name__=='__main__':
    i = 0
    tracemalloc.start()
    while True:
        leaky_func(i)
        i += 1
        time.sleep(1)
        if i % 10 == 0:
            trace_print()

This should print a snapshot every 10 seconds of the state of the heap such as the one below:

leak.py:27: size=576 B (+112 B), count=1 (+0), average=576 B
/usr/lib64/python3.5/posixpath.py:52: size=256 B (+64 B), count=4 (+1), average=64 B
/usr/lib64/python3.5/re.py:246: size=4672 B (+0 B), count=73 (+0), average=64 B
/usr/lib64/python3.5/sre_parse.py:528: size=1792 B (+0 B), count=28 (+0), average=64 B
/usr/lib64/python3.5/sre_compile.py:553: size=1672 B (+0 B), count=4 (+0), average=418 B
/usr/lib64/python3.5/sre_parse.py:72: size=736 B (+0 B), count=4 (+0), average=184 B
/usr/lib64/python3.5/fnmatch.py:70: size=704 B (+0 B), count=4 (+0), average=176 B
/usr/lib64/python3.5/sre_parse.py:524: size=560 B (+0 B), count=1 (+0), average=560 B

The numbers can be interpreted as follows:

  • size, the total size of allocations at this call site
  • count, the total number of allocations at this call site
  • average, the average size of allocations at this call site

The numbers in practices indicate the amount by which the value increased or decreased since the last snapshot. So to look for our memory leak we would expect to see a positive number in the brackets next to size and count. In the example above we can see there is a positive count next to leak.py line 27 which matches up with our leaky function.

Go Toolchain Primer

A toolchain is a package composed of the compiler and ancillary tools, libraries and runtime for a language which together allow you to build and run code written in that language. The GNU toolchain is the most commonly used toolchain on Linux and allows building programs written C, C++, Fortran and a host of other languages too.

gc

The first Go toolchain to be made available, and the one most people are referring to when they talk about Go, is gc.  gc, (which is not to be confused with GC, which usually refers to the garbage collector) is the compiler and toolchain which evolved from the Plan 9 toolchain and includes its own compiler, assembler, linker and tools, as well as the Go runtime and standard library. With Go 1.5 the parts of the toolchain that were written in C have been rewritten in Go so the Plan 9 legacy is gone in terms of code but remains in spirit. The toolchain supports i386, x86_64, arm, arm64 and powerpc64 and the code is BSD licensed.

gccgo

gccgo extends the gcc project to support Go. gcc is widely used compiler that along with GNU binutils for the linker and assembler supports a large number of processor architectures. gccgo currently only supports Go 1.4, but Go 1.5 support is in the works. The advantage of being able to use the gcc compiler infrastructure is that as well as supporting more processor architectures than gc, gccgo can take advantage of the more advanced middle-end and backend optimizations that gcc has developed over the years which could lead to faster generated code. The GNU toolchain and gccgo are GPLv3 licensed which some people may find problematic, but it is what many of the Linux distributions use to support Go on architectures not supported by gc like SPARC, S/390 or MIPS.

llgo

LLVM is a compiler infrastructure project similar to gcc, with a couple of key differences. Firstly it was developed from the ground up in C++ at a much later date than gcc so the code is generally more modern in style and has a clearer structure. Secondly it is BSD licensed, which has attracted a number of large companies such as Apple and Google to get heavily involved in it (in fact Apple employs the project’s founder). llgo is a Go compiler built on top of the LLVM compiler infrastructure. It is BSD licensed and supports nearly as many architectures as gccgo but feels like a less mature project and fewer people seem to be using it, at least publically.

Why so many toolchains?

One thing that may appear as odd is that all three toolchains are predominantly developed by Google engineers. All the toolchains contain some of the same components – the runtime and libraries are largely shared between all the projects, gccgo and llgo share a compiler frontend (language parser) and all the compilers are similarly ahead of time and generate native code. Perhaps Google feels like diversity of implementations is good for the language – I would be inclined to agree – and it looks like Google is spending their efforts relatively sparingly on gccgo and llgo so the development cost of that strategy may not be that high.

I would suggest most people should just stick with gc for their Go development needs, but it will be interesting to see in which directions the various toolchains develop. In later posts I will go into a bit more depth about the relative pros and cons of each.

Channel buffering in Go

Go provides channels as a core concurrency primitive. Channels can be used to pass objects between goroutines safely and can be used to construct quite complex concurrent data structures, however they are still quite a low-level construct and it is not always clear how to use them best.

I was reminded of this when I saw Evan Huus’s excellent talk on Complex Concurrency Patterns in Go at the Golang UK conference. One of the points he made was that on his team infinite buffering was considered a “code smell”. This is something I find interesting, particularly as someone who has written a bit of Erlang code. In the Erlang concurrency model there are no channels but every process (processes in Erlang are lightweight, like goroutines) has an input message queue, which is effectively unbounded in size. In Go channels are first class objects and must provide a finite buffer size on construction, the default value of which is zero.

On the face of it the Go approach is appealing. Nobody wants runaway resource allocation in their program and Go provides a useful way of, for example, throttling excessively fast producers in a producer-consumer system to prevent them filling up memory with unprocessed data. But how large should you make channel buffers?

Unbuffered channels are the default in Go but can have some undesirable side effects. For example a producer and consumer connected by an unbuffered channel can cause reduced parallelism as the producer blocks waiting for the consumer to finish working and the consumer blocks waiting for the producer to produce output, for example in the following code increasing the channel buffer from zero to one will cause a speedup, at least on my machine:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

func producer(ch chan bool) {
	for i := 0; i < 100000; i++ {
		time.Sleep(time.Duration(10 * rand.Float64() * float64(time.Microsecond)))
		ch <- true
	}
	close(ch)
}

func consumer(ch chan bool) {
	for _ = range ch {
		time.Sleep(time.Duration(10 * rand.Float64() * float64(time.Microsecond)))
	}
}

func unbuffered() {
	ch := make(chan bool, 0)
	go producer(ch)
	consumer(ch)
}

func buffered() {
	ch := make(chan bool, 1)
	go producer(ch)
	consumer(ch)
}

func main() {
	startTime := time.Now()
	unbuffered()
	fmt.Printf("unbuffered: %vn", time.Since(startTime))
	startTime = time.Now()
	buffered()
	fmt.Printf("buffered: %vn", time.Since(startTime))
}

Unbuffered channels are also more prone to deadlock. It can be argued whether this is a good thing or bad thing – deadlocks that may not be apparent with buffered channels become visible with unbuffered channels which allows them to be fixed and the behaviour of an unbuffered system is generally simpler and easier to reason about.

So if we decide we need to create a buffered channel, how large should that buffer be? That question turns out to be pretty hard to answer. Channel buffers are allocated with malloc at when the channel is created, so the buffer size is not an upper bound on the size of the buffer but the actual allocated size – the larger the buffer size the more system memory it will consume and the worse cache locality it will have. This means we can’t just use a very large number and treat the channel as if it was infinitely buffered.

In the example below, which is inspired by an open source project, we have a logging API that internally has a channel that is used to pass log messages from the API entry point to the log backend which writes the log messages to a file or network based on the configuration:

type Logger struct {
	logChan chan string
}

func NewLogger() *Logger {
	return &Logger{logChan: make(chan string, 100)}
}

func (l *Logger) WriteString(str string) {
	l.logChan <- str
}

In this case the channel is providing two things. Firstly, it is providing synchronization so multiple goroutines can call into the API safely at the same time. This is a fairly uncontentious use of a channel. Secondly, it is providing a buffer to prevent callers of the API from blocking unduly when many logging calls are made around the same time – it won’t increase throughput but will help if logging is bursty. But is this a good use of a channel? And is 100 the right size to use?

The two questions are somewhat intertwined in my opinion. Channels make good buffers for small buffer sizes. As mentioned above, the storage for a channel is statically allocated so a large buffer can more space efficiently be implemented in other ways. The right size of a buffer depends on many things, in this example the number of goroutines logging, how bursty the logging is, the tolerance for blocking, the size of the machine memory and probably many others. For this reason the API should provide a way of setting the size of the channel buffer so users can adapt it to their needs rather than hardcoding the size.

So a good size for a buffer could be zero, one or some other small number – say five or less – or a variable size that can be set by the API but it is unlikely, in my opinion, to be a large constant value like 100. Infinite buffer sizes, that is a buffer only limited by available memory, are sometimes useful and not something we should dismiss altogether although they are obviously not going to be possible to implement directly with channels as provided by Go. It is possible to create a buffer goroutine with an input channel and an output channel that can implement a memory efficient variable-sized buffer, for example as Evan does here and this is the best choice if you expect your buffering requirements to be large.

Using SystemTap userspace static probes

One of the new features in glibc 2.19 was a set of SystemTap static probes in the malloc subsystem to allow a better view into its inner workings. SystemTap static probe points expand to only a single nop instruction when not enabled and take a fixed number of arguments which are passed to your SystemTap probe as arguments. I wanted to use these probes to analyze the performance of a malloc workload, so I wrote a SystemTap script to log events in the malloc subsystem.

To get this script to work on Fedora 20 I had to install the git version of SystemTap or some of the probes failed to parse
their arguments correctly. The script can be run like this:


# stap malloc.stp -c /usr/bin/ls

It’s also possible to run this script using non-installed version of glibc if you modify the globs in the script to match the path to your libc and run it with the appropriate library path:


# stap malloc.stp -c "env 'LD_LIBRARY_PATH=.../glibc-build:.../glibc-build/nptl' /usr/bin/ls"

The script is very simple and just prints a timestamp, the name of the probe point and the arguments but I hope someone will find it useful.


probe process("/lib*/libc.so.*").mark("memory_heap_new") {
printf("%d:memory_heap_new heap %x size %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_heap_more") {
printf("%d:memory_heap_more heap %x size %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_heap_less") {
printf("%d:memory_heap_less heap %x size %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_heap_free") {
printf("%d:memory_heap_free heap %x size %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_arena_new") {
printf("%d:memory_arena_new arena %x size %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_arena_reuse_free_list") {
printf("%d:memory_arena_reuse_free_list free_list %xn",
gettimeofday_ms(), $arg1)
}

probe process("/lib*/libc.so.*").mark("memory_arena_reuse_wait") {
printf("%d:memory_arena_reuse_wait mutex %d arena %x avoid_arena %xn",
gettimeofday_ms(), $arg1, $arg2, $arg3)
}

probe process("/lib*/libc.so.*").mark("memory_arena_reuse") {
printf("%d:memory_arena_reuse arena %x avoid_arena %xn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_arena_retry") {
printf("%d:memory_arena_retry arena %x bytes %dn",
gettimeofday_ms(), $arg2, $arg1)
}

probe process("/lib*/libc.so.*").mark("memory_sbrk_more") {
printf("%d:memory_sbrk_more brk %x change %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_sbrk_less") {
printf("%d:memory_sbrk_less brk %x change %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_malloc_retry") {
printf("%d:memory_malloc_retry bytes %dn",
gettimeofday_ms(), $arg1)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_free_dyn_thresholds") {
printf("%d:memory_mallopt_free_dyn_thresholds mmap %d trim %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_realloc_retry") {
printf("%d:memory_realloc_retry bytes %d oldmem %xn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_memalign_retry") {
printf("%d:memory_memalign_retry bytes %d alignment %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_calloc_retry") {
printf("%d:memory_calloc_retry bytes %dn",
gettimeofday_ms(), $arg1)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt") {
printf("%d:memory_mallopt param %d value %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_mxfast") {
printf("%d:memory_mallopt_mxfast new %d old %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_trim_threshold") {
printf("%d:memory_mallopt_trim_threshold new %d old %d dyn_threshold %dn",
gettimeofday_ms(), $arg1, $arg2, $arg3)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_top_pad") {
printf("%d:memory_mallopt_top_pad new %d old %d dyn_threshold %dn",
gettimeofday_ms(), $arg1, $arg2, $arg3)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_mmap_threshold") {
printf("%d:memory_mallopt_mmap_threshold new %d old %d dyn_threshold %dn",
gettimeofday_ms(), $arg1, $arg2, $arg3)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_mmap_max") {
printf("%d:memory_mallopt_mmap_max new %d old %d dyn_threshold %dn",
gettimeofday_ms(), $arg1, $arg2, $arg3)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_check_action") {
printf("%d:memory_mallopt_check_action new %d old %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_perturb") {
printf("%d:memory_mallopt_perturb new %d old %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_arena_test") {
printf("%d:memory_mallopt_arena_test new %d old %dn",
gettimeofday_ms(), $arg1, $arg2)
}

probe process("/lib*/libc.so.*").mark("memory_mallopt_arena_max") {
printf("%d:memory_mallopt_arena_max new %d old %dn",
gettimeofday_ms(), $arg1, $arg2)
}

Canon PIXMA MG6350 drivers for Linux

I recently bought a Canon PIXMA MG6350 printer for my home office. Before buying it I found Canon had a set of drivers available for Linux so assumed it was reasonably well supported. However the binary packages available from the Canon support site had out of date dependencies for Fedora 20 so weren’t installable, but there was a source package available so I grabbed that.

On the positive side Canon have provided a mostly GPL CUPS printer driver package for Linux, which is to be commended, but unfortunately it doesn’t build out of the box on modern systems and contains a handful of proprietary binary libraries. I spent a bit of time hacking it to build and fix some compile warnings and pushed the result to github:

https://github.com/willnewton/cnijfilter

The following commands will build an RPM for the MG6350, you need to modify it slightly for other printers in the family:


# git archive --prefix=cnijfilter-source-3.80-2/ -o ~/rpmbuild/SOURCES/cnijfilter-source-3.80-2.tar.gz HEAD
# rpmbuild -ba cnijfilter-common.spec --define="MODEL mg6300" --define="MODEL_NUM 408" --with build_common_package

As I mentioned above unfortunately there are some binary libraries in the package which seem to be essential, and the code quality in general seems pretty poor. There are a number of compiler warnings still that show up moderately serious issues with the bits of code in question. There’s a lot of copy and paste reuse and the code is full of fixed size buffers and dangerous assumptions. It lets me print documents from my laptop so I am not entirely unhappy, although it would be nice if Canon would engage with the community on getting these drivers fully open sourced and integrated properly into CUPS.

setcontext and signal handlers

setcontext is a C library function that along with getcontext allows you to perform a non-local jump from one context to another. They are often used when implementing coroutines or custom threading libraries. longjmp and setjmp provide similar functionality but setcontext was an attempt to fix the shortcomings of these functions and standardize behaviour, although in POSIX 2008 the specification of setcontext and related functions were removed due to the difficulty of implementing them in a truly portable manner.

In the beginning there was setjmp and longjmp. setjmp would capture the current register values into a data structure and longjmp would restore those values at a later point in program execution, causing the control flow to jump back to the point where setjmp was called. This works fine unless the place where you are jumping from is a signal handler. In this case the problem you have is that setjmp will not restore the signal mask so the signal you were handling will not be unmasked. To fix this functions that saved and restored the signal mask called sigsetjmp and siglongjmp were introduced.

However, this doesn’t mean it is necessarily safe to jump out of a signal handler even if you are using siglongjmp. The problem you will often hit is that if the signal was delivered at an arbitrary point in program execution there may be locks or other global resource that need to be deallocated. The only way to avoid this is to block the appropriate signal across any part of the code that may not behave well when interrupted in this way. Unfortunately without auditing all third party libraries that probably means you can only enable handling of such a signal across very small regions of your program.

setcontext and getcontext also restore and save the signal mask and setcontext can also be used to exit from a signal handler with the caveats expressed above. However there is another way in which signal handling and setcontext interact. The context structure used by setcontext and getcontext is of type ucontext_t. If you installed your signal handler using sigaction and the sa_handler field of struct sigaction you get a third argument to the signal handler which is of type void * but allows casting to ucontext_t *. So what happens if you pass this structure to setcontext?

Well the answer is, on Linux at least, “probably nothing good”. The original definition of setcontext specified that “program execution continues with the program instruction following the instruction interrupted by the signal”, but more recent versions of the standards removed that requirement so the result is now unspecified. Some glibc ports such as powerpc, mips and tile do support restoring signal handler created contexts in the spirit of the original specification, but the rest, including x86 and ARM do not. As such it is not possible to rely on being able to restore a signal handler created context with setcontext on Linux. It would be interesting to know if any proprietary Unixes support restoring these contexts and if any applications actually use the functionality.