• The Counting Man

    She pulled the papery dragon insect from her nose. It’s hyper-compressed form exploded instantly into an airy, delicate pattern of scales and space, and it flew around the room before perching on top of the bookshelf. She had been a sick child, and until she figured out that these dragons grew in her nose, life had been confusing and hard. It made it harder that no one, not even her parents, believed in the existence of her dragons, and she had no way to show them. They would emerge in her solitary living space, and fly out the window to hide in the trees to live a shady, verdant life. Did they know something that she did not? It was back in that window when she spent a lot of time wandering around the city. In the reflections of her face in the shop windows she was reminded of being in the present. But her mind mirrored the setting sun and the reflections on the puddles of the water that had not yet come to be. This was her present. She was constantly thinking of her future, and ruminating about her past.

    On a cold December morning, she skipped her flying lessons to pay a visit to the old man that counted things. He lived three paces left of the best smelling bakery in the city, and just underneath one’s nose when the smell of strawberry and dough reached a nutty sweetness that indicates done-ness. If you counted one half a step too far, you would surely miss him. If the pastries burnt, you had already gone too far.

    Nobody knew where he came from, how he persisted, or even how old he was. When the city governor miscalculated the earnings to expenditures of the city, a visit to the old man cleared the digital slate. When a distracted goose misplaced her goslings, the old man was just a waddle away to account for them properly. The girl imagined that he sustained on the gears turning in his head, the infinite space of numbers that gave him beauty and meaning, and the leftover croissants from the bakery.

    On that afternoon, even the drones were not flying, and the girl was stuck in a while loop unable to let go of her conditions of the past. Thus, her desire to visit the counting man wasn’t to actually perform addition and subtraction of objects, but because she might count on having his company during this time of loneliness. His space was in perfect logical order, and in parallel, in complete unaccounted-for chaos. Today he had arranged a surface of glass shapes, and had removed the ground under them so you might fall to smithereens if you slipped into a margin of error. A wave of chill crashed over the small of her lower back. it was a situation of danger. But when you have a mindset like the girl, a risk that might topple some internal barrier presents itself only as exciting opportunity. And so she stepped forward onto the surface.

    Her moves were cautious at first, pressing gently on the glossy shapes to estimate how well they liked her. But soon, she felt her heart relax, her mind release into the rhythm, and she gave in to allowing the stones to capture the memory of her feet. She did this until exhaustion, and then stepped off of the glass back onto the crisp, well-defined apple earth. Her eyes moved from her exposed toes to the feet, knees, and finally face, of the old counting man.

    “Can you tell me how many?” she asked. Although she had lost track of time, she was confident that she had done it so quickly - and imagined her dance as a well-scoped problem to tackle to ensure that her strategy was robust. She had hit the smaller states of the glass continent by brushing her feet over them in a horizontal motion, and having reassurance of her influence when they glowed and smiled.

    The old man also smiled. “You danced over great depths, and gave the glass much memory today. Perhaps you should come back tomorrow, and I might count then.”

    This practice continued, day after day, and the girl learned to dance genuinely. The shapes would dance with her sometimes too, and change their locations, as shapes often like to do. She approached the same task with a clean pair of feet each day, and a new trick in her mind for how to make sure she covered all the space. It wasn’t until the following December that the old man added note to the end of their daily routine.

    “You know,” added the old man. Today you’ve visited me 365 times. I think to make the calculation easier, I’ll call that “1.” Maybe we should do this again sometime?”

    As the girl opened her lungs to release enough air to respond with “Yes, we should!” for the first time in the presence of another, a tiny dragon emerged from her nose. It’s tiny body, painted with red and gold, flew to perch on the old counting man’s right ear. A beat of uncertainty punched her in the stomach, and she was both surprised and terrified of exposing her deepest vulnerability. “Such a small spirit,” he responded calmly. “but within it is accrued such vibrancy and hope.” It was then that the girl again found control over the air to expel the message curled in her neck.

    “Yes, I’ve never seen one like that before. And I of course will continue to visit. But I do not wish to count, because I will be dancing.”

    Neither needed to say more. The old man stretched out his pinky, an offering to the tiny dragon to sit on, and he streamed the dragon from his ear toward a tree where he might enjoy watching the leaves grow. Both he and she knew that it wasn’t about the counting at all. The girl first went to visit the old man in hopes that it might alleviate her loneliness, a hidden desire behind searching for the secrets of the bakery. She visited him again because the dancing gave her a parcel of meaning. She continued to visit because she had found herself in the movements of her feet.

    “My dear friend,” the girl said to the counting man one day. “The sun rises and falls, and someone, somewhere, is apologizing to their selves of the past, feeling loss for the dances not done, and destroying the present with rumination about a future that is never truly reached.” She paused, anticipating some sign that he knew that this insight could only come from personal experience. “Is this an optimal way to live one’s life?”

    “It might be, for some,” he responded. “But for you, you just keep dancing.”

    It is only when the girl stopped counting did she realize that she could count on the things that gave her meaning, and old man time would manage the rest. And little did she know, he had come to count on her too.


  • Containers for Academic Products

    Why do we do research? Arguably, the products of academia are discoveries and ideas that further the human condition. We want to learn things about the world to satisfy human curiosity and need for innovation, but we also want to learn how to be happier, healthier, and generally maximize the goodness of life experience. For this reason, we give a parcel of our world’s resources to people with the sole purpose of producing research in a particular domain. We might call this work the academic product.

    The Lifecycle of the Academic Product

    Academic products typically turn into manuscripts. The manuscripts are published in journals, ideally the article has a few nice figures, and once in a while the author takes it upon him or herself to include a web portal with additional tools, or a code repository to reproduce the work. For most academic products, they aren’t so great, and they get a few reads and then join the pile of syntactical garbage that is a large portion of the internet. For another subset, however, the work is important. If it’s important and impressive, people might take notice. If it’s important but not so impressive, there is the terrible reality that these products go to the same infinite dumpster, but they don’t belong there. This is definitely an inefficiency, and let’s step back a second here and think about how we can do better. First, let’s break down these academic product things, and start with some basic questions:

    • What is the core of an academic product?
    • What does the ideal academic product look like?
    • Why aren’t we achieving that?

    What is the core of an academic product?

    This is going to vary a bit, but for most of the work I’ve encountered, there is some substantial analysis that leads to a bunch of data files that should be synthesized. For example, you may run different processing steps for images on a cluster, or permutations of a statistical test, and then output some compiled thing to do drumroll your final test. Or maybe your final thing isn’t a test, but a model that you have shown can work with new data. And then you share that result. It probably can be simplified to this:

    [A|data] --> [B|preprocessing and analysis] --> [C|model or result] --> [D|synthesize/visualize] --> [E|distribute]

    We are moving A data (the behavior we have measured, the metrics we have collected, the images we have taken) through B preprocessing and analysis (some tube to handle noise, use statistical techniques to say intelligent things about it) to generate C (results or a model) that we must intelligently synthesize, meaning visualization or explanation (D) by way of story (ahem, manuscript) and this leads to E, some new knowledge that improves the human condition. This is the core of an academic product.

    What does the ideal academic product look like?

    In an ideal world, the above would be a freely flowing pipe. New data would enter the pipe that matches some criteria, and flow through preprocessing, analysis, a result, and then an updated understanding of our world. In the same way that we subscribe to social media feeds, academics and people alike could subscribe to hypothesis, and get an update when the status of the world changes. Now we move into idealistic thought that this (someday) could be a reality if we improve the way that we do science. The ideal academic product is a living thing. The scientist is both the technician and artist to come up with this pipeline, make a definition of what the input data looks like, and then provide it to the world.

    The entirety of this pipeline can be modular, meaning running in containers that include all of the software and files necessary for the component of the analysis. For example, steps A(data) and B (preprocessing and analysis) are likely to happen in a high performance computing (HPC) environment, and you would want your data and analysis containers run at scale there. There is a lot of discussion going on about using local versus “cloud” resources, and I’ll just say that it doesn’t matter. Whether we are using a local cluster (e.g., SLURM) or in Google Cloud, these containers can run in both. Other scientists can also use these containers to reproduce your steps. I’ll point you to Singularity and follow us along at researchapps for a growing example of using containers for scientific compute, along with other things.

    For the scope of this post, we are going to be talking about how to use containers for the middle to tail end of this pipeline. We’ve completed the part that needs to be run at scale, and now we have a model that we want to perhaps publish in a paper, and provide for others to run on their computer.

    Web Servers in Singularity Containers

    Given that we can readily digest things that are interactive or visual, and given that containers are well suited for including much more than a web server (e.g., software dependencies like Python or R to run an analysis or model that generates a web-based result) I realized that sometimes my beloved Github pages or a static web server aren’t enough for reproducibility. So this morning I had a rather silly idea. Why not run a webserver from inside of a Singularity container? Given the seamless nature of these things, it should work. It did work. I’ve started up a little repository https://github.com/vsoch/singularity-web of examples to get you started, just what I had time to do this afternoon. I’ll also go through the basics here.

    How does it work?

    The only pre-requisite is that you should install Singularity. Singularity is already available on just over 40 supercomputer centers all over the place. How is this working? We basically follow these steps:

    1. create a container
    2. add files and software to it
    3. tell it what to run

    In our example here, at the end of the analysis pipeline we are interested in containing things that produce a web-based output. You could equally imagine using a container to run and produce something for a step before that. You could go old school and do this on a command by command basis, but I (personally) find it easiest to create a little build file to preserve my work, and this is why I’ve pushed this development for Singularity, and specifically for it to look a little bit “Dockery,” because that’s what people are used to. I’m also a big fan of bootstrapping Docker images, since there are ample around. If you want to bootstrap something else, please look at our folder of examples.

    The Singularity Build file

    The containers are built from a specification file called Singularity, which is just a stupid text file with different sections that you can throw around your computer. It has two parts: a header, and then sections (%runscript,%post). Actually there are a few more, mostly for more advanced usage that I don’t need here. Generally, it looks something like this:

    Bootstrap: docker
    From: ubuntu:16.04
         exec /usr/bin/python "$@"
         apt-get update
         apt-get -y install python   

    Let’s talk about what the above means.

    The Header

    The First line bootstrap says that we are going to bootstrap a docker image, specifically using the (From field) ubuntu:16.04. What the heck is bootstrapping? It means that I’m too lazy to start from scratch, so I’m going to start with something else as a template. Ubuntu is an operating system, instead of starting with nothing, we are going to dump that into the container and then add stuff to it. You could choose another distribution that you like, I just happen to like Debian.


    Post is the section where you put commands you want to run once to create your image. This includes:

    • installation of software
    • creation of files or folders
    • moving data, files into the container image
    • analysis things

    The list is pretty obvious, but what about the last one, analysis things? Yes, let’s say that we had a script thing that we wanted to run just once to produce a result that would live in the container. In this case, we would have that thing run in %post, and then give some interactive access to the result via the %runscript. In the case that you want your image to be more like a function and run the analysis (for example, if you want your container to take input arguments, run something, and deliver a result), then this command should go in the %runscript.


    The %runscript is the thing executed when we run our container. For this example, we are having the container execute python, with whatever input arguments the user has provided (that’s what the weird $@ means). And note that the command exec basically hands the current running process to this python call.

    But you said WEB servers in containers

    Ah, yes! Let’s look at what a Singularity file would look like that runs a webserver, here is the first one I put together this afternoon:

    Bootstrap: docker
    From: ubuntu:16.04
         cd /data
         exec python3 -m http.server 9999
         mkdir /data
         echo "<h2>Hello World!</h2>" >> /data/index.html
         apt-get update
         apt-get -y install python3     

    It’s very similar, except instead of exposing python, we are using python to run a local webserver, for whatever is in the /data folder inside of the container. For full details, see the nginx-basic example. We change directories to data, and then use python to start up a little server on port 9999 to serve that folder. Anything in that folder will then be available to our local machine on port 9999, meaning the address localhost:9999 or



    The nginx-basic example will walk you through what we just talked about, creating a container that serves static files, either within the container (files generated at time of build and served) or outside the container (files in a folder bound to the container at run time). What is crazy cool about this example is that I can serve files from inside of the container, perhaps produced at container generation or runtime (in this example, my container image is called nginx-basic.img, and by default it’s going to show me the index.html that I produced with the echo command in the %post section:

    Serving HTTP on port 9999 ...

    or I can bind a folder on my local computer with static web files (the . refers to the present working directory, and -B or --bind are the Singularity bind parameters) to my container and serve them the same!

    singularity run -B .:/data nginx-basic.img 

    The general format is either:

    singularity [run/shell] -B <src>:<dest> nginx-basic.img 
    singularity [run/shell] --bind <src>:<dest> nginx-basic.img 

    where <src> refers to the local directory, and <dest> is inside the container.


    The nginx-expfactory example takes a software that I published in graduate school and shows an example of how to wrap a bunch of dependencies in a container, and then allow the user to use it like a function with input arguments. This is a super common use case for science publication type things - you want to let someone run a model / analysis with custom inputs (whether data or parameters), meaning that the container needs to accept input arguments and optionally run / download / do something before presenting the result. This example shows how to build a container to serve the Experiment Factory software, and let the user execute the container to run a web-based experiment:

    ./expfactory stroop
    No battery folder specified. Will pull latest battery from expfactory-battery repo
    No experiments, games, or surveys folder specified. Will pull latest from expfactory-experiments repo
    Generating custom battery selecting from experiments for maximum of 99999 minutes, please wait...
    Found 57 valid experiments
    Found 9 valid games
    Preview experiment at localhost:9684


    Finally, nginx-jupyter fits nicely with the daily life of most academics and scientists that like to use Jupyter Notebooks. This example will show you how to put the entire Jupyter stuffs and python in a container, and then run it to start an interactive notebook in your browser:

    The ridiculously cool thing in this example is that when you shut down the notebook, the notebook files are saved inside the container. If you want to share it? Just send over the entire thing! The other cool thing? If we run it this way:

    sudo singularity run --writable jupyter.img

    Then the notebooks are stored in the container at /opt/notebooks (or a location of your choice, if you edit the Singularity file). For example, here we are shelling into the container after shutting it down, and peeking. Are they there?

      singularity shell jupyter.img 
      Singularity: Invoking an interactive shell within container...
      Singularity.jupyter.img> ls /opt/notebooks

    Yes! And if we run it this way:

    sudo singularity run -B $PWD:/opt/notebooks --writable jupyter.img

    We get the same interactive notebook, but the files are plopping down into our present working directory $PWD, which you now have learned is mapped to /opt/notebooks via the bind command.

    How do I share them?

    Speaking of sharing these containers, how do you do it? You have a few options!

    Share the image

    If you want absolute reproducibility, meaning that the container that you built is set in stone, never to be changed, and you want to hand it to someone, have them install Singularity and send them your container. This means that you just build the container and give it to them. It might look something like this:

          sudo singularity create theultimate.img
          sudo singularity bootstrap theultimate.img Singularity

    In the example above I am creating an image called theultimate.img and then building it from a specification file, Singularity. I would then give someone the image itself, and they would run it like an executable, which you can do in many ways:

          singularity run theultimate.img

    They could also shell into it to look around, with or without sudo to make changes (breaks reproducibility, your call, bro).

          singularity shell theultimate.img
          sudo singularity shell --writable theultimate.img

    Share the build file Singularity

    In the case that the image is too big to attach to an email, you can send the user the build file Singularity and he/she can run the same steps to build and run the image. Yeah, it’s not the exact same thing, but it’s captured most dependencies, and granted that you are using versioned packages and inputs, you should be pretty ok.

    Singularity Hub

    Also under development is a Singularity Hub that will automatically build images from the Singularity files upon pushes to connected Github repos. This will hopefully be offered to the larger community in the coming year, 2017.

    Why aren’t we achieving this?

    I’ll close with a few thoughts on our current situation. A lot of harshness has come down in the past few years on the scientific community, especially Psychology, for not practicing reproducible science. Having been a technical person and a researcher, my opinion is that it’s asking too much. I’m not saying that scientists should not be accountable for good practices. I’m saying that without good tools and software, doing these practices isn’t just hard, it’s really hard. Imagine if a doctor wasn’t just required to understand his specialty, but had to show up to the clinic and build his tools and exam room. Imagine if he also had to cook his lunch for the day. It’s funny to think about this, but this is sort of what we are asking of modern day scientists. They must not only be domain experts, manage labs and people, write papers, plead for money, but they also must learn how to code, make websites and interactive things, and be linux experts to run their analyses. And if they don’t? They probably won’t have a successful career. If they do? They probably still will have a hard time finding a job. So if you see a researcher or scientist this holiday season? Give him or her a hug. He or she has a challenging job, and is probably making a lot of sacrifices for the pursuit of knowledge and discovery.

    I had a major epiphany during the final years of my graduate school that the domain of my research wasn’t terribly interesting, but rather, the problems wrapped around doing it were. This exact problem that I’ve articulated above - the fact that researchers are spread thin and not able to maximally focus on the problem at hand, is a problem that I find interesting, and I’ve decided to work on. The larger problem, that tools for researchers, because it’s not a domain that makes money, or that there is an entire layer of research software engineers missing from campuses across the globe, is also something that I care a lot about. Scientists would have a much easier time giving us data pipes if they were provided with infrastructure to generate them.

    How to help?

    If you use Singularity in your work, please comment here, contact me directly, or to researchapps@googlegroups.com so I can showcase your work! Please follow along with the open source community to develop Singularity, and if you are a scientist, please reach out to me if you need applications support.


  • Tweezers

    It was around my 13th birthday when my mom gave me the gift of self-scrutiny. It was an electronic makeup mirror, a clone of her own, that lit up and magnified my face from both sides with an array of daylight themes. Did I want an evening tone? There was soft blue for that. Morning? A yellow, fresh hue. With this mirror came my very own pair of tweezers - the most important tool to commence the morning ritual of plucking away hairs and insecurities. I had watched her do it for my entire childhood, and it was calming in a way. She would be fresh from bath or shower, sometimes in a kerchief, sometimes with hair tied back, and sit down in a tiny chair in front of a mirror parked front and center at a little table with an opening for her legs and a three paneled mirror behind that. If it were a painting, it could be called a triptych. She would first embark on the arduous task of removing any deviant hairs. I remember my hopes for an equivalently flawless and hairfree nose line were shattered when she told me that if I plucked for 20 or more years, they just wouldn’t grow back. That’s a long time when you are 13. At some point she would apply a palm’s dabble of an Estee Lauder white cream - a smooth, sweet smelling cream that I watched in my lifetime go from $40 to $44 a bottle, and likely is now higher. It was, she told me, the secret to her flawless skin. I’m now more convinced it’s some Puerto Rican gene, but it’s a moot point. This cream had a shiny, rounded gold screw top that accented the rectangular, glassy white bottle. The version made years ago had a subtle cream tint to it, and the more modern one is a silkier white. I stopped using it in favor of a non-scented $3.99 bottle of face cream from Trader Joe’s at the onset of graduate school when my allergic-ness kicked in, and the fancy stuff started to give me rashes.

    But back to that memory. The smell of the cream and removal of hairs from nose, brow, and (later in her life, chin) was followed by the opening of the makeup drawer. What treasures lay inside! She had brow pencils, an entire section of lipsticks, terrifying black wands that I learned to be mascara, and lip liners. She would first apply a fine line of some kind of “under eye” concealer, to hide any appearance of bags from stress or tiredness. My Mom had (and still has) amazing nails. They were long and clean, and she could jab a nail into the creamy beige stick, and then swipe it under each eye to get a perfect line. A quick rub then would dissolve the line into her skin. The eyes were then complete with a swipe of a dark liner and shadow. The process of applying lipstick was my favorite to watch, because it was like a coloring book on your face, and I could always smell the weird scent of the stuff. She would draw a line cleanly around the lip, sometimes too far over the top, and then color it in like a master painter. It was a brave commitment, because it meant that the lipstick would likely wear off toward the end of the day, leaving only the liner. Never to fear! This is why a lipstick was never far away in a purse, and a husband or child close by to ask about the appearance of the liner. She told me many times growing up that her mother never wore lipstick, and told her it (in more proper words) made her too becoming to men. When I refused to touch it, she told me that it was one of those things that skips generations. I’ll never have a daughter so I don’t have to find out, but I can imagine there is a strong propensity to not do some of the things that you observe your Mom to do. When the face and makeup were complete, the hair would fall out of the kerchief, and my Mom was flawless. Me on the other hand, well here I was in all my unibrow glory:

    And there I was. I put my little setup on the floor of my childhood room, and plugged the mirror into the wall and directly under my bulletin board scattered with childhood photos and Hello Kitty lights. I would finish my two Eggo waffles, cut into perfect squares with just the right amount of syrup in the crevices, and go straight up the stairs to start my own ritual. Coming armed with the contribution of my Dad’s gene pool, I had much thicker and darker lashes and eyebrows, and so my uni-brow was extra Frida like. It’s funny how before adolescence, I had never really noticed it, and then it was immediately a marker of some kind of gross, ethnic hair. I painstakingly removed it, many times plucking out too many hairs. The damn caterpillar always grew back in full force, and the voice of my mom… “20 years…” repeated in my head. As for makeup, it never clicked with me. I went through an eye shadow phase in 8th grade, and went nuts with colors, dusts, and creams, and this was also when eye glitter from the store “The Icing” was the rage. The silly little screw-top bottles had this strange goo inside that would completely dry up when left even the slightest bit open. The nicer ones I had were “roll on” style, usually the “Smackers” brand. I still have some of those chapsticks and various teenager “makeup” - they are sitting in my apartment organized in a plastic bin from college one room over. It’s really the texture and scent that I kept them for - it’s an immediate warp back into the past. As for eye mascara, my first (and last) experience was with some sort of waterproof version, which I tried from my friend Kara at a sports club called Hampshire Hills in middle school. I didn’t know about makeup remover, and in sheer horror when I couldn’t get it off and resorted to using fingers to pull out my eyelashes, I never touched it again. I still won’t.

    When I entered college, I didn’t bring that mirror, and was long over any kind of makeup beyond chapstick and tinted pimple cream. “Pimples” I’ve come to learn are actually pretty unlikely for many, but brought on by anxiety and some maladaptive desire to pick at imperfections. It’s a constant struggle, but I’m getting better! I brought those tweezers along, and every few years my Mom would buy me a new pair - always a really nice pair - because the ones in the drug store weren’t suitable for the task. Did I stick to my duty? Sort of, kind of. A combination of laziness and lacking setup for my ritual led to the abolishment of the daily plucking in exchange for a weekly or bi-weekly grooming. I went through phases of forgetting, and finally, not really caring, and only doing some touch up if there was a formal event like a graduation or wedding.

    I just turned 30. It’s now been (almost) those 20 years (and well, I haven’t been true to my duty), but the hairs always came back. At some point in graduate school, I just stopped caring. The higher level awareness of this entire experience is the realization that this focus, this kind of ritual that derives self-worth from obtaining a particular appearance and going to extremes to achieve it, is just ridiculous. Nobody has to do this crap. It took me years to stumble on this, matter of fact I only derived this insight when I found my groove, and I found my meaning, both personally and in the things I am passionate about doing. Maybe it is some kind of “growing up” milestone, but if we had to put eggs in baskets to correspond with how we evaluate ourselves, I just don’t care to put any eggs in the “how do my eyebrows look right now?” and “should I gussy up a bit?” baskets. It’s much more fun to be silly, express your ideas and personality with intensity, and it doesn’t matter if the hairs on your face are groomed are not. When I’m tired because I’ve worked hard in a day, you are going to see that, and it’s just going to be that. As I age, you are going to know and see me, and not some mask that I hide over me to meet some idealistic plastic version of myself. Everyone else is too busy with their own insecurity and heavy self-awareness to care. This self-realization that sort of happened, but came to awareness after the fact, has been empowering on many levels. I can look at the tweezer in my bathroom, have awareness of a fast fading ritualistic memory, and then just walk away. And now I want the opposite - I am rooting for every hair on my face to grow back in all its little glory!

    This post is dedicated to my Mom. As a grown up now myself, I wish, in retrospect, that she would have seen and appreciated her beauty without the makeup on - because it seems too easy to get used to a red lip, or a shadowed eye, and feel lacking without them. Mom - the times when you didn’t want to be seen outside with your Pajamas on, when you checked your lipstick in the car mirror just to be safe, or when you put on makeup in the middle of the night because you felt vulnerable, you didn’t need to do that. Radiating your inner energy and joy would have been the light that blinded people from any imperfection in your skin that was possible to see. I know that I will be loved, and I will be the same, irrespective of these things. It is not the look of a face that gives inner beauty, and so to close this post, I do not wish to validate the societal standard of evaluating women based on looks by saying something about looks. Mom - with and without your makeup, with and without your insecurities, you are beautiful. As your hair gets white, and your face more aged, I hope you can really feel that. I know that I do.


  • Python Environments, A User Guide

    Do you want to run Python? I can help you out! This documentation is specific to the farmshare2 cluster at Stanford, on which there are several versions on python available. The python convention is that python v2 is called ‘python’, and python v3 is called ‘python3’. They are not directly compatible, and in fact can be thought of as entirely different software.

    How do I know which python I’m calling?

    Like most Linux software, when you issue a command to execute some software, you have a variable called $PATH that loads the first executable it finds with that name. The same is true for python and python3. Let’s take a look at some of the defaults:

    # What python executable is found first?
    rice05:~> which python
    # What version of python is this?
    rice05:~> python --version
    Python 2.7.12
    # And what about python3?
    rice05:~> which python3
    # And python3 version
    rice05:~> python3 --version
    Python 3.5.2

    This is great, but what if you want to use a different version? As a reminder, most clusters like Farmshare2 come with packages, modules, and can also be installed with your custom software (here’s a refresher if you need it). Let’s talk about the different options for extending the provided environments, or creating your own environment. First, remember that for all of your scripts, the first line instructs what executable to use. So make sure to have this at the top of your script:

    #!/usr/bin/env python

    Now, what to do when the default python doesn’t fit your needs? You have many choices:

    1. Install to a User Library if you want to continue using a provided python, but add a module of your choice to a personal library
    2. Install a conda environment if you need standard scientific software modules, and don’t want the hassle of compiling and installing them.
    3. Create a virtual environment if you want more control over the version and modules

    1. Install to a User Library

    The reason that you can’t install to the shared python or python3 is because you don’t have access to the site-packages folder, which is where the modules are looked for automatically by python. But don’t despair! You can install to your (very own) site-packages by simply appending the --user argument to the install command. For example:

    # Install the pokemon-ascii package
    pip install pokemon --user
    # Where did it install to?
    rice05:~> python
    Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
    [GCC 5.4.0 20160609] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pokemon
    >>> pokemon.__file__

    As you can see above, your --user packages install to a site packages folder for the python version under .local/lib. You can always peek into this folder to see what you have installed.

    rice05:~> ls $HOME/.local/lib/python2.7/site-packages/
    nibabel			 pokemon		      virtualenv.py
    nibabel-2.1.0.dist-info  pokemon-0.32.dist-info       virtualenv.pyc
    nisext			 virtualenv-15.0.3.dist-info  virtualenv_support

    You probably now have two questions.

    1. How does python know to look here, and
    2. How do I check what other folders are being checked?

    How does Python find modules?

    You can look at the sys.path variable, a list of paths on your machine, to see where Python is going to look for modules:

    rice05:~> python
    Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
    [GCC 5.4.0 20160609] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.path
    ['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/home/vsochat/.local/lib/python2.7/site-packages', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']

    Above we can see that the system libraries are loaded before local, so if you install a module to your user folder, it’s going to be loaded after. Did you notice that the first entry is an empty string? This means that your present working directory will be searched first. If you have a file called pokemon.py in this directory and then you do import pokemon, it’s going to use the file in the present working directory.

    How can I dynamically change the paths?

    The fact that these paths are stored in a variable means that you can dynamically add / tweak paths in your scripts. For example, when I fire up python3 and load numpy, it uses the first path found in sys.path:

    >>> import numpy
    >>> numpy.__path__

    And I can change this behavior by removing or appending paths to this list before importing. Additionally, you can add paths to the environmental variable $PYTHONPATH to add folders with modules (read about PYTHONPATH here). First you add the variable to the path:

    # Here is setting an environment variable with csh
    rice05:~> setenv PYTHONPATH /home/vsochat:$PYTHONPATH
    # And here with bash
    rice05:~> export PYTHONPATH=/home/vsochat:$PYTHONPATH
    # Did it work?
    rice05:~> echo $PYTHONPATH

    Now when we run python, we see the path has been appended to the beginning of sys.path:

    rice05:~> python
    Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
    [GCC 5.4.0 20160609] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.path
    ['', '/home/vsochat', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/home/vsochat/.local/lib/python2.7/site-packages', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']


    How do I see more information about my modules?

    You can look to see if a module has a __version__, a __path__, or a __file__, each of which will tell you details that you might need for debugging. Keep in mind that not every module has a version defined.

    rice05:~> python
    Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
    [GCC 5.4.0 20160609] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy
    >>> numpy.__version__
    >>> numpy.__file__
    >>> numpy.__path__
    >>> numpy.__dict__

    If you are really desperate for seeing what functions the module has available, take a look at (for example, for numpy) numpy.__dict__.keys(). While this doesn’t work on the cluster, if you load a module in iPython you can press TAB to autocomplete for available options, and add a single or double _ to see the hidden ones like __path__.

    How do I ensure that my package manager is up to date?

    We’ve hit a conundrum! How does one “pip install pip”? And further, how do we ensure we are using the pip version associated with the currently active python? The same way that you would upgrade any other module, using the --upgrade flag:

    rice05:~> python -m pip install --user --upgrade pip
    rice05:~> python -m pip install --user --upgrade virtualenv

    And note that you can do this for virtual environments (virtualenv) as well.

    2. Install a conda environment

    There are a core set of scientific software modules that are quite annoying to install, and this is where anaconda and miniconda come in. These are packaged virtual environments that you can easily install with pre-compiled versions of all your favorite modules (numpy, scikit-learn, pandas, matplotlib, etc.). We are going to be following instructions from the miniconda installation documentation. Generally we are going to do the following:

    • Download the installer
    • Run it to install, and install to our home folder
    • (optional) add it to our path
    • Install additional modules with conda

    First get the installer from here, and you can use wget to download the file to your home folder:

    rice05:~> cd $HOME
    rice05:~> wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    # Make it executable
    rice05:~> chmod u+x Miniconda3-latest-Linux-x86_64.sh 

    Then run it! If you do it without any command line arguments, it’s going to ask you to agree to the license, and then interactively specify installation parameters. The easiest thing to do is skip this, using the -b parameter will automatically agree and install to miniconda3 in your home directory:

    rice05:~> ./Miniconda3-latest-Linux-x86_64.sh -b
    (installation continues here)

    If you want to add the miniconda to your path, meaning that it will be loaded in preference to all other pythons, then you can add it to your .profile:

    echo "export PATH=$HOME/miniconda3/bin:$PATH >> $HOME/.profile"

    Then source your profile to make the python path active, or log in and out of the terminal to do the same:

    source /home/vsochat/.profile

    Finally, to install additional modules to your miniconda environment, you can use either conda (for pre-compiled binaries) or the pip that comes installed with the miniconda environment (in the case that the conda package managed doesn’t include it).

    # Scikit learn is included in the conda package manager
    /home/vsochat/miniconda3/bin/conda install -y scikit-learn
    # Pokemon ascii is not
    /home/vsochat/miniconda3/bin/pip install pokemon

    3. Install a virtual environment

    If you don’t want the bells and whistles that come with anaconda or miniconda, then you probably should go for a virtual environment. The Hitchhiker’s Guide to Python has a great introduction, and we will go through the steps here as well. First, let’s make sure we have the most up to date version for our current python:

    rice05:~> python -m pip install --user --upgrade virtualenv

    Since we are installing this to our user (.local) folder, we need to make sure the bin (with executables for the install) is on our path, because it usually won’t be:

    # Ruhroh!
    rice05:~/myproject> which virtualenv
    virtualenv: Command not found.
    # That's ok, we know where it is!
    rice05:~/myproject> export PATH=/home/vsochat/.local/bin:$PATH
    # (and for csh)
    rice05:~/myproject> setenv PATH /home/vsochat/.local/bin:$PATH
    # Did we add it?
    rice05:~/myproject> which virtualenv

    You can also add this to your $HOME/.profile if you want it sourced each time.

    Now we can make and use virtual environments! It is as simple as creating it, and activating it:

    rice05:~>mkdir myproject
    rice05:~>cd myproject
    rice05:~/myproject> virtualenv venv
    New python executable in /home/vsochat/myproject/venv/bin/python
    Installing setuptools, pip, wheel...done.
    rice05:~/myproject> ls

    To activate our environment, we use the executable activate in the bin provided. If you take a look at the files in bin, there is an activate file for each kind of shell, and there is also the executables for python and the package manager pip:

    rice05:~/myproject> ls venv/bin/
    activate       activate_this.py  pip	 python     python-config
    activate.csh   easy_install	 pip2	 python2    wheel
    activate.fish  easy_install-2.7  pip2.7  python2.7

    Here is how we would active for csh:

    rice05:~/myproject> source venv/bin/activate.csh 
    [venv] rice05:~/myproject> 

    Notice any changes? The name of the active virutal environment is added to the terminal prompt! Now if we look at the python and pip versions running, we see we are in our virtual environment:

    [venv] rice05:~/myproject> which python
    [venv] rice05:~/myproject> which pip

    Again, you can add the source command to your $HOME/.profile if you want it to be loaded automatically on login. From here you can move forward with using python setup.py install (for local module files) and pip install MODULE to install software to your virtual environment.

    To exit from your environment, just type deactivate:

    [venv] rice05:~/myproject> deactivate

    PROTIP You can specify commands to your virtualenv creation to include the system site packages in your environment. This is useful for modules like numpy that require compilation (lib/blas, anyone?) that you don’t want to deal with:

    rice05:~/myproject> virtualenv venv --system-site-packages

    Reproducible Practices

    Whether you are a researcher or a software engineer, you are going to run into the issue of wanting to share your code, and someone on a different cluster running it. The best solution is to container-ize everything, and for this we recommend using Singularity. However, let’s say that you’ve been a bit disorganized, and you want to quickly capture your current python environment either for a requirements.txt file, or for a container configuration? If you just want to glance and get a “human readable” version, then you can do:

    rice05:~> pip list
    biopython (1.66)
    decorator (4.0.6)
    gbp (0.7.2)
    nibabel (2.1.0)
    numpy (1.11.0)
    pip (8.1.2)
    pokemon (0.32)
    Pyste (0.9.10)
    python-dateutil (2.4.2)
    reportlab (3.3.0)
    scipy (0.18.1)
    setuptools (28.0.0)
    six (1.10.0)
    virtualenv (15.0.1)
    wheel (0.29.0)

    If you want your software printed in the format that will populate the requirement.txt file, then you want:

    rice05:~> pip freeze

    And you can print this right to file:

    # Write to new file
    rice05:~> pip freeze > requirements.txt
    # Append to file
    rice05:~> pip freeze >> requirements.txt

  • The Constant Struggle


  • Contained Environments for Software for HPC

    I was recently interested in doing what most research groups do, setting up a computational environment that would contain version controlled software, and easy ways for users in a group to load it. There are several strategies you can take. Let’s first talk about those.

    Strategies for running software on HCP

    Use the system default

    Yes, your home folder is located on some kind of server with an OS, and whether RHEL, CentOS, Ubuntu, or something else, it likely comes with, for example, standard python. However, you probably don’t have any kind of root access, so a standard install (let’s say we are installing the module pokemon) like any of the following won’t work:

    # If you have the module source code, with a setup.py
    python setup.py install
    # Install from package manager, pip
    pip install pokemon
    # use easy_install
    easy_install pokemon

    Each of the commands above would attempt to install to the system python (something like /usr/local/lib/pythonX.X/site-packages/) and then you would get a permission denied error.

    OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/pokemon-0.1-py2.7.egg/EGG-INFO/entry_points.txt'

    Yes, each of the commands above needs a sudo, and you aren’t sudo, so you can go home and cry about it. Or you can install to a local library with something like this:

    # Install from package manager, pip, but specify as user
    pip install pokemon --user

    I won’t go into details, but you could also specify a –prefix to be some folder you can write to, and then add that folder to your PYTHONPATH. This works, but it’s not ideal for a few reasons:

    • if you need to capture or package your environment for sharing, you would have a hard time.
    • on your $HOME folder, it’s likely not accessible by your labmates. This is redundant, and you can’t be sure that if they run something, they will be using the same versions of software.

    Thus, what are some other options?

    Use a virtual environment

    Python has a fantastic thing called virtual environments, or more commonly seen as venv. It’s actually a package that you install, create an environment for your project, and activate it:

    # Install the package
    pip install virtualenv --user
    virtualenv myvenv

    There are also ones that come prepackaged with scientific software that (normally) are quite annoying to compile like anaconda and miniconda (he’s a MINI conda! :D). And then you would install and do stuff, and your dependencies would be captured in that environment. More details and instructions can be found here. What are problems with this approach?

    • It’s still REALLY redundant for each user to maintain different virtual environments
    • Personally, I just forget which one is active, and then do stupid things.

    For all of the above, you could use pip freeze to generate a list of packages and versions for some requirements.txt file, or to save with your analysis for documentation sake:

    pip freeze >> requirements.txt
    # Inside looks like this

    Use a module

    Most clusters now use modules to manage versions of software and environments. What it comes down to is running a command like this:

    # What versions of python are available?
    module spider python
    Rebuilding cache, please wait ... (written to file) done.
      For detailed information about a specific "python" module (including how to load the modules) use the module's full name.
      For example:
         $ module spider python/3.3.2

    Nice! Let’s load 2.7.5. I’m old school.

    module load python/2.7.5

    What basically happens, behind the scenes, is that there is a file written in a language called lua that adds folders to the beginning of your path with the particular path to the software, and possibly maps the locations as well. We can use the module software to show us this code:

    # Show me the lua!
    module show python/2.7.5
    whatis("Provides Python 2.7.5 ")
    help([[ This module provides support for the
            Python 2.7.5 via Redhat Software Collections.

    I won’t get into the hairy details, but this basically shows that we are adding paths (managed by an administrator) to give us access to a different version of python. This helps with versioning, but what problems do we run into?

    • We still have to install additional packages using –user
    • We don’t have control over any of the software configuration, we have to ask the admin
    • This is specific to one research cluster, who knows if the python/2.7.5 is the same on another one. Or if it exists at all.

    Again, it would work, but it’s not great. What else can we do? Well, we could try to use some kind of virtual machine… oh wait we are on a login node with no root access, nevermind. Let’s think through what we would want.

    An ideal software environment

    Ideally, I want all my group members to have access to it. My pokemon module version should be the same as yours. I also want total control of it. I want to be able to install whatever packages I want, and configure however I want. The first logical thing we know is that whatever we come up with, it probably is going to live in a group shared space. It also then might be handy to have equivalent lua files to load our environments, although I’ll tell you off the bat I haven’t done this yet. When I was contemplating this for my lab, I decided to try something new.

    Singularity for contained software environments

    ### A little about Singularity We will be using Singularity containers that don’t require root priviledges to run on the cluster for our environments. Further, we are going to “bootstrap” Docker images so we don’t have to start from nothing! You can think of this like packaging an entire software suite (for example, python) into a container that you can then run as an executable:

      $ ./python3 
      Python 3.5.2 (default, Aug 31 2016, 03:01:41) 
      [GCC 4.9.2] on linux
      Type "help", "copyright", "credits" or "license" for more information.

    Even the environment gets carried through! Try this:

      import os

    We are soon to release a new version of Singularity, and one of the simple features that I’ve been developing is an ability to immediately convert a Docker image into a Singularity image. The first iteration relied upon using the Docker Engine, but the new bootstrap does not. Because… I (finally) figured out the Docker API after many struggles, and the bootstrapping (basically starting with a Docker image as base for a Singularity image) is done using the API, sans need for the Docker engine.

    As I was thinking about making a miniconda environment in a shared space for my lab, I realized - why am I not using Singularity? This is one of the main use cases, but no one seems to be doing it yet (at least as determined by the Google Group and Slack). This was my goal - to make contained environments for software (like Python) that my lab can add to their path, and use the image as an executable equivalently to calling python. The software itself, and all of the dependencies and installed modules are included inside, so if I want a truly reproducible analysis, I can just share the image. If I can’t handle about ~1GB to share, I can minimally share the file to create it, called the definition file. Let’s walk through the steps to do this. Or if you want, skip this entirely and just look at the example repo.

    Singularity Environments

    The basic idea is that we can generate “base” software environments for labs to use on research clusters. The general workflow is as follows:

    1. On your local machine (or an environment with sudo) build the contained environment
    2. Transfer the contained environment to your cluster
    3. Add the executable to your path, or create an alias.

    We will first be reviewing the basic steps for building and deploying the environments.

    Step 0. Setup and customize one or more environments

    You will first want to clone the repository, or if you want to modify and save your definitions, fork and then clone the fork first. Here is the basic clone:

          git clone https://www.github.com/radinformatics/singularity-environments
          cd singularity-environments

    You can then move right into building one or more containers, or optionally customize environments first.

    Step 1. Build the Contained Environment

    First, you should use the provided build script to generate an executable for your environment:

          ./build.sh python3.def

    The build script is really simple - it just grabs the size (if provided), checks the number of arguments, and then creates and image and runs bootstrap (note in the future this operation will likely be one step):

    # Check that the user has supplied at least one argument
    if (( "$#" < 1 )); then
        echo "Usage: build.sh [image].def [options]\n"
        echo "Example:\n"
        echo "       build.sh python.def --size 786"
        exit 1
    # Pop off the image name
    # If there are more args
    if [ "$#" -eq 0 ]; then
        args="--size 1024*1024B"
    # Continue if the image is found
    if [ -f "$def" ]; then
        # The name of the image is the definition file minus extension
        imagefile=`echo "${def%%.*}"`
        echo "Creating $imagefile using $def..."
        sudo singularity create $args $imagefile
        sudo singularity bootstrap $imagefile $def

    Note that the only two commands you really need are:

    sudo singularity create $args $imagefile
    sudo singularity bootstrap $imagefile $def

    I mostly made the build script because I was lazy. This will generate a python3 executable in the present working directory. If you want to change the size of the container, or add any custom arguments to the Singularity bootstrap command, you can add them after your image name:

          ./build.sh python3.def --size 786

    Note that the maximum size, if not specified, is 1024*1024BMiB. The python3.def file will need the default size to work, otherwise you run out of room and get an error. This is also true for R (r-base), which I used --size 4096 to work. That R, it’s a honkin’ package!

    Step 2. Transfer the contained environment to your cluster

    You are likely familiar with FTP, or hopefully your cluster uses a secure file transfer (sFTP). You can also use a command line tool scp. For the Sherlock cluster at Stanford, since I use Linux (Ubuntu), my preference is for gftp.

    Step 3. Add the executable to your path

    Let’s say we are working with a python3 image, and we want this executable to be called before the python3 that is installed on our cluster. We need to either add this python3 to our path (BEFORE the old one) or create an alias.

    Add to your path

    You likely want to add this to your .bash_profile, .profile, or .bashrc:

          mkdir $HOME/env
          cd $HOME/env
          # (and you would transfer or move your python3 here)

    Now add to your .bashrc:

          echo "PATH=$HOME/env:$PATH; export PATH;" >> $HOME/.bashrc

    Create an alias

    This will vary for different kinds of shells, but for bash you can typically do:

          alias aliasname='commands'
          # Here is for our python3 image
          alias python3='/home/vsochat/env/python3'

    For both of the above, you should test to make sure you are getting the right one when you type python3:

          which python3

    The definition files in this base directory are for base (not hugey modified) environments. But wait, what if you want to customize your environments?

    I want to customize my environments (before build)!

    The definition files can be modified before you create the environments! First, let’s talk a little about this Singularity definition file that we use to bootstrap.

    A little about the definition file

    Okay, so this folder is filled with *.def files, and they are used to create these “executable environments.” What gives? Let’s take a look quickly at a definition file:

          Bootstrap: docker
          From: python:3.5
              apt-get update
              apt-get install -y vim
              mkdir -p /scratch
              mkdir -p /local-scratch

    The first two lines might look (sort of) familiar, because “From” is a Dockerfile spec. Let’s talk about each:

    • Bootstrap: is telling Singularity what kind of Build it wants to use. You could actually put some other kind of operating system here, and then you would need to provide a Mirror URL to download it. The “docker” argument tells Singularity we want to use the guts of a particular Docker image. Which one?
    • From: is the argument that tells Singularity bootstrap “from this image.”
    • runscript: is the one (or more) commands that are run when someone uses the container as an executable. In this case, since we want to use the python 3.5 that is installed in the Docker container, we have the executable call that path.
    • post: is a bunch of commands that you want run once (“post” bootstrap), and thus this is where we do things like install additional software or packages.

    Making changes

    It follows logically that if you want to install additional software, do it in post! For example, you could add a pip install [something], and since the container is already bootstrapped from the Docker image, pip should be on the path. For example, here is how I would look around the container via python:

          Python 3.5.2 (default, Aug 31 2016, 03:01:41) 
          [GCC 4.9.2] on linux
          Type "help", "copyright", "credits" or "license" for more information.
          >>> import os
          >>> os.system('pip --version')
          pip 8.1.2 from /usr/local/lib/python3.5/site-packages (python 3.5)

    or using the Singularity shell command to bypass the runscript (/usr/local/bin/python) and just poke around the guts of the container:

          $ singularity shell python3
          Singularity: Invoking an interactive shell within container...
          Singularity.python3> which pip

    If you would like any additional docs on how to do things, please post an issue or just comment on this post. I’m still in the process of thinking about how to best build and leverage these environments.

    I want to customize my environments! (after build)

    Let’s say you have an environment (node6, for example), and you want to install a package with npm (which is localed at /usr/local/bin/npm), but then when you run the image:


    it takes you right into the node terminal. What gives? How do you do it? You use the Singularity shell, with write mode, and we first want to move the image back to our local machine, because we don’t have sudo on our cluster. We then want to use the writable option:

          sudo singularity shell --writable node6
          Singularity: Invoking an interactive shell within container...

    Then we can make our changes, and move the image back onto the cluster.

    A Cool Example

    The coolest example I’ve gotten working so far is using Google’s TensorFlow (the basic version without GPU - testing that next!) via a container. Here is the basic workflow:

    ./build tensorflow.def --size 4096
    # building... building...
    Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
    [GCC 4.8.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tensorflow

    Ok, cool! That takes us into the python installed in the image (with tensorflow), and I could run stuff interactively here. What I first tried was the “test” example, to see if it worked:

    singularity shell tensorflow
    python -m tensorflow.models.image.mnist.convolutional

    Note that you can achieve this functionality without shelling into the image if you specify that the image should take command line arguments, something like this in the definition file:

    exec /usr/local/bin/python "$@"

    and then run like this!

    ./tensorflow -m tensorflow.models.image.mnist.convolutional
    Extracting data/train-images-idx3-ubyte.gz
    Extracting data/train-labels-idx1-ubyte.gz
    Extracting data/t10k-images-idx3-ubyte.gz
    Extracting data/t10k-labels-idx1-ubyte.gz
    Step 0 (epoch 0.00), 6.6 ms
    Minibatch loss: 12.054, learning rate: 0.010000
    Minibatch error: 90.6%
    Validation error: 84.6%

    Another added feature, done specifically when I realized that there are different Docker registries, is an ability to specify the Registry and to use a Token (or not):

    Bootstrap: docker
    From: tensorflow/tensorflow:latest
    IncludeCmd: yes
    Registry: gcr.io
    Token: no

    Final Words

    Note that this software is under development, I think the trendy way to say that is “bleeding edge,” and heck, I came up with this idea and wrote all this code most of yesterday, and so this is just an initial example to encourage others to give this a try. We don’t (yet) have a hub to store all these images, so in the meantime if you make environments, or learn something interesting, please share! I’ll definitely be adding more soon, and customizing the ones I’ve started for my lab.


  • Thirty Days Hath September

    Thirty days hath September
    for as long as I remember
    until next year, at thirty one,
    What things may duly come?

    My preponderance prodded at the Dish,
    why Vanessasaur, what do you wish?
    Thoughts climbing slopes, far up and down
    from the foothills side, back toward town.

    First came hopes, lithe and fun!
    Always starting and never done
    Always gaining and never won
    Should I quicker jump, or faster run?

    Followed by fears, meek and shy
    Sometimes cunning, never spry
    Religiously truthful, will not lie
    Can I evade darkness before I die?

    Quick in tow was love and heart
    Forever caring, emotionally smart
    Always defeated yet finding start
    Might I surpass ego to play my part?

    Last came dream, fickle and sweet,
    Traversing muddy forest, stinky feet!
    Falling constantly, never to defeat
    Will I traverse mountains, promises keep?

    I thought long and hard
    about what I might expect
    I imagined every card,
    that fate keeps in her deck

    Thirty years this September
    that’s as long as I remember
    For future passed and what may come
    Ask me when I’m thirty one :)


  • The Docker APIs in Bash

    Docker seems to have a few different APIs, highly under development, and this state almost guarantees that mass confusion will ensue. The documentation isn’t sparse (but it’s confusing) so I want to take some time to talk through some recent learning, in the case it is helpful. Some warnings - this reflects what I was able to figure out, over about 24 hours of working on things, and optimizing my solutions for time and efficiency. Some of this is likely wrong, or missing information, and I hope that others comment to discuss things that I missed! While Python makes things really easy, for the work that I was doing, I decided to make it more challenging (fun?) by trying to accomplish everything using bash. First, I’ll briefly mention the different APIs that I stumbled upon:

    • The Docker Remote API: This seems to be an API for the client to issue commands via the Docker Daemon. I have no idea why it’s called “remote.” Maybe the user is considered “remote” ? This might be where you would operate to develop some kind of desktop application that piggy backs on a user’s local Docker. Please correct me if I am messing this up.
    • The Docker Hub API: A REST API for Docker Hub. This is where (I think) a developer could build something to work with “the images out in the internet land.” I (think) I’ve also seen this referred to as the “Docker HUB registry API” (different from the next one…)
    • Docker Registry API: Is this an interface between some official Docker registry and a Docker engine? I have no idea what’s going on at this point.

    I found the most helpful (more than the docs above) to be the comments on this Github issue. Specifically, @baconglobber. You’re the man. Or the baconglobber, whichever you prefer. Regardless, I’ll do my best to talk through some details of API calls that I found useful for the purpose of getting image layers without needing to use the Docker engine. If that task is interesting to you, read on.

    The Docker Remote API

    Docker seems to work by way of an API, meaning a protocol that the engine can use under the hood to send commands to the Hub to push and pull images, along with do all the other commands you’ve learned to appreciate. If you look at docs for the remote API, what seems to be happening is that the user sends commands to his or her Docker Daemon, and then they can interact with this API. I started to test this, but didn’t read carefully that my curl version needs to (not) be less than 7.40. Here (was) my version:

          curl -V
          curl 7.35.0 (x86_64-pc-linux-gnu) libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
          Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp 
          Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP 

    Oups. To upgrade curl on Ubuntu 14.04 you can’t use standard package repos, so here is a solution that worked for me (basically install from source). Note that you need to open a new terminal for the changes to take effect. But more specifically, I wanted a solution that didn’t need to use the Docker daemon, and I’m pretty sure this isn’t what I wanted. However, anyone wanting to create an application that works with Docker on a user’s machine, this is probably where you should look. There are a LOT of versions:

    and you will probably get confused as I did and click “learn by example” expecting the left sidebar to be specific to the API (I got there via a Google search) and then wonder why you are seeing tutorials for standard Docker:

    What’s going on?! Rest assured, probably everyone is having similar confusions, because the organization of the documentation feels like being lost in wikiland. At least that’s how I felt. I will point you to the client API libraries because likely you will be most diabolical digging directly into Python, Java, or your language of choice (they even have web components?! cool!) For my specific interest, I then found the Docker Hub API.

    Docker Hub API

    This is great! This seems to be the place where I can interact with Docker Hub, meaning getting lists of images, tags, image ids, and all the lovely things that would make it possible to work with (and download) the images (possibly without needing to have the Docker engine running). The first confusion that I ran into was a simple question - what is the base url for this API? I was constantly confused about what endpoint I should be using at pretty much every call I tried, and only “found something that worked” by way of trying every one. Here are a handful of the ones that returned responses sometimes, sometimes not:

    • https://registry.hub.docker.com/v1/
    • https://registry.hub.docker.com/v2/
    • https://registry-1.docker.io/v1/
    • https://registry-1.docker.io/v2/
    • https://cdn-registry-1.docker.io/v1/

    The only thing that is (more intuitive) is that if you know what a cdn is, you would intuit that the cdn is where images, or some filey things, might be located.

    So we continue in the usual state of things, when it comes to programming and web development. We have a general problem we want to solve, or goal we want to achieve, we are marginally OK at the scripting language (I’m not great at bash, which is why I chose to use it), and the definition of our inputs and resources needs to be figured out as we go. But… we have the entire internet to help us! And we can try whatever we like! This, in my opinion, is exactly the kind of environment that is most fun to operate in.

    The organization of images

    Before we jump into different commands, I want to review what the parameters are, meaning the terms that Docker uses to describe images. When I first started using Docker, I would see something like this in the Dockerfile:

         FROM ubuntu:latest

    and I’m pretty sure it’s taken me unexpectedly long to have a firm understanding of all the possible versions and variables that go into that syntax (and this might make some of the examples in the API docs more confusing if a user doesn’t completely get it). For example, I intuited that if a “namespace” isn’t specified, the default is “library?” For example, this:


    is equivalent to:


    where “library” is considered the namespace, “ubuntu” is the repo name, and “14.04” is considered the “tag.” Since Docker images are basically combinations of layers, each of which is some tar-guzzed up group of files (does anyone else say that in their head?), I’m guessing that a tag basically points to a specific group of layers, that when combined, complete the image. The tag that I’m most used to is called “latest”, so the second thing I intuited is that if a user doesn’t specify a tag:


    that would imply we want the latest, e.g.,


    Getting repo images

    My first task was to find a list of images associated with a repo. Let’s pretend that I’m interested in ubuntu, version 14.04.

    repo_images=$(curl -si https://registry.hub.docker.com/v1/repositories/$namespace/$repo_name/images)

    It returns this page, a list of things that looks like this:

    {"checksum": "tarsum+sha256:aa74ef4a657880d495e3527a3edf961da797d8757bd352a99680667373ddf393", "id": "9cc9ea5ea540116b89e41898dd30858107c1175260fb7ff50322b34704092232"}

    If you aren’t familiar, a checksum is a string of numbers (more) you can generate on your local machine using a tool, and “check” against to ensure that you, for example, downloaded a file in it’s entirety. Also note that I found the (old?) registry endpoint (verison 1.0) to work. What I was interested in were the “id” variables. What we’ve found here is a list of image layer ids that are associated with the ubuntu repo, in the library namespace (think of a namespace like a collection of images). However, what this doesn’t give me is which layers I’m interested in - some are probably for 14.04.1, and some not. What I need now is some kind of mapping from a tag (e.g., 14.04.1) to the layers that I want.

    Which layers for my tag?

    Out of all the image layers belonging to “ubuntu,” which ones do I need for my image of interest, 14.04.1? For this, I found a slight modification of the url would provide better details about this, and I’m including an if statement that will fire if the image manifest is not found (a.k.a, the text returned by the call is just “Tag not found”.) I’m not sure why, but this call always took a long time, at least given the amount of information it returns (approximately ~22 seconds):

    # Again use Ubuntu... but this time define a tag!
    layers=$(curl -k https://registry.hub.docker.com/v1/repositories/$namespace/$repo_name/tags/$repo_tag)
    # Were any layers found?
    if [ "$layers" = "Tag not found" ]; then
        echo "Ahhhhh!"
        exit 1

    When it works, you see:

    echo $layers
     {"pk": 20355486, "id": "5ba9dab4"}, 
     {"pk": 20355485, "id": "51a9c7c1"}, 
     {"pk": 20355484, "id": "5f92234d"}, 
     {"pk": 20355483, "id": "27d47432"}, 
     {"pk": 20355482, "id": "511136ea"}

    If the image tag isn’t found:

    Tag not found

    There is a big problem with this call, and that has to do with the tag “latest,” and actually versioning of tags as well. If I define my tag to be “latest,” or even a common Ubuntu version (14.04) I get the “Tag not found” error. You can get all of the tag names of the image like so:

    tags=$(curl -k https://registry.hub.docker.com/v1/repositories/$namespace/$repo_name/tags)
    # Iterate through them, print to screen
    echo $tags | grep -Po '"name": "(.*?)"' | while read a; do
        tag_name=`echo ${a/\"name\":/}`
        tag_name=`echo ${tag_name//\"/}`
        echo $tag_name

    There isn’t one called latest, and there isn’t even one called 14.04 (but there is 14.04.1, 14.04.2, and 14.04.3). Likely I need to dig a bit deeper and find out exactly how a “latest” tag is asserted to belong to the (latest) version of a repo, but arguably as a user I expect this tag to be included when I retrieve a list for the repo. It was confusing. If anyone has insight, please comment and share!

    Completing an image ID

    The final (potentially) confusing detail is the fact that the whole image ids have about 32 characters, eg 5807ff652fea345a7c4141736c7e0f5a0401b30dfe16284a1fceb24faac0a951 but have you ever noticed when you do docker ps to list your images you see 12 numbers, or if you look at the ids referenced in the manifest above, we only have 8?

    {"pk": 20355486, "id": "5ba9dab4"}

    The reason (I would guess) is because, given that we are looking at layer ids for a single tag within a namespace, it’s unlikely we need that many characters to distinguish the images, so reporting (and having the user reference just 8) is ok. However, given that I can look ahead and see that the API command to download and get meta-data for an image needs the whole thing, I now need a way to compare the whole list for the namespace to the layers (smaller list with shorter ids) above.

    Matching a shorter to a longer string in bash

    I wrote a simple loop to accomplish this, given the json object of layers I showed above ($layers) and the result of the images call ($repo_images):

    echo $layers | grep -Po '"id": "(.*?)"' | while read a; do
        # remove "id": and extra "'s
        image_id=`echo ${a/\"id\":/}`
        image_id=`echo ${image_id//\"/}`
        # Find the full image id for each tag, meaning everything up to the quote
        image_id=$(echo $repo_images | grep -o -P $image_id'.+?(?=\")')
        # If the image_id isn't empty, get the layer
        if [ ! -z $image_id ]; then
            echo "WE FOUND IT! DO STUFF!"

    Obtaining a Token

    Ok, at this point we have our (longer) image ids associated with some tag (inside the loop above), and we want to download them. For these API calls, we need a token. What I mean is that we need to have a curl command that asks the Docker remote API for permission to do something, and then if this is OK, it will send us back some nasty string of letters and numbers that, if we include in the header of a second command, it will validate and say “oh yeah, I remember you! I gave you permission to read/pull/push to that image repo. In this case, I found two ways to get a token. The first (which produced a token that worked in a second call for me) was making a request to get images (as we did before), but then adding content to the header to ask for a token. The token is then returned in the response header. In bash, that looks like this:

    token=$(curl -si https://registry.hub.docker.com/v1/repositories/$namespace/$repo_name/images -H 'X-Docker-Token: true' | grep X-Docker-Token)
    token=$(echo ${token/X-Docker-Token:/})
    token=$(echo Authorization\: Token $token)

    The token thing looks like this:

    echo $token
    Authorization: Token signature=d041fcf64c26f526ac5db0fa6acccdf42e1f01e6,repository="library/ubuntu",access=read

    Note that depending on how you do this in bash, you might see some nasty newline (^M) characters. This was actually for the second version of the token I tried to retrieve, but I saw similar ones for the call above:

    The solution I found to remove them was:

    token=$(echo "$token"| tr -d '\r')  # get rid of ^M, eww

    I thought that it might be because I generated the variable with an echo without -n (which indicates to not make a newline), however even with this argument I saw the newline trash appear. In retrospect I should have tried -ne and also printf, but oh well, will save this for another day. I then had trouble with double quotes with curl, so my hacky solution was to write the cleaned call to file, and then use cat to pipe it into curl, as follows:

    echo $token > somefile.url
    response=$(cat somefile.url | xargs curl)
    # For some url that has a streaming response, you can also pipe directly into a file
    cat somefile.url | xargs curl -L >> somefile.tar.gz
    # Note the use of -L, this will ensure if there is a redirect, we follow it!

    If you do this in Python, you would likely use the requests module and make a requests.get to GET the url, add the additional header, and then get the token from the response header:

    import requests
    header = {"X-Docker-Token": True}
    url = "https://registry.hub.docker.com/v1/repositories/%s/%s/images" %(namespace,repo_name)
    response = requests.get(url,headers=header)

    Then we see the response status is 200 (success!) and can peep into the headers to find the token:

    # 200
    # {'x-docker-token': 'signature=5f6f83e19dfac68591ad94e72f123694ad4ba0ca,repository="library/ubuntu",access=read', 'transfer-encoding': 'chunked', 'strict-transport-security': 'max-age=31536000', 'vary': 'Cookie', 'server': 'nginx/1.6.2', 'x-docker-endpoints': 'registry-1.docker.io', 'date': 'Mon, 19 Sep 2016 00:19:28 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json'}
    token = response.headers["x-docker-token"]
    # 'signature=5f6f83e19dfac68591ad94e72f123694ad4ba0ca,repository="library/ubuntu",access=read'
    # Then the header token is just a dictionary with this format
    header_token = {"Authorization":"Token %s" %(token)}
    # {'Authorization': 'Token signature=5f6f83e19dfac68591ad94e72f123694ad4ba0ca,repository="library/ubuntu",access=read'}

    And here is the call that didn’t work for me using version 2.0 of the API. I should be more specific - this call to get the token did work, but I never figured out how to correctly pass it into the version 2.0 API. I read that the default token lasts for 60 seconds, and also the token should be formatted as Authorization: Bearer: [token] but I got continually hit with

    '{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":[{"Type":"repository","Name":"ubuntu","Action":"pull"}]}]}\n'

    The interesting thing is that if we look at header info for the call to get images (which uses the “old” registry.hub.docker.com, e.g, https://registry.hub.docker.com/v1/repositories/library/ubuntu/images we see that the response is coming from registry-1.docker.io:

    In [148]: response.headers
    Out[148]: {'x-docker-token': 'signature=f960e1e0e745965069169dbb78194bd3a4e8a10c,repository="library/ubuntu",access=read', 'transfer-encoding': 'chunked', 'strict-transport-security': 'max-age=31536000', 'vary': 'Cookie', 'server': 'nginx/1.6.2', 'x-docker-endpoints': 'registry-1.docker.io', 'date': 'Sun, 18 Sep 2016 21:26:51 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json'}

    When I saw this I said “Great! It must just be a redirect, and maybe I can use that (I think newer URL) to make the initial call.” But when I change registry.hub.docker.com to registry-1.docker.io, it doesn’t work. Boo. I’d really like to get, for example, the call https://registry-1.docker.io/v2/ubuntu/manifests/latest to work, because it’s counterpart with the older endpoint (below) doesn’t seem to work (sadface). I bet with the right token, and a working call, the tag “latest” will be found here, and resolve the issues I was having using the first token and call. This call for “latest” really should work :/

    Downloading a Layer

    I thought this was the coolest part - the idea that I could use an API to return a data stream that I could pipe right into a .tar.gz file! I already shared most of this example, but I’ll do it quickly again to add some comment:

    # Variables for the example
    image_id=511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158 # I think this is empty, but ok for example
    # Get the token again
    token=$(curl -si https://registry.hub.docker.com/v1/repositories/$namespace/$repo_name/images -H 'X-Docker-Token: true' | grep X-Docker-Token)
    token=$(echo ${token/X-Docker-Token:/})
    token=$(echo Authorization\: Token $token)
    # Put the entire URL into a variable, and echo it into a file removing the annoying newlines
    url=$(echo https://cdn-registry-1.docker.io/v1/images/$image_id/layer -H \'$token\')
    url=$(echo "$url"| tr -d '\r')
    echo $url > $image_id"_layer.url"
    echo "Downloading $image_id.tar.gz...\n"
    cat $image_id"_layer.url" | xargs curl -L >> $image_id.tar.gz

    I also tried this out in Python so I could look at the response header, interestingly they are using AWS CloudFront/S3. Seems like everyone does :)

    {'content-length': '32', 'via': '1.1 a1aa00de8387e7235a256b2a5b73ede8.cloudfront.net (CloudFront)', 'x-cache': 'Hit from cloudfront', 'accept-ranges': 'bytes', 'server': 'AmazonS3', 'last-modified': 'Sat, 14 Nov 2015 09:09:44 GMT', 'connection': 'keep-alive', 'etag': '"54a01009f17bdb7ec1dd1cb427244304"', 'x-amz-cf-id': 'CHL-Z0HxjVG5JleqzUN8zVRv6ZVAuGo3mMpMB6A6Y97gz7CrMieJSg==', 'date': 'Mon, 22 Aug 2016 16:36:41 GMT', 'x-amz-version-id': 'mSZnulvkQ2rnXHxnyn7ciahEgq419bja', 'content-type': 'application/octet-stream', 'age': '3512'}

    Overall Comments

    In the end, I got a working solution to do stuff with the tarballs for a specific docker image/tag, and my strategy was brute force - I tried everything until something worked, and if I couldn’t get something seemingly newer to work, I stuck with it. That said, it would be great to have more examples provided in the documentation. I don’t mean something that looks like this:

        PUT /v1/repositories/foo/bar/ HTTP/1.1
        Host: index.docker.io
        Accept: application/json
        Content-Type: application/json
        Authorization: Basic akmklmasadalkm==
        X-Docker-Token: true
        [{"id": "9e89cc6f0bc3c38722009fe6857087b486531f9a779a0c17e3ed29dae8f12c4f"}]

    I mean a script written in some language, showing me the exact flow of commands to get that to work (because largely when I’m looking at something for the first time you can consider me as useful and sharp as cheddar cheese on holiday in the Bahamas). For example, if you do anything with a Google API, they will give you examples in any and every language you can dream of! But you know, Google is amazing and awesome, maybe everyone can’t be like that smile :)

    I’ll finish by saying that, after all that work in bash, we decided to be smart about this and include a Python module, so I re-wrote the entire thing in Python. This let me better test the version 2.0 of the registry API, and unfortunately I still couldn’t get it to work. If anyone has a concrete example of what a header should look like with Authentication tokens and such, please pass along! Finally, Docker has been, is, and probably always will be, awesome. I have a fiendish inkling that very soon all of these notes will be rendered outdated, because they are going to finish up and release their updated API. I’m super looking forward to it.


  • Promise me this...

    Promise me this, promise me that. If you Promise inside of a JavaScript Object, your this is not going to be ‘dat!

    The desired functionality

    Our goal is to update some Object variable using a Promise. This is a problem that other JavaScript developers are likely to face. Specifically, let’s say that we have an Object, and inside the Object is a function that uses a JavaScript Promise:

    function someObject() {
        this.value = 1;
        this.setValue = function(filename) {
                /* update this.value to be value */

    In the example above, we define an Object called someObject, and then want to use a function setValue to read in some filename and update the Object’s value (this.value) to whatever is read in the file. The reading of the file is done by the function readFile, which does it’s magic and returns the new value in the variable newValue. If you are familiar with JavaScript Promises, you will recognize the .then(**do something**) syntax, which means that the function readFile returns a Promise. You will also know that inside of the .then() function we are in the JavaScript Promise space. First, let’s pretend that our data file is very stupid, and is a text file with a single value, 2:


    First we will create the Object, and see that the default value is set to 1:

    var myObject = new someObject()
    >> 1

    Great! Now let’s define our file that has the updated value (2), and call the function setValue:

    var filename = "data.txt"

    We would then expect to see the following:

    >> 2

    The intuitive solution

    My first attempt is likely what most people would try - referencing the Object variable as this.value to update it inside the Promise, which looks like this:

    function someObject() {
        this.value = 1;
        this.setValue = function(filename) {
                this.value = newValue;

    But when I ran the more complicated version of this toy example, I didn’t see the value update. In fact, since I hadn’t defined an intitial value, my Object variable was still undefined. For this example, we would see that the Object value isn’t updated at all:

    var filename = "data.txt"
    >> 1

    Debugging the error

    Crap, what is going on? Once I checked that I wasn’t referencing the Object variable anywhere else, I asked the internet, and didn’t find any reasonable solution that wouldn’t require making my code overly complicated or weird. I then decided to debug the issue properly myself. The first assumption I questioned was the idea that the this inside of my Promise probably wasn’t the same Object this that I was trying to refer to. When I did a console.log(this), I saw the following:

    Window {external: Object, chrome: Object, document: document, wb: webGit, speechSynthesis: SpeechSynthesis…}

    uhh… what? My window object? I should have seen the someObject variable myObject, which is what I’d have seen refencing this anywhere within someObject (but clearly outside of a Promise):

    someObject {value: 1}

    This (no pun intended) means that I needed something like a pointer to carry into the function, and refer to the object. Does JavaScript do pointers?

    Solution: a quasi pointer

    JavaScript doesn’t actually have pointers in the way that a CS person would think of them, but you can pass Objects and they refer to the same thing in memory. So I came up with two solutions to the problem, and both should work. One is simple, and the second should be used if you need to be passing around the Object (myObject) through your Promises.

    Solution 1: create a holder

    We can create a holder Object for the this variable, and reference it inside of the Promise:

    function someObject() {
        this.value = 1;
        this.setValue = function(filename) {
            var holder = this;
                holder.value = newValue;

    This will result in the functionality that we want, and we will actually be manipulating the myObject by way of referencing holder. Ultimately, we replace the value of 1 with 2.

    Solution 2: pass the object into the promise

    If we have some complicated chain of Promises, or for some reason if we can’t access the holder variable (I haven’t proven or asserted this to be an issue, but came up with this solution in case someone else finds it to be one) then we need to pass the Object into the Promise(s). In this case, our function might look like this:

    function someObject() {
        this.value = 1;
        this.setValue = function(filename) {
                // Here is the newValue
                // Here is the passed myObject
                var myObject = response.args;
                // Set the value into my Object
                myObject.value = response.newValue;

    and in order for this to work, the function readFile needs to know to add the input parameter this as args to the response data. For example, here it is done with a web worker:

    this.readFile = function (filename,args) {
        return new Promise((resolve, reject) => {
            const worker = new Worker("js/worker.js");
            worker.onerror = (e) => {
            worker.onmessage = (e) => {
                e.data.args = args;

    In the above code, we are returning a Promise, and we create a Worker (worker), and have a message sent to him with a command called “getData” and the args filename. For this example, all you need to know is that he returns an Object (e) that has data (e.data) that looks like this:

      "newValue": 2

    so we basically add an “args” field that contains the input args (myObject), and return a variable that looks like this:

      "newValue": 2,
      "args": myObject

    and then wha-la, in our returning function, since the response variable holds the data structure above, we can then do this last little bit:

    // Here is the passed myObject
    var myObject = response.args;
    // Set the value into my Object
    myObject.value = response.newValue;

    Pretty simple! This of course would only work given that an object is returned as the response, because obviously you can’t add an object onto a string or number. In retrospect, I’m not sure this was deserving of an entire post, but I wanted to write about it because it’s weird and confusing until you look at the this inside the Promise. I promise you, it’s just that simple :)


  • Gil Eats

    I have a fish, and his name is Gil, and he eats quite a bit. The question of what Gil eats (and where he gets it from) is interesting to me as a data science problem. Why? Our story starts with the unfortunate reality (for my fish) that I am allergic to being domestic: the extent that I will “cook” is cutting up vegetables and then microwaving them for a salad feast. This is problematic for someone who likes to eat many different things, and so the solution is that I have a lot of fun getting Gil food from many of the nice places to eat around town. Given that this trend will likely continue for many years, and given that we now live in the land of infinite culinary possibilities (Silicon Valley programmers are quite eclectic and hungry, it seems), I wanted a way to keep a proper log of this data. I wanted pictures, locations, and reviews or descriptions, because in 5, 10, or more years, I want to do some awesome image and NLP analyses with these data. Step 1 of this, of course, is collecting the data to begin with. I knew I needed a simple database and web interface. But before any of that, I needed a graphical representation of my Gil fish:


    My desire to make this application goes a little bit deeper than keeping a log of pictures and reviews. I think pretty often about the lovely gray area between data analysis, web-based tools, and visualization, and it seems problematic to me that a currently highly desired place to deploy these things (a web browser) is not well fit for things that are large OR computationally intensive. If I want to develop new technology to help with this, I need to first understand the underpinnings of the current technology. As a graduate student I’ve been able to mess around with the very basics, but largely I’m lacking the depth of understanding that I desire.

    On what level are we thinking about this?

    When most people think of developing a web application, their first questions usually revolve around “What should we put in the browser?” You’ll have an intimate discussion about the pros and cons of React vs. Angular, or perhaps some static website technology versus using a full fledged server (Jekyll? Django? ClojureScript? nodeJS?). What is completely glossed over is the standard that we start our problem solving from within the browser box, but I would argue that this unconscious assumption that the world of the internet is inside of a web browser must be questioned. When you think about it, we can take on this problem from several different angles:

    From within the web browser (already discussed). I write some HTML/Javascript with or without a server for a web application. I hire some mobile brogrammers to develop an Android and iOS app and I’m pouring in the sheckles!

    Customize the browser. Forget about rendering something within the constraints of your browser. Figure out how the browser works, and then change that technology to be more optimized. You might have to work with some standards committees for this one, which might take you decades, your entire life, or just never happen entirely.

    Customize the server such as with an nginx (pronounced “engine-X”) module. Imagine, for example, you just write a module that tells the server to render or perform some other operation on a specific data type, and then serve the data “statically” or generate more of an API.

    The Headless Internet. Get rid of the browser entirely, and harness the same web technologies to do things with streams of data, notifications, etc. This is how I think of the future - we will live in a world where all real world things and devices are connected to the internet, sending data back and forth from one another, and we don’t need this browser or a computer thing for that.

    There are so many options!

    How does a web browser work?

    I’ve had ample experience with making dinky applications from within a browser (item 1) and already knew that was limited, and so my first thinking was that I might try to customize a browser itself. I stumbled around and found an amazing overview. The core or base technology seems simple - you generate a DOM from static sytax (HTML) files, link some styling to it based on a scoring algorithm, and then parse those things into a RenderTree. The complicated part comes when the browser has to do a million checks to walk up and down that tree to find things, and make sure that the user (the person who wrote the syntax) didn’t mess something up. This is why you can have some pretty gross HTML and it still renders nicely, because the browser is optimized to handle a lot of developer error before passing on the “I’m not loading this or working” error to the user. I was specifically very interested in the core technology to generate parsers, and wound up creating a small one of my own. This was also really good practice because knowing C/C++ (I don’t have formal CS training so I largely don’t) is something else important to do. Python is great, but it’s not “real” programming because you don’t compile anything. Google is also on to this, they’ve created Native Client to run C/C++ natively in a browser. I’m definitely going to check this out soon.

    I thought that it would be a reasonable goal to first try and create my own web browser, but reading around forums, this seemed like a big feat for a holiday weekend project. This chopped item #2 off of my list above. Another idea would be to create a custom nginx module (item #3) but even with a little C practice I wasn’t totally ready this past weekend (but this is definitely something I want to do). I realized, then, that the best way to understand how a web browser worked would be to start with getting better at current, somewhat modern technology. I decided that I wanted to build an application with a very specific set of goals.

    The Goals of Gil Eats

    I approached this weekend fun with the following goals. Since this is for “Gil Eats” let’s call them geats:

    • I want to learn about, understand, and implement an application that uses Javascript Promises, Web Workers (hello parallel processing!), and Service Workers (control over resources/caching).
    • The entire application is served statically, and the goal achieved with simple technology available to everyone (a.k.a, no paying for an instance on AWS).
    • Gil can take a picture of his dinner, and upload it with some comments or review.
    • The data (images and comments) are stored in a web-accessible (and desktop-accessible, if desired) location, in an organized fashion, making them immediately available for further (future!) analysis.

    This was a pretty hefty load for just a holiday weekend, but my excitement about these things (and continually waking up in the middle of the night to “just try one more thing!”) made it possible, and I was able to create Gil Eats.

    Let’s get started!

    My Workflow

    I never start with much of a plan in mind, aside from a set of general application goals (listed above). The entire story of the application’s development would take a very long time to tell, so I’ll summarize. I started with a basic page that used the Dropbox API to list files in a folder. On top of that I added a very simple Google Map, and eventually I added the Places API to use as a discrete set of locations to link restaurant reviews and photos. The “database” itself is just a folder in Dropbox, and it has images, a json file of metadata associated with each image, and a master db.json file that gets rendered automatically when a user visits the page (sans authentication), and updated when a user is authenticated and fills out the form. I use Web Workers to retrieve all external requests for data, and use Service Workers to cache local files (probably not necessary, but it was important for me to learn this). The biggest development in my learning was with the Promises. With Promises you can get rid of “callback hell” that is (or was) definitive of JavaScript, and be assured that one event finishes before the next. You can create a single Promise that takes two handles to resolve (aka, woohoo it worked here’s the result!), or reject (nope, it didn’t work, here’s the error), or do things like chain promises or wrap them in calling functions. I encourage you to read about these Promises, and give them a try, because they are now native in most browsers, and (I think) are greatly improving the ease of developing web applications.

    Now let’s talk a bit about some of the details, and problems that I encountered, and how I dealt with them.

    The Interface

    The most appropriate visualization for this goal was a map. I chose a simple “the user is authenticated, show them a form to upload” or “don’t do that” interface, which appears directly below the map.

    The form looks like this:

    The first field in the form is linked with the Google Maps Places API, so when you select an address it jumps to it on the map. The date field is filled in automatically from the current date, and the format is controlled by way of a date picker:

    You can then click on a place marker, and see the uploads that Gil has made:

    If you click on an image, of course it’s shown in all its glory, along with Gil’s review and the rating (in stars):

    Speaking of stars, I changed the radio buttons and text input into a custom stars rating, which uses font awesome icons, as you can see in the form above. The other great thing about Google Maps is that you can easily add a Street View, so you might plop down onto the map and (really) see some of the places that Gil has frequented!

    The “database”

    Dropbox has a nice API that let me easily create an application, and then have (Gil) authenticate into his account in order to add a restaurant, which is done with the form shown above. The data (images and json with comments/review) are saved immediately to the application folder:

    How is the data stored?
    When Gil uploads a new image and review, it’s converted to a json file, a unique ID is generated based on the data and upload timestamp, and the data and image file are both uploaded to Dropbox with an API call. At the same time, the API is used to generate shared links for the image and data, and those are written into an updated master data file. The master data file knows which set of flat files belong together (images rendered together for the same location) because the location has a unique ID generated based on its latitude and longitude, which isn’t going to change because we are using the Places API. The entire interface is then updated with new data, down to closing the info window for a location given that the user has it open, so he or she can re-open it to see the newly uploaded image. If a user (that isn’t Gil) logs into the application, the url for his or her new database is saved to a cookie, so (hopefully) it will load the correct map the next time he or she visits. Yes, this means that theoretically other users can use Gil’s application for their data, although this needs further testing.

    A Flat File Database? Are you nuts?
    Probably, yes, but for a small application for one user I thought it was reasonable. I know what you are thinking: flat file databases can be dangerous. A flat file database that has a master file for keeping a record of all of these public shared links (so a non authenticated person, and more importantly, anyone who isn’t Gil) can load them on his or her map means that if the file gets too big, it’s going to be slow to read, write, or just retrieve. I made this decision for several reasons, the first of which is that only one user (Gil) is likely to be writing to it at once, so we won’t have conflicts. The second is that it will take many years for Gil to eat enough to warrant the db.json file big enough to slow down the application (and I know he is reading this now and taking it as a personal challenge!). When this time comes, I’ll update the application to store and load data based on geographic zones. It’s very easy to get the “current view” of the box in the Google Map, and I already have hashes for the locations, so it should be fairly easy to generate “sub master” files that can be indexed based on where the user goes in the map, and then load smaller sets of data at once.

    Some application Logic

    • The minimum required data to add a new record is an image file and an address.
    • I had first wanted to have only individual files, and then load them dynamically based on knowing some Dropbox folder address. Dropbox doesn’t actually let you do this - each file has to have it’s own “shared” link. When I realized this, I came up with my “master database” file solution, but then I was worried about writing to that file and potentially losing data if there was an error in that operation. This is why I made the application so that the entire master database can be re-generated fairly easily. A record can be added or deleted in the user’s Dropbox, and the database will update to not have it.
    • A common bug I encountered: when you have a worker running via a Promise, the Promise will only be resolved if you post a message back. I forgot to do this and was getting pending promise returned. This is the same case if you have chained or Promises inside of other promises - you have to return something or the (final) returned variable is undefined.

    Things I Learned

    Get rid of JQuery
    It’s very common (and easy) to use JQuery for really simple operations like setting cookies, and easily selecting divs. I mean, why would I want to do this:

    var value = document.selectElementById("answer").value;

    When I can do this?

    var value = $("#answer").val();

    However, I realized that, for my future learning and being a better developer, I should largely try to develop applications that are simple (and don’t use JQuery). Don’t get me wrong, I like it a lot, but it’s not always necessary, and it’s kind of honkin’.

    Better Organize Code
    The nice thing about Python, and most object-oriented programming languages, is that the organization of the code, along with dependencies and imports, is very intuitive to me. JavaScript is different because it feels like you are throwing everything into a big pot of soup, all at once, and it’s either floating around in the pot somewhere or just totally missing. This makes variable conflicts likely, and I’ve noticed makes it very easy to write poorly documented, error-prone, and messy code. I tried to keep things organized as I went, and at the end was overtaken with my code’s overall lack of simplicity. I’m going to get a lot better at this. I want to get intuition about how to best organize, and write better code. The overall goal seems like it should be to take a big, hairy function and split it into smaller, modular ones, and then reuse them a lot. I need to also learn how to write proper tests for JS.

    Think about the user
    When Gil was testing it, he was getting errors in the console that a file couldn’t be created, because there was a conflict. This happened because he was entering an image and data in the form, and then changing just the image, and trying to upload again. This is a very likely use case (upload a picture of the clam chowder, AND the fried fish, Romeo!), but I didn’t even think of it when I was generating the function for a unique id (only based on the other fields in the form). I then added a variable that definitely would change, the current time stamp with seconds included. I might have used the image name, but then I was worried that a user would try to upload images with the same name, for the same restaurant and review. Anyway, the lesson is to think of how your user is going to use the application!

    Think about the platform
    I didn’t think much about where Gil might be using this, other than his computer. Unfortunately I didn’t test on mobile, because the Places API needs a different key depending on the mobile platform. Oops. My solution was to do a check for the platform, and send the user to a “ruh roh” page if he or she is on mobile. In the future I will develop a proper mobile application, because this seems like the most probably place to want to upload a picture.

    Easter Eggs

    I’m closing up shop for today (it’s getting late, I still need to get home, have dinner, and then wake up tomorrow for my first day of a new job!! :D) but I want to close with a few fun Easter Eggs! First, if you drag “Gil” around (the coffee cup you see when the application starts) he will write you a little message in the console:

    The next thing is that if you click on “Gil” in Gil's Eats you will turn the field into an editable one, and you can change the name!

    …and your edits will be saved in localStorage so that the name is your custom one next time:

        function saveEdits() {
            var editElem = document.getElementById("username");
            var username = editElem.innerHTML.replaceAll('
    ',''); localStorage.userEdits = username; editElem.innerHTML = username; } el = document.getElementById("username"); el.addEventListener("contentchange", saveEdits, false);

    and the element it operates on is all made possible with a

    contenteditable="true" onkeyup="saveEdits()

    in the tag. You’ll also notice I remove any line breaks that you add. My instinct was to press enter when I finished, but actually you click out of the box.

    Bugs and Mysteries, and Conclusions

    I’m really excited about this, because it’s my first (almost completely working) web application that is completely static (notice the github pages hosting?) and works with several APIs and a (kind of) database. I’m excited to learn more about creating custom elements, and creating object oriented things in JavaScript. It’s also going to be pretty awesome in a few years to do some image processing and text analysis with Gil’s data! Where does he go? Can I predict the kind of food or some other variable from the review? Vice versa? Can the images be used to link his reviews with Yelp? Can I see changes in his likes and dislikes over time? Can I predict things about Gil based on his ratings? Can I filter the set to some subset, and generate a “mean” image of that meal type? I would expect as we collect more data, I’ll start to make some fun visualizations, or even simple filtering and plotting. Until then, time to go home! Later gator! Here is Gil Eats

    and the code


  • Poldracklab, and Informatics

    Why, hello there! I hear you are a potentially interested graduate student, and perhaps you are interested in data structures, and or imaging methods? If so, why you’ve come to the right place! My PI Russ Poldrack recently wrote a nice post to advertise the Poldrack lab for graduate school. Since I’m the first (and so far, only) student out of BMI to make it through (and proudly graduate from the Poldracklab), I’d like to do the same and add on to some of his comments. Please feel free to reach out to me if you want more detail, or have specific questions.

    Is graduate school for you?

    Before we get into the details, you should be sure about this question. Graduate school is a long, challenging process, and if you don’t feel it in your heart that you want to pursue some niche passion or learning for that long, given the opportunity cost of making a lower income bracket “salary” for 5 years, look elsewhere. If you aren’t sure, I recommend taking a year or two to work as an RA (research assistant) in a lab doing something similar to what you want to do. If you aren’t sure what you want to study (my position when I graduated from college in 2009), then I recommend an RAship in a lab that will maximize your exposure to many interesting things. If you have already answered the harder questions about the kind of work that gives you meaning, and can say a resounding “YES!” to graduate school, then please continue.

    What program should I apply to?

    Russ laid out a very nice description of the choices, given that you are someone that is generally interested in psychology and/or informatics. You can study these things via Biomedical Informatics (my program), Neuroscience, or traditional Psychology. If you want to join Poldracklab (of which I highly recommend!) you probably would be best choosing one of these programs. I will try to break it down into categories as Russ did.

    • Research: This question is very easy for me to answer. If you have burning questions about human brain function, cognitive processes, or the like, and are less interested in the data structures or methods to get you answers to those questions, don’t be in Biomedical Informatics. If you are more of an infrastructure or methods person, and your idea of fun is programming and building things, you might on the other hand consider BMI. That said, there is huge overlap between the three domains. You could likely pursue the exact same research in all three, and what it really comes down to is what you want to do after, and what kind of courses you want to take.
    • Coursework: Psychology and neuroscience have a solid set of core requirements that will give you background and expertise in understanding neurons, (what we know) about how brains work, and (some) flavor of psychology or neuroscience. The hardest course (I think) in neuroscience is NBIO 206, a medical school course (I took as a domain knowledge course) that will have you studying spinal pathways, neurons, and all the different brain systems in detail. It was pretty neat, but I’m not sure it was so useful for my particular research. Psychology likely will have you take basic courses in Psychology (think Cognitive, Developmental, Social, etc.) and then move up to smaller seminar courses and research. BMI, on the other hand, has firm but less structured requirements. For example, you will be required to take core Stats and Computer Science courses, and core Informatics courses, along with some “domain of knowledge.” The domain of knowledge, however, can be everything from genomics to brains to clinical science. The general idea is that we learn to be experts in understanding systems and methods (namely machine learning) and then apply that expertise to solve “some” problem in biology or medicine. Hence the name “Bio-medical” Informatics.
    • Challenge: As someone who took Psychology courses in College and then jumped into Computer Science / Stats in graduate school, I can assuredly say that the latter I found much more challenging. The Psychology and Neuroscience courses I’ve taken (a few at Stanford) tend to be project and writing intensive with tests that mainly require lots of memorization. In other words, you have a lot of control over your performance in the class, because working hard consistently will correlate with doing well. On the other hand, the CS and Stats courses tend to be problem set and exam intensive. This means that you can study hard and still take a soul crushing exam, work night and day on a problem set, get a 63% (and question your value as a human being), and then go sit on the roof for a while. TLDR: graduate courses, especially at Stanford, are challenging, and you should be prepared for it. You will learn to unattach your self-worth from some mark on a paper, and understand that you are building up an invaluable toolbelt to start to build the foundation of your future career. You will realize that classes are challenging for everyone, and if you work hard (go to problem sessions, do your best on exams, ask for help when you need it) people tend to notice these things, and you’re going to make it through. Matter of fact, once you make it through it really is sunshine and rainbows! You get to focus on your research and build things, which basically means having fun all the time :)
    • Career: It’s hard to notice that most that graduate from BMI, if they don’t continue as a postdoc or professor in academia, get some pretty awesome jobs in industry or what I call “academic industry.” The reason is because the training is perfect for the trendy job of “data scientist,” and so coming out of Stanford with a PhD in this area, especially with some expertise in genomics, machine learning, or medicine, is a highly sought after skill set, and a sound choice given indifference. You probably would only do better with Statistics or Computer Science, or Engineering. If you are definitely wanting to stay in academia and/or Psychology, you would be fine in any three of the programs. However, if you are unsure about your future wants and desires (academia or industry?) you would have a slightly higher leg up with BMI, at least on paper.
    • Uncertainty: We all change our minds sometime. If you are decided that you love solving problems using algorithms but unsure about imaging or brain science, then I again recommend BMI, because you have the opportunity to rotate in multiple labs, and choose the path that is most interesting to you. There is (supposed to be) no hard feelings, and no strings attached. You show up, bond (or not) with the lab, do some cool work (finish or not, publish or not) and then move on.
    • Admission: Ah, admissions, what everyone really wants to know about! I think most admissions are a crapshoot - you have a lot of highly and equally qualified individuals, and the admissions committees apply some basic thresholding of the applications, and then go with gut feelings, offer interviews to 20-25 students (about 1/5 or 1/6 of the total maybe?) and then choose the most promising or interesting bunch. From a statistics point of view, BMI is a lot harder to be admitted to (I think). I don’t have complete numbers for Psychology or Neuroscience, but the programs tend to be bigger, and they admit about 2-3X the number of students. My year in BMI, the admissions rate was about 4-5% (along the lines of 6 accepted for about 140-150 applications) and the recently published statistics cite 6 accepted for 135 applications. This is probably around a 5% admissions rate, which is pretty low. So perhaps you might just apply to both, to maximize your chances for working with Poldracklab!
    • Support: Support comes down to the timing of having people looking out for you during your first (and second) year experiences, and this is where BMI is very different from the other programs. You enter BMI and go through what are called “rotations” (three is about average) before officially joining a lab (usually by the end of year two), and this happens during the first two years. This period also happens to be the highest stress time of the graduate curriculum, and if a student is to feel in lack of support, overworked, or sad, it is most likely to happen during this time. I imagine this would be different in Psychology, because you are part of a lab from Day 1. In this case, the amount of support that you get is highly dependent on your lab. Another important component of making this decision is asking yourself if you are the kind of person that likes having a big group of people to be sharing the same space with, always available for feedback, or if you are more of a loner. I was an interesting case because I am strongly a loner, and so while the first part of graduate school felt a little bit like I was floating around in the clouds, it was really great to be grounded for the second part. That said, I didn’t fully take advantage of the strong support structure that Poldracklab had to offer. I am very elusive, and continued to float when it came to pursuing an optimal working environment (which for me wasn’t sitting at a desk in Jordan Hall). You would only find me in the lab for group meetings, and because of that I probably didn’t bond or collaborate with my lab to the maximum that I could. However, it’s notable to point out that despite my different working style, I was still made to feel valued and involved in many projects, another strong indication of a flexible and supportive lab.

    How is Poldracklab different from other labs?

    Given some combination of interest in brain imaging and methods, Poldracklab is definitely your top choice at Stanford, in my opinion. I had experience with several imaging labs during my time, and Poldracklab was by far the most supportive, resource providing, and rich in knowledge and usage of modern technology. Most other labs I visited were heavily weighed to one side - either far too focused on some aspect of informatics at a detriment to what was being studied, or too far into answering a specific question and relying heavily on plugging data into opaque, sometimes poorly understood processing pipelines. In Poldracklab, we wrote our own code, we constantly questioned if we could do it better, and the goal was always important. Russ was never controlling or limiting in his advising. He provided feedback whenever I asked for it, brought together the right people for a discussion when needed, and let me build things freely and happily. We were diabolical!

    What does an advisor expect of me?

    I think it’s easy to have an expectation that, akin to secondary school, Medical School, or Law School, you sign up for something, go through a set of requirements, pop out of the end of the conveyor belt of expectation, and then get a gold star. Your best strategy will be to throw away all expectation, and follow your interests and learning like a mysterious light through a hidden forest. If you get caught up in trying to please some individual, or some set of requirements, you are both selling yourself and your program short. The best learning and discoveries, in my opinion, come from the mind that is a bit of a drifter and explorer.

    What kind of an advisor is Russ?

    Russ was a great advisor. He is direct, he is resourceful, and he knows his stuff. He didn’t impose any kind of strict control over the things that I worked on, the papers that I wanted to publish, or even how frequently we met. It was very natural to meet when we needed to, and I always felt that I could speak clearly about anything and everything on my mind. When I first joined it didn’t seem to be a standard to do most of our talking on the white board (and I was still learning to do this myself to move away from the “talking head” style meeting), but I just went for it, and it made the meetings fun and interactive. He is the kind of advisor that is also spending his weekends playing with code, talking to the lab on Slack, and let’s be real, that’s just awesome. I continued to be amazed and wonder how in the world he did it all, still catching the Caltrain to make the ride all the way back to the city every single day! Lab meetings (unless it was a talk that I wasn’t super interested in) were looked forward to because people were generally happy. The worst feeling is having an advisor that doesn’t remember what you talked about from week to week, can’t keep up with you, or doesn’t know his or her stuff. It’s unfortunately more common than you think, because being a PI at Stanford, and keeping your head above the water with procuring grants, publishing, and maintaining your lab, is stressful and hard. Regardless, Russ is so far from the bad advisor phenotype. I’d say in a heartbeat he is the best advisor I’ve had at Stanford, on equal par with my academic advisor (who is also named Russ!), who is equally supportive and carries a magical, fun quality. I really was quite lucky when it came to advising! One might say, Russ to the power of two lucky!

    Do I really need to go to Stanford?

    All this said, if you know what you love to do, and you pursue it relentlessly, you are going to find happiness and fulfillment, and there is no single school that is required for that (remember this?). I felt unbelievably blessed when I was admitted, but there are so many instances when opportunities are presented by sheer chance, or because you decide that you want something and then take proactive steps to pursue it. Just do that, and you’ll be ok :)

    In a nutshell

    If you pursue what you love, maximize fun and learning, take some risk, and never give up, graduate school is just awesome. Poldracklab, for the win. You know what to do.


  • Thesis Dump

    I recently submit my completed thesis (cue albatross soaring, egg hatching, sunset roaring), but along the way I wanted a simple way to turn it into a website. I did this for fun, but it proved to be useful because my advisor wanted some of the text and didn’t want to deal with LaTeX. I used Overleaf because it had a nice Stanford template, and while it still pales in comparison to the commenting functionality that Google Docs offers, it seems to be the best currently available collaborative, template-based, online LaTeX editor. If you are familiar with it, you also know that you have a few options for exporting your documents. You can of course export code (meaning .tex files for text, and .bib for something like a bibliography, and .sty for styles (and these files are zipped up), or you can have Overleaf compile it for you and download as PDF.

    Generating a site

    The task at hand is to convert something in LaTeX to HTML. If you’ve used LaTeX before, you know that there are already tools for that (hdlatex and docs). The hard part in this process was really just installing dependencies, a task that Docker is well suited for. Thus, I generated a Docker image that extracts files from the Overleaf zip, runs an hdlatex command to generate a static version of the document with appropriate links, and then you can push the static files to your Github pages, and you’re done! I have complete instructions in the README, and you can see the final generated site. It’s nothing special, basically just white bread that could use some improving upon, but it meets it’s need for now. The one tiny feature is that you can specify the Google Font that you want by changing one line in generate.sh (default is Source Serif Pro):

    docker exec $CONTAINER_ID python /code/generate.py "Space Mono"

    Note that “Source Mono” is provided as a command line argument, and nothing is specified in the current file to default to Source Serif Pro. Here is a look at the final output with Source Serif Pro:

    Advice for Future Students

    The entire thesis process wasn’t really as bad as many people make it out to be. Here are some pointers, for those in the process of or not yet started writing their theses.

    • Choose a simple, well-scoped project. Sure, you could start your dream work now, but it will be a lot easier to complete a well defined project, nail your PhD, and then save the world after. I didn’t even start the work that became my thesis until about a year and a half before the end of graduate school, so don’t panic if you feel like you are running out of time.
    • Early in graduate school, focus on papers. The reason is that you literally can have a paper be an entire chapter, and boum there alone you’ve banged out 20-30 pages! Likely you will want to rewrite some of the content to have a different organization and style, but the meat is high quality. Having published work in a thesis is a +1 for the committee because it makes it easy for them to consider the work valid.
    • Start with an outline, and write a story around it. The biggest “new writing” I had to do for mine was an introduction with sufficient background and meat to tie all the work that I had done together. Be prepared to change this story, depending on feedback from your committee. I had started with a theme of “reproducible science,” but ultimately finished with a more niche, focused project.
    • For the love of all that is good, don’t put your thesis into LaTeX until AFTER it’s been edited, reviewed, and you’ve defended, made changes, and then have had your reading committee edit it again. I made the mistake of having everything ready to go for my defense, and going through another round of edits was a nightmare afterward. Whatever you do, there is going to be a big chunk of time that must be devoted for converting a Google Doc into LaTeX. I chose to do it earlier, but the cost of that is something that is harder to change later. If I did this again, I would have just done this final step when it was intended for, at the end!
    • Most importantly, graduate school isn’t about a thesis. Have fun, take risks, and spend much more time doing those other things. The thesis I finished, to be completely honest, is pretty niche, dry, and might only be of interest to a few people in the world. The real richness in graduate school, for me, was everything but the thesis! I wrote a poem about this a few months ago for a talk, and it seems appropriate to share it here:
    I don't mean to be facetious,
    but graduate school is not about a thesis.
    To be tenacious, tangible, and thoughtful,
    for inspired idea you must be watchful.
    The most interesting things you might miss
    because they can come with a scent of risk.
    In this talk I will tell a story,
    of my thinking throughout this journey.
    I will try to convince you, but perhaps not
    that much more can be learned and sought
    if in your work you are not complacent,
    if you do not rely on others for incent.
    When you steer away from expectation,
    your little car might turn into innovation.
    Graduate school between the lines,
    has hidden neither equation nor citation.
    It may come with a surprise -
    it's not about the dissertation.

    Uploading Warnings

    A quick warning - the downloaded PDF wasn’t considered by the Stanford online Axess portal to be a “valid PDF”:

    and before you lose your earlobes, know that if you open the PDF in any official Adobe Reader (I used an old version of Adobe Reader on a Windows machine) and save it again, the upload will work seamlessly! Also don’t panic when you first try to do this, period, and see this message:

    As the message says, if you come back in 5-10 minutes the page will exist!


  • Pokemon Ascii Avatar Generator

    An avatar is a picture or icon that represents you. In massive online multiplayer role playing games (MMORPGs) your “avatar” refers directly to your character, and the computer gaming company Origin Systems took this symbol literally in its Ultima series of games by naming the lead character “The Avatar.”

    Internet Avatars

    If you are a user of this place called the Internet, you will notice in many places that an icon or picture “avatar” is assigned to your user. Most of this is thanks to a service called Gravatar that makes it easy to generate a profile that is shared across sites. For example, in developing Singularity Hub I found that there are many Django plugins that make adding a user avatar to a page as easy as adding an image with a source (src) like https://secure.gravatar.com/avatar/hello.

    The final avatar might look something like this:

    This is the “retro” design, and in fact we can choose from one of many:

    Command Line Avatars?

    I recently started making a command line application that would require user authentication. To make it more interesting, I thought it would be fun to give the user an identity, or minimally, something nice to look at at starting up the application. My mind immediately drifted to avatars, because an access token required for the application could equivalently be used as a kind of unique identifier, and a hash generated to produce an avatar. But how can we show any kind of graphic in a terminal window?

    Ascii to the rescue!

    Remember chain emails from the mid 1990s? There was usually some message compelling you to send the email to ten of your closest friends or face immediate consequences (cue diabolical flames and screams of terror). And on top of being littered with exploding balloons and kittens, ascii art was a common thing.

     __     __        _           _                     _ 
     \ \   / /       | |         | |                   | |
      \ \_/ /__  __ _| |__     __| | __ ___      ____ _| |
       \   / _ \/ _` | '_ \   / _` |/ _` \ \ /\ / / _` | |
        | |  __/ (_| | | | | | (_| | (_| |\ V  V / (_| |_|
        |_|\___|\__,_|_| |_|  \__,_|\__,_| \_/\_/ \__, (_)
                                                   __/ |  

    Pokemon Ascii Avatars!

    I had a simple goal - to create a command line based avatar generator that I could use in my application. Could there be any cute, sometimes scheming characters that be helpful toward this goal? Pokemon!! Of course :) Thus, the idea for the pokemon ascii avatar generator was born. If you want to skip the fluff and description, here is pokemon-ascii.

    Generate a pokemon database

    Using the Pokemon Database I wrote a script that produces a data structure that is stored with the module, and makes it painless to retrieve meta data and the ascii for each pokemon. The user can optionally run the script again to re-generate/update the database. It’s quite fun to watch!

    The Pokemon Database has a unique ID for each pokemon, and so those IDs are the keys for the dictionary (the json linked above). I also store the raw images, in case they are needed and not available, or (in the future) if we want to generate the ascii’s programatically (for example, to change the size or characters) we need these images. I chose this “pre-generate” strategy over creating the ascii from the images on the fly because it’s slightly faster, but there are definitely good arguments for doing the latter.

    Method to convert image to ascii

    I first started with my own intuition, and decided to read in an image using the Image class from PIL, converting the RGB values to integers, and then mapping the integers onto the space of ascii characters, so each integer is assigned an ascii. I had an idea to look at the number of pixels that were represented in each character (to get a metric of how dark/gray/intense) each one was, that way the integer with value 0 (no color) could be mapped to an empty space. I would be interested if anyone has insight for how to derive this information. The closest thing I came to was determining the number of bits that are needed for different data types:

    # String
    # Integer
    # Unicode
    # Boolean
    # Float

    Interesting, a float is equivalent to an integer. What about if there are decimal places?


    Nuts! I should probably not get distracted here. I ultimately decided it would be most reasonable to just make this decision visually. For example, the @ character is a lot thicker than a ., so it would be farther to the right in the list. My first efforts rendering a pokemon looked something like this:

    I then was browsing around, and found a beautifully done implementation. The error in my approach was not normalizing the image first, and so I was getting a poor mapping between image values and characters. With the normalization, my second attempt looked much better:

    I ultimately modified this code sightly to account for the fact that characters tend to be thinner than they are tall. This meant that, even though the proportion / size of the image was “correct” when rescaling it, the images always looked too tall. To adjust for this, I modified the functions to adjust the new height by a factor of 2:

    def scale_image(image, new_width):
        """Resizes an image preserving the aspect ratio.
        (original_width, original_height) = image.size
        aspect_ratio = original_height/float(original_width)
        new_height = int(aspect_ratio * new_width)
        # This scales it wider than tall, since characters are biased
        new_image = image.resize((new_width*2, new_height))
        return new_image

    Huge thanks, and complete credit, goes to the author of the original code, and a huge thanks for sharing it! This is a great example of why people should share their code - new and awesome things can be built, and the world generally benefits!

    Associate a pokemon with a unique ID

    Now that we have ascii images, each associated with a number from 1 to 721, we would want to be able to take some unique identifier (like an email or name) and consistently return the same image. I thought about this, and likely the basis for all of these avatar generators is to use the ID to generate a HASH, and then have a function or algorithm that takes the hash and maps it onto an image (or cooler) selects from some range of features (e.g., nose mouth eyes) to generate a truly unique avatar. I came up with a simple algorithm to do something like this. I take the hash of a string, and then use modulus to get the remainder of that number divided by the number of pokemon in the database. This means that, given that the database doesn’t change, and given that the pokemon have unique IDs in the range of 1 to 721, you should always get the same remainder, and this number will correspond (consistently!) with a pokemon ascii. The function is pretty simple, it looks like this:

    def get_avatar(string,pokemons=None,print_screen=True,include_name=True):
        '''get_avatar will return a unique pokemon for a specific avatar based on the hash
        :param string: the string to look up
        :param pokemons: an optional database of pokemon to use
        :param print_screen: if True, will print ascii to the screen (default True) and not return
        :param include_name: if True, will add name (minus end of address after @) to avatar
        if pokemons == None:
            pokemons = catch_em_all()
        # The IDs are numbers between 1 and the max
        number_pokemons = len(pokemons)
        pid = numpy.mod(hash(string),number_pokemons)
        pokemon = get_pokemon(pid=pid,pokemons=pokemons)
        avatar = pokemon[pid]["ascii"]
        if include_name == True:
            avatar = "%s\n\n%s" %(avatar,string.split("@")[0])
        if print_screen == True:
            print avatar    
            return avatar

    …and the function get_pokemon takes care of retrieving the pokemon based on the id, pid.


    On the surface, this seems very silly, however there are many good reasons that I would make something like this. First, beautiful, or fun details in applications make them likable. I would want to use something that, when I fire it up, subtly reminds me that in my free time I am a Pokemon master. Second, a method like this could be useful for security checks. A user could learn some image associated with his or her access token, and if this ever changed, he/she would see a different image. Finally, a detail like this can be associated with different application states. For example, whenever there is a “missing” or “not found” error returned for some function, I could show Psyduck, and the user would learn quickly that seeing Psyduck means “uhoh.”

    There are many more nice uses for simple things like this, what do you think?


    The usage is quite simple, and this is taken straight from the README:

          usage: pokemon [-h] [--avatar AVATAR] [--pokemon POKEMON] [--message MESSAGE] [--catch]
          generate pokemon ascii art and avatars
          optional arguments:
            -h, --help         show this help message and exit
            --avatar AVATAR    generate a pokemon avatar for some unique id.
            --pokemon POKEMON  generate ascii for a particular pokemon (by name)
            --message MESSAGE  add a custom message to your ascii!
            --catch            catch a random pokemon!
          usage: pokemon [-h] [--avatar AVATAR] [--pokemon POKEMON] [--message MESSAGE] [--catch]


    You can install directly from pip:

          pip install pokemon

    or for the development version, clone the repo and install manually:

          git clone https://github.com/vsoch/pokemon-ascii
          cd pokemon-ascii
          sudo python setup.py sdist
          sudo python setup.py install

    Produce an avatar

    Just use the --avatar tag followed by your unique identifier:

          pokemon --avatar vsoch

    You can also use the functions on command line (from within Python):

          from pokemon.skills import get_avatar
          # Just get the string!
          avatar = get_avatar("vsoch",print_screen=False)
          print avatar
          # Remove the name at the bottom, print to screen (default)
          avatar = get_avatar("vsoch",include_name=False)

    Randomly select a Pokemon

    You might want to just randomly get a pokemon! Do this with the --catch command line argument!

          pokemon --catch

    You can equivalently use the --message argument to add a custom message to your catch!

          pokemon --catch --message "You got me!"
          You got me!

    You can also catch pokemon in your python applications. If you are going to be generating many, it is recommended to load the database once and provide it to the function, otherwise it will be loaded each time.

          from pokemon.master import catch_em_all, get_pokemon
          pokemons = catch_em_all()
          catch = get_pokemon(pokemons=pokemons)

    I hope that you enjoy pokemon-ascii as much as I did making it!


  • How similar are my operating systems?

    How similar are my operating systems?

    A question that has spun out of one of my projects that I suspect would be useful in many applications but hasn’t been fully explored is comparison of operating systems. If you think about it, for the last few decades we’ve generated many methods for comparing differences between files. We have md5 sums to make sure our downloads didn’t poop out, and command line tools to quickly look for differences. We now have to take this up a level, because our new level of operation isn’t on a single “file”, it’s on an entire operating system. It’s not just your Mom’s computer, it’s a container-based thing (e.g., Docker or Singularity for non sudo environments) that contains a base OS plus additional libraries and packages. And then there is the special sauce, the application or analysis that the container was birthed into existence to carry out. It’s not good enough to have “storagey places” to dump these containers, we need simple and consistent methods to computationally compare them, organize them, and let us explore them.

    Similarity of images means comparing software

    An entire understanding of an “image” (or more generally, a computer or operating system) comes down to the programs installed, and files included. Yes, there might be various environmental variables, but I would hypothesize that the environmental variables found in an image have a rather strong correlation with the software installed, and we would do pretty well to understand the guts of an image from the body without the electricity flowing through it. This would need to be tested, but not quite yet.

    Thus, since we are working in Linux land, our problem is simplified to comparing file and folder paths. Using some software that I’ve been developing I am able to derive quickly lists of both of those things (for example, see here), and matter of fact, it’s not very hard to do the same thing with Docker (and I plan to do this en-masse soon).

    Two levels of comparisons: within and between images

    To start my thinking, I simplified this idea into two different comparisons. We can think of each file path like a list of sorted words. Comparing two images comes down to comparing these lists. The two comparisons we are interested in are:

    • Comparing a single file path to a second path, within the same image, or from another image.
    • Comparing an entire set of file paths (one image) to a (?different) set (a second image).

    I see each of these mapping nicely to a different goal and level of detail. Comparing a single path is a finer operation that is going to be useful to have detailed understanding about differences between two images, and within one image it is going to let me optimize the comparison algorithm by first removing redundant paths. For example, take a look at the paths below:


    We don’t really need the first one because it’s represented in the second one. However, if some Image 1 has the first but not the second (and we are doing a direct comparison of things) we would miss this overlap. Thus, since I’m early in developing these ideas, I’m just going to choose the slower, less efficient method of not filtering anything yet. So how are we comparing images anyway?

    Three frameworks to start our thinking

    Given that we are comparing lists of files and/or folders, we can approach this problem in three interesting ways:

    1. Each path is a feature thing. I’m just comparing sets of feature things.
    2. Each path is list of parent –> child relationships, and thus each set of paths is a graph. We are comparing graphs.
    3. Each path is a document, and the components of the path (folders to files) are words. The set of paths is a corpus, and I’m comparing different corpus.

    Comparison of two images

    I would argue that this is the first level of comparison, meaning the rougher, higher level comparison that asks “how similar are these two things, broadly?” In this framework, I want to think about the image paths like features, and so a similarity calculation can come down to comparing two sets of things, and I’ve made a function to do this. It comes down to a ratio between the things they have in common (intersect) over the entire set of things:

          score = 2.0*len(`intersect`) / (len(`pkg1`)+len(`pkg2`))

    I wasn’t sure if “the entire set of things” should include just folder paths, just files paths, or both, and so I decided to try all three approaches. As I mentioned previously, it also would need to be determined if we can further streamline this approach by filtering down the paths first. I started running this on my local machine, but then realized how slow and lame that was. I then put together some cluster scripts in a giffy, and the entire thing finished before I had finished the script to parse the result. Diabolical!

    I haven’t had a chance to explore these comparisons in detail yet, but I’m really excited, because there is nice structure in the data. For example, here is the metric comparing images using both files and folders:

    A shout out to plotly for the amazingly easy to use python API! Today was the first time I tried it, and I was very impressed how it just worked! I’m used to coding my own interactive visualizations from scratch, and this was really nice. :) I’m worried there is a hard limit on the number of free graphs I’m allowed to have, or maybe the number of views, and I feel a little squirmy about having it hosted on their server… :O

    Why do we want to compare images?

    Most “container storage” places don’t do a very good job of understanding the guts inside. If I think about Docker Hub, or Github, there are a ton of objects (scripts, containers, etc.) but the organization is largely manual with some search feature that is (programatically) limited to the queries you can do. What we need is a totally automated, unsupervised way of categorizing and organizing these containers. I want to know if the image I just created is really similar to others, or if I could have chosen a better base image. This is why we need a graph, or a mapping of the landscape of images - first to understand what is out there, and then to help people find what they are looking for, and map what they are working on into the space. I just started this pretty recently, but here is the direction I’m going to stumble in.

    Generating a rough graph of images

    The first goal is to get an even bigger crapton of images, and try to do an estimate of the space. Graphs are easy to work with and visualize, so instead of sets (as we used above) let’s now talk about this in a graph framework. I’m going to try the following:

    1. Start with a big list of (likely) base containers (e.g., Docker library images)
    2. Derive similarity scores based on the rough approach above. We can determine likely parents / children based on one image containing all the paths of another plus more (a child), or a subset of the paths of the other (a parent). This will give us a bunch of tiny graphs, and pairwise similarity scores for all images.
    3. Within the tiny graphs, define potential parent nodes (images) as those that have not been found to be children of any other images.
    4. For all neighbors / children within a tiny graph, do the equivalent comparison, but now on the level of files to get a finer detail score.
    5. Find a strategy to connect the tiny graphs. The similarity scores can do well to generate a graph of all nodes, but we would want a directional graph with nice detail about software installed, etc.

    The last few points are kind of rough, because I’m not ready yet to think about how to fine tune the graph given that I need to build it first. I know a lot of researchers think everything through really carefully before writing any code or trying things, but I don’t have patience for planing and not doing, and like jumping in, starting building, and adjusting as I go. On second thought, I might even want to err away from Singularity to give this a first try. If I use Docker files that have a clear statement about the “parent” image, that means that I have a gold standard, and I can see how well the approach does to find those relationships based on the paths alone.

    Classifying a new image into this space

    Generating a rough heatmap of image similarity (and you could make a graph from this) isn’t too wild an idea, as we’ve seen above. The more challenging, and the reason that this functionality is useful, is quickly classifying a new image into this space. Why? I’d want to, on the command line, get either a list or open a web interface to immediately see the differences between two images. I’d want to know if the image that I made is similar to something already out there, or if there is a base image that removes some of the redundancy for the image that I made. What I’m leading into is the idea that I want visualizations, and I want tools. Our current understanding of an operating system looks like this:

    Yep, that’s my command line. Everything that I do, notably in Linux, I ssh, open a terminal, and I’ll probably type “ls.” If I have two Linuxy things like containers, do we even have developed methods for comparing them? Do they have the same version of Python? Is one created from the other? I want tools and visualization to help me understand these things.

    We don’t need pairwise comparisons - we need bases

    It would be terrible if, to classify a new image into this space, we had to compare it to every image in our database. We don’t need to, because we can compare it to some set of base images (the highest level of parent nodes that don’t have parents), and then classify it into the graph by walking down the tree, following the most similar path(s). These “base” images we might determine easily based on something like Dockerfiles, but I’d bet we can find them with an algorithm. To be clear, a base image is a kind of special case, for example, those “official” Docker library images like Ubuntu, or Nginx, or postgres that many others are likely to build off of. They are likely to have few to no parent images themselves. It is likely the case that people will add on to base images, and it is less likely they will subtract from them (when is the last time you deleted stuff from your VM when you were extending another image?). Thus, a base image can likely be found by doing the following:

    • Parse a crapton of Docker files, and find the images that are most frequently used
    • Logically, an image that extends some other image is a child of that image. We can build a graph/tree based on this
    • We can cut the tree at some low branch to define a core set of bases.

    Questions and work in progress!

    I was working on something entirely different when I stumbled on this interesting problem. Specifically, I want a programmatic way to automatically label the software in an image. In order to do this, I need to derive interesting “tags.” An interesting tag is basically some software that is installed on top of the base OS. You see how this developed - I needed to derive a set of base OS, and I needed a way to compare things to them. I’ll get back to that, along with the other fun project that I’ve started to go with this - developing visualizations for comparing operating systems! This is for another day! If you are interested in the original work, I am developing a workflow interface using Singularity containers called Singularity Hub Hubba, hubba!.


  • Service Worker Resource Saver

    If you are like me, you probably peruse a million websites in a day. Yes, you’re an internet cat! If you are a “tabby” then you might find links of interest and leave open a million tabs, most definitely to investigate later (really, I know you will)! If you are an “Octocat” then “View Source” is probably your right click of choice, and you are probably leaving open a bunch of raw “.js” or “.css” files to look at something later. If you are an American cat, you probably have a hodge-podge of random links and images. If you are a perfectionist cat (siamese?), you might spend an entire afternoon searching for the perfect image of a donut (or other thing), and have some sub-optimal method for saving them. Not that I’ve ever done that…

    TLDR: I made a temporary stuff saver using service workers. Read on to learn more.

    How do we save things?

    There are an ungodly number of ways to keep lists of things, specifically Google Docs and Google Drive are my go-to places, and many times I like to just open up a new email and send myself a message with said lists. For more permanent things I’m a big fan of Google Keep and Google Save, but this morning I found a use case that wouldn’t quite be satisfied by any of these things. I had a need to keep things temporarily somewhere. I wanted to copy paste links to images and be able to see them all quickly (and save my favorites), but not clutter my well organized and longer term Google Save or Keep with these temporary lists of resources.

    Service Workers, to the rescue!

    This is a static URL that uses a service worker with the postMessage interface to send messages back and forth between a service worker and a static website. This means that you can save and retrieve links, images, and script URLS across windows and sessions! This is pretty awesome, because perhaps when you have a user save stuff you rely on the cache, but what happens if they clear it? You could use some kind of server, but what happens when you have to host things statically (Github pages, I’m looking at you!). There are so many simple problems where you have some kind of data in a web interface that you want to save, update, and work with across pages, and service workers are perfect for that. Since this was my first go, I decided to do something simple and make a resource saver. This demo is intended for Chrome, and I haven’t tested in other browsers. To modify, stop, or remove workers, visit chrome://serviceworker-internals.

    How does it work?

    I wanted a simple interface where I could copy paste a link, and save it to the cache, and then come back later and click on a resource type to filter my resources:

    I chose material design (lite) because I’ve been a big fan of it’s flat simplicity, and clean elements. I didn’t spend too much time on this interface design. It’s pretty much some buttons and an input box!

    The gist of how it works is this: you check if the browser can support service workers:

    if ('serviceWorker' in navigator) {
      Stuff.setStatus('Ruh roh!');
    } else {
      Stuff.setStatus('This browser does not support service workers.');

    Note that the “Stuff” object is simply a controller for adding / updating content on the page. Given that we have browser support, we then register a particular javascript file, our service controller commands, to the worker:

        // Wait until the service worker is active.
        .then(function() {
          return navigator.serviceWorker.ready;
        // ...and then show the interface for the commands once it's ready.
        .catch(function(error) {
          // Something went wrong during registration. The service-worker.js file
          // might be unavailable or contain a syntax error.

    The magic of what the worker does, then, is encompassed in the “service-worker.js” file, which I borrowed from Google’s example application. This is important to take a look over and understand, because it defines different event listeners (for example, “activate” and “message”) that describe how our service worker will handle different events. If you look through this file, you are going to see a lot of the function “postMessage”, and actually, this is the service worker API way of getting some kind of event from the browser to the worker. It makes sense, then, if you look in our javascript file that has different functions fire off when the user interacts with buttons on the page, you are going to see a ton of a function saveMessage that opens up a Message Channel and sends our data to the worker. It’s like browser ping pong, with data instead of ping pong balls. You can view in the console of the demo and type in any of “MessageChannel”, “sendMessage” or “postMessage” to see the functions in the browser:

    If we look closer at the sendMessage function, it starts to make sense what is going on. What is being passed and forth are Promises, which help developers (a bit) with the callback hell that is definitive of Javascript. I haven’t had huge experience with using Promises (or service workers), but I can tell you this is something to start learning and trying out if you plan to do any kind of web development:

    function sendMessage(message) {
      // This wraps the message posting/response in a promise, which will resolve if the response doesn't
      // contain an error, and reject with the error if it does. If you'd prefer, it's possible to call
      // controller.postMessage() and set up the onmessage handler independently of a promise, but this is
      // a convenient wrapper.
      return new Promise(function(resolve, reject) {
        var messageChannel = new MessageChannel();
        messageChannel.port1.onmessage = function(event) {
          if (event.data.error) {
          } else {
        // This sends the message data as well as transferring messageChannel.port2 to the service worker.
        // The service worker can then use the transferred port to reply via postMessage(), which
        // will in turn trigger the onmessage handler on messageChannel.port1.
        // See https://html.spec.whatwg.org/multipage/workers.html#dom-worker-postmessage

    The documentation is provided from the original example, and it’s beautiful! The simple functionality I added is to parse the saved content into different types (images, script/style and other content)

    …as well as download a static list of all of your resources (for quick saving).

    More content-specific link rendering

    I’m wrapping up for playing around today, but wanted to leave a final note. As usual, after an initial bout of learning I’m unhappy with what I’ve come up with, and want to minimally comment on the ways it should be improved. I’m just thinking of this now, but it would be much better to have one of the parsers detect video links (from youtube or what not) and then them rendered in a nice player. It would also make sense to have a share button for one or more links, and parsing into a data structure to be immediately shared, or sent to something like a Github gist. I’m definitely excited about the potential for this technology in web applications that I’ve been developing. For example, in some kind of workflow manager, a user would be able to add functions (or containers, in this case) to a kind of “workflow cart” and then when he/she is satisfied, click an equivalent “check out” button that renders the view to dynamically link them together. I also imagine this could be used in some way for collaboration on documents or web content, although I need to think more about this one.

    Demo the Stuff Saver


  • Neo4J and Django Integration

    What happens when a graph database car crashes into a relational database car? You get neo4-django, of course! TLDR: you can export cool things like this from a Django application:

    Neo4j-Django Gist

    I’ve been working on the start of version 2.0 of the Cognitive Atlas, and the process has been a little bit like stripping a car, and installing a completely new engine while maintaining the brand and look of the site. I started with pages that looked like this:

    meaning that fixing up this site comes down to inferring the back end functionality from this mix of Javascript / HTML and styling, and turning them into Django templates working with views that have equivalent functionality.

    Neo For What?

    Neo4J is a trendy graph database that emerged in 2007, but I didn’t really hear about it until 2014 or 2015 when I played around with it to visualize the nidm data model, a view of the Cognitive Atlas and of the NIF ontology (which seems like it was too big to render in a gist). It’s implemented in Java, and what makes it a “graph” database is the fact that it stores nodes and relationships. This is a much different model than traditional relational databases, which work with things in tables. There are pros and cons of each, however for a relatively small graph that warrants basic similarity metrics, graph visualizations, and need for an API, I thought Neo4j was a good use case. Now let’s talk about how I got it to work with Django.

    Django Relational Models

    Django is based on the idea of models. A model is a class of objects directly linked to objects in the relational database, so if I want to keep track of my pet marshmallows, I might make a django application called “marshdb” and I can do something like the following:

    from django.db import models
    class Marshmallow(models.Model):
        is_toasted = models.BooleanField(default=True)
        name = models.CharField(max_length=30)

    and then I can search, query, and interact with my marshmallow database with very intuitive functionality:

    from marshdb.models import Marshmallow
    # All the marshmallows!
    nomnom = Marshmallow.objects.all()
    # Find me the toasted ones
    toasted_mallows = Marshmallow.objects.filter(is_toasted=True)
    # How many pet marshmallows do I have?
    marshmallow_count = Marshmallow.objects.count()
    # Find Larry
    larry = Marshmallow.objects.get(name="Larry")


    Django is fantastic - it makes it possible to create an entire site and database backend, along with whatever plugins you might want, in even the span of a weekend! My first task was how to integrate a completely different kind of database into a relational infrastructure. Django provides ample detail on how to instantiate your own models, but it’s not a trivial thing to integrate a completely different kind of database. I found neo4django, but it wasn’t updated for newer versions of Django, and it didn’t seem to be taking a clean and simple approach to integrating Neo4j. Instead, I decided to come up with my own solution.

    Step 1: Dockerize!

    Deployment and development is much easier with Docker, period. Need neo4j run via Docker? Kenny Bastani (holy cow he’s just in San Mateo! I could throw a rock at him!) has a solution for that! Basically, I bring in the neo4j container:

      image: kbastani/docker-neo4j:latest
       - "7474:7474"
       - "1337:1337"
       - mazerunner
       - hdfs

    and then link it to a docker image that is running the Django application:

        image: vanessa/cogat-docker
        command: /code/uwsgi.sh
        restart: always
            - .:/code
            - /var/www/static
            - postgres
            - graphdb

    You can look at the complete docker-compose file, and Kenny’s post on the mazerunner integration for integrating graph analytics with Apache Spark.

    This isn’t actually the interesting part, however. The fun and interesting bit is getting something that looks like a Django model for the user to interact with that entirely isn’t :).

    Step 2: The Query Module

    As I said previously, I wanted this to be really simple. I created a Node class that includes the same basic functions as a traditional Django model (get, all, filter, etc.), and added a few new ones:

        def link(self,uid,endnode_id,relation_type,endnode_type=None,properties=None):
            '''link will create a new link (relation) from a uid to a relation, first confirming
            that the relation is valid for the node
            :param uid: the unique identifier for the source node
            :param endnode_id: the unique identifier for the end node
            :param relation_type: the relation type
            :param endnode_type: the type of the second node. If not specified, assumed to be same as startnode
            :param properties: properties to add to the relation

    … blarg blarg blarg

     def cypher(self,uid,lookup=None,return_lookup=False):
            '''cypher returns a data structure with nodes and relations for an object to generate a gist with cypher
            :param uid: the node unique id to look up
            :param lookup: an optional lookup dictionary to append to
            :param return_lookup: if true, returns a lookup with nodes and relations that are added to the graph
            base = self.get(uid)[0]

    and then I might instantiate it like this for the “Concept” node:

    class Concept(Node):
        def __init__(self):
            self.name = "concept"
            self.fields = ["id","name","definition"]
            self.relations = ["PARTOF","KINDOF","MEASUREDBY"]
            self.color = "#3C7263" # sea green

    and you can see that generally, I just need to define the fields, relations, and name of the node in the graph database to get it working. Advanced functionality that might be needed for specific node types can be implemented for those particular classes.

    Functionality for any node in the graph can be added to the “Node” class. The function “link” for example, will generate a relationship between an object and some other node, and “cypher” will produce node and link objects that can be rendered immediately into a neo4j gist. This is where I see the intersection of Django and Neo4j - adding graph functions to their standard model. Now how to visualize the graph? I like developing my own visualizations, and made a general, searchable graph run by the application across all node types:

    However I realized that a user is going to want more power than that to query, make custom views, and further, share them. The makers of Neo4j were smart, and realized that people might want to share snippets of code as github gists to make what they call a graph gist. I figured why not generate a URL to render this cypher code that can then immediately be rendered into a preview, and then optionally exported and saved by the user? The awesome part of this is that it sends the computing of the graph part off of the Cognitive Atlas server, and you can save views of the graph. For example, here is a gist that shows a view of the working memory fMRI task paradigm. If you’re a visual learner, you can learn from looking at the graph itself:

    You can see example cypher queries, with results rendered into clean tables:

    and hey, you can write your own queries against the graph!

    This is a work in progress and it’s not perfect, but I’m optimistic about the direction it’s going in. If more ontologies / graph representations of knowledge were readily explorable, and sharable in this way, the semantic web would be a lot easiest to understand and navigate.

    Relational Database Help

    Why then should we bother to use a relational database via Django? I chose this strategy because it keeps the model of the Cognitive Atlas separate from any applications deploying or using it. It provides a solid infrastructure for serving a RESTful API:

    and basic functionalities like storing information about users, and any (potential) future links to automated methods to populate it, etc.

    General Thinking about Visualization and Services

    This example gets at a general strategy that is useful to consider when building applications, and that is the idea of “outsourcing” some of your analysis or visualization to third parties. In the case of things that just need a web server, you might store code (or text) in a third party service like Github or Dropbox, and use something like Github Pages or another third party to render a site. In the case of things that require computation, you can take advantage of Continuous Integration to do much more than run tests. In this example, we outsourced a bit of computation and visualization. In the case of developing things that are useful for people, I sometimes think it is more useful to build a generic “thing” that can turn some standard data object (eg, some analysis result, data, or text file) and render it into some more programmatic data structure that can plug into (some other tool) that makes it relatable to other individual’s general “things.” I will spend some time in another post to more properly articulate this idea, but the general take away is that as a user you should be clever when you are looking for a certain functionality, and as a developer you should aim to provide general functions that have wide applicability.

    Cognitive Atlas 2.0

    The new version of the Cognitive Atlas has so far been a fun project I’ve worked on in free time, and I would say you can expect to see cool things develop in the next year or two, even if I’m not the one to push the final changes. In the meantime, I encourage all researchers working with behavioral or cognitive paradigms, perhaps using the Experiment Factory or making an assertion about a brain map capturing a cognitive paradigm in the NeuroVault database, to do this properly by defining paradigms, cognitive concepts in the current version of the Cognitive Atlas. If you have feedback or want to contribute to developing this working example of integrating Neo4j and Django, please jump in. Even a cool idea would be a fantastic contribution. Time to get back to work! Well, let’s just call this “work,” I can’t say I’m doing much more than walking around and smiling like an idiot in this final lap of graduate school. :)


  • The Elusive Donut

    Elusive Donut

    A swirl of frosting and pink
    really does make you think
    Take my hunger away, won’t ‘ut?
    Unless you’ve browsed an elusive donut!

    elusive donut


  • Interactive Components for Visualizations

    If you look at most interactive visualizations that involve something like D3, you tend to see lots of circles, bars, and lines. There are several reasons for this. First, presenting information simply and cleanly is optimal to communicate an idea. If you show me a box plot that uses different tellitubbies as misshapen bars, I am likely going to be confused, a little angry, and miss the data lost in all the tubby. Second, basic shapes are the default available kind of icon built into these technologies, and any variability from that takes extra work.

    Could there be value in custom components?

    This begs the question - if we were able to, would the world of data science specific to generating interactive visuaizations be improved with custom components? I imagine the answer to this question is (as usual), “It depends.” The other kind of common feature you see in something like D3 is a map. The simple existence of a map, and an ability to plot information on it, adds substantially to our ability to communicate something meaningful about a pattern across geographical regions. The human mind is more quick to understand something with geographic salience overlayed on a map than the same information provided in some kind of chart with labels corresponding to geographic regions. Thus, I see no reason that we cannot have other simple components for visualizations that take advantage of our familiarity that brings some intuitive understanding of a domain or space.

    A Bodymap

    My first idea (still under development) was to develop a simple map for the human body. I can think of so many domains crossing medicine, social media, and general science that have a bodygraphic salience. I was in a meeting with radiologists many weeks ago, and I found it surprising that there weren’t standard templates for an entire body (we have them for brain imaging). A standard template for a body in the context of radiology is a much different goal than one for a visualization, but the same reality rings true. I decided that a simple approach would be to take a simple representation, transform it into a bunch of tiny points, and then annotate different sets of points with different labels (classes). The labels can then be selected dynamically with any kind of web technology (d3, javascript, jquery, etc.) to support an interactive visualization. For example, we could parse a set of documents, extract mentions of body parts, and then use the Bodymap like a heatmap to show the prevalance of the different terms.

    Generating an svg pointilism from any png image

    My first task was to be able to take any png image and turn it into a set of points. I first stupidly opened up Inkscape and figured out how to use the clone tool to generate a bunch of points. Thankfully I realized quickly that before I made my BodyMap tool, I needed general functions for working with images and svg. I am in the process of creating svgtools for this exact goal! For example, with this package you can transform a png image into an svg (pointilism thing) with one function:

    from svgtools.generate import create_pointilism_svg
    # Select your png image!
    png_image = "data/body.png"
    # uid_base will be the id of the svg
    # sample rate determines the space between points (larger --> more space)


    I expect to be adding a lot of different (both manual and automated) methods here for making components, so keep watch of the package if interested.

    This allowed me to transform this png image:

    into a “pointilism svg” (this is for a sampling rate of 8, meaning that I add more space between the points)

    actual svg can be seen here

    Great! Now I need some way to label the paths with body regions, so I can build stuff. How to do that?

    Terms and relationships embedded in components

    We want to be able to (manually and automatically) annotate svg components with terms. This is related to a general idea that I like to think about - how can we embed data structures in visualizations themselves? An svg object (a support vector graphic) is in fact just an XML document, which is also a data structure for holding (you guessed it, data!). Thus, if we take a set of standard terms and relationships between them (i.e., an ontology), we can represent the terms as labels in an image, and the relationships by the relationship between the objects (eg, “eye” is “part of” the “head” is represented by way of the eye literally being a part of the head!). My first task, then, was to take terms from the Foundation Model of Anatomy (FMA) and use them as tags for my BodyMap.

    A little note about ontologies - they are usually intended for a very specific purpose. For example, the FMA needs to be detailed enough for use in science and medicine. However, if I’m extracting “body terms” from places like Twitter or general prose, I can tell you with almost certainty that you might find a term like “calf” but probably not “gastrocnemius.” My first task was to come up with a (very simple) list of terms from the FMA that I thought would be likely to be seen in general conversation or places like the Twitterverse. It’s not an all-encompassing set, but it’s a reasonable start.

    Annotation of the BodyMap

    I then had my svg for annotation, and I had my terms, how to do the annotation? I built myself a small interface for this goal exactly. You load your svg images and labels, and then draw circles around points you want to select, for example here I have selected the head:

    and then you can select terms from your vocabulary:

    and click annotate! The selection changes to indicate that the annotation has been done.

    Selecting a term and clicking “view” will highlight the annotation, in case you want to see it again. When you are finished, you can save the svg, and see that the annotation is present for the selected paths via an added class attribute:

    This is the simple functionality that I desired for this first round, and I imagine I’ll add other things as I need them. And again, ideally we will have automated methods to achieve these things in the long run, and we would also want to be able to take common data structures and images, convert them seamlessly into interactive components, and maybe even have a database for users to choose from. Imagine if we had a database of standard components for use, we could use them as features to describe visualizations, and get a sense of what the visualization is representing by looking at it statically. We could use methods from image processing and computer vision to generate annotated components automatically, and blur the line between what is data and what is visual. Since this is under development and my first go, I’ll just start by doing this annotation myself. I just created the svgtools package and this interface today, so stay tuned for more updates!

    annotation interface demo


  • Visualizations, Contain yourselves!

    Visualizing things is really challenging. The reason is because it’s relatively easy to make a visualization that is too complex for what it’s trying to show, and it’s much harder to make a visualization catered for a specific analysis problem. Simplicity is usually the best strategy, but while standard plots (e.g., scatter, box and whisker, histogram) are probably ideal for publications, they aren’t particularly fun to think about. You also have the limitations of your medium and where you paint the picture. For example, a standard web browser will get slow when you try to render ~3000 points with D3. In these cases you are either trying to render too many, or you need a different strategy (e.g., render points on canvas in favor of number over interactivity).

    I recently embarked on a challenge to visualize a model defined at every voxel in the brain (a voxel is a little 3D cube of brain landscape associated with an X,Y,Z coordinate). Why would I want to do this? I won’t go into details here, but with such models you could predict what a statistical brain map might look like based on cognitive concepts, or predict a set of cognitive concepts from a brain map. This work is still being prepared for publication, but we needed a visualization because the diabolical Poldrack is giving a talk soon, and it would be nice to have some way to show output of the models we had been working on. TLDR: I made a few Flask applications and shoved them into Docker continers with all necessary data, and this post will review my thinking and design process. The visualizations are in no way “done” (whatever that means) because there are details and fixes remaining.

    Step 1: How to cut the data

    We have over 28K models, each built from a set of ~100 statistical brain maps (yes, tiny data) with 132 cognitive concepts from the Cognitive Atlas. When you think of the internet, it’s not such big data, but it’s still enough to make putting it in a single figure challenging. Master Poldrack had sent me a paper from the Gallant Lab, and directed me to Figure 2:

    Gallant lab figure 2

    I had remembered this work from the HIVE at Stanford, and what I took away from it was the idea for the strategy. If we wanted to look at the entire model for a concept, that’s easy, look at the brain maps. If we want to understand all of those brain maps at one voxel, then the visualization needs to be voxel-specific. This is what I decided to do.

    Step 2: Web framework

    Python is awesome, and the trend for neuroimaging analysis tools is moving toward Python dominance. Thus, I decided to use a small web framework called Flask that makes data –> server –> web almost seamless. It takes a template approach, meaning that you write views for a python-based server to render, and they render using jinja2 templates. You can literally make a website in under 5 minutes.

    Step 3: Data preparation

    This turned out to be easy. I could generate either tab delimited or python pickled (think a compressed data object) files, and store them with the visualizations in their respective Github repos.

    Regions from the AAL Atlas

    At first, I generated views to render a specific voxel location, some number from 1..28K that corresponded with an X,Y,Z coordinate. The usability of this is terrible. Is someone really going to remember that voxel N corresponds to “somewhere in the right Amygdala?” Probably not. What I needed was a region lookup table. I wasn’t decided yet about how it would work, but I knew I needed to make it. First, let’s import some bread and butter functions!

    import pandas
    import nibabel
    import requests
    import xmltodict
    from nilearn.image import resample_img
    from nilearn.plotting import find_xyz_cut_coords

    The requests library is important for getting anything from a URL into a python program. nilearn is a nice machine learning library for python (that I usually don’t use for machine learning at all, but rather the helper functions), and xmltodict will do exactly that, convert an xml file into a superior data format :). First, we are going to use the Neurovault RESTApi to both obtain a nice brain map, and the labels from it. In the script to run this particular python script, we have already downloaded the brain map itself, and now we are going to load it, resample to a 4mm voxel (to match the data in our model), and then associate a label with each voxel:

    data = nibabel.load("AAL2_2.nii.gz")
    img4mm = nibabel.load("MNI152_T1_4mm_brain_mask.nii.gz")
    # Use nilearn to resample - nearest neighbor interpolation to maintain atlas
    aal4mm = resample_img(data,interpolation="nearest",target_affine=img4mm.get_affine())
    # Get labels
    labels = numpy.unique(aal4mm.get_data()).tolist()
    # We don't want to keep 0 as a label
    url = "http://neurovault.org/api/atlases/14255/?format=json"
    response = requests.get(url).json()

    We now have a json object with a nice path to the labels xml! Let’s get that file, convert it to a dictionary, and then parse away, Merrill.

    # This is an xml file with label descriptions
    xml = requests.get(response["label_description_file"])
    doc = xmltodict.parse(xml.text)["atlas"]["data"]["label"]  # convert to a superior data structure :)

    Pandas is a module that makes nice data frames. You can think of it like a numpy matrix, but with nice row and column labels, and functions to sort and find things.

    # We will store region voxel value, name, and a center coordinate
    regions = pandas.DataFrame(columns=["value","name","x","y","z"])
    # Count is the row index, fill in data frame with region names and indices
    count = 0
    for region in doc:
        regions.loc[count,"value"] = int(region["index"]) 
        regions.loc[count,"name"] = region["name"] 

    I didn’t actually use this in the visualization, but I thought it might be useful to store a “representative” coordinate for each region:

    # USE NILEARN TO FIND REGION COORDINATES (the center of the largest activation connected component)
    for region in regions.iterrows():
        label = region[1]["value"]
        roi = numpy.zeros(aal4mm.shape)
        roi[aal4mm.get_data()==label] = 1
        nii = nibabel.Nifti1Image(roi,affine=aal4mm.get_affine())
        x,y,z = [int(x) for x in find_xyz_cut_coords(nii)]
        regions.loc[region[0],["x","y","z"]] = [x,y,z]

    and then save the data to file, both the “representative” coords, and the entire aal atlas as a squashed vector, so we can easily associate the 28K voxel locations with regions.

    # Save data to file for application
    # We will also flatten the brain-masked imaging data into a vector,
    # so we can select a region x,y,z based on the name
    region_lookup = pandas.DataFrame(columns=["aal"])
    region_lookup["aal"] = aal4mm.get_data()[img4mm.get_data()!=0]


    For this first visualization, that was all that was needed in the way of data prep. The rest of the files I already had on hand, nicely formatted, from the analysis code itself.

    Step 4: First Attempt: Clustering

    My first idea was to do a sort of “double clustering.” I scribbled the following into an email late one night:

    …there are two things we want to show. 1) is relationships between concepts, specifically for that voxel. 2) is the relationship between different contrasts, and then how those contrasts are represented by the concepts. The first data that we have that is meaningful for the viewer are the tagged contrasts. For each contrast, we have two things: an actual voxel value from the map, and a similarity metric to all other contrasts (spatial and/or semantic). A simple visualization would produce some clustering to show to the viewer how the concepts are similar / different based on distance. The next data that we have “within” a voxel is information about concepts at that voxel (and this is where the model is integrated). Specifically - a vector of regression parameters for that single voxel. These regression parameter values are produced via the actual voxel values at the map (so we probably would not use both). What I think we want to do is have two clusterings - first cluster the concepts, and then within each concept bubble, show a smaller clustering of the images, clustered via similarity, and colored based on the actual value in the image (probably some shade of red or blue).

    Yeah, please don’t read that. The summary is that I would show clusters of concepts, and within each concept cluster would be a cluster of images. Distance on the page, from left to right, would represent the contribution of the concept cluster to the model at the voxel. This turned out pretty cool:

    You can mouse over a node, which is a contrast image (a brain map) associated with a particular cognitive concept, and see details (done by way of tipsy). Only concepts that have a weight (weight –> importance in the model) that is not zero are displayed (and this reduces the complexity of the visualization quite a bit), and the nodes are colored and sized based on their value in the original brain map (red/big –> positive, and blue/small –> negative):

    You can use the controls in the top right to expand the image, save as SVG, link to the code, or read about the application:

    You can also select a region of choice from the dropdown menu, which uses select2 to complete your choice. At first I showed the user the voxel location I selected as “representative” for the region, but I soon realized that there were quite a few large regions in the AAL atlas, and that it would be incorrect and misleading to select a representative voxel. To embrace the variance within a region but still provide meaningful labels, I implemented it so that a user can select a region, and a random voxel from the region is selected:

        # Look up the value of the region
        value = app.regions.value[app.regions.name==name].tolist()[0]
        # Select a voxel coordinate at random
        voxel_idx = numpy.random.choice(app.region_lookup.index[app.region_lookup.aal == value],1)[0]
        return voxel(voxel_idx,name=name)

    Typically, Flask view functions return… views :). In this case, the view returned is the original one that I wrote (the function is called voxel) to render a view based on a voxel id (from 1..28K). The user just sees a dropdown to select a region:

    Finally, since there are multiple images tagged with the same concept in an image, you can mouse over a concept label to highlight those nodes in the image. You can also mouse over a concept label to highlight all the concepts associated with the image. We also obtain a sliced view of the image from NeuroVault to show to the user.

    Check out the full demo

    Step 5: Problems with First Attempt

    I first thought it was a pretty OK job, until my extremely high-standard brain started to tell me how crappy it was. The first problem is that the same image is shown for every concept it’s relevant for, and that’s both redundant and confusing. It also makes no sense at all to be showing an entire brain map when the view is defined for just one voxel. What was I thinking?

    The second problem is that the visualization isn’t intuitive. It’s a bunch of circles floating in space, and you have to read the “about” very careful to say “I think I sort of get it.” I tried to use meaningful things for color, size, and opacity, but it doesn’t give you really a sense of anything other than, maybe, magnetic balls floating in gray space.

    I thought about this again. What a person really wants to know, quickly, are

    1) which cognitive concepts are associated with the voxel?
    2) How much?
    3) How do the concepts relate in the ontology?

    I knew very quickly that the biggest missing component was some representation of the ontology. How was “recognition” related to “memory” ? Who knows! Let’s go back to the drawing table, but first, we need to prepare some new data.

    Step 6: Generating a Cognitive Atlas Tree

    A while back I added some functions to pybraincompare to generate d3 trees from ontologies, or anything you could represent with triples. Let’s do that with the concepts in our visualization to make a simple json structure that has nodes with children.

    from pybraincompare.ontology.tree import named_ontology_tree_from_tsv
    from cognitiveatlas.datastructure import concept_node_triples
    import pickle
    import pandas
    import re

    First we will read in our images, and we only need to do this to get the image contrast labels (a contrast is a particular combination / subtraction of conditions in a task, like “looking at pictures of cats minus baseline”).

    # Read in images metadata
    images = pandas.read_csv("../data/contrast_defined_images_filtered.tsv",sep="\t",index_col="image_id")

    The first thing we are going to do is generate a “triples data structure,” a simple format I came up with that would be simple for pybraincompare to understand that would allow it to render any kind of graph into the tree. It looks like this:

      id    parent  name
      1 none BASE                   # there is always a base node
      2 1   MEMORY                  # high level concept groups
      3 1   PERCEPTION              
      4 2   WORKING MEMORY          # concepts
      5 2   LONG TERM MEMORY
      6 4   image1.nii.gz           # associated images (discovered by way of contrasts)
      7 4   image2.nii.gz

    Each node has an id, a parent, and a name. For the next step, I found the unique contrasts represented in the data (we have more than one image for contrasts), and then made a lookup to find sets of images based on the contrast.

    # We need a dictionary to look up image lists by contrast ids
    unique_contrasts = images.cognitive_contrast_cogatlas_id.unique().tolist()
    # Images that do not match the correct identifier will not be used (eg, "Other")
    expression = re.compile("cnt_*")
    unique_contrasts = [u for u in unique_contrasts if expression.match(u)]
    image_lookup = dict()
    for u in unique_contrasts:
       image_lookup[u] = images.index[images.cognitive_contrast_cogatlas_id==u].tolist()

    To make the table I showed above, I had added a function to the Cognitive Atlas API python wrapper called concept_node_triples.

    output_triples_file = "../data/concepts.tsv"
    # Create a data structure of tasks and contrasts for our analysis
    relationship_table = concept_node_triples(image_dict=image_lookup,output_file=output_triples_file)

    The function includes the contrast images themselves as nodes, so let’s remove them from the data frame before we generate and save the JSON object that will render into a tree:

    # We don't want to keep the images on the tree
    keep_nodes = [x for x in relationship_table.id.tolist() if not re.search("node_",x)]
    relationship_table = relationship_table[relationship_table.id.isin(keep_nodes)]
    tree = named_ontology_tree_from_tsv(relationship_table,output_json=None)


    Boum! Ok, now back to the visualization!

    Step 7: Second Attempt: Tree

    For this attempt, I wanted to render a concept tree in the browser, with each node in the tree corresponding to a cognitive concept, and colored by the “importance” (weight) in the model. As before, red would indicate positive weight, and blue negative (this is a standard in brain imaging, by the way). To highlight the concepts that are relevant for the particular voxel model, I decided to make the weaker nodes more transparent, and nodes with no contribution (weight = 0) completely invisible. However, I would maintain the tree structure to give the viewer a sense of distance in the ontology (distance –> similarity). This tree would also solve the problem of understanding relationships between concepts. They are connected!

    As before, mousing over a node provides more information:

    and the controls are updated slightly to include a “find in page” button:

    Which, when you click on it, brings up an overlay where you can select any cogntiive concepts of your choice with clicks, and they will light up on the tree!

    If you want to know the inspiration for this view, it’s a beautiful installation at the Stanford Business School that I’m very fond of:

    The labels were troublesome, because if I rendered too many it was cluttered and unreadable, and if I rendered too few it wasn’t easy to see what you were looking at without mousing over things. I found a rough function that helped a bit, but my quick fix was to simply limit the labels shown based on the number of images (count) and the regression parameter weight:

        // Add concept labels
        var labels = node.append("text")
            .attr("dx", function (d) { return d.children ? -2 : 2; })
            .attr("dy", 0)
            .style("font","14px sans-serif")
            .style("text-anchor", function (d) { return d.children ? "end" : "start"; })
            .html(function(d) { 
                // Only show label for larger nodes with regression parameter >= +/- 0.5
                if ((counts[d.nid]>=15) && (Math.abs(regparams[d.nid])>=0.5)) {
                    return d.name

    Check out the full demo

    Step 8: Make it reproducible

    You can clone the repo on your local machine and run the visualization with native Flask:

        git clone https://github.com/vsoch/cogatvoxel
        cd cogatvoxel
        python index.py

    Notice anything missing? Yeah, how about installing dependencies, and what if the version of python you are running isn’t the one I developed it in? Eww. The easy answer is to Dockerize! It was relatively easy to do, I would use docker-compose to grab an nginx (web server) image, and my image vanessa/cogatvoxeltree built on Docker Hub. The Docker Hub image is built from the Dockerfile in the repo, which installs dependencies, maps the code to a folder in the container called /code and then exposes port 8000 for Flask:

    FROM python:2.7
    RUN apt-get update && apt-get install -y \
        libopenblas-dev \
        gfortran \
        libhdf5-dev \
    MAINTAINER Vanessa Sochat
    RUN pip install --upgrade pip
    RUN pip install flask
    RUN pip install numpy
    RUN pip install gunicorn
    RUN pip install pandas
    ADD . /code
    WORKDIR /code
    EXPOSE 8000

    Then the docker-compose file uses this image, along with the nginx web server (this is pronounced “engine-x” and I’ll admit it took me probably 5 years to figure that out).

      image: vanessa/cogatvoxeltree
      restart: always
        - "8000"
        - /code/static
      command: /usr/local/bin/gunicorn -w 2 -b :8000 index:app
      image: nginx
      restart: always
        - "80:80"
        - /www/static
        - web
        - web:web

    It’s probably redundant to again expose port 8000 in my application (the top one called “web”), and add /www/static to the web server static. To make things easy, I decided to use gunicorn to manage serving the application. There are many ways to skin a cat, there are ways to run a web server… I hope you choose web servers over skinning cats.

    That’s about it. It’s a set of simple Flask applications to render data into a visualization, and it’s containerized. To be honest, I think the first is a lot cooler, but the second is on its way to a better visualization for the problem at hand. There is still a list of things that need fixing and tweaking (for example, not giving the user control over the threshold for showing the node and links is not ok), but I’m much happier with this second go. On that note, I’ll send a cry for reproducibility out to all possible renderings of data in a browser…

    Visualizations, contain yourselves!


  • Wordfish: tool for standard corpus and terminology extraction

    If pulling a thread of meaning from woven text
    is that which your heart does wish.
    Not so absurd or seemingly complex,
    if you befriend a tiny word fish.


    I developed a simple tool for standard extraction of terminology and corpus, Wordfish, that is easily deployed to a cluster environment. I’m graduating (hopefully, tentatively, who knows) soon, and because publication is unlikely, I will write about the tool here, in the case it is useful to anyone. I did this project for fun, mostly because I found DeepDive to be overly complicated for my personal goal of extracting a simple set of terms from a corpus in the case that I couldn’t define relationships apriori (I wanted to learn them from the data). Thus I used neural networks (word2vec) to learn term relationships based on their context. I was able to predict reddit boards for different mental illness terms with high accuracy, and it sort of ended there because I couldn’t come up with a good application in Cognitive Neuroscience, and no “real” paper is going to write about predicting reddit boards. I was sad to not publish something, but then I realized I am empowered to write things on the internet. :) Not only that, I can make up my own rules. I don’t have to write robust methods with words, I will just show and link you to code. I might even just use bulletpoints instead of paragraphs. For results, I’ll link to ipython notebooks. I’m going to skip over the long prose and trust that if you want to look something up, you know how to use Google and Wikipedia. I will discuss the tool generally, and show an example of how it works. First, an aside about publication in general - feel free to skip this if you aren’t interested in discussing the squeaky academic machine.

    Why sharing incomplete methods can be better than publication

    It’s annoying that there is not a good avenue, or even more so, that it’s not desired or acceptable, to share a simple (maybe even incomplete) method or tool that could be useful to others in a different context. Publication requires the meaningful application. It’s annoying that, as researchers, we salivate for these “publication” things when the harsh reality is that this slow, inefficient process results in yet another PDF/printed thing with words on a page, offering some rosy description of an analysis and result (for which typically minimal code or data is shared) that makes claims that are over-extravagant in order to be sexy enough for publication in the first place (I’ve done quite a bit of this myself). A publication is a static thing that, at best, gets cited as evidence by another paper (and likely the person making the citation did not read the paper to full justice). Maybe it gets parsed from pubmed in someone’s meta analysis to try and “uncover” underlying signal across many publications that could have been transparently revealed in some data structure in the first place. Is this workflow really empowering others to collaboratively develop better methods and tools? I think not. Given the lack of transparency, I’m coming to realize that it’s much faster to just share things early. I don’t have a meaningful biological application. I don’t make claims that this is better than anything else. This is not peer reviewed by some three random people that gives it a blessing like from a rabbi. I understand the reasons for these things, but the process of conducting research, namely hiding code and results toward that golden nugget publication PDF, seems so against a vision of open science. Under this context, I present Wordfish.

    Wordfish: tool for standard corpus and terminology extraction



    The extraction of entities and relationships between them from text is becoming common practice. The availability of numerous application program interfaces (API) to extract text from social networks, blogging platforms and feeds, standard sources of knowledge is continually expanding, offering an extensive and sometimes overwhelming source of data for the research scientist. While large corporations might have exclusive access to data and robust pipelines for easily obtaining the data, the individual researcher is typically granted limited access, and commonly must devote substantial amounts of time to writing extraction pipelines. Unfortunately, these pipelines are usually not extendable beyond the dissemination of any result, and the process is inefficiently repeated. Here I present Wordfish, a tiny but powerful tool for the extraction of corpus and terms from publicly available sources. Wordfish brings standardization to the definition and extraction of terminology sources, providing an open source repository for developers to write plugins to extend their specific terminologies and corpus to the framework, and research scientists an easy way to select from these corpus and terminologies to perform extractions and drive custom analysis pipelines. To demonstrate the utility of this tool, I use Wordfish in a common research framework: classification. I first build deep learning models to predict Reddit boards from post content with high accuracy. I hope that a tool like Wordfish can be extended to include substantial plugins, and can allow easier access to ample sources of textual content for the researcher, and a standard workflow for developers to add a new terminology or corpus source.


    While there is much talk of “big data,” when you peek over your shoulder and look at your colleague’s dataset, there is a pretty good chance that it is small or medium sized. When I wanted to extract terms and relationships from text, I went to DeepDive, the ultimate powerhouse to do this. However, I found that setting up a simple pipeline required database and programming expertise. I have this expertise, but it was tenuous. I thought that it should be easy to do some kind of NLP analysis, and combine across different corpus sources. When I started to think about it, we tend to reuse the same terminologies (eg, an ontology) and corpus (pubmed, reddit, wikipedia, etc), so why not implement an extraction once, and then provide that code for others? This general idea would make a strong distinction between a developer, meaning an individual best suited to write the extraction pipeline, and the researcher, an individual best suited to use it for analysis. This sets up the goals of Wordfish: to extract terms from a corpus, and then do some higher level analysis, and make it standardized and easy.

    Wordfish includes data structures that can capture an input corpus or terminology, and provides methods for retrieval and extraction. Then, it allows researchers to create applications that interactively select from the available corpus and terminologies, deploy the applications in a cluster environment, and run an analysis. This basic workflow is possible and executable without needing to set up an entire infrastructure and re-writing the same extraction scripts that have been written a million times before.


    The overall idea behind the infrastructure of wordfish is to provide terminologies, corpus, and an application for working with them in a modular fashion. This means that Wordfish includes two things, wordfish-plugins and wordfish-python. Wordfish plugins are modular folders, each of which provides a standard data structure to define extraction of a corpus, terminology or both. Wordfish python is a simple python module for generating an application, and then deploying the application on a server to run analyses.

    Wordfish Plugins

    A wordfish plugin is simply a folder with typically two things: a functions.py file to define functions for extraction, and a config.json that is read by wordfish-python to deploy the application. We can look at the structure of a typical plugin:


    Specifically, the functions.py has the following functions:

    1) extract_terms: function to call to return a data structure of terms
    2) extract_text: function to call to return a data structure of text (corpus)
    3) extract_relations: function to call to return a data structure of relations
    4) functions.py: is the file in the plugin folder to store these functions

    The requirement of every functions.py is an import of general functions from wordfish-python that will save a data structure for a corpus, terminology, or relationships:

    	from wordfish.corpus import save_sentences
    	from wordfish.terms import save_terms
    	from wordfish.terms import save_relations
    	from wordfish.plugin import generate_job

    The second requirement is a function, go_fish, which is the main function to be called by wordfish-python under the hood. In this function, the user writing the plugin can make as many calls to generate_job as necessary. A call to generate job means that a slurm job file will be written to run a particular function (func) with a specified category or extraction type (e.g., terms, corpus, or relations). This second argument helps the application determine how to save the data. A go_fish function might look like this:

    	def go_fish():    

    The above will generate slurm job files to be run to extract terms and relations. Given input arguments are required for the function, the specification can look as follows:


    where inputs is a dictionary of keys being variable names, values being the variable value. The addition of the batch_num variable also tells the application to split the extraction into a certain number of batches, corresponding to SLURM jobs. This is needed in the case that running a node on a cluster is limited to some amount of time, and the user wants to further parallelize the extraction.

    Extract terms

    Now we can look at more detail at the extract_terms function. For example, here is this function for the cognitive atlas. The extract_terms will return a json structure of terms

    	def extract_terms(output_dir):
    	    terms = get_terms()

    You will notice that the extract_terms function uses another function that is defined in functions.py, get_terms. The user is free to include in the wordfish-plugin folder any number of additional files or functions that assist toward the extraction. Here is what get_terms looks like:

    	def get_terms():
    	    terms = dict()
    	    concepts = get_concepts()
    	    for c in range(len(concepts)):
    		concept_id = concepts[c]["id"]
    		meta = {"name":concepts[c]["name"],
    		terms[concept_id] = meta
    	    return terms

    This example is again from the Cognitive Atlas, and we are parsing cognitive ceoncepts into a dictionary of terms. For each cognitive concept, we are preparing a dictionary (JSON data structure) with fields name, and definition. We then put that into another dictionary terms with the key as the unique id. This unique id is important in that it will be used to link between term and relations definitions. You can assume that the other functions (e.g., get_concepts are defined in the functions.py file.

    Extract relations

    For extract_relations we return a tuple of the format (term1_id,term2_id,relationship):

    	def extract_relations(output_dir):
    	    links = []
    	    terms = get_terms()
    	    concepts = get_concepts()
    	    for concept in concepts:
    		if "relationships" in concept:
    		    for relation in concept["relationships"]:   
    		        relationship = "%s,%s" %(relation["direction"],relation["relationship"])
    		        tup = (concept["id"],relation["id"],relationship) 

    Extract text

    Finally, extract_text returns a data structure with some unique id and a blob of text. Wordfish will parse and clean up the text. The data structure for a single article is again, just JSON:

                corpus[unique_id] = {"text":text,"labels":labels}

    Fields include the actual text, and any associated labels that are important for classification later. The corpus (a dictionary of these data structures) gets passed to save_sentences


    More detail is provided in the wordfish-plugin README

    The plugin controller: config.json

    The plugin is understood by the application by way of a folder’s config.json, which might look like the following:

                  "name": "NeuroSynth Database",
                  "tag": "neurosynth",
                  "corpus": "True",
                  "terms": "True",
                  "labels": "True",
                  "dependencies": {
                                    "python": [ 
                                     "plugins": ["pubmed"]
                  "arguments": {
                  "contributors": ["Vanessa Sochat"], 
                  "doi": "10.1038/nmeth.1635",

    1) name: a human readable description of the plugin

    2) tag: a term (no spaces or special characters) that corresponds with the folder name in the plugins directory. This is a unique id for the plugin.

    3) corpus/terms/relationships: boolean, each “True” or “False” should indicate if the plugin can return a corpus (text to be parsed by wordfish) or terms (a vocabulary to be used to find mentions of things), or relations (relationships between terms). This is used to parse current plugins available for each purpose, to show to the user.

    4) dependencies: should include “python” and “plugins.” Python corresponds to python packages that are dependencies, and these plugins are installed by the overall application. Plugins refers to other plugins that are required, such as pubmed. This is an example of a plugin that does not offer to extract a specific corpus, terminology, or relations, but can be included in an application for other plugins to use. In the example above, the neurosynth plugin requires retrieving articles from pubmed, so the plugin develop specifies needing pubmed as a plugin dependency.

    5) arguments: a dictionary with (optionally) corpus and/or terms. The user will be asked for these arguments to run the extract_text and extract_terms functions.

    6) contributors: a name/orcid ID or email of researchers responsible for creation of the plugins. This is for future help and debugging.

    7) doi: a reference or publication associated with the resource. Needed if it’s important for the creator of the plugin to ascribe credit to some published work.

    Best practices for writing plugins

    Given that a developer is writing a plugin, it is generally good practice to print to the screen what is going on, and how long it might take, as a courtesy to the user, if something needs review or debugging.

    “Extracting relationships, will take approximately 3 minutes”

    The developer should also use clear variable names, well documented and spaced functions (one liners are great in python, but it’s more understandable by the reader if to write out a loop sometimes), and attribute function to code that is not his. Generally, the developer should just follow good practice as a coder and human being.

    Functions provided by Wordfish

    While most users and clusters have internet connectivity, it cannot be assumed, and an error in attempting to access an online resource could trigger an error. If a plugin has functions that require connectivity, Wordfish provides a function to check:

          from wordfish.utils import has_internet_connectivity
          if has_internet_connectivity():
              # Do analysis

    If the developer needs a github repo, Wordfish has a function for that:

          from wordfish.vm import download_repo
          repo_directory = download_repo(repo_url="https://github.com/neurosynth/neurosynth-data")

    If the developer needs a general temporary place to put things, tempfile is recommended:

          import tempfile
          tmpdir = tempfile.mkdtemp()

    Wordfish has other useful functions for downloading data, or obtaining a url. For example:

          from wordfish.utils import get_url, get_json
          from wordfish.standards.xml.functions import get_xml_url
          myjson = get_json(url)
          webpage = get_url(url)
          xml = get_xml_url(url)

    Custom Applications with Wordfish Python

    The controller, wordfish-python is a flask application that provides the user (who is just wanting to generate an application) with an interactive web interface for doing so. It is summarized nicely in the README:

    Choose your input corpus, terminologies, and deployment environment, and an application will be generated to use deep learning to extract features for text, and then entities can be mapped onto those features to discover relationships and classify new texty things. Custom plugins will allow for dynamic generation of corpus and terminologies from data structures and standards of choice from wordfish-plugins You can have experience with coding (and use the functions in the module as you wish), or no experience at all, and let the interactive web interface walk you through generation of your application.

    Installation can be done via github or pip:

          pip install git+git://github.com/word-fish/wordfish-python.git
          pip install wordfish

    And then the tool is called to open up a web interface to generate the application:


    The user then selects terminologies and corpus.

    And a custom application is generated, downloaded as a zip file in the browser. A “custom application” means a folder that can be dropped into a cluster environment, and run to generate the analysis,

    Installing in your cluster environment

    The user can drop the folder into a home directory of the cluster environment, and run the install script to install the package itself, and generate the output folder structure. The only argument that is needed is to supply is the base of the output directory:

          bash install.sh $WORK

    All scripts for the user to run are in the scripts folder here:

          cd $WORK/scripts

    Each of these files corresponds to a step in the pipeline, and is simply a list of commands to be run in parallel. The user can use launch, or submit each command to a SLURM cluster. A basic script is provided to help submit jobs to a SLURM cluster, and this could be tweaked to work with other clusters (e.g., SGE).

    Running the Pipeline

    After the installation of the custom application is complete, this install script simply runs run.py, which generates all output folders and running scripts. the user has a few options for running:

    1) submit the commands in serial, locally. The user can run a job file with bash, bash run_extraction_relationships.job
    2) submit the commands to a launch cluster, something like launch -s run_extraction_relationships.job
    3) submit the commands individually to a slurm cluster. This will mean reading in the file, and submitting each script with a line like sbatch -p normal -j myjob.job [command line here]

    Output structure

    The jobs are going to generate output to fill in the following file structure in the project base folder, which again is defined as an environmental variable when the application is installed (files that will eventually be produced are shown):


    The folders are generated dynamically by the run.py script for each corpus and terms plugin based on the tag variable in the plugin’s config.json. Relationships, by way of being associated with terms, are stored in the equivalent folder, and the process is only separate because it is not the case that all plugins for terms can have relationships defined. The corpus are kept separate at this step as the output has not been parsed into any standard unique id space. Wordfish currently does not do this, but if more sophisticated applications are desired (for example with a relational database), this would be a good strategy to take.


    Once the user has files for corpus and terms, he could arguably do whatever he pleases with them. However, I found the word2vec neural network to be incredibly easy and cool, and have provided a simple analysis pipeline to use it. This example will merge all terms and corpus into a common framework, and then show examples of how to do basic comparisons, and vector extraction (custom analyses scripts can be based off of this). We will do the following:

    1) Merge terms and relationships into a common corpus
    2) For all text extract features with deep learning (word2vec)
    3) Build classifiers to predict labels from vectors

    Word2Vec Algorithm

    First, what is a word2vec model? Generally, Word2Vec is a neural network implementation that will allow us to learn word embeddings from text, specifically a matrix of words by some N features that best predict the neighboring words for each term. This is an interesting way to model a text corpus because it’s not about occurrence, but rather context, of words, and we can do something like compare a term “anxiety” in different contexts. If you want equations, see this paper.

    The problem Wordfish solves

    Wordfish currently implements Word2Vec. Word2Vec is an unsupervised model. Applications like DeepDive take the approach that a researcher knows what he or she is looking for, requiring definition of entities as first step before their extraction from a corpus. This is not ideal given that a researcher has no idea about these relationships, or lacks either positive or negative training examples. In terms of computational requirements, Deepdive also has some that are unrealistic. For example, using the Stanford Parser is required to determine parts of speech and perform named entity recognition. While this approach is suitable for a large scale operation to mine very specific relationships between well-defined entities in a corpus, for the single researcher that wants to do simpler natural language processing, and perhaps doesn’t know what kind of relationships or entities to look for, it is too much. This researcher may want to search for some terms of interest across a few social media sources, and build models to predict one type of text content from another. The researcher may want to extract relationships between terms without having a good sense of what they are to begin with, and definition of entities, relationships, and then writing scripts to extract both should not be a requirement. While it is reasonable to ask modern day data scientists to partake in small amounts of programming, substantial setting up of databases and writing extraction pipelines should not be a requirement. A different approach that is taken by Wordfish is to provide plugins for the user to interactively select corpus and terminology, deploy their custom application in their computational environment of choice, and perform extraction using the tools that are part of their normal workflows, which might be a local command line or computing cluster.

    When the DeepDive approach makes sense, the reality is that setting up the infrastructure to deploy DeepDive is really hard. When we think about it, the two applications are solving entirely different problems. All we really want to do is discover how terms are related in text. We can probably do ok to give DeepDive a list of terms, but then to have to “know” something about the features we want to extract, and have positive and negative cases for training is really annoying. If it’s annoying for a very simple toy analysis (finding relationships between cognitive concepts) I can’t even imagine how that annoyingness will scale when there are multiple terminologies to consider, different relationships between the terms, and a complete lack of positive and negative examples to validate. This is why I created Wordfish, because I wanted an unsupervised approach that required minimal set up to get to the result. Let’s talk a little more about the history of Word2Vec from this paper.

    The N-Gram Model

    The N-gram model (I think) is a form of hidden Markov Model where we model the P(word) given the words that came before it. The authors note that N-gram models work very well for large data, but in the case of smaller datasets, more complex methods can make up for it. However, it follows logically that a more complex model on a large dataset gives us the best of all possible worlds. Thus, people started using neural networks for these models instead.

    simple models trained on huge amounts of data outperform complex systems trained on less data.

    The high level idea is that we are going to use neural networks to represent words as vectors, word “embeddings.” Training is done with stochastic gradient descent and backpropagation.

    How do we assess the quality of word vectors?

    Similar words tend to be close together, and given a high dimensional vector space, multiple representations/relationships can be learned between words (see top of page 4). We can also perform algebraic computations on vectors and discover cool relationships, for example, the vector for V(King) - V(Man) + V(Woman) is close to V(Queen). The most common metric to compare vectors seems to be cosine distance. The interesting thing about this approach reported here is that by combining individual word vectors we can easily represent phrases, and learn new interesting relationships.

    Two different algorithm options

    You can implement a continuous bag of words (CBOW) or skip-gram model: 1) CBOW: predicts the word given the context (other words)
    2) skip-gram: predicts other words (context) given a word (this seems more useful for what we want to do)

    They are kind of like inverses of one another, and the best way to show this is with a picture:


    Discarding Frequent Words

    The paper notes that having frequent words in text is not useful, and that during training, frequent words are discarded with a particular probability based on the frequency. They use this probability in a sampling procedure when choosing words to train on so the more frequent words are less likely to be chosen. For more details, see here, and search Google.

    Building Word2Vec Models

    First, we will train a simple word2vec model with different corpus. And to do this we can import functions from Wordfish, which is installed by the application we generated above.

    	from wordfish.analysis import build_models, save_models, export_models_tsv, load_models, extract_similarity_matrix, export_vectors, featurize_to_corpus
    	from wordfish.models import build_svm
    	from wordfish.corpus import get_corpus, get_meta, subset_corpus
    	from wordfish.terms import get_terms
    	from wordfish.utils import mkdir
    	import os

    Installation of the application also write the environmental variable WORDFISH_HOME to your bash profile, so we can reference it easily:

    	base_dir = os.environ["WORDFISH_HOME"]

    It is generally good practice to keep all components of an analysis well organized in the output directory. It makes sense to store analyses, models, and vectors:

    	# Setup analysis output directories
    	analysis_dir = mkdir("%s/analysis" %(base_dir))
    	model_dir = mkdir("%s/models" %(analysis_dir))
    	vector_dir = mkdir("%s/vectors" %(analysis_dir))

    Wordfish then has nice functions for generating a corpus, meaning removing stop words, excess punctuation, and the typical steps in NLP analyses. The function get_corpus returns a dictionary, with the key being the unique id of the corpus (the folder name, tag of the original plugin). We can then use the subset_corpus plugin if we want to split the corpus into the different groups (defined by the labels we specified in the initial data structure):

    	# Generate more specific corpus by way of file naming scheme
    	corpus = get_corpus(base_dir)
    	reddit = corpus["reddit"]
    	disorders = subset_corpus(reddit)

    We can then train corpus-specific models, meaning word2vec models.

    	# Train corpus specific models
    	models = build_models(corpus)

    Finally, we can export models to tsv, export vectors, and save the model so we can easily load again.

    	# Export models to tsv, export vectors, and save

    I want to note that I used gensim for learning and some methods. The work and examples from Dato are great!

    Working with models

    Wordfish provides functions for easily loading a model that is generated from a corpus:

    model = load_models(base_dir)["neurosynth"]

    You can then do simple things, like find the most similar words for a query word:

    	# [('aggression', 0.77308839559555054), 
    	#   ('stress', 0.74644440412521362), 
    	#   ('personality', 0.73549789190292358), 
    	#   ('excessive', 0.73344630002975464), 
    	#   ('anhedonia', 0.73305755853652954), 
    	#   ('rumination', 0.71992391347885132), 
    	#   ('distress', 0.7141801118850708), 
    	#   ('aggressive', 0.7049030065536499), 
    	#   ('craving', 0.70202392339706421), 
    	#   ('trait', 0.69775849580764771)]

    It’s easy to see that corpus context is important - here is finding similar terms for the “reddit” corpus:

    	model = load_models(base_dir)["reddit"]
    	# [('crippling', 0.64760375022888184), 
    	# ('agoraphobia', 0.63730186223983765), 
    	# ('generalized', 0.61023455858230591), 
    	# ('gad', 0.59278655052185059), 
    	# ('hypervigilance', 0.57659250497817993), 
    	# ('bouts', 0.56644737720489502), 
    	# ('depression', 0.55617612600326538), 
    	# ('ibs', 0.54766887426376343), 
    	# ('irritability', 0.53977066278457642), 
    	# ('ocd', 0.51580017805099487)]

    Here are examples of performing addition and subtraction with vectors:

    	# [('ibs', 0.50205761194229126), 
    	# ('undereating', 0.50146859884262085), 
    	# ('boredom', 0.49470821022987366), 
    	# ('overeating', 0.48451068997383118), 
    	# ('foods', 0.47561675310134888), 
    	# ('cravings', 0.47019645571708679), 
    	# ('appetite', 0.46869537234306335), 
    	# ('bingeing', 0.45969703793525696), 
    	# ('binges', 0.44506731629371643), 
    	# ('crippling', 0.4397256076335907)]
    	model.most_similar(positive=['bipolar'], negative=['manic'])
    	# [('nos', 0.36669495701789856), 
    	# ('adhd', 0.36485755443572998), 
    	# ('autism', 0.36115738749504089), 
    	# ('spd', 0.34954413771629333), 
    	# ('cptsd', 0.34814098477363586), 
    	# ('asperger', 0.34269329905509949), ('schizotypal', 0.34181860089302063), ('pi', 0.33561226725578308), ('qualified', 0.33355745673179626), ('diagnoses', 0.32854354381561279)]

    And to get the raw vector for a word:


    Extracting term similarities

    To extract a pairwise similarity matrix, you can use the function extract_similarity_matrix. These are the data driven relationships between terms that the Wordfish infrastructure provides:

    	# Extract a pairwise similarity matrix
    	wordfish_sims = extract_similarity_matrix(models["neurosynth"])


    Finally, here is an example of predicting neurosynth abtract labels using the pubmed neurosynth corpus. We first want to load the model and meta data for neurosynth, meaning labels for each text:

    	model = load_models(base_dir,"neurosynth")["neurosynth"]
    	meta = get_meta(base_dir)["neurosynth"]

    We can then use the featurize_to_corpus method to get labels and vectors from the model, and the build_svm function to build a simple, cross validated classified to predict the labels from the vectors:

    	vectors,labels = featurize_to_corpus(model,meta)
    	classifiers = build_svm(vectors=vectors,labels=labels,kernel="linear")

    The way this works is to take a new post from reddit with an unknown label, use the Word2vec word embeddings vector as a lookup, and generating a vector for the new post based on taking the mean vector of word embeddings. It’s a simple approach, could be improved upon, but it seemed to work reasonably well.

    Classification of Disorder Using Reddit

    A surprisingly untapped resource are Reddit boards, a forum with different “boards” indicating a place to write about topics of interest. It has largely gone unnoticed that individuals use Reddit to seek social support for advice, for example, the Depression board is predominantly filled with posts from individuals with Depression writing about their experiences, and the Computer Science board might be predominantly questions or interesting facts about computers or being a computer scientist. From the mindset of a research scientist who might be interested in Reddit as a source of language, a Reddit board can be thought of as a context. Individuals who post to the board, whether having an “official” status related to the board, are expressing language in context of the topic. Thus, it makes sense that we can “learn” a particular language context that is relevant to the board, and possibly use the understanding of this context to identify it in other text. Thus, I built 36 word embedding models across 36 Reddit boards, each representing the language context of the board, or specifically, the relationships between the words. I used these models to look at context of words across different boards. I also build one master “reddit” model, and used this model in the classification framework discussed previously.

    For the classification framework, it was done for two applications - predicting reddit boards from reddit posts, and doing the same, but using the neurosynth corpus as the Word2Vec model (the idea being that papers about cognitive neuroscience and mental illness might produce word vectors that are more relevant for reddit boards about mental illness groups). For both of these, the high level idea is that we want to predict a board (grouping) based on a model built from all of reddit (or some other corpus). The corpus used to derive the word vectors gives us the context - meaning the relationships between terms (and this is done across all boards with no knowledge of classes or board types), and then we can take each entry and calculate an average vector for it based on averaging the vectors of word embeddings that are present in the sentence. Specifically we:

    1) generate word embeddings model (M) for entire reddit corpus (resulting vocabulary is size N=8842) 
    2) For each reddit post (having a board label like "anxiety":
    - generate a vector that is an average of word embeddings in M

    Then for each pairwise board (for example, “anxiety” and “atheist”

    1) subset the data to all posts for “anxiety” and “atheist”
    2) randomly hold out 20% for testing, rest 80% for training
    3) build an SVM to distinguish the two classes, for each of rbf, linear, and poly kernel
    4) save accuracy metrics


    How did we do?

    Can we classify reddit posts?

    The full result has accuracies that are mixed. What we see is that some boards can be well distinguished, and some not. When we extend to use the neurosytnh database to build the model, we don’t do as well, likely because the corpus is much smaller, and we remember from the paper that larger corpus tends to do better.

    Can we classify neurosynth labels?

    A neurosynth abstract comes with a set of labels for terms that are (big hand waving motions) “enriched.” Thus,given labels for a paragraph of text (corresponding to the neurosynth term) I should be able to build a classifier that can predict the term. The procedure is the same as above: an abstract is represented as its mean of all the word embeddings represented. The results are also pretty good for a first go, but I bet we could do batter with a multi-class model or approach.

    Do different corpus provide different context (and thus term relationships?)

    This portion of the analysis used the Word2Vec models generated for specific reddit boards. Before I delved into classification, I had just wanted to generate matrices that show relationships between words, based on a corpus of interest. I did this for NeuroSynth, as well as for a large sample (N=121862) reddit posts across 34 boards, including disorders and random words like “politics” and “science.” While there was interesting signal in the different relationship matrices, really the most interesting thing we might look at is how a term’s relationships varies on the context. Matter of fact, I would say context is extremely important to think about. For example, someone talking about “relationships” in the context of “anxiety” is different than someone talking about “relationships” in the context of “sex” (or not). I didn’t upload these all to github (I have over 3000, one for each neurosynth term), but it’s interesting to see how a particular term changes across contexts.

    Each matrix (pdf file) in the folder above is one term from neurosynth. What the matrix for a single term shows is different contexts (rows) and the relationship to all other neurosynth terms (columns). Each cell value shows the word embedding of the global term (in the context specified) against the column term. The cool thing for most of these is that we see general global patterns, meaning that the context is similar, but then there are slight differences. I think this is hugely cool and interesting and could be used to derive behavioral phenotypes. If you would like to collaborate on something to do this, please feel free to send me an email, tweet, Harry Potter owl, etc.


    Wordfish provides standard data structures and an easy way to extract terms, corpus, and perform a classification analysis, or extract similarity matrices for terms. It’s missing a strong application. We don’t have anything suitable in cognitive neuroscience, at least off the bat, and if you might have ideas, I’d love to chat. It’s very easy to write a plugin to extend the infrastructure to another terminology or corpus. We can write one of those silly paper things. Or just have fun, which is much better. The application to deploy Wordfish plugins, and the plugins themselves are open source, meaning that they can be collaboratively developed by users. That means you! Please contribute!.


    The limitations have to do with the fact that this is not a finished application. Much fine tuning could be done toward a specific goal, or to answer a specific question. I usually develop things with my own needs in mind, and add functionality as I go and it makes sense.


    For my application, it wasn’t a crazy idea to store each corpus entry as a text file, and I had only a few thousand terms. Thus, I was content using flat text files to store data structures. I had plans for integration of “real” databases, but the need never arose. This would not be ideal for much larger corpus, for which using a database would be optimal. Given the need for a larger corpus, I would add this functionality to the application, if needed.

    Deployment Options

    Right now the only option is to generate a folder and install on a cluster, and this is not ideal. Better would be options to deploy to a local or cloud-hosted virtual machine, or even a Docker image. This is another future option.


    It would eventually be desired to relate analyses to external data, such as brain imaging data. For example, NeuroVault is a database of whole-brain statistical maps with annotations for terms from the cognitive atlas, and we may want to bring in maps from NeuroVault at some point. Toward this aim a separate wordfish-data repo has been added. Nothing has been developed here yet, but it’s in the queue.

    And this concludes my first un-paper paper. It took an afternoon to write, and it feels fantastic. Go fish!