Short teaching module: An adventurous introduction to the command-line

Background and inspiration. Figuring out how to run scientific software through a command-line interface is a key skill for students exploring bioinformatics. There are many good resources that explain how to navigate around one’s computer using interfaces like BASH, the MacOSX Terminal, or Windows PowerShell.  However, many of the tutorials that I explored when looking for resources for my students were pretty dry.  I wanted something I could give to students on the first or second day of class, and get them hooked on using command line interfaces for the rest of the term.  In classes that mix students with a wide range of computational expertise, I also wanted something that would be fun and engaging even if you are already familiar with using the command-line.

Sean O’Neil , a bioinformatics trainer at Oregon State University, does a masterful  job of explaining the concept of directory structure and how to navigate it on the command-line. While explaining this idea, he drew a little stick figure onto the tree of directories to represent the user’s current location –  the current working directory.  I liked this metaphor that each directory is a place, and that commands like cd represent walking yourself from place to place. It seemed to resonate with students.

directory_structure-01

In this short teaching module, I’ve taken this metaphor and extend it into a short choose-your-own-adventure style exercise.  In the exercise, students explore a whimsically haunted mansion, trying to find their little brother. Each directory represents a different location along the way.  There are many surprises – from hidden passageways to a dragon- to be found along the way.

Educational goals:

  • Improve understanding of directory structure, absolute paths, and relative paths
  • Gain comfort and familiarity with navigating on the command-line using the commands cd, ls, pwd.
  • Understand how to read a text file with more or to edit a text file with pico.
  • Understand how to move a file with mv
  • Using tab-completion to reduce errors due to typos and go faster
  • Reusing previously issued commands

Time required:

I usually spend 60-90 minutes introducing the command-line, including a mini-lecture and the exercise. Of that, I would expect to spend anywhere from 45-60 minutes on the exercise.  Some students who already know how to navigate the command line will finish earlier (but may still tend to take some time exploring the different areas).

Materials:

Handout with summary of common commands: Commandline_Cheatsheet.pdf

The zipped directory structure for the exercise: treasure_hunt.zip

Example introductory slides: Commandline_treasurehunt_slides

Students will need a Windows or MacOSX computer/laptop with internet access, one handout each, a piece of blank paper, and a pen.

Instructions:

I give an intro using some version of the PowerPoint slides linked above, and give everyone the handout. After the intro, I demonstrate how to open PowerShell on Windows or Terminal on MacOSX.

I then explain how to download the materials (there are some tricks to this- see hints below).

Finally, I tell students to map the directories that they explore and label what they find. This forces students new to using the command-line to keep track of where they are and the location of different locations relative to one another. It could probably make a good worksheet, but so far I’ve just had them do so on a piece of blank paper.

As they wander around the mansion, they are looking for their little brother, who they want to return to the directory labelled home_sweet_home (by using the mv command). Bonus points if they can return magical creatures to the zoo and assemble the ingredients for a tasty mushroom soup.

There is an instructions.txt file in the first directory that gives an overview of what to do and reviews some useful commands with examples.

Hints:

  • Expect to spend some significant hands-on time with students at the start of the exercise. For larger groups, it is very useful if you have some helpers or more experienced students who know enough to help others get set up.
  • There is a certain chicken-and-the-egg problem with doing an exercise on navigating the command line. Many undergraduate biology students won’t know that their computer has directories, much less how to save a file in a specific one. So if they start by downloading the exercise materials to an arbitrary directory, they won’t know how to get to it on the command line, since that’s what the exercise is supposed to teach! Here’s what I do instead: FIRST show the students how to find the terminal or PowerShell, SECOND have them use the pwd command to find the absolute path for their current directory and only THIRD have them download the zipped directory structure into their current working directory (you’ll have to explain how to right or control-click and select ‘Save As…’).
  • The details of platform-specific differences in commands can take a long time to explain. For example, Windows PowerShell has different commands than bash or the MacOSX terminal.  I’ve tried to emphasize commands that have aliases across all systems, meaning that even if they wouldn’t necessarily be how a PowerShell user would first think to enter a command, they’ll still work on that platform.
  • While most students seem to very much enjoy the exercise, it might be a good idea to have handy a regular set of file folders from a recent analysis project in case there is a student that for whatever reason finds the theme distracting.
  • As I show students cd, ls, etc, I try to emphasize that these are all just programs, with arguments and parameters.
  • At the end of the exercise, be sure to re-emphasize that this is just a series of folders and text files, just like all others on their machine. Some students who have never seen a command line interface before might think this is some sort of actual game. It would be useful to practice the same commands for a bioinformatics-oriented task in the next lesson to drive home this point.
  • If you’re using this in a workshop that uses a central server, you can place the exercise in a publicly accessible directory, then have students connect to the server and copy a version to their home directory to get started.  In some ways this is probably a little bit easier than downloading locally.

Feature image credit: Das Geisterhaus, Harald Hoyer from Schwerin, Germany (Wikipedia Commons)

 

Advertisements

The Data Deleters: a bill proposed by Senators Lee and Rubio would wipe people of color off federal maps

the_data_deleters_r8

High-quality data is now a matter of life and death. We depend on these data to plan safe roads, regulate toxic chemicals, and track infectious diseases. In my work as a microbiologist, we’ve seen amazing advances in medical and environmental research by making large, standardized DNA sequencing datasets accessible to everyone.

The same applies to geographic data in the social sciences. Where racial disparities are occurring, high-quality datasets will help to uncover them. Where issues like housing availability are closer to equitable, analysis of Big Data can suggest what policies helped. These connections bind data science to civil rights, in areas from housing discrimination to environmental racism to voter suppression.

Many of us have spoken out against recent assaults on climate change data maintained by the EPA. Fundamental geographic databases crucial for policy analysis are now also under attack.

A stealth anti-science and anti-civil rights provision is embedded in proposed senate Bill S.103 “‘Local Zoning Decisions 5 Protection Act of 2017”, and its companion bill H.R. 482 in the house.  Section 3 of those bills would supersede all existing laws to ban all federal funding for using or maintaining geospatial databases that track racial disparities in our communities. It would also ban databases that track disparities in access to affordable housing. I confirmed that this language is still present as of 1/30/17.

The key provision, Section 3, is only one sentence long (You can read the official full text of the bills for yourself at the links above.) :

SEC. 3. PROHIBITION ON USE OF FEDERAL FUNDS.

Notwithstanding any other provision of law, no Federal funds may be used to design, build, maintain, utilize, or provide access to a Federal database of geospatial information on community racial disparities or disparities in access to affordable housing.

The provision is absurdly broad and intentionally vague. Using federal funds to do any research at all on racial discrimination could potentially violate it, if that research involves a ‘database’ that records ‘geospatial information’ – a low bar that most research studies would trip over.

We scientists like to hedge, to find nuance.  But this provision is quite simple. There is no rational, non-discriminatory basis for a blanket ban on studying inequality using federal databases. If you thought racial discrimination in housing were minimal, you’d want more data out in the open, not less. You’d want giant billboards up everywhere showing how fair our housing policy was. The only reason to place a blanket ban on researchers building and using geospatial databases of racial disparities is if you know discrimination is happening, and you want to cover it up.  

That is precisely what Senator Mike Lee (R-Utah), Senator Marco Rubio (R-FL) and 22 representatives from the house have done in this bill.

I’m sure that, when pressed, some might argue that state databases will take up some of the slack.  Perhaps, but that would still produce a situation worse than the status quo, in which we have both federal and state databases.  It would also make national analyses of racial disparities much harder. Data science benefits greatly from consistently annotated and easily accessible datasets.  The most likely effect of eliminating federal databases of racial inequality would be to have fewer data, and to make those data that are available  harder to access and compare.  This will limit the power of research studies, and potentially cast doubt on their results.

Discriminated against in housing?  Now they can claim there is ‘no scientific data’ to back it up.  Because they deleted it.

Trying to get a federal research grant to study racial disparities in zoning law? You won’t, because that study would require you to build a database, and now funds can’t be used for that purpose.

Fighting a chemical plant in your backyard? Good luck arguing that it’s placement was discriminatory if your expert witnesses don’t have the data on community racial disparities to build you a map.

This provision has no purpose other than to shield discrimination. Its effect is to literally wipe the concerns of people of color from Federal maps.

As scientists, we will safeguard key datasets from political interference. As citizens, we will resist cowardly attempts to shield discrimination and racism from the bright light of public scrutiny.  As decent human beings, we will never support legislation make it harder for people in our communities to drink clean water or find a place to live, no matter what the color of their skin.

The house bill is currently in the House Committee on Financial Services.  The senate bill is in the Banking, Housing, and Urban Affairs committee. Citizens from all over the country have expressed their shock and outrage over these bills on social media, calling for their rejection. Advocates and scientists from every state are now calling their representatives to demand action, and spreading the word online. A list of senators and representatives sponsoring this bill can be found below.  If your representatives are on this list, please call them and express your opposition.  If they are not, please call your representatives and make them aware that they should oppose this bill. If you live in Utah or Florida, I especially urge you to demand that Senators Lee and Rubio to withdraw this bill, and let them know that this betrayal of fundamental American values will not soon be forgotten.

Thanks everyone for your help and support.

Acknowledgements: I would like to thank  Dr. Jan Marie Eberth, a professor at the UNC Arnold School and Deputy Director of the South Carolina Rural Health research center, who alerted me to this bill.

p.s. here are Senators and Representatives sponsoring or co-sponsoring the bill (see the official congress.gov page here).  If your representatives are on this list, please call them and express your opposition. If they are not, please call your representatives and make them aware that they should oppose this bill.

Senate:

Sponsor: Sen. Lee, Mike [R-UT] (Salt Lake City Office, Phone: 801-524-5933, Fax: 801-524-5730)

Co-sponsor: Sen. Rubio, Marco [R-FL]* (Phone: Miami office: (305) 418-8553; Orlando office (407) 254-2573; Tampa office: (813) 287-5035)

House:

Sponsor: Rep. Gosar, Paul A. [R-AZ-4] (Phone: Phone: (480) 882-2697)

Cosponsors: (as of 1/31/17. See here for current official cosponsors):

Rep. Biggs, Andy [R-AZ-5] 01/24/2017
Rep. Franks, Trent [R-AZ-8]* 01/12/2017
Rep. McClintock, Tom [R-CA-4]* 01/12/2017
Rep. Rohrabacher, Dana [R-CA-48]* 01/12/2017
Rep. Buck, Ken [R-CO-4]* 01/12/2017
Rep. Webster, Daniel [R-FL-11]* 01/12/2017
Rep. Yoho, Ted S. [R-FL-3]* 01/12/2017
Rep. Blum, Rod [R-IA-1]* 01/12/2017
Rep. King, Steve [R-IA-4]* 01/12/2017
Rep. Massie, Thomas [R-KY-4]* 01/12/2017
Rep. Smith, Jason [R-MO-8]* 01/12/2017
Rep. Joyce, David P. [R-OH-14] 01/20/2017
Rep. Duncan, Jeff [R-SC-3]* 01/12/2017
Rep. Blackburn, Marsha [R-TN-7]* 01/12/2017
Rep. DesJarlais, Scott [R-TN-4]* 01/12/2017
Rep. Duncan, John J., Jr. [R-TN-2]* 01/12/2017
Rep. Babin, Brian [R-TX-36]* 01/12/2017
Rep. Burgess, Michael C. [R-TX-26]* 01/12/2017
Rep. Poe, Ted [R-TX-2]* 01/12/2017
Rep. Sessions, Pete [R-TX-32]* 01/12/2017
Rep. Brat, Dave [R-VA-7]* 01/12/2017
Rep. Grothman, Glenn [R-WI-6]* 01/12/2017

 

Changelog:

2/3/17 Updated to fix a typo (I put 1/30/16 instead of 1/30/17).  Thanks to Greg Caporaso  for spotting the mistake.

GCMP: Visit to Penn State and First look at Australia Data

 

IMG_0189-1024x768

We recently had a great trip to Penn State, where we visited with Mónica Medina  and her group. Ryan McMinds, Becky Vega Thurber and I headed out there to work with Mónica’s group on the Global Coral Microbiome Project (GCMP).  Mónica, Joe Pollock, and the whole lab were amazing hosts. We stayed in Mónica’s house and had the chance to spend some time with her lovely tia and wonderful daughters.

The overall project aims to understand the microbes living on reef-building corals, which are thought to play key roles in corals’ resistance or vulnerability to environmental stressors like climate change and algal competition.  We are working with the Earth Microbiome Project to assess bacterial diversity in a large global collection of coral samples. In the meantime, we are moving forward with a subset of samples collected in Australia.

For this project, we have enough preliminary data from our previous work and the literature on coral microbiomes to form fairly specific hypotheses.  So we decided to be fairly formal about framing our key hypotheses, the testable predictions for each hypothesis, and planning ahead of time many of the specific analyses we’d do to test those predictions.

Some of the key questions we’re trying to address include:

  • How do different ‘habitats’ within a coral, such as mucus vs. tissue vs. skeleton differ in microbial community?  We predicted that the microbial community in the coral surface mucus layer (SML), which is a key interface between the coral and its environment, will be more strongly influenced by local environmental factors than the microbial community within coral tissues. We predicted that the tissue community would be more driven by the evolutionary history of the coral.
  • Have distantly related corals with similar life-history strategies converged on similar microbiomes?  We are testing a number of concrete predictions in this area for features ranging from the abundance of microbial antibiotic production pathways (we predict there will be more in stress-resistant corals) to the extent of inter-individual variability in different types of corals.
  • Can we identify clear cases of co-evolution between corals and their microbes?  A key prediction of the coral holobiont theory is corals and their bacteria are symbiotic partners that have co-evolved over long periods of time.  This is a challenging idea to test, but our sampling scheme was designed to have enough power to try to address these questions.  We’re first assessing what bacteria, archaea and Symbiodinium lineages are found in most or all of our coral specimens, and will then move on to evolutionary analyses of these groups. 

These are just a few of the ideas we’re kicking around at the moment. Any of these predictions may very well be incorrect, and we’re happy to find that out – mostly we’re excited to have data in hand, and grateful to the large network of coral scientists that helped us get to the point where we can start testing these predictions in a more  definitive way than previously possible. The results from some of these evolutionary and ecological questions will help to inform model-building in later stages of the project, where we are going to test whether incorporating information on microbial diversity can improve models of which coral species are vulnerable to disease and bleaching.

During our visit, we worked with Joe Pollock to start analyzing DNA sequence data for this project, and with Styles Smith to connect these data to ongoing bacterial genome sequencing efforts. Having everyone in the same room proved to be very useful for advancing the project rapidly. We now have Symbiodinium (ITS2)  data for most of these samples, and bacterial/archaeal  (16S rRNA gene) data for all of them. Although long OTU-picking and beta-diversity runs ate up the first few days, we were able to summarize this sample set into some nice tables for publication; conduct basic quality-control, OTU picking, and core diversity analyses; revise our taxonomic annotations of Symbiodinium diversity (more on this later); set up our organizational system; and got a preliminary look at what the data are telling us about our four or five of our key predictions.

Right now we’re working on a Dropbox model, with standardized folders (input/output/procedure subfolders) for different sub-analyses, and using IPython notebooks  or bash scripts to record analysis procedures.  So far this is working fairly well, although it helps being in the same place to rapidly coordinate what’s happening in each folder, especially if multiple people are contributing to the same analysis. Disambiguating those types of synchronous contributions is a place where a more formal version management system like GitHub could be advantageous.  For example the Earth Microbiome Project is coordinating their analysis in this way (see here). We may still go there, but for now the team is small and connected enough that we might avoid the overhead.

In any case, many thanks to everyone in Mónica’s group for a great visit. As this analysis matures,  I’ll try to write more on approaches we’re taking to connect our microbiological data to coral functional traits and life-history strategies.

Short Teaching Module: Perspectives on Microbial Community Change in Health and Disease

I recently put together a short interactive teaching module on microbial community change in health and disease.  Students reacted well to the exercise, so I thought I’d share it here (see below for materials). The basic idea is that after a short introduction getting folks excited about the microbiome, students break up into small groups and try to figure out what kinds of community changes might underly some disease scenarios.  The group then discusses these ideas together, and relates each scenario to a real example.

The main goal of the lesson is to introduce many core ideas in microbial ecology, like alpha-diversity, beta-diversity, richness, evenness, etc. in a very short period of time, using examples that will be relevant to many folks. A secondary goal is to introduce the utility of tables of bacterial abundance across samples for sorting out these different patterns.  A natural follow-on would be to actually convert those tables to electronic form and have students use them in microbial community analysis tools, or write their own python scripts to quantify these patterns (which could be improved with statistical tests later on).

Scenario_1_specific_pathogen-01

So, for example, this first example illustrates a case where there is one microbe (the red dots) that is present in all the diseased samples, but none of the healthy ones. This might represent a classic bacterial disease caused by a single pathogen, and a likely candidate for fulfillment of Koch’s Postulates.

Several of the easier scenarios, like this one, also have microbe-microbe interactions embedded in them. For example, in the above scenario, the aqua and blue microbes are strongly negatively correlated in samples from both healthy and diseased patients. These bonus patterns can give groups that quickly get a solid idea about what might be causing disease something more meaty to explore while other groups continue thinking about their scenarios. They also introduce an alternative way of looking at microbial communities that will be important later on. Finally,  these microbe-microbe interactions also help illustrate the utility of tables for spotting and quantifying patterns in microbial communities.

Scenario_1_table_only

In this case, a table of the microbes across samples isn’t really needed to see the main trend with disease (red dots, bottom row) , but might help identify microbe-microbe interactions that are harder to spot visually.  Here, the aqua and blue microbes (2nd and 3rd rows) trade off in abundance across samples.

Here’s a handout summarizing the different scenarios from the lesson:

Scenario_Summary-01

 

Here are the lesson materials:

Overview and GuidePerspectives_on_Microbial_Communities_Exercise_Overview_r2

Lesson Powerpoint (via SlideShare)[link]

Handouts: All handout files, including the original Illustrator files (16 MB download) as a single zip. [link].   The files includes all of the scenarios without tables, their associated tables alone (basically an OTU table for each scenario), or the diagram with the table. The summary handout image is also included.

 

 

 

US-Indonesia Kavli Frontiers of Science Videos

This summer, I had the amazing opportunity to travel to Makassar, Indonesia to attend the US-Indonesia Kavli Frontiers of Science Forum.  This conference is unique in that it features U.S., Australian, and Indonesian scientists from diverse disciplines.  So for example, you might -purely hypothetically *cough*- be giving a talk on coral microbes, and get some really interesting questions about microbial metabolism and detecting alien life in (very) remote sensing data from Jason Rhodes, from NASA’s Jet Propulsion laboratory.  I found it to be a very refreshing chance to interact with a broader range of scientists than I might normally.

The conference also had an element of diplomacy to it – it is one of several scientific exchange programs initiated to follow through on the commitments laid out in Obama’s 2009 Cairo speech, which called for greater cultural and scientific exchange with predominantly muslim countries.

DCIM100GOPRO
The Amirul Mukminin Mosque (Masjid Amirul Mukminin) in downtown Makassar, seen from our conference hotel.

Certainly our hosts were incredibly kind and generous, and the folks I interacted with both off and on the conference site were very welcoming. It was great to meet Jamaluddin Jompa, who was the first Indonesian scientist to give a keynote at the International Coral Reef Symposium, and many young Indonesian scientists. I also found Bahasa Indonesia to be an incredibly fun and approachable language – though unfortunately I don’t think there will be many opportunities to practice it around here. My fledgling attempts were enough to egg Mónica Medina into including a short paragraph in Indonesian in here opening statement- so I’ll count that as a win.

In any case, I’m revisiting this older trip because the conference has recently posted video from all of the talks.  These are all intended for a broad audience, and are generally quite approachable.  Be warned, they have archives going back several years, and its pretty easy to burn an afternoon checking them out.

My talk on coral microbiomes focused on a long-term field experiment studying the effects of nutrient pollution and overfishing on corals and their microbes in the Florida Keys (Vega Thurber and Burkepile labs), and the Global Coral Microbiome Project, which seeks to characterize the microbial diversity of evolutionarily diverse corals from many sites around the globe (Vega Thurber and Medina labs – with help and collaboration from many, many others).

Kim Ritchie’s intro on benefical microbes in the ocean is here:

 

A menu with all the talks can be found on the Kavli Website [link]. I would particularly recommend the talks by Vikram Ravi on supermassive black hole evolution; Christopher Mores on his experiences running an Ebola clinic; Maxime Aubert on the discovery of the oldest dated cave paintings in Indonesia; and Enid Montague on developing apps to improve hospital visits.  Kiki Vierdayanti also gave a very interesting talk on x-ray emissions from black holes, with some interesting comments on what its like doing astrophysics as an Indonesian researcher.

Enjoy the videos, and if you ever have the chance to attend one of these I would strongly encourage it.

 

 

 

Peter Norvig’s iPython notebook on probability

By way of Daniel McDonald, I recently came across Peter Norvig’s  iPython notebook exploring probability. Peter Norvig is the director of research at Google, and his courses tend to do a great job of breaking down complex concepts into digestible ideas and clean code.  This notebook is no exception.

The notebook explores basic probability and the Monte Haul problem using some straightforward code for exploring sample spaces.  It then extends this code to deal with some statistical ‘paradoxes’ like the Two Child Problem. Some of the most interesting parts of the discussion hinge on the sample spaces that would be required to make particular results true.

In any case, if you are interested in statistics and/or python, this is a good read.   If you like the notebook, you may also be interested in checking out his free online course on programming principles (h/t Justin Kuczynski), or “Artificial Intelligence: a modern approach”, his canonical text on artificial intelligence, which has accumulated a few* citations.

*26,167