Building a data science portfolio
By Lindsey Doermann
January 11, 2019
ChemE capstone projects pair students with industry and academic partners
What does natural language processing have to do with protein expression? On the surface, not a whole lot. But in spring 2018, Josh Smith made a connection between the two — and found it rather powerful. Smith and fellow ChemE graduate students Jay Rutherford and Christopher Nyambura employed the principles of a machine-learning algorithm that's used to glean meaning from gobs of text to create a kind of “grammar” for proteins. From there, they trained an algorithm to predict how well different proteins could be synthesized.
Smith’s team was one of 12 to work with an industry or academic partner on a data-intensive capstone project. Graduate students complete these capstones to round out a nearly year-long data science training program led by ChemE professors David Beck and Jim Pfaendtner. ChemE and UW’s Clean Energy Institute partner to offer the program, which started with an emphasis on clean energy applications. With every year, it’s taking on a broader molecular science scope to equip students with the skills to handle complex data sets across all areas of ChemE research.
“Molecules don’t limit themselves to clean tech,” says Beck. “The fundamental aspects of data science are applicable to many areas.” So Smith’s team worked with pharmaceutical company Novo Nordisk on a bioinformatics project relating to protein expression. Other teams took on problems ranging from chiller plant efficiency to sensor accuracy to chemical toxicity.
The master's and Ph.D. students who completed the data science capstones came from across the College of Engineering and the Department of Chemistry, with strong ChemE representation.
No previous experience in data science was required; in the winter quarter, students took two data science classes to learn concepts and tools, then applied that training to real-world problems in the spring. “For most students, all they’ve seen is the academic side,” says program manager Kelly Thornton. “This [program] shows them other paths.”
Beck and Pfaendtner worked with sponsors to scope out data-intensive projects that teams of 3 – 5 students could complete in one quarter. Then they found students with the interest and the right mix of skills to take them on. Smith said his team met with its Novo Nordisk sponsor right away to create a strategy, and they corresponded regularly throughout the project. To top it off, they got to check out the company’s facilities in the South Lake Union area of Seattle.
Additionally, the student teams all got together every two weeks for a “stand-up,” during which a representative from each project shared their status, goals for the upcoming weeks, and issues blocking progress. Beck found that teams had a lot to share with each other; he said the “few” minutes allotted to each team often turned into many, as classmates exchanged resources and offered up new ideas. The quarter concluded with a showcase at which students presented their projects. In addition to displaying their work at a poster session, each team gave a one-slide presentation in the style of an elevator pitch.
In explaining their project, Smith and his group showed how they sought a more streamlined way of predicting what proteins would be easy or difficult to produce. Recombinant proteins have many applications in biology and medicine, including biopharmaceuticals and other areas of interest to Novo Nordisk. A faster way to screen peptide sequences could in theory lead to more efficient drug development or improved protein manufacturing processes.
Existing methods of determining protein expression levels have involved running data through multiple steps using a suite of software. Novo Nordisk challenged the team to simplify this process. The students’ approach drew from natural language processing algorithms that determine the thrust of movie reviews. The machine “reads” scores of articles and essentially learns if words have positive or negative connotations; then it can group films into hits and flops.
The data science team trained its algorithm on a relatively small “big data” set of 45,000 peptide sequences. The way Smith tells it, they treated each amino acid as a word and each peptide sequence as a sentence. And they found that their model did in fact learn — it grouped similar proteins together. Beck calls their result a “beautiful single model,” and the students are now preparing a paper to submit for publication.
Smith is now finishing his Ph.D. and hopes to transition into a data science role within the biotech sector. He says that what he found challenging about the capstone project — the exchange of information between disparate fields — was also inspiring. He sees how data science can create a bridge between experts in the hard sciences and those in machine learning — and generate novel solutions as a result.
On top of this, Beck and Pfaendtner see a unique opportunity for students to build soft skills such as project management and stakeholder engagement. The training sets up a win-win: preparing students for new, highly-prized data science jobs, and giving industry partners a leg up in leveraging what might otherwise be unwieldy data streams. Look out for this new breed of chemical engineer.
Some of the data science capstone projects that ChemE students completed in spring 2018
- Characterizing proteins with neural networks, with sponsor Novo Nordisk
- Detecting sensor drift in chiller plants, with sponsor Optimum Energy
- Predicting climate change sentiments by applying deep learning to Twitter data, with sponsor KPMG
- Using image processing and self-learning to identify the best rooftops for solar panels in Anchorage, Alaska, with sponsor Alaska Center for Energy and Power
- Predicting chemical toxicity using deep learning, with sponsor Pacific Northwest National Laboratory
Does your company or organization have a data science project that ChemE students could take on?
Program directors are currently developing capstone projects for trainees for the spring 2019 quarter. Please contact Jim Pfaendtner at firstname.lastname@example.org to explore the capstone program and other ways to connect around data science.