Comparison of our analysis of tyrosine kinase signaling in Monosiga brevicollis with that of the Pincus et al report

A related analysis of pTyr signaling gene in Monosiga has recently been published by Pincus et al (PNAS: 105: 9680-84). While both show that phosphotyrosine (pTyr) signaling is extensive in Monosiga, their data looks dramatically different to ours. While small variations are likely in any large-scale analysis of draft genomes, these differences are substantial enough to be worthy of comment. We do not yet have access to their raw data, but based on their report, here are some distinctions, and how they may have come about. This is also a somewhat rare chance to compare two approaches to the same computational question and address reasons for different results.

Why do we find so many more genes?

Many differences appear due to analytical approaches: they used the gene predictions from the genome project and SMART domain profiles without further analysis, while ours benefited from extensive re-analysis of gene models (for instance, 102 of the 128 TKs are modified or novel relative to the genome predictions), and downstream testing of predicted domains with multiple approaches and manual curation.

Using this more complete analysis, we see strong support for 128 tyrosine kinases, of which over 100 form into multi-gene families, while Pincus et al find 45. Similarly we see 123 SH2 proteins compared with their 100 and 39 PTPs in place of their 34 (they do not look at PTB domain proteins). Some of the gene counts in other organisms also appear to be slightly off, even from published counts, though they do not provide exact counts for most of their data (e.g. humans have 90 TKs (Manning et al, Robinson et al), not 85 as stated. This is likely due to over-reliance on one automated assignment method (the SMART HMMs). For kinases in particular, it may be that some TKs were mis-classified as serine-threonine kinases, for which there is a distinct SMART model.

Why do miss so many odd domain combinations?

We see TK, PTP, SH2 and PTB domains in many new domain combinations, but they are mostly with known signaling or adaptor domains. Pincus et al see an even wider set of domain combinations, many unrelated to signaling. Several of these are possible gene prediction artefacts, where two neighboring ORFs are fused by the genome project gene predictions. Most unusual domain combinations have long introns (>700 bp) separating the signaling portion from the unusual portion, while such introns are very rare in the more typical genes. In some cases we find EST support in M. brevicollis or M. ovata for truncating the ORF before crossing the intron. For these reasons, we truncated several sequences, removing domains such as the ETC, ATP11, Rib L36 and Topoisomerase domains. In other cases, we keep the genome center predictions, but see no evidence for domains such as HDAC or HDAC interaction domains, or Guanylate Kinases and suspect that some weak domain hits like TNFR or GCC2/3 may be false positives due to the presence of high cysteine content and semi-conserved spacing of cysteine residues. While these include some Monosiga-specific conserved motifs, many with CxxC sequences, not all can be explained by such motifs, so we have labeled such domains simply as "Cys-rich". Given the vagaries of domain detection, we cannot claim that our dataset is perfectly correct either, both in terms of ORF sequence or domain identities.

Differences in domain architectures highlighted in Figure 3

In Pincus et al, Figure 3a lists the STAT-SH2 combination as bilaterian-specific. However, this combination is not only found in Monosiga, but is also found in as divergent an organism as Dictyostelium discoideum, and presumably has been secondarily lost in fungi. Some of the other domains are unclear: for instance, for the three Monosiga-specific domains, there are four IDs listed in the supplement, but none has any obvious similarity to a HDAC-interaction domain (Gene ID 10594 has the SAM and SH2 domains as marked, but even in the genomic sequence, we see no sign of a HDAC), and none has a PTP domain (Gene 23461 has a B41/FERM domain, but the proteins that we see with PDZ and PTP domains both lack B41 and GuK domains, and are orthologs of PTP-BAS or PTP-H1, and so are not Monosiga-specific). For Mbre+bilaterian, we do not see any SH2-RasGEF protein, though we do see the opposite combination (10070 has RasGEF-SH2-SH2), so we have classified this architecture (SH2D3) as metazoan-specific. The SH2-Ank-Ank architecture seen only in Mbre and Nvec is also confusing: the Mbre ID is not given, and the Nvec ID (51792) is not found in the JGI gene list, and we do not see such an architecture in Mbre. SH2-Ank-Ank architectures are seen as part of shark-family kinases in invertebrates, but not on their own.

In Figure 3b, the upper portion shows a Jak kinase in coelomates. However, the N-terminal kinase domain labeled as a generic STYK (serine-threonine-tyrosine kinase) is in fact a degenerate tyrosine kinase domain of the same (Jak) family, and contrary to the figure, Drosophila Jak also contains this pseudokinase domain, though it is highly degenerate. Several Monosiga kinases have SH2-TK combinations, and two (CTKA1, UTK05) even have SH2-TK-TK combinations, but none is clearly orthologous to Jak, so while these domain combinations have been used in both organisms, it is not clear that this is of common origin, convergent evolution, or chance. The lower PTP panel appears to include human RPTP K/M/N/L (these usually are taken as having 4 FN3 repeats); the Monosiga counterpart is unclear, as the only genes with PTP and FN3 domains in our study are cadherins and a gene (N23) not predicted by the genome project. The Fused repeats are also somewhat questionable, since we see a lot of extracellular weak hits to these and other Cys-rich domains that appear to be dominated by their AA content rather than sequence.

Back to the Monosiga kinome page.