Research Blog
Programming
Programming
Improvements
Jul 1st
TypeA
- Removed all motifs where both similar drugs bind to the target.
- 30,231 motifs were originally identified, now 27,499 (-2,732).
TypeB
- Removed motifs where the drug binds to both targets.
- 22,864 motifs were orignally identified, now 19,185 (-3,679).
TypeC
- Removed motifs where protein(1) inv_in disease(2) or protein(2) inv_in disease(1).
- 70,866 motifs were orignally identified in the dataset, now 70,715 (-151).
- Tried: Remove motifs where the target connects to protein(2), this did not take into account that some links are has_similar_sequence/structure and not is_a. 200 (-70,515).
- Removed motifs with target is_a protein2, made no difference to total.
Ondex
- Selecting a node with a certain id in Ondex?
- Too hard when there is 500k nodes
- Search doesn’t work
- Ondex Console: how to use this??? -> help is unavailable surprisingly.
Cutting down
Jun 30th
70k~ seems a high amount for the TypeC motif.
This high number is due to the multiple disease involved with Protein(1) and multiple disease involved with Protein(2). Each of these will be seperate motif to the count.
- Do not count motifs where Protein(2) is known to interact with Disease(1)//Protein(1) amd Disease(2) (Not Interested) — Only cut the results by ~200 70,510
- Remove motifs where Target interacts with Protein(1) and Protein(2). — Cut the result to 200 motifs.
- Only 200, as I only checked if Target was a connected to Protein(2), but there are more than just “is_a” that can connect these nodes such as “h_s_s”.
Mare
Jun 28th
Motif Finder
Exported the datasets ready for using with software (motif finder) that searches for motifs that are overrepresented. (Hypergeometric tests: 1day work). Cytoscape has a plugin.
- ib_2010_data.1.xml.gz –> .dot (50mb)
- ib_2010_data.1.xml.gz –> graph htm??? (cant remeber but it was 450mb)
Motif Type C
Trying to define another motif, struggling… similar targets?
Plan
Results/Code/Other program (June-July 2010)
4 Weeks
- Week 1 – Code/Motif Finder (i.e. Motif A,B,C… x,y,z instances)
- Week 2 – Java/Score/FoundExamples
- Week 3 – Score
- Week 4 – Score
Paper (July-August 2010)
- Intro
- Background
- Methods
- Results
- TypeA
- TypeB
- TypeC
- Discussion
- Future work
- Abstract
- Definitions of S.M.
- Searching
- Scoring
Cleaning
Jun 20th
Cleaned up the coding.
- Removed deprecated stuff
- Added some comments
Also fixed the name servers.
Motif Definition 2
Jun 17th
The second motif that was defined has been completed in Java.
Example output from Java (cutoff dataset):
Code was ran on the updated dataset:
- There were 70,866 instances of this motif definition in the dataset.
- However, this doesn’t address the issue of the OMIM database. So many of these will not have true “Diseases”. This would probably be best addressed when trying to score this type, if its even possible to do this.
Cross Checking
Jun 15th
Linux(Ubuntu) was successfully installed on to the machine along with the latest version of Ondex. The large dataset seemed to work fine in Linux with 3.6gig of allowed memory for the program. I searched for Chlorpromazine with in the dataset to get node id’s of the semantic motif discussed in the paper.
- Drug 1 – Chlorpromazine – ID:120943
- Drug 2 – Trimeprazine – ID:120836
- Target – Histamine H1 Receptor – ID:715
I then traversed the output of the java code in Ondex Mini, and sucessfully found the motif:
Definition of a more complex example
- Started an implementation of a more complex motif i.e. look for similar proteins.
- Developed it using the chart created on a previous post.
- Seems to be working okay, although there are some issues with the Disease output and when to increment the motif counter/add to the motif Set.
Abstract Motifs
Jun 14th
Developed a java class to represent a abstractMotif as described in the paper (has fields for drug1, drug2 and thepotential target for drug1 (see previous post)). Created a TreeSet (Allows automated sorting of a Set, currently sorting by drug1.getId()) to store motifs found within the dataset, these are added in the method when a target is discovered. These motifs can then be manipulated when trying to score this type of motif.
It would be a good idea to cross check this with Chlorpromazine, but I need to get Linux up and running properly in a dual-boot due to memory issues in Windows (also not being able to print concept names is a problem).
Wrote a toString() method for a nicer output for testing purposes:
Drug1 <--> Drug2 --> Target
I wanted the Concepts name for example Chlorpromazine:
Chlorpromazine <--> Trimeprazine --> Histamine H1 Receptor
However , the ondex API is not working for printing names? – so using getId()’s for now.
Output
Here is some current output of the basic semantic motif definition.
Differences
This new implementation is different to the previously developed code by Cockell, S.J. for the 2010 paper. However, in the cutoff_data.xml.gz 157 motifs were identified, the same as the using the other method. However, there is a difference when using the code on the larger dataset ib2010_data.xml.gz. Previously there were 26,693 motifs identified compared with 29,633 using the new implementation, a difference of 2,940.
Note
It should also be reminded that it was discovered in the proper dataset:
- Drugs seem to have the ConceptClass “
Comp:Compound“. - Rather than the ConceptClass “
Compound“. - There are 70 of these, none of which seem to bind to a target.
- There are over 8k concepts of the former, so I will use this.
Chart
Jun 10th
Graph of how the basic motifs for protein similarity and drug similarity look like, so I see names of conceptClass and more importantly relationType.
Potential solution (in progress) to this:-
Code appears to be working, traverses and prints out the information of the nodes (although not each indiviudal motif).
June 8
Jun 8th
Today:
GDS.getValue to get “Tanimoto” is object. Convert to double to compare to a double cutoff_value.start() before compounds examinied. Used 1.0 as a cutoff, cutdown the amount of motifs found by 81%.






