>> okay, thank you very much george for the introduction. so, what i'm going to talk about today actually is not that kind of core database work of my team. but i'm going to talk about the integration of human and mouse phenotype data. so, as of introductory slides, basically,
all of the work that my team does is it's united by attempts to add value to data by curation and better semantic representation. it's all about functional genomics, data comes from many different sources. we have planned some goals, human, mouse, projects,
and this one is going to focus on phenotype representation and integration for human and mouse. this work is primarily funded by two nhgri grants. one, for the gwas catalogue and one for komp2 data coordination. for the first part of the talk, it's going to be about human gwas phenotype information, and part one is titled a picture paints a thousand traits because this is really all about the visualizations of data and how we integrate visual representation for data,
with more traditional resources like databases such as ensembl. so i'll talk a bit about the data and the ontology to be developed. a bit about visualization and extension to other domains. and this is kind of important because all of the code that we present for the whole of the ebis and all of the projects to be worked on is open source, so it will be coded portable and can be reused by the projects with whom our happiest people would like to reuse it. so, this project is type of a goci project, it's a collaboration between the ebi and the office of population genomics at the nhgri at teri manolio's group
who produced the gwas catalogue. so it's a catalogue of associations identified in the literature between snp and identifiable traits in human. so i have a slide here on gwas, so i suspect the audience doesn't need to know what a gwas is. but effectively, what the data here that's represented is about high-density data which is then curated by the curators of the gwas catalogue. the gwas catalogue basically is a manual curation
of published human gwas studies. a weekly literature search to identify new studies, all of the data is manually extracted into a web interface by a curator. and we've built some tooling to help that for example, to extract it from tables, where, there are more than fifty snps in the tables. one of the very critical things is that the data is all double-checked, so everything is extracted by one person, who's an expert geneticist and then is checked by another person who's also an expert geneticist. and then, some issues are discussed,
and that results in a very high-quality well-undertaken data set. so this data is presented on a coordinated gwas diagram, and as of the 17th of october this year, there were almost 1,400 publications, 7,200 snps, with over 700 distinct traits and nearly 8,300 snp-trait associations. and the primary place to view this and tell the works that i'm describing here was actually three of the nhgris catalogues and it's at the url here. on all the slides today, all of the url is present at the bottom of the slide,
so you can actually go straight to google if you're interested. so the inclusion criteria for gwas catalogue, there has to be 100,000 snps assays. there's a p-value inclusion criteria of minimum ten to the minus five, only one snp per gene or region of high ld is extracted. an information about it also is title, journal, publication date, trait, and ethnicity is extracted along with information on the snps including rsid, strongest allele, gene, risk frequency, p-value, odds ratio, confidence intervals and standard error.
and also if a snp has been previously reported. this actually is kind of a challenging information to extract and that's why this is a very manual, an expert process. so gwas data is mostly based on information on publications, there is some supplementary annotation done by the ncbi, for example, base pair location, assignments of cytogenetic bands, an identification of up-stream and down-stream genetic loci. there's also a mesh mapping performed of gwas traits. and meshed with the catalogue that was initially used
by an annotated gwas trait before the work i'm about to describe. and importantly, the gwas catalogue data is subsumed by resources like ensembl, where it forms a large part of the variation data that's displayed against the genome. so i'm showing here a schema diagram of the gwas catalogue, with some examples shown on the paper recently in nature, and this basically forms a scheme of all of the representations that we're showing here, that we have a study, trait association, trait, snp, gene, cytogenetic band, and chromosomes.
and cytogenetic band is very important in this visualization because all of the visualization is by the karyotype; it makes for a very familiar visualization for geneticists. it makes for a very familiar visualization for geneticists and molecular biologists and a lot of people seemed to find that a much easier representation than a more kind of common track-based representation and some interplay between the two is probably an optimal visual representation that have to navigate this data, and that's what we're trying to get to.
so our project really was to develop and maintain an ontology for the gwas data. the gwas term here in these slides is in square brackets because actually the ontology use a lot of things, lots of functional genomic states, and not just gwas and i'll talk more about that in a moment. we wanted to support the very manual process that the curator do to try and increase efficiency. one of the problems is that, and the number of curators stay fairly constant but the volume of data grows all the time.
and we wanted to ensure that we could support the accuracy as well so we could detect cases where, for example trait associations were quite different between papers but maybe that that was some commonality. and one of my motivations was that we wanted to integrate the catalogue data into our ebi resources. and to do that, we kind of needed a semantic layer over the top of it, that would allow us to do that. so this is a screenshot of the gwas catalog trait list.
and effectively, this was just one long list of 700 plus traits. and so if you wanted to get all of the papers related to diabetes that were all of the insulin-resistant measurements for example, or of the glucose measurements, it was actually kind of tricky to do that because you have to scan the list of 700 traits and try and figure out which ones of those sets we wanted. and then you couldn't dynamically visualize this on the diagram. and additionally, the traits are highly diverse so they include things which i would consider to be a true phenotype, so hair color, eye color.
they include treatment responses, there were a lot of drug responses, response to anti-cancer agents, response to anti-psychotic, a lot of diseases, but genetic in common. and a lot of chemical and drug names, and also enzyme or protein names. so for example c-reactive protein, it's sort of a common annotation. on top of that, the traits are often compound and or context-dependent. so it's quite common to see two diseases that are coincident, or disease and treatment, or disease and phenotype. and this example of parkinson's disease, interaction with caffeine,
this kind of lifestyle information there as well as parkinson's disease. so it's not of fairly complicated information with a lot of implicit knowledge. and the fact that it's implicit, it makes it quite hard to query and visualize. and so a lot of the work we've done has been to tease out the implicit knowledge. so, we basically want to integrate the traits into a structured hierarchy and to have links between those, for example, so that we can show multiple parentage. so for example we just put in familial hypercholesterolemia,
it's a genetic disorder and it's a metabolic disease. we want to answer questions, for example, show me all snps associated with the type 2 diabetes and metabolic syndromes, so you could do combinatorial queries. and all of these things were not possible in the version of the gwas catalogue that we started working with. so really, we're taking the information of the curator to have very carefully extract it and just bending it and mapping it in a different way.
so we have two options for ontology integrations. and we could have just created a new gwas ontology and just distributed that, or we could have integrated with an existing ontology. so my group already built an ontology called the "experimental factor ontology" and it's an actively developed owl format ontology. and critically, it's not the case on ontology. so it re-uses concepts from other ontologies but it shapes them to look like the data that works really well for visualizing data. the current source of data which efo is used
for include the array express database, ebi, it's being used in the site ensembl. it's used-- the gwas catalogue now and it's also being re-used for the various samples of database. and it looked at the start of this project, very similar to the kind of data that we had. so it had lot of tons of data but it was like from genetic data. it's mapped repeatedly every month to several of resources including the heavy ontology of small molecules, the gene ontology,
ncbi taxonomy, so it's fairly interoperable. it turned out that it was well-suited to covering a diverse set in gwas traits that we had without any work at all. we are going to have 20 percent coverage when we did some alignment of the data to the ontology. and we also looked at some other ontologies in the domain. and we only found that the disease ontology gave us anything like that and it wasn't the comparable level. so what we've done is effectively take the entire trait sets
from the gwas catalog, map it into efo, and also to add a lot of new terms, so 450 new terms have been added to give us 98 percent coverage of the catalogue. and that missing 2 percent just represents that we lagged slightly behind the curation efforts and we catch up every month when we have a new ontology release. so there's a lot of novel term additions, so enzymes, antibodies, proteins and serum, and all of these were not trivial to add because it wasn't just the question of adding what the protein that--
was been assayed, well, it was to know why it was being assayed, for example, that it was involved in some inflammation response. and that was interest that-- that's an interesting query that we could add to the catalogue. many response to drug terms so i added over 50 of those. they were actually all added to go because gene ontology already holds response to drug terms and there's lot of information about behavioral studies. and these were not things that were available in any existing ontology. and because efo is going to be fairly widely used with an ebi
and by some external projects often because it looks like the data from those projects, and makes a good visualization and there's high integration potential here. so one thing that became very clear is that about 30 percent of the gwas data is published with measurement traits where a measurement will be stated but there's always an implicit link to some disease or to some biological question. and we want to kind of capture that, is to get from species mappings between, for example, mouse data and human data understanding what's measured is very
important as in understanding the inference that the clinician make because all of the model organism data is assayed and described as a measurement level or a measurement of a trait, while it is in a high level description of the trait. so we need both pieces of information. and much of the work we've done has a lot of two integrate of cross-species, and we'll to get to that at the end of the presentation. so what we did was we created a small schema ontology to model the concepts in the catalogue that i showed
on the previous slide that's kind of stick log diagram. we built a java application to take all of the data from the gwas database and convert it into owl individuals. so this is basically in a memory database. and we reasoned over this using hermit. so this is a fairly series reasoning activity, it takes on 10 hours on a desktop machine. so you can read over this thing overnight and generate a new visualization the next day.
the release cycles of the gwas catalog is quarterly at the moment, and a lot of that was down to a sort of very large freeze and then a three-week for the manual work to get to release that. it's something that-- but it's reasoning for 10 hours if that's so fairly small overhead search for the release frequency we needed and we do have some-- i don't have to speed up the reasoning on this as well. it's never been optimized. so we're just running on a standard desktop machine and actually for our needs that works fine.
so, what we are able to do now is use the gwas knowledge base that we just built which contains 12,000 individuals and almost 44,000 axioms to provide queries such as, show me all snps associated with type 2 diabetes and metabolic syndrome with a p-value of 10 to minus 5, from papers published before january 2010. we can layer in queries looking at combinatorial disease. we can restrict it to a single paper or multiple papers, or papers which metaanalysis. so in fact, all these things can be just implemented because when we really have all of the axiomatization and the knowledge base.
so it's actually is a new way using semantic web technologies to visualize the gwas data. it can be extended to model other relationships, the scheme is very simple. and now, and we can include other data sources. it's fairly simple to do this and i'll describe at the end about some other communities who are interested in using the-- it had to be the back end of this to other scenarios. so motivation has all been about improving the visualization. so, the gwas catalog is very rich, it--
have this beautiful karyotype diagram which will be shown in a moment, but it was very, very helpful to produce that. it was done by hand and looks really attractive. and for me, this has been a really nice project 'cause i don't often get to work on something that looks really great. usually, we have an interface, and in this case, we have a really beautiful interface. so, the gwas diagram was such, was great for showing the evolution of the data, but because it was very static and hand-drawn.
it was hard to integrate. and so i could just screen shot here some papers down the right-hand side of the slide, and i guess you'll recognize this as nature cover. but the important thing here is that the audience could read this darryl leja actually used to draw the gwas diagram. so he spent three weeks every quarter sitting down with teri manolio producing a revised diagram. one way that he could have been doing really beautiful things like this.
so, the gwas diagram was a visualization of all of these snp-trait associations of the p-value 5 greater than 10 to the minus 8, that will generate this manually, quarterly. it was a static pdf or powerpoint image. one of the problems was that there are now too many traits from colors to reliably identify a single feature. and it does work really well to shape each of the catalog overtime. so i'm just going to show an animation here and these are darryl's original drawings from the power point presentation
and the pdf that he produced. so you can see at that time, as more snps appear on the diagram, it gets richer and richer and richer. and as of 2011, it was the last time that the diagram was produced by hand and we were asked to try to address this stuff because it was becoming too laborious as more and more snps are being reported. and also as the diagram is getting more dense, it's was very hard to actually start to place these manually on the diagram. so we started to address this problem.
and i think this is the key problem, the key is the key problem. so there are more colors here than the human eye can distinguish reliably. so if you put prints on the poster and look at it, it looks great. if you put it on the screen, it's really hard given screen resolutions and what the human eye can distinguish to map this trait list into the actual diagram. so what we have done is programmatically generate the diagram from the gwas knowledge base. we used little "rederlets."
so a rederlet is just a visualization of snp with the code that generates components in the diagram from the knowledge base and it just knows how draw the owl classes on the diagram and how to layout the glyph. so the colors of the dots are based on ontology classes and i'll show that in a moment. and so all of this is sitting on a support retrographic back end so we took an svg version of the karyotype and we've got javascript and some stylesheet markup and none of this is particularly complicated.
i'll be putting all the bits together. it turned out to be tricky. it turned out that the code wasn't complicated. the rendering, the knowledge base and owl wasn't particularly complicated and we were able to be repurpose any case that we have. but making it look nice is a really human skill that we worked very closely with a colleague at nhgri. so the effects with the users would see something that they were familiar with that would lead them into the additional features that we had added.
and so this is what we've produced. and i guess it looks very similar to the original diagram. however, this thing is fully interactive. so, all of the dots are colored by an ontology high level class and we use an algorithmic approach in deciding how to arrange the classes. so there has to be enough data to get a color assigned. we've limited the color choices so that leaves it distinguishable by the human eye. actually, this is something slightly more than we would like
and ultimately i think we would like 12 and i think we've got 18 on here. but this turned out to be a nice breakdown across the kinds of data that we have. so each dot represents a snp trait association. if you must, so that the dots you get a description of the trait, so the trait isn't a trait mapped to the ontology, it's the original annotation that was supplied by the curator. and in some cases, these are matched to a higher level term inside the ontology that we don't certainly model all of the data or the trait level.
and that was a kind of decision we took because we wanted to be able to maintain the granular piece of the annotative and allow them to be able to decide whichever they wanted but to map it to something that exists in the ontology. this diagram is fully zoomable and it uses the same karyotype that's used for the gwas visualization that currently exists in the gwas catalog site, so it's familiar. the color assignment is different, but the key is much simpler. you can zoom in and out.
so i've just shown you an example here at the bottom of chromosome 6 where we have a very dense region of snps that are fanning out, and so you can actually zoom in and out and see how the key corresponds here and by mapping out, for example, the light-blue color, you can see that those are all hematological measurements. and hematological measurement, an example of that, we added lots of terms to describe those measurements because they weren't available in any ontology previously. so one of the key things that we cannot filter, the diagram by the ontology,
and so here, i'm showing a screenshot from a production site where we've filtered it for cancer. so-- and basically, we'll pull up any terms for instance to use all of the synonyms and cite the experimental fact ontology to do this. and that's hidden from the user. and in that version, we now have an also complete book supplied by the national center bio ontology. so you're able to type in string and get some uses of code and what kind of terms you could extend on.
and one of the nice features is that when you visualize something by disease as in this example, you retain the context of all of the others snp and they just braid out which means you can still mouse over them so you can see the trait's context of any given such as snp as well as to-- for grounding your query. and so-- and the mouse over this is still working. so here i've mapped over one of this pink snps that was a lot of cancer and we do have some subsumptions in the ontology. so the red dots are digestive system cancers which are colored red
because they're also digestive system disorders. and so this is kind of a problem with color assignment that in some ways we would like to [inaudible] on it and to be richer, but we haven't yet addressed that problem. and it's due with the fact that there are multiple ways to classify disease. and so we limited our classification to that available in our ontology which we think is a good fit for the data. but it's certainly not a clinician's classification of disease, it's more of a biomedical translational researcher classification of disease.
and this assignment was also based on an understanding of the query logs provided to us by the nhgri and it turns out that 99 percent of queries are about 10 common diseases of phenotypes associated with those 10 common diseases. and so if you look on this screen shot on the trait-specific view, we've actually canned a bunch of queries so far, the cancers, for cardiovascular disease, the diabetes, for other metabolic disorders, so if you're actually interested in the current snp assignment across the karyotype for disease scenario, we--
you can actually download those and use them in your own presentation. this is a shot from that site and the urls at the bottom of the slide. and we've now put in dynamic links so when you click on the snp, you actually can see the rsid, you can see the p-value assigned from the paper by the curator. you can see a link for the ontology term and the definitions, it's clickable, and also a link to the paper and also the rsid that links to ensembl. so you can see this in the genomic context and you can go straight from the diagram to the genomic context so that you can look at, for example,
this in the coding region, whether this is an enhancer region, something like that. there would be no reason why we could not layer on, on the visualization certain other kinds of motifs, so for example we could show all genic snps, that would be quite easy to do as well. we haven't yet done that. so we're about to take this implementation to ashd while we will get some feedback from people
on additional features they would like. so summarizing the key features that we've got, a dynamic zoomable diagram with mouse-over showing traits, we have a new simplified color scheme with high-level trait categories whether there are a lot of data. the mental health disorders, cancer, cardiovascular diseases would be examples of those. the diagram is interactive. right now, the interactivity is focused around the trait anotation
but we will be implementing other things in the future and the dots basically link directly into both phenotype-derived catalogues and into resources like ensembl. and we can add in as the resources as well that we could have in dbsnp but we could have in ucsd depending on the context. so in terms of future work, we're going to enhance the link status of the diagram, right now these are running on a dead site, they need to go to production sites. we would like to reformat the nhgri page a little bit
so that diagram is really an entry point to the catalogue and such tool as well as the visualization. on mid-term goals, more complicated filtering scenarios by p-value, by region, by study and combinations of these with trait. we would like to share the information, this semantic information alongside features track from ensembl and that requires us to load all of this site into ensembl and network is on progress, including the semantic information. and longer term goals, and to look
at different resolution strategies for high density regions. so for example the mhc has a very large number of a variance map into it and we would like to have to expand that and make the viewing slightly smarter. we'd like to be able to use the ontology inside ensembl, and we would like to do some cross-species mapping for example to map [inaudible]. there's also additional information which is extracted by the curators which is not shown on here. so for example, ethnicity is not shown, cohort information is not shown
and if snp is part of the metaanalysis or has been reported by multiple papers is not visualized on the diagram. and those are all things which would add to the kind of utility, and those are things that we could do. and so we've restructured the gwas catalogue data using allele-based approach. we've remapped the catalogue data and we've removed all the manual processing from the catalogue visualization. all of these use semantic web technologies based querying so it's-- the queries are done live for the trait information
and actually it's using the mouse back end for the memory. and when using this visualization and it works really well, so and a lot of the code was reused to another project. so from my point of view, i was working very much on the mapping side and the reiteration of the data to try and integrate it with the ontology. and because so much of the data was novel and didn't look like any original ontologies, we underestimated how hard that would be. but in terms of coding, this worked very well and actually the visualization turned out to be fairly straightforward to do.
we've been talking to some other communities. this is a screen shot of a browser from an organization called tgac-- genome analysis centre in the uk. they hold genomic information for many, many plant species and they would like to use this visualization to layer on plant eqtls and also to look at traits favored by plant breeders, and layer it over the genome. so we've kind of done a little mark up here of how we think this could lay up the genome and how it could be used to visualize trait information from other species.
we know that we have a rich and well-described data set. we can either use data from any new data source alongside the existing data as long as it obeys the same semantic. and everything is transferable so we can build new visualizations from different data sources as well, again, so we had a sort of species neutral pheno in here, and at the moment the ontology that's used to visualize everything is efo but that could actually be swapped out for something like mouse phenotype and the code allows the work.
we just have to remap the data and we have some tooling that will allow us to remap the data as well. that doesn't have to be done manually a second time. so effectively, it's a fairly portable back end for this kind of visualization. it's not type of the karyotype. we use a karyotype. the plant community uses a very simple line diagram for the chromosomes and it seems that it will work just as well to that data set as well. it would be interesting to add comparative views,
so it would be nice to look at mapping between different genomes, more circos-like views to show synteny and we would also like to bridge the gap to the transcription-- transcriptome as well to show for example transcribed regions that would allow us to start to look at eqtls and the information that we have inside the arrayexpress and to link that with some of the variance information that may be affecting transcription. so really, it's quite an exciting tool for us that we would be able to use in lots of different scenarios within ebi,
by collaborating these groups outside. so that's all i have on this part of the talk. i'm roughly on schedule. i should thank most of the people who worked on this project to nhgri, in teri manolio's group. so peggy hall, lucia hindorff, heather junkins, kent klemm, darryl leja, and the people who have produced all of the data in the previous visualizations. and in my team, some of these people contributed a lot and the ones involved are the ones who really have done the job
through the work, so jackie macarthur, joannella morales, and dani welter. and this is funded as a grant for me in [inaudible] from nhgri. okay, so i'm going to switch gears now. and so let's talk about mouse data. and at the end i'll talk about integrating human and mouse data [inaudible]. so, the second part of my talk about mouse phenotype data. ebi is one of consortium consisting of ebi, the medical research council at harwell in the uk and the welcome trust sanger institute.
we are funded as the mpi2 consortium to deliver the komp2 mouse phenotype data to the biomedical community. so komp2 is knockout mouse program 2. it's a common fund project. every clip coding genome in the mouse genome is going to be knocked out and the standards that of phenotyping test applied to the mice. generating enormous amounts of data, different kinds of granularity and i'll walk through some of the data that we expect, some of the data we already have and talk about how we're going
to start to integrate this. because really, one of my motivations in being involved in this project is not the laboratory research of the mouse because we're not a mouse house at ebi. but we do have an enormous amount of data around human and mouse, probably represents something like 50 percent of our total data holdings across many resources. and we want to be able to project the mouse data on to the human data and make it available translational research.
and so that would also concern to the second part of the talk. so the komp2 has a common phenotyping scheme which uses a set of previously available esls. there's cohort breeding and a phenotyping program which is undertaken by mouse phenotyping centers. we support them in extracting data from them, from their omim systems. and then all of that information effectively is shown on a database. and so this is a slide from the common fund website which describes that about the relationship between the existing components
and what i've shown here in red is how we then handle that data as we acquire it from the phenotyping centers for display. so there's a large data coordinating centers [inaudible] mrc harwell which does all of the data acquisition from the omim systems, applies qc and it also coordinates all of the standard operating procedures in the database with impress. and that turns out to be incredibly important and actually an enormous amount of work to get an agreed sop between something like six or seven phenotype incentives.
and that's essentially to ensure that the data will be compellable and this means going into things like housing and husbandry. the design of the experiments are how often phenotyping is performed, who performs it, whether it's performed blinded for alleles for example. and all of that information then goes into the first database, mrc harwell-- one through qc and then once the data are qc'd then that's exported with the core data archive which is held at the ebi. the most critical people i think in the early phase of the project, all the data wranglers so we have six data wranglers
between the wellcome trust sanger institute and then we'll see how well and their job is to [inaudible] getting the data from the phenotyoping centers to qc it and provide feed backing to, into the phenotyping centers to ensure the data, it's consistent and that includes things like you know, understanding what the baseline of the data looks, so like so you can see if you have instrumentation drift or if a new instrument has been installed or if you have any operator and sees that views these things off in a deviant,
or whether it's an instrumental effect or an operator effect. and in that challenge, not be underestimated, then we have invested lots of resource in that. so once they go through qc, it's pushed up at the core data archive and then we have a view on that database community. and so the data resolving host it from a single point of access and view also that as mousephenotype.org. there are two sites running on this domain that's the mousephenotype.org and there's a dead site as well.
the dead site is basically where we put online new features to get feedback on them from the community before we deploy them to the mousephenotype.org site. and the way we do that is they go up there for comments but also we actually have user experience that you go and sit with a bunch of researchers and watch them work through the features to see how they perform. so right now, the mice production, we don't expect big data flows for this project yet. some of the centers already have mice production but some have just started and so we expect big data flow to be around year three.
right now we're in year one but we actually have built most of the infrastructure to get the data through the system already all the way through to the interface. what we are now developing is new features to go on the mousephenotype.org site and so this is an example of all of the gene symbols at the moment, the mgi identified from jackson and lab and the status and this is already here so that if you have an interest in a particular gene, you're already able to register to get a lot from that gene goes into production and also when we get data on that gene.
so that you go to get that data straight away. and all of the data result, so none of the data is closed, all of the data through the qc process and all of the data through to the release is open and accessible to anyone. this is an example of an sop. this one is the eye morphology. these are held in a database called impress which stores all of the information about the protocol, for example the instrumentation but it also stores ontology terms
that describe how the phenotype has been described and so in this case, we've got an example which is abnormal eye presence of whether it's in left and right or both sides and an mp term so this is a term from the mouse phenotype ontology described on the jackson lab, have been supplied. and all of the protocols have to assign a mp term and this is critical work because it's not possible for the people in the centers to assign a phenotype term as they generate the data. so by typing the sop strongly with ontology terms,
and having those applied consistently, it gives us a way to consistently describe the data and to ensure that we don't have to try and mind those information later after free text. so those are the way that the protocol is-- is applied and also how it is annotated and will later be used for display purposes and query purposes, is controlled at the start of the process by registering a protocol in impress, that is the only way we can make the scale and this has been a very early phase [inaudible].
right now i think we're around 20 agreed protocols and they have mp terms agreed. in some cases, we are asking for extra mp terms from the jackson lab because we would like to describe this in richer detail and that's not typically performed by the data wrangler, at howell, and at the sanger. we are very image-heavy project and this has always been a slight concern to me. we have a lot of lacz images coming through, we know also have other kinds of images, some of the groups doing mri, micro ct.
that's where we have an image transfer format in progress which allows us to bring image information and image annotation in different groups, annotated images in slightly different ways, but in a technology-neutral way. so it's extensible for future technology. we needed some image, visualization software, the sanger already have this and built in their omim system so we have modulized that component and it will be our image view for the project. so, we now already we're getting expressions like this, so visible dysmorphologies and other phenotypes,
in short term from from x-ray and histopathology. so as an example here, which shows the kind of information that we have. and i'll talk about this in more detail on the next slide. so, here we're really presenting data in different contexts. so, this is actually an example from a paper that was a nice example of how we have different phenotypes expressed at different levels of granularity. so, this is hey2, bhlh transcription factor. and there's a little screen shot there of this chaining ensembl.
there's a great phenotype in the homes of these animals shows us a runting, very few of the home survived the [inaudible]. the hearts are so small. there's a cardiac defect that we have-- for a full body phenotype, and we also have an organ phenotype and so we are having to deal with all of these different levels of phenotypes and try and present them somehow for users in a sort of system that also handles gene level information that's fairly simple to query because we really want this to be accessible and we don't want to have
to build a very complicated advanced interface that no one will use. so, we split this into three concepts, as of pages. the first is a gene detail page which has the gene, its variance and its associative phenotypes and tissues, which the genes expressed in if known, if we get like that images for example. we have phenotype feature pages, while we have a one phenotype on that page plus it's associated genes plus the affected tissues, and the tissue detail page where we have an anatomical entity,
the genes expressed in that, and the phenotypes that impact that tissue. and so you can come in any of these points depending on your kind of query interest and it also means that if you don't necessary know a lot about the anatomy of the mouse then you can actually come in and express queries about for example, tissues, for example heart or liver. if you're an expert in those and you don't have to understand the contexts of the mouse to do a query. so this is an example of what we have now. so, we have taken mendelian's process and embedded it in our webpages
because we want another fine detail about the phenotype information. we have associated images. so, here we've got histopathology images. and this is how we're planning to lay these out in a kind of gallery view. one of our challenges is to kind of reduce the number of images to those which are representative for a given strain of mouse, well i've been trying to display a lot of images in this view and to come up with gallery views to people who are experts to try and scan through them and look for abhorrent images.
so, all of this is work in progress. here's a screen shot taken off the developers machines inside ebi. we have a major meeting coming up next month in toronto and some of this layout will be on our dead website there for user experience. then at the bottom here we've got the actual allele map showing the context of the allele and also the strain of origin as well. one important feature to know is that in all cases you can actually order the map so. and we would like people to build,
to get the mice while they're freely available. so while they're still on their shoes as opposed to when they're spending things frozen down and so there are active links to people who would like to get them because it's easy to do that as the mice is being produced. and so this is a mark up of our phenotype page. in this case, it's cataracts. so, what we do is we put one iconic image per mass phenotype term. and-- what-- in this image, there are certain cases, for example,
if it's sending electrically cross-tolerance test where they would be glass rather than a natural image of anatomy parts. there's a link to the sops that i showed you before the generate it. strains with that phenotype are included here and this is something that we would like to add in being in explore section while we can look at associative phenotypes. so, that would be a term which would be a sibling term or for example, a mapped human phenotype oncology term where you would be able to get a human disease data and i'll talk
at the moment how we can get disease associations. we've also been experimenting with image gallery. and so, this is really a work in progress. so i was thinking about how as that image data start to appear, how we will lay them out in the context with the genomic information. and i think it's quite a tangent project, but this represents a really useful start on this and i think our users will really provide us lots of feedback of how we can improve this.
this is an example of how we can lay out the genetic variance in here so we've got gene variants with the phenotype shown. so, if you're interested, you can see all of the information about genotype and this is how we expect these images to be laid out on the page. and obviously these pages are quite information- dense and so a lot of this would be hidden until the user starts to click on it. so i thought in conclusion in the last few minutes, i would talk about that sort of cross phenotype integration challenges that we have both from our experience with the gwas data
and our experience with komp2 data. so in the mouse, we have measured traits. we have fully open raw and processed data. in human, we have traits based on clinical inference and that has implicit links to measurements, but nowhere is that made explicit quite often not even in the paper. in the mouse, we have sops for all animals across all sites. in human we have study specific parameters which have to be extracted from the literature.
we have anatomical differences, mouse have paws and prostrate. it's a very different kind entity, loveliest complexity in mouse versus humans. in mouse we have a standard statistical phenodeviant calling. in humans we have different statistical pipelines per study. we have different terminologies. in mouse, we use mouse anatomy mouse phenotype, mouse pathology. human terminologies include mesh, icd, snomed-ct, hpo. we use the [inaudible] a lot here to account for data. the data for mouse is open, the human data are typically restricted
at the raw level, and we have to go through data access committee and the data often have to be extracted in literature. in the mouse we have inbred strains by detailed allele level descriptions. in a human we have variable population and actually describing ethnicities really made a challenge by sending through a catalog and also for other data that we hold at ebi, very, very difficult to have a standard coating scheme for ethnicity. and in the mouse we have very high granularity data, and in the human we have often low granularity data
because of all of the proceeding point. so we're kind of trying to reconcile things that are fairly different in granularity and actually are relevant in different ways to different communities. and so this is all hugely challenging but actually we are getting some traction on that partly because of the way the community from the maps has worked and also because of two languages i'll talk about next. and so there are some fairly pragmatic approaches to integrating mouse and human phenotypes.
so one of the resources we can use is the asserted curated mappings that come from the jackson labs. and i put the allele in here, so for example if you choose parkinson's disease, the six mice which are asserted by the jackson lab curators to be modeled as part of that disease. and this can be incredibly useful because you can use it as control set and you can also, may use it as an exploration into the mouse data. but we don't have curated data for all of the komp2 data because the data haven't yet arrived.
it will be a while before they're curated from the literature. and so we're also able to use terminology-based approaches which helps describes mouse. so, by using the terms annotated through a phenotype, for, in our case, annotated to some doc writing procedures that we use in komp2 we're able to decompose these. so some of them do muster a compound. so for example abnormal eye or enlarged eye, or small eye, have two components within it.
the modifier about the eye, plus the anatomy of the poppy eye. and by doing that and using the structure of the mp and structures that correspond in human ontologies, we can use a predictive approach to suggest candidate mice for human disease based on shared phenotype so we can also do things like pheno clustering. now looking at mice with a common set of phenotypes to see if they have-- we can observe the biological relationship between the disease. for example, elements in a shared pathway. so, this is data from our collaborators at the wellcome trust sanger institute
who are using an algorithm called alison which is described in this paper. and this is an example of a part of the human phenotype ontology developed by peter robinson in berlin. and part of the mouse phenotype ontology is from the jackson lab. and in this case we've got abnormalities of the jaw or the mandible. and we have here some annotations that correspond to human and mouse. and we have a clear lexical equivalent at the level of abnormality of the mandible. and how we've got a different subsumption hierarchy underneath it.
and if we have annotations to low-level terms, we have to use structures of the ontology to actually infer the relationship between these. but actually we can use the information content of the ontology to tell us which are good matches and also to rank the matches for us and the algorithm that is usually now send basically does that even the information content. and it will allow you to prioritize matches between hpo annotated terms and the main phenotype annotated terms. there are different tools of data.
this is a screen shot from this paper which was developed-- which was data provided by our collaborators to show that owlsim at the moment performs well, dr. oelrich was out there. and i imagine as more data becomes available, this tool will perform better overtime. so, this tells actually-- this algorithm is actually implemented in a tool called mouse finder. so, it will allow you to query about a known disease or omim gene name or hp term.
well, i've shown an example here for palmoplantar keratoderma. so that's basically scaly skin and particularly on the palms of the hands and feet or the paws. and the reason i chose this was that it's an interesting example of how you can layer on mouse data of a human data and actually get to see some predictive candidates. so in this case, i looked up palmoplantar keratoderma. and this is the rank scheme list of-- from the mouse model. you can see that there's a link to omim genes mass model curation.
but actually in this case there is no curated mass model for palmoplantar keratoderma. but some work done in this handout using the mouse genetics pipeline. and so this is data that went through the knockout process prior to komp2 but will be included in the presentation of the komp2data in the url that i showed. actually i have some mp term annotation of hyperkeratosis and parakeratosis. and these are terms which are shared with the omim annotations from the human phenotype ontology.
and so if you do a comparison of the two ontologies and layer over the mouse data where we actually have mouse phenotype data as well as just comparing the ontoligies, we're including data in this. you can actually get the disease prediction out for krt76. and i actually had to look at some phenotype information in the slides in this. this is a screen shot of the human keratoderma. and actually there are some very closely related genes in mouse which have been knocked out that show very similar physical phenotype, so very scaly claws.
and so by doing this, in presenting images, we can actually present predictive information in the context of the mouse data. and particularly for the human data, it turns out that our users, particularly ebi of ensembl are extremely comfortable with predicted data. they are very happy to understand that what the information they're looking at is prediction. and if you label it clearly enough, that actually that's a good way to present the different data to a human user, who's interested in say translation and research where you can label it
as predicted on those present avenues for discovery for the user. and so this is really something that we see is the way forward for integrating these data and these approaches, something we'll be exploring in the context of komp2. so the future for mousephenotype.org which the website where all of this data is available through the international mouse phenotyping consortium is that we'll apply standard phenodeviant calling pipelines for all our legacy and emerging data. right now it's being applied to our legacy data
because the emerging data is probably three or four months away and that the first data has started to come to the system. a mixed model statistical procedure has been developed and there's a note back to the paper that's just been accepted and plus one for that method. we expect to expose our data via ensembl. ensembl gets something like 20,000 unique users per month and we know that 90 percent of the queries for human and mouse are disease based. so we expect that this disease phenotype information is the kind or way we would
like to present the data to our user. we expect predictive tools like the one i've shown but others are available to identify common phenotypes between human and mouse and probably also including 0:49:16.3 s__as there needs in this project start to produce more data. we expect to do phenoclustering of the mouse data. and our aim really is to provide improved access to mice and to the mouse data for translational researches. and really that's the value of this project is that the data is there
but actually secondary to phenotyping. in fact, the ensembl will take a mouse that looks like it's a useful model for their system and then provide data back to us as secondary phenotyping after they've saved it on an infection challenge will be really critical way that the resource that were providing growth. we have not reached peak data flow. we're pretty much at peak coding so we have a critical massive coders in generating the infrastructure, so that to acquire the data, to see the infrastructure and to push it through to the web.
as we move through in the project, we expect that we'll be supporting the users by acquiring data from the omim system but also queering data and really changing the ways that we present the data. so the personnel will shift lighter in the project to improve the presentation. and i guess, i'm pretty much at my conclusion. so we have integrated, remapped, teased out measurements and generated a reusable technology for visualizing gwas data. we had a comprehensive infrastructure for mouse data and a single point of access to the community.
we are addressing phenotype integration challenges. we're using existing terminologies. it's rare that we develop new terminologies. occasionally, those have to be developed. we're mapping between terminologies. we're using imprints to predict candidate genes based on ontological structure. we're adding in mouse and human phenotype data to refine and validate these predictions. we're presenting mouse data in the context of human disease and figuring out how
to lay out images as mouse biologist and for human disease specialist. and really it's all about leveraging existing resources. we couldn't do this work without the mouse phenotype ontology, the human phenotype ontology and jackson lab curated models and omim. we expect to improve results as data appear and we're already seeing interesting stories from our phenotyping centers and the kind of things that they're discovering. i should acknowledge our funding so gwas work is funded by an nhgri grant.
Icd 10 Code For Insulin Dependent Diabetes,the mouse work is funded by an nih common fund project
and my own team funding comes from around 10 different projects and embl core funds. and i think with that, i'll finish and can take any questions.
Now you know how to deal with your trouble, you will be greater able to create the right options and modifications in your life. Take into account that these guidelines are only pertinent if you utilize as a lot of them as possible, and thus you should begin today to see prompt effects.
No comments:
Post a Comment