APAC

Principal Scientific Data Architect

Posted a day ago

India

⭐ 10+ years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

Lead the design and implementation of a software-defined data framework using Schema as Code and Data as Code within the GCP ecosystem. Bridge the gap between cloud engineering and life sciences to power in-silico molecular discovery and Agentic AI frameworks.

About Xebia

Xebia is a trusted advisor in the modern era of digital transformation, serving hundreds of leading brands worldwide with end-to-end IT solutions. The company has experts specializing in technology consulting, software engineering, AI, digital products and platforms, data, cloud, intelligent automation, agile transformation, and industry digitization. In addition to providing high-quality digital consulting and state-of-the-art software development, Xebia has a host of standardized solutions that substantially reduce the time-to-market for businesses.

Xebia also offers a diverse portfolio of training courses to help support forward-thinking organizations as they look to upskill and educate their workforce to capitalize on the latest digital capabilities. The company has a strong presence across 16 countries with development centres across the US, Latin America, Western Europe, Poland, the Nordics, the Middle East, and Asia Pacific.

Job Description: Principal Scientific Data Architect (Google Cloud Platform Ecosystem)

Role Overview

Highly specialized Principal Scientific Data Architect to bridge the gap between advanced Google Cloud engineering and life sciences discovery. This role will redefine how scientific data is structured, scaled, and consumed across our R&D, Onyx, and CMC (Chemistry, Manufacturing, and Controls) divisions.

Operating natively within the Google Cloud Platform (GCP) and Databricks on GCP ecosystem, will lead the transition toward a fully automated, software-defined data framework by implementing Schema as Code, Data as Code, and metadata-driven Configuration Data Engineering. The ideal candidate combines elite cloud data architecture expertise with deep scientific literacy, enabling the design of data systems that directly power in-silico molecular discovery and autonomous Agentic AI frameworks.

Key Responsibilities

GCP-Native Data Architecture & Paradigm Shifts

Schema as Code: Design and implement version-controlled, programmatically managed data schemas natively integrated with Google BigQuery. Ensure schemas evolve seamlessly using GCP DevOps tools (Cloud Build, Artifact Registry) and Terraform.
Data as Code: Treat data assets with software engineering rigor. Implement data versioning, programmability, and automated quality testing using BigQuery features (like Table Snapshots and Time Travel), dbt, and Delta Lake on GCP.
Configuration Data Engineering: Architect highly optimized, metadata-driven, configuration-led data pipelines using Google Cloud Composer (Airflow) or Dataflow to abstract infrastructure complexity.

Scientific Domain Integration

Translate complex biological and chemical concepts (e.g., molecular modalities, chemical structures, solubility traits) into highly scalable logical and physical data models within BigQuery and Databricks.
Collaborate closely with computational chemists, biologists, and AI engineers to ensure the data architecture natively supports predictive in-silico modeling.
Design robust data layouts that allow autonomous AI agents to easily "dip into" molecular data, extract properties, and explain molecular behavior.

Platform & Ecosystem Strategy

Optimize the interoperability between Databricks on GCP (Lakehouse architecture) and enterprise-wide Google BigQuery storage and analytics. [1]
Inform the integration of semantic web technologies and knowledge graphs (e.g., StarDog) into the overarching Google Cloud data fabric.
Ensure data availability and high-performance querying for downstream multi-agent AI ecosystems (Agentic Hubs built on Google Cloud's AI suite or custom frameworks).

Required Skills & Qualifications

Scientific Domain Knowledge [1]

Mandatory: Strong background or proven experience working inside life sciences, pharmaceuticals, biotech, or scientific research organizations.
Ability to converse fluently with scientists regarding therapeutic modalities, molecular properties, and R&D pipelines without needing to be a wet-lab scientist.

GCP & Technical Architecture Expertise

GCP Data Stack: Mastery of Google BigQuery (including BigLake, analytics hubs, and nested JSON schemas) and Databricks on GCP.
Software-Defined Data: Proven track record of implementing Schema as Code and Data as Code paradigms using tools like Terraform, dbt, and Git-based CI/CD workflows.
Pipeline Automation: Deep experience with configuration-driven pipeline orchestrators, specifically Google Cloud Composer / Apache Airflow.
Modeling & Semantics: Strong understanding of relational, dimensional, and graph-based data modeling. Familiarity with knowledge graphs (e.g., StarDog) or biomedical ontologies is a major plus.

Soft Skills & Leadership

Abstract Thinking: Ability to conceptualize and suggest complex in-silico data solutions at a high strategic level without getting bogged down by immediate technology limitations.
Communication: Exceptional ability to articulate the business and scientific value of pure data architecture to non-technical executive stakeholders.

Preferred Qualifications

Professional Google Cloud Data Engineer or Google Cloud Professional Cloud Architect certification.
Degree in Computer Science, Data Engineering, Bioinformatics, Computational Chemistry, or a related quantitative field.
Experience setting up GCP data foundations specifically engineered to feed Large Language Models (e.g., Vertex AI / Gemini) and autonomous AI agents.

Location : Not a constraint

Some useful links:

Xebia | Creating Digital Leaders.

https://www.linkedin.com/company/xebia/mycompany/

http://twitter.com/xebiaindia

http://www.youtube.com/XebiaIndia