The grading API

You upload student work. Hawkings grades it against your rubric. You get a numeric score, per-criterion breakdown, markdown feedback, inline annotations, and next-step insights — with terminal guarantees and full reproducibility.

const evaluation = await hk.evaluations.create({
  submission: "sub_01J3K8R2QF...",
});

const result = await hk.evaluations.waitFor(evaluation.id);
console.log(result.score, result.feedback_markdown);

Four resources

Resource	ID prefix	Mutable	Owns
`Assignment`	`asg_`	yes	What is being graded against. References a `Rubric` and a model.
`Rubric`	`rub_`	versioned	Criteria, weights, scale, guidance. Validatable on its own.
`Submission`	`sub_`	no	The student’s work. Immutable once accepted.
`Evaluation`	`eval_`	state machine	An async job: one submission, one rubric, one model.

A Submission may have many Evaluations — re-grades, model upgrades, calibration runs, human-override cycles. The Evaluation is what you wait on. The Submission is what you keep.

Why split Submission from Evaluation? Re-grading is the most common real operation: rubrics evolve, models improve, teachers ask for second opinions. With one combined record the second grade overwrites the first and you lose history. With separate records you keep both, diff them, and roll back.

End-to-end in five calls

The complete grading flow, top to bottom, nothing hidden:

// 1. Define the rubric. Lint-checked before saving.
const rubric = await hk.rubrics.create({
  name: "Essay rubric — research methods",
  criteria: [
    { id: "thesis",     weight: 0.30, max: 30, guidance: "Clear, arguable, specific." },
    { id: "evidence",   weight: 0.40, max: 40, guidance: "At least 3 primary sources; cited correctly." },
    { id: "structure",  weight: 0.20, max: 20, guidance: "Logical paragraphing; transitions." },
    { id: "mechanics",  weight: 0.10, max: 10, guidance: "Grammar, spelling, formatting." },
  ],
  scale: { min: 0, max: 100, passing: 60 },
});

// 2. Create the assignment that uses it.
const assignment = await hk.assignments.create({
  name: "RP1 — Research proposal",
  rubric: rubric.id,
  model: "claude-sonnet-4-6",
  human_review: "required",
  ocr: { enabled: true, language: "es" },
  external_id: "moodle-assignment-29",
});

// 3. Upload the student's work.
const submission = await hk.submissions.create({
  assignment: assignment.id,
  student: { external_id: "moodle-user-15", email: "jm@example.edu", name: "José María" },
  files: [{ name: "RP1.docx", body: fileStream }],
  external_id: "moodle-submission-12",
});

// 4. Enqueue evaluation. Returns immediately with status "queued".
const evaluation = await hk.evaluations.create({ submission: submission.id });

// 5. Wait for the terminal state. Webhooks in production; helper in scripts.
const done = await hk.evaluations.waitFor(evaluation.id);
console.log(done.result.score, done.result.feedback_markdown);

Five calls. Zero file gets uploaded twice. Zero rubric travels with a submission. The job lifecycle is explicit. The integrator never has to write polling logic in production code — webhooks are first-class.

The Evaluation lifecycle

Every Evaluation lives on this state machine. All three terminal states are guaranteed: the platform itself transitions stuck jobs to failed within 30 minutes. There is no pending forever.

                          ┌──────────┐
                POST ──→  │  queued  │
                          └────┬─────┘
                               │ worker picks up
                               ▼
                          ┌──────────┐
                          │ running  │
                          └────┬─────┘
                               │
                ┌──────────────┼──────────────┐
                ▼              ▼              ▼
           succeeded        failed       canceled
        (result set)  (failure_reason)  (canceled_at)

Each terminal state populates a specific field; the others are null:

State	Always populates
`succeeded`	`result.score`, `result.feedback_markdown`, `result.breakdown`
`failed`	`failure_reason.type`, `failure_reason.message`, `failure_reason.request_id`
`canceled`	`canceled_at`, `canceled_by`

failure_reason.type is a closed enum. Switch on it:

switch (evaluation.failure_reason?.type) {
  case "rubric_invalid":         /* rubric malformed or missing fields */
  case "submission_unreadable":  /* corrupt, encrypted, no extractable text */
  case "submission_too_short":   /* below minimum length threshold */
  case "submission_too_long":    /* exceeds configured ceiling */
  case "submission_off_topic":   /* response doesn't address the prompt */
  case "submission_only_images": /* image-only and OCR disabled */
  case "model_error":            /* LLM non-recoverable after internal retries */
  case "timeout":                /* didn't complete within 30 minutes */
  case "canceled_by_user":       /* you called evaluations.cancel() */
}

Each value has a public docs page with a reproducer.

Assignment

The long-lived configuration of what is being graded.

{
  "id": "asg_01J3K8Q7N5W4F2H6X9T2Y0M8E7",
  "object": "assignment",
  "name": "RP1 — Research proposal",
  "rubric": "rub_01J3K8PT9R3M2N1P6Q5K8S2W1",
  "model": "claude-sonnet-4-6",
  "human_review": "required",
  "ocr": { "enabled": true, "language": "es" },
  "language": "es",
  "max_score": 100,
  "context": {
    "course": "crs_01J3K8Q0...",
    "materials": ["mat_01J3K8...", "mat_01J3K8..."]
  },
  "external_id": "moodle-assignment-29",
  "metadata": {},
  "created_at": "2026-05-11T09:14:01Z"
}

Operations

POST /v1/assignments — create. GET /v1/assignments/{id} — retrieve. PATCH /v1/assignments/{id} — update (rubric, model, human_review, ocr). GET /v1/assignments?external_id=… — list, filterable.

Course context

Assignments may reference course Materials (textbook chapters, lecture slides, the syllabus). The grader uses these as authoritative context. A claim that contradicts the course materials is flagged in the breakdown — even if it’d score well in isolation.

Rubric

A first-class, validatable, versioned resource.

{
  "id": "rub_01J3K8PT9R3M2N1P6Q5K8S2W1",
  "object": "rubric",
  "version": 3,
  "name": "Essay rubric — research methods",
  "criteria": [
    { "id": "thesis",    "weight": 0.30, "max": 30, "guidance": "Clear, arguable, specific." },
    { "id": "evidence",  "weight": 0.40, "max": 40, "guidance": "At least 3 primary sources; cited correctly." },
    { "id": "structure", "weight": 0.20, "max": 20, "guidance": "Logical paragraphing; transitions." },
    { "id": "mechanics", "weight": 0.10, "max": 10, "guidance": "Grammar, spelling, formatting." }
  ],
  "scale": { "min": 0, "max": 100, "passing": 60 },
  "calibration": {
    "examples_count": 7,
    "agreement_score": 0.91
  },
  "warnings": [],
  "created_at": "2026-05-11T09:13:55Z"
}

Operations

POST /v1/rubrics — create. POST /v1/rubrics/validate — dry-run lint, returns warnings without saving. Use in CI. POST /v1/rubrics/{id}/versions — replace contents; bumps version. POST /v1/rubrics/{id}/calibrate — attach golden examples.

Lint warnings

The validation pass catches what we see in the wild:

{
  "warnings": [
    {
      "code": "looks_like_guide_not_rubric",
      "message": "This document explains how to write rubrics rather than defining one. Did you mean to upload the rubric itself?",
      "severity": "blocking"
    },
    {
      "code": "criteria_weights_dont_sum_to_one",
      "message": "Weights sum to 1.2; they must sum to 1.0.",
      "severity": "blocking"
    },
    {
      "code": "criterion_guidance_missing",
      "message": "Criterion 'sources' has no guidance — AI scoring will be inconsistent.",
      "severity": "warning"
    },
    {
      "code": "scale_inconsistent_with_criteria_max",
      "message": "Criterion max values sum to 100 but scale max is 50.",
      "severity": "blocking"
    }
  ]
}

blocking warnings prevent creation. warning warnings persist and surface in the dashboard, the SDK response, and assignments.preview().

Calibration

Upload 5–20 hand-graded examples. The platform uses them as few-shot anchors so the AI scores like your institution scores, not like a generic model would.

await hk.rubrics.calibrate(rubric.id, {
  examples: [
    { text: "...", scores: { thesis: 28, evidence: 32, structure: 18, mechanics: 9 }, notes: "Strong proposal, weak citations." },
    { text: "...", scores: { thesis: 15, evidence: 12, structure: 14, mechanics: 7 }, notes: "Underdeveloped throughout." },
    // ... 5–20 total
  ],
});

The response includes an agreement_score (0–1, the rank correlation between AI scores and your hand scores on a held-out subset). Below 0.7 means the rubric is too subjective; below 0.5 means the rubric is not measuring what you think it’s measuring.

Submission

Immutable, multi-modal, preflight-checked at the door.

{
  "id": "sub_01J3K8R2QF6T9V8B0D2C5A1N4K",
  "object": "submission",
  "assignment": "asg_01J3K8Q7...",
  "student": {
    "id": "stu_01J3K8QP...",
    "external_id": "moodle-user-15",
    "email": "jm@example.edu",
    "name": "José María"
  },
  "content": {
    "text": null,
    "files": [
      {
        "id": "file_01J3K8R2A...",
        "name": "RP1.docx",
        "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "size_bytes": 48201,
        "extracted_text_chars": 5421,
        "extracted_language": "es",
        "preflight": { "readable": true, "warnings": [] }
      }
    ],
    "audio": null,
    "video": null,
    "code": null
  },
  "external_id": "moodle-submission-12",
  "received_at": "2026-05-11T09:20:11Z"
}

Operations

POST /v1/submissions — create. Synchronous preflight: parses every file, counts extracted text, runs OCR if enabled, detects language. Blocking issues fail the request with a 422 before the submission is persisted. GET /v1/submissions/{id} — retrieve. GET /v1/submissions?assignment=asg_… — list. Submissions accept any of: text, document files (pdf, docx, md, html), audio (mp3, wav, m4a — transcribed), video (mp4 — transcribed + keyframe analysis), code (zip or single file — language-aware extraction), images (jpg, png — OCR if enabled).

await hk.submissions.create({
  assignment: assignment.id,
  student: { external_id: "..." },
  content: {
    audio: { url: "https://your-cdn.example/oral-exam.mp3", language: "es" },
  },
});

The grader sees the transcript and the original media; the rubric is evaluated against both.

Preflight at the door

Preflight is synchronous and blocking. The submission is rejected before persistence if:

The file is encrypted or password-protected.
A PDF contains only images and OCR is disabled.
Extracted text is below the assignment’s minimum length.
Total payload exceeds the configured ceiling.

This kills the most common silent-failure path: a PDF that’s actually a scanned image arriving, being accepted, and timing out in the grading job 5 minutes later.

Evaluation

The async grading job. The artifact you wait on.

{
  "id": "eval_01J3K8S3T0X5Y9Z2W4Q7P1B6C8",
  "object": "evaluation",
  "submission": "sub_01J3K8R2QF...",
  "rubric": "rub_01J3K8PT9R...",
  "rubric_version": 3,
  "model": "claude-sonnet-4-6-20260301",
  "seed": 42,
  "status": "running",
  "result": null,
  "failure_reason": null,
  "usage": null,
  "created_at": "2026-05-11T09:20:14Z",
  "started_at": "2026-05-11T09:20:16Z",
  "completed_at": null
}

Operations

POST /v1/evaluations — enqueue. GET /v1/evaluations/{id} — retrieve. POST /v1/evaluations/{id}/cancel — cancel queued or running.

Result envelope (on success)

{
  "status": "succeeded",
  "result": {
    "score": 87,
    "score_raw": 87,
    "scale": { "min": 0, "max": 100, "passing": 60 },
    "pass": true,
    "confidence": 0.84,
    "breakdown": [
      { "criterion": "thesis",    "score": 26, "max": 30, "rationale": "Specific and arguable; could narrow further." },
      { "criterion": "evidence",  "score": 34, "max": 40, "rationale": "3 primary sources; one citation incomplete." },
      { "criterion": "structure", "score": 18, "max": 20, "rationale": "Clear paragraphing; transitions could be tighter." },
      { "criterion": "mechanics", "score": 9,  "max": 10, "rationale": "Two minor typos." }
    ],
    "feedback_markdown": "## Strengths\nYour thesis is specific...\n\n## Areas to improve\n...",
    "feedback_for_student": "Great research direction! A few small things to tighten...",
    "annotations": [
      { "quote": "Einstein in 1915", "comment": "Year is right but cite the paper itself.", "criterion": "evidence", "start": 1240, "end": 1257 }
    ],
    "insights": {
      "next_steps": ["Practice paraphrasing citations", "Review APA format §6.3"],
      "concepts_strong": ["formulating arguable claims"],
      "concepts_weak":   ["primary vs secondary sources"]
    },
    "flags": []
  },
  "usage": {
    "input_tokens": 18420,
    "output_tokens": 2104,
    "model": "claude-sonnet-4-6-20260301",
    "cost_usd": 0.0683
  },
  "completed_at": "2026-05-11T09:20:48Z"
}

Result envelope (on failure)

{
  "status": "failed",
  "result": null,
  "failure_reason": {
    "type": "submission_only_images",
    "message": "The PDF contains only scanned images. Enable OCR on the assignment or upload a text-based PDF.",
    "request_id": "req_01J3K8S3T..."
  },
  "completed_at": "2026-05-11T09:20:22Z"
}

Reproducibility

Every evaluation pins three things:

model — the exact model version (claude-sonnet-4-6-20260301, not the floating alias).
rubric_version — the rubric as it was at the moment of evaluation.
seed — deterministic re-runs return identical output (within model determinism limits).

Re-running an evaluation with the same (submission, rubric_version, model, seed) returns the same result. This is what makes grading auditable.

Two flags worth knowing about

flags is an array; absent flags mean “not detected”. Today we ship:

"likely_ai_generated" — the submission was probably written by an LLM. We surface this for the teacher’s awareness; we don’t act on it.
"off_topic" — the response doesn’t engage with the prompt. The score is still computed but is meaningless; the teacher should look.

We don’t ship a plagiarism flag. Plagiarism detection is a different product with different liability; integrate Turnitin or similar alongside.

Batch

Grade a whole class at once. Optimal for the teacher’s “submit all” moment after a class deadline.

const batch = await hk.evaluationBatches.create({
  evaluations: submissions.map(s => ({ submission: s.id })),
});

const completed = await hk.evaluationBatches.waitFor(batch.id);
console.log(`${completed.succeeded_count}/${completed.total_count} graded`);

for await (const evaluation of hk.evaluationBatches.results(batch.id)) {
  yield evaluation;
}

Batches scale automatically; per-evaluation rate-limits don’t apply. A batch of 30 typically completes in 1–2 minutes. A batch of 1000 in under 10. The single failure of one evaluation doesn’t fail the batch — each is independently terminal. Batch evaluations cost 50% less per token. A single evaluation_batch.completed webhook fires when the batch reaches its terminal state, rather than one event per evaluation.

Preview

Before processing 100 real submissions, run one through the rubric with a sample answer. No persistence, no cost charged to the production ledger.

const preview = await hk.evaluations.preview({
  rubric: rubric.id,
  model: "claude-sonnet-4-6",
  submission: { content: { text: "Sample student response..." } },
});

console.log(preview.result.score, preview.result.feedback_markdown);

Use this in your CI when you change a rubric — snapshot the preview output for 5 canonical answers and diff in PRs.

Teacher review

The post-AI workflow is a first-class resource, not a side-effect.

{
  "id": "rev_01J3K8T...",
  "object": "evaluation_review",
  "evaluation": "eval_01J3K8S3T...",
  "verdict": "overridden",
  "score": 92,
  "comments_markdown": "I'm giving extra credit for the original source...",
  "reviewer": { "id": "usr_01J3K8...", "name": "Prof. M. García" },
  "created_at": "2026-05-11T15:42:01Z"
}

Operations

POST /v1/evaluations/{id}/reviews — accept, override, or reject.

// Teacher accepts AI grade verbatim
await hk.evaluations.review(eval.id, { verdict: "accepted" });

// Teacher overrides
await hk.evaluations.review(eval.id, {
  verdict: "overridden",
  score: 92,
  comments_markdown: "Extra credit for primary source work.",
});

// Teacher rejects AI entirely; submission stays ungraded
await hk.evaluations.review(eval.id, {
  verdict: "rejected",
  comments_markdown: "Off-topic — student misunderstood the prompt.",
});

A Submission has a derived final_score that follows precedence: latest accepted/overridden review → evaluation result → null.

const sub = await hk.submissions.retrieve(sub.id);
console.log(sub.final_score);   // 92  (from the review)
console.log(sub.final_source);  // "review"

If the assignment has human_review: "required", final_score is null until a review exists, regardless of the evaluation status.

Webhooks

Production integrators subscribe instead of polling. Every event is signed with HMAC-SHA256.

Event	When
`submission.received`	Submission persisted, preflight passed.
`evaluation.queued`	Job accepted.
`evaluation.succeeded`	Terminal: scored, feedback ready.
`evaluation.failed`	Terminal: see `failure_reason.type`.
`evaluation.canceled`	Terminal: caller invoked `evaluations.cancel()`.
`evaluation_review.created`	Teacher reviewed an evaluation.
`evaluation_batch.completed`	Batch reached its terminal state.
`rubric.warning`	A lint warning surfaced post-grading.

Signature verification

The signature header is Hawkings-Signature: t=<timestamp>,v1=<hmac>.

import { verifyWebhook } from "@hawkings/sdk/webhooks";

app.post("/hawkings-webhook", (req, res) => {
  const event = verifyWebhook({
    payload: req.rawBody,
    signature: req.headers["hawkings-signature"],
    secret: process.env.HAWKINGS_WEBHOOK_SECRET!,
  });

  if (event.type === "evaluation.succeeded") {
    writeGradeToLms(event.data);
  }

  res.status(200).end();
});

verifyWebhook throws on signature mismatch or timestamp older than 5 minutes (replay protection).

Delivery & retry

Non-2xx responses retry on exponential backoff for 24 hours. Dedupe on your end by event.id. The dashboard at app.hawkings.education/webhooks shows every delivery, replayable with one click.

Idempotency

Every POST accepts an Idempotency-Key header. Same key + same body returns the same response — even on the 100th retry.

await hk.evaluations.create(
  { submission: submission.id },
  { idempotency_key: `eval-moodle-${submissionRemoteId}` },
);

This eliminates the duplicate-grade problem entirely. The Moodle plugin can retry on network failure without worrying about creating phantom evaluations. Same key + different body throws IdempotencyError. Records live 24 hours.

External IDs

You have IDs in your LMS. We have ours. The rules:

URL keys are always Hawkings IDs. asg_…, sub_…, eval_…, rub_…. Globally unique, prefixed, never reused.
external_id is a queryable field on every resource. Your reference, scoped to your workspace.
Lookup by external_id: GET /v1/assignments?external_id=moodle-assignment-29 returns a list (always; even size 1) so duplicates are detectable.

Never put your external ID in the URL. GET /v1/assignments/moodle-assignment-29 is not a valid endpoint. URL keys are always Hawkings IDs. This keeps URLs cacheable, prevents collisions across tenants, and means renaming an assignment in your LMS never breaks our URLs.

Errors

Every error response uses the same envelope:

{
  "error": {
    "type": "invalid_request_error",
    "code": "rubric_invalid",
    "message": "Criterion 'sources' has no max value.",
    "fields": { "criteria.2.max": "is required" },
    "request_id": "req_01J3K8Q8..."
  }
}

type is a closed enum: authentication_error, permission_error, not_found, invalid_request_error, rate_limit_error, idempotency_error, api_error, service_unavailable. code is finer-grained and stable. Switch on code for app logic; switch on type for retry decisions. fields is present on every validation error, keyed by JSON-pointer path. request_id is always present, on every response (success and failure). Include it in every support ticket.

Test mode

Use a sandbox API key (prefix hk_test_) to integrate without spending real LLM tokens. Sandbox evaluations:

Return deterministic mock results computed from a hash of the submission.
Skip the LLM entirely (usage.cost_usd is 0).
Complete in under 100 ms.
Honor the full state machine, including failed outcomes, so you can exercise every branch of your integration.

const hk = new Hawkings({ api_key: "hk_test_..." });

const eval = await hk.evaluations.create({ submission: sub.id });
const done = await hk.evaluations.waitFor(eval.id);
expect(done.result.score).toBeDefined();

Test-mode data lives in a separate tenant; nothing leaks into production.

Get started

Core concepts

Guides

Four resources

End-to-end in five calls

The Evaluation lifecycle

Assignment

Operations

Course context

Rubric

Operations

Lint warnings

Calibration

Submission

Operations

Preflight at the door

Evaluation

Operations

Result envelope (on success)

Result envelope (on failure)

Reproducibility

Two flags worth knowing about

Batch

Preview

Teacher review

Operations

Webhooks

Signature verification

Delivery & retry

Idempotency

External IDs

Errors

Test mode

Get started

Core concepts

Guides

Documentation Index

​Four resources

​End-to-end in five calls

​The Evaluation lifecycle

​Assignment

​Operations

​Course context

​Rubric

​Operations

​Lint warnings

​Calibration

​Submission

​Operations

​Multi-modal natively

​Preflight at the door

​Evaluation

​Operations

​Result envelope (on success)

​Result envelope (on failure)

​Reproducibility

​Two flags worth knowing about

​Batch

​Preview

​Teacher review

​Operations

​Webhooks

​Signature verification

​Delivery & retry

​Idempotency

​External IDs

​Errors

​Test mode

Four resources

End-to-end in five calls

The Evaluation lifecycle

Assignment

Operations

Course context

Rubric

Operations

Lint warnings

Calibration

Submission

Operations

Multi-modal natively

Preflight at the door

Evaluation

Operations

Result envelope (on success)

Result envelope (on failure)

Reproducibility

Two flags worth knowing about

Batch

Preview

Teacher review

Operations

Webhooks

Signature verification

Delivery & retry

Idempotency

External IDs

Errors

Test mode