1.Industry workflow: schedules and rates
The cadence splits by data type and access method. Every four hours, workers fetch public vessel-schedule tables through the OmniScrape API using css_extractor against the schedule rows — enough resolution to catch most ETA revisions without hammering the source. Once a day, a BaaS session logs into the carrier portal your company contracts with, navigates to the rate-quote page, and exports the visible rate table that is never available without authentication.
Every timestamp is normalized to UTC at ingest, because port dashboards routinely display local time with no offset and a naive parser will silently shift an ETA by hours. The alerting rule is deliberately simple: flag any vessel whose ETA has slipped more than 24 hours since the prior scrape. That single signal drives most of the operational value, feeding exception handling in the TMS before a planner notices the delay manually.
The fetch frequency is a deliberate tradeoff. Polling more often than every four hours on public dashboards rarely yields fresher data — most ports batch their ETA updates — and increases the chance of triggering rate limits or IP blocks. For high-value lanes where a single vessel carries a significant cargo value, you can reduce the interval to two hours on those specific URLs without changing the default for the broader fleet.
2.Example data schema
The schema centers on normalized UTC timestamps and a stable vessel identity. Store the IMO number rather than the vessel name, since names change with charter and ownership while the IMO is permanent. Validate its seven-digit format at ingest to catch parser drift early — a missing leading digit or an extra character is a reliable indicator that the table layout shifted.
Recording the source dimension — port dashboard versus carrier portal — lets you weight conflicting ETAs when two sources disagree, which they frequently do. When the carrier portal and the port authority dashboard show different ETAs for the same vessel, the carrier portal is usually more current because it reflects the vessel's own reporting rather than the port's berth planning system. Store both and let the downstream consumer decide which to trust for their use case.
12345678910111213141516{
"vessel_imo": "9876543",
"vessel_name": "MSC AURORA",
"port_code": "USLAX",
"terminal": "APM Pier 400",
"eta_utc": "2026-06-25T14:30:00Z",
"ata_utc": null,
"status": "delayed",
"delay_hours": 18,
"carrier": "maersk",
"voyage_number": "AE-1234W",
"scraped_at": "2026-06-23T12:00:00Z",
"source": "port_dashboard",
"source_url": "https://port.example/vessel-schedule",
"parse_version": "2"
}
3.Example API request (public schedule table)
For public schedule pages, use mode 'auto' with output_format 'html' and pin a js_wait_selector to the schedule table element so the fetch does not return before the client-side table renders. Port dashboards are notorious for a multi-second hydration delay where the page skeleton loads immediately but the actual schedule rows are injected by JavaScript after a data fetch completes.
Leave css_selectors empty here and parse the table in your worker, where you can map column headers to fields and handle the inevitable layout quirks per carrier. The js_wait_timeout of 12 seconds reflects reality: these dashboards are slow under load, and a six-second timeout that works on a fast day will intermittently return an empty table skeleton during peak hours. Use the returned data.content field — not data.html — to get the full rendered HTML.
When the response comes back, check metadata.method_used to confirm js_rendering was engaged. If the field shows 'fast', the table selector was found before JavaScript ran, which usually means the page is server-rendered and you can drop the js_wait_selector for a faster, cheaper fetch on subsequent runs.
1234567891011POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY
Content-Type: application/json
{
"url": "https://port.example/vessel-schedule",
"mode": "auto",
"output_format": "html",
"js_wait_selector": "table.schedule",
"js_wait_timeout": 12000
}
4.Authenticated carrier portal (BaaS)
Rate tables and contracted schedules live behind a login, which is where Browser-as-a-Service earns its place. Connect Playwright over CDP to wss://browser.omniscrape.io, script the login once, navigate to the rate page, wait for the table to render, and grab the HTML via page.content(). The critical operational detail is closing the browser the instant you have the content — BaaS bills by the minute, and a session left open across cron ticks turns a cheap daily job into a runaway cost line.
Keep credentials in a secrets manager and inject them as environment variables at runtime, never in the repo or the script body. A leaked carrier login is both a security incident and a contractual breach. The script itself is ordinary Playwright web scraping — the only difference is the browser runs remotely on OmniScrape's infrastructure rather than on your worker, so there is no headless Chrome to manage, no dependency conflicts, and no residential IP fingerprinting to configure.
For portals that add a CAPTCHA or aggressive bot detection even after a valid login, the techniques in headless browser scraping on fingerprint consistency and human-like navigation timing keep the session alive longer. Specifically: randomize the delay between the password fill and the submit click, avoid submitting the form programmatically if a real click works, and do not navigate to the rate page immediately after login — add a brief wait_for_load_state('networkidle') to let the portal's session validation complete before moving on.
123456789101112131415161718192021222324from playwright.async_api import async_playwright
import os
async def carrier_rates():
async with async_playwright() as p:
browser = await p.chromium.connect_over_cdp(
"wss://browser.omniscrape.io?apikey=KEY&render_media=false"
)
page = await browser.new_page()
# Login
await page.goto("https://carrier-portal.example/login")
await page.fill("#user", os.environ["CARRIER_USER"])
await page.fill("#pass", os.environ["CARRIER_PASS"])
await page.click("button[type=submit]")
# Wait for authenticated rate table
await page.wait_for_load_state("networkidle")
await page.wait_for_selector("table.rates", timeout=15000)
html = await page.content()
await browser.close() # Close immediately — BaaS bills per minute
return html
5.Pipeline architecture
The topology runs from a registry of port and carrier URLs into a scheduler that routes public pages to the OmniScrape /v1/scrape endpoint and authenticated portals to a BaaS pool. A table parser turns HTML into rows, a timezone normalizer converts port-local times to UTC, and the cleaned rows flow into the TMS over a webhook while exceptions fan out to a Slack ops channel. A nightly job compares predicted ETAs against actual arrivals to produce the accuracy report that justifies the whole system.
Airflow handles scheduling and BaaS pool sizing, because authenticated sessions are the expensive resource and you want exactly one logged-in session per carrier at a time rather than a thundering herd. Parsed rows that fail validation — bad IMO format, missing ETA, unparseable timezone — land in a dead-letter queue so a layout change degrades one carrier gracefully instead of corrupting the warehouse. Each dead-letter entry includes the raw HTML snippet that failed parsing so a developer can reproduce the failure without re-running the scrape.
For the /v1/scrape path, use a session_id tied to the port URL so OmniScrape can reuse the same IP across the four-hour polling cycle. This reduces the chance of a source treating the repeated fetches as a distributed scan rather than a returning visitor. Rotate the session_id daily rather than per-request.
6.PDF rate sheets
A frustrating share of carriers still publish rates as PDF only, and OmniScrape's html output path does nothing for a binary document. Route these to a licensed PDF parser — pdfplumber for Python is a practical starting point for structured tables — or a manual upload workflow for carriers that publish PDFs on no fixed schedule.
Resist the temptation to OCR PDF rate sheets at scale without rigorous accuracy validation. An OCR error that turns $1,800 into $1,300 on a rate sheet propagates straight into quotes and is nearly impossible to audit after the fact because the source document looks correct. Where a carrier offers both a PDF and a portal table, always prefer the portal table: structured HTML is an order of magnitude more reliable to parse than a PDF layout that shifts every quarter when the carrier updates their template.
Track which carriers are PDF-only as a known limitation in the pipeline registry so stakeholders understand why their coverage lags the portal-based carriers. That transparency also creates pressure on the carrier relationship to request an API or portal alternative, which is a better long-term outcome than an increasingly brittle PDF parser.
7.Metrics to track
Operations trusts ETA mean absolute error above every other number, so the right move when accuracy slips is to fix the parser and timezone handling before adding carriers — breadth without accuracy just spreads the error around. Segment the MAE by carrier and port so you can identify which source is degrading the aggregate rather than averaging over the problem.
BaaS session duration is the cost canary. A session that should complete in 40 seconds creeping toward three minutes usually means the portal added an interstitial step — a new cookie consent banner, an MFA prompt, or a 'what's new' modal — that the script is waiting through rather than dismissing. Alerting on session duration above a threshold catches these changes before they become billing surprises.
- ETA accuracy vs actual arrival — mean absolute error in hours, segmented by carrier and port
- Rate sheet freshness — staleness in days since last successful parse
- Carrier coverage — percentage of contracted carriers with a successful scrape in the last 24 hours
- Exception detection rate — delays caught before a planner noticed manually
- BaaS session duration — minutes per carrier login; creep above baseline signals a portal change
- Table parse failure rate — percentage of fetches where the parser produced zero valid rows
- Dead-letter queue depth — rows failing validation per day, broken down by failure reason
8.Time zone normalization
Time zone handling is where logistics scrapers most often go quietly wrong. Port dashboards routinely show '14:30' with no UTC offset, and assuming UTC or your server's local zone will misplace the ETA by anywhere from a few to twelve-plus hours depending on the port. Store a port_timezone field in the URL registry — keyed by port_code — and convert explicitly with Python's zoneinfo module, so USLAX times resolve through America/Los_Angeles and Singapore through Asia/Singapore.
Daylight saving transitions add another trap. A port that is UTC-8 in winter is UTC-7 in summer, which is exactly why you store the IANA timezone name rather than a fixed numeric offset. Letting zoneinfo apply the correct offset for a given date is the only approach that survives the twice-yearly clock changes without manual patching. A fixed offset of -8 will produce wrong ETAs for roughly half the year for any US West Coast port.
After normalization, store the original scraped timestamp string alongside the converted UTC value in the dead-letter schema. When a timezone conversion produces an ETA that is more than 48 hours in the past or more than 30 days in the future, treat it as a parse error rather than a valid ETA — it almost always means the source format changed rather than a genuine extreme schedule.
9.Authorization and compliance
Only scrape carrier portals your company holds a contract and credentials for, and treat the terms of that contract as the boundary of what is permissible. Some carrier terms of service restrict automated access even for authenticated, paying customers, so route any new portal through legal review before pointing a BaaS session at it. The engineering cost of building the scraper is trivial compared to the relationship cost of violating a carrier agreement.
Credential storage belongs in a secrets manager — AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager — with rotation policies enforced at the infrastructure level. Never hardcode credentials in the repo or bake them into a container image. A leaked carrier login is both a security incident and a contractual breach that can result in portal access termination for the entire organization.
Document which carriers have approved automated access in a registry that is auditable by legal and operations, not just engineering. When a carrier updates their terms, you need to be able to identify every automated workflow touching that portal within minutes, not days. Include the date of legal review, the relevant contract clause, and the name of the approver in the registry entry.
10.Rollout phases
Begin with two public port dashboards where there is no authentication risk and the parsing patterns can be proven in production. This phase validates the scheduler, the timezone normalization, the dead-letter queue, and the TMS webhook before any BaaS spend is involved. Run it long enough to collect a week of ETA accuracy data against actual arrivals.
Phase two adds a single authenticated carrier so the BaaS login flow, session hygiene, credential injection, and session duration alerting get exercised on one target before they multiply. Choose the carrier with the simplest portal — no MFA, no CAPTCHA, a straightforward rate table — so the first BaaS integration is a clean proof of concept rather than a debugging exercise.
Phase three expands to full lane coverage once the ETA accuracy metric is stable and the cost per session is understood and budgeted. Each phase should run long enough to produce a clean accuracy report before the next begins, because adding carriers faster than you can validate them inflates the error rate without anyone being able to identify which source is to blame. A measured rollout also keeps BaaS spend predictable while you learn each portal's quirks — session duration, login flow steps, table selector stability — before they become production incidents.
Frequently asked questions
When should I use the OmniScrape API versus a BaaS session?
Use the /v1/scrape endpoint with mode 'auto' or 'js_rendering' for any page that does not require you to be logged in — public port authority dashboards, publicly visible vessel trackers, and open sailing schedule pages. Use BaaS when the data only exists behind a login: contracted rate portals, shipper-specific schedule views, and booking confirmations. The distinction is not about page complexity but about whether valid credentials are required to see the data at all.
How do I parse HTML schedule tables reliably?
Fetch the page as output_format 'html' and read the response from body.data.content. Pass that HTML to pandas.read_html() in your worker, which handles colspan and rowspan merging that manual BeautifulSoup parsing typically misses. After parsing, validate the resulting DataFrame's column names against an expected schema before writing any rows — a renamed or reordered column will otherwise map data into the wrong fields silently, and you will not notice until the ETA accuracy metric degrades.
How long should a BaaS session stay open?
Log in, wait for the target table, capture the HTML, and close the browser immediately. The entire session should complete in under 60 seconds for a simple portal. Never leave a session open between cron ticks to 'save' a login state — BaaS bills per minute and an idle open session is pure waste plus an additional opportunity for the portal to detect and terminate the connection. If re-authenticating every run is too slow, use session cookies persisted to a secrets manager and restore them at the start of each session rather than keeping the browser alive.
What if the schedule data loads from a JSON XHR rather than an HTML table?
That is usually the cleaner path. Inside a BaaS session, intercept the network response using page.route() or page.wait_for_response() to capture the XHR payload directly, then parse the JSON rather than scraping the rendered table. The underlying API payload is far more stable than the DOM constructed from it — column names do not move, data types are consistent, and you skip the HTML parsing layer entirely. Document the XHR endpoint URL in the registry so future maintainers know to monitor it for changes.
How do I handle portals that add MFA or CAPTCHA after login?
For MFA, the practical approach is to use a service account with MFA disabled if the carrier permits it, or to implement TOTP generation using the pyotp library with the seed stored in the secrets manager. For CAPTCHAs on an authenticated portal, enable_solver: true on the OmniScrape API request handles the challenge resolution automatically. Inside a BaaS session, OmniScrape's infrastructure handles common CAPTCHA types transparently. If the portal uses a custom challenge that solver does not handle, contact OmniScrape support — custom challenge types are evaluated case by case.
What is the right polling frequency for vessel ETAs?
Every four hours covers the vast majority of ETA revisions for standard ocean freight. Most port authority systems batch their updates rather than publishing in real time, so polling more frequently rarely yields fresher data and increases the risk of triggering rate limits. For high-value lanes — a single vessel carrying a significant cargo value or time-sensitive goods — reduce the interval to two hours for those specific URLs without changing the default for the broader fleet. Never poll faster than every 30 minutes regardless of cargo value; the data sources do not update that frequently.
How do I keep the pipeline running when a carrier portal changes its layout?
The dead-letter queue is the first line of defense: when the table parser produces zero valid rows, the raw HTML is preserved so you can reproduce the failure and update the selector without re-running the scrape. Set an alert on parse failure rate above a threshold — two consecutive zero-row fetches from the same carrier URL is a reliable layout change signal. Version your CSS selectors and column-to-field mappings in the registry so you can roll back if a 'fix' turns out to be a temporary A/B test on the portal's side. For BaaS flows, record the sequence of wait_for_selector calls so a new modal or interstitial is immediately obvious when the session duration alert fires.
Related guides