Introduction
You can use a collection of Watson Data REST APIs associated with Watson Studio and Watson Knowledge Catalog to manage data-related assets and the people who need to use these assets.
Refine data Use the sampling APIs to create representative subsets of the data on which to test and refine your data cleansing and shaping operations. To better understand the contents of your data, you can create profiles of your data assets that include a classification of the data and additional distribution information which assists in determining the data quality.
Catalog data Use the catalog APIs to create catalogs to administer your assets, associate properties with those assets, and organize the users who use the assets. Assets can be notebooks or connections to files, database sources, or data assets from a connection.
Data policies Use the data policy APIs to implement data policies and a business glossary that fits to your organization to control user access rights to assets and to make it easier to find data.
Ingest streaming data Use the streams flow APIs to hook up continuous, unidirectional flows of massive volumes of moving data that you can analyze in real time.
API Endpoint
https://api.dataplatform.cloud.ibm.com
Creating an IAM bearer token
Before you can call a Watson Data API you must first create an IAM bearer token. Each token is valid only for one hour, and after a token expires you must create a new one if you want to continue using the API. The recommended method to retrieve a token programmatically is to create an API key for your IBM Cloud identity and then use the IAM token API to exchange that key for a token.
You can create a token in IBM Cloud or by using the IBM Cloud command line interface (CLI).
To create a token in the IBM Cloud:
- Log in to IBM Cloud and select Manage > Security > Platform API Keys.
- Create an API key for your own personal identity, copy the key value, and save it in a secure place. After you leave the page, you will no longer be able to access this value.
- With your API key, set up Postman or another REST API tool and run the following command to the right
- Use the value of the
access_token
property for your Watson Data API calls. Set theaccess_token
value as the authorization header parameter for requests to the Watson Data APIs. The format isAuthorization: Bearer <access_token_value_here>
. For example:Authorization: Bearer eyJraWQiOiIyMDE3MDgwOS0wMDowMDowMCIsImFsZyI6IlJTMjU2In0...
To create a token by using the IBM Cloud CLI:
Follow the steps to install the CLI, log in to IBM Cloud, and get the token described here.
Remove
Bearer
from the returned IAM token value in your API calls.
Curl command with API key to retrieve token
curl "https://iam.ng.bluemix.net/identity/token" -d "apikey=YOUR_API_KEY_HERE&grant_type=urn%3Aibm%3Aparams%3Aoauth%3Agrant-type%3Aapikey" -H "Content-Type: application/x-www-form-urlencoded" -H "Authorization: Basic Yng6Yng="
Response
{
"access_token": "eyJraWQiOiIyMDE3MDgwOS0wMDowMDowMCIsImFsZyI6...",
"refresh_token": "zmRTQFKhASUdF76Av6IUzi9dtB7ip8F2XV5fNgoRQ0mbQgD5XCeWkQhjlJ1dZi8K...",
"token_type": "Bearer",
"expires_in": 3600,
"expiration": 1505865282
}
Versioning
Watson Data API has a major, minor, and patch version, following industry conventions on semantic versioning: Using the version number format MAJOR.MINOR.PATCH, the MAJOR version is incremented when incompatible API changes are made, the MINOR version is incremented when functionality is added in a backwards-compatible manner, and the PATCH version is incremented when backwards-compatible bug fixes are made. The service major version is represented in the URL path.
Sorting
Some of the Watson Data API collections provide custom sorting support. Custom sorting is implemented using the sort
query parameter. Service collections can also support single-field or multi-field sorting. The sort
parameter in collections that support single-field sorting can contain any one of the valid sort fields.
For example, the following expression would sort accounts on company name (ascending):GET /v2/accounts?sort=company_name
.
You can also add a + or - character, indicating “ascending” or “descending,” respectively.
For example, the expression below would sort on the last name of the account owner, in descending order:GET /v2/accounts?sort=-owner.last_name
.
The sort
parameter in collections that support sorting on multiple fields can contain a comma-separated sequence of fields (each, optionally, with a + or -) in the same format as the single-field sorting. Sorts are applied to the data set in the order that they are provided. For example, the expression below would sort accounts first on company name (ascending) and second on owner last name (descending): GET /v2/accounts?sort=company_name,-owner.last_name
Filtering
Some of the Watson Data API collections provide filtering support. You can specify one or more filters where each supported field is required to match a specific value for basic filtering. The query parameter names for a basic filter must exactly match the name of a primitive field on a resource in the collection or a nested primitive field where the '.' character is the hierarchical separator. The only exception to this rule is for primitive arrays. In primitive arrays, such as tags, a singular form of the field is supported as a filter that matches the resource if the array contains the supplied value. Some of the Watson Data API collections can also support extended filtering comparisons for the following field types: Integer and float, date and date/time, identifier and enumeration, and string.
Rate Limiting
The following rate limiting headers are supported by some of the Watson Data service APIs: 1. X-RateLimit-Limit: If rate limiting is active, this header indicates the number of requests permitted per hour; 2. X-RateLimit-Remaining: If rate limiting is active, this header indicates the number of requests remaining in the current rate limit window; 3. X-RateLimit-Reset: If rate limiting is active, this header indicates the time at which the current rate limit window resets, as a UNIX timestamp.
Error Handling
Responses with 400-series or 500-series status codes are returned when a request cannot be completed. The body of these responses follows the error model, which contains a code field to identify the problem and a message field to explain how to solve the problem. Each individual endpoint has specific error messages. All responses with 500 or 503 status codes are logged and treated as a critical failure requiring an emergency fix.
Connections
A connection is the information necessary to create a connection to a data source or a repository. You create a connection asset by providing the connection information.
List data source types
Data sources are where data can be written or read and might include relational database systems, file systems, object storage systems and others.
To list supported data source types, call the following GET method:
GET /v2/datasource_types
The response to the GET method includes information about each of the sources and targets that are currently supported. The response includes a unique ID property value metadata.asset_id
, name, and a label. The metadata.asset_id
property value should be used for the data source in other APIs that reference a data source type. Additional useful information such as whether that data source can be used as a source or target (or both) is also included.
Use the connection_properties=true
query parameter to return a set of properties for each data source type that is used to define a connection to it. Use the interaction_properties=true
query parameter to return a set of properties for each data source type that is used to interact with a created connection. Interaction properties for a relational database might include the table name and schema from which to retrieve data.
Use the _sort
query parameter to order the list of data source type returned in the response.
A default maximum of 100 data source type entries are returned per page of results. Use the _limit
query parameter with an integer value to specify a lower limit.
More data source types than those on the first page of results might be available. Additional properties generated from the page size initially specified with _limit
are returned in the response. Call a GET method using the value of the next.href
property to retrieve the next page of results. Call a GET method using the value in the prev.href
property to retrieve the previous page of results. Call a GET method using the value in the last.href
property to retrieve the last page of results.
These URIs use the _offset
and _limit
query parameters to retrieve a specific block of data source types from the full list. Alternatively, you can use a combination of the _offset
and _limit
query parameters to retrieve a custom block of results.
Create a connection
Connections to any of the supported data source types returned by the previous method can be created and persisted in a catalog or project.
To create a connection, call the following POST method:
POST /v2/connections
A new connection can be created in a catalog or project. Use the catalog_id
or project_id
query parameter to specify where to create the connection asset. Either catalog_id
or project_id
is required.
The request body for the method is a UTF-8 encoded JSON document and includes the data source type ID (obtained in the List data source types
section), its unique name in the catalog or project space, and a set of connection properties specific to the data source. Some connection properties are required.
The following example shows the request body used for creating a connection to IBM dashDB:
{
"datasource_type": "cfdcb449-1204-44ba-baa6-9a8a878e6aa7",
"name":"My-DashDB-Connection",
"properties": {
"host":"dashDBhost.com",
"port":"50001",
"database":"MYDASHDB",
"password": "mypassword",
"username": "myusername"
}
}
By default, the physical connection to the data source is tested when the connection is created. Use the test=false
query parameter to disable the connection test.
A response payload containing a connection ID and other metadata is returned when a connection is successfully created. Use the connection ID as path parameter in other REST APIs when a connection resource must be referenced.
Discover connection assets
Data sources contain data and metadata describing the data they contain.
To discover or browse the data or metadata in a data source, call the following GET method:
GET /v2/connections/{connection_id}/assets?path=
Use the catalog_id
or project_id
query parameter to specify where the connection asset was created. Either catalog_id
or project_id
is required.
connection_id
is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections
method, which created the connection asset.
The path
query parameter is required and is used to specify the hierarchical path of the asset within the data source to be browsed. In a relational database, for example, the path might represent a schema and table. For a file object, the path might represent a folder hierarchy.
Each asset in the assets array returned by this method includes a property containing its path in the hierarchy to facilitate the next call to drill down deeper in the hierarchy.
For example, starting at the root path in an RDBMS will return a list of schemas:
{
"path": "/",
"asset_types": [
{
"type": "schema",
"dataset": false,
"dataset_container": true
}
],
"assets": [
{
"id": "GOSALES",
"type": "schema",
"name": "GOSALES",
"path": "/GOSALES"
},
],
"fields": [],
"first": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"prev": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"next": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=100&_limit=100"
}
}
Drill down into the GOSALES schema using the path
property for the GOSALES schema asset to discover the list of table assets in the schema.
GET /v2/connections/{connection_id}/assets?catalog_id={catalog_id}&path=/GOSALES
The list of table type assets is returned in the response.
{
"path": "/GOSALES",
"asset_types": [
{
"type": "table",
"dataset": true,
"dataset_container": false
}
],
"assets": [
{
"id": "BRANCH",
"type": "table",
"name": "BRANCH",
"description": "BRANCH contains address information for corporate offices and distribution centers.",
"path": "/GOSALES/BRANCH"
},
{
"id": "CONVERSION_RATE",
"type": "table",
"name": "CONVERSION_RATE",
"description": "CONVERSION_RATE contains currency exchange values.",
"path": "/GOSALES/CONVERSION_RATE"
}
],
"fields": [],
"first": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"prev": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"next": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=100&_limit=100"
}
}
Use the fetch
query parameter with a value of either data
, metadata
, or both. Data can only be fetched for data set assets. In the response above, note the asset_type
has the property type
value of table. Its dataset
property value is true. This means that data can be fetched from table type assets. However, if you fetched assets from the connection root, the response would contain schema asset types, which are not data sets and thus fetching this data is not relevant.
A default maximum of 100 metadata assets are returned per page of results. Use the _limit
query parameter with an integer value to specify a lower limit. More assets than those on the first page of results might be available.
Additional properties generated from the page size initially specified with _limit
are returned in the response. Call a GET method using the value of the next.href
property to retrieve the next page of results. Call a GET method using the value in the prev.href
property to retrieve the previous page of results. Call a GET method using the value in the last.href
property to retrieve the last page of results.
These URIs use the _offset
and _limit
query parameters to retrieve a specific block of assets from the full list. Alternatively, use a combination of the _offset
and _limit
query parameters to retrieve a custom block of results.
Specify properties for reading delimited files
When reading a delimited file using this method, specify property values to correctly parse the file based on its format. These properties are passed to the method as a JSON object using the properties
query parameter. The default file format (property file_format
) is a CSV file. If the file is a CSV, the following property values are set by default:
Property Name | Property Description | Default Value | Value Description |
---|---|---|---|
quote_character |
quote character | double_quote |
double quotation mark |
field_delimiter |
field delimiter | comma |
comma |
row_delimiter |
row delimiter | carriage_return_linefeed |
carriage return followed by line feed |
escape_character |
escape character | double_quote |
double quotation mark |
For CSV file formats, these property values can not be overwritten. If it is necessary to modify these properties to properly read a delimited file, set the file_format
property to delimited
. For generic delimited files, these properties have the following values:
Property Name | Property Description | Default Value | Value Description |
---|---|---|---|
quote_character |
quote character | none |
no character is used for a quote |
field_delimiter |
field delimiter | null | no field delimiter value is set by default |
row_delimiter |
row delimiter | new_line |
Any new line representation |
escape_character |
escape character | none |
no character is used for an escape |
This example sets file format properties for a generic delimited file:
GET https://{service_URL}/v2/connections/{connection_id}/assets?catalog_id={catalog_id}&path=/myFolder/myFile.txt&fetch=data&properties={"file_format":"delimited", "quote_character":"single_quote","field_delimiter":"colon","escape_character":"backslash"}
For more information about this method see the REST API Reference.
Discover assets using a transient connection
A data source's assets can be discovered without creating a persistent connection.
To browse assets without first creating a persistent connection, call the following POST method:
POST https://{service_URL}/v2/connections/assets?path=
This method is identical in behavior to the GET method in the Discover connection assets
section except for two differences:
- You define the connection properties in the request body of the REST API. You do not reference the connection ID of a persistent connection with a query parameter. The same JSON object used to create a persistent connection is used in the request body.
- You do not specify a catalog or project ID with a query parameter.
See the previous section to learn how to set properties used to read delimited files.
For more information about this method see the REST API Reference.
Update a connection
To modify the properties of a connection, call the following PATCH method:
PATCH /v2/connections/{connection_id}
connection_id
is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections
method, which created the connection asset.
Use the catalog_id
or project_id
query parameter to specify where the connection asset was created. Either catalog_id
or project_id
is required.
Set the Content-Type
header to application/json-patch+json
. The request body contains the connection properties to update using a JSON object in JSON Patch format.
Change the port number of the connection and add a description using this JSON Patch:
[
{
"op": "add",
"path": "/description",
"value": "My new PATCHed description"
},
{
"op":"replace",
"path":"/properties/port",
"value":"40001"
}
]
By default, the physical connection to the data source is tested when the connection is modified. Use the test=false
query parameter to disable the connection test.
For more information about this method see the REST API Reference.
Delete a connection
To delete a persistent connection, call the following DELETE method:
DELETE /v2/connections/{connection_id}
connection_id
is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections
method, which created the connection asset.
Use the catalog_id
or project_id
query parameter to specify where the connection asset was created. Either catalog_id
or project_id
is required.
Schedules
Introduction
Schedules allow you to run a data flow, a notebook, a data profile, or any other given source more than once. It supports various repeat types namely hour, day, week, month, and year with 2 repeat end options namely, end date and the maximum number of runs.
Create a schedule
To create a schedule in a specified catalog or project, call the following POST method:
HTTP Method : POST
URI : /v2/schedules
Before you create a schedule, you must consider the following points:
You must have a valid IAM token to make REST API calls and a project or catalog ID.
You must be authorized (be assigned the correct role) to create schedules in the catalog or project.
The start and end dates must be in the following format:
YYYY-MM-DDTHH:mm:ssZ
orYYYY-MM-DDTHH:mm:ss.sssZ
(specified in RFC 3339).The supported repeat types are
hour
,day
,week
,month
, andyear
.There are 2 repeat end options, namely
max_invocations
andend_date
.The supported repeat interval is 1.
There are 3 statuses for schedules, namely
enabled
,disabled
, andfinished
. To create a schedule, the status must beenabled
. The scheduling service updates the status tofinished
once it has finished running. You can stop or pause the scheduling service by updating the status todisabled
.You can update the endpoint URI in the target HREF. Supported target methods are POST, PUT, PATCH, DELETE, and GET.
Set
generate_iam_token=true
. When this option is set to true, the scheduling service generates an IAM token and passes it to the target URL at runtime. This IAM token is required to run schedules automatically at the scheduled intervals. This token is not to be confused with the IAM token required to make Watson Data API REST calls.
This POST method creates a schedule in a catalog with a defined start and a given end date:
{
"catalog_id": "aeiou",
"description": "aeiou",
"name": "aeiou",
"tags": ["aeiou"],
"start_date": "2017-08-22T01:02:14.859Z",
"status": "enabled",
"repeat": {
"repeat_interval": 1,
"repeat_type": "hour"
},
"repeat_end": {
"end_date": "2017-08-24T01:02:14.859Z"
},
"target": {
"href": "https://api.dataplatform.cloud.ibm.com/v2/data_profiles?start=false",
"generate_iam_token": true,
"method": "POST",
"payload": "aeiou",
"headers": [
{
"name": "content-type",
"value": "application/json",
"sensitive": false
}
]
}
}
Get multiple schedules in a catalog or project
To get all schedules in the specified catalog or project, call the following GET method:
HTTP Method: GET
URI :/v2/schedules
You need the following information to get multiple schedules:
A valid IAM token, schedule ID, and the catalog or project ID.
You must be authorized to get schedules in the catalog or project.
You can filter the returned results by using the options entity.schedule.name
and entity.schedule.status
and can filter matching types by using StartsWith(starts:)
and Equals(e:)
.
You can sort the returned results either in ascending or descending order by using one or more of the following options: entity.schedule.name
, metadata.create_time
, and entity.schedule.status
.
Get a schedule
To get a schedule in the specified catalog or project, call the following GET method:
HTTP Method: GET
URI :/v2/schedules/{schedule_id}
You need the following information to get a schedule:
A valid IAM token, schedule ID, and the catalog or project ID.
You must be authorized to get a schedule in the catalog or project.
Update a schedule
To update a schedule in the specified catalog or project, call the following PATCH method:
HTTP Method: PATCH
URI :/v2/schedules/{schedule_id}
You need the following information to update a schedule:
A valid IAM token, schedule ID, and the catalog or project ID.
You must be authorized to update a schedule in the catalog or project.
You can update all the attributes under entity but can't update the attributes under meta-data.
Patch supports the replace
, add
, and remove
operations. The replace
operation can be used with all the attributes under entity. The add
and remove
operations can only be used with the repeat end options, namely max_invocations
and end_date
.
The start and end dates must be in the following format: YYYY-MM-DDTHH:mm:ssZ
or YYYY-MM-DDTHH:mm:ss.sssZ
(specified in RFC 3339).
This PATCH method replaces the repeat type, removes the max invocations and adds an end date:
[
{
"op": "remove",
"path": "/entity/schedule/repeat_end/max_invocations",
"value": 20
},
{
"op": "add",
"path": "/entity/schedule/repeat_end/end_date",
"value": "date"
},
{
"op": "replace",
"path": "/entity/schedule/repeat/repeat_type",
"value": "week"
}
]
Delete a schedule
To delete a schedule in the specified catalog or project, call the following DELETE method:
HTTP Method : DELETE
URI :{GATEWAY_URL}/v2/schedules/{schedule_id}
":guid" represents the schedule_id of the deleted schedule.
You need the following information to delete a schedule:
A valid IAM token, schedule ID, and the catalog or project ID.
You must be authorized to delete a schedule in the catalog or project.
Delete multiple schedules
To delete multiple schedules in the specified catalog or project, call the following DELETE method:
HTTP Method: DELETE
URI :{GATEWAY_URL}/v2/schedules
":guid" represents the schedule_id of the deleted schedule.
You need the following information to delete multiple schedules:
A valid IAM token, schedule ID, and the catalog or project ID.
You must be authorized to delete schedules in the catalog or project.
A comma-separated list of the schedule IDs. If schedule IDs are not listed in the parameter
schedule_ids
, the scheduling service will delete all the schedules in the catalog or project.
Catalogs
Watson Knowledge Catalog helps you easily organize, find and share data assets, analytical assets, etc. for many data science projects and for the users who need to use those assets.
You can use the Catalog API to create catalogs which are rich metadata repositories for organizing and exploring metadata.
There are two phrases that will be used repeatedly throughout this (and the "Assets" and "Asset Types") documentation:
asset resource
: The primary content of the asset. Many assets have a resource that is stored in an external repository: a data file, connected data set, notebook file, dashboard definition, or model definition.asset metadata
: The information about the asset resource. Each asset has a primary metadata document in a project or catalog and might have additional metadata documents.
See the Asset Terminology section for more information about those two phrases.
There is one special user-provided storage that must be specified by the creator of a catalog at the time the catalog is created: a Cloud Object Storage bucket for public cloud deployment and a file system for hybrid cloud deployment. We'll informally call that the "catalog's bucket". The creator of the catalog owns that bucket, but by providing that bucket's identification info during catalog creation, the catalog creator is allowing the Watson Knowledge Catalog graphical User Interface to store asset resources in that bucket and is allowing other Watson Knowledge Catalog APIs to stored (extended) asset metadata in that bucket.
If a user wants to store and retrieve asset resources (like spreadsheets, images, etc.) in the catalog's bucket, then that user can use the Assets API API to assist in that process.
In some cases, one of the other Watson Knowledge Catalog APIs (for example, the "Profiling" API) will store (extended) asset metadata documents in the catalog's bucket.
This section describes some of the individual Catalog APIs.
Get a Catalog
You can get metadata about a catalog using the get Catalog API. (Note: you aren't retrieving the actual data catalog with the GET Catalog API - you're just retrieving metadata that describes the catalog.)
Get Catalog - Request URL:
GET {service_URL}/v2/catalogs/{catalog_id}
Get Catalog - Response Body:
{
"metadata": {
"guid": "c6f3cbd8-2b7f-42fb-aa60-___",
"url": "https://api.dataplatform.cloud.ibm.com/v2/catalogs/c6f3cbd8-2b7f-42fb-aa60-___",
"creator_id": "IBMid-___",
"create_time": "2018-11-06T17:40:32Z"
},
"entity": {
"name": "CatalogForGettingStartedDoc",
"description": "Catalog created for Getting Started doc",
"generator": "Your catalog generator",
"bss_account_id": "12345___",
"capacity_limit": 0,
"is_governed": false,
"saml_instance_name": "IBM w3id"
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/catalogs/c6f3cbd8-2b7f-42fb-aa60-___"
}
In this case, the response for the Get Catalog request is identical to the response for the Create Catalog request. If more activity had occurred with the catalog between the Create Catalog and the Get Catalog requests then there might have been some differences between the two responses.
Get Catalogs
To obtain the metadata for all the catalogs that you have access to (ie, are a collaborator of), you can call the GET Catalogs API.
Get Catalogs - Request URL:
GET {service_URL}/v2/catalogs
Note: the above URL is the simplest URL for getting catalogs because it doesn't contain any parameters. There are a number of optional parameters (limit
, bookmark
, skip
, include
, bss_account_id
) to the above URL that you can make use of to limit the number of catalogs for which metadata is returned.
Get Catalogs - Response Body:
{
"catalogs": [
{
"metadata": {
"guid": "c6f3cbd8-2b7f-42fb-aa60-___",
"url": "https://api.dataplatform.cloud.ibm.com/v2/catalogs/c6f3cbd8-2b7f-42fb-aa60-___",
"creator_id": "IBMid-___",
"create_time": "2018-11-06T17:40:32Z"
},
"entity": {
"name": "CatalogForGettingStartedDoc",
"description": "Catalog created for Getting Started doc",
"generator": "Your catalog generator",
"bss_account_id": "12345___",
"capacity_limit": 0,
"is_governed": false,
"saml_instance_name": "IBM w3id"
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/catalogs/c6f3cbd8-2b7f-42fb-aa60-___"
}
],
"nextBookmark": "g1AAAAFCeJzLYWBgYMlgTmHQSklKzi9KdUhJMjT___",
"nextSkip": 0
}
In the above example, metadata for only one catalog is returned - the catalog created above. An advantage of calling the GET Catalogs API is you don't have to remember the ID of any particular catalog in order to get the metadata for that catalog.
Assets
From a high level, an asset is an item of data or data analysis in a project or catalog. Most of these assets consist of two parts:
Asset resource: The primary content of the asset. Many assets have a resource that is stored in an external repository: a data file (eg. text file, image, video, etc.), connected data set (eg. database table), notebook file, dashboard definition, or model definition. The Assets API does not affect this part of the asset. Think of this as the object that's being described by asset metadata (ie, an asset resource is a "decribee").
Asset metadata: The information about the asset resource. Each asset has a primary metadata document in a project or catalog and might have additional metadata documents. This is the part of the asset that you can get, create, or operate on with the Assets API. Think of this as the object that's doing the describing of an asset resource (ie, asset metadata is a "describer").
A library is a useful analogy for understanding the scope of the Assets API. A library contains a set of books and an index. The index, or card catalog, contains a card about each book. A card has information about the book, including the location of the book. A Watson project or catalog contains only the card catalog part of the library. The books, or asset resources, are elsewhere. Consequently, the Assets API can return the location of an asset resource, but not affect the asset resource in any way.
The term asset encapsulates the following:
- [1] asset resource: the primary / initial resource that a user wants described by a primary metadata document.
- [2] primary metadata document: a document added to a catalog to describe an asset resource.
- [3] attributes: chunks of data inside a primary metadata document that describe either the asset resource or a secondary / extended metadata document.
- [4] secondary / extended metadata documents: additional documents containing information related to the asset resource. Attached to the primary metadata document. Can be generated by catalog processes, such as profiling.
- [5] a combination of all of the above: the Watson Knowledge Catalog UI presents information from each of the above on a single page and calls all that information an "asset".
For example, when you call the Get Assets API, you receive asset metadata (in a primary metadata document). The asset metadata might point to the location of the asset resource, but the Get Assets API does not return the asset resource. Similarly, when you run the Create Assets API, you create a primary metadata document that can, eventually, include the location of an existing asset resource.
This overview section provides a picture of the parts of a "primary metadata document" and then explains the parts of that picture. The picture provides a kind of "map" of a primary metadata document, so it's recommended to spend a few minutes studying it. Readers who prefer API examples can skip over the explanation of that picture that follows, and go straight to the Assets API Examples section. However, the Assets API Examples section will often refer back to the terms and explanations discussed in this Assets API Overview section.
Note: when calling any of the endpoints in the Assets API you must specify either a catalog ID or a project ID to indicate whether the metadata for an asset is (to be) in a catalog or a project. Because the Assets API endpoints can be applied to either a catalog or a project, rather than repeating the phrase "either a catalog or a project" over and over throughout the rest of this documentation, only the term "catalog" will be used. The possibility of instead using a "project" will be implied.
Asset Primary Metadata Document (or Card)
A primary metadata document is a document that contains the primary metadata for an asset resource. Once a primary metadata document has been created and stored in the catalog, it's often informally said that that asset resource has been "cataloged", or "added to the catalog". Note: being cataloged, or added to the catalog, does not mean the asset resource has been moved or copied and is now physically stored inside the catalog - it just means a primary metadata document has been created for that asset resource, and that primary metadata document is now stored in the catalog.
Almost every Assets API endpoint revolves around creating, reading, modifying or deleting a primary metadata document. JSON is natively used to store primary metadata documents in a catalog, and to transfer those documents in Assets API REST calls. So, JSON examples of primary metadata documents will be used throughout this documentation.
In this documentation, the term card (as in, an index card in a library's catalog) will often be used as a short nickname for the phrase "primary metadata document". In this documentation, "card" and "primary metadata document" mean exactly the same thing. The term "card" just saves us from reading and writing the lengthier phrase "primary metadata document" over and over.
A primary metadata document (ie, card) is a JSON object that's composed of up to three top-level fields, named as follows:
- "metadata": a JSON object containing metadata common to all asset types
- "entity": a JSON object containing attributes, each containing metadata specific to one asset type
- "attachments": an optional JSON array, each item of which is a JSON object containing metadata for an attached (ie, externally stored) asset resource or extended metadata document
For a pictorial representation of a primary metadata document (ie, card) and its associated asset resource and extended metadata documents, see the Parts of a Primary Metadata Document figure below:
In particular, note that:
- red rectangles are used in the figure to highlight the three top-level fields of a card.
- the green rectangles illustrate how important the name of the primary asset type is in relating various parts of the card, and the attached asset resource, to each other. In the example figure, the value of
"metadata.asset_type"
is "data_asset". The value you'll see in your card depends on the "asset_type" you've specified for your asset.
"metadata" field of a Primary Metadata Document
The "metadata" field of a primary metadata document (ie, of a card) is a JSON object that contains metadata fields that are common across all types of assets. (See the top red rectangle in the parts figure.) The Assets API specifies the names of the fields that go into the "metadata" part of the card. The user must supply values for some of the fields in "metadata"; the values of other fields in "metadata" will be filled in by the Assets API during the life of the card. Here's a list of some of the fields inside "metadata" (see example cards in the Get Asset section for more extensive lists):
- "asset_id":
- The ID of the card (ie, primary metadata document) rather than of the asset resource described by the card.
- Created internally by the Assets API at the time the card is created. That is, you do not supply this value.
- "asset_type":
- You must supply this value.
- Declares the primary asset type of this card.
- Describes the type of the asset resource attached (if any) to this card.
- Specifies the name of the primary attribute in this card.
- See Asset Types for more details on asset types.
- "asset_attributes":
- You must not supply any value for this field when creating a primary metadata document. The Assets APIs maintain the contents of this field.
- An array of attribute names (only the names, not the actual attributes).
- Each attribute / asset type name listed in this array will have a correspondingly named attribute in the "entity" field of the card.
- The name of each attribute must match the name of an existing asset type, so this is also an array of the names of the primary and secondary / extended asset types used by this card.
- "name": the name of the asset resource this card describes
- "description": a description of the asset resource
- "origin_country": the originating country for the asset resource
- "tags": an array of terms that users want to associate with the asset resource
- "rov": Rules Of Visibility. The most common values are:
- "mode": -1 - this is the default, which corresponds to "mode" : 0, public (see below)
- "mode": 0 - if you want public visibility, in which everybody can view and search the values of the asset's primary metadata document (card), and preview the asset's data, then you would set this field as follows. Note: access can still be denied based on actionable governance policy rules.
"rov": {
"mode": 0,
"collaborator_ids": []
}
- "mode": 8 - if you want private visibility, in which only users listed as members of the asset (as denoted by collaborator_ids list) can view and search the values of the asset's primary metadata document (card), and preview the asset's data, then you would set this field as follows. Note: access can still be denied based on actionable governance policy rules.
"rov": {
"mode": 8,
"collaborator_ids": [
{
"IBMid-06___": {
"user_iam_id": "IBMid-06___"
}
},
{
"IBMid-27___": {
"user_iam_id": "IBMid-27___"
}
}
]
}
"entity" field of a Primary Metadata Document
The "entity" field of a card (ie, primary metadata document) is a JSON object that contains additional JSON objects called attributes, each of which contains metadata fields that are specific to one asset type. (See the middle red rectangle in the parts figure.) The only contents of the "entity" field are attributes, which are discussed in the next section.
Note: the fact that the "entity" section contains attributes for more than one asset type does not mean that a single card contains metadata for more than one asset resource. A card always contains metadata for exactly one asset resource, and that asset resource will have exactly one attribute associated with it (see primary attribute below). All the other attributes in the "entity" field contain extended metadata describing the single asset resource that the card was created for. Really, asset types ought to be thought of as attribute types because asset types literally define (some of) the fields that will appear in attributes.
Attributes
- is contained directly inside the "entity" field of the primary metadata document.
- is identically named with, and has fields that are partially defined by, an Asset Type
- describes an asset resource or something related to that asset resource, such as an extended metadata document
There is one attribute in the "entity" field for each attribute name that appears in the "metadata.asset_attributes" array. So, for example, if the "metadata.asset_attributes"
array contains these two attribute names:
"metadata": {
...
"asset_attributes": [
"data_asset",
"data_profile"
],
}
then the "entity" field will contain these two correspondingly named attributes:
"entity": {
"data_asset": { // attribute name matches "data_asset" in "metadata.asset_attributes"
...attribute contents...
},
"data_profile": { // attribute name matches "data_profile" in "metadata.asset_attributes"
...attribute contents...
}
}
The name of each attribute in "entity" must also match the name of an existing asset type. That is, an attribute named "X" will contain metadata related to an asset type also named "X". So, an attribute's name can be thought of as simultaneously telling us that attribute's "type". For example, in this asset metadata document example, both the attribute names "data_asset" and "data_profile" refer to asset types with those same names.
There is one special attribute that will be referred to as the primary attribute. The primary attribute is the main attribute used to describe an asset resource. Every primary metadata document will have exactly one primary attribute. The name of the primary attribute is the same as the name that appears in the "metadata.asset_type"
field.
Any attribute other than the primary attribute is a "secondary" / "extended" attribute whose name must match the name of a secondary / extended asset type. A common example of an attribute for extended metadata is named "data_profile", which is created by the Profiling API. For example, see the underlined names in the Parts of a Primary Metadata Document figure, or the "entity.data_profile"
field in this asset metadata document.
Although the Assets API restricts the names of attribute objects to match the names of asset types, the Assets API does not (in general) specify what the contents of those attributes should be. So, in some sense, the fields within an attribute are the opposite of the fields within the "metadata" field:
- the Assets API "owns" (or, specifies) which fields go inside "metadata"
- the user "owns" (or, specifies) which fields go inside the attributes (except for some fields of already available asset types)
The following example shows two attributes, whose names must match asset types, but whose contents are (for the most part) up to the user:
"entity": {
"data_asset": { // attribute name must match some asset type's name
...
data_asset *type creator* and
data_asset *attribute creator*
decide what fields go here
...
},
"data_profile": { // attribute name must match some asset type's name
...
data_profile *type creator* and
data_profile *attribute creator*
decide what fields go here
...
}
}
Because the Asset Types API is itself the creator of some already available asset types, the Asset Types API specifies some of the fields for any attribute whose name corresponds to one of those already available asset types. For example, see the discussion of the already available asset type called "data_asset".
Note: there is a GET attribute
API that can be used to retrieve just the attributes in the "entity" section of the primary metadata document, instead of the entire primary metadata document as returned by the GET asset
API.
"attachments" (optional) field of a Primary Metadata Document
The "attachments" field of a card (ie, primary metadata document) is a JSON array, each item of which contains metadata for one attachment. (See the bottom red rectangle in the parts figure.)
the "attachments" array in the primary metadata document
an attachment item in the "attachments" array
a metadata document that will be returned from a call to the GET Attachment API. That metadata document will contain information that points to, and can be used to retrieve, either...
the asset resource being described by the primary metadata document
an extended metadata document stored in the catalog's bucket and containing extended metadata for the asset resource
Each attribute in the "entity" field can have a corresponding attachment item in the "attachments" array. An attribute and its corresponding attachment item are related to each other by using the name of the attribute as the value for the attachment item's "asset_type" field. For example, notice in the following card snippet how the attribute name "data_asset" is used to link that "data_asset" attribute to its attachment item in the "attachments" array:
"entity": {
...other attributes
"data_asset": { // <-- attribute's name matches its...
...
},
...other attributes
},
"attachments": [
...other attachment items
{
...
"asset_type": "data_asset", // <-- ...attachment's asset_type
...
"connection_id": "...", // connection_ fields are one way
"connection_path": "...", // that item points to attached object
...
},
...other attachment items
]
Notice also in the above card snippet that, in this case, the attachment item contains two "connection_..." fields that point to the attachment object located in external storage. So, an attribute has an attachment item which points to an attachment object.
Like the fields of "metadata", the fields of an attachment item are specified by the Assets API. Some of the most important fields in an attachment item are:
- "asset_type":
- describes the type of the attachment
- figuratively connects the attachment item to the attribute with the same name
- "connection_id" and "connection_path" (optional):
- this pair of fields specify the ID of a
WDP Connection
and a path in the associated data repository that points to the attached object - always used for an attached asset like a database table
- can also be used for an attached asset resource (eg, spreadsheet) that can be stored in the catalog's bucket
- the presence of these two fields means the attachment will be known as a remote attachment
- this pair of fields specify the ID of a
- "object_key" and "handle" (optional):
For any attachment, only one of the following two pairs of fields will be used:
"connection_id"
and"connection_path"
(ie, remote attachment), or"object_key"
and"handle"
(ie, referenced attachment).
Interestingly, being remote does not tell you whether or not an attachment is in the catalog. Remote only tells you how the attached object can be retrieved: by using a connection.
An attachment item (in the card) points to one of two kinds of attached object (in external storage):
1) an asset, or
2) and extended metadata document.
Those are briefly discussed in the next 2 sections.
Asset Resource Attachment
The most typical attachment object is the asset resource being described by the card.
Follow the green arrows in the Parts of a Primary Metadata Document figure to see how:
- the asset's type name leads to
- an attribute name, which leads to
- a primary attribute, which leads to
- an attachment metadata item for that attribute, which finally leads to
- the attached asset resource.
For a full example that shows an attachment metadata item for an attached csv file, see the (only) item in the "attachments" array in Get Asset - CSV File - Response Body - Before Profiling.
Extended Metadata Document Attachment(s)
The other kind of attachment objects are extended metadata documents. A card can have 0, 1, or many attached extended metadata documents. These documents each contain a related set of (additional) metadata describing the asset resource. Extended metadata documents are stored externally in the catalog's bucket.
See the underlined "data_profile" type name in the Parts of a Primary Metadata Document figure for a visualization of how, for one extended metadata document, the three parts ("metadata", "entity", "attachments") of a card are related to each other.
See the second item in the "attachments" array in Get Asset - CSV File - Response Body - After Profiling for an example showing an attachment item for a "data_profile" extended metadata document.
Uses of "asset_type"
value
From the previous sections, you can see that the "asset_type"
value shows up in:
- the "metadata.asset_type" field
- the "metadata.asset_attributes" array
- a field (ie, object) in the "entity" field. This object is the primary attribute.
- the asset_type field of the primary attribute's attachment (if such an attachment exists, which it typically does). This (primary) attachment will be the asset resource (eg, database table, spreadsheet, csv file, etc.).
For example, see the Parts of a Primary Metadata Document figure above, where the name of the primary attribute is, in this case, "data_asset" and is highlighted with green rectangles in all the places it's used. The path shown by the green arrows in the figure starts at the "metadata.asset_type"
field and ends at the asset resource, in this case a file called Sample.csv.
Other Assets API Objects
Finally, here is a brief list of some of the remaining objects that can be manipulated with the Assets APIs:
- owner
- the owner of the asset
- collaborators
- users who are allowed to see and possibly edit (some parts of) the asset
- perms
- permissions for viewing / editing an asset
- ratings
- indications of how popular or useful the asset is
- stats
- statistics on how often and when the asset was viewed or edited, and who did that viewing or editing.
Getting an Asset
It's important to understand that the GET Asset
API does not return an asset resource like a database table, a spreadsheet, a csv file, etc. Instead, it returns a primary metadata document (ie, card) that describes an asset resource.
Obviously, a primary metadata document (ie, card) must have been created before it can be retrieved. Still, it's instructive to see actual examples of a card and its parts before attempting to create those things. After all, many users will retrieve cards that were previously created by someone else.
This and the following sections show how to retrieve asset metadata and attachments (eg, an asset resource and extended metadata documents).
Getting an Asset - for a Connection
We'll start by retrieving a common primary metadata document (ie, card): one for a "connection" asset type. This is a simple card because it has no attachments. That makes it an easy example to start with, even though many of the other cards you'll encounter do have attachments.
Use the following GET Asset
API to retrieve the primary metadata document for a connection. Note that this requires that you know and supply the IDs of both the primary metadata document (ie, card) and of the catalog that contains the card. Either someone has given you both of those IDs or you can browse to the asset's page using the Watson Knowledge Catalog UI and then extract both the catalog ID and the primary metadata document ID from within the URL in the browser's address bar.
Getting an Asset - Request URL:
GET {service_URL}/v2/assets/{asset_id}?catalog_id={catalog_id}
The following is the primary metadata document (ie, card) that's returned.
Note: you may find it helpful to look at the Parts of a Primary Metadata Document Figure before looking at the following Response Body.
Getting an Asset - Connection - Response Body:
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2018-11-06T17:40:37Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1541526037227,
"last_accessed_at": "2018-11-06T17:40:37Z",
"last_access_time": 1541526037227,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"name": "ConnectionForCSVFile",
"description": "Connection for CSV file",
"tags": [],
"asset_type": "connection",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created": 1541526037227,
"created_at": "2018-11-06T17:40:37Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"connection"
],
"asset_id": "070e9be2-40a8-4e0e-___",
"asset_category": "SYSTEM"
},
"entity": {
"connection": {
"datasource_type": "193a97c1-4475-4a19-b90c-295c4fdc6517",
"context": "source,target",
"properties": {
"bucket": "catalogforgettingsta___",
"secret_key": "{wdpaes}12345___=",
"api_key": "{wdpaes}eo/12345_=",
"resource_instance_id": "crn:v1:bluemix:public:cloud-object-storage:global:a/12345c___:7240b198-b0f6-___::",
"access_key": "12345___",
"region": "us-geo",
"url": "https://s3.us-south.objectstorage.softlayer.net"
},
"flags": []
}
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/070e9be2-40a8-4e0e-___?catalog_id=c6f3cbd8-2b7f-42fb-aa60-___"
}
The above response has two of the three primary groups of metadata that were described in the Primary Metadata Document section: "metadata" and "entity".
As discussed in Assets API Overview section, the contents of the "metadata" field are common to all primary metadata documents (ie, cards). The set of fields in "metadata" is completely defined by the Assets API. The values for some of those fields must be provided by the creator of the card, while other fields' values will be populated by various Assets APIs during the life of the card. Note the following fields' values in particular:
"metadata"
fields whose values are provided by the creator of the card:"name"
: "ConnectionForCSVFile""description"
: "Connection for CSV file""asset_type"
: "connection""asset_attributes"
: ["connection"
]
"metadata"
fields whose values are set by various Assets APIs during the life of the card:"usage"
: contains various statistics describing usage of the card/asset"catalog_id"
: the ID of the catalog that contains the card"created_at"
: the time and date at which the card was created"asset_id"
: the ID of the card (not the asset resource)
For more info about the "metadata"
fields, see the discussion on "metadata" in the Assets API Overview section above.
The contents of the "entity"
field are only partially defined by the Assets API. In particular, the "entity"
field shown in the above card contains a field whose name must match the value in "metadata.asset_type"
, in this case, "connection"
. That field is the primary attribute.
On the other hand, both the names and the values of all the fields inside the primary attribute "entity.connection"
are completely determined by the creator of the "connection" asset type and the creator of the "connection" attribute. The Assets API does not, in general, decide what fields go inside the primary attribute (or any other attribute). In the example "connection" attribute above, some of the more interesting fields are:
"datasource_type"
- specifies the ID of the type of the data source to which a connection will be formed."properties"
- specifies connection metadata specific to the type of the datasource. The exact contents of this field will change according to the type of the datasource.
For more info on the contents of "entity"
in general, see the discussion on "entity" in the Assets API Overview section.
Notice the above card contains no "attachments" array. That means there is no attached asset resource associated with this card. A natural question is: how can "connection" asset metadata exist for, or describe, a non-existent "connection" asset resource? Actually, a "connection" asset resource does exist, but only when the metadata in the connection's primary metadata document is used to create a client-server connection at runtime.
Get Asset - for a CSV File
This section shows a far more typical example in which the primary metadata document (ie, card) does have an attached asset resource - in this case, a csv file named Sample.csv. Here's the very simple contents of the Sample.csv file:
Sample.csv file contents
Name,Number
abc,123
def,456
Use the GET Asset
API to retrieve the asset metadata for the Sample.csv asset resource. Note: the GET Asset
API only returns a primary metadata document (ie, card) that describes the Sample.csv file - it does not return the actual Sample.csv file.
Get Asset - Request URL:
GET {service_URL}/v2/assets/{asset_id}?catalog_id={catalog_id}
It's instructive to show two different versions of the primary metadata document for the Sample.csv asset:
- Before profiling (which returns a small metadata document - without extended metadata)
- After profiling (which returns a much larger metadata document - with extended metadata)
Note: you may find it helpful to look at the Parts of a Primary Metadata Document Figure before looking at either of the following two Get Asset Response Bodies.
Here is the smaller primary metadata document that exists before the Profile API is invoked on the Sample.csv file.
Get Asset - CSV File - Response Body - Before Profiling:
{
"metadata": {
"name": "Sample.csv",
"description": "A simple csv file.",
"asset_type": "data_asset",
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2018-11-06T17:45:23Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1541526323713,
"last_accessed_at": "2018-11-06T17:45:23Z",
"last_access_time": 1541526323713,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"origin_country": "united states",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created": 1541526321437,
"created_at": "2018-11-06T17:45:21Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"data_asset"
],
"asset_id": "45f4ab8c-37d5-45a1-8adf-___",
"asset_category": "USER"
},
"entity": {
"data_asset": {
"mime_type": "text/csv",
"dataset": false
}
},
"attachments": [
{
"id": "b8c7a390-e857-4c34-add8-___",
"version": 2,
"asset_type": "data_asset",
"name": "remote",
"description": "remote",
"connection_id": "070e9be2-40a8-4e0e-___",
"connection_path": "catalogforgettingsta-datacatalog-r1s___/data_asset/Sample_SyjEQUy6m.csv",
"create_time": 1541526323713,
"size": 0,
"is_remote": true,
"is_managed": false,
"is_referenced": false,
"is_object_key_read_only": false,
"is_user_provided_path_key": true,
"transfer_complete": true,
"is_partitioned": false,
"complete_time_ticks": 1541526323713,
"user_data": {},
"test_doc": 0,
"usage": {
"access_count": 0,
"last_accessor_id": "IBMid-___",
"last_access_time": 1541526323713
}
}
],
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/45f4ab8c-37d5-45a1-8adf-___?catalog_id=c6f3cbd8-2b7f-42fb-aa60-___"
}
The above primary metadata document has all three primary groups of metadata ("metadata", "entity", and "attachments") that were described in the Assets API Overview section.
The contents of the "metadata" field are very similar to those shown above for the Connection card example. The most important difference is the value that the user specified as the "asset type" for the Sample.csv asset, namely "data_asset"
. That asset type name shows up in two places inside the "metadata" section of the primary metadata document:
"metadata"
:"asset_type"
: "data_asset""asset_attributes"
: [
]"data_asset"
As discussed in the Attributes section, the fact that "metadata.asset_type"
has the value "data_asset"
means the "entity" field of the card must contain a primary attribute called "data_asset"
. The Asset Types API provides the predefined asset type "data_asset". That "data_asset"
type definition declares that there are two mandatory fields in a "data_asset"
attribute: "mime_type"
and "dataset"
, as can be seen in the card above and repeated here:
"entity"
:"data_asset"
:"mime_type"
: "text/csv"- specifies the mime type of the asset resource. Here, the mime type indicates that the asset resource is a text csv file.
"dataset"
: false- false because there is no "columns" field in this primary attribute.
- Note: false does not mean there are no columns in the asset resource. Clearly, our Sample.csv file does have columns. The problem here is that no one has (yet) told the card that the asset resource has columns. Compare this "data_set" attribute to the one shown in the next example Get Asset - CSV File - Response Body - After Profiling, where the value of "dataset" has been changed to true, and the primary attribute does have a "columns" field.
Unlike in the Connection card example above, the card for the Sample.csv file does have an "attachments"
field. In this case, the "attachments" array has one item in it. That item contains metadata that points to the attached asset resource (ie, the Sample.csv file). Some of the more interesting fields in that attachment item are:
"id"
: "b8c7a390-e857-4c34-add8-___"- identifies the metadata document that points to the attached asset resource
"asset_type"
: "data_asset"- matches the name of the primary attribute in "entity", so linking the primary attribute to this attachment item and designating this item as the item that points to the asset resource.
"connection_id"
: "070e9be2-40a8-4e0e-___"- identifies a connection primary metadata document (ie, card) which contains credentials and other info that can be use to connect to the external repository that contains the attached asset resource (ie, the "Sample.csv" file)
- not coincidentally, the particular connection card referred to by "070e9be2-40a8-4e0e-_" is the exact same connection card shown above in [Get Asset - Connection Primary Metadata Document](#Section_AssetsGet_Asset__WDP_Connection)
"connection_path"
: "catalogforgettingsta-datacatalog-r1s___/data_asset/Sample_SyjEQUy6m.csv",- identifies the path in the external repository that contains the attached asset (ie, the "Sample.csv" file)
"is_remote"
: true- as discussed in the "attachments" overview section, is_remote is true because "connection_id" and "connection_path" are being used to describe how to get the Sample.csv asset resource.
"is_referenced"
: false (at most one of "is_referenced" and "is_remote" will be true)
Get Asset - CSV File - Response Body - After Profiling:
Now, let's compare what GET {service_URL}/v2/assets/{asset_id}?catalog_id={catalog_id}
returns for the same asset after the Profile API has been invoked on the Sample.csv file:
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2018-11-12T15:33:34Z",
"last_updater_id": "iam-ServiceId-12345___",
"last_update_time": 1542036814782,
"last_accessed_at": "2018-11-12T15:33:34Z",
"last_access_time": 1542036814782,
"last_accessor_id": "iam-ServiceId-12345___",
"access_count": 0
},
"name": "Sample.csv",
"description": "Simple csv file for experiment for getting started document.",
"tags": [],
"asset_type": "data_asset",
"origin_country": "united states",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created": 1541526321437,
"created_at": "2018-11-06T17:45:21Z",
"owner_id": "IBMid-___",
"size": 9238,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"data_asset",
"data_profile"
],
"asset_id": "45f4ab8c-37d5-45a1-8adf-___",
"asset_category": "USER"
},
"entity": {
"data_asset": {
"mime_type": "text/csv",
"dataset": true,
"columns": [
{
"name": "Name",
"type": {
"type": "varchar",
"length": 1024,
"scale": 0,
"nullable": true,
"signed": false
}
},
{
"name": "Number",
"type": {
"type": "varchar",
"length": 1024,
"scale": 0,
"nullable": true,
"signed": false
}
}
]
},
"data_profile": {
"971e9c66-be4c-44b4-91f3-___": {
"metadata": {
"guid": "971e9c66-be4c-44b4-91f3-___",
"asset_id": "971e9c66-be4c-44b4-91f3-___",
"dataset_id": "45f4ab8c-37d5-45a1-8adf-___",
"url": "https://api.dataplatform.cloud.ibm.com/v2/data_profiles/971e9c66-be4c-44b4-91f3-___?catalog_id=c6f3cbd8-2b7f-42fb-aa60-___&dataset_id=45f4ab8c-37d5-45a1-8adf-___",
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created_at": "2018-11-12T15:32:53.902Z",
"accessed_at": "2018-11-12T15:32:53.902Z",
"owner_id": "IBMid-___",
"last_updater_id": "IBMid-___"
},
"entity": {
"data_profile": {
"options": {
"disable_profiling": false,
"max_row_count": 5000,
"max_distribution_size": 100,
"max_numeric_stats_bins": 200,
"classification_options": {
"disabled": false,
"use_all_ibm_classes": true,
"ibm_class_codes": [],
"custom_class_codes": []
}
},
"execution": {
"status": "finished",
"is_supported": true,
"dataflow_id": "3f1ace02-4d40-451d-9bc7-___",
"dataflow_run_id": "f774f92f-5a61-49ca-8a68-___"
},
"columns": [],
"attachment_id": "8d614be0-6900-403b-ab50-___"
}
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/data_profiles/971e9c66-be4c-44b4-91f3-___?catalog_id=c6f3cbd8-2b7f-42fb-aa60-___&dataset_id=45f4ab8c-37d5-45a1-8adf-___"
},
"attribute_classes": [
"NoClassDetected",
"Organization Name"
]
}
},
"attachments": [
{
"id": "b8c7a390-e857-4c34-add8-___",
"version": 2,
"asset_type": "data_asset",
"name": "remote",
"description": "remote",
"connection_id": "070e9be2-40a8-4e0e-___",
"connection_path": "catalogforgettingsta-datacatalog-r1s___/data_asset/Sample_SyjEQUy6m.csv",
"create_time": 1541526323713,
"size": 0,
"is_remote": true,
"is_managed": false,
"is_referenced": false,
"is_object_key_read_only": false,
"is_user_provided_path_key": true,
"transfer_complete": true,
"is_partitioned": false,
"complete_time_ticks": 1541526323713,
"user_data": {},
"test_doc": 0,
"usage": {
"access_count": 0,
"last_accessor_id": "IBMid-___",
"last_access_time": 1541526323713
}
},
{
"id": "8d614be0-6900-403b-ab50-___",
"version": 2,
"asset_type": "data_profile",
"name": "data_profile_971e9c66-be4c-44b4-91f3-___",
"object_key": "data_profile_971e9c66-be4c-44b4-91f3-___",
"create_time": 1542036813627,
"size": 9238,
"is_remote": false,
"is_managed": false,
"is_referenced": true,
"is_object_key_read_only": false,
"is_user_provided_path_key": true,
"transfer_complete": true,
"is_partitioned": false,
"complete_time_ticks": 1542036813627,
"user_data": {},
"test_doc": 0,
"handle": {
"bucket": "catalogforgettingsta-datacatalog-r1s___",
"location": "us-geo",
"key": "data_profile_971e9c66-be4c-44b4-91f3-___",
"upload_id": "done",
"max_part_num": 1
},
"usage": {
"access_count": 0,
"last_accessor_id": "iam-ServiceId-12345___",
"last_access_time": 1542036813627
}
}
],
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/45f4ab8c-37d5-45a1-8adf-___?catalog_id=c6f3cbd8-2b7f-42fb-aa60-___"
}
Let's look at a few of the most important differences between the primary metadata document for the Sample.csv file before and after profiling:
"metadata"
:"asset_attributes"
: [
]"data\_asset", "data\_profile"
- Note the "data_profile" attribute name has been added
"entity"
:"data_asset"
:"columns"
: the Profile API has added the"columns"
field to thedata_asset
attribute,"dataset"
: the Profile API caused this to change from false to true because of the newly added"columns"
field
"data_profile"
:- this
"data_profile"
attribute is entirely new, and was added by the Profile API. - the name of this secondary attribute matches the name of the secondary asset type "data_profile", which was (previously) created by the Profile API.
- the contents of this
"data_profile"
attribute was entirely decided by the Profile API, not by the Assets API. - this attribute contains a lot of extended metadata about the "data_profile" run that produced a
"data_profile"
extended metadata document.
- this
"attachments"
:- a new item has been added to the
"attachments"
array - that new item contains the following
metadata
about an extended metadata document:"id"
: "8d614be0-6900-403b-ab50-___""asset_type"
: "data_profile"- note that the value "data_profile" matches the name of the "data_profile" attribute that this attachment item belongs to, so linking the attachment item and the attribute.
"handle"
: contains various fields pointing to the actual attached extended metadata document which is located in some external repository. That extended metadata document will contain a great deal more metadata about the asset resource, that is, about the "Sample.csv" file.
- a new item has been added to the
The next section shows how to retrieve the Extended Metadata Document that's referred to by the new "data_profile" "attachments"
item just described above.
Get Attachment - Extended Metadata Document:
The following example builds on the GET Asset
example from the previous section and shows how to retrieve an attachment that is an extended metadata document.
An attachment can be retrieved in 4 steps.
The only choices you have for asset_type in a given primary metadata document are listed in that document's "metadata.asset_attributes"
field. In the example above those values are:
- "data_asset"
- "data_profile"
The asset_type of the extended metadata document we want is "data_profile".
Step 2: Get the "id"
of the "attachments"
item whose "asset_type"
field has the value you chose in Step 1.
In the primary metadata document, look for the only "attachments"
item whose "asset_type"
field has the value you chose in Step 1, namely "data_profile". In our example primary metadata document above, that "attachments"
item has the "id"
value "8d614be0-6900-403b-ab50-___".
Step 3: Invoke the Get Attachment
API to get attachment metadata for the attached extended metadata document.
Get Asset Attachment - Request URL
GET /v2/assets/{asset_id}/attachments/{attachment_id}
The values for the above URL parameters are obtained as follows:
{asset_id}
: is the same as what appears in the"metadata.asset_id"
field of the above primary metadata document, namely "45f4ab8c-37d5-45a1-8adf-___"{attachment_id}
is the of"id"
that was obtained in Step 2, namely "8d614be0-6900-403b-ab50-___".
Invoke the above GET Attachment
API with the above values, which will return an attachment metadata document as shown in the following response body:
Get Asset Attachment - Response Body:
{
"attachment_id": "8d614be0-6900-403b-ab50-___",
"asset_type": "data_profile",
"is_partitioned": false,
"name": "data_profile_971e9c66-be4c-44b4-91f3-___",
"created_at": "2018-11-12T15:33:33Z",
"object_key": "data_profile_971e9c66-be4c-44b4-91f3-___",
"object_key_is_read_only": false,
"bucket": {
"bucket_name": "catalogforgettingsta-datacatalog-r1s___",
"bluemix_cos_connection": {
"viewer": {
"bucket_connection_id": "5b6bc03d-577d-4609-b3a4-___"
},
"editor": {
"bucket_connection_id": "070e9be2-40a8-4e0e-a468-___"
}
}
},
"url": "https://s3.us-south.objectstorage.softlayer.net/catalogforgettingsta-datacatalog-r1s___/data_profile_971e9c66-be4c-44b4-91f3-___?response-content-disposition=attachment%3B%20filename%3D%22data_profile_971e9c66-be4c-44b4-91f3-___%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20190423T162446Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86400&X-Amz-Credential=d2d518b66ac64de___%2F2019___%2Fus-geo%2Fs3%2Faws4_request&X-Amz-Signature=ce7322d7291396c511a6df38635df4e85b7c78c173___",
"transfer_complete": true,
"size": 9238,
"user_data": {},
"creator_id": "iam-ServiceId-12345___",
"usage": {
"access_count": 1,
"last_accessor_id": "IBMid-___",
"last_access_time": 1556036686480
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/45f4ab8c-37d5-45a1-8adf-726c65b68008/attachments/8d614be0-6900-403b-ab50-___?catalog_id=c6f3cbd8-2b7f-42fb-aa60-___"
}
It's important to understand that the GET Attachment
API only returns a metadata document that describes where, or how, an attached asset resource or extended metadata document can be accessed or retrieved.
The most important field in the above response is "url"
which contains a signed URL that can be used to retrieve the actual extended metadata document. Note that the "url"
points to a completely different server than the server that responds to "Assets API" calls! Extended metadata documents are not stored in the catalog.
Step 4: Use the "url"
in the response from Step 3 to call the relevant server to get the extended metadata document.
The simplest way to use that "url"
value is to paste it into the address bar of a browser, and let the browser retrieve the extended metadata document. Here's a peek at some of the contents of the large extended metadata document that can be retrieved using that "url"
value. That large extended metadata document was created by the Profile API and contains a great deal of extended metadata about our small Sample.csv file:
{
"summary": {
"version": "1.9.3",
"row_count": 2,
"score": 1,
"score_stats": {
"n": 2,
"mean": 1.0,
"variance": 0.0,
"stddev": 0.0,
"min": 1.0,
"max": 1.0,
"sum": 2.0
},
...
},
"columns": [{
"name": "Name",
"value_analysis": {
"distinct_count": 2,
"null_count": 0,
"empty_count": 0,
"unique_count": 2,
"max_value_frequency": 1,
"min_string": "abc",
"max_string": "def",
"inferred_type": {
"type": {
"length": 3,
"precision": 0,
"scale": 0,
"type": "STRING"
}
},
...
}, {
"name": "Number",
"value_analysis": {
"distinct_count": 2,
"null_count": 0,
"empty_count": 0,
"unique_count": 2,
"max_value_frequency": 1,
"min_string": "123",
"max_string": "456",
"min_number": 123.0,
"max_number": 456.0,
"inferred_type": {
"type": {
"length": 3,
"precision": 3,
"scale": 0,
"type": "INT16"
}
},
...
]
}
Get Attachment - Asset Resource:
The 4 steps given above to retrieve an extended metadata document can also be used to retrieve an asset resource like the Sample.csv file example.
The main difference is that in Step 1 you would choose the asset_type "data_asset" because that is the primary asset type of the primary metadata document, ie. the asset_type that identifies both the primary attribute and the primary attachment, ie, the asset resource.
Create Asset: book
Before you can create a primary metadata document (ie, card) the asset type that you want to use for that card must already exist. You can use one of the already available asset types, or you can use an asset type that you have created.
The Create Asset Type: book section shows how to create an asset type named book
. In this section, that asset type will be used to create a primary metadata document for a book asset resource. That primary metadata document will have:
- a
"metadata.asset_type"
field with the value"book"
- a primary attribute called
"book"
.
Use the following endpoint to create a primary metadata document for a book asset resource:
Create Asset: book - Request URL:
POST {service_URL}/v2/assets?catalog_id={catalog_id}
Create Asset: book - Request Body:
{
"metadata": {
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": ["getting", "started", "documentation"],
"asset_type": "book",
"origin_country": "us",
"rov": {
"mode": 0
}
},
"entity": {
"book": {
"author": {
"first_name": "Tracy",
"last_name": "Smith"
},
"price": 29.95
}
}
}
The above request body specifies the preliminary contents for the primary metadata document about to be created. Most of the fields have been described previously in the Asset's Primary Metadata Document section. However, there are a few things to note in particular about the above request:
"metadata"
: you supply the values of only some of the fields that will end up appearing inside the"metadata"
field of the primary metadata document about to be created, including:"asset_type"
: the value"book"
matches the name of the asset type for this document"name"
: the name to use for the asset being described by this document"description"
: a description for the asset
Notice that you do not supply a "metadata.asset_attributes"
field in the request body. If you include a "metadata.asset_attributes"
field in your Create Asset request body then the request will be rejected because it tried to supply a reserved value. The Assets API reserves control of the contents of the "metadata.asset_attributes"
field.
"entity"
: you supply the entire contents of the"entity"
field"book"
:- this is the primary attribute of the primary metadata document
- the name of this attribute matches the name of the corresponding primary asset type "book"
- contains metadata describing a book (does not contain the actual book asset resource)
Notice the above "book"
attribute doesn't contain a field called "title" - a field which might be expected in an attribute for a book. In this case, we've chosen to put the title of the book in the "metadata.name"
field of the card. However, the creator of the "book"
attribute is free to include whatever fields they want in that attribute, including a field called "title" if desired.
Create Asset: book - Response Body:
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2019-04-30T14:37:57Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1556635077746,
"last_accessed_at": "2019-04-30T14:37:57Z",
"last_access_time": 1556635077746,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": [
"getting",
"started",
"documentation"
],
"asset_type": "book",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-___",
"created": 1556635077746,
"created_at": "2019-04-30T14:37:57Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"book"
],
"asset_id": "3da5389d-d4a4-43da-be1f-___",
"asset_category": "USER"
},
"entity": {
"book": {
"author": {
"first_name": "Tracy",
"last_name": "Smith"
},
"price": 29.95
}
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/3da5389d-d4a4-43da-be1f-___?catalog_id=c6f3cbd8-___",
"asset_id": "3da5389d-d4a4-43da-be1f-___"
}
Notice that the card returned in the Create Asset Response Body has many more fields than were present in the Request Body. The Create Asset API has added a lot of information to the "metadata"
part of the primary metadata document:
"asset_id"
: most importantly, the Create Asset API has given your primary metadata document an id"owner_id"
: the API has made the caller of the API be the owner of the asset"created_at"
: the API has recorded the time at which the metadata document was created. In general, this is not the same as the time at which an attached asset resource was created (although in this case there is no attached asset resource)."total_ratings"
: contains the number of ratings this asset has recieved. 0 for now because the primary metadata document is brand new."usage"
: usage statistics. Since this is a brand new card these statistics don't yet contain much interesting data."asset_attributes"
: notice that the Create Asset API has added the name of the primary attribute to this array.
On other hand, notice that the Create Asset API did not modify the contents of the "entity"
field in any way. In particular, the Create Asset API did not modify the contents of the primary attribute "book"
.
Your catalog now contains a primary metadata document for a "book" asset resource.
Asset Types
Asset Types serve multiple purposes in the Assets API. Asset types fall into two categories:
Primary asset type:
- describes the primary type of an asset
- every primary metadata document (ie, card) will have exactly one primary asset type, whose name will be stored in the card's
"metadata.asset_type"
field - every card will have exactly one primary attribute whose name matches the name of the primary asset type
- a very common example of a primary asset type is the "data_asset" type, examples of which are shown throughout this documentation
Secondary / Extended asset type:
- a secondary / extended asset type describes an inter-related group of additional metadata for an asset resource
- a primary metadata document can have 0, 1, or many secondary / extended asset types
- information for a secondary / extended asset type is stored in a secondary / extended attribute in a primary metadata document
- a very common example of a secondary / extended asset type is "data_profile". See Get Asset - CSV File - Response Body - After Profiling for an example "data_profile" attribute.
The names of various asset types are used in the following ways, all at once, within a single primary metadata document:
- describe the type of an asset resource, via the
"metadata.asset_type"
field - describe the type of an object that contains extended information for an asset resource. For example, the type of an extended metadata document via an
"attachments[_].asset_type"
field. - assign names and types to attributes in the "entity" field of a primary metadata document
- implicitly tie various related parts of a primary metadata document to each other. For example, see the green rectangles and arrows in the Parts of a Primary Metadata Document Figure.
The content, or definition, of an asset type serves the following purposes:
- tell the catalog what fields of an attribute should be indexed for searching
- specify search paths and cross attribute searching
- specify additional features like relationships and external asset previews (both of which are beyond the scope of this document)
An asset type must exist in the catalog before it can be used for any of the above purposes.
As of this writing there are several asset types available, including the following:
- data_asset
- folder_asset
- policy_transform
- asset_terms
- column_info
- connection
- ai_training_definition
- data_flow
- activity
- notebook
- machine-learning-stream
- dashboard
- data_profile_nlu
You are free to use any of the above asset types. You do not have to, nor are you allowed to, create or over-write any of the above asset types.
Use the Create Asset Type API to create your own asset type. See Asset Type Fields for an overview of the specification of an asset type. See Create Asset Type: book for an example of creating an asset type.
Asset Type Fields
Here is a description for each of the fields in the definition of an asset type. You supply values for these fields when creating an asset type. You will see those same values returned when you get a list of asset types or get a specific asset type.
"name"
:- the name and identifier for the asset type
- should contain only lowercase letters
- will be used in various places in primary metadata documents, including:
- can be used in catalog searches of attribute contents
"description"
: a description for this asset type"fields"
:- an array that contains information for the fields in the corresponding attribute that should be indexed for subsequent searches.
- does not (necessarily) describe all the fields in attributes of this asset type.
- there must be at least one item in this array. In other words, there must be at least one index for an asset type.
- see the following Fields Table for a description of the contents of an item in the
"fields"
array- see "fields" and "properties" Note below
"properties"
:- an object that contains "non-index" information for the fields in the corresponding attribute. This information is typically used by UIs that display/edit assets.
- does not (necessarily) describe all the fields in attributes of this asset type.
- see the following Properties Table for a description of the contents of an item in the
"properties"
object - see "fields" and "properties" Note below
"external_asset_preview"
: beyond the scope of this document"relationships"
: beyond the scope of this document
Note: "fields" and "properties" can, optionally, both be used to describe the exact same field in an attribute. Whether you use "fields"
and/or "properties"
depends on what you want to specify for a field. For example, if you're creating an asset type named "person"
and a person has a field called "birthdate"
(resulting in "entity.person.birthdate"
being present in the primary metadata document) then:
- if you want
birthdate
to be indexed (for searching) then you would include an entry in the"fields"
array forbirthdate
- if you want a UI to understand/display the
birthdate
properly then you would include an entry in the"properties"
object for that samebirthdate
field
See this example which shows both an example "fields"
array and an example "properties"
object.
Key | Description | Example | Required |
---|---|---|---|
key | the name of both the field that will appear in an attribute for this asset type, and the name of the corresponding index for that attribute field | data_asset.mime_type | Yes |
type | the data type of the field being indexed | boolean, or number, or string | Yes |
facets | beyond the scope of this document | true or false | No. Defaults to false. |
search_path | a json path that locates a field in the attribute | See Search Path Examples below. | Yes |
is_searchable_across_types | specifies whether this field can be used in a query without specifying the asset type | true or false | No. Defaults to false. |
Name | Type | Description |
---|---|---|
type | String | Specifies the data type for the property. This value is required. Possible types are: string, number |
description | String | A displayable string to describe the property. |
is_array | boolean | true if the property value is multi-valued (json array). |
required | boolean | true if the property requires a value to be set. |
hidden | boolean | true if the application UI should not display the property or value. |
readonly | boolean | true if the property should not be changed once set. |
default_value | matches the "type" | A value that should be set if no value is provided when the asset attribute is created. |
placeholder | string | A string an application UI can use as a prompt before a value is entered. |
values | array, elements matching "type" | An array of allowed values for the property. Used to describe a limited enumeration or "choice list". |
minimum | integer/number | For an integer or number property, the minimum allowed value. |
maximum | integer/number | For an integer or number property, the maximum allowed value. If both minimum and maximum are specified, minimum must be less than or equal to maximum. |
min_length | integer | For a string property, the minimum allowed length. If specified, must be greater than or equal to zero. |
max_length | integer | For a string property, the maximum allowed length. If specified, must be greater than or equal to zero. If both min_length and max_length are specified, min_length must be less than or equal to max_length. |
properties | object | For a property of type 'object', the recursive definition of the properties, described as in this table. This allows describing nested object-valued properties. |
Search Path Examples
See the request body in Create Asset Type: book for an example of where a search path is used in the definition of an asset type.
- Note: when you specify a search path in the definition of an asset type's
"field"
, you only specify the path within the correspondingly named attribute. You needn't specify the attribute name. For example if you have an attribute called"book"
that has a field called"author.last_name"
within it, you only need to specify"author.last_name"
as the search path - not"book.author.last_name"
.
- Note: when you specify a search path in the definition of an asset type's
See Search Asset Type: attribute - book for an example of where a search path is used in the body of a search.
- Note: when you specify a search path in the body of search you must specify the name of the attribute being searched. For example if you have an attribute called
"book"
that has a field called"author.last_name"
within it, you would include the name of the attribute in the search path:"book.author.last_name"
.
- Note: when you specify a search path in the body of search you must specify the name of the attribute being searched. For example if you have an attribute called
"price"
: a simple path contains just the name of the field to be searched. In this case the attribute being searched should have a simple field called"price"
."tags[]"
: traverse a json array called"tags"
. Because tags[] is not followed by any further names it must be a basic type (e.g. string, boolean, or number), and so its elements will be indexed directly."asset_terms[].name"
: this search path indicates a path starting with a json object named"term_assignments"
at the top, traversing through a json array named asset_terms (you use the [] at the end of the field name to indicate it's an array), landing on another json object that has a field called"name"
. The"name"
field will be indexed."asset_terms[0].name"
: same as above but only the first element in the"asset_terms"
array will be traversed."columns.*.tags[]"
: traverse an object called"columns"
followed by any column name (the '*' indicates a wildcard), followed by a json array called"tags"
. Because tags[] is not followed by any further names it must be a basic type (e.g. string, boolean, or number), and so its elements will indexed directly."column_tags.*[]"
: the json object"column_tags"
contains a series of arrays indicated by *[]. The name of the array object doesn't matter - we want to index it.
data_asset Type
"data_asset"
is by far the most commonly used already available asset type. It can be seen in:
- the Parts of a Primary Metadata Document Figure
- many of the examples in the Assets API Examples and Asset Types API Examples sections
- the default asset type used when you drag an asset resource file onto the Create Asset page.
The reason "data_asset"
is so popular is that it is a generic asset type that allows you to declare a specific type for a given asset resource without explicitly creating an asset type named after that specific type. For example, say you want to create a primary metadata document for a csv file. You could first create a specific asset type named, say, "csv_file", and then create a primary metadata document (for that csv file) and specify "csv_file" as the value for "metadata.asset_type"
. However, you can avoid creating a specific "csv_file" asset type by instead using the generic "data_asset" asset type and then use the "mime_type" field of the "data_asset" attribute to declare that the specific type of your asset resource is a csv file. To do so, the primary metadata document for the csv file would have:
- a
"metatada.asset_type"
value of the generic type"data_asset"
- a
"entity.data_asset.mime_type"
value of the specific type"text/csv"
.
The fields "asset_type"
and "mime_type"
both describe the "type" of the asset resource. However:
- the type specified by the
"metatada.asset_type"
field (ie,"data_asset"
) is generic - the type specified by the
"entity.data_asset.mime_type"
field (ie,"text/csv"
) is specific
It is the "mime_type"
field of the data_asset
type that allows you to declare a specific type for an asset without creating that specific type(!).
So, in its most basic use, the "data_asset"
asset type is a very "lite" asset type. It's used to avoid creating many other "heavier" asset types. However, if you need to create more complex attributes with indexes for specific fields in your attribute then you will have to create your own asset type (see Create Asset Type: book for an example).
The other two fields of the type "data_asset"
are "dataset"
and "columns"
.
"dataset"
value offalse
means that the"columns"
field is absent in a"data_asset"
attribute"dataset"
value oftrue
means that the"columns"
field is present in a"data_asset"
attribute
The "columns"
field of a "data_asset"
attribute is optionally used to specify metadata for columns of assets that have columns, like csv files, spreadsheets, database tables, etc.
The full definition of the "data_asset"
type is shown in Get Asset Type: data_asset - Response Body.
See Get Asset - CSV File - Response Body - Before Profiling and Get Asset - CSV File - Response Body - After Profiling for examples where a "data_asset"
is used for a csv asset resource.
Get Asset Types
You can get a list of the asset types in a catalog using the following Asset Types API:
Get Asset Types - Request URL:
GET {service_URL}/v2/asset_types?catalog_id={catalog_id}
Get Asset Types - Response Body:
{
"resources": [
{
"description": "Data Asset Type",
"fields": [
{
"key": "dataset",
"type": "boolean",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "mime_type",
"type": "string",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "columns",
"type": "string",
"facet": true,
"is_array": true,
"search_path": "columns[].name",
"is_searchable_across_types": true
}
],
"external_asset_preview": {},
"relationships": [],
"name": "data_asset",
"version": 3
},
{
"description": "An asset type you can use to describe the columns of a data asset. Normally attached as a property to an existing data asset.",
"fields": [
{
"key": "column_info_term_display_name",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.column_terms[].term_display_name",
"is_searchable_across_types": true
},
{
"key": "column_info_term_id",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.column_terms[].term_id",
"is_searchable_across_types": false
},
{
"key": "column_info_tag",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.column_tags[]",
"is_searchable_across_types": true
},
{
"key": "column_info_description",
"type": "string",
"facet": false,
"is_array": false,
"search_path": "*.column_description",
"is_searchable_across_types": true
},
{
"key": "column_info_omrs_guid",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.omrs_guid",
"is_searchable_across_types": true
}
],
"external_asset_preview": {},
"relationships": [],
"name": "column_info",
"version": 4
},
{
"description": "An asset type that you can use to assign terms from a business glossary to any asset. Attach items of this type as attributes to other assets.",
"fields": [
{
"key": "asset_term_display_name",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "list[].term_display_name",
"is_searchable_across_types": true
},
{
"key": "asset_term_id",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "list[].term_id",
"is_searchable_across_types": false
}
],
"external_asset_preview": {},
"relationships": [],
"name": "asset_terms",
"version": 1
},
...
]
}
See Asset Type Fields for descriptions of the fields in each of the above asset types.
In a scenario in which the user has not yet created any of their own asset types, the result will contain only the pre-existing, global, asset types. For brevity, the actual sample result shown above includes only a subset of those asset types. Try the GET Asset Types
API on your catalog to see the complete set of pre-existing, global, asset types.
Get Asset Type: data_asset
You can get an individual asset type in a catalog using the following Asset Types API:
Get Asset Type: data_asset - Request URL:
GET {service_URL}/v2/asset_types/{type_name}?catalog_id={catalog_id}
Supplying "data_asset" as the value for the {type_name}
parameter in the above url will produce a response like the following:
Get Asset Type: data_asset - Response Body:
{
"description": "Data Asset Type",
"fields": [
{
"key": "mime_type",
"type": "string",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "dataset",
"type": "boolean",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "columns",
"type": "string",
"facet": true,
"is_array": true,
"search_path": "columns[].name",
"is_searchable_across_types": true
}
],
"external_asset_preview": {},
"relationships": [],
"name": "data_asset",
"version": 3
}
See Asset Type Fields for descriptions of the fields in the above asset type definition.
Since an asset type called "data_asset"
exists, you can create a primary metadata document (ie, card) with a "metadata.asset_type"
value of "data_asset". That card must then also have a primary attribute called "data_asset".
The most interesting item in the "fields"
array in the above "data_asset"
asset type definition is the item with "key"
value "mime_type". That item means that a primary attribute named "data_asset" will have a field called "mime_type"
. The value of that "mime_type"
attribute field will declare the specific type of the asset resource represented by the primary metadata document. For example, see the field "entity.data_asset.mime_type"
in Get Asset - CSV File - Response Body - Before Profiling where the "mime_type"
value is "text/csv".
Notice the "data_asset" attribute in Get Asset - CSV File - Response Body - Before Profiling only contains two fields - "mime_type"
and dataset
. The columns
field specified in the definition of the "data_asset"
asset type is not present in the "data_asset" attribute.
Now compare all the items in the "fields"
array in the above "data_asset"
asset type definition with the "entity.data_asset"
attribute fields as shown, for example, in Get Asset - CSV File - Response Body - After Profiling. Notice that now all the fields described in the "fields"
array of the "data_asset"
type are present as fields in the "entity.data_asset"
attribute. In particular, profiling has added the "columns"
field to the "data_asset" attribute.
The Before Profiling and After Profiling examples illustrate that not all the fields defined in an asset type need be present in a corresponding attribute.
Create Asset Type: book
Say you have a book asset resource and you want to create a primary metadata document to describe that book. You will first need to create an asset type called "book" (as shown below) so you can then:
- use the name of that asset type as the value for the
"metadata.asset_type"
field in the primary metadata document - create a primary attribute named "book" that will contain data about your book.
Say you want that primary attribute to look like the following:
"book": {
"author": {
"first_name": "Tracy",
"last_name": "Smith"
},
"price": 29.95
}
}
The above "book" attribute has:
- one complex field called "author" (complex fields are allowed in attributes)
- one simple field called "price".
For this example, assume you'll want to be able to search inside the "author.last_name"
field of "book" attributes.
To create an asset type named "book" that will allow you to do all of the above, use a request like the following:
Create Asset Type: book - Request URL:
POST {service_URL}/v2/asset_types?catalog_id={catalog_id}
Create Asset Type: book - Request Body:
{
"name": "book",
"description": "Book asset type",
"fields": [
{
"key": "author.last_name",
"type": "string",
"facet": false,
"is_array": false,
"search_path": "author.last_name",
"is_searchable_across_types": true
}
],
"properties": {
"price" : {
"type": "number",
"description": "Suggested retail price",
}
}
}
The purpose of most of the fields used in the above request was described in the Asset Type Fields section. Here are some things to note specifically in the above request:
"name"
: uses only lowercase letters, ie, "book""fields"
: even though our goal attribute has multiple fields in it, there is only one item in the asset type's"fields"
array. That is because the"fields"
array should only contain items for the fields of an attribute that we want the catalog to create an index for. In this case, we only want an index for the"author.last_name"
field of "book" attributes."key"
: the name of the attribute field that we want indexed, and the name for that index. In this case,"author.last_name"
."type"
: the type of the"author.last_name"
field is "string""facet"
: an explanation of this field is beyond the scope of this document"is_array"
: false because"author.last_name"
is not an array"search_path"
: this is the path inside the attribute to the value that we want indexed"is_searchable_across_types"
: an explanation of this field is beyond the scope of this document
Create Asset Type: book - Response Body:
{
"description": "Book asset type",
"fields": [
{
"key": "author.last_name",
"type": "string",
"facet": false,
"is_array": false,
"search_path": "author.last_name",
"is_searchable_across_types": true
}
]
"relationships": [],
"name": "book",
"version": 1
}
The response to the POST /v2/asset_types
API echoes the input, with two additional fields:
relationships
: an explanation of the contents of this field is beyond the scope of this documentversion
: the version of the newly created asset type
You now have an asset type called "book"
that specifies one indexed, search-able, field called "author.last_name"
. See Create Asset: book for an example of the ways in which that "book"
asset type can be used when creating a primary metadata document.
Search Asset Type: attribute - book
The Search Asset Type API can be used to search inside a catalog for all the primary metadata documents that satisfy both of the following conditions:
- have a
"metadata.asset_type"
value that matches the asset type name specified in the {type_name} URL parameter - have an attribute whose fields' values match those specified in the request body.
Recall that one of the primary reasons for creating an asset type is to specify fields in attributes (named after that asset type) that will be indexed for searching. The Create Asset Type: book section showed how to create an asset type named "book"
. The Create Asset: book section showed how to create a primary metadata document whose "metadata.asset_type"
value and primary attribute name are both "book". So, if you use the value "book" for the `{type_name}
parameter in the URL below, and if you supply the following request body, then you'll get back matching metadata for books.
Search Asset Type: attribute - book - Request URL
POST {service_URL}/v2/asset_types/{type_name}/search?catalog_id={catalog_id}
Search Asset Type: attribute - book - Request Body:
{
"query":"book.author.last_name:Smith"
}
Notice how the query specifies both the attribute (book
) to be searched and the search path (author.last_name
) within that attribute. The value to match is specified after the colon (:
). In this case, the value is Smith
.
The following is the result of the above search:
Search Asset Type: attribute - book - Response Body:
{
"total_rows": 1,
"results": [
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2019-05-01T18:58:51Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1556737131140,
"last_accessed_at": "2019-05-01T18:58:51Z",
"last_access_time": 1556737131140,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": [
"getting",
"started",
"documentation"
],
"asset_type": "book",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-___",
"created": 1556635077746,
"created_at": "2019-04-30T14:37:57Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 0,
"asset_state": "available",
"asset_attributes": [
"book"
],
"asset_id": "3da5389d-d4a4-43da-be1f-___",
"asset_category": "USER"
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/3da5389d-d4a4-43da-be1f-___?catalog_id=c6f3cbd8-___"
}
]
}
In this case, there is only one primary metadata document returned in the "results"
array (namely, the primary metadata document that was created in the Create Asset: book section). In general, there can be many matching documents in the "results"
array.
Notice the results of an Asset Type Search, as shown above, only contain the "metadata" section of a primary metadata document. In particular, the "entity" section that contains the attributes is not returned. That is done to reduce the size of the response because, in general, the "entity" section of a primary metadata document can be much larger than the "metadata" section. Use the value of the "metadata.asset_id"
in one of the items in "results"
to retrieve either:
- the entire primary metadata document (using the GET Asset API), or
- just the attributes of the primary metadata document (using the GET Attributes API).
Notes:
- searching is not limited to just primary attributes (like
book
above). Searches may also be performed on:- Secondary, or extended, attributes
- the "metadata" field of a primary metadata document, as shown in the next section.
- other parameters available for searches are:
- limit (number): limit number of search results
- sort (string): sort columns for search results
- counts: beyond the scope of this document
- drilldown: beyond the scope of this document
Search Asset Type: metadata - name
You're not limited to searching within attributes (like the attribute search shown in the previous section). You can also search within the "metadata" section of a primary metadata document.
Search Asset Type: metadata - name - Request URL:
POST {service_URL}/v2/asset_types/{type_name}/search?catalog_id={catalog_id}
Search Asset Type: metadata - name - Request Body:
{
"query":"asset.name:Getting Started with Assets"
}
Notice the query signifies that the search should take place in the "metadata" section of the primary metadata document by using the term asset
at the beginning of the search path. Then the field to be searched within "metadata" is specified - name
in the example above. The value to match is specified after the colon (:
), in this case the value is Getting Started with Assets
.
The following is the result of the above search:
Search Asset Type: metadata - name - Response Body:
{
"total_rows": 1,
"results": [
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2019-04-30T17:27:56Z",
"last_updater_id": "IBMid___",
"last_update_time": 1556645276827,
"last_accessed_at": "2019-04-30T17:27:56Z",
"last_access_time": 1556645276827,
"last_accessor_id": "IBMid___",
"access_count": 0
},
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": [
"getting",
"started",
"documentation"
],
"asset_type": "book",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-___",
"created": 1556635077746,
"created_at": "2019-04-30T14:37:57Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 0,
"asset_state": "available",
"asset_attributes": [
"book"
],
"asset_id": "3da5389d-d4a4-43da-be1f-___",
"asset_category": "USER"
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/3da5389d-d4a4-43da-be1f-___?catalog_id=c6f3cbd8-___"
}
]
}
In this case, the result is the same as was described in Search Asset Type: attribute - book - Response Body. See that section for more details.
Data Flows
Introduction
A data flow can read data from a large variety of sources, process that data using pre-defined operations or custom code, and then write it to one or more targets. The runtime engine can handle large amounts of data so it's ideally suited for reading, processing, and writing data at volume.
The sources and targets that are supported include both Cloud and on-premises offerings as well as data assets in projects. Cloud offerings include IBM Cloud Object Storage, Amazon S3, and Azure, among others. On-premises offerings include IBM Db2, Microsoft SQL Server, and Oracle, among others.
For a list of the supported connectivity and the properties they support, see IBM Watson Data API Data Flows Service - Data Asset and Connection Properties.
Creating a data flow
The following example shows how to create a data flow that reads data from a table on IBM Db2 Warehouse on Cloud (previously called IBM dashDB), filters the data, and writes the data to a data asset in the project. The data flow created for this example will contain a linear pipeline, although in the general case, the pipeline forms a directed asymmetric graph (DAG).
Environments
Begin by creating a connection to an existing IBM Db2 Warehouse on Cloud instance to use as the source of the data flow. For further information on the connections service, see Connections.
Defining a source in a data flow
A data flow can contain one or more data sources. A data source is defined as a binding node in the data flow pipeline, which has one output and no inputs. The binding node must reference either a connection or a data asset. Depending on the type of connection or data asset, additional properties might also need to be specified. Refer to IBM Watson Data API Data Flows Service - Data Asset and Connection Properties to determine which properties are applicable for a given connection, and which of those are required. For IBM Db2 Warehouse on Cloud both select_statement
and table_name
are required, so you must include values for those in the data flow.
For the following example, reference the connection you created earlier. The binding node for the data flow's source is:
{
"id": "source1",
"type": "binding",
"connection": {
"properties": {
"schema_name": "GOSALESHR",
"table_name": "EMPLOYEE"
},
"ref": "85be3e09-1c71-45d3-8d5d-220d6a6ea850"
},
"outputs": [
{
"id": "source1Output"
}
]
}
The outputs
object declares the ID of the output port of this source as source1Output
so that other nodes can read from it. You can see the schema and table name have been defined, and that the connection with ID 85be3e09-1c71-45d3-8d5d-220d6a6ea850
is being referenced.
Defining an operation in a data flow
A data flow can contain zero or more operations, with a typical operation having one or more inputs and one or more outputs. An operation input is linked to the output of a source or another operation. An operation can also have additional parameters which define how the operation performs its work. An operation is defined as an execution node in the data flow pipeline.
The following example creates a filter operation so that only rows with value greater than 2010-01-01
in the DATE_HIRED
field are retained. The execution node for our filter operation is:
{
"id":"operation1",
"type":"execution_node",
"op":"com.ibm.wdp.transformer.FreeformCode",
"parameters":{
"FREEFORM_CODE":"filter(DATE_HIRED>'2010-01-01*')"
},
"inputs":[
{
"id":"inputPort1",
"links":[
{
"node_id_ref":"source1",
"port_id_ref":"source1Output"
}
]
}
],
"outputs":[
{
"id":"outputPort1"
}
]
}
The inputs
attribute declares an input port with ID inputPort1
which references the output port of the source node (node ID source1
and port ID source1Output
). The outputs
attribute declares the ID of the output port of this operation as outputPort1
so that other nodes can read from it. For this example, the operation is defined as a freeform operation, denoted by the op
attribute value of com.ibm.wdp.transformer.FreeformCode
. A freeform operation has only a single parameter named FREEFORM_CODE
whose value is a snippet of Sparklyr code. In this snippet of code, a filter function is called with the arguments to retain only those rows with value greater than 2010-01-01
in the DATE_HIRED
field.
The outputs
attribute declares the ID of the output of this operation as outputPort1
so that other nodes can read from it.
Defining a target in a data flow
A data flow can contain zero or more targets. A target is defined as a binding node in the data flow pipeline which has one input and no outputs. As with the source, the binding node must reference either a connection or a data asset. When using a data asset as a target, specify either the ID or name of an existing data asset.
In the following example, a data asset is referenced by its name. The binding node for the data flow's target is:
{
"id": "target1",
"type": "binding",
"data_asset": {
"properties": {
"name": "my_shapedFile.csv"
}
},
"inputs": [
{
"links": [
{
"node_id_ref": "operation1",
"port_id_ref": "outputPort1"
}
],
"id": "target1Input"
}
]
}
The inputs
object declares an input port with ID target1Input
which references the output port of our operation node (node ID operation1
and port ID outputPort1
). The name of the data asset to create or update is specified as my_shapedFile.csv
. Unless otherwise specified, this data asset is assumed to be in the same catalog or project as that which contains the data flow.
Defining a parameterized property in a data flow
Properties contained within a data flow can be parameterised, allowing for the values associated with the property to be replaced at run-time. The paths referencing the parameterized properties are contained within the external parameters of the data flow pipeline. The paths can be defined as an RFC 6902 path, however we will also support the path containing the id of the object within the array. So instead of:
/entity/pipelines/0/nodes/0/connection/table_name
you could also use:
/entity/pipelines/<pipeline_id>/nodes/<node_id>/connection/table_name
Any external parameters that are defined as being required must be reconciled when the data flow is run. Any external parameters that are defined as not being required and that are not reconciled when the data flow is run will default to using the property values already contained within the data flow.
In the following example, the external parameter references a filter property within the data flow that may be reconciled when the data flow is run. The external parameters for the data flow's pipeline is:
[
{
"name": "freeform_update",
"required": false,
"paths": [
"/entity/pipeline/pipelines/pipeline1/nodes/operation1/parameters/FREEFORM_CODE"
]
}
]
Creating the data flow
Putting it all together, you can now call the API to create the data flow with the following POST method:
POST /v2/data_flows
The new data flow can be stored in a catalog or project. Use either the catalog_id
or project_id
query parameter, depending on where you want to store the data flow asset. An example request to create a data flow is shown below:
POST v2/data_flows?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218
Request payload:
{
"name": "my_dataflow",
"pipeline": {
"doc_type": "pipeline",
"version": "2.0",
"primary_pipeline": "pipeline1",
"pipelines": [
{
"id": "pipeline1",
"nodes": [
{
"id": "source1",
"type": "binding",
"connection": {
"properties": {
"schema_name": "GOSALESHR",
"table_name": "EMPLOYEE"
},
"ref": "85be3e09-1c71-45d3-8d5d-220d6a6ea850"
},
"outputs": [
{
"id": "source1Output"
}
]
},
{
"id": "operation1",
"type": "execution_node",
"op": "com.ibm.wdp.transformer.FreeformCode",
"parameters": {
"FREEFORM_CODE": "filter(DATE_HIRED>'2010-01-01*')"
},
"inputs": [
{
"id": "inputPort1",
"links": [
{
"node_id_ref": "source1",
"port_id_ref": "source1Output"
}
]
}
],
"outputs": [
{
"id": "outputPort1"
}
]
},
{
"id": "target1",
"type": "binding",
"data_asset": {
"properties": {
"name": "my_shapedFile.csv"
}
},
"inputs": [
{
"links": [
{
"node_id_ref": "operation1",
"port_id_ref": "outputPort1"
}
],
"id": "target1Input"
}
]
}
],
"runtime_ref": "runtime1"
}
],
"runtimes": [
{
"name": "Spark",
"id": "runtime1"
}
],
"external_parameters": [
{
"name": "freeform_update",
"required": false,
"paths": [
"/entity/pipeline/pipelines/pipeline1/nodes/operation1/parameters/FREEFORM_CODE"
]
}
]
}
}
The response will contain a dataflow ID which you will need later to run the data flow you created.
Working with data flow runs
What is a data flow run?
Each time a data flow is run, a new data flow run asset is created and stored in the project or catalog to record this event. This asset stores detailed metrics such as how many rows were read and written, a copy of the data flow that was run, and any logs from the engine. During a run, the information in the asset is updated to reflect the current state of the run. When the run completes (successfully or not), the information in the asset is updated one final time. If and when the data flow is deleted, any run assets of that data flow are also deleted.
As part of a data flow run it is possible to specify runtime values specific to this particular run, that will be used to override any parameterized properties defined when creating the associated data flow.
There are four components of a data flow run, which are accessible using different APIs.
- Summary (
GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}
). A quick, at-a-glance view of a run with a summary of how many rows in total were read and written. - Detailed metrics (
GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/metrics
). Detailed metrics for each binding node in the data flow (link sources and targets). - Data flow (
GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/origin
). A copy of the data flow that was run at that point in time. (Remember that data flows can be modified between runs.) - Logs (
GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/logs
). The logs from the engine, which are useful for diagnosing run failures.
Run state life cycle
A data flow run has a defined life cycle, which is shown by its state
attribute. The state
attribute can have one of the following values:
starting
The run was created but was not yet submitted to the engine.queued
The run was submitted to the engine and it is pending.running
The run is currently in progress.finished
The run finished and was successful.error
The run did not complete. An error occurred either before the run was sent to the engine or while the run was in progress.stopping
The run was canceled but it is still running.stopped
The run is no longer in progress.
The run states that define phases of progress are: starting
, queued
, running
, stopping
. The run states that define states of completion are: finished
, error
, stopped
.
The following are typical state transitions you would expect to see:
- The run completed successfully:
starting
->queued
->running
->finished
. - The run failed (for example, connection credentials were incorrect):
starting
->queued
->running
->error
. - The run could not be sent to the engine (for example, the connection referenced does not exist):
starting
->error
. - The run was stopped (for example, at users request):
starting
->queued
->running
->stopping
->stopped
.
Run a data flow
To run a data flow, call the following POST API:
POST /v2/data_flows/{data_flow_id}/runs?project_id={project_id}
The value of data_flow_id
is the metadata.asset_id
from your data flow. An example response from this API call might be:
{
"metadata": {
"asset_id": "ed09488c-6d51-48c4-b190-7096f25645d5",
"asset_type": "data_flow_run",
"create_time": "2017-12-21T10:51:47.000Z",
"creator": "[email protected]",
"href": "https://api.dataplatform.cloud.ibm.com/v2/data_flows/cfdacdb4-3180-466f-8d4c-be7badea5d64/runs/ed09488c-6d51-48c4-b190-7096f25645d5?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218",
"project_id": "ff1ab70b-0553-409a-93f9-ccc31471c218",
"usage": {
"last_modification_time": "2017-12-21T10:51:47.923Z",
"last_modifier": "[email protected]",
"last_access_time": "2017-12-21T10:51:47.923Z",
"last_accessor": "[email protected]",
"access_count": 0
}
},
"entity": {
"data_flow_ref": "cfdacdb4-3180-466f-8d4c-be7badea5d64",
"name": "my_dataflow",
"rov": {
"mode": 0,
"members": []
},
"state": "starting",
"tags": []
}
}
Creating a parameter set
A data flow can be run with parameter replacements that reference a created parameter set.
Each parameter is contained within a parameter set. A parameter can be either of type string
, object
, array
, boolean
or integer
. The value should conform to the type specified.
To create a parameter set call the following POST API:
POST /v2/data_flows/parameter_sets?project_id={project_id}
Request payload:
{
"name": "my_parameter_set",
"parameters": [
{
"name": "TheTableName",
"literal_value": {
"type": "string",
"value": "Employee"
}
},
{
"name": "param2",
"literal_value": {
"type": "object",
"value": {
"type": "string",
"value": "Test Value"
}
}
},
{
"name": "param3",
"literal_value": {
"type": "boolean",
"value": true
}
},
{
"name": "param4",
"literal_value": {
"type": "array",
"value": [
"string1",
"string2"
]
}
},
{
"name": "param5",
"literal_value": {
"type": "integer",
"value": 1
}
}
]
}
Run a data flow with parameter replacement
At runtime we allow parameter replacement properties to be contained within the request body. These properties will be specific to this particular run, and will be used to replace the associated values of the parameterized properties defined when creating the related data flow. A parameter replacement property can be a reference to an existing parameter, within a stored a parameter set or a straight forward replacement object defined as a literal value.
Each parameter replacement defines a name, which is used to match with the name of an external parameter defined in the data flow. Once the association has been successfully made the runtime value will then replace the default value currently contained with the data flow.
An important point to note here is that the stored data flow is left unchanged, the values are only overridden for this particular run.
To run a data flow with parameter replacement call the following POST API:
POST /v2/data_flows/{data_flow_id}/runs?project_id={project_id}
Request payload:
{
"param_replacements": [
{
"reference_value": {
"parameter_set_ref": "6a750da0-7dc4-427a-b35d-939bb5be87f5",
"parameter_set_param_name": "TheTableName"
},
"name": "table_name_update"
},
{
"literal_value": {
"value": "filter(DATE_HIRED>'2018-01-01*')"
},
"name": "freeform_update"
}
]
}
The value of data_flow_id
is the metadata.asset_id
from your data flow.
An example response from this API call might be:
{
"metadata": {
"asset_id": "ed09488c-6d51-48c4-b190-7096f25645d5",
"asset_type": "data_flow_run",
"create_time": "2017-12-21T10:51:47.000Z",
"creator": "[email protected]",
"href": "https://api.dataplatform.cloud.ibm.com/v2/data_flows/cfdacdb4-3180-466f-8d4c-be7badea5d64/runs/ed09488c-6d51-48c4-b190-7096f25645d5?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218",
"project_id": "ff1ab70b-0553-409a-93f9-ccc31471c218",
"usage": {
"last_modification_time": "2017-12-21T10:51:47.923Z",
"last_modifier": "[email protected]",
"last_access_time": "2017-12-21T10:51:47.923Z",
"last_accessor": "[email protected]",
"access_count": 0
}
},
"entity": {
"data_flow_ref": "cfdacdb4-3180-466f-8d4c-be7badea5d64",
"name": "my_dataflow",
"rov": {
"mode": 0,
"members": []
},
"state": "starting",
"tags": []
}
}
Get a data flow run summary
To retrieve the latest summary of a data flow run, call the following GET method:
GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}?project_id={project_id}
The value of data_flow_id
is the metadata.asset_id
from your data flow. The value of data_flow_run_id
is the metadata.asset_id
from your data flow run. An example response from this API call might be:
{
"metadata": {
"asset_id": "ed09488c-6d51-48c4-b190-7096f25645d5",
"asset_type": "data_flow_run",
"create_time": "2017-12-21T10:51:47.000Z",
"creator": "[email protected]",
"href": "https://api.dataplatform.cloud.ibm.com/v2/data_flows/cfdacdb4-3180-466f-8d4c-be7badea5d64/runs/ed09488c-6d51-48c4-b190-7096f25645d5?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218",
"project_id": "ff1ab70b-0553-409a-93f9-ccc31471c218",
"usage": {
"last_modification_time": "2017-12-21T10:51:47.923Z",
"last_modifier": "[email protected]",
"last_access_time": "2017-12-21T10:51:47.923Z",
"last_accessor": "[email protected]",
"access_count": 0
}
},
"entity": {
"data_flow_ref": "cfdacdb4-3180-466f-8d4c-be7badea5d64",
"engine_state": {
"session_cookie": "route=Spark; HttpOnly; Secure",
"engine_run_id": "804d17bd-5ed0-4d89-ba38-ab7890d61e45"
},
"name": "my_dataflow",
"rov": {
"mode": 0,
"members": []
},
"state": "finished",
"summary": {
"completed_date": "2018-01-03T16:58:05.726Z",
"engine_elapsed_secs": 9,
"engine_completed_date": "2018-01-03T16:58:05.360Z",
"engine_started_date": "2018-01-03T16:57:56.211Z",
"engine_status_date": "2018-01-03T16:58:05.360Z",
"engine_submitted_date": "2018-01-03T16:57:46.044Z",
"total_bytes_read": 95466,
"total_bytes_written": 42142,
"total_rows_read": 766,
"total_rows_written": 336
},
"tags": []
}
}
Troubleshooting a failed run
If a data flow run fails, the state
attribute is set to the value error
. In addition to this, the run asset itself has an attribute called error
which is set to a concise description of the error (where available from the engine). If this information is not available from the engine, a more general message is set in the error
attribute. This means that the error
attribute is never left unset if a run fails. The following example shows the error
payload produced if a schema specified in a source connection's properties doesn't exist:
{
"error": {
"trace": "1c09deb8-c3f9-4dc1-ad5a-0fc4e7c97071",
"errors": [
{
"code": "runtime_failed",
"message": "While the process was running a fatal error occurred in the engine (see logs for more details): SCAPI: CDICO2005E: Table could not be found: \"BADSCHEMAGOSALESHR.EMPLOYEE\" is an undefined name.. SQLCODE=-204, SQLSTATE=42704, DRIVER=4.20.4\ncom.ibm.connect.api.SCAPIException: CDICO2005E: Table could not be found: \"BADSCHEMAGOSALESHR.EMPLOYEE\" is an undefined name.. SQLCODE=-204, SQLSTATE=42704, DRIVER=4.20.4\n\tat com.ibm.connect.jdbc.JdbcInputInteraction.init(JdbcInputInteraction.java:158)\n\t...",
"extra": {
"account": "2d0d29d5b8d2701036042ca4cab8b613",
"diagnostics": "[PROJECT_ID-ff1ab70b-0553-409a-93f9-ccc31471c218] [DATA_FLOW_ID-cfdacdb4-3180-466f-8d4c-be7badea5d64] [DATA_FLOW_NAME-my_dataflow] [DATA_FLOW_RUN_ID-ed09488c-6d51-48c4-b190-7096f25645d5]",
"environment_name": "ypprod",
"http_status": 400,
"id": "CDIWA0129E",
"source_cluster": "NULL",
"service_version": "1.0.471",
"source_component": "WDP-DataFlows",
"timestamp": "2017-12-19T19:52:09.438Z",
"transaction_id": "71c7d19b-a91b-40b1-9a14-4535d76e9e16",
"user": "[email protected]"
}
}
]
}
}
To get the logs produced by the engine, use the following API:
GET v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/logs?project_id={project_id}
Data Profiles
Introduction
Data profiles contains classification and information about the distribution of your data, which helps you to understand your data better and make the appropriate data shaping decisions.
Data profiles are automatically created when a data set is added to a catalog with data policy enforcement. The profile summary helps you in analyzing your data more closely and in deciding which cleansing operations on your data will provide the best results for your use-case. You can also perform CRUD operations on data profiles for data sets in catalogs or projects without data policy enforcement.
Create a data profile
You can use this API to:
- Create a data profile
- Create and execute a data profile
To create a data profile for a data set in a specified catalog or project and not execute it, call the following POST method:
POST /v2/data_profiles?start=false
OR
POST /v2/data_profiles
To create a data profile for a data set in a specified catalog or project and execute it, call the following POST method:
POST /v2/data_profiles?start=true
The minimal request payload required to create a data profile is as follows:
{
"metadata": {
"dataset_id": "{DATASET_ID}",
"catalog_id": "{CATALOG_ID}"
}
}
OR
{
"metadata": {
"dataset_id": "{DATASET_ID}",
"project_id": "{PROJECT_ID}"
}
}
The request payload can have an entity
part which is optional:
{
"metadata": {
"dataset_id": "{DATASET_ID}",
"catalog_id": "{CATALOG_ID}"
},
"entity": {
"data_profile": {
"options": {
"max_row_count": {MAX_ROW_COUNT_VALUE},
"max_distribution_size": {MAX_SIZE_OF_DISTRIBUTIONS},
"max_numeric_stats_bins": {MAX_NUMBER_OF_STATIC_BINS},
"classification_options": {
"disabled": {BOOLEAN_TO_ENABLE_OR_DISABLE_CLASSIFICATION_OPTIONS},
"class_codes": {DATA_CLASS_CODE},
"items": {ITEMS}
}
}
}
}
The following parameters are required in the URI and the payload:
start
: Specifies whether to start the profiling service immediately after the data profile is created. The default isfalse
.max_row_count
: Specifies the maximum number of rows to perform profiling on. If no value is provided or if the value is invalid (negative), the default is to 5000 rows.row_percentage
: Specifies the percentage of rows to perform profiling on. If no value is provided or if the value is invalid (<0 or>100).0>max_distribution_size
: Specifies the maximum size of various distributions produced by the profiling process. If no value is provided, the default is 100.max_numeric_stats_bins
: Specifies the maximum number of bins to use in the numerical statistics. If no bin size is provided, the default is 100 bins.classification_options
: Specifies the various options available for classification.(i).
disabled
: If true, the classification options are disabled and default values are used.(ii).
class_codes
: Specifies the data class code to consider during profiling.(iii).
items
: Specifies the items.Note: You can get various data class codes through the data class service.
To create a data profile for a data set, the following steps must be completed:
You must have a valid IAM token to make REST API calls and a project or catalog ID.
You must have an IBM Cloud Object Storage bucket, which must be associated with your catalog in the project.
The data set must be added to your catalog in the project.
Construct a request payload to create a data profile with the values required in the payload.
Send a POST request to create a data profile.
When you call the method, the payload is validated. If a required value is not specified or a value is invalid, you get a response message with an HTTP status code of 400 and information about the invalid or missing values.
The response of the method includes a location header with a value that indicates the location of the profile that was created. The response body also includes a field href
which contains the location of the created profile.
The execution.status
of the profile is none
if the start
parameter is not set or is set to false
. Otherwise, it is in submitted
state or any other state depending on the profiling execution status.
The following are possible response codes for this API call:
Response HTTP status | Cause | Possible Scenarios |
---|---|---|
201 | Created | A data profile was created. |
400 | Bad Request | The request payload either had some invalid values or invalid/unwanted parameters. |
401 | Unauthorized | Invalid IAM token was provided in the request header. |
403 | Forbidden | User is not allowed to create a data profile. |
500 | Internal Server Error | Some runtime error occurred. |
Get a data profile
To get a data profile for a data set in a specified catalog or project, call the following GET method:
GET /v2/data_profiles/{PROFILE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
GET /v2/data_profiles/{PROFILE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The value of PROFILE_ID
is the value of metadata.guid
from the successful response payload of the create data profile call.
For other runtime errors, you might get an HTTP status code of 500 indicating that profiling didn't finished as expected.
The following are possible response codes for this API call:
Response HTTP status | Cause | Possible Scenarios |
---|---|---|
200 | Success | Data profile is created and executed. |
202 | Accepted | Data profile is created and under execution. |
401 | Bad Request | Invalid IAM token was provided in the request header. |
403 | Forbidden | User is not allowed to get the data profile. |
404 | Not Found | The data profile specified was not found. |
500 | Internal Server Error | Some runtime error occurred. |
Update a data profile
To update a data profile for a data set in a specified catalog or project, call the following PATCH method:
PATCH /v2/data_profiles/{PROFILE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
PATCH /v2/data_profiles/{PROFILE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The value of PROFILE_ID
is the value of metadata.guid
from the successful response payload of the create data profile call.
The JSON request payload must be as follows:
[
{
"op": "add",
"path": "string",
"from": "string",
"value": {}
}
]
During update, the entire data profile is replaced, apart from any read-only or response-only attributes.
If profiling processes are running and the start parameter is set to true, then a data profile is only updated if the stop_in_progress_runs parameter
is set to true.
The updates must be specified by using the JSON patch format, described in RFC 6902.
Modify asset level classification
This API is used for CRUD operations on asset level classification.
To modify the asset level classification details in the data_profile
parameter for a data set in a specified catalog or project, call the following PATCH method:
PATCH /v2/data_profiles/classification?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
PATCH /v2/data_profiles/classification?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The JSON request payload must be structured in the following way:
[
{
"op": "add",
"path": "/data_classification",
"value": [
{
"id":"{ASSET_LEVEL_CLASSIFICATION_ID}",
"name":"{ASSET_LEVEL_CLASSIFICATION_NAME}"
}
]
}
]
The path
attribute must be set to what is written in the previous JSON request payload, otherwise you will get a validation error with an HTTP status code of 400.
The values of ASSET_LEVEL_CLASSIFICATION_ID
and ASSET_LEVEL_CLASSIFICATION_NAME
can be: PII
and PII details
respectively.
The data updates must be specified by using the JSON patch format, described in RFC 6902 [https://tools.ietf.org/html/rfc6902]. For more details about JSON patch, see [http://jsonpatch.com].
A successful response has an HTTP status code of 200 and lists the asset level classifications.
The following are possible response codes for this API call:
Response HTTP status | Cause | Possible Scenarios |
---|---|---|
200 | Success | Asset Level Classification is added to the asset. |
400 | Bad Request | The request payload either had some invalid values or invalid/unwanted parameters. |
401 | Unauthorized | Invalid IAM token was provided in the request header. |
403 | Forbidden | User is not allowed to add asset level classification to the asset. |
500 | Internal Server Error | A runtime error occurred. |
Delete a data profile
To delete a data profile for a data set in a specified catalog or project, call the following DELETE method:
DELETE /v2/data_profiles/{PROFILE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}&stop_in_progress_profiling_runs=false
OR
DELETE /v2/data_profiles/{PROFILE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}&stop_in_progress_profiling_runs=true
The value of PROFILE_ID
is the value of metadata.guid
from the successful response payload of the create data profile call.
You can't delete a profile if the profiling execution status is in running
state and the query parameter stop_in_progress_profiling_runs
is set to false.
A successful response has an HTTP status code of 204.
Troubleshooting your way out if something goes wrong
In case of failures of any of the API end points, if you are not able to pinpoint the issue from the error message received as to what went wrong (Mostly in cases of Internal Server Error 500 HTTP status code
), you can retrieve the profiling data flow run logs and look at the all the steps behind the scenes to figure out what went wrong.
The possible scenarios can be that the profiling data flow didn't complete as the way we wanted it to. A common culprit is that profiling data flows are not able to connect to sources or targets based on the connection information that is specified in the request payload, which from a profiling perspective means that the connection was either not created for the catalog/project or the attachment for the data set has inconsistent interaction properties (in case of remote attachment).
To get the profiling data flow run logs, call the following GET method:
GET /v2/data_flows/{DATA_FLOW_ID}/runs/{DATA_FLOW_RUN_ID}/logs?catalog_id={CATALOG_ID}
OR
GET /v2/data_flows/{DATA_FLOW_ID}/runs/{DATA_FLOW_RUN_ID}/logs?project_id={PROJECT_ID}
The values of DATA_FLOW_ID and DATA_FLOW_RUN_ID would be present in the response payload for the GET profile call at the path: entity.data_profile.execution.dataflow_id
and entity.data_profile.execution.dataflow_run_id
respectively.
The response to the GET method includes information about each log event, including the event time, message type, and message text.
A maximum of 100 logs is returned per page. To specify a lower limit, use the limit
query parameter with an integer value. More logs than those on the first page might be available. To get the next page, call a GET method using the value of the next.href
member from the response payload.
Stream Flows
Introduction
The streams flow service provides APIs to create, update, delete, list, start, and stop stream flows.
A streams flow is a continuous flow of massive volumes of moving data that real-time analytics can be applied to. A streams flow can read data from a variety of sources, process that data by using analytic operations or your custom code, and then write it to one or more targets. You can access and analyze massive amounts of changing data as it is created. Regardless of whether the data is structured or unstructured, you can leverage data at scale to drive real-time analytics for up-to-the-minute business decisions.
The sources that are supported include Kafka, Message Hub, MQTT, and Watson IoT. Targets that are supported include Db2 Warehouse on Cloud, Cloud Object Storage, and Redis. Analytic operators that are supported include Aggregation, Python Machine Learning, Code, and Geofence.
Authorization
Authorization is done via Identity Access Management (IAM) bearer token. All API calls will require this Bearer token in the header.
Create a Streams Flow
1. Streaming Analytics instance ID
The streams flow is submitted to a Streaming Analytics service for compilation and running. When creating a flow, the Streaming Analytics instance ID must be provided. The instance ID can be found in the service credentials, which can be accessed from the service dashboard.
2. The pipeline graph
The streams flow represents it's source, targets, and operations in a pipeline graph. The pipeline graph can be generated by choosing the relevant operators in the Streams Designer canvas. To retrieve a pipeline graphcreated by the Streams Designer, use:
GET /v2/streams_flows/85be3e09-1c71-45d3-8d5d-220d6a6ea850?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218
This will return a streams flow containing a pipeline field in the entity. This pipeline object can be copied and submitted into another flow via:
POST /v2/streams_flows/?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218
Request Payload:
{
"name": "My Streams Flow",
"description": "A Sample Streams Flow.",
"engines": {
"streams": {
"instance_id": "8ff81caa-1076-41ce-8de1-f4fe8d79e30e"
}
},
"pipeline": {
"doc_type": "pipeline",
"version": "1.0",
"json_schema": "http://www.ibm.com/ibm/wdp/flow-v1.0/pipeline-flow-v1-schema.json",
"id": "",
"app_data": {
"ui_data": {
"name": "mqtt 2"
}
},
"primary_pipeline": "primary-pipeline",
"pipelines": [
{
"id": "primary-pipeline",
"runtime": "streams",
"nodes": [
{
"id": "messagehubsample_29xse4zvabe",
"type": "binding",
"op": "ibm.streams.sources.messagehubsample",
"outputs": [
{
"id": "target",
"schema_ref": "schema0",
"links": [
{
"node_id_ref": "mqtt_o6are9c4f",
"port_id_ref": "source"
}
]
}
],
"parameters": {
"schema_mapping": [
{
"name": "time_stamp",
"type": "timestamp",
"path": "/time_stamp"
},
{
"name": "customerId",
"type": "double",
"path": "/customerId"
},
{
"name": "latitude",
"type": "double",
"path": "/latitude"
},
{
"name": "longitude",
"type": "double",
"path": "/longitude"
}
]
},
"connection": {
"ref": "EXAMPLE_MESSAGE_HUB_CONNECTION",
"project_ref": "EXAMPLE",
"properties": {
"asset": {
"path": "/geofenceSampleData",
"type": "topic",
"name": "Geospatial data",
"id": "geofenceSampleData"
}
}
},
"app_data": {
"ui_data": {
"label": "Sample Data",
"x_pos": 60,
"y_pos": 90
}
}
},
{
"id": "mqtt_o6are9c4f",
"type": "binding",
"op": "ibm.streams.targets.mqtt",
"parameters": {},
"connection": {
"ref": "cd5388c3-b203-4c77-803b-bc902d864a30",
"project_ref": "a912d673-54d3-4e5c-800f-5088554d3aa8",
"properties": {
"asset": "t"
}
},
"app_data": {
"ui_data": {
"label": "MQTT",
"x_pos": 420,
"y_pos": 90
}
}
},
{
"id": "mqtt_y84zc3vfche",
"type": "binding",
"op": "ibm.streams.sources.mqtt",
"outputs": [
{
"id": "target",
"schema_ref": "schema1",
"links": [
{
"node_id_ref": "debug_9avg3zdig25",
"port_id_ref": "source"
}
]
}
],
"parameters": {
"schema_mapping": [
{
"name": "time_stamp",
"type": "timestamp",
"path": "/time_stamp"
},
{
"name": "customerId",
"type": "double",
"path": "/customerId"
},
{
"name": "latitude",
"type": "double",
"path": "/latitude"
},
{
"name": "longitude",
"type": "double",
"path": "/longitude"
}
]
},
"connection": {
"ref": "cd5388c3-b203-4c77-803b-bc902d864a30",
"project_ref": "a912d673-54d3-4e5c-800f-5088554d3aa8",
"properties": {
"asset": "t"
}
},
"app_data": {
"ui_data": {
"label": "MQTT",
"x_pos": -120,
"y_pos": -210
}
}
},
{
"id": "debug_9avg3zdig25",
"type": "binding",
"op": "ibm.streams.targets.debug",
"parameters": {},
"app_data": {
"ui_data": {
"label": "Debug",
"x_pos": 240,
"y_pos": -270
}
}
}
]
}
],
"schemas": [
{
"id": "schema0",
"fields": [
{
"name": "time_stamp",
"type": "timestamp"
},
{
"name": "customerId",
"type": "double"
},
{
"name": "latitude",
"type": "double"
},
{
"name": "longitude",
"type": "double"
}
]
},
{
"id": "schema1",
"fields": [
{
"name": "time_stamp",
"type": "timestamp"
},
{
"name": "customerId",
"type": "double"
},
{
"name": "latitude",
"type": "double"
},
{
"name": "longitude",
"type": "double"
}
]
}
]
}
}
Streams Flow Lifecycle
After a Streams Flow is created it will be in the STOPPED state unless it's been submitted as a job to be started. When starting a job, a Cloudant asset is created to track the status of the streams flow run. The start job operation can take up to minute to complete, during which time the streams flow will be in the STARTING state. Once the submission and compilation has completed, the streams flow will be in the RUNNING state.
To change the run state use the POST api:
POST /v2/streams_flows/85be3e09-1c71-45d3-8d5d-220d6a6ea850/runs?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218
Request Payload:
{
"state": "started",
"allow_streams_start": true
}
For starting the streams flow run, use { state: started }. To stop the flows run, use { state: stopped }.
Specify "allow_streams_start" to start the Streaming Analytics service in the event that it is stopped.
The start job operation triggers a long running process on the Streaming Analytics service instance. During this time the progress/status of this job can be viewed :
GET https://api.dataplatform.cloud.ibm.com/v2/streams_flows/85be3e09-1c71-45d3-8d5d-220d6a6ea850/runs?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218
A version of the pipeline that has been deployed is saved to represent the Runtime Pipeline. The streams flow can still be edited in the Streams Designer, and it will not have an impact on the Runtime Pipeline that has been deployed, until the user stops the running flow, and starts it again..
Metadata Discovery
Metadata Discovery can be used to automatically discover assets from a connection. The connection used for a discovery run can be associated with a catalog or project, but new data assets will be created in a project. Each asset that is discovered from a connection is added as a data asset to the project.
For a list of the supported types of connections against which the Metadata Discovery service can be invoked, see Discover data assets from a connection.
In general, the discovery process takes a significant amount of time. Therefore, the API to create a discovery run actually only queues a discovery run and then returns immediately (typically before the discovery run is even started). Subsequent calls to different APIs can then be made to monitor the progress of the discovery run (see Monitoring a metadata discovery run and Retrieving discovered assets).
The following example shows a request to create a metadata discovery run. It assumes that a project, a connection, and a catalog have already been created, and that their IDs are known by the caller. If a catalog is provided (as in the following example), the connection is associated with the catalog. If no catalog is provided, the connection is associated with the project.
Note: In the following examples, the discovered assets are found in a connection to a DB2 database, but the details of the database are hidden within the connection. So, the caller of the data_discoveries
API specifies the database to discover indirectly via the connection.
API request - Create discovery run:
POST /v2/data_discoveries
Request payload:
{
"entity": {
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
In the example request payload, you can see the ID of the connection whose assets will be discovered, and the ID of the project into which the newly created assets will be added.
{
"metadata": {
"id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:42:02.843Z"
},
"entity": {
"status": "CREATED",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
In the response, you can see that the discovery run was created with the ID dcb8a234ad5e438d904a4cdbe0ba70e2
, which you'll need to use if you want to get the status of the discovery run that you just created. Also shown in the response is:
invoked_by
: the IAM ID of the account that kicked off the discovery processbss_account_id
: the BSS account ID of the catalogcreated_at
: the creation date and time of the discovery job
To get the status of a discovery run use the GET data_discoveries
API. You can request the status of a discovery run as often as desired. In the following sections, you will be shown a few such calls to illustrate the progression of a discovery run.
API Request - Get status of discovery run:
GET /v2/data_discoveries/dcb8a234ad5e438d904a4cdbe0ba70e2
There is no request payload for the previous GET data_discoveries
request. Instead, the ID of the discovery run whose status is being requested is supplied as a path parameter. In the previous URL, use the discovery run ID that was returned by the earlier call to POST data_discoveries
. If you no longer have access to the ID of the discovery run for which you want to see status information, see the section Call Discovery API to get the ID of a metadata discovery run.
The following examples show various responses to the same GET data_discoveries
monitor request previously shown, made at various points during the discovery run.
Response to status request immediately after creation of a discovery run:
{
"metadata": {
"id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:42:02.843Z"
},
"entity": {
"status": "CREATED",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
In the previous response, you can see that the status of the discovery run has not yet changed - it is still CREATED
. This is because the request to discover assets is put into a queue and will be initiated in the order in which it was received.
Response to status request immediately after a discovery run has actually started:
{
"metadata": {
"id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:42:02.843Z",
"started_at": "2018-06-22T15:42:06.167Z",
"ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
},
"entity": {
"statistics": {
},
"status": "RUNNING",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
Now notice that the status
has changed to RUNNING
which indicates that the discovery process has actually started. Also, the metadata
field has some additional fields added to it:
started_at
: the date and time at which the discovery run startedref_project_connection_id
: a reference to a cloned project connection ID, internally set when a discovery is created for a connection in a catalog
In addition, notice that a new statistics
object was introduced into the response body. In the response, that object is empty because the discovery run, which has just started hasn't yet discovered any assets.
Response to status request after some assets were discovered:
{
"metadata": {
"id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:42:02.843Z",
"started_at": "2018-06-22T15:42:06.167Z",
"discovered_at": "2018-06-22T15:42:27.970Z",
"ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
},
"entity": {
"statistics": {
"discovered": 128,
"submit_succ": 128
},
"status": "RUNNING",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
Notice the statistics
object now contains two fields:
discovered
: the number of assets discovered so far during the discovery runsubmit_succ
: the number of assets successfully submitted for creation so far during the discovery run. A discovered asset goes through an internal pipeline with various stages from being discovered at the connection to being created in the project. Here, submitted means the asset was submitted to the internal pipeline.
Refer to Watson Data API schema for the complete list of the possible fields that might show up in the statistics
object.
Because the discovery run isn't yet finished, the status
in the previous response is still RUNNING
.
Response to status request after the discovery run was completed:
{
"metadata": {
"id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:42:02.843Z",
"started_at": "2018-06-22T15:42:06.167Z",
"discovered_at": "2018-06-22T15:42:27.970Z",
"processed_at": "2018-06-22T15:42:45.877Z",
"finished_at": "2018-06-22T15:43:14.969Z",
"ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
},
"entity": {
"statistics": {
"discovered": 179,
"submit_succ": 179,
"create_succ": 179
},
"status": "COMPLETED",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
Notice the status
field has changed to COMPLETED
to indicate that the discovery run is finished. Other response fields to note:
finished_at
: the date and time at which the discovery run finisheddiscovered
: indicates that 179 assets were discovered at the connectionsubmit_succ
: indicates that 179 of the discovered assets were successfully submitted to the discovery run's internal asset processing pipeline.create_succ
: indicates that 179 assets were successfully created in the project
At any time during or after a discovery run, you call Asset APIs to get the list of metadata for the currently discovered assets in the project. To retrieve metadata for any list of assets you can make the following call:
POST /v2/asset_types/{type_name}/search?project_id={project_id}
More specifically, to find the metadata for discovered assets the value to use for the {type_name}
path parameter is discovered_asset
. So, for the discovery run we created, the call to retrieve metadata for the discovered assets would look like this:
API Request - Get metadata for discovered assets:
POST /v2/asset_types/discovered_asset/search?project_id=960f6aff-295f-4de1-a9d7-f3b6805b3590
where the project_id
query parameter value 960f6aff-295f-4de1-a9d7-f3b6805b3590
is the same value that was specified in the body of the POST request that was used to create the discovery run.
In addition, the ID of the connection that the discovery was run against has to be specified in the body of the POST, like this:
{
"query": "discovered_asset.connection_id:\"f638398f-fcc7-4856-b78d-5c8efa5b9282\""
}
Here is part of the response body for the previous query:
{
"total_rows": 179,
"results": [
{
"metadata": {
"name": "EMP_SURVEY_TOPIC_DIM",
"description": "Warehouse table EMP_SURVEY_TOPIC_DIM describes employee survey questions for employees of the Great Outdoors Company, in supported languages.",
"tags": [
"discovered",
"GOSALESDW"
],
"asset_type": "data_asset",
"origin_country": "ca",
"rating": 0.0,
"total_ratings": 0,
"sandbox_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590",
"catalog_id": "a682c698-6019-437d-a0b9-224aa0a4dbc9",
"created": 0,
"created_at": "2018-06-22T15:41:47Z",
"owner": "[email protected]",
"owner_id": "IBMid-50S...",
"size": 0,
"version": 0.0,
"usage": {
"last_update_time": 1.52968210955E12,
"last_updater_id": "iam-ServiceId-87f49...",
"access_count": 0.0,
"last_accessor_id": "iam-ServiceId-87f49...",
"last_access_time": 1.52968210955E12,
"last_updater": "ServiceId-87f49...",
"last_accessor": "ServiceId-87f49..."
},
"asset_state": "available",
"asset_attributes": [
"data_asset",
"discovered_asset"
],
"rov": {
"mode": 0
},
"asset_category": "USER",
"asset_id": "e35cfd4d-590f-40a5-b75c-ec07c0a4bcbc"
}
},
.
.
.
}
Notice that the total_rows
value 179 matches the create_succ
value that was returned in the result of the API call to get the final status of the completed discovery run.
The results
array in the previous response body has an entry containing metadata for each asset that was discovered by the discovery run. In the previous code snippet, for brevity, only 2 of the 179 entries are shown. The metadata created by the discovery run includes:
name
: in this case, the name of the DB2 table that was discovereddescription
: a description of the table as provided by DB2tags
: these are useful for searching. Thediscovered
tag is one of the tags set for a discovered asset.asset_type
: the type of the asset that was created in the project
Each entry in the results
array also contains an href
field that points to the actual asset that was created by the discovery run.
There might be times in which you no longer have the ID of the metadata discovery run whose status you're interested in, and so might not be able to call the following API for the specific discovery run you're interested in (which requires that ID):
GET /v2/data_discoveries/dcb8a234ad5e438d904a4cdbe0ba70e2
The following example illustrates how to get the IDs of metadata discovery runs for the connection and catalog that were used in the previous call to create a discovery run:
API Request - Get information for discovery runs:
GET /v2/data_discoveries?offset=0&limit=1000&connection_id=f638398f-fcc7-4856-b78d-5c8efa5b9282&catalog_id=816882fa-dcda-46e1-8c6b-fa23c3cbad14
Note that the values of the query parameters connection_id
and catalog_id
correspond to the values for the identically named fields in the payload for the previous request to create a discovery run.
Notice also that you can use the offset
and limit
query parameters to focus on a particular subset of the full list of related discoveries.
The response payload will look like this:
{
"resources": [
{
"metadata": {
"id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:42:02.843Z",
"started_at": "2018-06-22T15:42:06.167Z",
"discovered_at": "2018-06-22T15:42:27.970Z",
"processed_at": "2018-06-22T15:42:45.877Z",
"finished_at": "2018-06-22T15:43:14.969Z",
"ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
},
"entity": {
"statistics": {
"discovered": 179,
"submit_succ": 179,
"create_succ": 179
},
"status": "COMPLETED",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
],
"first": {
"href": "http://localhost:9080/v2/data_discoveries?offset=0&limit=1000&connection_id=f638398f-fcc7-4856-b78d-5c8efa5b9282&catalog_id=816882fa-dcda-46e1-8c6b-fa23c3cbad14"
},
"next": {
"href": "http://localhost:9080/v2/data_discoveries?offset=1000&limit=1000&connection_id=f638398f-fcc7-4856-b78d-5c8efa5b9282&catalog_id=816882fa-dcda-46e1-8c6b-fa23c3cbad14"
},
"limit": 1000,
"offset": 0
}
Anything that is found because it matches the query criteria is returned in the resources
array. In the previous response, there is only one entry and it corresponds to the discovery run which was created in the previous Create a metadata discovery run section.
There might be times when you want to stop a discovery run before it's completed. To do so, use the PATCH data_discoveries
API. The following illustrates how to abort a discovery run (a different discovery run than the one used in the previous examples):
API Request - Abort a discovery run:
PATCH /v2/data_discoveries/09cbff0981f84c51be4b4d93becc17b0
The previous PATCH request requires the following request body to set the status
of the discovery run to "ABORTED":
{
"op": "replace",
"path": "/entity/status",
"value": "ABORTED"
}
The response payload will look like this:
{
"metadata": {
"id": "09cbff0981f84c51be4b4d93becc17b0",
"invoked_by": "IBMid-50S...",
"bss_account_id": "e348e...",
"created_at": "2018-06-22T15:45:54.638Z",
"started_at": "2018-06-22T15:45:56.202Z",
"finished_at": "2018-06-22T15:46:02.274Z",
"ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
},
"entity": {
"statistics": {
},
"status": "ABORTED",
"connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
"catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
"project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
}
}
Notice in the previous response payload that the status
has now been set to ABORTED
.
Any assets discovered before the run was aborted will remain discovered. In the example, the abort occurred so quickly after the creation of the discovery run that no assets had been discovered, hence the statistics
object is empty.
Data Samples
Introduction
Data samples are a representative subset of a data set before you begin processing the entire data set. Creating a data sample enables you to test and refine the operations that cleanse and shape data on a smaller portion of the data set. Working on a data subset helps you determine the quality and appropriateness of your data transformations for the type of data analysis you plan before you run those operations on the entire data set.
Create a data sample
To create a data sample for a data set in a specified catalog or project, call the following POST method:
POST /v2/data_samples
The JSON request payload for a data set in a catalog must be structured in the following way:
{
"dataset_id": "{DATASET_ID}",
"catalog_id": "{CATALOG_ID}",
"algorithm": {
"type": "{TYPE_OF_ALGORITHM}",
"seed": {SEED_VALUE},
"fraction": {FRACTION_VALUE}
}
}
The JSON request payload for a data set in a project must be structured in the following way:
{
"dataset_id": "{DATASET_ID}",
"project_id": "{PROJECT_ID}",
"algorithm": {
"type": "{TYPE_OF_ALGORITHM}",
"seed": {SEED_VALUE},
"max_rows": {MAX_ROWS_VALUE}
}
}
You can either use fraction
in the algorithm
field of the payload or max_rows
to limit the size of the sample that you want to create. Notice that these fields are optional, including the type
field which specifies which algorithm is to be used for sampling. By default, it takes RANDOM
algorithm if not otherwise specified. Currently, RANDOM
algorithm is the only supported algorithm.
To create a data sample for a data set, perform the following steps:
You must have a valid IAM token to make REST API calls and a project or catalog ID.
You must have an IBM Cloud Object Storage bucket, which you must associate with your catalog in the project.
The data set must be added to your catalog in the project.
Construct a request payload for creating a data sample with the values required in the payload.
Send a POST request to create the sample.
When you call the method, the payload is validated. If a required value is not specified or a value is invalid, you get a response message with an HTTP status code of 400 and information about the invalid or missing values.
The response of the method includes a location header with a value that indicates the location of the sample that was created. The response body also includes a field entity.href
which contains the location of the created sample.
The following example shows a success response:
{
"metadata": {
"asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
"dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
"catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
"owner": "[email protected]",
"created_at": "2017-09-09T16:04:33.238Z"
},
"entity": {
"data_sample": {
"catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
"dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
"algorithm": {
"type": "RANDOM",
"with_replacement": false,
"seed": 1,
"max_rows": 10000
},
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/data_samples/93d3b425-9569-4f70-a53f-9192814769bd?dataset_id=a0572944-86a6-49ee-9da3-cb45d73e8d8a&catalog_id=3239e296-5aba-4256-aa9a-bfcf7b974e23",
"sample_execution": {
"status": "initiated"
}
}
}
}
List all data samples
To list all data samples for a data set in a specified catalog or project, call the following GET method:
GET v2/data_samples?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
GET v2/data_samples?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The following example shows a success response:
{
"resources": [
{
"metadata": {
"asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
"dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
"catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
"owner": "[email protected]",
"created_at": "2017-09-09T16:04:33.238Z"
},
"entity": {
"data_sample": {
"algorithm": {
"type": "RANDOM",
"with_replacement": false,
"seed": 1,
"max_rows": 10000
},
"sample_execution": {
"activity_id": "828b28dd-8cc3-4a29-b42c-9dbdc737aa98",
"activity_run_id": "0f54a748-2a73-43f6-b8b5-1da9a2b160ba",
"status": "finished"
}
}
}
}
]
}
Get the data sample for a data set
To get a data sample for a data set in a specified catalog or project, call the following GET method:
GET /v2/data_samples/{SAMPLE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
GET /v2/data_samples/{SAMPLE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The value of SAMPLE_ID
is the value of metadata.asset_id
from the successful response payload of the create sample call.
The following example shows a success response:
{
"metadata": {
"asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
"dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
"catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
"owner": "[email protected]",
"created_at": "2017-09-09T16:04:33.238Z"
},
"entity": {
"data_sample": {
"algorithm": {
"type": "RANDOM",
"with_replacement": false,
"seed": 1,
"max_rows": 10000
},
"sample_execution": {
"activity_id": "828b28dd-8cc3-4a29-b42c-9dbdc737aa98",
"activity_run_id": "0f54a748-2a73-43f6-b8b5-1da9a2b160ba",
"status": "finished"
}
}
}
}
Get the data in a data sample
To get the data in a data sample in a specified catalog or project, call the following GET method:
GET /v2/data_samples/{SAMPLE_ID}/data?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}&_limit={LIMIT}&_offset={OFFSET}
OR
GET /v2/data_samples/{SAMPLE_ID}/data?project_id={PROJECT_ID}&dataset_id={DATASET_ID}&_limit={LIMIT}&_offset={OFFSET}
The value of SAMPLE_ID
is the value of metadata.asset_id
from the successful response payload of the create sample call.
Update a data sample
To update a data sample for a data set in a specified catalog or project, call the following PATCH method:
PATCH /v2/data_samples/{SAMPLE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
PATCH /v2/data_samples/{SAMPLE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The value of SAMPLE_ID
is the value of metadata.asset_id
from the successful response payload of the create sample call.
The JSON request payload to change the seed value must be structured in the following way:
[{
"op":"replace",
"path":"/entity/data_sample/algorithm/seed",
"value":10
}]
This API does not allow you to update the data sample metadata, for example, the creation time or modification time, or the creator of the sample. Also, you are not allowed to modify any data sample container details like the attachment URL.
However, you can modify the algorithm parameter of the sample by specifying a new seed value, the with_replacement and the fraction attributes.
If the sampling process is still running, the data sample is not updated unless the stop_in_progress_runs
parameter is set to true. To start the sampling process again as soon as the sample is updated, set the start parameter to true.
The data updates must be specified by using the JSON patch format, described in RFC 6902.
The following example shows a success response:
{
"metadata": {
"asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
"dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
"catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
"owner": "[email protected]",
"created_at": "2017-09-09T16:04:33.238Z"
},
"entity": {
"data_sample": {
"algorithm": {
"type": "RANDOM",
"with_replacement": false,
"seed": 10,
"max_rows": 10000
},
"sample_execution": {
"activity_id": "828b28dd-8cc3-4a29-b42c-9dbdc737aa98",
"activity_run_id": "0f54a748-2a73-43f6-b8b5-1da9a2b160ba",
"status": "finished"
}
}
}
}
Delete a data sample
To delete a data sample for a data set in a specified catalog or project, call the following DELETE method:
DELETE /v2/data_samples/{SAMPLE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
DELETE /v2/data_samples/{SAMPLE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
The value of SAMPLE_ID
is the value of metadata.asset_id
from the successful response payload of the create sample call.
A successful response is received with an HTTP status code of 204.
Delete all data samples
To delete all data samples for a data set in a specified catalog or project, call the following DELETE method:
DELETE /v2/data_samples?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}
OR
DELETE /v2/data_samples?project_id={PROJECT_ID}&dataset_id={DATASET_ID}
A successful response is received with an HTTP status code of 204.
Troubleshooting your way out if something goes wrong
In case of failures of any of the API endpoints, if you are not able to pinpoint the issue from the error message received as to what went wrong (Mostly in cases of Internal Server Error 500 HTTP status code
), you can retrieve the activity run logs and look at the all the steps behind the scenes to figure out what went wrong.
The possible scenarios can be that activity didn't complete as the way we wanted it to, or it finished with errors, or it was aborted, etc. A common culprit is that activities are not able to connect to sources or targets based on the connection information that is specified in the request payload, which from a sampling perspective means that the connection was either not created for the catalog/project or the attachment for the data set has inconsistent interaction properties (in case of remote attachment).
To get the activity run logs, call the following GET method:
GET /v2/activities/{ACTIVITY_ID}/activityruns/{ACTIVITY_RUN_ID}/logs?catalog_id={CATALOG_ID}
OR
GET /v2/activities/{ACTIVITY_ID}/activityruns/{ACTIVITY_RUN_ID}/logs?project_id={PROJECT_ID}
To values of ACTIVITY_ID and ACTIVITY_RUN_ID would be present in the response payload for the GET sample call at the path: entity.data_sample.sample_execution.activity_id
and entity.data_sample.sample_execution.activity_run_id
respectively.
The response to the GET method includes information about each log event, including the event time, message type, and message text.
A maximum of 100 logs is returned per page. To specify a lower limit, use the _limit
query parameter with an integer value. More logs than those on the first page might be available. To get the next page, call a GET method using the value of the next.href
member from the response payload. This URI includes the _start
query parameter which contains the next page bookmark token.
Lineage
Introduction
The lineage of an asset includes information about all events, and other assets, that have led to its current state and its further usage. Asset and Event are the two main entities that are part of the lineage data model. An asset can either be generated from or used in subsequent events. An event can be any of:
- asset-generation-events
- asset-modification-events
- asset-usage-events.
Use the Lineage API to publish events on an asset or to query the lineage of an asset.
Publish a lineage event
The following example shows a sample lineage event that can be posted when a data set is published from a project to a catalog:
Request URL
POST /v2/lineage_events
Request Body
{
"message_version": "v1",
"user_id": "IAM-Id_of_User",
"account_id": "e86f2b06b0b267d559e7c387ceefb089",
"event_details": {
"event_id": "sample-event1",
"event_type": "DATASET_PUBLISHED",
"event_category": [
"additions"
],
"event_time": "2018-04-03T14:01:08.603Z",
"event_source_service": "Watson Knowledge Catalog"
},
"generates_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"asset_type": "DataSet",
"relation": {
"name": "Created"
},
"properties": {
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"name": "Asset Name in Catalog XX",
"catalog_id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
},
"catalog": {
"type": "catalog",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
}
}
}
],
"uses_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"asset_type": "DataSet",
"relation": {
"name": "Used"
},
"properties": {
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"name": "2017_sales_data",
"project_id": "9f9c961a-78d1-4c06-a601-4b589project"
}
},
"project": {
"type": "project",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589project"
}
}
}
}
]
}
Response Body
{
"metadata": {
"id": "01014d1f-31cf-4956-bd41-7a77ba14004c",
"source_event_id": "sample-event1"
}
}
The id generated in the response can be used to query the details of the published event with the following request:
Request URL
GET v2/lineage_events/01014d1f-31cf-4956-bd41-7a77ba14004c
For more details on each field in the lineage event JSON payload, refer to the Lineage Events section of API documentation.
Query lineage of an asset
The lineage of an asset involved in the sample event can be queried using the following request:
Request URL
GET v2/asset_lineages/9f9c961a-78d1-4c06-a601-4b5890fdataset03
Response Body
{
"resources": [
{
"metadata": {
"id": "01014d1f-31cf-4956-bd41-7a77ba14004c",
"source_event_id": "sample-event1",
"created_at": "2018-04-03T14:01:08.603Z",
"created_by": "IAM-Id_of_User"
},
"entity": {
"type": "DATASET_PUBLISHED",
"generates_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"type": "DataSet",
"relation": {
"name": "Created"
},
"properties": {
"catalog": {
"type": "catalog",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
},
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"name": "Asset Name in Catalog XX",
"catalog_id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
}
}
}
],
"uses_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"type": "DataSet",
"relation": {
"name": "Used"
},
"properties": {
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"name": "2017_sales_data",
"project_id": "9f9c961a-78d1-4c06-a601-4b589project"
}
},
"project": {
"type": "project",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589project"
}
}
}
}
],
"properties": {
"event_time": "2018-04-03T14:01:08.603Z",
"event_category": [
"additions"
],
"event_source_service": "Watson Knowledge Catalog"
}
}
}
],
"limit": 50,
"offset": 0,
"first": {
"href": "https://api.dataplatform.cloud.ibm.com/v2/asset_lineages/9f9c961a-78d1-4c06-a601-4b5890fdataset03?offset=0&_=1528182675331"
}
}
copyright: years: 2019 lastupdated: "2019-02-01"
Methods
Get list of jobs under a project or a space.
Lists the jobs in the specified project or space (either project_id or space_id must be set).
GET /v2/jobs
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Optionally get all jobs associated with a particular asset.
Optionally get all jobs associated with the particular asset ref type.
The ID of the job run. Can be used to search parent job of a job run
The limit of the number of items to return, for example limit=50. If not specified a default of 100 will be used.
Constraints: value ≥ 1
Default:
100
Response
Array of all jobs.
Status Code
Success.
Bad request. See the error message for details.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
The resources you specified cannot be found.
An error occurred. See response for more information.
No Sample Response
Create a new job.
Creates a new job in the specified project or space (either project_id or space_id must be set).
POST /v2/jobs
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
The job to be created. Note: either asset_ref or asset_ref_type must be provided, not both.
The name of the job
Example:
Name
ID of associated asset to run (provide either asset_ref or asset_ref_type).
Example:
ff1ab70b-0553-409a-93f9-ccc31471c218
The type of asset to run (provide either asset_ref or asset_ref_type).
Example:
notebook
The description of the job.
Example:
Description.
A cron string defining when the job should be run. If an empty string is provided it means the job is not scheduled to run.
Example:
0 3 21 13 1 ? 2019
Indicate a repeated job
Example:
true
A timestamp in epoch time, the scheduled job will be triggered after this timestamp.
Example:
1547578689512
A timestamp in epoch time, the scheduled job will be triggered before this timestamp.
Example:
1547578689512
schedule_info
job
Response
AssetMetadata Model
The underlying job definition.
Status Code
Created.
Bad request. See the error message for details.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
An error occurred. See response for more information.
No Sample Response
Get the information of job identified by the ID.
Gets the info for a single job associated from the specified project or space (either project_id or space_id must be set).
GET /v2/jobs/{job_id}
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Response
AssetMetadata Model
The underlying job definition.
Status Code
Success.
Bad request. See the error message for details.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
The resources you specified cannot be found.
An error occurred. See response for more information.
No Sample Response
Delete a specific job.
Deletes a specific job in a project or space (either project_id or space_id must be set). If the deletion of the job and its runs will take some time to finish, then a 202 response will be returned and the deletion will continue asynchronously. All the jobs runs associated with the job will also be deleted. If the job is still running, it will not be deleted.
DELETE /v2/jobs/{job_id}
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Response
Status Code
The requested operation completed successfully.
Bad request. See the error message for details.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
The resources you specified cannot be found.
An error occurred. See response for more information.
No Sample Response
Update information of a job.
Updates specific attributes of a job in the specified project or space (either project_id or space_id must be set). You must specify the updates by using the JSON patch format, described in RFC 6902. Use 'last_run_initiator' for the initiator of the last job run, use 'last_run_status' for the status of the last job run.
PATCH /v2/jobs/{job_id}
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Updates to make to the job run.
The operation to be performed. Allowable values: [add, replace].
Example:
replace
A JSON pointer to the field to update. Allowable field paths to be updated. Ex /metadata/name or /entity/job/configuration
Example:
/metadata/name (or /entity/job/configuration)
- Example:
Response
AssetMetadata Model
The underlying job definition.
Status Code
Success.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
The resources you specified cannot be found.
An error occurred. See response for more information.
No Sample Response
Get list of runs of a job.
Lists the job runs for a specific job in the specified project or space (either project_id or space_id must be set). Only the metadata and certain elements of the entity component of each run are returned.
GET /v2/jobs/{job_id}/runs
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Response
Array of all jobs.
Status Code
Success.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
An error occurred. See response for more information.
No Sample Response
Start a run for a job.
Starts the specified job contained in a project or space (either project_id or space_id must be set).
POST /v2/jobs/{job_id}/runs
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
The configuration of the job run to use. If not provided, use the configuration of the associated job.
The environment variables, only for Notebook and Script jobs.
Example:
configuration
job_run
Response
AssetMetadata Model
Status Code
The requested operation completed successfully.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
An error occurred. See response for more information.
No Sample Response
Get a specific run of a job.
Gets the info for a single job run from the specified project or space (either project_id or space_id must be set).
GET /v2/jobs/{job_id}/runs/{run_id}
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
The ID of the job run.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Response
AssetMetadata Model
Status Code
Success.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
An error occurred. See response for more information.
No Sample Response
Delete a run.
Delete the specified job run in a project or space (either project_id or space_id must be set).
DELETE /v2/jobs/{job_id}/runs/{run_id}
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
The ID of the job run.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Response
Status Code
The requested operation is in progress.
The requested operation completed successfully.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
The resources you specified cannot be found.
An error occurred. See response for more information.
No Sample Response
Cancel a run.
Cancels a job run that is in the running state.
POST /v2/jobs/{job_id}/runs/{run_id}/cancel
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
The ID of the job run.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
An empty body.
Example: {}
Response
Status Code
The requested operation completed successfully.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
An error occurred. See response for more information.
No Sample Response
Retrieve runtime log of a run.
Gets the logs for a job run in the specified project or space (either project_id or space_id must be set)
GET /v2/jobs/{job_id}/runs/{run_id}/logs
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the job to use. Each job has a unique ID.
The ID of the job run.
Query Parameters
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
The limit of the number of lines to return, for example limit=50. If not specified, all log will be returned.
Response
Response of get job run log
Array of log string split by line
Example:Total number of lines available
Example:
100
Status Code
Success.
You are not authorized to access the service. See response for more information.
You are not permitted to perform this action. See response for more information.
An error occurred. See response for more information.
No Sample Response
Create project as a transaction
Creates a new project with the provided parameters, including all the storage and credentials in a single transaction. This endpoint will create a new COS bucket using generated unique name, all credentials, asset container and call all the required atomic APIs to fully configure a new project. Attempts to use the duplicate project names will result in an error. NOTE: when creating projects programmatically, always use this endpoint, not /v2/projects.
This endpoint can also be used to create a project from an exported Watson Studio .zip file. In this case, a new transaction is initiated to create assets under the project. A Transaction ID along with a URL is returned as a response of this API. As this transaction can take time, you can view the current status of the transaction using the returned URL.
NOTE: This feature is only available in the private cloud.
POST /transactional/v2/projects
Authentication
Request
Custom Headers
Allowable values: [
application/json
,multipart/form-data
]
Project metadata required to create a project.
The name of the new project. The name must be a non-empty String. This does not need to be unique.
Example:
Project Name
A tag to indicate where this project was generated. This is only intended for use in metrics. It does not need to be unique and all consumers of this API should use a consistent string for their 'generator' field. The value is stored in the project metadata for future consumption in metrics.
Example:
DAP-Projects
A description for the new project.
Example:
A project description.
A value of
true
makes the project public.Default:
false
List of user defined tags that are attached to the project.
Set to true of project members should be scoped to the account and/or SAML of the creator
Default:
false
Object storage properties to be associated with the project.
List of computes to be associated with the project.
List of tools to be associated with the project.
Allowable values: [
watson_visual_recognition
,jupyter_notebooks
,dashboards
,streams_designer
,spss_modeler
,experiments
,data_refinery
]
Response
Description of create transactional project API response body.
API to access the newly created project.
Status Code
Created
Accepted
Bad Request
Unauthorized
Forbidden
Not Found
Internal Server Error
{ "location": "/v2/projects/b2549f22-7565-4193-9434-9b77e15757cc" }
{ "location": "/v2/projects/b2549f22-7565-4193-9434-9b77e15757cc", "transaction_detail": { "id": "dcff12a9-3f9e-4d10-b4c4-f121f681d81b", "links": { "self": "/transactional/v2/projects/b2549f22-7565-4193-9434-9b77e15757cc/transactions/dcff12a9-3f9e-4d10-b4c4-f121f681d81b" } } }
{ "code": "400", "error": "Bad Request", "reason": "Error fetching projects. Status code: 400", "message": "The server cannot or will not process the request due to an apparent client error (e.g. malformed request syntax).", "description": "[400] Bad Request: Error fetching projects. Status code: 400. The server cannot or will not process the request due to an apparent client error (e.g. malformed request syntax)." }
{ "code": "401", "error": "Unauthorized", "reason": "Unable to verify token via IAM Auth server.", "message": "Authentication failed.", "description": "[401] Unauthorized: Authentication failed." }
{ "code": "403", "error": "Forbidden", "reason": "Invalid bearer token: Access token is invalid.", "message": "The target operation is strictly forbidden due to schema constraints.", "description": "[403] Forbidden: Invalid bearer token: Access token is invalid. The target operation is strictly forbidden due to schema constraints." }
{ "code": "404", "error": "Not Found", "reason": "Transaction 09f5e19b-5a30-428d-95f4-ff93590f3c071 is not associated with project 522d3ffe-0787-4bee-a616-dc12a19c9a76", "message": "Resource requested by the client was not found.", "description": "[404] Not Found: Transaction 09f5e19b-5a30-428d-95f4-ff93590f3c071 is not associated with project 522d3ffe-0787-4bee-a616-dc12a19c9a76. Resource requested by the client was not found." }
{ "code": "500", "error": "Internal Server Error", "reason": "Error creating project: 522d3ffe-0787-4bee-a616-dc12a19c9a76", "message": "The API encountered an unexpected condition which prevented it from fulfilling the request.", "description": "[500] Internal Server Error: Error creating project: 522d3ffe-0787-4bee-a616-dc12a19c9a76." }
Delete project as a transaction
Deletes a project with a given GUID, deletes COS bucket and all the files in it, all credentials and asset container in the order reverse from the project creation transaction. When deleting projects programmatically, always use this endpoint, not /v2/projects/{guid}.
DELETE /transactional/v2/projects/{guid}
Authentication
Request
Path Parameters
The GUID for the project to be deleted.
Response
Status Code
No Content
Bad Request
Unauthorized
Forbidden
Not Found
Internal Server Error
{ "code": "400", "error": "Bad Request", "reason": "Error fetching projects. Status code: 400", "message": "The server cannot or will not process the request due to an apparent client error (e.g. malformed request syntax).", "description": "[400] Bad Request: Error fetching projects. Status code: 400. The server cannot or will not process the request due to an apparent client error (e.g. malformed request syntax)." }
{ "code": "401", "error": "Unauthorized", "reason": "Unable to verify token via IAM Auth server.", "message": "Authentication failed.", "description": "[401] Unauthorized: Authentication failed." }
{ "code": "403", "error": "Forbidden", "reason": "Invalid bearer token: Access token is invalid.", "message": "The target operation is strictly forbidden due to schema constraints.", "description": "[403] Forbidden: Invalid bearer token: Access token is invalid. The target operation is strictly forbidden due to schema constraints." }
{ "code": "404", "error": "Not Found", "reason": "Transaction 09f5e19b-5a30-428d-95f4-ff93590f3c071 is not associated with project 522d3ffe-0787-4bee-a616-dc12a19c9a76", "message": "Resource requested by the client was not found.", "description": "[404] Not Found: Transaction 09f5e19b-5a30-428d-95f4-ff93590f3c071 is not associated with project 522d3ffe-0787-4bee-a616-dc12a19c9a76. Resource requested by the client was not found." }
{ "code": "500", "error": "Internal Server Error", "reason": "Error creating project: 522d3ffe-0787-4bee-a616-dc12a19c9a76", "message": "The API encountered an unexpected condition which prevented it from fulfilling the request.", "description": "[500] Internal Server Error: Error creating project: 522d3ffe-0787-4bee-a616-dc12a19c9a76." }
Get status of import transaction
Status of import transaction created using create project as a transaction API
GET /transactional/v2/projects/{guid}/transactions/{id}
Authentication
Request
Path Parameters
The GUID for the project on which transaction was created.
The transaction ID provided by create project as a transaction endpoint.
Response
Description of import transactional status API response.
Project GUID.
Name of the project.
The date and time the status was created in UTC (ISO 8601)
The date and time the status was last updated in UTC (ISO 8601)
Last known status of the transaction
Description of the last know transaction status
Description of all last know import status.
Status Code
OK
Bad Request
Unauthorized
Forbidden
Not Found
Internal Server Error
{ "code": "400", "error": "Bad Request", "reason": "Error fetching projects. Status code: 400", "message": "The server cannot or will not process the request due to an apparent client error (e.g. malformed request syntax).", "description": "[400] Bad Request: Error fetching projects. Status code: 400. The server cannot or will not process the request due to an apparent client error (e.g. malformed request syntax)." }
{ "code": "401", "error": "Unauthorized", "reason": "Unable to verify token via IAM Auth server.", "message": "Authentication failed.", "description": "[401] Unauthorized: Authentication failed." }
{ "code": "403", "error": "Forbidden", "reason": "Invalid bearer token: Access token is invalid.", "message": "The target operation is strictly forbidden due to schema constraints.", "description": "[403] Forbidden: Invalid bearer token: Access token is invalid. The target operation is strictly forbidden due to schema constraints." }
{ "code": "404", "error": "Not Found", "reason": "Transaction 09f5e19b-5a30-428d-95f4-ff93590f3c071 is not associated with project 522d3ffe-0787-4bee-a616-dc12a19c9a76", "message": "Resource requested by the client was not found.", "description": "[404] Not Found: Transaction 09f5e19b-5a30-428d-95f4-ff93590f3c071 is not associated with project 522d3ffe-0787-4bee-a616-dc12a19c9a76. Resource requested by the client was not found." }
{ "code": "500", "error": "Internal Server Error", "reason": "Error creating project: 522d3ffe-0787-4bee-a616-dc12a19c9a76", "message": "The API encountered an unexpected condition which prevented it from fulfilling the request.", "description": "[500] Internal Server Error: Error creating project: 522d3ffe-0787-4bee-a616-dc12a19c9a76." }
List defined connections
Lists defined connections.
Use the following parameters to sort the results:
| Field | Example | | ------------------------- | ----------------------------------- | | entity.name | ?_sort=+entity.name | | metadata.create_time | ?_sort=-metadata.create_time |
Use the following parameters to filter the results:
| Field | Example | |-------------------------- | ----------------------------------- | | entity.name | ?entity.name=MyConnection | | entity.datasource_type | ?entity.datasource_type=<asset_id> | | entity.context | ?entity.context=source | | entity.properties | ?entity.properties={"name":"value"} | | entity.flags | ?entity.flags=+restricted | | metadata.creator_id | ?metadata.creator_id=userid |
Filtering is done by specifying the fields to filter on.
To filter on the properties of a connection, the exact values to compare must be provided in the entities.properties field and all values supplied must exactly match a property of a connection.
The entity.flags field specifies the flags a connection can have to be included in the list results. By default, all connections with no flags are returned.
Adding the name of a flag to entity.flags will add the connections with that flag to the list results. The name can be optionally prefixed with a plus sign (+) to indicate that it is being added.
Adding the name of a flag to entity.flags with a minus sign (-) prefix will remove those connections with that flag from the list results. All additions are done before the subtractions.
GET /v2/connections
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Allowable values: [
application/json;charset=utf-8
,application/json
]
Query Parameters
The ID of the catalog to use. catalog_id or project_id is required.
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
The field to sort the results on, including whether to sort ascending (+) or descending (-), for example, sort=-metadata.create_time
The page token indicating where to start paging from.
The limit of the number of items to return, for example limit=50. If not specified a default of 100 will be used.
Constraints: value ≥ 1
Default:
100
The creator of the connection.
The name of the connection.
The data source type of the connection.
The context of the connection. Can be one of "source", "target", or "source,target".
The properties of the connection that must match for the connection to be included in the list.
A comma separated list of flags that must be present for the connection to be included in the list. If not provided, only connections with no flags will be returned.
Response
A page from a collection of connections.
List of connections
The number of assets skipped before this page.
The total number of assets available.
Status Code
Connections with metadata.
You are not authorized to list the defined connections.
You are not permitted to perform this action.
The service is currently receiving more requests than it can process in a timely fashion. Please retry submitting your request later.
An error occurred. The defined connections cannot be listed.
A timeout occurred when processing your request. Please retry later.
No Sample Response
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Allowable values: [
application/json;charset=utf-8
,application/json
]Allowable values: [
application/json;charset=utf-8
,application/json
]
Query Parameters
The ID of the catalog to use. catalog_id or project_id is required.
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Whether to test the connection before saving it. If a connection cannot be established, the connection is not saved.
Default:
true
The definition of the connection.
The id of the data source type to connect to. For example. "cfdcb449-1204-44ba-baa6-9a8a878e6aa7".
The name of the connection.
The description of the connection.
The ID of the catalog that this connection refers to for properties values.
The ID of the connection in reference catalog that this connection refers to for properties values.
The id of the secure gateway to use with the connection. A Secure Gateway is needed when connecting to an on-premises data source. This is the id of the Secure Gateway created with the SecureGateway Service. Your Secure Gateway Client running on-premises must be connected to the gateway with this Id. For example, "E9oXGRIhv1e_prod_ng".
Specifies how a connection is to be treated internally.
Allowable values: [
restricted
,internal_use_only
,personal_credentials
]Country which data originated from. - ISO 3166 Country Codes.
Owner or creator of connection. Provided when a service ID token is used to create connection.
Rules of visibility for connections.
Connection properties.
properties
Interaction properties allowed for a connection.
Custom data to be associated with a given object
source_system
The asset category
Allowable values: [
user
,system
]
Response
extendedProperties
additionalProperties
configurationProperties
securedProperties
Status Code
The connection was created.
The connection test failed. See the error message for details.
You are not authorized to define a connection.
You are not permitted to perform this action.
A connection with the same name already exists. Specify another name.
The service is currently receiving more requests than it can process in a timely fashion. Please retry submitting your request later.
An error occurred. A connection was not created.
A timeout occurred when processing your request. Please retry later.
No Sample Response
Discover assets
Discovers assets from the data source accessed via a connection description.
POST /v2/connections/assets
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Allowable values: [
application/json;charset=utf-8
,application/json
]Allowable values: [
application/json;charset=utf-8
,application/json
]
Query Parameters
Path of the asset.
The limit of the number of items to return, for example limit=50. If not specified a default of 100 will be used.
Constraints: value ≥ 1
Default:
100
The 0-based index of the first result to return, for example, offset=200. If not specified, the default offset of 0 is used.
Constraints: value ≥ 0
Default:
0
Specify whether to return the asset's metadata, the asset's data, interaction properties, connection properties, or data source type. If not specified, metadata is used by default. This parameter only applies when requesting details about a data set. To specify multiple fetch values, use a comma-separated string, such as fetch=data,metadata,interaction,connection,datasource_type.
Specify whether to return additional asset-specific details. If not specified, these details are not returned.
Specify whether assets are discovered for the purpose of reading (source) or writing (target). If not specified, source is used by default.
Allowable values: [
source
,target
]A JSON object containing a set of properties to configure aspects of the asset browsing.
A JSON object containing a set of properties to define filtering of the assets to be returned.
The connection definition.
The id of the data source type to connect to. For example. "cfdcb449-1204-44ba-baa6-9a8a878e6aa7".
The name of the connection.
The description of the connection.
The ID of the catalog that this connection refers to for properties values.
The ID of the connection in reference catalog that this connection refers to for properties values.
The id of the secure gateway to use with the connection. A Secure Gateway is needed when connecting to an on-premises data source. This is the id of the Secure Gateway created with the SecureGateway Service. Your Secure Gateway Client running on-premises must be connected to the gateway with this Id. For example, "E9oXGRIhv1e_prod_ng".
Specifies how a connection is to be treated internally.
Allowable values: [
restricted
,internal_use_only
,personal_credentials
]Country which data originated from. - ISO 3166 Country Codes.
Owner or creator of connection. Provided when a service ID token is used to create connection.
Rules of visibility for connections.
Connection properties.
properties
Interaction properties allowed for a connection.
Custom data to be associated with a given object
source_system
The asset category
Allowable values: [
user
,system
]
Response
A page from a collection of discovered assets.
A page from a collection of discovered assets.
The path of the asset.
An ID for the asset.
Properties defining the returned assets.
properties
Discovered types
Discovered assets
Discovered fields
The definition of a data source type.
Connection properties.
connection_properties
The interaction properties needed to find the asset.
interaction_properties
Extended metadata properties
The data returned when the fetch parameter contains the value "data".
Details about a discovered asset.
details
The number of assets skipped before this page.
The total number of assets available.
Log events created during the discovery of the assets.
Status Code
The discovered assets.
You are not authorized to discover assets.
You are not permitted to perform this action.
The service is currently receiving more requests than it can process in a timely fashion. Please retry submitting your request later.
An error occurred. No assets were found.
A timeout occurred when processing your request. Please retry later.
No Sample Response
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Allowable values: [
application/json;charset=utf-8
,application/json
]
Path Parameters
The ID of the data asset.
Query Parameters
The ID of the catalog to use. catalog_id or project_id is required.
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
The limit of the number of items to return, for example limit=50. If not specified a default of 100 will be used.
Constraints: value ≥ 1
Default:
100
The 0-based index of the first result to return, for example, offset=200. If not specified, the default offset of 0 is used.
Constraints: value ≥ 0
Default:
0
Specify whether to return the asset's metadata, the asset's data, interaction properties, connection properties, or data source type. If not specified, metadata is used by default. This parameter only applies when requesting details about a data set. To specify multiple fetch values, use a comma-separated string, such as fetch=data,metadata,interaction,connection,datasource_type.
Specify whether to return additional asset-specific details. If not specified, these details are not returned.
Specify whether the asset is discovered for the purpose of reading (source) or writing (target). If not specified, source is used by default.
Allowable values: [
source
,target
]A JSON object containing a set of properties to configure aspects of the asset browsing.
A JSON object containing a set of properties to define filtering of the assets to be returned.
Path of the asset.
Response
A page from a collection of discovered assets.
A page from a collection of discovered assets.
The path of the asset.
An ID for the asset.
Properties defining the returned assets.
properties
Discovered types
Discovered assets
Discovered fields
The definition of a data source type.
Connection properties.
connection_properties
The interaction properties needed to find the asset.
interaction_properties
Extended metadata properties
The data returned when the fetch parameter contains the value "data".
Details about a discovered asset.
details
The number of assets skipped before this page.
The total number of assets available.
Log events created during the discovery of the assets.
Status Code
The discovered asset.
You are not authorized to discover the asset.
You are not permitted to perform this action.
The asset cannot be found.
The service is currently receiving more requests than it can process in a timely fashion. Please retry submitting your request later.
An error occurred. No assets were discovered in the data source.
A timeout occurred when processing your request. Please retry later.
No Sample Response
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Allowable values: [
application/json;charset=utf-8
,application/json
]
Query Parameters
The ID of the catalog to use. catalog_id or project_id is required.
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
The page token indicating where to start paging from.
The limit of the number of items to return, for example limit=50. If not specified a default of 100 will be used.
Constraints: value ≥ 1
Default:
100
Response
Result of connection upgrade request.
Status Code
The connections were upgraded.
Bad request. See the error message for details.
You are not authorized to define a connection.
You are not permitted to perform this action.
An error occurred. Connections were not updated.
A timeout occurred when processing your request. Please retry later.
No Sample Response
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Allowable values: [
application/json;charset=utf-8
,application/json
]
Path Parameters
The ID of the connection.
Query Parameters
The ID of the catalog to use. catalog_id or project_id is required.
The ID of the project to use. project_id or space_id is required.
The ID of the space to use. catalog_id, project_id, or space_id is required.
Response
extendedProperties
additionalProperties
configurationProperties
securedProperties
Status Code
The connection object.
You are not authorized to get details about the connection.
You are not permitted to perform this action.
The connection cannot be found.
The service is currently receiving more requests than it can process in a timely fashion. Please retry submitting your request later.
An error occurred. The connection definition details cannot be retrieved.
A timeout occurred when processing your request. Please retry later.
No Sample Response
Delete connection
Deletes a connection definition. This call does not check whether the connection is used by activities, data sets or other assets. The caller must check this before deleting a connection.
DELETE /v2/connections/{connection_id}
Request
Custom Headers
Identity Access Management (IAM) bearer token.
Path Parameters
The ID of the connection.
Query Parameters
The ID of the catalog to use. catalog_id or project_id is required.