Extract skills🔗

The /extract endpoint is a REST service offered via a POST method.
The service expects JSON as input, with up to four input fields: the input text, the text's language, the skill validation threshold and the output_language.
The input text and language are mandatory fields.
The input text must be UTF-8 encoded.

This endpoint will

extract the known skills (defined in the skill taxonomy) from the input text
validate the extracted skills in context
return the validated skills (with their position, confidence and normalized description)

Remember to send your authentication token with each request (see Authentication page).

Endpoint🔗

Method	Media	URL	Description
POST	`application/json`	`{{domain}}/extract`	Extract skills from input text

Input parameters🔗

Parameter	Type	Default	Description
`text`	`str`	None	The text to extract skills from
`language`	`str`	None	The language of the input text in ISO 639-1 code
`threshold`	`float`	0.5	The minimum confidence threshold for including a skill in the response
`output_language`	`str`	same as `language`	The language (`ISO639-1` code) or locale (`ISO639-1_ISO3166-1` code) of the normalized skills

Specify in the language field one of the supported languages (ISO 639-1 code format).

The input text should be written in one of the supported languages.
There is a 50 000 characters limit on the input text. If you exceed the limit you will receive 403 Forbidden Request HTTP responses.

Setting the threshold to a lower value than the default will result in more skill extractions per document, but also increases the chance that the results contain ambiguity-related erorrs (e.g., erroneously normalizing the word "access" to Microsoft Access). A higher value of the threshold will have the opposite effect: fewer skills will be extracted, but there is also a lower chance of false positives. The recommended value is 0.5 (default) for CVs and vacancies, and 0.3 for other HR documents. If the input is a single skill or list of skills, use the Normalize Endpoint instead.

Set the output_language field only if you want to get normalized skill descriptions in a different language than the document language.
If so, set it to one of the supported languages (ISO639-1 code format) or locales (ISO639-1_ISO3166-1 code format).
Whenever a skill can't be normalized in the requested language, it will be normalized by default in English.

See the Overview for the list of supported languages.

Response🔗

Status	Content type	Content description
`200` (OK)	`application/json`	A JSON object containing: `skills`: an array of skill objects `truncated`: boolean value indicating if the input text has been truncated `version`: the API version `meta`: an object including the taxonomy version (release date)
`400` (Bad request)		The input request body is incorrect
`404` (Not Found)		The language is not supported

Example🔗

$ curl -X POST https://api.textkernel.nl/skills/v2/extract \
    -H "Authorization: Bearer $TOKEN" \
    -H "accept: application/json" -H "Content-Type: application/json" \
    -d '{ "text":"I am a Java/J2EE developer.", "language": "en", "threshold": 0.5 }'

{
  "meta": {
    "taxonomy_version": "2020-12-04T16:54:07.021406"
  },
  "skills": [
    {
      "category": "IT Skill",
      "code_id": "KS123KG6DL8N3D5ZW036",
      "confidence": 1.0,
      "description": "Java Platform Enterprise Edition (J2EE)",
      "matches": [
        {
          "begin_span": 7,
          "end_span": 15,
          "likelihood": 1.0,
          "surface_form": "Java/J2EE"
        }
      ]
    }
  ],
  "truncated": false,
  "version": "1.15.1"
}

Response fields🔗

Fields on each extracted skill:

Field	Type	Value
`code_id`	`str`	The code id of the normalized skill from the Taxonomy (unique across all languages)
`description`	`str`	The description of the normalized skill concept from the taxonomy
`category`	`str`	The category of the extracted skill. See the Overview for the list of supported categories.
`confidence`	`float`	Overall confidence that the extracted term actually refers to a skill in the context of the text (gets the average value of the ‘likelihood’ values of the individual match scores
`iso_code`	`str`	The language ISO 639-1 code (only for language skills)
`match.begin_span`	`int`	Start position
`match.end_span`	`int`	End position
`match.surface_form`	`str`	The skill description as found in the input text. The evidence of the normalized skill.
`match.likelihood`	`float`	Confidence that the extracted term actually refers to a skill in the context of the text.

Rate limits🔗

Accounts have a limited request rate. If you exceed the limit you will receive 429 Too Many Requests HTTP responses.

Plan	Limit	Units
Standard	500	Minute
Demo	30	Minute